USB-drive-DS RT2020 Proc PDF
USB-drive-DS RT2020 Proc PDF
USB-drive-DS RT2020 Proc PDF
Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy
beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom
of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood
Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at pubs-
permissions@ieee.org. All rights reserved. Copyright ©2020 by IEEE.
Proceedings of the 2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real
Time Applications (DS-RT)
1st edition
Contact:
Miroslav Voznak, VSB-Technical University of Ostrava, Faculty of Electrical Engineering and Computer
Science, 17. listopadu 2172/15, 708 00 Ostrava, Czech Republic
miroslav.voznak{at}vsb.cz
ii
Editors:
Dusan Maga
Jiri Hajek
Each paper has been reviewed. The responsibility for the content and language of each paper rests
solely on its author(s).
Association for Computing Machinery Special Interest Group on Simulation and Modeling - ACM
SIGSIM
Miroslav Voznak
VSB-Technical University of Ostrava
17. listopadu 2172/15
708 00 Ostrava
Czech Republic
tel.: +420 596 995 940
e-mail: miroslav.voznak{at}vsb.cz
http://ds-rt.com/2020/home#
Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are
permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles
in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in
the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For
reprint or republication permission, email to IEEE Copyrights Manager at pubs-permissions@ieee.org.
All rights reserved. Copyright ©2020 by IEEE.
iii
A message from chairs
A warm welcome to the 2020 IEEE/ACM 24th International Symposium on Distributed Simulation and
Real Time Applications (DS-RT), originally planned to be organized in Prague. The coronavirus
pandemic situation forced us to switch this event to online conferencing. The decision was made after
many discussions. Rules for entering the territory of the Czech Republic have been experiencing
frequent changes since March 2020, and also it is impossible to guarantee adequate conditions in
Prague. We are well aware of the fact that the conference is especially a place for meetings and
discussions; nevertheless, we have done our best to prepare for you an attractive conference program
in the cyberspace.
DS-RT serves as a forum for simulationists from academia, industry and research labs, to present recent
research results that target the growing overlap between large distributed simulations and real-time
applications. A total of about 91 papers have been submitted, of which 10 were withdrawn (some of
them after review), and 24 accepted as regular. In addition to the regular papers, 8 papers have been
accepted as short papers. These 32 papers are divided into eight sessions running sequentially over
three days.
We selected the most popular videoconferencing tool for supporting the event. Nevertheless, each of
the conference days starts with a testing session in which participants are given the possibility to solve
technical issues, if they are any. We also prepared a possibility of watching a live stream from DS-RT
2020 on a private Youtube channel. We expect this option to be the best for participants who do not
want to be connected directly to the conference room. The scheduling of presentations was a more
difficult task this year due to different timezones of speakers, which was considered as a new
parameter to multicriterial planning. Conference sessions are the following: Distributed Simulations;
Scheduling & Simulations; Real-Time Simulations; Cloud, Fog & Edge Computing; Vehicular & Edge
Computing; Secure & Efficient Computing; Simulations & Modelling and the last UAVs & Simulations.
In addition to the sessions above, the program also includes three keynote speeches opening individual
conference days. The first keynote is on "Evolutionary Algorithms and its Use in Modelling and
Simulations of the Complex Systems", given by Prof. Ivan Zelinka from VSB-Technical University of
Ostrava, Czechia. The second conference day starts with keynote on "Modeling & Simulation Based
Framework for Interoperability Driven Enterprise Design" delivered by Prof. Gregory Zacharewicz from
IMT – Mines Ales, France. The last keynote "Stability and Hidden Attractors in the Simulation and
Theoretical Study of Dynamical Models" is given by Prof. Nikolay V. Kuznetsov from Saint-Petersburg
State University, Russia. The best paper award announcement is scheduled to the closing session, when
we also disclose the next venue of IEEE/ACM DS-RT 2021.
In recognition of the paper's quality and originality, the following awards of IEEE/ACM DS-RT 2019
were given last year:
The Best Paper Award to Marco Rapelli, Claudio E. Casetti and Giandomenico Gagliardi for the
paper "TuST: from Raw Data to Vehicular Traffic Simulation in Turin."
The Best Paper Runner-up Award to Shingo Igarashi , Takuya Azumi, Yuto Kitagawa, Tasuku
Ishigooka and Tatsuya Horiguchi for the paper "Multi-rate DAG Scheduling Considering
Communication Contention for NoC-based Embedded Many-core Processor."
iv
- in category of short papers
The Best Short Paper Award to Robert Chodorek, Agnieszka Chodorek and Krzysztof Wajda for
the paper "Media and non-media WebRTC communication between a terrestrial station and a
drone: the case of a flying IoT system to monitor parking."
The Best Short Paper Runner-up Award to Armir Bujari, Jordan Gottardo, Claudio E. Palazzi
and Daniele Ronzani for the paper "Message Dissemination in Urban IoV."
We would like to express our sincere gratitude to the members of organization, steering and program
committees, and also to reviewers, speakers and especially to all authors for their contributions, effort,
and time. Special thanks also go to our sponsors, IEEE Computer Society and ACM SIGSIM, and to the
editorial staff at IEEE Conference Publication Services for their work in producing these proceedings.
Thanks to the VSB-Technical University of Ostrava, CZ and its staff for the management of all duties
related to organizing the conference. Last but not least, we appreciate the provided support from
CESNET (National research and education network association in Czechia) with switching IEEE/ACM
DS-RT 2020 to the virtual conference mode.
Miroslav Voznak
Floriano De Rango
v
Organizing Committees
General Chair
Miroslav Voznak
VSB – Technical University of Ostrava
Czech Republic
Program Co-Chairs
Carlos Tavares Calafate Floriano De Rango
Universitat Politècnica de València University of Calabria
Spain Italy
Special Sessions Chair
Rodolfo W. L. Coutinho
Concordia University
Canada
Posters/Demo Chair
Hakki Gokhan Ilk
Ankara University, Turkey
Publicity Co-Chairs
Mirela Sechi Notare
Eirini Eleni Tsiropoulou Mauro Tropea
University of Technology in Fly
University of New Mexico University of Calabria
Transportation
Mexico Italy
Brazil
Finance Chair
Lukas Sevcik
Technical University of Ostrava
Publication/Proceedings Co-Chairs
Dusan Maga Mauro Tropea
Czech Technical University in Prague University of Calabria
Czech Republic Italy
Local Organization Chair
Robert Bestak
Czech Technical Univeristy in Prague
Czech Republic
Web Chair
Noura Aljeri
University of Ottawa
Canada
vi
Program Committee
Adeline Urmacher University of Rostock, Germany
Alfredo Garro University of Calabria, Italy
Angelo Furfaro University of Calabria, Italy
Armir Bujari University of Padua, Italy
Andrea D'Ambrogio University of Rome TorVergata, Italy
Chun-Wei Lin Western Norway University of Applied Sciences, Norway
Claudia Campolo Mediterranea University of Reggio Calabria, Italy
Danilo Amendola University of Trieste, Italy
Emanuel Puschita Technical University of Cluj-Napoca, Romania
Enrique Hernández-Orallo Universitat Politècnica de València, Spain
Franco Cicirelli University of Calabria, DIMES, Italy
Francesco Quaglia University of Rome "La Sapienza", Italy
Gabriele D'Angelo University of Bologna, Italy
Gabriel Wainer Carleton University, Canada
Georgios Keramidas Think Silicon, Greece
Giandomenico Spezzano Consiglio Nazionale delle Ricerche (CNR), Italy
Greg Zacharewicz IMT - Mines Ales, France
Hakki Ilk Ankara University, Turkey
Helen Karatza Aristotle University of Thessaloniki, Greece
Hoang-Sy Nguyen Binh Duong University, Vietnam
Iqbal Khan Qualcomm Technologies Inc., USA
Ivan Zelinka Technical University of Ostrava, Czech Republic
Jan Martinovic IT4Innovations, Czech Republic
Jerry Chun-Wei Lin Western Norway University of Applied Sciences, Norway
Juan-Carlos Cano Universidad Politecnica de Valencia, Spain
Libero Nigro UniveUniversity of Calabria, Italy
Marcin Niemiec AGH University of Science and Technology, Poland
Marco Morana University of Palermo, Italy
Mauro Tropea Università della Calabria, Italy
Michal Stepanovsky Czech Technical University in Prague, Czech Republic
Miralem Mehic Univesity of Sarajevo, Bosnia and Herzegovina
Mirko Stoffers RWTH Aachen University, Germany
Pavel Tvrdik Czech Technical University in Prague, Czech Republic
Pierre Siron University of Toulouse, France
Philip Wilsey University of Cincinnati, USA
Pietro Manzoni Universitat Politècnica de València, Spain
Radek Fujdiak Brno University of technology, Czech Republic
Seilendria Hadiwardoyo IMEC
Simon Taylor Brunel University, UK
Stefan Rass Universität Klagenfurt, Austria
Tan Nhat Ngueyn Ton Duc Thang University, Vietnam
Vaclav Snasel Technical University of Ostrava, Czech Republic
Wentong Cai Nanyang Technological University, Singapore
vii
Steering Committee
Azzedine Boukerche University of Ottawa, Canada
Sajal K. Das Missouri University of Science and Technology, USA
Paul Reynolds University of Virginia, USA
Stephen J. Turner KMUTT, Thailand
Albert Zomaya University of Western Australia, Australia
Rodolfo W. L. Coutinho Concordia University, Canada
viii
Table of Contents
Claudia Campolo, Giacomo Genovese, Antonella Molinaro, Bruno Pizzimenti: Digital Twins
at the Edge to Track Mobility for MaaS Applications ...............................................................................1
Martin Drašar, Stephen Moskal, Shanchieh Yang, Pavol Zaťko: Session-Level Adversary
Intent-Driven Cyberattack Simulator .......................................................................................................7
Michael Kyesswa, Philipp Schmurr, Hueseyin Kemal Cakmak, Uwe Kuehnapfel, Veit Hagenmeyer:
A New Julia-Based Parallel Time-Domain Simulation Algorithm for Analysis of Power System
Dynamics ............................................................................................................................................... 16
Anselm Erdmann, Anna Marcellan, Dominik Hering, Michael Suriyah, Carolin Ulbrich, Martin Henke,
André Xhonneux, Dirk Müller, Rutger Schlatmann, Veit Hagenmeyer: On Verification of Designed
Energy Systems Using Distributed Co-Simulations ............................................................................... 25
Mauro Tropea, Abdon Serianni: Bio-Inspired Drones Recruiting Strategy for Precision Agriculture
Domain .................................................................................................................................................. 33
Alexander Puzicha, Peter Buchholz: Real-Time Simulation of Robot Swarms with Restricted
Communication Skills ............................................................................................................................ 41
Shingo Igarashi, Tasuku Ishigooka, Tatsuya Horiguchi, Ryotaro Koike, Takuya Azumi: Heuristic
Contention-Free Scheduling Algorithm for Multi-core Processor Using LET Model ............................. 49
Maryan Rab, Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, Francesco Quaglia:
NUMA-Aware Non-Blocking Calendar Queue ....................................................................................... 59
Andrea Piccione, Alessandro Pellegrini: Agent-Based Modeling and Simulation for Emergency
Scenarios: A Holistic Approach .............................................................................................................. 68
Nicolas Nevigato, Mauro Tropea, Floriano De Rango: Collision Avoidance Proposal in a MEC Based
VANET Environment .............................................................................................................................. 77
Sung woon Park, Azzedine Boukerche, Shichao Guan: A Novel Deep Reinforcement Learning
Based Service Migration Model for Mobile Edge Computing ............................................................... 84
Diogo Torres, João Pedro Dias, André Restivo, Hugo Ferreira: Real-time Feedback in Node-RED
for IoT Development: An Empirical Study ............................................................................................. 92
Franco Cicirelli, Libero Nigro: Model Checking Actor-based Cyber-Physical Systems ........................ 107
Moritz Gütlein, Wojciech Baron, Christopher Renner, Anatoli Djanatliev: Performance Evaluation
of HLA RTI Implementations................................................................................................................ 115
ix
Sergey Suslov, Michael Schiek, Markus Robens, Christian Grewing, Stefan van Waasen: Simulating
Heterogeneous Models on Multi-Core Platforms Using Julia's Computing Language Parallel
Potential .............................................................................................................................................. 133
Alberto Falcone, Alfredo Garro: Pitfalls and Remedies in Modeling and Simulation of Cyber
Physical Systems .................................................................................................................................. 137
Lorenzo Donatiello, Lorenzo Gasparini, Gustavo Marfia: Laying the Path to Consumer-Level
Immersive Simulation Environments .................................................................................................. 142
Emilie Bout, Valeria Loscrí, Antoine Gallais: Energy and Distance Evaluation for Jamming Attacks
in Wireless Networks........................................................................................................................... 146
Awais Aziz Shah, Marco Mussini, Francesco Nicassio, Giorgio Parladori, Francesco Triggiani, Giovanni
Grieco, Giuseppe Iaffaldano, Giuseppe Piro: A Real-Time Simulation Framework for Complex and
Large-Scale Optical Transport Networks Based on the SDN Paradigm ............................................... 151
Franco Cicirelli, Antonio Gentile, Emilio Greco, Antonio Guerrieri, Giandomenico Spezzano,
Andrea Vinci: An Energy Management System at the Edge Based on Reinforcement Learning ........ 155
Jalil Boudjadar, Mohammad Hassan Khooban: A Cost-effective Scheduling Control for a Safety
Critical Hybrid Power System .............................................................................................................. 163
Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M. Mustafa Rafique: CoSim: A Simulator
for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters .............................................. 167
Jamie Wubben, Pablo Aznar, Francisco Fabra, Carlos T. Calafate, Juan-Carlos Cano, Pietro Manzoni:
Toward Secure, Efficient, and Seamless Reconfiguration of UAV Swarm Formations ....................... 175
Youssra Cheriguene, Soumia Djellikh, Fatima Zohra Bousbaa, Nasreddine Lagraa, Abderrahmane
Lakas, Chaker Abdelaziz Kerrache, Abdou El Karim Tahari: SEMRP: an Energy-Efficient Multicast
Routing Protocol for UAV Swarms ...................................................................................................... 182
Giovanni Iacovelli, Pietro Boccadoro, Luigi Alfredo Grieco: An Iterative Stochastic Approach to
Constrained Drones' Communications................................................................................................ 190
Nasos Grigoropoulos, Spyros Lalis: Simulation and Digital Twin Support for Managed Drone
Applications ......................................................................................................................................... 198
Alessandro Ciociola, Michele Cocca, Danilo Giordano, Luca Vassio, Marco Mellia:
E-Scooter Sharing: Leveraging Open Data for System Design............................................................. 206
Peppino Fazio, Miralem Mehic, Pavol Partila, Jaromir Tovarek, Miroslav Voznak: A New Mobility
Samples Encoding Scheme Based on Pairing Functions and Data Analytics....................................... 222
x
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The research into wireless communication and mobile offering lower latency and higher context-awareness compared
computing is called to formulate novel smart mobility solutions to the remote cloud.
to improve the quality of a citizen’s life in smart cities. In such In this paper we make a step forward and extend the work in
a context, in this paper we elaborate on the role of technologies
like multi-access edge computing (MEC), Internet of Things (IoT) [3] by providing the following main innovative contributions:
messaging protocols, such as Constrained Application Protocol • We propose the usage of Digital Twins (DTs), acting as
(CoAP) and Message Queue Telemetry Transport (MQTT), and digital counterparts of physical entities, e.g., smartphones
virtualization (i.e., digital twins) in the design of a framework of commuters, On Board Systems (OBSs) of PT vehicles.
enabling the collection and processing of data about the mobility
of commuters and public transport vehicles. Such data have They are in charge of retrieving mobility data of the
the purpose to feed mobility monitoring and transport planning corresponding physical entities and can be queried by
solutions. A Proof-of-Concept (PoC) is developed to validate stakeholders interested to such data.
the framework under realistic experimental settings. Results in • We design DTs as virtualized applications to be hosted
terms of efficiency and effectiveness of the considered messaging at the network edge. This is a clear departure from the
protocols are reported.
Index Terms—Multi-access Edge Computing, CoAP, Digital current literature [7] according to which DTs are hosted
Twin, OMA LwM2M, MQTT, MaaS in the remote cloud. Moreover, we align the deployment
to the European Telecommunications Standards Institute
I. I NTRODUCTION (ETSI) MEC reference architecture [8].
The proliferation of devices equipped with positioning ca- • We consider two different messaging protocols. In addition
pabilities has recently opened the way to the development of to CoAP [4], coupled with OMA LwM2M investigated in
a plethora of location-based applications. Tracking the user [3], we also leverage Message Queue Telemetry Transport
mobility can be helpful to identify the best transport solution (MQTT) [9].
to satisfy her needs, to infer her preferences and predict future • We assess the feasibility of the proposed framework
travel demands. Retrieving mobility data from fleets of Public through a realistic experimental validation conducted dur-
Transport (PT) vehicles can support the route planning and ing a trip taken by a bus in a urban environment. During
enable the design of improved solutions for customers. Both the trip, position information are retrieved through a
user and vehicle mobility data can feed Mobility as a Service Global Positioning System (GPS) receiver attached to a
(MaaS) applications tracking the mobility of commuters which Rasbperry Pi device [10].
need to dynamically compose their trips through solutions of • We quantitatively compare the two considered messaging
different travel operators, spanning different means of trans- protocols for what concerns the efficiency in terms of
portation, e.g., bike, car, bus, planes, trains [1], [2]. bandwidth usage, and the effectiveness in terms of packet
Besides positioning systems, Information and Communica- reliability.
tion Technologies (ICT) are needed to enable the collection, The rest of the paper is organized as follows. Section II pro-
delivery, processing, and presentation of the mobility-related vides background material about the enabling technologies and
information to all the interested parties. concepts of the proposed framework, which is then described
In our previous work [3] we identified prominent solu- in Section III. The experimental setup is described in Section
tions addressing most of the aforementioned functionalities IV, as well as the main findings of the evaluation study. Final
by encouraging the joint usage of Constrained Application remarks are reported in Section V.
Protocol (CoAP) [4], the lightweight messaging protocol specif-
ically devised for Internet of Things (IoT) environments, and II. M AIN TECHNOLOGY ENABLERS
the Open Mobile Alliance (OMA) Lightweight Machine-to- A. Multi-access edge computing: a primer
Machine (LwM2M) [5] resource description model. This results MEC has been proposed by ETSI as a prominent paradigm
into interoperable and efficient data retrieval and description. offering computing and storage resources at the edge of the
Moreover, there we argued in favour of Multi-access Edge mobile network, close to the subscribers. It provides high-
Computing (MEC) [6] facilities to host MaaS services by bandwidth and ultra-low latency access to radio network as
978-1-7281-7343-6/20/$31.00 © 2020 IEEE well as context information, which can be exploited by many
1
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
verticals, such as transportation, industrial automation, enter- transport operators, the municipality and the MaaS provider,
tainment, media, healthcare. can get mobility data of commuters/PT vehicles.
The reference ETSI architecture has been specified in [8]
with a set of Application Programming Interfaces (APIs) for A. The ground domain
key MEC interfaces, along with the main functionalities. We track the mobility of vehicles like buses, as well as of
The ME host is the entity that contains the ME platform commuters. The first ones are equipped with multi-interface
and a virtualization infrastructure which offers computing, OBSs, whereas commuters carry a smartphone, both acting as
storage, and network resources for the ME applications. Such User Equipment (UE). Each UE runs a UE App that interacts
applications can interact with the ME platform to consume and with the corresponding counterpart in the ME host, which is
offer ME services. Services offered by the platform are, for the DT App.
instance, the Radio Network Information Service (RNIS) and The OBS is equipped with a GPS transceiver and, similarly
the Location Service (LS), respectively providing radio-related to [13], with a Bluetooth Low Energy (BLE) beacon that
and augmented positioning information. broadcasts a series of identifiers. BLE on board the bus is
The ME orchestrator is the core of the ETSI MEC architec- leveraged to facilitate e-ticketing procedures [13].
ture. It decides which ME host(s) is(are) the most appropriate Information about the commuter position is retrieved through
one(s) for application instantiation (and relocation) according the GPS receiver of the smartphone. However, whenever the
to application demands (e.g., latency), monitored available UE App infers the smartphone to be on board the bus, the GPS
resources, and also mobility conditions. receiver is switched off to reduce the energy consumption of
The deployment of ME hosts in the edge domain is operator- the device. Hence, the smartphone offloads to the OBS the task
specific: an ME host can be associated either to each Base of updating the DT with its location data.
Station (BS) or to a set of them, covering either a small area Unlike in [3] where the smartphone of the commuter con-
or offering a urban coverage. nects to the on-board Wi-Fi network and is detected to be on
B. Digital Twins board, in this work, smartphones of passengers receiving BLE
identifiers infer to be on-board the bus.
Originally developed to improve manufacturing processes,
now digital twins have a wider scope by representing the digital As defined also in Android Location manager1 , in order to
replications of living as well as non-living entities [11]. Such affect battery lifetime as little as possible, the UE App on the
replications are designed as the semantic description of the smartphone will communicate with the corresponding DT App
sensed physical world [12], also incorporating contextual and only when necessary, i.e., whenever its position changes of at
sensor data from them. This makes the service discovery easier, least minDistance meters. The same workaround applies to the
since metadata is used to index the virtual devices, and the OBS. However, being the OBS powered, it is not concerned
introduced semantic description is able to cope with hetero- about battery consumption. Notwithstanding, for the OBS such
geneity to provide interoperability among physical devices at an approach has the advantage of reducing the amount of data
a virtual level. As a result, the digital twin enables data to be exchanged over the mobile network.
seamlessly transmitted between the physical and virtual worlds,
B. The edge domain
hence ensuring real-time monitoring of systems, helpful for
their maintenance and for future upgrades. DTs are deployed as virtualized applications at the edge. In
The DT concept can be also leveraged in the automotive particular, without loss of generality, we assume that they are
domain. Starting from collected mobility data, vehicle, bus and instantiated at the closest ME host. The selection of the most
truck manufacturers can devise solutions to improve customer’s proper ME host where the DT of a given commuter/PT vehicle
satisfaction and vehicle fleets management [7]. Today, the has to be placed is performed by the ME orchestrator and is
common practice is storing and updating the DTs of vehicles outside the scope of the present work.
and other physical entities in the remote cloud [7]. Each DT App interacts with the UE App to get mobility data.
Compared to the work in [7] which considers only vehicles as
III. T HE PROPOSED FRAMEWORK physical entities, we extend the concept of DT to commuters.
In this paper, we treasure the previously described concepts Data are locally stored at the edge and periodically trans-
and technologies and apply them to efficiently and effectively ferred to the remote cloud for long-term storage and analytics
track the mobility of commuters and of PT vehicles in order to purposes. Besides benefiting from low latency access to storage
collect valuable data for PT service monitoring and planning. and computing services close to where they are needed, DT
To this aim, we propose the framework graphically sketched applications hosted in MEC facilities can further access addi-
in Figure 1. We distinguish three different domains: (i) the tional (context) information and provide accurate positioning
ground domain, with commuters and PT vehicles which are via data fusion from multiple available sources, also improved
connected to the mobile network; (ii) the edge domain, hosting by the ME LS. Retrieved mobility data have typically a city-
the virtual counterparts of the physical entities, both for the wide scope and could benefit from local processing [3].
commuters and the PT vehicles; (iii) the remote applications
through which third parties, such as the road authorities, the 1 https://developer.android.com/reference/android/location/LocationManager
2
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
C. Interactions between physical devices and DTs The connectionless User Datagram Protocol (UDP) is lever-
aged as a transport protocol and therefore retransmissions are
In alignment with the existing literature [14], for the interac- managed at the application-layer. In particular, applications can
tion between physical devices and their corresponding DTs we send reliable (confirmable) or non-reliable (non-confirmable)
consider two of the most widespread IoT messaging protocols, CoAP messages. Confirmable messages are retransmitted until
as detailed in the following. acknowledged by the receiver or until a maximum number of
1) MQTT: The first option relies on MQTT, a lightweight retransmissions has been reached.
messaging protocol largely used in IoT contexts in presence In addition to the request/response approach typical of the
of small sensors and mobile devices, optimized for unreli- HyperText Transfer Protocol (HTTP), CoAP can also work in
able networks [9]. The MQTT protocol [9] leverages a pub- a publish/subscribe manner through the O BSERVE extension,
lish/subscribe approach, in which a client subscribes to a topic which enables efficient asynchronous monitoring of IoT re-
and receives notifications via a server whenever a new message sources. CoAP clients can send a request with an O BSERVE
is generated on that topic by the node acting as publisher. header option to a CoAP resource, described through a Uni-
An MQTT server plays the role of message broker between form Resource Identifier (URI). The CoAP server tracks such
publishers and subscribers. MQTT leverages the Transport subscriptions and sends a Notification message to the clients
Control Protocol (TCP) at the transport layer. In addition, it uses whenever the observed resource changes. In the reference sce-
three levels of message transmission reliability. With Quality nario of our study, the position of a commuter does not change
of Service (QoS)=0 (a.k.a. at most once delivery), messages when she is at the bus stop; the same holds for a PT vehicle
are simply sent once and are not acknowledged. With QoS=1 stopped at a red traffic light. Hence, the O BSERVE extension
(a.k.a. at least once delivery), acknowledgements are used and saves bandwidth resources and, consequently, battery of the
messages are retransmitted if no acknowledgement is received (mobile) device acting as server, compared to the GET/POST
before the expiration of a timeout. With QoS=2 (a.k.a. exactly primitives [3].
once delivery), a four-way handshake is used to ensure that a The OMA LwM2M protocol complements CoAP with a
message arrives exactly once. simple object-based resource model to facilitate interoperability
In our framework, the MQTT publisher is implemented in resource description and discovery. Its usage in vehicular
within the UE App. The DT hosts an MQTT subscriber which domains is also suggested in [3] and [15].
is interested in retrieving data published by the corresponding In our framework, we assume that a OMA LwM2M client
physical device (either the smartphone or the OBS). The MQTT is implemented within the UE App. It is in charge of retrieving
broker is deployed as an ME app. Several publishers and data from the GPS (of the smartphone, of the OBS) and sending
subscribers may exchange messages through the same broker. them through CoAP to the OMA LwM2M server, which is
2) CoAP and OMA LwM2M: As a second option we con- implemented in the ME host and which the DT has access to.
sider CoAP [4], a well-known protocol which allows IoT de- 3) Mobility data: For both messaging protocols options, the
vices to operate in a Web-like fashion. CoAP request/response same information is transmitted. More precisely, starting from
methods allow the interaction between a client, which in the the Location object (object id 6) defined in the OMA LwM2M
CoAP terminology refers to the node requesting data, and the Object and Resource Registry2 , only the essential mobility-
server, i.e., the node hosting the resource (e.g., the value of a
temperature, a location), which is typically an IoT device. 2 http://www.openmobilealliance.org/wp/omna/lwm2m/lwm2mregistry.html
3
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
related data are transmitted. Such fields are Latitude, Longitude, Adafruit Ultimate GPS Breakout receiver [16], a high-quality
Timestamp and Speed, as reported in Table I and requested and energy-efficient GPS module that can track up to 22
through a URI and a topic for the CoAP and OMA LwM2M satellites on 66 channels, with an excellent high-sensitivity
and the MQTT options, respectively. The meaning of such fields receiver, and a built-in antenna.
is reported in Table II. To collect position data to be sent to the DT, a trip of around
7 km has been performed in a urban environment, i.e., the city
TABLE I of Reggio Calabria, along the trajectory shown in Fig. 2. The
L OCATION DATA AND HOW THEY ARE REQUESTED BY THE TWO
trip includes bus stops, and correspondingly, the position is not
MESSAGING PROTOCOLS .
updated.
CoAP and OMA LwM2M MQTT Edge facilities hosting the DT App have been emulated
Information Resource OMA LwM2M MQTT Topic through an Asus laptop with CPU Intel Core i7-6500U, 12
ID Obj URI path
Latitude 0 /6/0/ ntf/raspberrypi/6/0/ GB RAM and 512 GB SSD.
Longitude 1 /6/0/ ntf/raspberrypi/6/0/ The connectivity between the OBS and the ME host is
Timestamp 5 /6/0/ ntf/raspberrypi/6/0/ emulated through a wired link and the tc Linux utility [17]
Speed 6 /6/0/ ntf/raspberrypi/6/0/
has been used to reproduce different packet loss settings over
such a link.
TABLE II 2) Software modules: For the case in which CoAP is se-
M EANING OF THE CONSIDERED Location FIELDS . lected as messaging protocol, our platform relies on Leshan3 ,
the implementation in Java provided by the Eclipse foundation
Information Description
Latitude The decimal notation of latitude. which allows to develop OMA LwM2M-compliant server and
Longitude The decimal notation of longitude. clients. Such implementation covers the majority of the OMA
Timestamp The timestamp of when the location measurement was LwM2M specifications4 . It is based on the Californium CoAP
performed.
Speed The time rate of change in position without regard for implementation.
direction: the scalar component of velocity. Measurements have been performed when the Non-
confirmable Message exchange and the Confirmable Message
exchange options are considered.
D. The remote applications
For what concerns MQTT, we rely on Mosquitto [18], which
Different stakeholders may be interested in the mobility is an open-source implementation of the message broker.
data of commuters and PT vehicles. In order to preserve
the security/privacy of potentially sensitive data related to B. Metrics
commuters and/or owned by transport operators, each DT may The following metrics have been measured:
expose such data to requesting authorized applications through
• Message delivery ratio: it is computed as the ratio between
properly configured views.
the number of messages successfully received at the DT
Traditional HTTP primitives can then be used by the remote
side and the number of messages generated by the UE
applications to query the DTs. Indeed, there is no need for
App during the trip.
lightweight protocols (like CoAP) as when, instead, interacting
• Byte overhead: it is derived as the overall number of
with resource-constrained physical devices.
bytes transmitted by the involved entities, i.e., Leshan
IV. P ERFORMANCE EVALUATION client/server or MQTT publisher/broker, over the num-
A. Experimental setup ber of actual bytes corresponding to the position data
generated by the UE App (i.e., 57 bytes). The metric
The objective of the evaluation study is twofold and specifi-
includes also TCP Acknowledgements in the case MQTT
cally it aims (i) to provide a Proof of Concept (PoC) of the
is considered.
proposed framework by leveraging off-the-shelf components
and (ii) to compare the considered messaging protocols in terms The metrics have been evaluated through the Wireshark5
of effectiveness and efficiency under different settings. protocol analyzer.
More in detail, the analysis focuses on capturing mobility Results averaged over 10 independent experimental runs are
data of a PT vehicle during a bus ride in a urban context. Hence reported.
the interactions between the OBS and the corresponding DT are C. Results
only considered. In so doing, thanks to the offloading policy
for the positioning tasks, the mobility of commuters on board Fig. 3 shows the message delivery ratio for CoAP with Non-
the bus can be also tracked. confirmable Message exchange, CoAP with the Confirmable
1) Hardware components: For the OBS implementation, we Message exchange, and MQTT QoS 0. Both MQTT and CoAP
leveraged a Raspberry Pi [10], an inexpensive, fully customiz- with the Confirmable Message exchange achieve full reliability.
able and programmable single board computer with support 3 https://www.eclipse.org/leshan/
for a large number of input/output peripherals and network 4 https://github.com/eclipse/leshan/wiki/LWM2M-Supported-features
4
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
This is possible in MQTT thanks to TCP-triggered retransmis- V. C ONCLUSION AND FUTURE WORKS
sions, whereas CoAP with the Confirmable Message exchange
emulates TCP acknowledgments at the application layer. If the In this paper we have presented a framework to track com-
unreliable option of CoAP is considered, the percentage of muters’ and PT vehicles’ mobility. The proposal builds upon
received messages equals the link-layer reliability settings. emerging IoT technologies for the collection, delivery, process-
ing and presentation of mobility-related data to serve MaaS
applications and services by other interested stakeholders. The
implementation of a realistic PoC confirms the viability of the
proposal and provides helpful insights about the effectiveness
and efficiency of the candidate messaging protocols for the
interactions of physical devices with the corresponding DTs.
Preliminary results about the computation footprint of the DT
application (not shown in the paper) showcase it is negligible.
Hence, as a future work we plan the deployment of the DT
application as a Docker container, as well as the evaluation
of its memory and CPU footprint when varying the available
processing resources and the number of commuters for a given
ME host to figure out potential scalability issues for the actual
deployment.
Fig. 3. Message delivery ratio for the three compared messaging protocols
under different packet loss settings. ACKNOWLEDGMENT
Fig. 4 reports the byte overhead metric. It can be observed This work has been partially supported by the “Mobility
that although providing the same reliability performance, the for Passengers as a Service” (MyPasS) project, funded by the
byte overhead incurred by CoAP with the Confirmable Message Italian Government (through the PON 2014-2020 initiative).
5
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
R EFERENCES
[1] G. Smith, J. Sochor, and I. M. Karlsson, “Mobility as a service: De-
velopment scenarios and implications for public transport,” Research in
Transportation Economics, vol. 69, pp. 592–599, 2018.
[2] A. Nikitas, I. Kougias, E. Alyavina, and E. Njoya Tchouamou, “How
can autonomous and connected vehicles, electromobility, brt, hyperloop,
shared use mobility and mobility-as-a-service shape transport futures for
the context of smart cities?” Urban Science, vol. 1, no. 4, p. 36, 2017.
[3] C. Campolo, D. Cuzzocrea, G. Genovese, A. Iera, and A. Molinaro, “An
OMA lightweight M2M-compliant MEC framework to track multi-modal
commuters for MaaS applications,” in 2019 IEEE/ACM 23rd International
Symposium on Distributed Simulation and Real Time Applications (DS-
RT). IEEE, 2019, pp. 1–8.
[4] C. Bormann, A. P. Castellani, and Z. Shelby, “CoAP: An application
protocol for billions of tiny internet nodes,” IEEE Internet Computing,
vol. 16, no. 2, pp. 62–67, 2012.
[5] “Open Mobile Alliance, Lightweight Machine to Machine Technical
Specification Core; v1 1-20180612-c,” 2018.
[6] Q.-V. Pham, F. Fang, V. N. Ha, M. Le, Z. Ding, L. B. Le, and
W.-J. Hwang, “A survey of multi-access edge computing in 5G and
beyond: Fundamentals, technology integration, and state-of-the-art,” arXiv
preprint arXiv:1906.08452, 2019.
[7] D. Person Pros and N. Carlsson, “Performance comparison of messaging
protocols and serialization formats for digital twins in IoV,” in IFIP
Networking, 2020.
[8] “ETSI GS MEC 003 v1.1.1. Mobile Edge Computing (MEC); Framework
and Reference Architecture,” March 2016.
[9] A. Banks and R. Gupta, “MQTT version 3.1. 1,” OASIS standard, vol. 29,
p. 89, 2014.
[10] “Raspberry pi, https://www.raspberrypi.org/.”
[11] A. El Saddik, “Digital twins: The convergence of multimedia technolo-
gies,” IEEE MultiMedia, vol. 25, no. 2, pp. 87–92, 2018.
[12] M. Nitti, V. Pilloni, G. Colistra, and L. Atzori, “The virtual object as a
major element of the internet of things: a survey,” IEEE Communications
Surveys & Tutorials, vol. 18, no. 2, pp. 1228–1240, 2015.
[13] G. Tuveri, M. Garau, E. Sottile, L. Pintor, M. Gravellu, L. Atzori, and
I. Meloni, “Automating ticket validation: A key strategy for fare clearing
and service planning,” in 2019 6th International Conference on Models
and Technologies for Intelligent Transportation Systems (MT-ITS). IEEE,
2019, pp. 1–10.
[14] Z. Laaroussi, R. Morabito, and T. Taleb, “Service provisioning in vehicu-
lar networks through edge and cloud: an empirical analysis,” in 2018 IEEE
Conference on Standards for Communications and Networking (CSCN),
pp. 1–6.
[15] S. K. Datta, J. Haerri, C. Bonnet, and R. F. Da Costa, “Vehicles as
connected resources: Opportunities and challenges for the future,” IEEE
Vehicular Technology Magazine, vol. 12, no. 2, pp. 26–35, 2017.
[16] Adafruit. Adafruit ultimate gps breakout. [Online]. Available:
http://www.adafruit.com/product/746description-anchor
[17] M. A. Brown, “Traffic control howto. [online]. Available:
http://www.tldp.org/howto/traffic-control-howto/,” 2017.
[18] “Mosquitto, MQTT open-source implementation, https://mosquitto.org/.”
6
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Recognizing the need for proactive analysis of cyber To address the lack of appropriate adversary-focused simu-
adversary behavior, this paper presents a new event-driven lation tools, this paper brings two main contributions:
simulation model and implementation to reveal the efforts needed
by attackers who have various entry points into a network. • Introduction of the concept and the implementation of a
Unlike previous models which focus on the impact of attackers’ new simulation model, enabling evaluation of adversary
actions on the defender’s infrastructure, this work focuses on behavior on the session level.
the attackers’ strategies and actions. By operating on a request- • Enabling integration of different attack models within one
response session level, our model provides an abstraction of
simulation engine, demonstrating its flexibility.
how the network infrastructure reacts to access credentials the
adversary might have obtained through a variety of strategies. This paper is structured as follows. Section II provides
We present the current capabilities of the simulator by showing a review of relevant state of the art. In Section III, we
three variants of Bronze Butler APT on a network with different introduce the proposed simulation model and describe its
user access levels.
Index Terms—DEVS, cybersecurity, adversary behavior, APT implementation and integration with different attack models.
Section IV presents our case study referencing to the Bronze
Butler APT (BB) and its implementation in the simulator
I. I NTRODUCTION engine. In Section V, we evaluate the simulator engine by
Historically, cybersecurity research on adversary behavior showing and reasoning different BB attack strategies using
was reactive rather then proactive. This is because most of random and learning attackers. We conclude the paper and
the automated attacks were built around a limited set of discuss future opportunities in Section VI.
vulnerabilities and followed a predefined tree of actions in a
fire-and-forget manner. Recognizing these actions and tracing II. S TATE OF THE A RT
them in system artifacts was therefore usually enough to either There exist several approaches to simulate the behavior
prevent the attacks or predict their evolution. Complex and of adversaries in networked systems. Some are designed
creative attacks were deemed a domain of trained human specifically to test intrusion detection systems (IDS), e.g., [2],
professionals and were analyzed only partially as their action- while others expertly define static attack scenarios with little
effect components. This changed, however, with the advent configurability [3]. Some use real networks (virtual machines)
of Advanced Persistent Threats (APT), typically reflecting [4], and the data is often tailored to a specific type of
malware and tactics used by state-sponsored actors. This type attack [5], [6]. In this section, we review three branches of
of malware, e.g., Stuxnet [1], is infamous for its destruction approaches modelling the interactions between adversaries and
of Iranian nuclear centrifuges and exhibits traits attributed to the networked systems, which are relevant to the adversary
human attackers and often favours stealth above else. Due simulation approach we introduce in this paper.
to relative rarity of such malware and limited observation
of its effects, reactive approaches are limited and effective A. Attack Graphs
defense needs proactive approaches to simulate and evaluate
The attack graphs were first proposed by Swiler et al. [7]
adversarial behavior. Existing works mostly focus on the
and are used to simulate steps an attacker can take within the
impact of attackers’ actions and not on the actions and attack
infrastructure. They describe an abstracted network topology
strategies themselves. This work addresses such limitations in
and show the nodes, paths and consequences of network
the current state of cyber attack simulation research.
attacks. Once an attack graph is constructed, it enables various
This research was supported by ERDF "CyberSecurity, CyberCrime tasks of network security analysis. The use of attack graphs is a
and Critical Information Infrastructures Center of Excellence" (No.
CZ.02.1.01/0.0/0.0/16_019/0000822), and by US NSF Award # 1742789.
widely researched topic including attack graph generation [8],
[9], application scenarios [10], [11] and analytic methods [12].
To support automatic graph generation, tools such as MulVAL,
978-1-7281-7343-6/20/$31.00 ©2020 IEEE NetSPA, or TVA were developed, as summarized in [13].
7
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The downside to attack graphs is that they require the synthetic network attack emulation and evaluation, dynamic
totality of knowledge about the target infrastructure and known selection of attack models, and integration with non-simulated
vulnerabilities. While they model possible attacker actions, IDS systems. This model thus occupies a space between
they are in effect centered on the defense and represent a various simulation models and tools described earlier.
vulnerability model rather than an attack model. They offer The simulator implementing the model as well as the
only limited options to analyze adversarial behavior. evaluation scripts can be freely downloaded from here:
https://muni.cz/go/565e43
B. Game Theoretic Approaches
Game theoretic approaches applied to cyber security are A. Model goals
well researched and traditionally involve an attacker-defender This work aims at developing a cyberattack simulator that
model where the defender optimizes their defensive strategy models the interactions between the progression in adversary
for risk minimization [14]–[17] or maximize the uptime of intended outcomes and the network session level responses.
network assets [18]–[20]. The games played rely on some Here, the session level means that the units of interaction
amount of information sharing of various amount (complete between adversaries and the attacked environment are requests
or incomplete) where typically the defender observes the and responses, i.e., rough equivalent of TCP sessions. Such
attacker and responds according to their objective function interactions are meant to maximize autonomy for both the
[21]. Many works aimed at the attacker are either focused attackers and defenders. The model described in this paper is
on a specific attack type like distributed denial of service a step towards the longer-term and broader goals to enable:
(DDoS), which abuses a large number of machines to disable • lightweight simulation of multi-agent cybersecurity sce-
target service by overloading it with requests [22], or relies on narios,
unspecified mission models [23]. Liang et al. mentions that the • integration of different attack models,
attacker-defender model specifically for impact assessments • non-stochastic simulation of interaction between attackers
requires extensive data to understand the dynamic relationships and defenders for in-depth analysis of attack strategies,
between the attacker and defender, creating complex models • rapid prototyping of attack and defense strategies,
that may or may not have a solution [21]. • *smooth transition of simulated actors into emulated and
C. Simulation Approaches real-world settings,
• *modelling of environments, which can be emulated in
Similar to game theory, the impacts of attacks and attackers
can be realized through the use of configurable cyber attack virtual environments using already provided data,
• *integration of simulation and emulation to remove the
simulation platforms. NeSSi2, an agent-based simulation plat-
form by Grunewald et al. [24], models a packet-level descrip- need to re-implement existing cyberdefense mechanisms.
tion of a network with the primary focus on simulating the Note that the last three goals are outside of the scope of
effects of DDoS attacks. NeSSi2 models the effects of various this paper; yet they influence the current model design and
worm behaviors and how worms propagate through a network. development. Section VI briefly describes the relevant projects
This technique proves to be useful in other contexts such as and activities beyond this paper linking to the long-term goals.
smart grid networks [25]. Moskal et al. [26], [27] presents a B. Model components
knowledge-based cyber attack simulator CASCADES, where
the attacker’s actions are determined by the “Attacker Behavior The simulation model adopts the message-based approach
Model" (ABM) and the knowledge obtained about the target and consist of a number of components, which can be divided
network through performing actions on the network. CAS- into four levels: environment, network, host, and logical.
CADES focuses on a Monte-Carlo style approach to attack 1) Environment level: The environment level is a top-most
simulation and generates 1000’s of plausible attack scenarios layer of components, which are used for orchestration of
given the ABM and a detailed network description known as particular scenario runs. There are two components present:
the Virtual Terrain (VT) message and environment.
Several other simulation platforms exist and base mostly on Message is a unit of information exchanged between actors
a discrete event formalism, e.g., Chi et al. [28], Liljenstam et in the simulation. The message carries routing and statistical
al. [29], Futoransky et al. [30], and Kuhl et al. [31]. These information, activity descriptions and actors’ responses.
platforms offer synthetic network and attack emulation and Environment keeps track of all simulation elements, man-
evaluation. To overcome limited realism, emulators using vir- ages interaction between these elements by passing messages,
tual machines and integrated with offensive tools are available, controls the simulation time and evaluates an impact of actors’
such as DCAFE [32] and SVED [33]. It is also worth noting activities. It is the only point of interaction between actors and
that others have experimented cybersecurity simulations based the simulation.
on general-purpose simulators such as OMNet++ [34]. 2) Network level: The network level represents the topol-
ogy of simulation. The components used mimic the compo-
III. S IMULATION MODEL nents of the network, with some simplifications enabled by the
In this section we present a new non-stochastic simulation conceptual level the simulation happens on. The components
model based on discrete event formalism, which enables are: nodes, firewalls, connections, routers, and sessions.
8
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Nodes represent physical or virtual machines. Each node is Authorizations encode the ability of actors to access partic-
accessible from the outside via a set of network ports, which ular services or data. They can be defined in the scenario
are a simplification of an Ethernet port, i.e., these ports have an configuration or they can be created as a result of actors’
IP address and can be uniquely identified (although no explicit activities, e.g., new authorization resulting from successful
MAC addressing is used). privilege escalation. Terminology-wise, they conflate both au-
Firewalls function as their real-world counterparts by con- thorization and authentication for the sake of simplicity.
trolling inbound and outbound messages. They implement an Exploits represent mechanisms to abuse vulnerabilities of
equivalent of a simplified filter table of iptables with source- particular services and are tied to the name and the version of a
destination filtering and default filtering policies. service. To enable better machine reasoning about exploits and
Connections represent links between ports of particular their effects, they are categorized by their effect and locality
nodes. Messages go through the connections and can be and allow only limited parametrization to create a bounded
affected by connection properties. exploit domain from which an attacker can choose. The
Routers partition the networks. Unlike the real network exploits are expected to map to real-life exploits, such as those
settings, they are the only active switching elements. Routers listed in services like National Vulnerability Database [35] or
control permeability between different networks and enable Common Vulnerabilities and Exposures [36].
fine-grained control depending on both sources and destina- Actions represent the type of activity of actors. They can
tions of messages. comprise anything from getting the simulation time to launch-
Sessions represents a set of connections going through the ing a DDoS attack. They are a mean to express an attack
network, which are not subject to routing policies in the model within a simulator. In this case, we understand the attack
intermediate routers. An example of a session is a VPN tunnel model as an abstraction of activities an attacker can perform.
or a tunnel through several layers of NAT. The name stems The actions are then the elements of the actions space defined
from attacker taxonomy, where attacker exploits weaknesses in by a particular model. One such model, which we used for
infrastructure to open sessions to or from their targets, which our simulation evaluation is presented in the following text.
would be otherwise prohibited by intermediate active network
elements. C. Attack model
3) Host level: The host level covers activities happening Defining the action space of the adversary is a particularly
at a node. In addition to network ports, a node is modelled challenging task for cyber-attack simulators as the action space
as a set of services representing running processes. A node is effectively infinite, constantly expanding, and extremely di-
does not define an OS, as this is instead expressed as a set verse in the types of actions that can be performed. Modelling
of services representing OS functions required for simulation. each vulnerability in the simulator is time consuming and
Services come in two variants, which are the components of unsustainable, so we choose to represent the action space as
the host level: active services and passive services. an abstraction of the objective or intent of an attacker given
Active services can initiate the communication with their the simulated attack stage of the attacker. The Action-Intent
surroundings by sending messages through the environment. Framework (AIF) [37] is a cyber-attack action classification
They also processes inbound messages and react according framework where the focus is to describe attack actions with
to their programmed behavior. Thus, they must understand respect to the intended objective of performing a specific
the semantics of the incoming messages. Typical example of action such as: information discovery, privilege escalation,
active services are attackers and defenders, i.e., actors whose data exfiltration, etc. The AIF differentiates itself from other
behavior is the focus of a simulation. attack descriptions by providing significantly more detail then
Passive services, on the other hand, do not initiate a typical Cyber Attack Kill Chains while finding a middle
communication. Inbound messages are instead evaluated by ground between the highly detailed MITRE ATT&CK® [38]
the environment based on the definition of a passive service. by remaining network and service agnostic.
The definition contains service name, version, ability to create The AIF is broken up into two layers of abstraction: the
sessions, locality, etc. Passive services are used to create Macro Action-Intent States (Macro-AIS) describe the effect of
a believable environment for the active services, while not the actions at a high-level such as reconnaissance or destroy
pushing the burden of implementation on the user of the information, whereas the Micro Action-Intent States (Micro-
simulator. AIS) describe the method used to achieve the corresponding
4) Logical level: The logical level represents the activity Macro-AIS. An example is a brute-force credential access
domain for active actors, and relations between scenario ele- Micro-AIS for the privilege escalation Macro-AIS. We use
ments and scenario goals. There are four components: data, the Micro-AIS to represent our desired simulated attack sce-
authorizations, exploits, and actions. narios and as a method to select simulated actions given the
Data represent units of information, which may be interest- network topology and the services running on the network.
ing to an attacker, such as trade secrets or employee records. Given the session-based approach of our proposed simulation
Obtaining data does not help an attacker with the actual attack, architecture the abstracted model of the attacker’s process
but they may be essential to reaching the goal of the given will allow for attack scenarios to be quickly created and then
scenario. applied to the network topology without the need for detailed
9
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Firewall
Firewall
Data
Routing table
Data tween 2012–2017. Bronze Butler has been reported to use a va-
Authorization 1 Authorization 3
... ...
riety of spearphishing techniques, remote access exploits, and
Service Authorizations Exploits Service
web-based zero-day malware to target high profile executives
Data Authorization 1 Exploit Data to obtain sensitive business strategies and sales information.
Authorization 2
...
Authorization 2 Exploit Authorization 4
...
Depending on the target, Bronze Butler employed two tech-
Authorization 3 Exploit
... ... ... ... niques to gain initial access to their target: 1) a spearphishing
email to an executive with a malicious attachment [40] or
exploited VPN services to gain access to the target network
Fig. 1. Diagram of model components
[41]. The end goal of Bronze Butler is to exfiltrate critical
business or user information through the use of file-share
servers.
exploit definitions. In Section IV we demonstrate how a known
We choose to use Bronze Butler for our case study as
description of a real cyber-attack can be described using the
the techniques employed by Bronze Butler are sufficiently
AIF and then we use that description as the driving force of
complex to demonstrate the capabilities of our simulation
our simulation engine.
engine exhibiting distinct behaviors that are well represented
in attack action descriptions such as MITRE ATT&CK and
D. Integration of Components and Simulation Execution
the AIF.
Fig. 1 illustrates the component relations and how mes- Bronze Butler is comprised of a team of highly skilled
sages traverse between components. When a simulation run attackers. However, we abstract the behaviors of Bronze Butler
begins, all active services are executed. The services produce as a single entity and represent the behaviors as a set of
messages, which are inserted into environment queues and Micro-AIS to represent their scenario in our simulation engine.
distributed on a hop-by-hop basis to their intended targets. Using the threat reports from SecureWorks [41] and technique
Thus the message transport mimics the packet transport over description from MITRE ATT&CK [42], we map attack action
a network. The messages trigger component responses, simu- evidences to a corresponding Micro-AIS to capture some of
lating the actions and responses when the network is attacked. the key behavioral properties of Bronze Butler that will be
A message passing a connection can arrive into a router, used as the basis of our simulation experiments.
active service, or passive service. Each component computes a The Table I summarizes Bronze Butler capabilities in terms
simulation-relative processing time, which models link delays of MITRE ATT&CK, the AIF, and the simulation engine.
and processing complexity. Arriving into a router the message The table demonstrates that the simulator using Micro-AIS as
can either be forwarded or dropped based on firewall and an attack model is able to simulate most of Bronze Butler
routing rules. If the message arrives in a passive service, it is behavior, with the exception of user interaction and host-
evaluated by the environment, which has an implementation of level interaction, by means of simulated actions and their
attack models’ semantics (in our case the AIF). The model’s parameters.
implementation decides on the response given the action (in
B. Network Topology and Access Control
our case a Micro-AIS), message, and passive service parame-
ters. If the message arrives into an active service, the service To emulate the various scenarios how Bronze Butler can
decides on the response based on the observable properties penetrate into a network, we prepare a small-scale network
of the message. Note that active services do not have access as depicted in Fig. 2. The topology is partitioned into four
to action description as it would be equivalent to knowing logical segments, separated by routers with firewalls. The
an attacker’s intent just by looking at the packet and would first segment is outside of the organization and represent the
bypass the hard problem of cybersecurity analysis. attacker. Note that Bronze Butler may compromise the organi-
zation’s partner in a different network domain. For simplicity,
IV. C ASE STUDY: B RONZE B UTLER APT we consider all external sources in the same network. The
second segment is the DMZ with a Web server, VPN server
To present the expressive power of the simulation engine and an Email server. Each server has two network interfaces,
and to show how it can be used to reason about attackers’ and one accessible from the outside and the other accessible from
defenders’ abilities, we consider various scenarios simulating the inside. The third segment contains the desktop machines of
Bronze Butler APT breaching into an organization with the an employee and a CTO. Machines in this segment can access
goal of data theft. We model the Bronze Butler APT with the DMZ and can only be accessed from the SRV segment. The
the Micro-AIS described in Sec. III-C and compare it to fourth is the SRV segment containing an API gateway to the
the MITRE ATT&CK framework [39]. We then present the organization’s services, a database server (DB), and a domain
common network topology and an attack graph modelling the controller (DC). This segment is accessible from the DMZ
three variants of the breach scenario, which we further discuss. and from the CTO’s PC. The API gateway is also accessible
10
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
ATT&CK Technique Technique example Micro AIS
T1087 used net user /domain to identify account information. information discovery
T1088 malware xxmm contains a UAC bypass tool for privilege escalation. user privilege escalation
T1003 used various tools to perform credential dumping. information discovery
T1005 exfiltrated files stolen from local systems. data exfiltration
T1039 exfiltrated files stolen from file shares. data exfiltration
Simulated actions
T1140 downloads encoded payloads and decodes them on the victim. lateral movement
T1083 collected a list of files from the victim and uploaded it to its C2 server, and then created a new list of specific files to steal. data exfiltration
T1107 uses command to delete the RAR archives after they have been exfiltrated. data destruction
T1097 created forged Kerberos Ticket Granting Ticket (TGT) and Ticket Granting Service (TGS) tickets to maintain administrative access. root privilege escalation
T1060 used a batch script that adds a Registry Run key to establish malware persistence. lateral movement
T1105 used various tools to download files, including DGet (a similar tool to wget). lateral movement
T1018 use ping and Net to enumerate systems. host discovery
T1053 used at and schtasks to register a scheduled task to execute malware during lateral movement. lateral movement
T1102 MSGET downloader uses a dead drop resolver to access malicious payloads. lateral movement
T1113 used a tool to capture screenshots. information discovery
T1124 used net time to check the local time on a target system. -
T1210 used a CVE-2016-7836 to exploit VPN connection command and control
T1024 used a tool called RarStar that encodes data with a custom XOR algorithm when posting it to a C2 server. -
Action parametrization
T1193 used spearphishing emails with malicious Microsoft Word attachments to infect victims. -
User
T1036 given malware the same name as an existing file on the file share server to cause users to unwittingly launch and install the malware on additional systems. -
TABLE I
B RONZE B UTLER APT CAPABILITIES IN TERMS OF MITRE ATT&CK FRAMEWORK , THE AIF, AND THE SIMULATION ENGINE .
via a VPN tunnel from the outside. The simulated machines sent a successful attack on the CTO of the organization, who
are populated with various services and access credentials to can access the DB server and also has the necessary credentials
enable the attacker multiple paths through the network. to extract the goal data from the database. The second one
To limit the scope of the demonstration and to better reason represents a successful attack on the partner, who has an API
about the simulating engine, the modelled network does not access via VPN into the SRV segment. The attacker has to
contain any active defenses. The defenses are passive and use the API server as a stepping stone to get access to the
based on network and host access control. While the resulting domain controller and follow by forging a golden Kerberos
configuration cannot be considered secure, it will be shown ticket, which is then abused to get access to the data. The
later that it is hardened enough to resist inept attack attempts. most complicated variant begins with a successful attack on
an employee. The attacker has to go through the Web server
C. Attack Graph and Scenario Variations
in DMZ, discover domain controller credentials there and con-
The combination of the network configuration, the deployed tinue the attack inside the SRV segment as in the previous case.
services, and the access credentials on the simulated hosts, Those three variants require the attacker to execute 2(3), 6(7),
gives the Bronze Butler multiple ways to achieve the ulti- and 7(9) appropriate actions respectively to achieve the goal.
mate goal of exfiltrating data from the DB server. We have The numbers above represent the minimal number of steps
manually crafted an attack graph from the total knowledge of in the attack graph from the given start to the terminal node.
the scenario. The attack graph covering all shortest paths is The numbers in parentheses also include the reconnaissance
depicted in the Fig. 3. To preserve clarity of the graph, the steps, which would be necessary in a real-world setting. Note
paths that do not lead to the goal are excluded; these paths, that each step could actually be one or many activities in
however, can be explored by the attacker in the simulation simulation, especially in case of reconnaissance.
and can greatly prolong the attack duration. This can lead to a
counter-intuitive behavior of attacker with elevated privileges, V. E VALUATION AND DISCUSSION
which is later discussed in the Section V-C. In this section, we introduce three different implementations
Despite the possible variations, each path in the attack of Bronze Butler, each representing a different attack strategy;
graph starts with a successful spearphishing attempt, because namely a scripted, a random, and a learning attackers. We
it is the predominant entry-point for the Bronze Butler APT. deploy these attackers into the simulator and let them attempt
Note that the current simulation concentrates on modeling the the three scenario variants. By analyzing the results and
interactions between the attacker and the network, and does not extracting insights into the attack strategies, we demonstrate
include the exact user interactions with the attacker phishing how the simulator can be used to reason about particular attack
emails, for example. For each scenario variant the attacker strategies and about attackers’ behavior.
begins with the simulated artifact of the phishing attempt - an
opened session to the target machine. A. Scripted attacker
There are three main variants of the scenario, which differ The scripted attacker represents the idealist situation and
by the successfully spearfished machine. The first one repre- follows the shortest path in the attack graph from each of the
11
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
DMZ SRV
VPN
Partner
Mail
PC Employee CTO
three starting points, as depicted in Fig. 3. This attacker type the so called normalized attack difficulty, i.e., an average
is implemented as omniscient, i.e. knowing the topology of number of attack actions between advancing to a next step
the infrastructure and all system weaknesses, so it does not in the attack graph. This will enable a comparison of efforts
need to perform reconnaissance tasks. Therefore, its number and thus difficulty for the attacker to achieve the ultimate goal
of actions is the lower bound on actions needed to finish each in each of the CTO, VPN, and Employee cases, respectively,
scenario variation. For the three scenario variants (CTO, VPN, when the idealistic assumption is lifted.
and Employee), the abstract actions are:
B. Random attacker
• CTO: Acquire CTO credentials and Exfiltrate data from
the DB server (2 total). The random attacker, as the name implies, selects random
• VPN: Access infrastructure via VPN exploit, Acquire DC actions from the entirety of the action space until the goal
credentials, Establish session to the DC, Get root access is reached or the number of actions in a run exceeds a
to the DC, Get the golden ticket, Exfiltrate data from the given threshold. The network being considered seem to be
DB server (6 total). small scale but there are a significant number of sessions
• Employee: Acquire Employee credentials, Establish ses- and accesses for each attack step. While not resembling real-
sion to the Web server, Acquire DC credentials, Establish world attackers, the random attacker provides an important
session to the DC, Get root access to the DC, Get the benchmark to evaluate:
golden ticket, Exfiltrate data from the DB server (7 total). • correctness of the system behavior through fuzzing of
The total counts shown above are the minimal steps for simulation environment.
each variant and the attacker cannot take a shorter path due • complexity and effect of different scenarios.
to the lack of access or authorization. It is not surprising the • efficiency of different attack strategies.
CTO case presents much shorter path than the other two in this Fig. 4 illustrates how the random attacker is used to evaluate
idealist setting. In reality, however, getting the CTO credential the complexity of the action space and the impact of different
might be harder to achieve than getting such from the large strategies applied to the underlying CTO scenario variant. It
population of employees, especially if the CTO is well versed shows the number of actions needed to reach the goal over
in cybersecurity hygiene. 100 runs. Note that the CTO scenario requires a sequence of
Note that the minimal step counts will be used to calculate only two correct actions to reach the goal. Yet it can take
SRV:DC
SRV:DC credentials SRV:DC session Golden ticket
Administrator
extracted established generated
privileges obtained
12
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
a significant number of actions to reach the goal due to the reward values for successful and unsuccessful activities, it is
large action space the CTO has access to. We consider the able to compute the upper confidence bound and choose the
following random attacker strategies (rules followed by the next action accordingly. To prevent nonsensical ordering of
random attacker) to reduce the action space: actions, such as data deletion before extraction, it uses fixed
• Known services: the attacker targeted only services it action priorities. Essentially, the learning attacker act more
knew were running on particular hosts. intelligently by selecting the viable targets and sessions. This
• Live machines: the attacker did not try again a combi- serves as a step closer to mimic real-world attacks and is used
nation of session and a target if it received a network to compare to the random attacker on the actions needed to
failure. reach the goal when interacting with the network components.
We formulate the following two hypotheses:
150000 • On average, the learning attacker will require consider-
100000
ably less actions to finish the scenario than a random
50000 attacker employing the live machines strategy.
1
• The normalized attack difficulty (NAD) of the learning
attacker will decrease over time, whereas for random
10000
attacker it will remain constant.
The first hypothesis is based on the learning attacker’s
5000
ability to gradually add possible targets, rather than removing
them from the entire target space as the random attacker em-
ploying live machine strategy does. The second hypothesis is
1000
No reduction Known services Live machines Live machines + based on the random attacker not understanding any relations
strategy Known services
between actions and their consequences and selecting actions
by chance, whereas the learning attacker learns the appropriate
Fig. 4. Number of actions needed to reach the goal when simulating
actions over time.
different random CTO attacker strategies and their combinations. Each box- Figures 5 and 6 show the raw and normalized number of
plot represents 100 runs with up to 150,000 actions. actions required to finish each of the scenario variants for the
random and the learning attackers. For both attackers, each
Fig. 4 provides insights to the impact of particular strategies scenario variant was run 1000 times. It is apparent that the
as well as to the usage of randomized attackers to test simu- first hypothesis holds and even a cursory glance on the graphs
lated cyberattack scenarios. To begin with, having a random shows that the learning attacker’s strategy is between one and
attacker utilize the entire action space without any effort to two orders of magnitude more efficient.
reduce it is pointless. The current scenario variant had at least
1.5 million possible actions which could double with each 150000
13
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
1000
support for volumetric attacks, such as DDoS, adding
a framework for inter-attacker communication and es-
500 pecially support for stealthy and distributed operations,
and automating integration with exploit databases, such
as NVD.
• Actor autonomy: adding a support for automated gen-
100
eration of realistic cybersecurity scenarios. Currently, the
50
approach to use the presented simulation model as a basis
for expressing the scenarios as a satisfiability problem is
explored and results should be available soon.
• Defender support: adding support for methods of active
10 defense, such as firewall manipulation, decoy services,
CTO CTO VPN VPN Employee Employee
Learning Learning
Normalized
Learning Learning
Normalized
Learning Learning
Normalized
etc. to transition from passive to active defense and to
enable feedback loop between attackers and defenders.
• Deployability: enabling the transition from simulation
Fig. 6. Number of actions for learning attacker to finish under each scenario.
environment to emulated and real-world environments,
such as KYPO [45], while maintaining 1:1 mapping in
a new session via exploitation and this new session doubles attacker capabilities by using the agent algorithms to drive
the attacker’s action space. The second observation can be real-world attacking platforms, such as Cryton [46].
seen in Fig. 6. Event though the median NAD decreases over • User experience: creation of an IDE to facilitate easier
time as expected, the CTO variant displays unexpectedly large creation and analysis of cybersecurity scenarios.
variation and in many cases the easier scenario took longer
to finish than the other more complex variants. This counter- R EFERENCES
intuitive behavior is rooted in that the attacker under the CTO [1] R. Langner, “Stuxnet: Dissecting a cyberwarfare weapon,” IEEE Security
variant has visibility to the entirety of the infrastructure and & Privacy, vol. 9, no. 3, pp. 49–51, 2011.
is free to explore unfruitful branches of the attack graph. This [2] F. Erlacher and F. Dressler, “How to test an ids?: Genesids: An
automated system for generating attack traffic,” in Proceedings of the
has led to more possibilities and thus large variations where 2018 Workshop on Traffic Measurements for Cybersecurity. ACM,
the CTO variant can have very small or very high NAD’s 2018, pp. 46–51.
comparing to the other two variants, which are constrained in [3] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Towards
generating real-life datasets for network intrusion detection.” IJ Network
what the attackers have access to. Security, vol. 17, no. 6, pp. 683–701, 2015.
[4] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating
VI. S UMMARY AND F UTURE W ORK a new intrusion detection dataset and intrusion traffic characterization.”
This paper introduced a new event driven simulation tailored in Proceedings of the International Conference on Information Systems
Security and Privacy, 2018, pp. 108–116.
to analyzing and evaluating adversarial behavior. The model [5] S. Alzahrani and L. Hong, “Generation of ddos attack dataset for effec-
fills a gap between different cybersecurity simulation works tive ids development and evaluation,” Journal of Information Security,
and tools by focusing on attackers’ intent and actions, by vol. 9, no. 04, p. 225, 2018.
[6] M. Cermak, T. Jirsik, P. Velan, J. Komarkova, S. Spacek, M. Drasar, and
enabling integration of different attack models, and by oper- T. Plesnik, “Towards provable network traffic measurement and analysis
ating on a session level. The model and its implementation via semi-labeled trace datasets,” in Proceedings of 2018 Network Traffic
were evaluated with three variants of Bronze Butler APT Measurement and Analysis Conference (TMA). IEEE, 2018, pp. 1–8.
[7] C. Phillips and L. P. Swiler, “A graph-based system for network-
(BB). These variants of attacking agents possess the BB’s vulnerability analysis,” in Proceedings of the 1998 workshop on New
capabilities and were launched against a simulated corporate security paradigms, 1998, pp. 71–79.
infrastructure with insecure configuration. We simulated ran- [8] K. Kaynar, “A taxonomy for attack graph generation and usage in
network security,” Journal of Information Security and Applications,
dom and learning attackers for each of the three variants, vol. 29, pp. 27–56, 2016.
and assessed the efforts needed in each case to complete the [9] X. Ou and A. Singhal, “Attack graph techniques,” in Quantitative
attack goal. Our results showed not only insights on how Security Risk Assessment of Enterprise Networks. Springer, 2012, pp.
5–8.
to realize session level cyber adversary simulation, but also [10] Z. Ye, Y. Guo, C. Wang, and A. Ju, “Survey on application of attack
how different levels of accesses (CTO, VPN, and Employee) graph technology,” Journal of Communications, vol. 38, no. 11, pp. 121–
can lead to orders of magnitude differences in the number of 132, 2017.
[11] V. Shandilya, C. B. Simmons, and S. Shiva, “Use of attack graphs in
actions needed to retrieve critical data. security systems,” Journal of Computer Networks and Communications,
The presented simulation engine is the first stage in a co- vol. 2014, 2014.
ordinated effort to create an infrastructure for development of [12] J. Zeng, S. Wu, Y. Chen, R. Zeng, and C. Wu, “Survey of attack
graph analysis methods from the perspective of data and knowledge
autonomous cybersecurity agents, spearheaded by the NATO processing,” Security and Communication Networks, vol. 2019, pp. 1–
IST-152 research group [44], and as such will be expanded in 16, 12 2019.
several key areas: [13] S. Yi, Y. Peng, Q. Xiong, T. Wang, Z. Dai, H. Gao, J. Xu, J. Wang,
and L. Xu, “Overview on attack graph generation and visualization
• Action space: implementing the rest of the AIF which technology,” in 2013 International Conference on Anti-Counterfeiting,
can be expressed within the simulation model, adding Security and Identification (ASID), 2013, pp. 1–6.
14
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
[14] C. Xiaolin, T. Xiaobin, Z. Yong, and X. Hongsheng, “A markov game [34] A. Varga, OMNeT++. Berlin, Heidelberg: Springer Berlin Heidelberg,
theory-based risk assessment model for network information system,” 2010, pp. 35–59. [Online]. Available: https://doi.org/10.1007/978-3-
in Proceedings of International Conference on Computer Science and 642-12331-3_3
Software Engineering, vol. 3. IEEE, 2008, pp. 1057–1061. [35] National Institute of Standards and Technology. NVD - General FAQs.
[15] K. C. Nguyen, T. Alpcan, and T. Basar, “Security games with incomplete [Online]. Available: https://nvd.nist.gov/general/FAQ-Sections/General-
information,” in Proceedings of IEEE International Conference on FAQs
Communications. IEEE, 2009, pp. 1–6. [36] The MITRE Corporation. About CVE. [Online]. Available:
[16] B. Wang, J. Cai, S. Zhang, and J. Li, “A network security assessment https://cve.mitre.org/about/index.html
model based on attack-defense game theory,” in Proceedings of 2010 In- [37] S. Moskal and S. J. Yang, “Cyberattack action-intent-framework for
ternational Conference on Computer Application and System Modeling mapping intrusion observables,” 2020.
(ICCASM), vol. 3. IEEE, 2010, pp. V3–639. [38] The MITRE Corporation. MITRE ATT&CK. [Online]. Available:
[17] K. Chung, C. A. Kamhoua, K. A. Kwiat, Z. T. Kalbarczyk, and R. K. https://attack.mitre.org/
Iyer, “Game theory with learning for cyber security monitoring,” in [39] B. E. Strom, J. A. Battaglia, M. S. Kemmerer, W. Kupersanin, D. P.
Proceedings of 2016 IEEE 17th International Symposium on High Miller, C. Wampler, S. M. Whitley, and R. D. Wolf. (2017) Finding
Assurance Systems Engineering (HASE). IEEE, 2016, pp. 1–8. cyber threats with ATT&CK-based analytics.
[40] J. DiMaggio, “Tick cyberespionage group zeros in on japan,”
[18] K. Sallhammar, B. E. Helvik, and S. J. Knapskog, “Towards a stochastic
https://muni.cz/go/de26e1, 2016, [Online; accessed 13-May-2020].
model for integrated security and dependability evaluation,” in Proceed-
[41] Counter Threat Research Team, “Bronze butler targets japanese en-
ings of First International Conference on Availability, Reliability and
terprises,” https://www.secureworks.com/research/bronze-butler-targets-
Security. IEEE, 2006, pp. 8–pp.
japanese-businesses, 2017, [Online; accessed 13-May-2020].
[19] H. Wang, Y. Liang, and X. Liu, “Stochastic game theoretic method [42] MITRE ATT&CK Team, “Bronze butler,”
of quantification for network situational awareness,” in Proceedings of https://attack.mitre.org/groups/G0060/, 2019, [Online; accessed 13-
2008 International Conference on Internet Computing in Science and May-2020].
Engineering. IEEE, 2008, pp. 312–316. [43] M. M. Drugan and A. Nowe, “Designing multi-objective multi-armed
[20] K. Sallhammar, B. E. Helvik, and S. J. Knapskog, “A framework for bandits algorithms: A study,” in The 2013 International Joint Conference
predicting security and dependability measures in real-time,” Interna- on Neural Networks (IJCNN), 2013, pp. 1–8.
tional Journal of Computer Science and Network Security, vol. 7, no. 3, [44] P. Theron, A. Kott, M. Drašar, K. Rzadca, B. LeBlanc, M. Pihelgas,
pp. 169–183, 2007. L. Mancini, and F. de Gaspari, Reference Architecture of an
[21] X. Liang and Y. Xiao, “Game theory for network security,” IEEE Autonomous Agent for Cyber Defense of Complex Military Systems.
Communications Surveys & Tutorials, vol. 15, no. 1, pp. 472–486, 2013. Cham: Springer International Publishing, 2020, pp. 1–21. [Online].
[22] M. Fallah, “A puzzle-based defense strategy against flooding attacks Available: https://doi.org/10.1007/978-3-030-33432-1_1
using game theory,” IEEE transactions on Dependable and Secure [45] P. Čeleda, J. Vykopal, V. Švábenský, and K. Slavíček, “Kypo4industry:
Computing, vol. 7, no. 1, pp. 5–19, 2010. A testbed for teaching cybersecurity of industrial control systems,”
[23] S. Musman and A. Turner, “A game theoretic approach to cyber security in Proceedings of the 51st ACM Technical Symposium on Computer
risk management,” The Journal of Defense Modeling and Simulation, Science Education, ser. SIGCSE ’20. New York, NY, USA: Association
vol. 15, no. 2, pp. 127–146, 2018. for Computing Machinery, 2020, p. 1026–1032. [Online]. Available:
[24] D. Grunewald, M. Lützenberger, J. Chinnow, R. Bye, K. Bsufka, and https://doi.org/10.1145/3328778.3366908
S. Albayrak, “Agent-based network security simulation,” in The 10th [46] I. NUTÁR, “Automation of complex attack scenarios [online],”
International Conference on Autonomous Agents and Multiagent Systems Master’s thesis, Masaryk University, Faculty of Informatics, Brno,
- Volume 3, ser. AAMAS ’11. Richland, SC: International Foundation 2017. [Online]. Available: https://is.muni.cz/th/cry3j/
for Autonomous Agents and Multiagent Systems, 2011, p. 1325–1326.
[25] J. Chinnow, J. Tonn, K. Bsufka, T. Konnerth, and S. Albayrak, “A tool
set for the evaluation of security and reliability in smart grids,” in Smart
Grid Security, 01 2013, pp. 45–57.
[26] S. Moskal, B. Wheeler, D. Kreider, M. E. Kuhl, and S. J. Yang, “Context
model fusion for multistage network attack simulation,” in Proceedings
of Military Communications Conference (MILCOM), 2014 IEEE. IEEE,
2014, pp. 158–163.
[27] S. Moskal, S. J. Yang, and M. E. Kuhl, “Cyber threat assessment via
attack scenario simulation using an integrated adversary and network
modeling approach,” The Journal of Defense Modeling and Simulation,
vol. 15, no. 1, pp. 13–29, 2018.
[28] S.-D. Chi, J. S. Park, K.-C. Jung, and J.-S. Lee, “Network security
modeling and cyber attack simulation methodology,” in Proceedings of
the 6th Australasian Conference on Information Security and Privacy,
ser. ACISP ’01. Berlin, Heidelberg: Springer-Verlag, 2001, p. 320–333.
[29] M. Liljenstam, J. Liu, D. Nicol, Y. Yuan, G. Yan, and C. Grier,
“Rinse: the real-time immersive network simulation environment for
network security exercises,” in Workshop on Principles of Advanced
and Distributed Simulation (PADS’05), 2005, pp. 119–128.
[30] A. Futoransky, F. Miranda, J. Orlicki, and C. Sarraute, “Simulating
cyber-attacks for fun and profit,” in Proceedings of the 2nd International
Conference on Simulation Tools and Techniques. ICST, 5 2010.
[31] M. E. Kuhl, M. Sudit, J. Kistner, and K. Costantini, “Cyber attack
modeling and simulation for network security analysis,” in 2007 Winter
Simulation Conference, 2007, pp. 1180–1188.
[32] G. Rush, D. R. Tauritz, and A. D. Kent, “Dcafe: A distributed cyber
security automation framework for experiments,” 2014 IEEE 38th Inter-
national Computer Software and Applications Conference Workshops,
pp. 134–139, 2014.
[33] H. Holm and T. Sommestad, “Sved: Scanning, vulnerabilities, exploits
and detection,” in MILCOM 2016 - 2016 IEEE Military Communications
Conference, 2016, pp. 976–981.
15
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The present paper describes a new parallel time- analysis problem. This has led to an increasing interest in
domain simulation algorithm using a high performance com- the implementation of parallel and distributed algorithms for
puting environment – Julia – for the analysis of power system power system analysis [1]. The need for parallel solutions is
dynamics in large networks. The parallel algorithm adapts a
parallel-in-space decomposition scheme to a previously sequential further supported by the current advances in computational
algorithm in order to develop a new parallelizable numerical technology shifting from running applications on single com-
solution of the power system equations. The parallel-in-space de- puters with a single processor to a distributed and parallel
composition is based on the block bordered diagonal form, which computing architecture.
reformulates the network admittance matrix into sub-blocks that
can be solved in parallel. For the optimal spatial decomposition A number of methods and applications of high performance
of the network, a new extended graph partitioning strategy is computing in power system studies have been reported in
developed for load balancing and minimizing the communication literature [2]. Specific to dynamic simulations, parallel com-
between subnetworks. The new parallel simulation algorithm puting techniques are developed by decomposing the system
is tested using standard test networks of varying complexity. and models into subsystems that can be parallelized, and de-
The simulation results are compared to those obtained from
a sequential implementation in order to validate the solution vising alternative algorithms which offer more parallelization
accuracy and to determine the performance improvement in potential. The algorithms for decomposing the system into
terms of computational speedup. Test simulations are conducted parallelizable operations can be broadly divided into parallel-
using the ForHLR II supercomputing cluster and show a huge in-space, parallel-in-time, and waveform relaxation [3]. In the
potential in computational speedup with increasing network parallel-in-space algorithms, the network is partitioned into
complexity.
Index Terms—Graph partitioning, parallel computing, power independent subnetworks and the subnetwork equations are
systems, time-domain simulation, transient stability analysis assigned to different processors. This method was applied in
[4] using a Block Bordered Diagonal Form (BBDF) and in [5]
I. I NTRODUCTION using a Multi-Area Thevénin Equivalent (MATE) algorithm.
The parallel-in-time algorithms consider the combination of
The power system sector has seen an increase in the integra-
the differential and algebraic equations over several time
tion of renewable and distributed generation as a contribution
steps to create a larger system, which can then be solved
to the inter-sectoral effort to address the climate change
simultaneously as described in [6]. The waveform relaxation
challenges. Due to the variable nature of renewable energy
method was introduced in [7] for transient stability analysis
sources, flexibilities such as demand side management and
and implemented on a parallel computer in [8]. This method
storage devices are integrated in the power system towards
separates the system of differential algebraic equations into
a successful energy transition. In addition, the complexity
subsystems and distributes them to different processors to be
of the power system is further increasing in light of the
solved simultaneously.
current operation of large interconnected networks and an
increase in electricity demand from e.g. electric vehicles and The above approaches address the fundamental aspects
heat pumps. From the power system analysis perspective, the required for the parallelization of the power system problem
impact of these changes in operation conditions is an increase using decomposition schemes and numerical solutions based
in computational complexity in the simulation tools applied for on implicit integration methods for the discretization of the
stability and control studies. This implies that the traditional differential equations. However, these approaches have not
simulation tools require significant improvements through the been applied to adapt the optimized and efficient sequential
use of more efficient state-of-the-art computing environments algorithms based on explicit integration methods to parallel
and application of high performance computing hardware to computations using state-of-the-art high performance comput-
cope with the increasing complexity in the power system ing environments. Moreover, the iterative solution in implicit
methods implies a larger amount of computations at each
time step than in the solution using explicit methods, which
978-1-7281-7343-6/20/$31.00 ©2020 IEEE influences the numerical efficiency. Furthermore, since the fun-
16
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
damental parallel approaches rely on system decomposition, restructuring of the network equation coefficient matrix and
it is of interest to apply optimal network decomposition in applying graph partitioning strategy.
the formulation of the parallel solutions in order to balance
the processor loads and minimize communication between A. General Power System Representation
subnetworks. An optimized network partitioning results in a The general model of the power system for transient stabil-
small interconnecting system between the subnetworks. This ity analysis is described by a set of differential and algebraic
necessitates new efficient solution techniques for the network equations of the form
algebraic equations, since the conjugate gradient method in the
ẋ = f (x, y, u) (1)
parallel-in-space approach presented in [4] is considered to be
inefficient for small optimized dimensions of the interconnect 0 = g(x, y, u) (2)
partition.
In the present paper, a parallel-in-space decomposition where x is a vector of dynamic state variables, y is a vector
scheme is used to adapt a fundamentally sequential algo- of algebraic variables, and u are system parameters. The set
rithm to a parallel simulation algorithm using a high per- of differential equations represents uncoupled subsets used to
formance computing environment–Julia [9]. The proposed model the dynamic behavior of synchronous generators and
method adapts parallelism to the sequential program on the al- the connected controllers. The generator controllers include the
gorithmic level for the solution of the network algebraic equa- excitation control system for regulating the generator terminal
tion. This is due to the fact that analysis of the sequential time- voltage [11] and the turbine-governor system for controlling
domain numerical solution shows that solving the network the rotational speed and input mechanical power [12]. The
equation consumes a huge amount of time [5]. The proposed generator subsystems are coupled to each other through the
parallel method therefore restructures the sequential solution transmission network. Other components that are represented
of the network algebraic equation in such a way that allows by differential equations include Static Var Compensators
parallelization of the network solution. This restructuring is (SVC), dynamic loads and HVDC devices. The set of algebraic
based on the block bordered diagonal form for reformulation equations comprises the stator equations of each generator,
of the network coefficient matrix. coupled to the equations of the transmission network and
The network spatial decomposition applied in the present static loads. The interface between the generators and network
paper, however, requires a grid partitioning strategy. For this, system is through the stator algebraic equations included in the
a new extended optimal graph partitioning approach is pro- network nodal equation given by
posed to obtain balanced subnetworks which can be solved in YV =I (3)
parallel and only linked via an interconnect partition to share where Y is the network nodal admittance matrix of order n×n,
information at every time step. In such a decomposition, the for n nodes in the system, V and I are vectors of node voltages
set of differential equations is solved in parallel due to the and current injections of order n, respectively.
natural decoupling of the machines. The proposed numerical
In the present paper, the system of equations in (1) and (2)
solution of the algebraic equations in the presented algorithm
is solved based on the alternating solution scheme in which
is based on an efficient implementation of the direct LU
the differential equations and algebraic equations are solved
factorization method [10]. Combined with the optimization
separately at every integration step [13]. The set of discretized
of the network partitions, the approach in the present paper
equations in (1) is solved for xn+1 , which is then substituted
provides a computational advantage for the solution of the
into equation (2) to solve for yn+1 . The discretization method
algebraic equations.
applied in this case is an explicit integration scheme using the
The rest of the paper is organized as follows: Section II Runge-Kutta fourth order method [14]. The following steps
gives a detailed description of the applied methodology for summarize the sequential solution of the transient stability
the parallel formulation of the dynamic simulation solution and analysis problem:
the applied network partitioning strategy. The implementation
of the extended network partitioning method and the new (a) Determine initial steady state operating conditions at t =
proposed parallel dynamic simulation are described in Section 0 : V0
III. Section IV presents results to validate the accuracy and (b) Initialize dynamic state variables x0 and compute the
evaluate the performance of the proposed parallel method. initial algebraic variables
Finally, Section V concludes the paper, including an outlook (c) At time t + 1, calculate the dynamic state variables xt+1
for the future work. from the discretized form of (1) using the known values
at time t.
(d) Compute the algebraic variables yt+1 from 0 =
II. M ETHODOLOGY
g(xt+1 , yt+1 ) using the known values of xt+1 .
The present section gives an overview of the general repre- (e) Repeat steps (c) and (d) at every further time step.
sentation of the power system for transient stability analysis Among the solved algebraic equations in step (d) is the
considered in this paper and then describes the formulation network equation in (3) for the unknown node voltages.
of the parallelizable solution of the network equation by The node current injections I from the system machines are
17
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
calculated from the dynamic state variables. The solution of Equation (5) shows the preprocessing step where the matrix
the network equation is based on LU factorization [14] of the Ybs and vector Ibs are calculated in (6) and (7), respectively.
admittance matrix Y and solving for V from a linear set of The preprocessing step is used to compute the boundary
equations using the defined node current injections. Further node voltages Vs , which are required for the second step to
details about the models and the sequential solution process separately solve for the node voltages Vi in the ith subnetwork
are given in [15], [16]. Analysis of the computation time of as given in (8). The solution of the subnetwork node voltages
the individual stages of the sequential dynamic simulation in (8) can therefore be performed in parallel and constitutes
algorithm shows that the solution of (3) takes up a large the parallelizable task in the network solution implemented in
percentage of the total runtime in a single simulation, as the present paper.
also stated in [5]. It is therefore of interest to reformulate Further speedup can be realized by optimizing specific
the equation into a form where effective parallelization of the computations in the preprocessing step. For instance, the
solution approach can be applied as explained in the following solution of matrix Ybs in (6) can be completely computed at the
section. beginning of the simulation. In (7), the vector Ibs is calculated
at every time step of the simulation from current vector Is ,
B. Formulation of Parallel Solution which changes throughout the simulation. However, the matrix
T
In the present paper, a parallel-in-space approach is applied product Y i · Yi−1 in (7) can be pre-computed as part of the
to reformulate the network equation into an alternative form simulation optimization strategy. The solutions of the main
that can be easily parallelized during the network solution step. equations (5) and (8) are based on the LU factorization [14]
The parallel form of the problem is formulated based on the of the respective admittance matrices. The matrices Ybs and Yi
block bordered diagonal form as described in [17]. This spatial can be factorized by LU decomposition beforehand to speed
network decomposition forms the basis of the parallel time- up the calculation during the simulation loop.
domain simulation introduced in the present paper. The BBDF From the above formulation, solving the block bordered
formulation restructures the network in such a way that the n diagonal form initially requires a sequential solution for a
network nodes are grouped into p+1 sub-blocks; where p is the vector of boundary node voltages from (5), which grows with
number of subnetworks and the (p + 1)th sub-block represents more than linear complexity. An important consideration in the
the boundary nodes interconnecting the different subnetworks. formulation of the parallel solution is an efficient partitioning
The subnetworks are created from partitioning the grid. strategy of the network with a minimum number of boundary
nodes in the (p+1)th sub-block (interconnect partition) for an
The result of the BBDF formulation is shown in the re-
optimal performance. The graph partitioning strategy extended
structured network admittance matrix and the network nodal
for application to dynamic simulations is described in the
equation
following section.
Y1 Y1
V1 I1 C. Network Partitioning
Y2 Y 2 V2 I2
.. .. . .
Since power grids can be naturally represented as graphs,
· . = . (4)
. . network partitioning can be formulated as a graph partitioning
. .
Yp Y p Vp Ip optimization problem. The main criteria for optimizing per-
T T T
Y 1 Y 2 ... Yp Ys Vs Is formance during the solution of (4) defined by the system of
equations in (5) – (8) are a minimum number of branches that
where Yi are the elements of the original Y matrix within connect partitions and a balance in the sizes of the subnetworks
subnetwork i; Ys is the nodal admittance matrix formed by the or partitions. These criteria are similar to the main require-
boundary nodes in the (p + 1)th sub-block. The Y i elements ments in graph partitioning [18], where the objective is to
consist of data regarding the branches that connect subnetwork minimize the number of cut edges. However, the power grids
i to the (p + 1)th sub-block. in the present paper are considered to be unweighted graphs
The new formulation of the network nodal equation in (4) with a weight function of one for every branch. Thereby, a
rearranges the solution of the equation into two steps; the minimum number of cut edges results in a minimal number
preprocessing step and one step for every subnetwork of the of branches that interconnect to other partitions. This section
matrix given in (5) – (8). describes the partitioning strategy applied for the dynamic sim-
Ybs Vs = Ibs (5) ulations presented in this work: the basic partitioning format
using graph partitioning and the extension to the interconnect
p partition format.
X T
Ybs = Ys − Y i Yi−1 Y i (6) 1) Basic Graph Partitioning Format: A multilevel graph
i=1 partitioning approach, known as the Karlsruhe Fast Flow
p
X T Partitioner (KaFFPa) algorithm [19], is used in the present
Ibs = Is − Y i Yi−1 Ii (7) paper to generate equally sized partitions that have a minimal
i=1 number of cut branches. This partitioning output of the algo-
Yi Vi = Ii − Y i Vs (8) rithm is referred to as the “basic partitioning format” in this
18
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
paper. A multilevel graph partitioning process is defined by Basic partition format Interconnect partition format
the following three steps: coarsening, initial partitioning, and Subsystem 1 Subsystem 2
refinement [20], [21]. In the coarsening step, the algorithm Y1 , Y 1 Y2 , Y 2
Eqn (8) Eqn (8)
contracts the input graph to create a smaller representation of
the graph. The contraction is based on a matching strategy, Subsystem Subsystem I1 , Y1 , Y 1 Vs Vs I2 , Y2 , Y 2
1 2 Interconnect
which identifies a set of edges that do not have a common
Ys
end point (vertex) [22]. A matching is then contracted by Eqn (5) – (7)
combining the start and endpoint of every edge in the set; Subsystem
thus decreasing the size of the input graph. As soon as 3 Vs I3 , Y3 , Y 3
the graph is small enough, the initial partitioning step is Subsystem 3
applied using a global partitioning algorithm. The KaFFPa Y3 , Y 3
algorithm uses the SCOTCH global partitioning algorithm [23] Eqn (8)
19
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
is implemented to convert the basic partition format into Algorithm 2: Interconnect partitioning format
the interconnect partition format. The implementation of the Inputs: b branches, partition indices (1 . . . p) for nodes
converter algorithm is based on the procedure described in Assign interconnect partition index p + 1
Section II-C and summarized in Algorithm 2. for each branch do
B. Parallel Dynamic Grid Simulation Determine partition indices of f rom and to nodes
if f rom 6= to & partition index 6= p + 1 then
The inputs to the parallel time-domain simulation algorithm branch → cut branch; nodes → boundary nodes
include the network casefile in Matpower format and the end
corresponding subnetworks formed by the preprocessing parti- Ranking boundary nodes:
tioning procedure described in Section III-A. The initial step in for boundary nodes do
any dynamic simulation algorithm is the power flow solution to Rank based on branch count to other partitions
establish a quasi-steady state starting point for the simulation. if nodes exist with equal branches then
In the present implementation, the power flow calculation is Rank based on partition size of node location
based on the PowerModels package [26] in Julia. In terms of end
implementation, the advantage of applying the PowerModels Move highest ranking node to index p + 1
package is that the network data format is consistent with Update list of boundary nodes
the Matpower file format. With this property, the network Repeat Until set of boundary nodes is empty
casefiles defined in Matpower can be directly used in the end
Julia simulation. Additional inputs for the parallel dynamic end
simulation are the dynamic model parameter and network Return: partitions 1 to p and interconnect partition p + 1
events files. These files are defined in Matlab and directly
called within the Julia algorithm using the Julia package
Matlab.jl. Algorithm 3: Parallel computation procedure
The parallelization in the dynamic simulations is limited to Inputs: Network casefile and partitions
a single time step, since the solutions are based on the step- Initialization: V0 , X0
by-step numerical solution. The first parallelization step is the Precomputation:
computation of the decoupled machine differential equations Form subsystem matrices Yi , Y i and Ys
using the fourth order Runge-Kutta method to obtain the T
Compute Ybs and product Y i Yi−1
node current injections. The second step is the computation
LU factorize Yi and Ybs
of the BBDF formulated network equation. The solution of
for each partition do
the network consists of the precomputation steps, which are
Calculate machine state variables Xi
mainly matrix construction steps required for the sequential
Compute current injection in each partition Ii
solution of the interconnect partition equations, and the parallel
end
solution of the subnetwork equations. For the task of solving
the linear network system, the sparse LU factorization solver Compute link currents Ibs
using the UMFPACK library [10] is applied. The main steps Solve for the interconnect subnetwork voltages Vs
of the parallel algorithm are summarized in Algorithm 3. for each partition do
Solve for subnetwork node voltages Vi
C. Communication Aspects end
Return: State and algebraic variables at each time step
The parallel dynamic simulation algorithm proposed in
the present paper is based on a single node parallelization.
The main computations in the algorithm are memory bound
tasks dealing with vector arithmetic, matrix multiplication, and Fig. 2 illustrates the simulation time line for the imple-
solving of the network equation. In the Julia environment, such mented parallel dynamic simulation for an example with two
a parallelization problem can be effectively handled using the partitions (p1 , p2 ) and the interconnect partition (p + 1). At
multithreading construct [9]. the initialization step, the unpartitioned network is used to
establish the quasi steady state conditions of the system. The
steady state dynamic and algebraic variables of each subnet-
Algorithm 1: Initial graph partitioning work are derived from the unpartitioned network conditions.
Inputs: Network casefile; n−nodes, b−branches The subnetworks precompute their corresponding internal ad-
Required partitions p mittance matrices and the boundary matrix elements. The pre-
Eliminate recycling branches computation also includes the LU factorization of the internal
Build graph G in text format from network topology admittance matrices required for the solution of (8). After
Partition G into p subsystems all the subnetworks have finalized the matrix precomputation
Return: Graph G and p sub systems step, the interconnect partition starts receiving the matrices
of the individual subnetworks to precompute the interconnect
20
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
21
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Voltage [pu]
MatDyn Bus3
there is a perfect match in the results between the proposed 0.6
MatDyn Bus5
parallel simulation and the sequential simulation. This shows Parallel Bus1
0.4 Parallel Bus2
that the BBDF formulation of the network equation and
Parallel Bus3
solution using subnetworks correctly replicates the results of 0.2 Parallel Bus5
the original network equation formulation. The perfect match
can be attributed to application of the same numerical solution 0
strategy in both simulation cases based on the Runge-Kutta 1 1.2 1.4 1.6 1.8 2
method for discretization of the differential equations and a Time [s]
series of LU factorization for the solution of the algebraic
equations. Fig. 5. Comparison of bus voltage response following a fault on bus 5 in
MatDyn toolbox and the new Julia-based parallel dynamic simulation tool
C. Evaluation of Performance
TABLE II
In order to evaluate the performance of the parallel dynamic O PTIMAL NETWORK PARTITIONING COUNT
grid simulation, the sequential version of the algorithm is
initially extended to a similar programming environment, Network Optimal Average Interconnect
partition size size
Julia. The evaluation is performed in terms of computational Case9 2 3 2
speedup. The performance of the sequential and parallel algo- Case30 4 6 6
rithms in the Julia environment is evaluated using the different Case118 6 16 18
Case300 5 57 15
Case1354 7 188 37
Case9241 16 567 161
40
Case13659 10 1356 97
Relative angle [deg]
35
MatDyn δ21 test networks in reference to the Matlab-based sequential
30 MatDyn δ31
simulation toolbox in [27]. For the parallel simulation, the
Parallel δ21
Parallel δ31
optimal partitioning count is considered for the comparison.
25 Table II gives a summary of the optimal partitioning results
for the different network structures. The simulation runtimes
20 in the three algorithms are illustrated in Fig. 6 to compare the
1 2 3 4 5 minimum computation runtimes.
Time [s] Fig. 6 shows that the sequential and parallel Julia implemen-
tations are faster than the Matlab-based implementation for all
Fig. 3. Comparison of generator relative rotor angles following a bus fault test cases. This performance difference is attributed to the high
in MatDyn toolbox and the new Julia-based parallel dynamic simulation tool performance capability provided by the Julia programming en-
22
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
1,274.38
runtime of the network solution. At the same time, the
556.45
parallelizable partitions differ and decrease in size, causing
379.56
103
246.82
an increase in waiting times and less parallelizable processing
212.31
134.83
than the sequential task. Therefore, the quality of partitioning
Runtime [s]
48.6
102
25.22
The factors influencing the performance in the presented
18.52
17.16
15.34
13.28
12.81
6.64
due to the restructuring using the BBDF formulation and (ii)
4.73
101
3.95
3.19
the data exchange in a simulation step. For small networks,
2.11
1.93
1.78
1.54
1 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 200 Parallel solver (seq time)
2 0.8006 0.7776 1.0251 1.0641 0.9064 0.9046 0.9009
Parallel solver (dataIO)
3 0.7805 0.7770 0.9821 0.7486 1.0742 1.0389 1.0255
4 0.8449 1.1902 1.3385 1.2949 1.1079 1.1625 100
5 0.8176 1.2385 1.4028 1.3297 1.2058 1.3096
6 0.7448 1.2390 1.3491 1.3590 1.2201 1.2706
7 0.7760 1.0787 1.2970 1.4696 1.3118 1.3962
0
8 0.7669 1.1326 1.2765 1.3981 1.4295 1.3682 Case300 Case1354 Case9241 Case13659
10 0.7130 1.0745 1.2657 1.2079 1.5049 1.5380
12 1.0185 0.7766 1.2572 1.5445 1.5183
14 0.9790 1.1573 1.3206 1.5532 1.4350 Fig. 7. Summary of runtime of the numerical solver stage in the new
16 0.9633 1.1328 1.4160 1.5746 1.4905 Julia-based parallel dynamic simulation tool; showing the components of the
18 0.9319 1.0927 1.2374 1.5742 1.5192 parallel solver runtime compared to the sequential solver runtime
20 0.9023 1.0657 1.3920 1.5090 1.5188
However, the low speed up observed in the presented results
can be explained as follows: The communication flow in Fig. 2
vironment. With this in mind, the sequential Julia computation shows that a simulation step consists of two parallel stages.
runtimes are used for further analysis of the speedup attained Between the two stages, variables have to be collected by
by the parallel dynamic simulation. The computed speedup is the main partition in order to execute the sequential step of
shown in Table III with various partition sizes for all tested the network solution and then send the results back to the
network. The speedup is computed according to the relation individual partitions for the second parallel stage. Fig. 7 shows
Speedup = Ts /Tp , where Ts is the runtime in the sequential a summary of the simulation runtime for the numerical solver
simulation and Tp is the runtime in the parallel simulation. stage, which is the main computation in a simulation step. The
The optimal partitioning resulting in the best speedup for each runtime of the numerical solver stage in the parallel algorithm
simulated network is highlighted in Table III. consists of the parallel component for the solution of the
From the above results, the simulations in the proposed differential equations and the network equations for the node
parallel dynamic algorithm are relatively slower than the voltages (par time), the sequential solution for the link voltages
corresponding sequential simulations for small network cases. (seq time) and the data exchange time (dataIO) as shown in
However, the parallel runtime shows significant improvements Fig. 7. For comparison, the runtime of the numerical solver
with increasing network sizes as shown in Fig. 6, where in the sequential algorithm (Sequential solver runtime) is also
Case9241 and Case13659 achieve 57.46% and 53.8% speedup, included in the figure.
respectively. Furthermore, the speedup is observed to vary with Analysis of Fig. 7 shows the solver parallel component –
the number of partitions. For each of the tested networks, an parallel solver (par time) – is small compared to the total solver
optimal partitioning exists at which the computation runtime runtime. Furthermore, the parallel solver sequential component
is minimum as shown in Table III. This behavior can be – parallel solver (seq time) – is a very small percentage of
explained as follows: A higher number of partitions results in the solver runtime. This is due to the fact that the sequential
23
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
step is optimized using an efficient LU solver and optimized [8] L. Hou and A. Bose, “Implementation of the waveform relaxation
network partitioning. Therefore, the major factor for the low algorithm on a shared memory computer for the transient stability
problem,” IEEE Transactions on Power Systems, vol. 12, no. 3, pp.
speedup is the data exchange resulting in a high number of 1053–1060, Aug. 1997.
synchronizations in the parallel execution. Part of the future [9] J. Bezanson, A. Edelman, S. Karpinski and V. Shah, “Julia: A fresh
work is to further optimize the simulation algorithm in order approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp.
65–98, 2017.
to reduce the data exchange overhead.
[10] T. A. Davis, “UMFPACK user guide, version 5.6.2,” https://www.suite
V. C ONCLUSION sparse.com, 2013.
[11] IEEE Std 421.5-2005, “IEEE recommended practice for excitation
The present paper introduces a new parallel time-domain system models for power system stability studies,” IEEE, New York,
simulation tool based on the block bordered diagonal form 2006.
for reformulation of the network equation solution. The im- [12] Task Force on Turbine-Governor Modeling, “Dynamic models for
turbine-governors in power system studies,” IEEE Power & Energy
plementation relies on an extended partitioning strategy for Society, 2013.
decomposing the network structure into parallelizable sub- [13] S. Soman, S. Khaparde and S. Pandit, Computational methods for large
networks exchanging information at every simulation time sparse power systems analysis; An object oriented approach, Kluwer
Academic Publishers, 2002.
step through the boundary buses in the interconnect partition. [14] M. L. Crow, Computational methods for electric power systems, New
The accuracy of the simulation is compared to a validated York: CRC Press, 2009.
sequential simulation toolbox and found to perfectly match in [15] M. Kyesswa, H. Çakmak, U. Kühnapfel and V. Hagenmeyer, “A Matlab-
terms of the derived response profiles. For the tested network based dynamic simulation module for power system transients analysis
in the eASiMOV framework,” in European Modelling Symposium,
cases, a performance improvement is achieved as the network Manchester, UK, Nov. 2017.
size increases. In addition, the computation runtime is seen to [16] M. Kyesswa, H. K. Çakmak, U. Kühnapfel, and V. Hagenmeyer,
be dependent on the quality of the partitioning. “A Matlab-based simulation tool for the analysis of unsymmetrical
power system transients in large networks,” in European Conference
It is important to note that the presented results of the on Modelling and Simulation (ECMS), Wilhelmshaven, Germany, May
parallel dynamic simulation algorithm are obtained using Julia 2018.
version 0.6.3 – with experimental multi-threading status – [17] I. Decker, D. Falcao and E. Kaszkurewicz, “An efficient parallel method
which lacks support for multi-threading of nested loops. Part for transient stability analysis,” in Proceedings of the Tenth Power
Systems Computation Conference, 1990.
of the future work is to extend the algorithm to Julia versions [18] A. Buluç, H. Meyerhenke, I. Safro, P. Sanders and C. Schulz, “Re-
that provide general task parallelism properties as described cent advances in graph partitioning,” in Algorithm Engineering, Cham,
in [30]. Furthermore, the parallel algorithm will be extended Springer, pp. 117–158, 2016.
for testing on more than one computing node of the ForHLR [19] P. Sanders and C. Schulz, “High quality graph partitioning,” in 10th
DIMACS implementation challenge workshop: Graph Partitioning and
II computing cluster. Graph Clustering, 2013.
[20] P. Sanders and C. Schulz, “Think locally, act globally: Highly balanced
ACKNOWLEDGMENT graph partitioning,” in Experimental Algorithms, Springer Berlin Hei-
This work is part of the Energy Systems 2050 project, an delberg, pp. 164–175, 2013.
initiative of the Helmholtz Association. The work was per- [21] P. Sanders and C. Schulz, “Engineering multilevel graph partitioning
algorithms,” in 19th European Symposium on Algorithms, 2011.
formed on the supercomputer ForHLR funded by the Ministry [22] J. Maue and P. Sanders, “Engineering algorithms for approximate
of Science, Research and Arts Baden-Württemberg and by the weighted matching,” in International Workshop on Experimental and
Federal Ministry of Education and Research. Efficient Algorithms, 2007.
[23] F. Pellegrini, “Distillating knowledge about SCOTCH,” in Combinatorial
R EFERENCES Scientific Computing, 2009.
[24] R. D. Zimmerman, C. E. Murillo-Sanchez and R. J. Thomas, “MAT-
[1] R. C. Green, L. Wang and M. Alam, “High performance computing for POWER: Steady-state operations, planning, and analysis tools for power
electric power systems: Applications and trends,” in IEEE Power and systems research and education,” IEEE Transactions on Power Systems,
Energy Society General Meeting, San Diego, CA, 2011. vol. 26, no. 1, pp. 12–19, 2011.
[2] D. M. Falcao, “High performance computing in power system applica-
tions,” in International Conference on Vector and Parallel Processing, [25] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for
Porto, Portugal, 1997. partitioning irregular graphs,” SIAM Journal on Scientific Computing,
[3] C. Dufour, V. Jalili-Marandi, J. Bélanger and L. Snider, “Power system vol. 20, pp. 359–392, 1998.
simulation algorithms for parallel computer architectures,” in IEEE [26] C. Coffrin, R. Bent, K. Sundar, Y. Ng and M. Lubin, “PowerModels.jl:
Power and Energy Society General Meeting, San Diego, CA, 2012. An open-source framework for exploring power flow formulations,” in
[4] I. Decker, D. Falcao and E. Kaszkurewicz, ”Conjugate gradient methods 2018 Power Systems Computation Conference (PSCC), 2018.
for power system dynamic simulation on parallel computers,” IEEE [27] S. Cole and R. Belmans, “MatDyn, A new Matlab-based toolbox
Transactions on Power Systems, vol. 11, no. 3, pp. 1218–1227, 1996. for power system dynamic simulation,” IEEE Transactions on Power
[5] M. Tomim, J. Martı́ and L. Wang, “Parallel solution of large power Systems, vol. 26, no. 3, pp. 1129–1136, Aug. 2011.
system networks using the Multi-Area Thévenin Equivalents (MATE) [28] “KIT - SCC - ForHLR II,” [Online]. Available: https://www.scc.kit.edu/
algorithm,” International Journal of Electrical Power & Energy Systems, dienste/forhlr2.php.
vol. 31, no. 9, pp. 497–503, 2009. [29] C. Josz, S. Fliscounakis, J. Maeght and P. Panciatici, “AC power flow
[6] M. L. Scala, R. Sbrizzai and F. Torelli, “A pipelined-in-time parallel data in MATPOWER and QCQP format: iTesla, RTE snapshots, and
algorithm for transient stability analysis (power systems),” IEEE Trans- PEGASE,” 2016. [Online]. Available: http://arxiv.org/abs/1603.01533.
actions on Power Systems, vol. 6, no. 2, pp. 715–722, May 1991. [30] J. Bezanson, J. Nash and K. Pamnany, “Announcing composable multi-
[7] M. Crow and M. Ilic, “The parallel implementation of waveform relax- threaded parallelism in Julia,” Julia, 23 July 2019 . [Online]. Available:
ation methods for transient stability simulations,” IEEE Transactions on https://julialang.org/blog/2019/07/multithreading.
Power Systems, vol. 5, no. 3, pp. 922–932, Aug. 1990.
24
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—An essential part of the energy systems design the same simulation development environment and therefore
procedure is simulation, since it serves as a tool for verification requires specific expertise.
of the respective design. It serves the verifying of a stable In this context, distributing simulations into co-
operation of developed energy systems infrastructure, before it
comes to the realization. As energy systems integration becomes simulations [13]–[15] enables the collaboration of specialists
an important part in a low carbon energy scenario in the future, from the various areas of expertise. By doing so, the
the cooperation of experts specialized in various domains crucial existing simulations including the latest research and
to single aspects of the energy system is indispensable. Co- insights by specialized experts can be fully integrated into
simulation, yet, enables the modelling in the familiar environ- the co-simulation. Nevertheless, this approach requires a
ment of the experts, but requires a detailed coordination of
the simulation interfaces between the specific expert models. high coordination effort between the cooperation partners.
Hence, standardized interfaces are crucial to the efficient use of Reducing this effort by defining a unique interface between the
expert knowledge in distributed co-simulations. Therefore, in the simulation parts at the beginning simplifies the collaboration
presented paper a workflow for the co-simulation development fundamentally. Therefore, introducing a new approach for
of energy systems simulations, which simplifies the coordination obtaining an interface definition is the main purpose of the
procedure significantly by standardizing the interfaces between
the models and their simulations, is introduced. The approach present paper.
is exemplarily applied to the energy system design of a district The remainder of this paper is organized as follows: Section II
comprising electricity and heat in order to show its successful discusses related work in this area of interest. Section III
performance. and Section IV introduce a coordinating approach to the
Index Terms—co-simulation, energy systems integration, en- distribution procedure of simulations by defining the interfaces
ergy systems design, interfaces
between the simulations on multiple layers. This approach is
I. I NTRODUCTION exemplarily applied to the energy system design of a district
As the international focus on reducing the number of in Section V. Finally, Section VI concludes with a discussion
available fossil power plants intensifies [1]–[4], cheap and and outlook.
strongly fluctuating generation from renewable energies is II. R ELATED W ORK
growing. The challenging transition of the current energy
system into a sustainable multimodal energy system needs Distributed co-simulations for multimodal energy systems
new operating strategies [5]. are rare in the literature. Partly, this rarity is probably due to
Common approaches and surveys about the future design the lack of expert knowledge across all the respective fields.
of the energy system are dealing with a top-level approach In this context, an approach that takes into account the whole
from a national or international perspective (e.g. [6]–[9]). hierarchy between the European transmission grid and local
Yet, operating issues and local needs have to be considered prosumers is contained in the Energy System Development
in further steps. In lower level simulations the operability of Plan (ESDP) [16]. This approach builds local energy cells
the designed energy system can be verified. The modelling of with assumptions for distribution grids. These energy cells are
energy systems comprising many different technologies [10]– interconnected via the real European transmission grid [16].
[12] requires a high effort to consider all system attributes in Thereby, the gap between locally available flexibility and
transregional flexibility demand is bridged by simulating a
market-driven mode of operation. In these cells, the flexibility
978-1-7281-7343-6/20/$31.00 ©2020 IEEE is available for the entire network. Heat is considered in
25
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
the form of demands and heat-supplying energy conversion energy flow through the system is considered. All scientific
units. The simulation consists of different steps, in which investigations of the energy system design are dealt with this
a market simulation determines the schedule and the trans- layer. Furthermore, the initial splitting of the logical simulation
mission grid behaviour is simulated in a steady state power parts is performed.
flow simulation [16]. Since the simulation is split into a
sequentially executable toolchain, the data exchange between B. Information flow layer
the simulations is straightforward. Information exchange is an essential part of distributed
Another approach with focus on various heating systems is simulations. The simulation modules are considered as black
MESCOS [17], which also includes an electrical grid simula- boxes with interfaces for data exchange with other simulation
tion. In [17], models are developed in a couple of modelling modules. The calculation of the current state in the black
tools and co-simulated via network communication. Thereby, box can require input data delivered from external simulation
the interfaces between the different simulators are defined in modules.
a scenario description file. In the information flow layer the interfaces of the simulation
Furthermore, in [18] a gas and a heat grid simulation is modules and their dependencies are specified. Each interface
integrated into the OpSim platform [19] by linking already contains the value of a floating point variable, which
existing simulation tools for the respective domains. The represents a physical quantity with one unit. Formally, an
simulation tools used are adapted to the OpSim message bus interface consists of a quadrupel {identifier, type, value, unit},
according to the OpSim Proxy/Client concept [18]. where the identifier is a unique name, the type is either input
Moreover, a simulation framework for multi-carrier energy or output, the value is a floating point number, and the unit
systems is presented in [20]. It is designed for the cooperation is a SI-unit. An input interface is defined for each variable
between experts of various domains in particular. The central that a simulation module requires from another simulation
orchestration of the co-simulation is performed in MATLAB®. module. Similarly, an output interface is defined for each state
The interface design between the different simulations is variable that a simulation module provides for transmission
recognized as a central challenge. However, a high number to other simulation modules. To ensure the operability of the
of different simulation tools is seen as an obstacle [20]. simulations, every input interface needs to be connected with
In [21], requirements for coupling of simulators are identified. an output interface of a fitting quantity.
Besides the special application field of multimodal energy sys- The assignment of the output interfaces to corresponding
tem simulations, there are more general ambitions for coupling input interface is performed according to the scenario defined
physical simulations. Initial efforts to standardize physical in the semantic layer. The information flow is always directed
simulation interfaces on the technical level are already in use from an output interface to an input interface. Bidirectional
with the Functional Mockup Interface (FMI) definitions [22], connections can be represented by two opposite directed
[23]. In the future, the new standard for network co-simulation connections.
DCP [24] could also play an important role. A complete definition of this layer contains a list of all
In this context, to the best of our knowledge, there is no holis- interfaces of all simulation modules, and additionally, the
tic procedure for defining interfaces for the distributed simula- corresponding output interface for each input interface.
tion of energy systems. Hence, it is an open scientific problem
to define these interfaces for enabling a frictionless model
development by the individual domain experts. Therefore, C. Simulators interaction layer
the present paper introduces a clear interface definition for The simulators interaction layer provides the information
energy systems simulations in order to avoid time-consuming exchange between the simulation modules according to the
adaptions of the simulation models and incompatibilities that connections determined in the information flow layer. The
may occur in the end. communication between the simulation modules has to be
clearly specified and implemented by each associated simu-
III. I NTERFACING S IMULATIONS lation module. Alternatively, simulation tools can be adapted
The interfacing has to be carried out in several layers, of to the specified communication format with an interconnected
which the procedure is presented in Section IV. In the present module. Most of the co-simulation standards are using this
section, the stack of simulation interfaces for distributed concept, like the FMI standard [23] for local co-simulation
simulations is introduced. For this, a single simulation that
is a part of the distributed simulation is called a simulation
TABLE I
module. Table I gives an overview of the layers. I NTERFACE LAYERS FOR ENERGY SYSTEM CO - SIMULATIONS
A. Semantic layer
Layer Name Description
The semantic layer considers the energy system, which 3 semantic scenario(s), infrastructure, energy flow
is intended to be modelled. It further defines the respective 2 information flow variables, information direction
scenario(s), for which this modelling is undertaken. It consists 1 simulators interaction simulation control and communication
0 simulation technology (simulator dependent)
of energy grids and connected devices. In this perspective, the
26
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
or DCP [24] for distributed co-simulation as well as the co- simulation is repeated periodically in a period of typically
simulation frameworks [17], [19], [25]. 15 minutes to one hour.
During the simulation run, the values of the single variables are
transmitted from the output interfaces to the input interfaces
according to the specification in the information flow layer.
B. Splitting the simulation
D. Simulation technology layer
On the bottom layer, each simulation tool has its own To achieve a frictionless fitting of the distributed simulation
simulation technology. To provide the required interface parts, all interfaces are defined before the individual models
for the simulation interaction layer, an intuitive solution are created.
is the implementation of individual adaptions for existing After the scenario is clarified, the interface definition can be
simulation tools by using their program code APIs in the performed according to the interface layer stack introduced
simulation tools (like applied in [17], [18]) or adapting in Section III. On the semantic layer, the energy system is
them otherwise. In [20], a Python wrapper is implemented sketched in the shape of grids and connected components
to adapt commercial simulation tools to their co-simulation like consumers or generators. Energy and material like fuel
framework. is exchanged between the connected components across the
Simulation tools that already provide the widespread FMI grid. From this sketch, the competence responsibilities for
standard can be easily integrated by creating an adaptation components and grids are segregated between the participating
for the FMI standard once as applied in [25] or for a smart experts. The subsequent independent modelling of these com-
grid simulation framework in [26]. ponents in separate simulation modules is the responsibility of
the experts.
In the next step, the correspondingly assigned experts identify
IV. D ISTRIBUTING A S IMULATION the required information transfer on the information flow layer
Verification by simulation is an important part in design- for each energy or material flow. To do this, they draw up a
ing energy systems. For this verification, detailed simulation list of individual interfaces, each consists of a variable with
modules of different domains are needed. For an efficient the corresponding unit. For physical simulations, the variables
modelling of multimodal energy systems by experts of the depend on a real-world system. Modelled representations of
respective domains in parallel, a coordination procedure is these systems have a high similarity. Thus, for the time-step
required. In the present section, the procedure of distributing based simulation of physical systems, a generalization of the
the simulation is introduced. The splitting begins after an interfaces is possible due to the low level modelling of a real
appropriate simulation scenario is determined. Finally the system. A reusable definition of interfaces on this layer for
independently developed simulation models are united in a electricity and heat grids is shown in Section V. In simulations
common co-simulation without any adaption effort. considering communication behaviour or other kinds of event-
based simulation, reusable definitions of interfaces are not
A. Finding an appropriate level of detail feasible in general.
Energy systems can have very different dynamic behaviours. A common used co-simulation platform has to provide the
While dynamic investigations of electricity grids consider time connections between the simulation modules according to
periods of subseconds, the dynamic behaviour of gas and the elaborated list of interfaces. The co-simulation platform
heat grids is much slower. From a technical point of view, will conduct the composed distributed simulation in the end.
it is possible to create a co-simulation respecting all these An important question regarding the choice of the platform
issues. However, the execution of this type of co-simulation is whether the simulation should be performed locally on
requires an enormous amount of computing power. But if one machine or distributed on several machines. Furthermore,
effects occurring in one subsystem have a negligible influence confidentially obligations can necessitate the execution of
on other subsystems, these effects can be excluded from the some simulation models in a geographically restricted area,
co-simulation and investigated in an independent additional which needs a platform supporting co-simulations over large
simulation. For this reason, the level of detail is determined distances.
by the issues to be investigated. The development and simulation tool for each simulation mod-
Seasonal behaviour is an important factor in energy systems ule is individually chosen by each expert, but the compatibility
integration. Investigations usually consider periods of one year between the co-simulation platform and the chosen simulation
or even more (e.g. [6], [12]). The computational executability tool is mandatory. It is meaningful to take this already into
within an acceptable execution time has to be taken into account while choosing the co-simulation platform to avoid
account by every simulation module (see also subsection costly simulation tool adaptations. In [20], the selection of an
IV-C). A typical example for the adjustment of the level of appropriate simulation environment is actually recommended
detail in the simulations is given in this case for the modelling as first step.
of the power grid. To consider large time scales, the electricity The whole procedure for a district combined heat and power
grid is simulated in a steady state power flow simulation. This simulation case is exemplarily applied in Section V.
27
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
C. Independent model development coupled subsystems as described in [27] and accepting the
The great advantage of distributing simulations is the sep- accompanying inaccuracies. For this reason a skilled selection
aration of the simulation models. Different research groups of the subsystem borders is beneficial. If there are unavoidable
can develop and implement simulation models that represent delays in the simulation, it is possible to capitalize them
different systems. The individual models are developed inde- as described in [28]. In energy systems this could become
pendently of each other and differ both in their representation interesting when control systems are considered and the time
and in their runtime environment. They can be represented step size is appropriately small.
as differential equation systems, discrete automata, etc. The Increasing the accuracy by reducing the time step size is possi-
only restriction is the supply of interfaces according to the ble. However, this is accompanied by a higher computational
specified interface list in the information flow layer. Thus, time demand. Hence, the optimal compromise between the
the modelling is done on the level of subsystems without required accuracy and computation time has to be determined
having the coupled problem in mind. Ideally, these models for every application case. The usage of various time step sizes
have been created by experts in their respective field and is discussed in [20].
are properly validated and thus recognized. Since the subdo-
mains involved already use numerous established technologies V. E XEMPLARY A PPLICATION
and simulation tools, distributed simulation enables the easy
reuse of these simulation models. On the other side, these The approach described in the previous sections is demon-
existing models do not need to be adapted to the respective strated in the following using the example of a district sim-
simulator architecture or even reformulated. Each model is ulation with a photovoltaic system, heat storage, electricity
developed on its own platform and in such a way that the storage and mini-CHP (Combined Heat and Power unit). This
later unified distributed simulation is carried out by black test case has been selected as a useful addition in the context of
box operation of the subsystems with minimal computational large transmission grid simulations, which are usually limited
effort. The simulation modules are calculating their delivered to optimization strategies at a higher dispatch level with
output variables in dependence of the incoming input variables long-term optimization. It is assimilable to the energy cell
and the simulation time. In some cases precalculations during simulations of the ESDP [16]. This example can be used to
the model development can reduce the calculation effort during consider effects on the shorter time scale, such as fluctuations
the simulation run significantly. in energy demand and availability, energy costs, maintenance
planning, or storage capacity. Different operating strategies
D. Unifying the distributed simulation can be tested as well as the effects of fluctuations in weather
After the independent modelling is completed, all simula- conditions. Furthermore, universal interface definitions on the
tions are linked together. If the interface specifications are interface flow layer for power grids and for heat grids are
met, no further configuration in the simulation modules is provided.
required. The mapping of the simulation output and inter-
faces according to the specification on the information flow A. Application case
layer as well as further global settings is executed by the
co-simulation framework. When a simulation is started, the A demonstration district with exemplary consumption data
different simulation modules are exchanging information via is simulated with individual system models representing state-
the specified interfaces. At the semantic level, the operation of of-the-art technologies. A power grid model based on real-
the energy system is simulated in order to achieve the specific world data is integrated with a generic heat grid model. A
goals initially set (e.g. low emissions, low costs). photovoltaic installation, a heat storage, an electricity storage
The physical simulation represents a continuous process, and a mini-CHP are connected to each other and to the
which is usually simulated in time steps. Control can also power and heat grid. In this test case the mini-CHP and the
be integrated time-step-based via specified interfaces. Alter- power storage are operated in order to use a high share of
natively, a direct event-based addressing of the controlled the generated electrical energy and photovoltaic feed-in while
units can be independently implemented beyond the phys- simultaneously ensuring the thermal supply.
ical simulation. The co-simulation platform has to ensure
a deadlock-free execution sequence of the simultaneously
running simulations. Heat grid Heat
storage
E. Effects on distributing a simulation
Distributing a simulation is accompanied by accuracy is- fuel Mini- Household Household Household
CHP (with PV) (with PV)
sues. An issue concerns the splitting of the systems behaviour
describing equations. Representing the whole considered sys- external grid
Power
tem in a closed system of equations is unpractical due to Power grid storage
the huge computation effort solving this system. This can
be overcome by splitting the simulated system into weakly Fig. 1. Distributed simulation case on the semantic layer
28
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Pos. Interface Unit Note Interface name pos. type unit connected to simulator
(1) District heating T_ret (3.1) input K heat grid
(1.1) ambient air temperature K m_flow (3.2) input kg/s heat grid
P_th_set (3.3) input (kW) controller
(2) District heating per connector T_amb (5.1) input K weather
(2.1) water temp. in grid direction1 K
(2.2) mass flow rate2 kg/s
(3) Connector heat source
and heat sink connectors (4) the flow temperature (4.1). For
(3.1) return temperature K heat sources, the mass flow rate (2.2) can alternatively be
(3.2) mass flow rate3 kg/s shifted to the source side (3.2), if the water pump is part of
(3.3) setpoint(s) (data) the grid model.
(4) Connector heat sink Since the interface definition is orientated on the input in-
(4.1) flow temperature K
(4.2) setpoint return temperature (data)
terfaces, the output of the heating grid is listed as the input
interfaces of the connectors (3) and (4). (5) and (6) are
(5) Gas turbine, like (3), additionally
(5.1) ambient air temperature K special instances of (3) and (4). Every heating grid model
requires the ambient air temperature (1.1) once and the input
(6) Heat storage, like (3) + (4)4 ,
additionally
(6.1) ambient air temperature K interfaces (2) once for each linked connector. Thereby, the
design decision is made, that the heating grid model requires
an input interface (2.2) containing mass flow rate information
B. Applying the interface determination procedure for each connector. This decision represents a typical operation
of a district heating grid. Usually, the grid does not have
The interface determination procedure, introduced in Sec- information about the demand at each consumer. Instead,
tion IV is applied to the presented simulation case. First the each consumer increases or decreases the mass flow rate
simulation scenario is split into the responsibilities of the individually. Hence, the demand sets the mass flow rate (2.2).
experts. Figure 1 shows the distribution for the introduced The total mass flow rate of the grid is the sum of all demands
application case. The individual simulation modules are a heat and needs to be supplied by the heating grid model.
grid simulation including a storage and the heat consumers, a Setpoints (printed in italic) are actually not really physical
mini-CHP simulation, photovoltaics and electricity consumer quantities, since they represent only transmitted information.
simulation, electricity storage simulation, and power grid This information consists of simple data without any physical
simulation. Additionally there is a (not depicted) weather sim- quantity. Otherwise, they can be treated as the physical quan-
ulation, which provides ambient air temperature and radiation tity this information is associated with, or they can be excluded
data for the other simulation modules. and treated event-based beyond the physical simulation.
Subsequently, the interfaces on the information flow layer are The exemplarily deduced input interface list for a generic CHP
determined. The district heating grid connects a heat supply is shown in Table III. Next to the interfaces, the connection
to multiple consumers with pipes. For the majority of existing to the associated simulator is noted in the last column. The
district heating, consumers use heat exchangers or direct con- associated simulators have to provide an appropriate output
nections to the grid. The power transport in a heat exchanger is interface. The list of utilized output interfaces is determined
mainly driven by the fluid temperatures and the fluid mass flow from the connection information of the overall input interfaces
rates at primary and secondary side. For the representation of all simulators.
of connections, generalized connector input interfaces for the In the presented simulation case, the water pump is shifted
heating grid are shown in Table II. The interfaces on the from the heat grid simulation module to the mini-CHP simu-
information flow layer regarding the heating grid and its lation module. Following the mass flow input interface has to
connectors can be deduced from Table II. The grid itself needs be moved from the mini-CHP module to the heat grid module.
the ambient air temperature (1.1) as an input. This information In return the mini-CHP module receives an additional setpoint
is processed internally in the heating curve. The heating curve for the flow temperature.
determines the setpoint for the flow temperature in the heating The generalized interface specification for steady state elec-
grid. Furthermore, for every connector of the district heating, tricity grid simulations with slack bus is shown in Table IV.
an input interface describing the mass flow entering the heating The electricity grid receives the active and the reactive power
grid (2.2) and an input interface containing the corresponding for every connected load (1) or generator (2). The only
temperature of the incoming fluid (2.1) is defined. Heat source difference between the input interfaces is the sign of the value
connectors (3) need the return temperature (3.1) from the grid of the active power. While the active power of a load (1.2) is
1 return
represented by a positive value, the active power of generators
temperature for heat sinks, flow temperature for heat sources
2 for sources (2.2) is only used if the pump is placed on the source side
(2.1) is negative. (3), (4), (5) are generators, (6) is an instance
3 (3.2) is only used if the pump is placed on the grid side of a generator as well as of a load.
4 without setpoints in passive operation mode For connecting the specified interfaces on the simulators
29
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
30
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
1 100
360
0.8 80
Temperature in ◦ C
Heat power in kW
150
Temperature in K
State of charge
340
0.6 60
100
320
0.4 40
50
300
0.2 20
0 280
0 0
0 1 2 3 0 1 2 3
Time in days Time in days
Demand CHP P th State of charge T amb T ret T flow real
100 1 1 0.03
0.8 0.8
−100 0 0 0
0 1 2 3 0 1 2 3
Time in days Time in days
Demand PV CHP P el m flow CO2 m fuel
Storage Export State of charge
Fig. 5. Mini-CHP water flow and emission profiles
Fig. 3. Electrical power profiles
31
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
For the design of energy systems, the verification of [17] C. Molitor, S. Groß, J. Zeitz, and A. Monti, “MESCOS—a multienergy
developed concepts by simulation is an established approach. system cosimulator for city district energy systems,” IEEE Transactions
on Industrial Informatics, vol. 10, no. 4, pp. 2247–2256, Nov. 2014.
The carrying out of the associated investigations including [18] S. R. Drauz, C. Spalthoff, M. Würtenberg, T. M. Kneikse, and M. Braun,
design and operational optimization in interdisciplinary “A modular approach for co-simulations of integrated multi-energy
research cooperations is a future goal. systems: Coupling multi-energy grids in existing environments of grid
planning & operation tools,” in 2018 Workshop on Modeling and
Simulation of Cyber-Physical Energy Systems (MSCPES), 2018, pp. 12–
17.
ACKNOWLEDGEMENT [19] F. Marten, A.-L. Mand, A. Bernard, B. K. Mielsch, and M. Vogt, “Result
This work was funded by the Initiative and Networking processing approaches for large smart grid co-simulations,” Computer
Science - Research and Development, vol. 33, no. 1, pp. 199–205, Feb.
Fund of the Helmholtz Association in the future topic ”Energy 2018.
Systems Integration” under grant number ZT-0002. [20] J. Ruf et al., “Simulation framework for multi-carrier energy systems
with power-to-gas and combined heat and power,” in 2018 53rd Inter-
national Universities Power Engineering Conference (UPEC), 2018, pp.
R EFERENCES 526–531.
[1] P. Ekins, “Step changes for decarbonising the energy system: research [21] R. Egert, A. Tundis, and M. Mühlhäuser, “On the simulation of smart
needs for renewables, energy efficiency and nuclear power,” Energy grid environments,” in Proceedings of the 2019 Summer Simulation
Policy, vol. 32, no. 17, pp. 1891–1904, 2004. Conference, 2019.
[2] T. S. Schmidt, M. Schneider, and V. H. Hoffmann, “Decarbonising the [22] T. Blochwitz et al., “The functional mockup interface for tool inde-
power sector via technological change – differing contributions from pendent exchange of simulation models,” in Proceedings 8th Modelica
heterogeneous firms,” Energy Policy, vol. 43, pp. 466–479, 2012. Conference, Dresden, Germany, March 20-22, 2011, 2011.
[3] J.-F. Mercure et al., “The dynamics of technology diffusion and the [23] T. Blochwitz et al., “Functional mockup interface 2.0: The standard for
impacts of climate policy instruments in the decarbonisation of the tool independent exchange of simulation models,” in Proceedings of the
global electricity sector,” Energy Policy, vol. 73, pp. 686–700, 2014. 9th International Modelica Conference, Munich, Germany, September
[4] T. Gerres, J. Ávila, P. Llamas, and T. San Román, “A review of 3-5, 2012, 2012.
cross-sector decarbonisation potentials in the european energy intensive [24] M. Krammer et al., “The distributed co-simulation protocol for the
industry,” Journal of cleaner production, vol. 210, pp. 585–601, 2019. integration of real-time systems and simulation environments,” in Pro-
[5] E. Lachapelle, R. MacNeil, and M. Paterson, “The political economy of ceedings of the 50th Computer Simulation Conference, 2018.
decarbonisation: from green energy ‘race’ to green ‘division of labour’,” [25] A. Erdmann, H. K. Çakmak, U. Kühnapfel, and V. Hagenmeyer, “A
New Political Economy, vol. 22, no. 3, pp. 311–327, 2017. new communication concept for efficient configuration of energy sys-
[6] T. Brown, D. Schlachtberger, A. Kies, S. Schramm, and M. Greiner, tems integration co-simulation,” in 2019 IEEE/ACM 23rd International
“Synergies of sector coupling and transmission reinforcement in a cost- Symposium on Distributed Simulation and Real Time Applications (DS-
optimised, highly renewable European energy system,” Energy, vol. 160, RT), 2019, pp. 235–242.
pp. 720 – 739, 2018. [26] S. Rohjans, E. Widl, W. Müller, S. Schütte, and S. Lehnhoff, “Gekop-
[7] C. Müller et al., “Integrated planning and evaluation of multi-modal pelte Simulation komplexer Energiesysteme mittels MOSAIK und FMI,”
energy systems for decarbonization of Germany,” Energy Procedia, vol. at – Automatisierungstechnik, vol. 62, no. 5, pp. 325–336, 2014.
158, pp. 3482–3487, 2019. [27] P. Palensky, A. A. Van Der Meer, C. D. López, A. Joseph, and K. Pan,
[8] P. Capros et al., “European decarbonisation pathways under alternative “Cosimulation of intelligent power systems: Fundamentals, software
technological and policy choices: A multi-model analysis,” Energy architecture, numerics, and coupling,” IEEE Industrial Electronics Mag-
Strategy Reviews, vol. 2, no. 3-4, pp. 231–245, 2014. azine, vol. 11, no. 1, pp. 34–50, Mar. 2017.
[9] J. Després, N. Hadjsaid, P. Criqui, and I. Noirot, “Modelling the impacts [28] C. Michel and P. Siron, “Delay-based distribution and optimization of a
of variable renewable sources on the power sector: Reconsidering the simulation model,” in 2018 IEEE/ACM 22nd International Symposium
typology of energy modelling tools,” Energy, vol. 80, pp. 486–495, 2015. on Distributed Simulation and Real Time Applications (DS-RT), 2018,
[10] M. Zimmerlin, F. Mueller, M. Wilferth, L. Held, M. R. Suriyah, and pp. 21–28.
T. Leibfried, “Mixed integer nonlinear optimization of coupled power [29] D. Müller, M. Lauster, A. Constantin, M. Fuchs, and P. Remmen,
and gas distribution network operation,” in 2018 53rd International “AixLib – an open-source Modelica library within the IEA-EBC An-
Universities Power Engineering Conference (UPEC), 2018, pp. 257– nex 60 framework,” in Proceedings of the CESBP Central European
262. Symposium on Building Physics and BauSIM 2016, 2016, pp. 3–9.
[11] L. Andresen, P. Dubucq, R. Peniche Garcia, G. Ackermann, A. Kather, [30] B. van der Heijde et al., “Dynamic equation-based thermo-hydraulic pipe
and G. Schmitz, “Status of the TransiEnt library: Transient simulation model for district heating and cooling systems,” Energy Conversion and
of coupled energy networks with high share of renewable energy,” in Management, vol. 151, pp. 158–169, 2017.
Proceedings of the 11th International Modelica Conference, Versailles, [31] T. Krummrein, M. Henke, and P. Kutne, “A highly flexible approach on
France, September 21-23, 2015, no. 118, 2015, pp. 695–705. the steady-state analysis of innovative micro gas turbine cycles,” Journal
[12] S. Clegg and P. Mancarella, “Storing renewables in the gas network: of Engineering for Gas Turbines and Power, vol. 140, no. 12, Dec. 2018.
Modelling of power-to-gas seasonal storage flexibility in low-carbon [32] R. D. Zimmerman, C. E. Murillo-Sánchez, and R. J. Thomas, “MAT-
power systems,” IET Generation, Transmission & Distribution, vol. 10, POWER: Steady-state operations, planning, and analysis tools for power
pp. 566–575, Feb. 2016. systems research and education,” IEEE Transactions on Power Systems,
[13] M. Geimer, T. Krüger, and P. Linsel, “Co-Simulation, gekoppelte Simu- vol. 26, no. 1, pp. 12–19, Feb. 2011.
lation oder Simulatorkopplung?” O+P Ölhydraulik und Pneumatik, no. [33] DWD Climate Data Center (CDC), “Historical 10-minute station obser-
11-12, pp. 572–576, 2006. vations of pressure, air temperature (at 5cm and 2m height), humidity,
[14] F. Schloegl, S. Rohjans, S. Lehnhoff, J. Velasquez, C. Steinbrink, dew point, solar incoming radiation, longwave downward radiation,
and P. Palensky, “Towards a classification scheme for co-simulation sunshine duration, mean wind speed and wind direction for Germany,
approaches in energy systems,” in 2015 International Symposium on version V1,” last accessed: May 26th, 2020.
Smart Electric Distribution Systems and Technologies (EDST), 2015, [34] P. Remmen, M. Lauster, M. Mans, M. Fuchs, T. Osterhage, and
pp. 516–521. D. Müller, “TEASER: an open tool for urban energy modelling of
[15] C. Steinbrink et al., “Simulation-based validation of smart grids–status building stocks,” Journal of Building Performance Simulation, vol. 11,
quo and future research trends,” in International Conference on Indus- no. 1, pp. 84–98, 2018.
trial Applications of Holonic and Multi-Agent Systems, 2017, pp. 171– [35] N. Pflugradt and U. Muntwyler, “Synthesizing residential load profiles
185. using behavior simulation,” Energy Procedia, vol. 122, pp. 655 – 660,
[16] S. Raths et al., “The energy system development plan (ESDP),” in 2017, cISBAT 2017 International Conference – Future Buildings &
International ETG Congress 2015; Die Energiewende – Blueprints for Districts – Energy Efficiency from Nano to Urban Scale.
the new energy age, 2015, pp. 267–274.
32
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—UAV stands for Unmanned Aerial Vehicle and it For that concern precision agriculture, these new flying
is a flying device characterized by the absence of the pilot on devices allow to follow the plants growth intervening in
board. Its support in many situations and different applications cases of parasites infection. Also, they are able to fly at
is able to relieve the human operator thanks to its capacity of
having a rapid deployment and quickly performing its action. specific height in order to acquire images without interfere
In order to make its tasks, every UAV/Drone is equipped with a with satellites, [3]. [4].
set of on-board sensors specific for each task. One of different The possibility of creating team of UAVs/Drones is a big
applicative fields is the precision agriculture, where, thanks to advantage in this application domain. This because UAVs can
the possibility of equipping the UAV with on-board cameras, collaborate together in order to reach the prefixed task consid-
it is able to perform detailed analysis of the health status of
the plants intervening suddenly if it is needed. In this paper, ering that they are equipped with a limited amount of energy
coordination protocols applied to the problem of controlling a for battery life and limited amount of, for example, pesticide in
swarm of UAVs/Drones against parasites attacks to crops has been that case in which farmers have to fight against parasites. Then,
analysed, studying different approaches in order to measure their the cooperation between these devices represents an important
performance and costs. One of the problems with these devices aspect and, it is important to study the coordination techniques
concerns the limited quantities of fuel and pesticide. A possible
approach to this issue is asking for help to other UAVs/Drones in able to create group of UAVs that collaborate together, paying
order to destroy completely the parasites. The idea is to apply the attention to energy saving [5] and energy harvesting [7], [8].
concepts of bio-inspired approach to the recruitment protocols There are other important topics that are object of research in
providing performance evaluation in order to give the goodness the scientific community about UAVs/Drones. A very studied
of the proposal. topic regards the possibility of providing coverage in particular
Index Terms—UAV, Drone, Precision Agriculture, Coordina-
tion Protocol, Bio-Inspired scenario (such as emergency situations) [6], and, then, the
bandwidth management performing mobility prediction of the
I. I NTRODUCTION users and the opportune admission call [9], [10] preserving
Precision agriculture was born in the United States of packet loss [11], [12] in the network. This type of devices can
America in the early nineties and the name comes from the be utilized in cooperation with Satellite platforms [13], [14]
English Precision Agriculture or Precision Farming or Site in order to give a more ubiquitous connection or with VANET
Specific Farming Management. Its birth and evolution have network in order to give support to networks of vehicles [15].
been favored and supported by the potential deriving from Also, different routing techniques are possible to use in these
the widespread application of new technical solutions to the new networks based on new approaches such as opportunistic
primary sector. This practice consists in applying technologies, mechanism [16], [17]. In this paper, a comparison between two
principles and strategies for spatial and temporal management recruiting protocols are evaluated: a classical flooding mecha-
of the variability associated with aspects of agricultural pro- nism and a bio-inspired approach. After explaining briefly the
duction, in relation to the real needs of the plot [1]. protocol functionality, the results obtained compared the two
The application of this innovative approach requires an in- approaches in the Precision Agriculture domain are presented.
depth knowledge of the physical, chemical and biological The paper is organized in the following way: in Section II
characteristics of the fields, their mapping and storage so that related works are presented; an high level panoramic about
they can then be managed by computer control of the crop protocols for coordination is shown in Section III; two dif-
operations, placed on board the machines [2]. The environ- ferent coordination strategies with comparison are detailed in
mental benefits derive from a more targeted use of chemicals, Section IV; finally, conclusions are presented section V.
better efficiency or, in the case of pesticides, the reduction of II. R ELATED W ORK
the development of resistance to various active ingredients.
All this has effects on water quality and the reduction of In the following, some works that deal with coordination
its consumption, on the quality of soil and air, on climate issues in UAVs platforms are shown.
mitigation and on the energy issue. In [18], authors deal with coordination movement of swarms
of UAVs/Drones using mobile networks. In [19], authors
978-1-7281-7343-6/20/$31.00 ©2020 IEEE propose a biologically-inspired mechanisms to coordinate
33
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
UAVs performing target search with imperfect sensors. In Route Replay (RRep) message is sent back towards the source
[20], author proposal is based on stigmergy approach that when it arrives on a node that knows the destination. This
previews of depositing digital pheromone on those locations behavior is depicted in the figure below (Fig.1) through the
where UAVs sense potential target. In [21], authors propose time epochs T1, T2, and T3.
a control strategy for a group of UAVs through the use of
a bio-inspired approach for creating a robust control and
coordination strategy. The paper [22] proposes a solution for
the non-linear problem of the constraints optimization showing
the UAV motion coordination in which a reference UAV can
be seen as a leader in the group.
34
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
A. Flooding base recruitment killed parasites and protocols overhead metrics. In order to
When a UAV needs help, it sends a broadcast message perform these comparative analysis, a UAV simulator designed
starting a timeout and awaiting that a UAV in the Wi-Fi for evaluating UAV performance in Precision Agriculture
range receives this message and, if available, sends a reply. domain has been used [26].
At the timeout expiration, the requesting UAV, considering In Fig.2, the number of killed parasites is shown. It is
the replies, chooses the UAV with the maximum pesticide possible to view how the bio-inspired approach is able to
level, or, for equal pesticide, with maximum energy level, find an higher number of parasites in respect to the reactive
or, for equal pesticide and energy, nearest UAV. In case of flooding mechanism. This is due to the capacity of bio-inspired
equal pesticide, energy and distance, the choice is for the first approach of performing a better UAV recruitment thanks
answering UAV. So, the UAV that needs help sends a recruiting to FANT and BANT messages, differently from Reactive
message towards the chosen UAV returning to the recharging Flooding that floods messages towards the overall networks.
base. If the UAV does not receive response to its help request In Fig.3 it is possible to view the comparison between the
it stores its coordinates in order to come back after recharging. two approaches in terms of number of recruiting requests. As
mentioned previously, bio-inspired approach sends an higher
B. Bio-inspired recruitment number of UAVs recruiting requests. This means an higher
The bio-inspired technique is based on an Ant Colony number of UAVs recruiting and, then, a greater number of
Optimization (ACO) approach for performing recruitment: killed parasites in the considered area.
each node, periodically, sends Forward ANT (Fant) messages. In Fig.4 it is possible to view the trend of the consumed
The probability of sending the FANT message from node i to energy in both approaches. The bio-inspired technique, as it is
node j is performed on the basis of this formula: possible to observe in the figure, is more energy consuming in
respect of the reactive flooding approach. The reactive flooding
α
πi,j · βi,j sends a lesser quantity of data in respect of the bio-inspired
pi,j = P (1)
α · β
πi,k
k∈K i,k approach, then its energy consumption is lesser but it results
in a lower number of killed parasites.
where πij is the pheromone of the entry node in the table
in which the destination is the destination to be reached and
the next hop is the node j, i,j is the local heuristics on
the connection between node i and node j represented by a
random number, α is the incidence of the pheromone on the
choice, β is the incidence of heuristics on the choice, K is
the set of nodes, with distance equal to 1 hop from i, that
are able to reach the destination. Once the FANT reaches the
destination node, this last one sends a new message called
Backward ANT (BANT) on the reverse path.
The reinforcement of the pheromone for a given destination
takes place at the BANT packet crossing. In particular, at the
BANT crossing the pheromone is strengthened in the entry of
the routing table with destination equal to the node sending
the packet and next hop equal to the node from which it is
receiving the packet.
Evaporation of the pheromone occurs periodically as fol-
lows (to manage those paths no longer crossed by packets):
Fig. 2. Number of killed parasites comparison
πi,j = (1 − ρ) · πi,j−1 (2)
where πi,j−1 is the pheromone present before evaporation
in the routing table of node i towards a known destination V. C ONCLUSION
passing through the next hop j, ρ is the evaporation coefficient
of the pheromone and it is a value between 0 and 1. This paper presents a comparative analysis between two
different recruiting approaches for coordinating UAVs in a
C. Recruiting Protocol Comparison Precision Agriculture domain in the fight against parasites.
Differently from previous work [27], where a comparative It has been used a simulator specifically designed for this
analysis between a reactive flooding versus a link state ap- applicative context. The simulation results showed that the bio-
proach in a precision agriculture domain has been presented, inspired approach performs better than reactive flooding one
in this contribution, a comparison between a classical flooding being able to kill a great number of parasites and to exploit
mechanism with a recruiting protocol based on bio-inspired better the recruitment of other UAVs, even if it presents a
approach has been evaluated considering consumed energy, drawback: a greater energy consumption.
35
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
[7] Nguyen, T. N., Duy, T. T., Luu, G.-T., Tran, P. T., and Voznak, M.,
“Energy harvesting-based spectru maccess with incremental cooperation,
relay selection and hardware noises,” (2017).
[8] Nguyen, H.-S., Bui, A.-H., Do, D.-T., and Voznak, M., “Imperfect
channel state information of af and df energy harvesting cooperative
networks,”China Communications 13(10), pp.11–19, (2016).
[9] Fazio, P., Tropea, M., Sottile, C., Marano, S., Voznak, M., and Strangis,
F., “Mobility prediction in wireless cellular networks for the optimization
of call admission control schemes,” IEEE 27th Canadian Conference on
Electrical and Computer Engineering (CCECE), pp.1–5, (2014).
[10] Fazio, P., Tropea, M., Veltri, F., and Marano, S., “A novel rate adaptation
scheme for dynamic bandwidth management in wireless networks,”
IEEE 75th Vehicular Technology Conference (VTC Spring) , pp.1–5,
(May 2012).
[11] Frnda, J., Voznak, M., and Sevcik, L., “Impact of packet loss
and delay variation on the quality of real-time video stream-
ing,”Telecommunication Systems 62, pp.265–275, (Jun 2016).
[12] Voznak, M., Kovac, A., and Halas, M., “Effective packet loss estimation
on voip jitter buffer,” International Conference on Research in Network-
ing, Springer, Berlin, Heidelberg., pp. 157-162, (2012)
[13] De Rango, F., Tropea, M., Santamaria, A. F., and Marano, S., “An
enhanced qos cbt multicast routing protocol based on genetic algorithm
Fig. 3. Number of recruiting request comaprison in a hybrid hap-satellite system,”Comput. Commun. 30, pp.3126–3143,
(Nov. 2007).
[14] De Rango, F., Tropea, M., and Marano, S., “Integrated services on high
altitude platform: Receiver driven smart selection of hap-geo satellite
wireless access segment and performance evaluation,” International
Journal of Wireless Information Networks 13, pp.77–94, (Jan 2006).
[15] Fazio, P., Tropea, M., Sottile, C., and Lupia, A., ”Vehicular networking
and channel modeling: a new Markovian approach”, 12th Annual
IEEE Consumer Communications and Networking Conference (CCNC),
pp.702-707, (Jan 2015).
[16] Socievole, A., De Rango, F., and Coscarella, C., ”Routing approaches
and performance evaluation in delay tolerant networks”, Wireless
Telecommunications Symposium (WTS), pp.1-6, (April 2011).
[17] Socievole, A., Yoneki, E., De Rango, F., Crowcroft, J., ”Opportunistic
message routing using multi-layer social networks”, Proceedings of the
2nd ACM workshop on High performance mobile opportunistic systems,
pp.39-46, (Nov 2013).
[18] de Souza, B. J. O. and Endler, M., “Coordinating movement within
swarms of uavs through mobile net-works,” IEEE International Confer-
ence on Pervasive Computing and Communication Workshops(PerCom
Workshops)], pp.154–159, (2015).
[19] Alfeo, A. L., Cimino, M. G., De Francesco, N., Lazzeri, A., Lega, M.,
and Vaglini, G., “Swarm coordination of mini-uavs for target search
using imperfect sensors,” Intelligent Decision Technologies (Preprint),
pp.1–14, (2018).
Fig. 4. Consumed energy comparison [20] Cimino, M. G., Lazzeri, A., and Vaglini, G., “Combining stigmergic
and flocking behaviors to coordinate swarms of drones performing
target search,” 6th International Conference on Information, Intelligence,
Systems and Applications (IISA), pp.1–6, IEEE (2015).
R EFERENCES [21] Zelenka, J. and Kasanicky, T., “Outdoor uav control and coordination
[1] Pierce, F. J. and Nowak, P., “Aspects of precision agriculture,” Advances system supported by biological inspired method,” 23rd International
in agronomy, 67, pp.1–85, Elsevier (1999). Conference on Robotics in Alpe-Adria-Danube Region (RAAD), pp.1–7,
[2] Faiçal, B. S., Costa, F. G., Pessin, G., Ueyama, J., Freitas, H., Colombo, IEEE (2014).
A., ... and Braun, T., ”The use of unmanned aerial vehicles and [22] Meng, W., Xie, L., and Xiao, W., “Communication aware uav motion
wireless sensor networks for spraying pesticides”, Journal of Systems coordination for source localization and tracking,” Proceedings of the
Architecture, 60(4), pp.393-404, (2014). 32nd Chinese Control Conference, pp.7451–7455, IEEE (2013).
[3] Primicerio, J., Di Gennaro, S. F., Fiorillo, E., Genesio, L., Lugato, E., [23] Bekmezci, I., Sahingoz, O. K., and Temel, S ., “Flying ad-hoc networks
Matese, A., and Vaccari, F. P., “A flexible unmanned aerial vehicle for (fanets): A survey, ”Ad Hoc Networks 11(3), pp.1254–1270, (2013).
precision agriculture,”Precision Agriculture 13(4), pp.517–523, (2012). [24] Husain, A. and Sharma, S. C., “Comparative analysis of location
[4] Pederi, Y. and Cheporniuk, H., “Unmanned aerial vehicles and new and zone based routing in vanet with ieee 802.11p in city scenario,”
technological methods of monitoring and crop protection in precision International Conference on Advances in Computer Engineering and
agriculture,” IEEE International Conference Actual Problems of Un- Applications, pp.294–299, (March 2015).
manned Aerial Vehicles Developments (APUAVD), pp.298–301, (2015) [25] Jung, E. S., and Vaidya, N. H. ”Power aware routing using power
[5] De Rango, F. and Tropea, M., “Swarm intelligence based energy saving control in ad hoc networks”, ACM SIGMOBILE Mobile Computing
and load balancing in wireless adhoc networks,” Proceedings of the 2009 and Communications Review, 9(3), pp.7-18, (2005).
workshop on Bio-inspired algorithms for distributed systems, pp.77–84, [26] De Rango, F., Palmieri, N., Tropea, M., and Potrino, G., “Uavs team and
ACM (2009). its application in agriculture: A simulation environment.,” SIMULTECH
[6] De Rango, F., Tropea, M., Fazio, P., ”Bio-inspired routing over FANET 2017, pp.374–379, (2017).
in emergency situations to support multimedia traffic”, Proceedings [27] Tropea, M., Santamaria, A. F., De Rango, F., and Potrino, G., “Reactive
of the ACM MobiHoc workshop on innovative aerial communication flooding versus link state routing for fanet in precision agriculture,” 16th
solutions for FIrst REsponders network in emergency scenarios, pp.12- IEEE Annual Consumer Communications & Networking Conference
17, (July 2019). (CCNC), pp.1–6, (2019).
36
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
37
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
38
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
39
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
40
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The paper presents a new approach and a related Contribution of the Paper: There exists the broadly used
software environment for the parallel simulation of swarms of simulation tool Gazebo [2] of the Robot Operating System
autonomous robots in real time. The software environment has (ROS) framework [3]. It is used for Defense Advanced Re-
been developed for model based analysis of algorithms to control
large swarms of distributed autonomous mobile robots com- search Projects Agency (DARPA) [4] and National Aeronau-
municating over an unreliable and capacity restricted wireless tics and Space Administration (NASA) challenges in robotics,
network. It includes a physical simulation of static obstacles, dy- but currently lacks the capability to simulate larger swarms,
namic obstacles with scriptable movement, soil condition, active physical transmission influences and huge areas of up to
jammers, static and dynamic link obstacles with configurable 25 km2 in real time. It is more focused on a broad range
damping as well as noise floors. The simulated ground based
mobile robots use control particle belief propagation (C-PBP) of different robotics applications and supports many different
as a randomized and sample based model predictive closed systems and sensors. In contrast, the tool presented in this
loop controller in combination with cost functions to evaluate paper is intended as an environment for the development and
the situations. We emphasize where the use of shared memory analysis of control algorithms for swarms and makes use
parallelism is beneficial and which inaccuracies in computations of parallelization techniques for selected aspects of physics
are acceptable to increase performance without losing realism.
Index Terms—Real-Time Simulations, Parallel Simulations, and control technology. On the other side, we focus on
Simulation-Based Virtual Environments, Swarm Robotics realistic assumptions with dynamics in a continuous world,
which differs significantly from previous research on swarm
I. I NTRODUCTION intelligence in grid based worlds (see [5] and [6]). Currently
We present a novel control algorithm technique for robot three basic missions, which can be combined in any way, are
swarms to establish mobile ad hoc networks in disaster areas implemented. The first two missions are the exploration of
and validate it with a real-time simulation based on radio dynamic unknown areas and the surveillance, as well as escort
signal propagation physics. In application areas like industrial of vehicles with given distance. The last mission is establishing
production, disaster management or exploration of unknown a Mobile Ad Hoc Network (MANET) to support rescue teams.
terrains, autonomous robots are increasingly important [1]. Structure of the Paper: In the second Section we present the
Usually, swarms of robots solve together some problem or basic concept of control theory which the autonomous agents
perform some tasks. Especially in disaster situation, e.g. after use to find their own goal with respect to the behavior of the
an earthquake or a serious accident in industry, tasks that agents in vicinity. Afterwards, the main parts of the simulation
are dangerous for humans, like terrain exploration, rescue and its mathematical background will be outlined. Section IV
operations or setting up a communication network are ex- explains how the previous simulation parts act together and
tremely important and have to be done as soon as possible. synchronize as efficiently as possible, still keeping a realistic
Robot swarms are in principle well suited to perform these and correct view on the entire system. Then, detailed examples
operations. Ideally, a human operator defines the general goal and their analysis follow, which lead to the conclusions in
of the mission and the robots perform their individual tasks on Section VI.
their own, which implies that they have to synchronize during
operation and have to perform decisions autonomously often II. M ODEL P REDICTIVE C ONTROL OF ROBOTS
with limited knowledge about their environment and the state Each robot agent is controlled by its own model predic-
of other robots in the swarm. tive controller (MPC) [7], which creates trajectories based
The outlined scenario is challenging because the control on information gathered from the sensors of the robot and
software of the robots has to be tested thoroughly which acquired via communication with other agents. The MPC
usually cannot be done in the real environment after the consists of three main parts (see Fig. 1). At first, a model
disaster. Thus, virtual environments have to be built and the of the controllable part of the system is needed, which is
control software has to be deployed in these environments. the robot itself. It is used to estimate and predict system
This means a real-time simulation environment is necessary states. These states are rated through absolute cost functions
to provide realistic test conditions. aggregated to a cost value, that represents the usefulness of
the system state in space and time. The cost functions and lsocial (|d|)
constraints depend on the mission targets, configurations made
by an operator and on the states of the other robots. The
third part is the optimization, which minimizes the cost value 20
by generating new trajectories with the help of the model.
After the solution has been found, the corresponding control
vector is set as an input of the real system. The evaluation
of the dynamic environment inside the cost function and its 10
influence on the robots is classified into three basic categories.
costs
The first category contains all static obstacles such as walls,
barriers and obstructions. Those objects can be detected by
the robots or can be preloaded based on detailed maps. The 0
second group consists of dynamic obstacles that can move
around. They are separated into predictable obstacles, like
other robots or vehicles inside a communication group, and
−10
obstacles with unknown behavior. The last category contains
tiles of explored area that describe the accessibility of the −30 −20 −10 0 10 20 30
terrain. As an optimizer, the control particle belief propagation
(C-PBP) [8] algorithm is used. It combines parallel locally d
refined guided random walkers using discrete sampling of cost Fig. 2. Robot social function with desired minimum costs Lmin = −10,
functions and knowledge transfer between optimization steps. maximum costs Lmax = 20 and a desired distance ddesired = 10 for the
minimum (see [11])
actuator
mission & communication
disturbance
MPC environment: static obstacles, dynamic obstacles and terrain
change of accessibility. Most of the objects in the first category can
cost function reward action environemt
and constraints
optimization system be represented by locally limited distance based functions to
plan prevent collisions:
model sensor p
trajectory measurement d = kpk − xk k2D = (∆x)2 + (∆y)2 (1)
0
xk xk
Fig. 1. Four components of the robot: communication module, model
predictive controller (MPC), actuator and sensor with pk = yk0 , xk = yk
θk0 θk
III. A V IRTUAL E NVIRONMENT FOR ROBOT S WARMS
The model predictive control core as well as the object han- Let d ∈ R+ 0 denote the distance between an object position
dling structures and communication interfaces are written as pk ∈ R3 and a system state xk ∈ R3 for a discrete time step
an independent kernel. Hence, they can be used in simulations k. Each position and system state consist of a two dimensional
and can be deployed on real robots. However, this paper fo- position (xk , yk ) and an orientation θk . These functions can
cuses on the simulation and evaluation of virtual environments represent walls, trees, lakes, rivers and social behavior (see [9]
and robot swarms to test swarm control algorithms. Therefore, and Fig. 2). More complex objects like surveillance towers
we describe the four main parts of the simulation. with bounded angle of view need the whole system state
because, based on the orientation and range of view, the view
A. Simulation Of The Environment direction and the field of view have to be calculated. Then,
The environment is represented by a set of cost functions. In the robot maps a special cost function to this area to avoid
a completely unknown environment there exist in principle a entering it [10].
global minimum as desired working state and often an optimal Vehicles and time-dependent structures belong to the second
path towards it, but both are unknown. Thus, a rating relative category. Nevertheless, independently of their behavior, robots
to the optimum is impossible. Consequently, the environment only recognize cooperative objects that transmit their planned
will be modeled with absolute costs referring to time and movement or behavior, for example, as trajectory or as ve-
space. Due to the use of C-PBP, there are almost no restrictions locity vector and non-cooperative objects which are treated as
concerning these functions. They just have to be defined on the unknown static objects. As a possible improvement, movement
entire area, but in the worst case, they may depend on every estimation can be implemented for those objects. We have to
parameter and randomness which can cause a high complexity. distinguish between the view of the simulation which captures
Additionally, it has to be taken into account that the actual the complete system state and the information that can be
cost function and the representation, which the robot has, gathered from the sensors of a robot, which is much more
usually differ. The robots have three categories to evaluate the limited.
42
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Equation 2 presents the cost function for the last category. disaster areas. Thus, it is mandatory to predict the connection
It depends on the control vector uk and the system state xk−1 status and quality. Each message gets an unique identification
based terrain accessibility G(xk−1 ). number to enable a message filter and lightly modified Lam-
port clocks [12] are used to create a logical as well as temporal
lmovement (xk−1 , uk ) = vk2 · G(xk−1 ) · ρT + θ̇k2 · G(xk−1 ) · ρR
order on the messages.
(2) Signals are transmitted via Orthogonal Frequency-Division
" # Multiplexing (OFDM) like Long-Term-Evolution (LTE) does,
vk and are divided into subcarriers. Hence, a channel width of
with uk =
θ̇k ∆fc , a carrier distance of ∆ft and protection distances can
be specified to calculate the subcarriers. The carrier distance
Let v and θ̇ denote the translation and rotation velocity of the determines the symbol time [13]:
robot respectively, and ρT and ρR are parameters to adjust the
1
influence for different types of robots. Hence, the Equation ts = (4)
describes how much effort/energy is needed to move on the ∆ft
current terrain. For example, movement on streets requires less Each subcarrier is modulated with one symbol per symbol
energy than movement on mud or rocks. This has an influence time. The symbol width wsymbol depends on the modulation
on the battery capacity of the robot and therefore also on method. The simulation environment supports Quadrature
the decision how to continue the mission. This information Phase-Shift Keying (QPSK), 16 Quadrature Amplitude Mod-
is based on a tile map of the terrain and affects only the ulation (QAM), 64 QAM and 256 QAM. To increase band-
movement. width, Multiple-Input Multiple-Output (MIMO) [14] methods
are available. Based on the maximum the Received Signal
B. Simulation of Robot Dynamics Strength Index (RSSI) and the Signal to Noise Ration (SNR)
A nonholonomic system model is used as dynamic model are used to estimate real transmission rate.
for the state prediction inside the MPC. This model covers c
RSSI = Pt + Gt + Gr + 20 · log10 ( ) (5)
tracked vehicles as well as walking mobile robots. | {z } f · 4π · d
transmission power | {z }
xk free space damping
xk = fk (xk−1 , uk ) = yk (3) i
X
i
θk − Pobstacle
| {z } | {z }
xk obstacle damping in line of sight
xk−1 vk · sin(θk−1 ) " # The extended Friis Equation 5 [15] calculates the RSSI of
vk
= yk−1 + vk · cos(θk−1 ) · ∆k , with uk = the receiver in decibel based on one milliwatt (dBm). Let Pt
θ̇k
θk−1 θ̇k denote the sending power, Gt and GR denote the antenna
| {z }
xk−1
gain of the sender and receiver, respectively. The transmission
power is reduced by free space damping, which depends on
Let fk denote in general a discrete nonlinear map that maps the the speed of light c, the sending frequency f and the distance
current system state xk−1 and control vector uk to a successor d between the stations. For calculating the SNR, a noise floor
state xk . The system state xk consists of a two-dimensional power Pnoise has to be specified, measured or calculated by
position (xk , yk ) and an orientation θ. The time delta between Johnson–Nyquist noise (see [16]).
the current state and the successor is denote by ∆k. Each robot
is equipped with a light detection and ranging sensor (LIDAR) SNR = RSSI − Pnoise (6)
to detect obstacles in vicinity. For example, with a channel width ∆fc = 10 MHz, a carrier
C. Simulation Of Communication distance ∆ft = 15 kHz, a 256 QAM modulation and 2x2
MIMO a maximum transmission rate of 144 Mbit s can be
A necessary condition for autonomous agents to form a achieved. However, with SNR = 20 dBm and RSSI =
swarm is the information transfer via communication. The −69 dBm only 144 Mbit 2 Mbit
s · 3 = 96 s remain [10].
robots are equipped with configurable wireless network in-
terfaces to achieve emergence based on distributed and shared D. Operator
knowledge. Therefore, we created a network simulation model The simulation offers the operator to configure all parame-
based on physical wave propagation including signal power ters of the entire simulation including the robot physics, its
dissemination depending on specified frequencies, free-space planning technique and communication module bandwidth,
path loss and obstacles, but it does not cover reflections yet. antenna gain and frequency. Moreover, the behavior of objects
Moreover, noise levels of the environment and active signal can be scripted. The description is done by YAML files,
jammers can be placed. This realistic network simulation so no programming knowledge is needed. During runtime
serves as analysis tool and developing strategic behavior of commands can be sent by convoy vehicles, which are steered
signal changes and losses of groups of swarm agents. Fur- by the operator, to change the missions of the swarm at any
thermore, one of the basic missions is to create a MANET in time.
43
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
IV. PARALLEL S IMULATION OF S WARM B EHAVIOR physics like positions of convoy vehicles and robots based
on their control values. As a result, it represents mostly
actuators, these need a higher update rate of approximately
60 Hz to simulate smooth movement without glitching through
obstructions. Furthermore, there exists a rendering thread with
read only access to all data structures. Hence, it has almost
Agent Convoy no influence on the simulation performance.
vehicle
Trajectory
Trajectory data
Object detection
Obstacle data
Update
Rendering
Environment
Mission data
Movement data
Sensor range Dynamic
Thread Convoys Robots Static Obstacles
Obstacles
Dynamic Static
omp Thread Convoy
Obstacle Obstacle Terrain
List
List List
Fig. 3. Example scenario to indicate message types Abstract
Data
indicates the four available message types. In general, all in- Sending
sends its movement data with another message. The robots Radiation
Obstacles Dynamic
Obstacles Obstacle
themselves broadcast information about obstacles, which enter Sample 1
Data
Trajectory Static Obstacles
their sensor range, and transmit the planned trajectory to all Noisefloor Planning
Sample 2
Missions
horizon of 32.5 s. When obstacles or an increasing noise floor, Sample N
Missions
e.g. created by signal jammers, cause a connection loss, then a Robot Models
Robot
Models
Receiving
special cost function [10] leads the robot to reduce the distance
to the other agents. Therefore, the last transmitted trajectory
is used to predict the position and movement of those. But
each agent is developed as an independent autonomous agent Fig. 4. Thread and data structure of simulation
as well.
The simulation of swarms and dynamic obstacles is inher- A. Object Detection
ently concurrent because every robot forms an autonomous Each robot recognizes obstacles of the environment and
instance consisting of parallel hardware and all robots run in receives missions. This information is internally transformed
parallel. A robot is equipped with sensors for object detection, to cost functions and processed by the control algorithm in
a communication modules that can send and receive in parallel combination with the dynamic model resulting in the future
and a single board computer to process the collected data and trajectory of the robot. Afterwards, the generated trajectory
to run the control algorithm. Thus, four independent threads is sent to other robots. Although the data flow seems to
per robot are created (see Fig. 4) for representing the reality. be simple, because the sensor just takes information from
In order to analyze the real-time constraints, we measure the the environment, the parallelism leads to concurrent access
timings of these threads and check them for deadline misses. on those environmental data. To avoid conflicts, the data
Thereby all threads have the same priority in the simulator. structures allow just read-only access from robots, whereas
Thus, for this paper we present the data for the trajectory the common environment update thread is allowed to modify
planning thread, which is the most computationally intensive the data. Because of the high update rate in comparison to
thread with the smallest hard deadline of 300 ms among these the slower rate of the object detection and trajectory planning
four threads. The deadline is a result of the maximum robot thread, the feasible small inaccuracy is equal to sensor noise.
velocity and its size. Consequently, no synchronization is needed here.
In addition, the simulation of the surrounding environment
is split into two independent threads. One thread updates con- B. Data Transfer
nection qualities between the robots. It should be configured The sending thread is triggered by flags which are set by
with a low update rate between 5 Hz and 20 Hz, because on the planning thread after a new trajectory was calculated or
the one hand lightly modified positions often do not cause by the object detection after a new static object was detected.
changes in signal quality in outdoor scenarios and on the Additionally, it can be triggered to forward received packages.
other hand network interface cards react relatively slow to To provide a realistic data transfer, all data is copied by this
signal changes. The second thread calculates the rest of the thread and packed into a message object. Thus, read only
44
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
access to the data is granted and no synchronization is required These are used to evaluate discrete samples, which are created
because only the newest data has to be transmitted. Since data by the guided random walkers of C-PBP [8] (see Fig. 1).
can only be transmitted by the sender thread, it is guaranteed For a planning horizon of 32.5 s, which correspond to 60
that at most one upload to one station is performed as in a samples, and 24 parallel random walkers 1440 function calls
real system. Based on the message size and the transmission per entity have to be evaluated. Hence, it is necessary to plan
rate, the transfer time is calculated and causes the thread to the access to reduce the number of locks and synchronizations
sleep for the specified time. Afterwards, the receive callback as much as possible. The position vector is read without any
thread of the receiver is called. locks. This will possibly lead to inaccuracies if the vector
The flags themselves do not have to be protected by locks gets updated during reading. The resulting inaccuracy is sig-
either, because an inconsistent flag leads to package duplica- nificantly smaller than inaccuracies produced by simultaneous
tion or loss. Both consequences are typical for wireless ad hoc location and mapping algorithms used on real robots; therefore
networks and make the simulation more realistic. Furthermore, it can be tolerated. Processing of the obstacle data need
duplicates get caught by the message filter based on identifica- no synchronization because they are only accessed by read
tion numbers, and package losses cause other robots to predict operations. Static objects in these lists do not get deleted
the state of the sender based on a previous trajectory, which because they never change, whereas the influence of dynamic
is in general about 300 ms old if no consecutive losses occur, obstacles decreases and they will be deactivated by a flag after
but covers 30 s. If messages containing obstacles information some time in case they leave the sensor range. When the lists
get lost, then the robot will detected this obstacle on his own are too long, a synchronized purge for these can be applied,
if it is in a sensor range. which guarantees the consistency of the lists. Nevertheless,
the essential parts of robot models and the missions have to
C. Process Received Information be protected by locks. Those parts are the prediction of states
In contrast to the sending thread, the receiving thread has and connections based on current information, The models and
to be aware of memory consistency. Therefore, it has at first missions are updated via the receiving thread and the update
a general lock which prevents receiving multiple packages thread. thereby the receiver data is more important because
at the same time, which is not possible in a real system it contains the correct information from the senders. On the
either. This causes the sending thread to block after calling contrary, the update thread estimates those data based on older
the receiving callback. This blocking is similar to the IEEE information. After this computation finishes, the trajectory data
802.11 Carrier Sense Multiple Access (CSMA) Media Access will be overwritten by this thread and the sending thread is
Control (MAC) protocol for wireless networks, where the activated by a flag afterwards.
sender waits for a random back-off time if the channel is It does not matter if the sending thread is scheduled too
currently used. However, the blocking time is short because of late because only the newest trajectory should be transmitted.
the small data packages. The largest package is the message The time stamp of each message is used to drop older
that contains the trajectory. It includes 1472 B, where the trajectory information. Furthermore, if an inconsistency of the
header, consisting of an identification number, a timestamp, the flag occurs because of parallel writing, it will cause a package
size information and the message type, consumes 16 B. The loss. But for each pair of system state and control vector of
other packages are much smaller, because the movement data the trajectory the corresponding time is transmitted as well.
of the convoy vehicle just contains the position, the velocity Thus, it is possible to predict the behavior of the agent for at
and its identification number, which are 28 B+16 B = 44 B in least 32.5 s, which is equivalent to more than 100 consecutive
total. The mission and obstacle messages including the header package losses. The trajectory has exponentially distributed
are only 37 B and 28 B in size, respectively. time samples. Hence, the beginning can be predicted really
As shown in Figure 4, the receiver processes the data to accurate and the later the point in time of the trajectory,
update the obstacles data, the missions and the robot models. the more inaccurate the estimation is. This corresponds to
A robot model is a data structure to manage the received the general observation that the older the message the more
information about other robots. It offers state and connection inaccurate the information is.
quality prediction based on the planned trajectory as well as
information about the explored area through the robot that is E. Timing Analysis And Performance Improvement With
linked to the model. As a result, these data structures have OpenMP®
to be protected by locks to guarantee consistency of the data,
whereas the temporal order is not that important, because all The planning thread runs periodically with a relative dead-
data get constantly updated. line of 300 ms, but its computation time obviously increases
with the number of robots in the swarm. Consequently, no
D. Control And Planning fixed timing analysis is possible because the execution times of
The computationally intensive trajectory planning thread the threads depend quadratically on the swarm size. However,
to control robot behavior needs to access almost all data this statement only applies to the simulation because the
structures, because each entity in the environment, such as number of objects to be managed per robot only increases
obstacles and missions, forms an independent cost function. linearly in a real system but simulation also has to handle the
45
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
increased number of robots. Equation 7 emphasizes that the marked in black in Figure 5. Orange, yellow and green
simulation must manage also more robots. illustrate parts of the update thread for different subjects
R
X R
X E
X M
X and blue indicates the trajectory planning thread. Equation 9
objectsmanaged = ( + + ) underlines the increasing number of threads because of this
robot=1 model=1 object=1 mission=1 technique.
(7)
2
#env = 4 + #omp , #con = 1, #robot = 3 + #omp · 5 (9)
∈ O(R + R · E), with M << R ≤ E
52 4 + 16 + 1 + (3 + 16 · 5) · R 83
Let R denote the number of robots, E the number of objects → ≤ < (10)
3 1+1+4·R 4
in the environment without the robots and M the number of
Let #omp denote the number of threads that are created by
missions, which is currently at most 3.
OpenMP® , which is per default the number of virtual cores.
In general, the simulation generates much more threads than
The loop to update the robot models and the loop for the
available CPU cores even on machines with a large number
samples is unrolled into this amount of threads (see Figure 4).
of cores like compute servers.
Therefore, later one causes an approximately 54 #omp increase
R
X of threads. For a modern octa-core CPU with 16 virtual cores
#t = #env + #con + #robot = 1 + 1 + 4 · R (8) and R = 60 simulated robots, the fine granular parallelization
robot=1
produces around 5001242 ≈ 20.67 times more threads (see
Equation 8 shows that the number of threads #t exceeds Equation 10).
the number of cores of standard PCs even with 3 robots. In
this Equation #env denotes the number of threads for the V. A NALYSIS
update of the environment and #con the amount of threads In the following we illustrate how to create missions and
for the connection update. #robot corresponds to the number analyze the usability of the simulation environment. Further-
of threads per robot, like it is presented in Figure 4. To more, the performance of the former presented prallelization
deal with different swarm sizes and timings which can be techniques for simulations are evaluated on a real system.
configured by the user, these threads are split into smaller
workload. Therefore, the scheduler can utilize each core as A. MANET
good as possible to prevent deadline misses. The separation The MANET cost function should create a redundant re-
into smaller workloads is done with OpenMP® and is indicated liable network with the robots as base stations that covers
by the orange rectangles with rounded corners in Figure 4. as much area as possible. Hence, as long as the connection
During development, care was taken to ensure that each quality is constant, a large distance between the robots should
data item of data structure resides in consecutive memory be achieved to cover a larger area. If the quality decreases
blocks, but the structures themselves are in different regions it has to be decided whether a larger distance or a higher
of memory. Hence, processing of data can be parallelized by transmission rate is desired, whereas if the connection breaks
subjects on different cores. On the contrary the data structure down, the distance is not a positive aspect anymore and has to
access should not be parallelized to avoid cache conflicts. be reduced. Only two reasons for a connection loss exist; either
The control software is designed to run on single board the distance or a change in the environment, which cannot
computers on real robots, which are in general multi-core always be undone by the robot itself. Therefore, the only
systems. The trajectory planning thread handles 24 indepen- possibility to reestablish the connection is distance reduction.
dent random walkers for 60 time steps. So, each of the 24 Equation 11 combines these aspects.
samples can be processed on different cores as well as each
data structure (see Figure 4). The benefit of parallel updates −(d · fd + q · fi ) q > 0
costs(d, q) = (11)
and processing of data is that ideally the planning thread works d else
with the newest information and that the information of a data Let d be the distance and q ∈ [0, 1] the signal quality. fi and
structure correspond to the same time instance. Thus, they are fd denote weights to prefer better connection quality or larger
not changed during planning nor do there exist different time covered area. In general, negative costs indicate desired and
levels inside the structures (see Figure 5). positive costs repulsive areas for the optimizer.
The evaluation of this function does not lead to a circular
Subject based parallel processing Standard parallel processing
or star-shaped arrangement, which corresponds to a high
Core 1 Obstacles Planning Planning Plan. Plan. and slightly redundant network coverage of an area. The
Core 2 Mission Models Plan.
one time period
Plan. Obstacles Models Mission
one time period
experiments present approximately equilateral triangles (see
0 t 0 t Figure 6). It can be explained by optimal characteristics of
equilateral triangles regarding to maximizing distances (see
Fig. 5. Advantages and disadvantages of subject based parallelization
Figure 7). Another idea of always maximizing the minimum
distance is not appropriate as well because the connection
The disadvantage of these fine granular threads is increasing with the minimum distance does not have to be the opti-
computation time because of context switches, which are mal connection due to obstacles and sources of interference,
46
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
B
Performance comparison with and without OpenMP
140
D
120
E
80
40
140
120
Trajectory calculation time in ms
100
80
60
40
Fig. 8. Ring topology with inner and
Fig. 9. Combined ring and star topol-
outer circle consisting of 30 robots
ogy with 30 robots [10]
[10] 20
0
B. How Much Parallelization Is Needed? 3* 3 6* 6 12* 12 18* 18 24* 24 30* 30 45* 45 60* 60
Number of simulated robots
Figure 10 shows the comparison between fine granular and
coarse thread level prallelization for a simulation of an empty Fig. 11. Performance comparison for complex environments. Numbers with
environment with different swarm sizes. The results are tested stars indicate that OpenMP® was use additionally.
to a significance level of α = 0.01 to prove that the context
switches and thread handling cost more performance than the In both cases the deadline is never missed, but the distance
parallel calculation achieves. This holds even for complex of the means between Figure 10 and 11 decreases almost
environments with all different types of terrain accessibility, linear with a slope of −0.2 with the number of robots, which
noise floors with two signal jammers at different positions, 20 proves the performance increase of the parallel processing.
static and 20 dynamic obstacles in vicinity and 22 static and However, it is not sufficient for this simulation. In contrast
20 dynamic obstacles for the connections, which are important to this, reducing the parallelism to the number of cores with
for the prediction. Moreover, the exploration mission was consecutive access to the data structure is not possible either. If
47
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
we reduce the number of threads to the number of cores, then missions. In addition to that, the control theory will be im-
some threads with different deadlines would be combined. So, proved to support hard and soft constraints to give guarantees
the computation time would increase by reducing the deadline and
to the smallest at the same time or active polling has to be This research was funded by the Deutsche
used, which wastes CPU cycles as well. In addition to that, the Forschungsgemeinschaft (DFG, German Research Foundation)
realism of the hardware abstraction and the portability to real – 276879186/GRK2193 [Gefördert durch die Deutsche
systems would be lost. Hence, the OpenMP® parallelization Forschungsgemeinschaft (DFG) – 276879186/GRK2193]
should be used only on real robots, which does not have to
R EFERENCES
handle so many threads. For real robots only the planning
and the object detection thread remain because the rest is [1] L. E. Parker, “Distributed intelligence: overview of the field and its
application in multi-robot systems,” Journal of Physical Agents (JoPha),
done by hardware or exists in reality and does not need to vol. 2, no. 1, pp. 5–14, 2008.
be simulated. [2] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an
open-source multi-robot simulator,” in IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems, Sendai, Japan, 2004, pp. 2149–
VI. C ONCLUSIONS 2154.
[3] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,
We present a novel approach to control large autonomous R. Wheeler, and A. Ng, “Ros: an open-source robot operating system,”
robot swarms. For this purpose, a real-time simulation tool ICRA Workshop on Open Source Software, vol. 3, 2009.
[4] T. Chung, “Darpa subterranean (subt) challenge,”
to evaluate control algorithms for autonomous robots in a https://www.darpa.mil/program/darpa-subterranean-challenge, 2017.
swarm is explained in detail. It is described how potential [Online]. Available: https://www.darpa.mil/program/darpa-subterranean-
functions can be used to create different behavior and mis- challenge
[5] N. Palmieri, X.-S. Yang, F. D. Rango, and A. F. Santamaria, “Self-
sions for the entire swarm without directly assigning tasks adaptive decision-making mechanisms to balance the execution of
to specific agents. To achieve emergent behavior which is multiple tasks for a multi-robots team,” Neurocomputing, vol. 306, pp.
based on limited local knowledge, realistic communication 17–36, 2018.
[6] N. Palmieri, X.-S. Yang, F. D. Rango, and S. Marano, “Comparison of
conditions are necessary. Thus, we outline an integrated model bio-inspired algorithms applied to the coordination of mobile robots con-
of wireless communication channels which is used inside sidering the energy consumption,” Neural Computing and Applications,
the simulation to form a realistic environment. Afterwards, vol. 31, no. 1, pp. 263–286, 2019.
[7] L. Grüne and J. Pannek, Nonlinear Model Predictive Control: Theory
techniques to improve calculation times of the computational and Algorithms, 2nd ed., ser. SpringerLink Bücher. Cham: Springer,
intensive parts of the simulation are explained in detail. 2017.
As a result, efficient data structures are introduced and we [8] P. Hämäläinen, J. Rajamäki, and C. K. Liu, “Online control of simulated
humanoids using particle belief propagation,” in Proc. SIGGRAPH ’15.
are discussing their access strategies from different threads. New York, NY, USA: ACM, 2015.
A sophisticated implementation reduces the number of syn- [9] T. Laue, “Eine verhaltenssteuerung für autonome mobile roboter auf der
chronization steps and increases performance. After that, all basis von potentialfeldern,” Diplomarbeit, Universität Bremen, Bremen,
5. Januar 2004.
parallelization possibilities are pointed out and analyzed. As [10] A. Puzicha, “Modeling and analysis of a distributed non-linear model-
a result, the fine granular parallelization with OpenMP® has a predictive control for swarms of autonomous robots with limited com-
measurable performance increase, but the offset of the increase munication skills (in german),” Masterarbeit, TU Dortmund, Dortmund,
2019.
caused by the context switches is too high. That is why this [11] J. H. Reif and H. Wang, “Social potential fields: A distributed behavioral
fine grained parallelization cannot be recommended for the control for autonomous robots,” Robotics and Autonomous Systems,
simulation. Contrary to this observation, fine grained paral- vol. 27, no. 3, pp. 171–194, 1999.
[12] L. Lamport, “Time, clocks, and the ordering of events in a distributed
lelism is valuable for implementation of the control algorithms system,” Commun. ACM, vol. 21, no. 7, pp. 558–565, 1978. [Online].
on real robots, because there are significantly less threads on Available: http://doi.acm.org/10.1145/359545.359563
a robot than in the simulation which have to be managed. So [13] LTE-Anbieter.info, “Maximale datenrate der luftschnittstelle bei lte:
Wie errechnet sich diese eigentlich?” LTE-Anbieter.info, 2019.
there are usually cores available for additional threads. [Online]. Available: https://www.lte-anbieter.info/technik/datenrate-
In the next step, the control software will be evaluated on luftschnittstelle.php
[14] Ernst Ahlers, “Funk-übersicht: Wlan-wissen für gerätewahl und
real robots. By using emulation, the simulation environment fehlerbeseitigung,” c’t, vol. 2015, no. 15, pp. 178–181, 2015.
and the real robots are fused. This offers the ability to test real [Online]. Available: https://www.heise.de/ct/ausgabe/2015-15-WLAN-
robots as agents of a large swarm without having the complete Wissen-fuer-Geraetewahl-und-Fehlerbeseitigung-2717917.html
[15] H. T. Friis, “A note on a simple transmission formula,” Proceedings of
swarm available; instead, most of the robots are still virtual, the IRE, vol. 34, no. 5, pp. 254–256, 1946.
only some are physically available to perform their tasks. [16] W. Heywang and R. Müller, Rauschen. Berlin, Heidelberg: Springer
Additionally, the behavior in different complex scenarios can Berlin Heidelberg, 1990, vol. 15.
[17] W. J. Fokkink, Distributed algorithms: An intuitive approach, second
be evaluated without building them in reality. This becomes edition ed. Cambridge, Massachusetts and London, England: The MIT
necessary in extreme environmental conditions after disasters Press, 2018.
which can hardly be reconstructed in a test bed. On the other
hand, the simulation is tested on real data and can monitor
real robots.
In further research we expand the available mission by
logistic, formation and rendezvous with robots and objects
48
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Embedded systems, e.g., self-driving systems and up task processing; however, such processors have problems
advanced driver-assistance systems (ADAS), require computing in terms of predictability and temporal determinism.
platforms with high computing power and low power con-
sumption. Multi-/many-core platforms satisfy these requirements
In clustered many-core systems, resource contentions, e.g.,
effectively. However, for hard real-time applications, multiple contentions induced by shared memory and cache, prohibit
demands on shared resources can impede real-time performance, the system from satisfying real-time requirements. Therefore,
and memory is one resource that can impair the desired per- avoiding contentions and accurately calculating delay caused
formance significantly. Therefore, it is important that memory by contentions are required. We address this issue using the
access timing be deterministic to facilitate predictability. To
realize this, the Logical Execution Time (LET) paradigm is
Logical Execution Time (LET) model [4] [5]. The LET model
currently attracting attention. This paper proposes a theoretical is to perform communication at fixed timing determined by
scheduling method for a model applying the LET paradigm the LET section. A task to which the LET model is applied
to directed acyclic graph (DAG) nodes for a multi-/many-core accesses memory at the same timing in each period. We allow
platform. The proposed method considers communication timing concurrent executions on multiple cores and coordinate access
between nodes and generates a schedule that does not cause
communication contentions. In addition, the proposed method
to shared memory using a time-triggered schedule. Since
attempts to distribute tasks and reduce LET intervals to address access timing is adjusted to avoid overlap, communication time
increased execution times due to the implementation of the LET and task execution time are always constant. However, when
paradigm. In the evaluation, we observed that the proposed adopting the LET model, the task execution time is set to be
method improved the schedule length by up to 40%. greater than the actual time; thus, the time from sensor data ac-
Index Terms—multi-rate DAG, multi-/many-core, communica-
quisition to application execution, i.e., end-to-end latency, may
tion contention, list-scheduling, logical execution time
increase. Furthermore, there are multiple different periodic
tasks in an automotive application. To address these issues,
I. I NTRODUCTION
it is necessary to schedule jobs to occur in a hyperperiod in
Embedded systems, such as self-driving systems (e.g., Au- consideration of job dependencies and communication timing.
toware [1]) and advanced driver-assistance systems (ADAS), In addition, we provide application modeling that combines the
require high computing capacity and low power consump- LET paradigm and the DAG. We propose a method based on a
tion. Multi-/many-core hardware for embedded systems (e.g., list scheduling mechanism that supports distributed processing
Kalray MPPA [2], and Tilela TILE-Gx [3]) satisfy these and multi-rate periods.
demands and are the focus of active study. Multi-/many-core Contributions: Our primary contributions are summarized
hardware for embedded systems is suitable for parallel task as follows.
processing and large-scale computations. In addition, clustered
• We propose a theoretical method for parallel and dis-
many-core systems, such as the Kalray MPPA processor,
tributed processing of LET tasks using multi-/many-core
provide highly scalable, and isolated areas of computation.
processors while avoiding communication contentions.
In a self-driving system, multiple applications run simulta-
• We propose DAG scheduling, which executes distributed
neously, and each application has a deadline. Different appli-
processing and reduces idle time in the overestimated
cations, e.g., automatic brake and collision warning systems,
LET interval to reduce increased execution time by
utilize various types of sensor data and execute multiple
implementing the LET paradigm.
processes. Such applications must execute (from sensor data
• We reconstruct the job dependency generated from the
acquisition to termination) before the deadline. The procure-
task in the hyperperiod to make it compatible with
ment of a directed acyclic graph (DAG) with an end-to-
applications with multi-rate periods.
end deadline to achieve parallel distributed processing can be
considered as a real-time application of automotive systems. The remainder of this paper is organized as follows. The
In this paper, we propose a static DAG scheduling method system model and motivation of this study are described
to satisfy deadlines by accelerating processing using multi- in Section II. Section III discusses assumptions about the
/many-core hardware. Multi-/many-core hardware can speed scheduling problem. A scheduling approach is provided in
Section IV, and Section V proposes methods to reduce execu-
978-1-7281-7343-6/20/$31.00 ©2020 IEEE tion time to eliminate deadline-miss. Experimental methods,
49
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
''5 ,2FRUHV
3&,H
3&,H
*%
1R&LQWHUIDFH
time
execution
Instruction Instruction Instruction Instruction
(WKHUQHW
(DVW,26XEV\VWHP,26
Compute Compute Compute Compute
Cluster Cluster Cluster Cluster
IO SMEM (SRAM: 2MB) (CC0) (CC1) (CC2) (CC3)
1R&LQWHUIDFH
Fig. 1. Task execution using the LET model.
,2FRUHV
Compute Compute Compute Compute
:HVW,26XEV\VWHP,26
Cluster Cluster Cluster Cluster
(CC4) (CC5) (CC6) (CC7)
Resource Manager (RM) Debug Support Unit
1R&LQWHUIDFH
IC DC (DSU)
,2FRUHV
results, and considerations are presented in Section VI, and Compute Compute Compute Compute
60(065$00%
Cluster Cluster Cluster Cluster
PE 0 PE 1 PE 2 PE 3 (CC8) (CC9) (CC10) (CC11)
IC DC IC DC IC DC IC DC
(WKHUQHW
related work is discussed in Section VII. Finally, conclusions PE 4
IC DC
PE 5
IC DC
PE 6
IC DC
PE 7
IC DC
Compute
Cluster
(CC12)
Compute
Cluster
(CC13)
Compute
Cluster
(CC14)
Compute
Cluster
(CC15)
PE 8 PE 9 PE 10 PE 11
and directions for future work are presented in Section VIII. IC DC IC DC IC DC IC DC
1R&LQWHUIDFH
3&,H
''5
PE 12 PE 13 PE 14 PE 15 3&,H
,2FRUHV *%
IC DC IC DC IC DC IC DC
6RXWK,26XEV\VWHP,26
II. S YSTEM MODEL
NoC Interface NoC
DMA Rx Tx Micro Core (UC) Router
,2&OXVWHU,2&
&RPSXWH&OXVWHU&&
50
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
128KB ¼ 16 Ҹ 2MB
one bank is assigned to each core n2
Application Output
16 cores in one CC
n1 n4 n1 n2 n4
2MB SRAM
Core 1
n3 Core 2 n3
Reduction
Core 1 n1 n2 n4 Core 1 n1 n2 n4.1
*% Core 2 n3 Core 2 n3 n4.2
do not use this core for computation
global bank
(shared memory bank)
(c) LET task scheduling. (d) Reduction of the latency.
Fig. 3. Memory bank privatization. Fig. 4. Motivation of this study.
Fig. 4(c) shows a task schedule image using the LET model. A. Application Notation
Execution and communication processing are scheduled in In the following, we describe the proposed scheduling
each LET task, where communication processing is scheduled method using the DAG example shown in Fig. 5. Note that all
to not overlap with the communication processing of other parameters are expressed in unit time. Here, DAG G comprises
LET tasks. Applying the LET model to tasks eliminates the nodes V (G) and edges E(G). The computation time of node
need to consider communication delay; therefore, application ni ∈ V (G) is expressed as comp(ni ). This computation time
execution timing is expected to be constant. The method with- represents the measured execution time of the task when no
out idle time is referred to as read-execute-write semantics, contention occurs with other tasks in a single core. This value
which is frequently used for scheduling methods that consider is assumed to be acquired offline. Data are communicated
contentions. Compared to this method, the LET model is between the nodes. The side entering a certain node indicates
very flexible because the execution time of each process may the amount of data to be read by the node, and the side
change during development due to the addition of functions. exiting the node indicates the amount of data to be written.
In read-execute-write semantics, if functions are added, in the The required communication time of node ni , i.e., comm(ni ),
worst case, we may need to change the overall schedule (i.e., for the entire node is calculated as follows.
core allocation and execution order). However, the extra idle comm(ni ) = (datari /Dslot × Tslot ) + (dataw
i /Dslot × Tslot ) (1)
time in the LET model eliminates the need for such schedule
changes, thereby improving development efficiency. Here, datari and datawi represent the amount of data to be read
While the LET model has these advantages, it also has and written, respectively, and Tslot is the maximum allocation
disadvantages. The chain of tasks from sensor data (i.e., time determined by a round robin policy [9]. Dslot is the
the entry node) to application execution (i.e., the exit node) amount of data that can be transmitted during Tslot . The com-
is referred to as end-to-end latency. The LET section is munication time for edge ei,j ∈ E(G), i.e., comm(ni , nj ), is
longer than the task execution time; therefore, when the LET calculated as follows.
model is adopted, end-to-end latency can be longer than when comm(ni , nj ) = (data(ei,j )/Dslot ) × Tslot (2)
other models are adopted (Fig. 4(c)). As a result, application Here, ei,j represents a side connecting node i to node j,
execution is delayed, and a deadline-miss may occur in the comm(ni ) is used to set the LET section and determine the
worst case. communication time of the LET, and comm(ni , nj ) is used
We address this issue using a multi-/many-core processor to assign task priority.
to distribute LET tasks, and we set the idle time in the LET The LET section includes both communication and execu-
section appropriately. Fig. 4(d) shows the results of applying tion processing; thus, the Worst-Case Execution Time (WCET)
these techniques to Fig. 4(c). For tasks n2 and n3 , the idle time of node ni can be calculated as follows.
in the LET section is reduced. However, the longer the idle
W CET (ni ) = comp(ni ) + comm(ni ), (3)
time in the LET section, the greater the merit of introducing
LET (but latency will increase). Thus, our best goal is to The WCET value is used to set the length of the LET section
ensure that applications do not miss deadlines, and the LET for the task (Section III-B).
section will be reduced gradually. This makes it possible to The blue nodes (n2 in the figure) can be computed in
reduce deadline-misses while leaving as much idle time as parallel. The need for parallel computation must be confirmed
possible in the LET section. For task n4 , the execution time by offline analysis. Hereafter, we label a node that requires
is reduced by distributing processing across two cores. parallel computation as a parallel node. In order to reduce the
length of schedule, a parallel node is computed with multiple
III. S CHEDULING A SSUMPTION cores in Section V-B.
The entrance node of the DAG must correspond to the
This section describes the scheduling assumptions, the set- period of the sensor data to be acquired; therefore, we set the
tings, and constraints for the target problem. period at the entrance node. Nodes other than entrance nodes
51
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
52
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
of the task obtained by the offline analysis is used without By considering job priorities, laxity makes it possible to
modification. Here, read time r(ni,j ) and write time w(ni,j ) assign priorities in consideration of application deadlines. In
can be calculated based on the amount of data required for addition, the calculation is performed in order from the end
reading and writing. node; thus, job dependencies can be protected.
53
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
54
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
55
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
120000
core 0 ݊, ݊ଶ, ݊ସ, ݊,ଵ ݊ଶ,ଵ ݊ସ,ଵ 100000
time unit
80000
core 2 20000
݊ଶ, ݊ଶ,ଵ
0
0 0.1 0.2 0.3 0.4 0.5
core 3 ݊ଶ, ݊ଶ,ଵ
LET margin β
core 4 makespan deadline
݊ଷ,
Fig. 8. Impact of LET margin on schedule.
+\SHUSHULRG
Task before offloading (i.e., makespan) could be reduced by the proposed method.
Offloading task As margin β of LET increases, makespan increases as shown
Task after offloading
in Fig. 8. In Fig. 8, DCT verif y is scheduled while changing
Fig. 7. Schedule result of sample DAG. LET margin β, and the obtained makespan is shown. Here,
we simply schedule LET tasks to avoid contention, and do not
TABLE I reduce LET interval or offload tasks. We call such a schedule
TASK GRAPH CHARACTERISTICS GENERATED BY TGFF.
the normal schedule. Here, the deadline margin γ was set to
#Task-graphs 100 0.2. As can be seen, if the LET interval is too large, the set
#Tasks <40, 70, 100 >(min, avg, max)
deadline may not be satisfied. Therefore, we set a LET section
WCET [1;2000]
Exchanged data [1;100] and a deadline, and we investigated satisfying the deadline
Maximum in-degree 3 while reducing the LET section as little as possible.
Maximum out-degree 3
#Entry nodes <2, 5, 8 >
(offloaded) task must be executed using the determined cores In Fig. 9, the reduction amount of the schedule by proposed
and the number of cores. methods (LET reduction, Offloading, and both) with respect to
The pseudocode of the proposed algorithm is given as the normal schedule is shown as gain. Here, we set margins as
Algorithm 2. The algorithm keeps the result of parallelization follows: β = 0.5, γ = 0.2. Above the bar of ”LET Reduction”,
with 1-16 cores (lines 4-16) and adopts the number of cores LET margin β when deadline-miss disappears is written. On
that yields the best schedule (line 17). If deadline-miss has the other hand, above the bar of ”Offloading”, the number of
occurred even after offloading has been performed, the idle cores used for task offloading is shown. Finally, above the bar
time in the LET section is reduced gradually (lines 18-26). of ”Offloading + LET Reduction”, both LET margin β and
Fig. 7 shows the results of scheduling the DAG in Fig. 5. The the number of core are shown.
memory access phases in all LET tasks are adjusted so that
they do not overlap. In addition, it is derived from the proposed
As the margin of the deadline becomes smaller, the deadline
algorithm that the offloading of parallel nodes is best when
also becomes smaller. Therefore, it is difficult to satisfy
distributed with four cores. As can be seen from the figure,
the deadline using a normal scheduling method. However,
multiple jobs generated from the same task are executed using
when the deadline setting is severe (i.e., deadline margin
the same core and the same number of cores.
γ is smaller), much idle time of the LET is reduced. This
VI. E VALUATION reduces the benefits of the LET model. With the offloading
A. Simulation Method method, the schedule is reduced by distributing parallel nodes
across multiple cores without reducing LET sections. Here, we
As input data, we used applications from the StreamIT offloaded the node with the longest computation time in the
benchmark suite modeled as fork-join graphs [9] [10]. Further- DAG. In addition, when multiple parallel nodes are present,
more, we used task graphs generated by Task Graphs For Free the schedule can be reduced further. However, the amount of
(TGFF) [11]. TGFF can generate random DAGs for various schedule reduction depends on the position of the offloaded
parameters, such as the number of tasks, the number of entry node in the graph, and some graphs have no effect (e.g.,
nodes, and the maximum in-degree and out-degree. We set the Beamf ormer and DCT comp). The number of cores used
experimental parameters as shown in Table I. for distributed processing varied depending on the graph, and
The evaluation was performed using a single cluster of the there were 1 to 13 cores. On the other hand, the proposed
Kalray MPPA, i.e., 16 cores. We also assume that one node method (i.e., offloading + LET reduction) reduces the LET
in the original DAG, which has the longest computation time, section until the deadline is satisfied after task offloading. As
is a parallel node and can be offloaded. Throughout, we set a result, a schedule that satisfies the deadline can be generated
margins as follows: α = 1.0, δ = 0.2, K = 0.7, Tslot = 3, without excessively reducing the LET section. If there are
and Dslot = 10. no tasks with greater WCET, the offloading method may not
B. Evaluation Results for Benchmark Application work well; therefore, it is preferable to select an appropriate
StreamIt Benchmark is fork join graphs, that is, one DAG scheduling method so as to protect the deadline according to
has only one entry node and exit node, respectively. Therefore, the application. For all benchmarks, the algorithmic resolution
it was used to evaluate just how much the schedule length time was less than 20 minutes.
56
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
0.45 0.2, 10
0.5, 12 0.0, 1
0.40 0.5, 11 0.3, 4
0.4, 9 0.4, 9 0.0, 4 0.1, 1 0.1, 12
0.1, 1 0.1, 2 0.0
0.35 0.1 1 0.2, 9 11
0.5, 13 0.5, 8 12 0.0
0.30 10
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Gain (%)
0.25 13 8 9
0.0
0.20 9 4
0.15
0.10 9
0.05
2 12
1 1 2 1 1
0.00
LET margin The number of cores used for offloading LET margin, the number of cores used for offloading
LET Reduction Offloading Offloading + LET Reduction
Fig. 9. Schedule reduction by proposed method.
1
0.9 For example, Igarashi et al. [12] proposed a list scheduling
0.8
based heuristic technique for the Kalray MPPA2-256 Bostan.
Deadline-miss ratio
0.7
0.6 They performed scheduling for parallel computations accord-
0.5
0.4 ing to the amount of task computation, and they successfully
0.3
0.2 reduced the makespan over the existing method. Rouxel et
0.1
0
al. [9] [13] introduced heuristic contention-aware scheduling
0.1 0.2 0.3 0.4 0.5 strategies that generate a time-triggered schedule for applica-
deadline_margin γ tion tasks.
normal Proposed method (offloading + LET Reduction) In addition, when using a multi-/many-core processor, con-
Fig. 10. Impact of deadline margin γ. tentions for shared resources can hinder real-time performance,
C. Evaluation Results for Multi-rate DAGs which is a significant problem. Many studies have improved
Next, we evaluate task graphs created by TGFF. We evaluate the predictability of access timing to shared memory by divid-
the performance of the proposed method using the average ing a task into an execution phase and multiple communication
value of the schedule results for each task graph. Unlike the phases (e.g., PREM (PRedictable Execution Model) [14] [15]
benchmark application, These are evaluations for multi-rate and read-execute-write semantics [6] [8] [9]). As a result, it
DAGs that have multiple periods. We observe changes in is possible to avoid contention, accurately estimate delay, and
the deadline-miss ratio due to changes in each margin. The to increase application execution speed. From a hardware per-
deadline-miss ratio means the number that could not meet the spective focusing on NoC communication, a division strategy
deadline of the end node in each DAG. that reduces contention has been proposed previously [16].
The fluctuation of the deadline-miss ratio when the deadline Perret et al. [17] provided an execution model that limits the
margin γ is changed is shown in Fig. 10. Here, we set margins behavior of the application on the platform in order to perform
as follows: β = 0.5. In addition, the tendency can be observed temporal isolated partition mapping on Kalray MPPA2-256
by fixing the number of period values to two. The smaller the Bostan.
deadline margin, the more severe the deadline. Furthermore,
The LET paradigm is attracting significant attention as a
the created graphs have many end nodes after jobs are created.
way to consider contention on multi-/many-core processors.
Therefore, a large number of deadline-misses will occur in
The LET paradigm was originally proposed within a pro-
a normal scheduling method that simply prevents memory
gramming language for embedded systems [4]. Recently, to
contention. On the other hand, the proposed method that
bring determinism to a real-time system, research into an
offloads parallel nodes and reduce the idle time of LET section
autonomous driving system has been conducted [5].
can significantly reduce the deadline-miss ratio. This method,
which avoids contentions and leaves the LET section as much Note that the number of tasks that use memory increases
as possible, is considered to be very effective in hard real- when adopting the LET model; therefore, memory usage is
time application development. Margins α and δ had very little expected to increase. To address this problem, a memory usage
effect on the entire schedule, and it was not possible to observe reduction method using double buffering has been proposed
the tendency of the change in deadline-miss ratio due to the previously [18] [19]. Biondi et al. [20] described how to apply
change in these margins. the LET model to the AUTOSAR model for implementation
VII. RELATED WORK in actual automotive systems using multi-core platforms.
Task scheduling using multi-/many-core platforms is gener- In an automobile engine management system, end-to-end
ally considered an NP-hard problem. Thus, heuristic schedul- latency, including execution and communication of multiple
ing algorithms have been studied extensively, and most of such tasks, must be constant. To address this, Jorge et al. [21]
algorithms are based on list scheduling. employed offset allocation to reduce output jitter when using
57
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
58
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Maryan Rab∗ , Romolo Marotta† , Mauro Ianni† , Alessandro Pellegrini† , Francesco Quaglia∗†
∗ University of Rome “Tor Vergata”, Italy
Email: maryan.rab@gmail.com, francesco.quaglia@uniroma2.it
† Lockless S.r.l., Rome, Italy
Email: {marotta, ianni, pellegrini}@lockless.it
Abstract—Modern computing platforms are based on multi- atomic Read-Modify-Write (RMW) instructions offered by
processor/multi-core technology. This allows running applica- the ISA—to let concurrent threads gather information on
tions with a high degree of hardware parallelism. However, whether conflicting accesses to shared data structures have
medium-to-high end machines pose a problem related to the occurred—also allows resilience of performance degradation
asymmetric delays threads experience when accessing shared in CPU-stealing context, like Cloud computing.
data. Specifically, Non-Uniform-Memory-Access (NUMA) is However, modern hardware platforms are also char-
the dominating technology—thanks to its capability for scaled- acterized by asymmetries, which play as well a role in
up memory bandwidth—which however imposes asymmetric the actual performance deliverable by parallel/concurrent
distances between CPU-cores and memory banks, making an applications. One of the most important asymmetries is
access by a thread to data placed on a far NUMA node severely the so-called Non-Uniform-Memory-Access (NUMA). It is
impacting performance. In this article, we tackle this problem based on having memory banks organized in a configuration
in the context of shared event-pool management, a relevant where each processor has some close bank(s)—this is the
aspect in many fields, like parallel discrete event simulation. local NUMA node—and more far ones—which form the
Specifically, we present a NUMA-aware calendar queue, which far NUMA nodes. Consequently, the need for accessing data
also has the advantage of making concurrent threads coordi- from far NUMA nodes induces higher latency and traffic on
nate via a non-blocking scalable approach. Our proposal is the memory-interconnection hardware components. In these
based on work deferring combined with dynamic re-binding
architectures, locality in the accesses not only plays a role
for cache exploitation, but also for RAM exploitation, since
of the calendar queue operations (insertions/extractions) to
accesses to far RAM banks should be avoided as much as
the best suited among the concurrent threads hosted by the
possible.
underlying computing platform. This changes the locality of
The challenges posed by multi-processor/multi-core
the operations by threads in a way positively reflected onto
NUMA platforms have been faced since long time in the
NUMA tasks at the hardware level. We report the results
literature. In fact, most Operating System (OS) implemen-
of an experimental study, demonstrating the capability of
tations offer API to directly control the placement of logi-
our solution to achieve the order of 15% better performance cal pages to RAM memory—or dynamically migrate them
compared to state-of-the-art solutions already suited for multi- across NUMA nodes if required. Also, OS-level solutions
core environments. have the capability to migrate threads, and the data they are
Index Terms—NUMA, calendar queue, non-blocking data currently touching, to favor accesses to the nearest (local)
structures NUMA node of a given CPU core.
However, OS-level solutions only provide mid/long term
1. Introduction binding between threads/data and NUMA nodes. Further-
more, the concept at the base of these solutions is to pack
The current trend in computing architectures is charac- threads and their hot data on a same NUMA node, which
terized by an ever-increasing core-level parallelism. This is is a solution not adequate for the case of very large thread
motivated by the need for scaled-up computing capabilities counts—and CPU-bound threads—which share very large
in face of the physical limits imposed by transistors tech- amounts of logical memory, possibly performing frequent
nology [1], [2], [3]. This trend has brought concurrent and fine grain operations on it. This is the case of last genera-
parallel programming paradigms to become mandatory for tion parallel simulation platforms, especially those based on
current and next-generation applications. Furthermore, it has speculative processing schemes [5]. In these scenarios, the
brought non-blocking thread coordination [4] to assume a “same NUMA-node” packing approach does not work since
central role in the design and implementation of modern threads would simply be brought to compete for the same
concurrent applications. Incidentally, this type of coordi- CPU-cores, leading to performance degradation.
nation, which avoids critical sections and simply exploits Based on the above considerations, we feel that the
NUMA aspect should be directly incorporated into the de- A different approach has been provided in [11]. It is
sign of algorithms for managing shared data. Hence, in this based on a general technique for NUMA awareness, called
article we present a NUMA-aware design of a shared event- Fast Fly-weight Delegation (FFWD) introduced in [12],
pool based on the Calendar-Queue archetype. Our solution which resorts to a dedicated server thread to operate on
explicitly controls the locality of the accesses to hardware remote memory banks in a NUMA topology. With this
level memory resources, namely NUMA nodes. This is done solution, locality at the hardware level is achieved since
by relying on cross-thread insertions of elements in the data hosted by a given NUMA node are only touched
calendar, where the thread starting the insertion will not by specific server threads, which are bound to CPU-cores
complete it in case the target time-bucket of the calendar is on that NUMA node. However, this approach is implicitly
hosted by a far NUMA node. In these scenarios, we adopt a blocking, since a thread that asks a server thread to operate
deferred-work strategy, leading other threads participating in on the data structure is blocked until the server reply ar-
the application, which are hosted by those far NUMA nodes, rives. Contrarily, our solution is fully non-blocking—hence
to actually finalize (flush) these insertion operations. On the it is implicitly more scalable—and does not require pre-
other hand, we explicitly control the delay in the deferring partitioning of threads into clients and servers (with respect
scheme so to avoid that, when the elements whose insertion to NUMA oriented data access).
was deferred need to be extracted, then the whole burden The issue of memory-access latency asymmetries in
of managing the flush of the deferred insertions related to NUMA architectures has also been tackled by using strate-
the target time-bucket is put to an inconvenient thread— gies where the same data are replicated across multiple
one running on a far NUMA node. In our scheme, we do NUMA nodes [13]. This makes them fast accessible to
not only share the data structure among threads, but we also threads running on CPU-cores hosted by whatever NUMA
share the work to be done on the data structure so as to make node, at the cost of using mechanisms for making the
it be carried out by the most convenient threads. Our solution replicated data instances coherent—this cost may become
relies anyhow on non-blocking coordination of the threads prohibitive for intensive and/or fine grain data update op-
in all of their operations, including the ones of posting the erations. In our solution we avoid at all this cost since we
deferred work, and the ones of flushing it. This enables us do not use replication. Furthermore, we are able to manage
to actually achieve non-blocking insertions/extractions from NUMA optimized accesses in scenarios with fine-grain tasks
the calendar, which has already been shown to play a core operating in update mode on the data structures—in fact,
role in concurrent event-pool management applications [5]. insertions and extractions from LFDWCQ are actual update
Based on all its features, we have called our data structure operations.
as Lock-Free Deferred-Work Calendar Queue (LFDWCQ). As for the specific problem we tackle in this article,
We also report data for an experimental comparison of namely event-pool management, a wide literature exists on
LFDWCQ with state-of-the-art non-blocking versions of the making the event pool efficient—in terms of both asymptotic
Calendar Queue—which are however not NUMA-aware— and actual costs—like Calendar [14], Ladder [15] and LOCT
showing how our proposal can achieve up to 15% better [16] queues. Furthermore, enhancements of these data struc-
performance when running a classical event-pool benchmark tures (or of other data structure flavors like lists or trees)
on top of a commodity machine equipped with 32 CPU- have been proposed for the case of concurrent accesses, in
cores and 64GB of memory organized in 8 NUMA nodes. particular by making the data structures accessible via non-
The remainder of this article is structured as the fol- blocking algorithms that enable scalability (see, e.g., [17],
lowing. In Section 2 we discuss related work. LFDWCQ is [18], [19], [20], [21]). However, the proposed algorithms
presented in Section 3. In Section 4 we report experimental have no intent to improve the locality of accesses with
results. respect to architectures characterized by highly asymmetric
memory, such as NUMA platforms. Hence, our proposal
is an improvement over these literature solutions, as we
2. Related Work also demonstrate via experimental results. In fact, beyond
NUMA awareness, we also retain the lock-freedom prop-
As pointed out, an approach to cope with NUMA is erty, since our data structure provides fully non-blocking
based on OS level (or middleware level) facilities that operations
dynamically place threads and their working set of logical
pages on a same NUMA node [6], [7], [8], [9], [10]. These 3. Lock-Free Deferred-Work Calendar Queue
approaches have been shown to be effective in scenarios
where threads actually express locality in the access to LFDWCQ is built on top of a non-blocking and conflict-
groups of logical pages. They are not suited for scenarios resilient Calendar Queue [21] (CRCQ). This data structure
where the access pattern to data is highly variable, like when splits the domain of event timestamps into partitions, called
there is no stable binding of tasks by threads to portions virtual buckets, and maps them to a circular array of ordered
of the shared data [5]. We cope with this limitation for the linked lists, denoted as physical buckets. Also, the number
case of fully shared event-pool management, which is a core of events per bucket is guaranteed to be bounded by a con-
aspect in modern simulation systems to be run on top of stant which is independent from the queue size, delivering
multi-core machines. amortized constant-time accesses. Whenever the number of
60
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
physical bucket
DWSTR (non-blocking linked list) (DWSTR), of linked lists. Differently from ordinary calen-
dar queues, virtual buckets are explicitly maintained by the
H H H H H H DWB means of individual nodes, called Deferred Work Buckets
(virtual bucket)
(DWB), within physical buckets. In turn, each node (i.e. a
virtual bucket) maintains deferred events in an unsorted ar-
ray of fixed size, denoted as Deferred Work Values (DWV)1 .
Essentially, our DWQ is a three-dimensional calendar queue
where two dimensions are materialized within arrays and
DWV
(array of events)
one with the usage of ordered non-blocking linked lists [24].
Before items can be extracted, they have to be migrated
Figure 1. Layout of the front-end DWQ. from DWQ to the underlying CRCQ. Generally, inserting an
item into a calendar queue is a cache-unfriendly operation
because it involves the traversal of linked lists (the physical
items is no longer balanced across the physical buckets, a buckets) that are well-known to have a poor spatial locality
resize phase doubles/halves the number of physical buckets and to be sub-optimal in terms of cache usage (at least
in order to restore the balance, so as to keep control over the for large-sized lists). This is exacerbated in the case of
number of steps performed during insertions and extractions NUMA architectures, where a miss into the Last Level of
(since these steps depend on the number of events in each Cache (LLC) might trigger a request to a remote cache
bucket). CRCQ provides all these features in a non-blocking and/or memory component. All these shortcomings are still
fashion and jointly delivers conflict resiliency for extrac- present when migrating events from DWQ to the back-end
tions, which highly contend in the access to the bucket that CRCQ. However, our solution alleviates all these problems
keeps the minimum-timestamp event and are well-known by migrating items falling in the same virtual bucket in
to be critical for any concurrent priority queue because of batch. This avoids repeated traversals of nodes within the
their impact on caches in multi-core platforms [17], [22]. same physical bucket and allows to reuse most of the
We designed our solution in order to maintain the same steps performed during a previous migration of an event,
progress (lock-freedom) and scalability (conflict-resiliency) significantly increasing temporal locality of insertions.
guarantee of the CRCQ. Since extractions are performed from the underlying
In the following sections, we introduce the main idea at calendar queue, we need to ensure that all the items be-
the core of LFDWCQ and describe its actual structure and longing to the current virtual bucket have been migrated
the operations taking place on it. from DWQ, allowing threads to obtain events by resorting
to the original extraction logic provided by CRCQ (events
3.1. The idea in a nutshell with lower timestamps must be extracted before others). To
achieve this goal, threads have to trigger a migration when-
LFDWCQ has three main design principles: ever a bucket becomes the new target for extractions—it
1) postpone (defer) far-future events insertions; becomes hot. However, this reactive approach is not suitable
2) group them in a batch to control locality of memory for scalability and NUMA-awareness. In fact, regardless its
accesses; placement within the NUMA topology, any thread might
3) provide non-blocking progress of threads. trigger such a reactive migration, increasing the probability
of conflicts.
To achieve all these goals, we paired CRCQ with a front-end In more detail, the current bucket (the one keeping the
data structure called Deferred Work Queue (DWQ) aimed at lower timestamp events) has become hot because it is con-
maintaining events whose management has been deferred. currently targeted by all the threads performing extractions.
In more detail, these events are not directly connected to Also, it is frequently updated by memory-write accesses
the underlying calendar queue upon their insertion; rather, to signal item removals. This has a dramatic impact on
they are appended to DWQ in order to be processed later performance because of the costs associated with the cache-
along a more favorable phase of execution of some thread. coherency protocols running on firmware. In order to do
On the other hand, extractions are performed directly from not worsen this already challenging scenario, we adopted a
the CRCQ, which—as mentioned—embeds advanced tech- proactive approach for migrating events from DWQ to the
niques towards conflict resiliency and scalability. Overall, calendar. In particular, instead of migrating items belonging
events inserted into the front-end DWQ will be eventually to the already hot buckets, we flush in advance “mid-
migrated to the underlying (back-end) CRCQ. temperature” virtual buckets, namely those that are in the
In order to make our approach effective, adding items (near) future of the currently hot bucket. This guarantees
to DWQ has to be a low latency and low memory-footprint that, whenever a flushed bucket becomes hot all its items
operation, otherwise the costs for posting events overpass have been already migrated. To further reduce the proba-
the benefits given by their postponed batch-insertion. For bility of conflict during migration phases, we also adopt
this reason, the DWQ layout, shown in Figure 1, is mainly
based on arrays. It resembles the classical calendar queue 1. We could easily support dynamically sized arrays by resorting to lock-
arrangement by having an array, called Deferred Work Struct free dynamic vectors [23], but this is not the main focus of this work.
61
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
62
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
63
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
current
H H H H H H H H the hottest one, might reduce the utilization of the DWQ
bucket and makes most of the insertions be performed directly on
hot the underlying calendar queue, losing control on locality.
buckets
Consequently, our proactive migration targets warm buckets,
warm whose distance from the hottest one is enough large to
buckets
avoid conflicts and not too far to reduce DWQ utilization.
cold Since the advancement of the highest priority is fast as
buckets
threads extract events, the beginning and the end of the warm
region is strictly related to the actual level of concurrency
insisting on the data structure. Basing on this consideration,
hot
bucket hotness
cold
we consider the first 2N buckets after the one currently
used for extractions as either hot or warm, where N is the
Figure 3. Visual representation of the bucket hotness, namely the likelihood number of active threads. Since the hot region shifts by one
that it becomes active for extractions along wall-clock time. bucket at a time, buckets near the hottest one have higher
likelihood of being proactively migrated and the utilization
of cold buckets is not hampered.
accesses and the effect of executing read-modify-write in- Whenever a thread detects a conflict during a proac-
structions. The first one is the average number of items tive migration phase, we make just one thread proceed
that belongs to a virtual bucket. This clearly has an effect while the other can fallback by executing its extraction
on the utilization of DWV arrays within DWBs of our from the current bucket. Such a back-off scheme can be
DWQ. In fact, when a DWV is full, threads insert items easily implemented by exploiting the result of an individual
directly into the underlying calendar queue. Consequently, Compare&Swap performed by a thread. In particular, if a
the more virtual buckets are dense, the more threads can rely swap performed by a thread A fails, it means that another
on DWQ to reduce the insertion latency. Moreover, since thread B is working on the same bucket. Thus, we avoid
deferred enqueues are processed in a batch, we can tolerate any additional conflict by making thread A stop running
longer lists (physical buckets) than both blocking and non- the migration protocol and proceed with a classical extrac-
blocking calendar queues. In more detail, we used the same tion from the CRCQ. To further reduce the probability of
approach for bucket sizing adopted in CRCQ, which makes conflicts upon migration phases, buckets of both calendars
the number of events per bucket proportional to the average are assigned to NUMA nodes in a circular fashion and we
number of concurrent threads accessing the data structure. make threads proactively migrate only buckets assigned to
The second key point for optimizing LFDWCQ consists the NUMA nodes on which they are running. The benefits
in controlling the timeline of items’ migrations from DWQ introduced by this simple scheme are two-fold. On the one
to CRCQ. Such an operation is carried out by extractions. hand, we reduce the set of threads that might compete for
On the one hand, since threads have to handle deferred migrating a given bucket, hence the likelihood of conflicts.
work, the contention upon extractions and hence impact of On the other hand, memory requests issued during migra-
conflicts is reduced. On the other hand, migrating items from tions do not propagate towards far NUMA nodes, reducing
DWQ to CRCQ is an operation characterized by a heavy- latency for accessing and updating memory.
weight usage of RMW instructions, which might hamper In order to make such a binding effective, we need to
performance due to their impact on caches. These effects can ensure that the memory buffers used for a virtual bucket
be alleviated if the updated cache lines are unlikely shared. and its events are effectively allocated on the target NUMA
To this aim, we make extractions proactively migrate items node. Ensuring this require from none to small adjustments
by flushing those in a subsequent bucket of the current one, on the underlying memory allocator. In fact, if threads are
namely the virtual bucket currently targeted by extractions. pinned to run on a specific core, no countermeasures have
The idea is ensuring that when a virtual bucket becomes to be taken at all because the OS (e.g. the Linux kernel)
hot, namely it becomes the new target for extractions, it typically allocates memory frames on the NUMA node
is already filled with all its (already migrated) items. This nearest to the core issuing the request and/or first accessing
avoids that multiple threads try to migrate the same events, the just allocated page. Conversely, if thread execution might
share cache lines and increase the pressure on the cache- migrate on different cores and NUMA nodes, a NUMA-
coherency firmware. aware allocator is required to have full control of memory
Virtual buckets that immediately follow the hot one in buffer placement.
the priority domain will be active for extractions soon.
Hence, we can define the hotness of a bucket B as the 4. Experimental evaluation
distance in the timestamp-based priority domain between
B and the hottest one, namely the bucket currently used We have compared the behavior of LFDWCQ with the
for extractions. A visualization of this concept is provided ones of recent implementations of non-blocking calendar
in Figure 3. As suggested before, migrating an already queues, namely the Non-blocking Calendar Queue (NBCQ)
hot bucket increases the likelihood of conflicts. On the presented in [20] and the Conflict-Resilient Calendar Queue
other hand, migrating cold buckets, those far away from (CRCQ) [21]. All the data structures use the Epoch-Based
64
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
4 4 4 4
3 3 3 3
2 2 2 2
1 1 1 1
0 0 0 0
0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32
#Threads #Threads #Threads #Threads
Figure 4. Average Throughput for different queue sizes and thread counts.
Garbage Collector described in [26]. The used benchmark allows to relief the developer from the burden of choosing
is the well-know Classic HOLD [27] where the queue is the implementation that best fits her/his use case. When
initially pre-populated to reach a target size and then it is the number of active threads oversteps 16, LFDWCQ pro-
stressed out by having multiple threads performing a hold vides up to the 15% performance improvement w.r.t. the
operation, namely an extraction immediately followed by optimum in 3 out of 4 cases, showing that our approach
an insertion. The timestamp increment of the new inserted towards NUMA-awareness is effective also at high levels
event (compared to the last extracted one) is obtained via an of concurrency/contention. In particular, when the queue
exponential distribution with mean equal to 1. Our aim was size is smaller than 4 millions, the trend is still up rising,
to evaluate the performance of the data structure at steady suggesting that, if we could increase the number of cores, the
state, so we ran it for 10 seconds after the pre-population gap between CRCQ and LFDWCQ would likely increase.
phase has completed. The performance metric we used is the Consequently, an improved scalability has emerged as a
average throughput computed over 10 different executions. secondary benefit of our approach. However, when the queue
All the experiments have been carried out on an HP Pro- size is set to 4 millions, we achieve the same performance of
liant server equipped with 4 AMD Opteron 6128 processors the original approach, showing no gain at full concurrency.
running at 2 GHz. Each processor has 8 cores for a total This is because the benefits of batch insertions are reduced
of 32 hardware threads. The machine has 64GB of RAM when the density of the events per virtual bucket increases
arranged in 8 NUMA nodes and runs Debian 9.2. (version too much.
5.4.0 of Linux kernel) as Operating System. All the code of This behavior is more evident when observing the la-
the tested solutions is written in C and compiled with gcc tency of both insertion and extraction routines shown in
9.2.1 with the highest optimization flag (O3). Figure 5 and 6, respectively. The costs per insertion (en-
Figure 4 shows the average throughput (and its standard queue) decreases when the event density increases because
deviation) while running the benchmark with queue size we have more opportunities to exploit DWQ for insertion
ranging from 4 · 105 to 4 · 106 and different thread counts of future events, leading up to a 50% improvement at full
(from 1 to 32). It clearly shows that our approach pays off concurrency and largest queue size. On the other hand, the
and has an improved behavior across all evaluated scenarios. batch migration of items performed by extraction (dequeue)
To make this concept clear consider the trend of NBCQ and invocations alleviates the impact of high contention with
CRCQ while increasing both queue size and concurrency smaller queue size (up to the 33% improvement w.r.t. pure
level. On the one hand, if the thread count is lower than CRCQ). However, the latency becomes larger than the one
16, NBCQ has higher throughput than CRCQ. On the other provided by CRCQ when the queue size increases, compen-
hand, when the concurrency level increases, the roles are sating the gains achieved by the enqueues. This suggests
reversed. This is because, the conflict resiliency provided that our approach can be further extended by removing the
by CRCQ is traded off with latency (a trend well-known in batch migration from the critical path of dequeue executions,
the literature [17]), penalizing performance at lower concur- e.g. resorting to helper threads whose role is just the one
rency/contention. of migrating items from DWQ to the underlying calendar
Thanks to its NUMA-awareness, LFDWCQ allows al- queue. This is a direction we will explore as future work.
leviating the CRCQ inefficiencies with lower thread counts Finally, we analyzed the impact on caches of the differ-
by improving the locality of memory accesses. In fact, the ent calendar queue implementations by monitoring the miss
throughput of LFDWCQ always stands between the one of ratio of LLC accesses. This has been computed by resorting
CRCQ and NBCQ when the number of threads is lower than to Hardware Performance Counter statistics gathered with
16, maintaining a distance from the optimum bounded to LIKWID [28], a well-known suite to access such low-level
15% and reducing the performance loss of CRCQ compared monitors—the hardware events to be sampled have been
to NBCQ by at least 50% and up to 100%. This is extremely chosen according to the formula for LLC miss-ratio given
relevant since the actual concurrency level of applications in [29]. Figure 7 shows that our approach reduces the miss
can vary. Consequently, improving the behavior of non- ratio by 50% independently from the queue size. This shows
blocking calendar queues in a wider range of scenarios that rely on work deferring to insert events in batch is an
65
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
7 7 7 7
LFDWCQ LFDWCQ LFDWCQ LFDWCQ
6 NBCQ 6 NBCQ 6 NBCQ 6 NBCQ
CRCQ
Enqueue latency (us)
45 45 45 45
LFDWCQ LFDWCQ LFDWCQ LFDWCQ
40 NBCQ 40 NBCQ 40 NBCQ 40 NBCQ
Dequeue latency (us)
80 80 80 80
60 60 60 60
40 40 40 40
20 20 20 20
0 0 0 0
0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32
#Threads #Threads #Threads #Threads
effective technique to improve locality. our solution. Performance tests with a classical benchmark,
executed on top of an off-the-shelf medium-end machine
5. Conclusions equipped with 32 physical cores and 8 NUMA nodes–
globally entailing 64GB of RAM—have shown how our
The management of event pools plays a central role in solution can provide up to 15% performance boost com-
many applications, including simulation. The recent hard- pared to state-of-the-art event-pool management algorithms
ware trend towards multi/many-core platforms has therefore already suited for multi-core machines.
generated a great interest in having event-pool manage-
ment algorithms capable of providing scalability in face of References
concurrent accesses. However, much less explorations have
been performed in order to keep into account another factor [1] D. W. Wall, “Limits of instruction-level parallelism,” in Proceedings
characterizing modern parallel machines, particularly Non- of the Fourth International Conference on Architectural Support for
Uniform-Memory-Access (NUMA). In this article, we have Programming Languages and Operating Systems, ser. ASPLOS IV.
presented the Lock-Free Deferred-Work Calendar Queue, New York, NY, USA: ACM, 1991, pp. 176–188. [Online]. Available:
http://doi.acm.org/10.1145/106972.106991
an event pool, based on the Calendar-Queue archetype,
which jointly offers scalable thread coordination—via non- [2] W. A. Wulf and S. A. McKee, “Hitting the memory wall:
Implications of the obvious,” SIGARCH Comput. Archit. News,
blocking solutions–and NUMA-awareness. The latter fea- vol. 23, no. 1, pp. 20–24, Mar. 1995. [Online]. Available:
ture has been achieved via an approach that changes the http://doi.acm.org/10.1145/216585.216588
actual locality of operations by threads in the different mem- [3] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam,
ory banks in the NUMA architecture, an objective that has and D. Burger, “Dark silicon and the end of multicore scaling,”
been reached by incorporating work deferring concepts into in Proceedings of the 38th Annual International Symposium on
66
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Computer Architecture, ser. ISCA 11. New York, NY, USA: [16] F. Quaglia, “A low-overhead constant-time lowest-timestamp-first
Association for Computing Machinery, 2011, p. 365376. [Online]. cpu scheduler for high-performance optimistic simulation
Available: https://doi.org/10.1145/2000064.2000108 platforms,” Simulation Modelling Practice and Theory,
vol. 53, pp. 103 – 122, 2015. [Online]. Available:
[4] M. Herlihy and N. Shavit, “On the nature of progress,” in Proceedings
http://www.sciencedirect.com/science/article/pii/S1569190X15000209
of the 15th International Conference on Principles of Distributed
Systems, ser. OPODIS’11. Berlin, Heidelberg: Springer-Verlag, 2011,
[17] J. Lindén and B. Jonsson, “A skiplist-based concurrent priority queue
pp. 313–328. [Online]. Available: http://dx.doi.org/10.1007/978-3-
with minimal memory contention,” in Principles of Distributed Sys-
642-25873-2 22
tems, R. Baldoni, N. Nisse, and M. van Steen, Eds. Cham: Springer
[5] M. Ianni, R. Marotta, D. Cingolani, A. Pellegrini, and F. Quaglia, International Publishing, 2013, pp. 206–220.
“The ultimate share-everything PDES system,” in Proceedings of the
2018 ACM SIGSIM Conference on Principles of Advanced Discrete [18] S. Gupta and P. A. Wilsey, “Lock-free pending event set management
Simulation, Rome, Italy, May 23-25, 2018, F. Quaglia, A. Pellegrini, in time warp,” in Proceedings of the 2nd ACM SIGSIM Conference on
and G. K. Theodoropoulos, Eds. ACM, 2018, pp. 73–84. [Online]. Principles of Advanced Discrete Simulation, ser. SIGSIM PADS 14.
Available: https://doi.org/10.1145/3200921.3200931 New York, NY, USA: Association for Computing Machinery, 2014, p.
[6] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, 1526. [Online]. Available: https://doi.org/10.1145/2601381.2601393
B. Lepers, V. Quema, and M. Roth, “Traffic management:
A holistic approach to memory placement on numa systems,” [19] R. Marotta, M. Ianni, A. Pellegrini, and F. Quaglia, “A non-blocking
in Proceedings of the Eighteenth International Conference on priority queue for the pending event set,” in Proceedings of the 9th
Architectural Support for Programming Languages and Operating EAI International Conference on Simulation Tools and Techniques,
Systems, ser. ASPLOS 13. New York, NY, USA: Association ser. SIMUTOOLS16. Brussels, BEL: ICST (Institute for Computer
for Computing Machinery, 2013, p. 381394. [Online]. Available: Sciences, Social-Informatics and Telecommunications Engineering),
https://doi.org/10.1145/2451116.2451157 2016, p. 4655.
[7] L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and [20] ——, “A lock-free o(1) event pool and its application to share-
E. Tune, “Optimizing googles warehouse scale computers: The numa everything pdes platforms,” in Proceedings of the 20th International
experience,” in Proceedings of the 2013 IEEE 19th International Symposium on Distributed Simulation and Real-Time Applications,
Symposium on High Performance Computer Architecture (HPCA), ser. DS-RT 16. IEEE Press, 2016, p. 5360. [Online]. Available:
ser. HPCA 13. USA: IEEE Computer Society, 2013, p. 188197. https://doi.org/10.1109/DS-RT.2016.33
[Online]. Available: https://doi.org/10.1109/HPCA.2013.6522318
[8] B. Lepers, V. Quema, and A. Fedorova, “Thread and memory [21] ——, “A conflict-resilient lock-free calendar queue for scalable
placement on NUMA systems: Asymmetry matters,” in 2015 share-everything pdes platforms,” in Proceedings of the 2017 ACM
USENIX Annual Technical Conference (USENIX ATC 15). Santa SIGSIM Conference on Principles of Advanced Discrete Simulation,
Clara, CA: USENIX Association, Jul. 2015, pp. 277–289. [On- ser. SIGSIM-PADS 17. New York, NY, USA: Association
line]. Available: https://www.usenix.org/conference/atc15/technical- for Computing Machinery, 2017, p. 1526. [Online]. Available:
session/presentation/lepers https://doi.org/10.1145/3064911.3064926
[9] A. Pellegrini and F. Quaglia, “NUMA time warp,” in Proceedings [22] D. Alistarh, J. Kopinsky, J. Li, and N. Shavit, “The
of the 3rd ACM Conference on SIGSIM-Principles of Advanced spraylist: A scalable relaxed priority queue,” SIGPLAN Not.,
Discrete Simulation, London, United Kingdom, June 10 - 12, 2015, vol. 50, no. 8, p. 1120, Jan. 2015. [Online]. Available:
S. J. E. Taylor, N. Mustafee, and Y. Son, Eds. ACM, 2015, pp. https://doi.org/10.1145/2858788.2688523
59–70. [Online]. Available: https://doi.org/10.1145/2769458.2769479
[10] I. D. Gennaro, A. Pellegrini, and F. Quaglia, “Os-based NUMA [23] D. Dechev, P. Pirkelbauer, and B. Stroustrup, “Lock-free dynamically
optimization: Tackling the case of truly multi-thread applications resizable arrays,” in Proceedings of the 10th International Conference
with non-partitioned virtual page accesses,” in IEEE/ACM 16th on Principles of Distributed Systems, ser. OPODIS’06. Berlin,
International Symposium on Cluster, Cloud and Grid Computing, Heidelberg: Springer-Verlag, 2006, pp. 142–156. [Online]. Available:
CCGrid 2016, Cartagena, Colombia, May 16-19, 2016. IEEE http://dx.doi.org/10.1007/11945529 11
Computer Society, 2016, pp. 291–300. [Online]. Available:
https://doi.org/10.1109/CCGrid.2016.91 [24] T. L. Harris, “A pragmatic implementation of non-blocking linked-
lists,” in Proceedings of the 15th International Conference
[11] F. Strati, C. Giannoula, D. Siakavaras, G. Goumas, and N. Koziris, on Distributed Computing, ser. DISC ’01. London, UK,
“An adaptive concurrent priority queue for numa architectures,” UK: Springer-Verlag, 2001, pp. 300–314. [Online]. Available:
in Proceedings of the 16th ACM International Conference on http://dl.acm.org/citation.cfm?id=645958.676105
Computing Frontiers, ser. CF 19. New York, NY, USA: Association
for Computing Machinery, 2019, p. 135144. [Online]. Available: [25] M. P. Herlihy and J. M. Wing, “Linearizability: A correctness
https://doi.org/10.1145/3310273.3323164 condition for concurrent objects,” ACM Trans. Program. Lang.
[12] S. Roghanchi, J. Eriksson, and N. Basu, “Ffwd: Delegation is (much) Syst., vol. 12, no. 3, pp. 463–492, Jul. 1990. [Online]. Available:
faster than you think,” in Proceedings of the 26th Symposium on http://doi.acm.org/10.1145/78969.78972
Operating Systems Principles, ser. SOSP 17. New York, NY, USA:
Association for Computing Machinery, 2017, p. 342358. [Online]. [26] K. Fraser, “Practical lock-freedom,” Ph.D. dissertation, University of
Available: https://doi.org/10.1145/3132747.3132771 Cambridge, 2004.
[13] I. Calciu, S. Sen, M. Balakrishnan, and M. K. Aguilera, [27] R. Rönngren and R. Ayani, “A comparative study of parallel and
“Black-box concurrent data structures for numa architectures,” in sequential priority queue algorithms,” ACM Trans. Model. Comput.
Proceedings of the Twenty-Second International Conference on Simul., vol. 7, no. 2, p. 157209, Apr. 1997. [Online]. Available:
Architectural Support for Programming Languages and Operating https://doi.org/10.1145/249204.249205
Systems, ser. ASPLOS 17. New York, NY, USA: Association
for Computing Machinery, 2017, p. 207221. [Online]. Available:
[28] J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweight
https://doi.org/10.1145/3037697.3037721
performance-oriented tool suite for x86 multicore environments,”
[14] R. Brown, “Calendar queues: A fast 0(1) priority queue in Proceedings of the 2010 39th International Conference
implementation for the simulation event set problem,” Commun. on Parallel Processing Workshops, ser. ICPPW 10. USA:
ACM, vol. 31, no. 10, p. 12201227, Oct. 1988. [Online]. Available: IEEE Computer Society, 2010, p. 207216. [Online]. Available:
https://doi.org/10.1145/63039.63045 https://doi.org/10.1109/ICPPW.2010.38
[15] W. T. Tang, R. S. M. Goh, and I. L.-J. Thng, “Ladder queue: An o(1)
priority queue structure for large-scale discrete event simulation,” [29] P. J. Drongowski and B. D. Center, “Basic performance measurements
ACM Trans. Model. Comput. Simul., vol. 15, no. 3, p. 175204, Jul. for amd athlon 64, amd opteron and amd phenom processors,” AMD
2005. [Online]. Available: https://doi.org/10.1145/1103323.1103324 67 whitepaper, vol. 25, 2008.
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Agent-based Modeling and Simulation is a powerful parts interact in a wider whole. Several approaches have
technique which allows to study the interactions in complex also coupled sophisticated models with neural networks [6],
systems, and allows to explore or even foresee the emergence evolutionary algorithms [7], or other learning techniques in
of more complicated properties or behaviors related to the
interaction among the simpler agents in the environment. In the order to provide the agents with behavioral adaptation, making
context of emergency or crisis scenarios, Agent-based Modeling ABMS even more powerful and realistic.
and Simulation can allow to effectively study emergency plans, ABMS can be regarded as an effective methodology to
with the goal of assessing their viability, also with respect to
the number of possible fatalities. In this paper, we analyze
address the problem of studying the behavior of crowds when
Agent-based Modeling and Simulation for crisis scenarios from emergency situations arise. This is particularly important for
a methodological and empirical point of view, with the goal of large-scale events, which are prone to natural disasters and
identifying what are the behavioral parameters that a model chaos generated by people, which could cause severe threat to
should encompass, in order for the results of the simulation to crowds. Among the possible events which should be subject
be useful for emergency plan assessment and/or compilation. We
also experimentally provide a characterization of the effects of
to careful analysis we can enumerate religious service, sport
such behavioral parameters. events, cultural shows, public demonstrations and marches of
Keywords—Agent-Based Modeling and Simulation, Emergency any sort and kind. In many countries, the organization of these
Simulation, Planning. events must be accompanied by the compilation of ad-hoc
security and evacuation plans, to reduce the risk of accidents
I. I NTRODUCTION and fatalities. When these plans are compiled, it is fundamental
Agent-Based Modeling and Simulation (ABMS) is a power- to identify solutions which allows the crowd to escape from
ful paradigm in which the system is represented by a collection catastrophic events in the shortest possible amount of time
of autonomous decision-making entities (the agents) which are and/or minimize the number of people injured or subject to
set out in an environment [1], [2]. Each agent individually death—in many real-world scenarios, simply following the
assesses the surrounding environment, also taking into account shortest path to an exit might not deliver optimal results. Plans
the presence of other agents, and makes decisions on the basis should also consider the possibility that some security exits are
of a certain set of rules which implement their behavior. Dur- blocked, or that the direction to be followed should change
ing its lifetime, an agent can decide to change its behavior, also during the escape—this could be the case, for example, of
depending on the environment state and interactions with other cascading catastrophic events, such as the collapse of part of
agents. The actions that agents take might also have effects a building due to a fire.
on other agents and/or on the surrounding environment—for In many scenarios, compiling these plans is difficult. Indeed,
example, an agent can produce, consume, or exchange items. only real-world experience based on real accidents (which
ABMS is considered incredibly powerful for multiple ap- involve real people) could provide the required information
plications and real-world business problems for a number of to compile the plans. Of course, this is not viable: real-world
reasons. First of all, the model developer can concentrate on experience or experiments with real people can be too costly,
the design of agents behavior independently of where the dangerous, or might be simply impossible, as in the case of the
agents will act. This significantly simplifies the development compilation of evacuation plans for buildings or architectonic
of complex models, allowing to reach results which could ensembles which are not yet built.
be difficult when relying on more traditional mathematical
methods [3], [4]. Second, the interaction of multiple agents In the case of a catastrophic event, a fundamental aspect
in a system can exhibit complex behavioral patterns [5], able to be taken into account is to consider (especially in large
also to show (or even anticipate) what is commonly referred environments) that people do not immediately become aware
to as emergent behavior. Emergence occurs when an entity of the risk or the occurrence of the event itself. In these circum-
is observed to have properties its parts do not have on their stances, the panic generated by the event could be worsened
own. These properties or behaviors emerge only when the by detrimental behavior due to people observing escaping
crowds, without knowing the reason for it. For this reason,
the law in several countries demands the escape plans to
978-1-7281-7343-6/20/$31.00
2020
c IEEE explicitly consider the presence of police (or other law/security
68
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
enforcement agencies) which should monitor the emergency to the comprehensive work in [14] for a thorough discussion
situation, inform the people by technological means such on the technical aspects related to the deploy of agent-based
as loudspeakers, and/or guide the crowd towards the best- models.
suited security exit. Disregarding the possibility that the crowd From the point of view of ABMS as a methodology to study
ignores the information provided by security agents—we will crowds in the context of evacuations, it has been proven to be
deal with this possibility in Section III—could be a source of an effective way to model and analyze the movements and
ineffectiveness of the plan itself. Additionally, when compiling the behavior of very dense crowds. This approach has been
(or actuating) a security plan, a fundamental question is: “how applied to many diverse scenarios, such as malls, airports, or
many security agents should be used to minimize the number parks. Abdelghany et al. [15] have presented a simulation-
of fatalities, and what is their best-suited position in the optimization modeling framework to study the evacuation of
environment?” large-scale pedestrian facilities with multiple exit gates. In
In this paper, we explore ABMS as a technique to support their work, they couple genetic algorithms and ABMS to
the compilation of these security plans, explicitly accounting generate optimal evacuation plans for hypothetical crowded
for different behavioral aspects which should be considered exhibitions halls. The authors assume that the involved people
when designing the logic behind single agents, so as to capture receive evacuation instructions, which is an important aspect,
in a highly-realistic way emergent behavior of crowds. ABMS but they nevertheless do not take into account the possibil-
has features (autonomy, reactivity, pro-activity, and social ity that security exits become unavailable while the crowd
interaction of the agents) which make this method a natural is evacuating the building. Moreover, they assume that the
choice for scenarios requiring autonomous and adaptive partic- people will follow the provided instructions accurately and
ipating agents [8]. Nevertheless, particular care must be put in unequivocally, which is a strong assumption for real-world
the design of such models. Indeed, one way of modeling for emergency scenarios.
such scenarios is to focus on global flow consideration [9], Wang and Wainer [16] have presented a distributed frame-
or on local interactions only [10]. Structurally, an egress work for modeling evacuation of crowds which models the
scenario can be studied taking into account all the reachable environment in a realistic way starting from CAD/BIM au-
exists, while distributing evenly (in terms of egress time) the thoring tools. This work illustrates the importance of relying
population, as it is typically done in flow control [11], [12]. on realistic environments for real-world models. We consider
Nevertheless, at an individual level, agents are not particles, the environment to be a fundamental aspect in the model-
but social entities [13]. ing methodology, and we discuss how general environments
We define several building blocks of the agents which we should be modeled, although we do not retain the capability
consider fundamental to execute significant ABMS simulations of using authoring tools out of the box.
of evacuation scenarios. We believe that such an analysis Zheng et al. [17] have evaluated different methodologies
could be helpful for people studying the behavior of crowds, to carry out crowd evacuation simulations. The evaluated
and for practitioners which are involved in the development methodologies include cellular automata models, lattice gas
of assistive tools for the compilation of security plans. In models, social force models, fluid-dynamic models, agent-
particular, we consider the modeling methodology presented based models, game theoretic models, and approaches based
here as effective for evacuation simulations in the context of on experiments with animals. The authors conclude that
earthquakes, landslides, floods, fires, terrorism attacks, crazy psychological and physiological elements affecting individual
drivers, shooting, collapses, bombing, panic by misbehaving and collective behaviors should be also incorporated into the
people, or abandoned objects which could be thought to evacuation models, the assessment of which is exactly part of
be bombs, just to mention a few. Anyhow, depending on the characterization which we carry out in this paper.
the specific scenario, fewer aspects of the holistic modeling The importance of aspects such as physiological, emotional,
approach which we propose can be considered, as the behavior and social group attributes has been studied in [18]. This work
of the agents is fully probabilistic. shows that when social group and crowd-related behaviors
We complete our exploration with an experimental charac- are modeled according to findings and theories observed
terization of the effects of the different behavioral aspects and from social psychology, and when the interactions among
parameters on the final results of the simulations. With this individuals is realized by means of agent-based execution
study, we stress the need for a holistic approach in ABMS for processes, it becomes easier to simulate persons awareness of
evacuation scenarios. the situation and consequent changes on the internal attributes,
The remainder of this paper is structured as follows. In and the results are realistic at both individual and group level.
Section II we discuss related work. Our modeling methodology Du et al. [19] have shown that evacuation plans could be
is presented in Section III. The experimental characterization significantly suboptimal if the involved people are signifi-
is reported in Section IV. cantly older that average situations. In their work, they have
shown that older people are often not taken into account
II. R ELATED W ORK with great care also when compiling evacuation plans for
A lot of work has been done on ABMS, especially in the senior apartment buildings. Older people typically have a
context of frameworks and runtime environments to support different behavior in emergency situation as they move slower
their execution on large-scale clusters. We refer the reader and might demand for help [20], and have a higher fall
69
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
probability [21]. Puts et al. [22] have shown that by 2050 the of other people entering the environment. In our approach, the
world population with an age greater than 60 years will be steady state of the crowd distribution can be reached thanks
composed of 22 billion people, and Prot and Clements [23] to mobility models and/or by specifying the initial distribution
have shown that older people are more subject to accidents of the crowd in the environment. There has been an extensive
than other people. All in all, by this body of work, it is quite research on this aspect in the literature, and we refer the
clear that it is not possible to avoid considering age in ABMS reader to the work in [27] for a discussion and a possible
of evacuation plans, as the presence of elderly might also lead methodology with respect to this specific aspect.
to unexpected emergent behavior of the crowd. As mentioned in Section I, we target in our modeling
Chu et al. [24] have shown that egress simulations produce approach several different emergency scenarios. At the same
significantly different results when taking into account differ- time we advocate that, for a reliable assistive tool for the com-
ent agent behavioral models, namely following familiar exits, pilation of evacuation plans, it is important to take into account
following cues from building features, navigating with social the integration of multiple catastrophic events. Therefore, an
groups, and following crowds. Similarly, Zia and Ferscha [25] ABMS model must provide the possibility to consider that,
have shown that it is fundamental to combine individual, social during a single simulation, multiple events occur at different
and technological models of people during evacuation, in order time instants. It is also fundamental to correlate such events.
to obtain results which are close to real-world scenarios. These Therefore, the modeling approach should consider that, given
are aspects which we explicitly retain, while we combine them the occurrence of some event in the environment, correlated
with additional behavioral characteristics. events could take place after a certain amount of time, either
Overall, we consider all the aforementioned aspects in this in a fixed way, or by creating relations which are based on
paper (and additional ones), we try to orchestrate the concepts probability distributions. This is the case, e.g., of parts of the
in a holistic way with respect to the modeling strategy, and building collapsing some time after that an explosion took
we provide an experimental characterization of the effects of place. Another example is that of combined terrorism attacks,
these behavioral parameters on the overall simulation results. which take place shortly one after the other, also while the
III. T HE M ODELING A PPROACH crowd is already escaping. Often, it is extremely hard to make
an analysis of such events when compiling an evacuation
The modeling approach which we propose and study in this plan, giving the high number and stochasticity of variables
paper can be regarded as a tool for analysis, study, and forecast to account for, thus making ABMS a fundamental assistive
of the behavior of crowds in closed or open space environ- methodology.
ments, with a special focus on evacuation in case of crisis
scenarios. The approach is based on ABMS, and we define and Another aspect to account for is the timely intervention
combine the characteristics of each behavioral aspect which of rescuers or police. This is an aspect that also depends
we consider fundamental for a significant simulation able to on the environment. As an example, a catastrophic event
also produce realistic emergent behavior. The ultimate goal of happening at a concert might be more difficult to manage
this modeling approach is to allow for a what-if analysis of for rescuers, as the high-density of the crowd could prevent
the evacuation plans of buildings and/or public events. rescue vehicles to reach the critical points quickly. Also, the
mixture of people and vehicles in the same environment could
A. Representation of the Environment and Management of create more security risks, or increase the level of panic in the
Correlated/Timed Events people attending the event. Additionally, the social behavior
A fundamental aspect for effective ABMS of crowd egress of the people is such that they could seek rescuers, also if
scenarios is to provide a high parameterization and behavioral they do not actually need assistance, thus slowing down the
capabilities at the level of the agents and the environment. intervention, or creating variations in the evacuation flow as
As far as the environment is concerned, it is fundamental to soon as rescuers reach the incident location.
specify an accurate representation of the obstacles that the As already highlighted, the way according to which evac-
agents moving around could find on their way. We consider uation starts can play a fundamental role in the evacuation
traditional grid-based representations to be partially-suited for process. In large environments, different people could be
the purpose. In particular, the work in [26] has shown the informed of the occurrence of an event for which they should
importance in ABMS to rely on a graph-based topology egress. People nearby the accident will likely notice the
to represent more complex environments. In our modeling event by themselves, while people farther away might be
approach, we envisage the reliance on more traditional grid notified by loudspeakers, they could observe part of the crowd
based environments to represent portions of the overall space, running away, or they could be notified “remotely” by some
which are then linked in a graph-like fashion from/to specific kind of gossip dissemination—social networks or messaging
points of the grids. This solution allows to easily represent applications could also play a role here. This kind of remote
multi-level buildings, or areas which can be reached only interaction could also be misinterpreted, driving part of the
from specific entrance points, and provides a good degree of crowd towards the critical place(s) in the environment, rather
flexibility in the configuration of the environment. than in the opposite direction. This is a kind of emergent
Moreover, it is fundamental to be able to specify the initial behavior which could lead to the adoption of different no-
condition for the crowd distribution, and possible source points tification systems in the environment, or which could drive
70
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
the selection of the best-suited position of law enforcement b) Age: As already mentioned, age is an always more
officers in the environment, e.g., during some event. important aspect to take into account when compiling security
plans. The age of single individuals could alter the way
B. Behavioral Characteristics of Crowds
according to which they move and orientate in the surrounding
All the aspects which we have discussed so far have a environment. In particular, the speed at which an individual
different effect on the evacuation of the crowd depending on moves in the environment is inversely proportional to its age.
the characteristics of the single person which is involved in A fall probability is also defined depending on the age, which
the evacuation. We advocate that there are some fundamental is exponential with respect to the age.
aspects which must be considered for an evacuation simulation c) Grouping: Studying the emergent behavior of the
to be reliable, and we stress that these aspects cannot be crowd must be done also taking into account that multiple
studied separately from each other. In the following, we individuals might know each other beforehand, and that they
describe the aspects which must be taken into account, when are set in the environment as a group—a simple example is
describing the behavior of an agent in a simulation model. a family, or a group of friends. It is likely that such groups
a) Emotionality and Emotional Contamination: This is a will exhibit a “pack behavior”, in which the interactions with
fundamental aspect to take into account to describe the behav- the environment and the movements happen as a group. These
ior of the individuals, during emergency situations. Anxiety, groups will exhibit a behavior which will try to maximize the
panic attacks, fear, bewilderment, they are all aspects of the probability for the group to stay together, and it is something
personality of an individual which could lead to “erroneous” or that could potentially affect the emergent behavior. Different
dangerous actions, both for the single individual and for the environments can be characterized by a different probability of
community during an evacuation. Emotional attitude should grouping, and this should be explicitly taken into account when
be described and considered, and it must also be combined compiling an evacuation plan. In our approach, the grouping
with environmental aspects which can change the actions probability Pg tells the probability that an agent is grouped
that an individual is performing during the evacuation. We with a nearby agents.
model emotionality as a numerical value which is increased d) Remote Grouping: The wide spread of social net-
taking into account the presence of a number n of people in works and the ubiquitous presence of communication means
the nearby (the concept of crowdedness), the distance from adds the need to account for a different kind of grouping.
the catastrophic event dc , and the observability of the exit In particular, if a group of people entered the environment
point, along with its estimated distance de —if the exit is not together, but later split for any reason, it is likely that if an
observable, we set de = ∞. Each individual is characterized emergency scenario arises, they will try to regroup of get in
by an emotional factor η ∈ [0, 1] which drives the speed contact before leaving the environment. This could clearly
according to which the emotionality value is updated towards create delays in the evacuation, or counter-intuitive behaviors
the critical threshold. Overall, emotionality—which is always (e.g., moving towards the accident point). Again, this is an
in the range [0, 1]—is updated according to Equation (1), aspect which must be taken into account to deliver reliable
which accounts for a very high emotionality ramp up after simulations. In our modeling approach, the remote grouping
the occurrence of the critical event: probability Prg determines whether the agent will stop for a
n·d e random amount of time after that it becomes aware of the
0 1 dc dc
E =η 1+ + (1 − η)E, (1) emergency situation, or that it will start to move towards the
e n · de other individuals forming its group. The two behaviors are
where e is a control variable which is set to ∞ until the chosen uniformly at random.
occurrence of the catastrophic event, and to 1 afterwards— e) Memory and Knowledge: Different individuals might
it allows to prevent the emotionality value to increase in a have a different knowledge of the surrounding environment,
normal environment. and their memory could play a fundamental role. A motivating
Every time that the emotional value E for an individual example is a person which enters a mall for the first time
overcomes a certain threshold Ē, the agent starts to misbehave. in its life. They do not know where exit locations could be
Misbehavior entails forgetting about its heading towards an found, but they do remember the route they traveled to reach
exit, and starting moving according to a random walk, also a certain position. During the evacuation, also related with
possibly seeking rescuers if they are in the nearby. This their emotional state, they might decide to travel towards the
misbehavior continues until the emotional value is reduced entrance which led them into the building, possibly ignoring
behind the threshold, e.g., thanks to the agent getting closer other factors such as the presence of more suitable exits,
to an exit. or information provided by security officers. Similarly, the
Emotional contamination is also taken into account in this lack of knowledge of the surrounding environment might
computation: two or more individuals which are in proximity lead the decision-making process slower, or it could possibly
could “contaminate each other” with respect to their behav- exacerbate the “herd effect”, in which people simply follow
ior. As an example, if multiple anxious people are gathered other people during the escape. The memory probability Pm
together, without any “leader” or “stronger” individual in and the environmental knowledge probability Pkn determine
proximity, they might generate collective panic crises which how the agents will behave, once they become aware of the
are detrimental to security and safety. occurrence of a critical event.
71
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
f) Knowledge of Environmental Risks: This is another scenarios, especially because they could make an otherwise-
behavioral aspect which is fundamental, especially in cascad- good plan completely ineffective.
ing catastrophic events. A motivating example is an individual By this classification of the aspects which we consider
which, during a fire, moves in proximity of pillars or poles. fundamental in ABMS for evacuation purposes, it is clear
These architectonic elements could be easily damaged by the that some of them partially overlap, either in the cause and/or
fire, and their collapsing might produce additional fatalities. in the effects. This is exactly the reason why we advocate
In this sense, an individual which has a higher knowledge of that a holistic approach towards ABMS in these scenarios
these risks might leave the shortest path to a security exit, should consider all of them at once. In particular, we consider
just to avoid several environmental risks. If the environment that, for each individual, the behavioral description should
is extremely crowded, this behavior could create blockages be based on probability distributions, which feed different
or a slowing down, which are extremely important factors explanatory variables describing the agents. In this sense, an
from an emerging behavior point of view. The probability Per evacuation plan should be considered reliable if and only if
determines whether an agent is aware of environmental risks. it respects some Key Performance Indicator levels under a
In the positive case, the agent will try to avoid all the regions high variability of the agents’ behavioral characterization. Of
in the environment which they consider to be risky to traverse. course, in the context of specific public events (e.g., concerts
g) Trustfulness in Other People and Institutions: This of religious services) some configurations might be excluded.
behavioral aspect describes whether an individual will likely As an example, the distribution of the age of individuals can be
trust other people in the escape (therefore, “joining them” and tailored to the kind of public event. Nevertheless, an assistive
possibly forming a group), or whether they will abide by the ABMS-based tool could extremely simplify the compilation
indications of security officials and/or rescuers. It is possible of a security plan, if the model is able to account for all the
that this aptitude will negatively influence the choices taken, aspects which we have discussed at once.
also in case there is the availability of useful information to C. Behavioral Characteristics of Rescuers
leave the risky environment. The probability Pt tells how
probable is that an agent will trust (and implement) the With respect to rescuers, we consider several different
evacuation plans suggested by surrounding agents, or whether aspects and entities. One general aspect is related to the
it will continue to evacuate according to its own strategy. timeliness of the intervention. In particular, we consider the
possibility that actual rescuers or law enforcement agents
h) Social Networks: They are always more important in require some time to start acting after the catastrophic event.
daily life. Also in emergency situations, it is possible that the Indeed, this modeling approach can account also for the fact
people will spend some time in seeking for information—this that, before intervening, individuals in charge require to coor-
is an aspect also related to the aforementioned grouping aspect. dinate. The configuration of this general aspect works at the
The timeliness and quality of the retrieved information might level of the single individual, because different people might
be argued as well. In particular, inaccurate information might also react according to a different timeliness. The timeliness
make the people make wrong decisions, also in proximity of of intervention is therefore a configuration parameter which
secure points such as emergency exits. At the same time, is driven from a Gaussian distribution. The mean value of
this phenomenon could generate additional delays in the this distribution can be specified at simulation configuration,
evacuation of some people. The Psn probability determines accounting also for the aforementioned delay for coordination.
whether an agent, upon the occurrence of the critical event, Another aspect is related to the possibility, the delay, and
will spend a random amount of time stopped, consulting social the period of repetition according to which all individuals in
networks. the environment are notified of the fact that an emergency is
i) Lack of Understanding or Confusion: During an evac- occurring. This aspect mimics the fact that, as we have already
uation, individuals might not fully understand the information discussed, rescuers could inform all people of an accident by
that is provided to them, also by rescuers, or they might enter means of loudspeakers. Whether the crowd take into account
a confusional state—this is also related to the aforementioned this information or not, depends on the individuals’ state. The
emotional aspect. These states might make the individuals modeling of this aspect is similar in spirit to that of timed
forget or disregard important information related to the correct events which we have discussed before.
path to an exit, which they already acquired. Some people in Rescuers can be classified into people and vehicles. In
a confusional state might take a wrong path in the evacuation, the modeling approach, vehicles are set towards a specific
or they could simply stop, impeding the egress of other place in the environment and they move at a speed which
individuals. The confusion probability Pc determines whether is inversely proportional to the crowdedness of the region.
an agent gets into the confused state. This condition can be Vehicles can be of any kind, as they could represent firefighters
checked multiple times during the simulation. reaching the spot of a fire, or policemen trying to reach the
j) Chaos-generating Individuals: With specific respect to area of a shooting. Their target position is updated during
terrorism attacks, it is possible to suppose the presence of the simulation, every time that their target changes location—
people which generate chaos on purpose. These people will think, again, of shooters. We account for a delay in the
explicitly act against the evacuation of crowds. This kind of notification of the change of their target, which represents
byzantine behavior should be explicitly modeled in complex coordination and/or communication time.
72
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
73
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
to Pt = 0.1, the probability to rely on social networks to that the largest part of the agents has no idea about where
Psn = 0, and the probability of confusion to Pc = 0.01. An to go to leave the disaster scenario, and starts wandering
individual is chaos-generating with probability Pch = 0.01. around. Rescuers, on the other hand, try to drive people
Such an individual will randomly walk against the escaping towards the exits, but are anyhow subject to the same behavior
crowd, while also actively trying to scare and confuse the other of the other agents—they tend to move farther from the
agents. explosion points. In this sense, the zero-knowledge agents are
We modify this baseline configuration to study what hap- continuously subject to random movement, until they reach an
pens to a subset of the KPIs which we have introduced when exit by chance. By looking at the adversarial map, this could
one single parameter is changed—for the sake of brevity we require a significant amount of time, and given the subsequent
are not able to report results related to all KPIs in this paper. explosions it can be fatal for a large number of agents.
We have run complete simulation scenarios, meaning that the The most interesting (and possibly unexpected) result is
simulation is halted either when all agents have evacuated the associated with trustfulness (Figure 6). A slight increase in
map, or have died. Each point in the plot is averaged over 5 trustfulness generates an increase in the number of fatalities
different simulation runs—a total of 55 runs for each plot. All and time to evacuate. This is an emergent behavior related
different configurations of the models have been run with the to contrasting information. In this scenario, there is a sub-
same set of 5 random seeds for random-number generators, to set of the agents who have a non-minimal knowledge of
allow for a stabler comparison. We present results associated the environment, and are already heading towards a known
with the variation of the percentage of fatalities over the exit on the map. If these agents are also associated with a
total number of individuals in the simulations, the number of high trustfulness, while heading towards the exit, they might
individuals evacuated per unit of time, and time to evacuate change their plan and start following indications from the
all people. These results are studied when varying different rescuers—this entails heading towards a different exit. Given
parameters, namely age (Figure 2), the confusion probability the adversarial map, the time to evacuate increases, and it also
(Figure 3), the grouping probability (Figure 4), the knowledge creates conglomerates of agents who slow down the stampede
of the environment (Figure 5), trustfulness (Figure 6), and the of others. When the trustfulness is increased to a higher extent,
presence of chaos-generating individuals (Figure 7). the egress becomes more organized, and the agents can reach
Experimental data show that there is a positive correlation exits in a more ordered way. It is interesting to note that it
between the fatalities rate and the individuals’ average age is required a factor of trustfulness set to 100% to obtain a
(Figure 2). This is expected, since a higher age reduces reduction in fatalities of 50%.
mobility. However, the mortality doesn’t increase dramatically. Chaos generating individuals are a vast minority of the agent
The standard deviation of the age distribution has been set to population. Also, they do not directly restrict the movement of
15 years in order to reduce the noise over multiple simulation the population. In this extreme simulated scenario, the effect
runs. This means that rescuers and grouped agents don’t often of chaos generating individuals is therefore insignificant—
lose sight of each other due to vast difference in mobility. almost 40% mortality rate with default settings. Nevertheless,
This effect is also witnessed in the egress per time unit and a minimal increase in the time to evacuate can be observed.
the time to evacuate. The former decreases as the average age To conclude the experimental assessment, we report that,
increases, while the latter increases linearly. on average, the simulation of a complete scenario requires 55
seconds of wall-clock time.
Grouping (Figure 4) has a negligible effect on the number
of fatalities in our simulation scenario. As explained earlier, V. C ONCLUSIONS AND F UTURE W ORK
age variability is limited, therefore grouped agents don’t often In this paper we have discussed a holistic approach with
lose sight of each other. Moreover, in our simplified model, an respect to ABMS for emergency management, with a spe-
agent that loses sight of his group simply tries to escape alone. cial focus on crowds escaping environments in the case of
This does not directly impact survival chance. Confusion cascading catastrophic events. We have shown experimentally
probability (Figure 3), on the other hand, is strongly correlated what is the effect of the different behavioral characteristics of
with mortality rate. It is noted that all conditions impacting individuals in the overall results of the simulation. Our results
agents mobility have a huge influence on the outcome of this confirm that it is important to consider multiple aspects at
simulated scenario. This phenomenon is also reflected on the once, because the outcome of the simulations could lead to
egress per time unit, which drops almost to zero for higher very different results.
probabilities, and is similarly observed in the time to evacuate, In our future work we plan to perform an in-vitro recon-
which grows exponentially. This is also an expected result, struction of real-world accidents from the past. This effort
because if a large fraction of the agents are confused, they will allow us to determine whether and to what extent our
start misbehaving, moving themselves farther from exits. modeling approach is able to recreate actual evacuations in
The effect of knowledge on the environment is interesting real-world accident situations. Moreover, we plan to integrate
(Figure 5), because it illustrate a predominant behavioral our model into a framework which will allow to automatize
effect. A zero-knowledge probability generates a significantly the exploration of different parameters given a configuration,
high number of fatalities, and increases drastically the time so as to determine what could be the best-suited characteristics
to evacuate. This is a result which is associated with the fact of a final evacuation plan.
74
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
100 40 1200
35
1000
80
Logical time
800
60 25
20 600
40 15
400
10
20
200
5
0 0 0
30 35 40 45 50 55 60 65 70 30 35 40 45 50 55 60 65 70 30 35 40 45 50 55 60 65 70
Average Age (years) Average Age (years) Average Age (years)
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 2: Effects of Average Age
100 40 1200
35
1000
80
Saved per time unit
30
Fatalities (%)
Logical time
800
60 25
20 600
40 15
400
10
20
200
5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Probability Probability Probability
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 3: Effects of Confusion Probability
100 40 1200
35
1000
80
Saved per time unit
30
Fatalities (%)
Logical time
800
60 25
20 600
40 15
400
10
20
200
5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability Probability Probability
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 4: Effects of Grouping Probability
100 40 1200
35
1000
80
Saved per time unit
30
Fatalities (%)
800
Logical time
60 25
20 600
40 15
400
10
20
200
5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability Probability Probability
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 5: Effects of Knowledge of the Environment
75
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
100 40 1200
35
1000
80
Logical time
800
60 25
20 600
40 15
400
10
20
200
5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability Probability Probability
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 6: Effects of Trustfulness
100 40 1200
35
1000
80
Logical time
800
60 25
20 600
40 15
400
10
20
200
5
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability Probability Probability
(a) Fatalities (b) Egress per time unit (c) Time to evacuate
Fig. 7: Effects of Chaos-Generating Individuals
76
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
77
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
78
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The messages exchanged between vehicles and RSUs are model. This real world road network can be easily down-
processed and used to understand when an emergency braking loaded and modified and imported into the simulator through
is taking place, and therefore anticipate any collisions. In the netconvert command. In this way, a realistic vehicular
case of emergency event, the MEC server generates an alert traffic in the urban environment is generated. The Eclipse
message, and immediately sends it to the following vehicles development environment is used to manage the simulator
within a certain range. When the alert message reaches the using code written in Java language and the library called
vehicles, automatically the braking system is activated to stop SumoTraciConnection [24]. In order to start the simulation,
the vehicle safely and avoid a collision. A dynamic switching a xml configuration file with the .config extension has been
algorithm has also been implemented that allows the vehicle created containing: the file related to the road route, the file
to instantly decide whether to send the message: to the MEC related to vehicles and an additional file to display an alert on
or Cloud server, in order to avoid of overloading the devices the map in case of a collision between two vehicles. Moreover,
on the roadside when it is not necessary. the configuration file contains a time section where the start
and the end of the simulation with steps are specified. It is also
A. Docker and Kubernetes possible to generate an output file with the simulation logs.
The MEC server integrated in the RSU units is realized To carry out the communication between vehicles and MEC
using virtualization mechanism. A container is an isolated server and between vehicles and Cloud server, the CoAP
environment sharing the same kernel of the operating system. protocol [25] was chosen using the Californium library written
Docker is an open source project that automates the imple- in Java [26]. Each vehicle was created as a CoAP client, and
mentation of applications within software containers provid- periodically, makes requests to the MEC or Cloud server acting
ing additional abstraction thanks to the virtualization at the as CoAP server.
operating system level of Linux [8], [22]. It uses the resource
isolation features of the Linux kernel. Docker implements C. Critical parameters design
high-level APIs to manage containers that run processes in Preliminary studies have been conduced for analyzing la-
isolated environments. Since it uses Linux kernel functionality tencies and reaction times necessary to avoid collisions among
(mainly cgroup and namespace), a Docker container, unlike a vehicles through the considered scenario.
virtual machine, does not include a separate operating system. The vehicle that is ahead has ID = 0, while the vehicle
Instead, it uses kernel functionality and leverages resource that follows has ID = 1, as it is possible to view in Fig.2.
isolation (CPU, memory, block I/O, network) and separate For simplicity, in mathematical formulation, these IDs are
namespaces to isolate application from the operating system. put as superscripts. Two vehicles travelling a stretch of road
It uses the concept of image, that includes the fundamentals at constant speed have the same deceleration capacity a and
of the operating system created by Dockerfile script. maintain a distance d from each other. At a given instant, the
Kubernetes is an open source container orchestration and vehicle with ID = 0 brakes sharply. Let tcr be the sum of the
management system [23]. It is based on different components latency and the reaction time necessary for the vehicle with
distinguished in master and nodes. The master is the main ID = 1 to start braking. The hourly law with starting time
element and the other nodes refer to master for coordinating t0 = 0 for the vehicle with ID = 0 is as follows:
themselves. The node, called also worker, has the task of
(
executing the work load following the operative modalities x00 − 21 at2 + v00 t t < av
defined by the master. A group of workers is called cluster. (v 0 )2 (1)
x00 + 12 a0 t ≥ av
The resource describing the elementary unit executable on a
cluster node is called Pod. Kubernetes guarantees reliability as The hourly law with starting time t0 = 0 for the vehicle
it can automatically restart containers that fail during execution with ID = 1 is as follows:
and terminate those that do not respond, always guaranteeing (
a certain number of containers in execution. x10 + v01 t t < tcr
(2)
x10 + v01 + atcr t − 12 a t2 + t2cr
B. Vehicular traffic management with Sumo t ≥ tcr
The vehicular traffic management is made by Sumo sim- Let suppose that due to braking vehicle with ID = 0
ulator where a stretch of road of SS107 Silana-Crotonese, stopped its run, it is possible to calculate the instant of collision
in the southern Italy, has been considered and some xml between vehicle ID = 0 and vehicle ID = 1 that follows,
files have been created containing all information regarding with the assumption of considering the two paths traveled by
the road portion: the IDs of the two lanes of the roadway the vehicles equal.
with the information of their direction, the maximum speed
in m/s and so on; the vehicles travelling information with:
1 (v00 )2 1
acceleration and deceleration values, ID, length, maximum x00 + = x10 + v01 + atcr t − a t2 + t2cr
(3)
speed in m/s and minimum gap with the previous vehicle, 2 a 2
and so on. Sumo allows to import the road network via Open By calculations, the following second degree equation are
Street Map, simplifying the process of developing the mobility obtained:
79
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
1 2 1 1 (v00 )2
at − v01 + atcr t + at2cr + d +
=0 (4)
2 2 2 a
that has real solutions for ∆ > 0:
∆=
2
v01 + atcr +
1
1 2 0 2
1 (v0 )
(5)
−4 2 a 2 atcr + d + 2 a >0
d
=⇒ tcr > v
80
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
V. P ERFORMANCE ANALYSIS
In this section, simulation results are given in order to show
the performance of the system. Firs of all, a description of the
considered simulative scenario is presented. It is composed of
the following elements: four vehicles at distance d among [6;
6.5; 7; 7.5; 8; 8.5; 9; 9.5] values and constant speed v of 50
km/h, same deceleration and reaction time t random in the
interval [450,500] ms [27]. Twenty runs have been carried out
for each configuration: MEC and Cloud server communication
in 4G and 5G technologies. In particular, the specific latency
times for each mobile technology have been used. In table I
the simulation parameters are summarized.
Firstly, the collisions percentage varying the distance d
Fig. 6: Flow chart relating to dynamic switching between vehicles has been assessed. A communication with
MEC and Cloud Server, considering 4G and 5G technologies,
has been considered.
happened, and then, it immediately notifies following vehicles
to avoid potential collisions TABLE I: Simulation Parameters
Everything is summarized in the handlePOST method of the
Parameter Value
MEC server which manages every single communication with
vehicles number 4
the client. The server stores and decodes vehicles message vehicle distance d 6, 6.5, 7, 7.5,
information in order to evaluate if vehicle is braking abruptly. 8, 8.5, 9, 9.5
In particular, a dangerous event is identified if the difference vehicle speed v 50 km/h
vehicle deceleration a same for each vehicle
between two consecutive vehicle speeds is greater than a fixed reaction time t [450,500] ms
threshold. In this case, the server, identifying all the vehicles server range action r 15-50 m
arriving within a certain range, alert them in time for avoiding
collisions. The operating criteria are described briefly in Fig.7. The results obtained are shown in the following graphics,
see Fig.8.
As it is possible to observe, with same vehicles distance,
collisions percentage using communication with Edge is lower
than communication with Cloud, considering both mobile
technologies 4G and 5G. It is possible to note that, in some
cases, the number of collisions, in the hypothesis of same ve-
hicles distance, decreases. For example observing the distance
equal to 6.5 meter, using communication with MEC server first
in 4G and then in 5G, the percentage of collisions decreases
from 33,3% to 18,3%. After, the collisions percentage between
vehicles was evaluated, no longer changing their distance,
but varying the action range of the server from the point of
the hazard event. The considered scenario is the same of the
previous experiments with an action range of the server equal
to r meters from the point where the hazard event. The first
vehicle at a certain moment is forced to brake suddenly. The
considered ranges are: 15, 20, 26 meters. These values have
been chosen respectively to ensure that the server warning is
sent to the following vehicle, to the two following and finally
Fig. 7: Flow chart relating to the behavior of the MEC server to all three following vehicles (of one braking suddenly). The
results obtained are those shown in the Fig.9. As it is possible
Once the Java Server code was realized, a Docker container to see, both using 4G and 5G technology, and considering
could be created. Then, this container was published on a communication with the Cloud and the MEC server, as the
private repository created in the Docker Hub registry so that it range of action of the server increases, the percentage of
can be managed with Kubernetes. In Kubernetes, a deployment collisions decreases. This is due because a vehicle warned in
81
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
(a) (b)
Fig. 8: Percentage of Collisions with 4G and 5G network varying vehicles distance.
(a) (b)
Fig. 9: Percentage of Collisions with 4G and 5G network varying server range action.
time of the danger will start braking earlier avoiding collision VI. CONCLUSIONS
with the previous vehicle. Therefore, the more vehicles are In this work, a collision avoidance mechanism for automa-
warned by the server, the lower is the collision rate. tive environment has been proposed based on the Edge Com-
puting paradigm. In particular, an assisted guidance system
Another parameter that has been evaluated is the MEC for collision prediction based on MEC technology has been
and Cloud server utilization percentage in dynamic switching. developed. The use of Edge is essential when constraints
This algorithm allows, especially in certain vehicular traffic on latency are required. With Edge technology the system
conditions, not to overload the Edge communicating with the functionalities are moved closer to end users, and then to the
Cloud server. Three scenarios are considered in the evaluation, edge of the network. The MEC and Cloud server run on a
starting from one that is not congested up to the congested one, Docker container managed by the Kubernetes orchestrator. In
as shown in the Figg.10a, 10b and 10c. order to show the improvements of Edge Computing over the
Cloud, a comparison evaluating the percentage of vehicles’
The results are represented in the graphic of Fig.11. collisions has been conduced showing a significantly reduction
of collisions number through the use of MEC paradigm due
When the traffic is congested, vehicles communicate more to lower latency values. It has also been demonstrated by
with the MEC server than with the Cloud server since low experiments how in some cases the MEC-5G combination has
latency times are needed to avoid collisions. So, in this case the best performance in the considered system.
a lot usage of the Edge device is registered. If the traffic is
R EFERENCES
lightly congested, the server utilization percentage is almost
the same. Considering instead traffic not congested, each [1] F. Giust, V. Sciancalepore, D. Sabella, M. C. Filippou, S. Mangiante,
W. Featherstone, and D. Munaretto, “Multi-access edge computing: The
vehicle communicates only with the Cloud server since latency driver behind the wheel of 5g-connected cars,” IEEE Communications
constraints are not stringent. Standards Magazine, vol. 2, no. 3, pp. 66–73, 2018.
82
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
83
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Cloud Computing has emerged as a foundation of user mobility can lead to the deterioration of Quality of Service
smart environments by encapsulating and virtualizing the (QoS). The service migration issue which involves the
underlying design and implementation details. Concerning the coordination among geographically distributed MEC servers is
inherent latency and deployment issues, Mobile Edge Computing one of the core issues to be concerned regarding the quantity,
seeks to migrate services in the vicinity of mobile users. However, mobility patterns of the users, and the heterogeneity of edge
the current migration-based studies lack the consideration of servers [6].
migration cost, transaction cost, and energy consumption on the
system-level with discussion on the impact of personalized user The service migration in MEC is a sophisticated
mobility. In this paper, we implement an enhanced service optimization problem, since the decision on whether, when, and
migration model to address user proximity issues. We formalize where to migrate relies on many dynamic environmental
the migration cost, transaction cost, energy consumption related variables, involving user mobility, communication channel
to the migration process. We model the service migration issue as characteristics, and resource availability [7]. In addition to the
a complex optimization problem and adapt Deep Reinforcement complexity of the input parameters, the service migration in
Learning to approximate the optimal policy. We compare the MEC inevitably introduces service delays -- Transmission
performance of the proposed model with the recent Q-learning Delay, Processing Delay, and Backhaul Delay [8]. Several
method and other baselines. The results demonstrate that the studies have conducted to solve the migration problem between
proposed model can estimate the optimal policy with complicated MEC servers, focusing on costs related to migrations. Although
computation requirements.
notable results were yielded from them, these solutions have not
Keywords—Mobile edge computing, service migration, deep fully considered the impact of personalized user mobility, which
reinforcement learning, energy consumption, migration cost may not properly function in such complex scenarios. For
instance, if the user equipment follows some mobility patterns
I. INTRODUCTION and moves on the boundaries of two adjacent edge servers, the
policy of some existing distance-based methods can cause
The advanced mobile devices, considered as one of the most
repeated unnecessary migrations(always find the nearest server),
innovative technologies, have brought significant convenience
which degrade energy efficiency and QoS due to downtime or
to human’s everyday life. Such smart devices tremendously
communication channel management overhead incurred by the
boost the developments of many high-tech concepts, such as
migrations [9].
Smart City, Autonomous Vehicle, Internet of Things (IoT), and
have constantly introduced. The Cloud Computing, as a In this paper, to cope with the complex service migration
fundamental infrastructure for implementing the afore- environment in MEC, we proposed an extensive service
mentioned technologies, draws increasing attention from migration model based on Deep Reinforcement Learning
academia and industry due to its elastic provisioning of (DRL). Compared to the previous models, the extensive service
computational and network capabilities [1]. One step further, migration model enables the controller to manage the migration
Mobile Edge Computing (MEC) has been proposed to provide process with a comprehensive perspective, which considers
cloud services in proximity to mobile user equipment, tackling more environmental factors, including migration and transaction
the potential overhead caused by the physical distance between cost, and energy consumption. As utilizing the more variables
UE and cloud instances. Academia and industry have presented and the increased number of MEC servers, we can implement
several conceptual models and proposals in connection with the realistic simulation, although it brings the computation
MEC to resolve related issues. Numerous offloading and complexity. Therefore, we apply DRL to enhance computational
resource management models have been introduced [2,3], and power as well as to increase the likelihood of adopting more
even various platforms for distributed simulation in a distributed determinants. Based on the model, we also formulate a novel
environment have been designed and implemented [4,5]. optimization problem to find an optimal policy of task
Despite of the continuous advancements in MEC environments, migrations between MEC servers, such that the balanced service
numerous challenges are still needed to resolve. The constraints migration decisions are accomplished.
of both the limited computation capability and the unpredictable
The main contributions of this paper are as follows: migration framework for active service applicants using the
Incremental file synchronization. The model is composed tree
▪ We design an extensive service migration model that layers, such as the base layer, the application layer, and the
considers the migration and transaction cost, QoS and instance layer, and the three-layered setup enables the model to
the energy efficiency of the mobile devices and the reduce the service downtime. Ouyang et al. [16] devised an
migration loads on the servers. The system model online service placement framework under a long-term time-
performs the service migration according to the average budget. The authors applied Lyapunov optimization to
optimization policy related to the migration and decompose the long-term budget into a real-time problem. They
transition costs, and the energy consumption. also utilized both Markov approximation and Best Response
▪ We employ Deep Reinforcement Learning to Update methods to achieve the best optimization for the
approximate the complicated computation problem. problem.
▪ We implement meaningful simulations to demonstrate In recent, some researchers applied Artificial Intelligence
that the proposed method surpasses the other models Technique to enhance capacity of their optimization models. T.
related to the service migration. G. Rodrigues et al. [8] provided an analytical model of service
delay in MEC as utilizing a configuration phase to control the
The reminder of this paper is organized as follows: Section service delay elements. They analyzed the behavior of the
presents the related research works in MEC area, especially fitness function using Particle Swarm Optimization (PSO)
the service migration. In section , we describe our enhanced algorithm. As a result, the model can be attained the outcome in
service model for the optimization problem of the service fewer iterations and with fewer particles. On the other hand,
migration. Section formulates the algorithm based on Deep Reinforcement Learning is also used to find solutions for service
Q-Networks. In section , we demonstrate the efficiency of the migration issues. Z. Gao et al. in [17] not only designed a Q-
model as analyzing the simulation results. Lastly, section learning based model to handle the complex environmental
concludes the research. factors, but also utilized a Deep Reinforcement Learning (DRL)
II. RELATED WORK to cope with a complicated computation problem for an action
value Q. Similarly, responding to User Equipment with high
Distributing resources optimally over selected edges is the level mobility and changing mobility pattern, C. Zhang and Z.
crucial task for resource management, which involves Zheng in [18] devised a method based on the deep reinforcement
optimization problems such as minimize processing latency, learning. They made the best use of Deep Q Network (DQN) to
load balancing, and maximize server computing efficiency. help the FMC controller generalize the past experiences. The
Besides resource placement issues, attention regarding authors also defined the reward as trade-off between QoS and
preserving energy is increasing to observe and appropriately the migration cost. Unlike MDP, the proposed DQN based task
control the burden of energy consumption [9]. Several migration algorithm is conducted without transition probability
researches have been involved in Mobile Edge Computing and reward function.
(MEC) area. As a result, diverse MEC concepts, such as Small
Cell Cloud (SCC), Mobile Micro Cloud (MMC), Fast Moving Apart from researches related to service migration costs,
Personal Cloud, Follow Me Cloud (FMC), and CONCERT, several studies are involved in the energy efficiency problem in
have been proposed [10]. There are some other studies related MEC. J. Hu et al. [19] introduced a dynamic service migration
to the service migration model based on Markov Decision model using the optimal stopping theory, which is an effective
Process (MDP). The authors in [11] not only demonstrated the tool to solve optimization problems. They considered the energy
existence of an optimal threshold policy for finding the optimal consumption factor associated with the migration distance to
action of the MDP, but also devised the polynomial time- solve the migration path selection problem. In the paper [20], the
complexity algorithm to determine the optimal thresholds. authors formulated a Green-Oriented Problem (GOP), an energy
Unlike previous works, which presented one-dimension model, minimization problem, and attempted to solve it as
S. Wang et al. in [12,13] provided the way to use the distance- implementing a Mixed Integer Linear Program (MINLP) with a
based MDP to approximate the solution for 2-D mobility models Q-learning-based Reinforcement Learning. The article [21]
in order to efficiently compute a service migration policy. introduced the fine-grained migration based on generation
Besides, they also show how to apply the algorithms to the real algorithm (FGMBGA), a genetic based algorithm, to reduce the
world as adopting it the real mobility traces of taxis in San energy consumption of terminal mobile devices as well as to
Francisco. satisfy the smooth execution of tasks during the task migration
process. Although they devised the optimal migration strategy,
X. Sun and N. Ansari proposed a PRofIt Maximization they only consider migrations between the smart mobile
Avatar pLacement (PRIMAL) strategy in order to optimize the terminal and the MEC, instead of those between the MECs. In
trade-off between the migration gain and cost [14]. They used [22], a task-centric offloading model is devised to solve the
the Mixed-Integer Quadratic Programming (MIQP) tool based a significant communication overhead and the offloading energy
heuristic algorithm to solve PRIMAL problem. A. Nadembega consumption both incurred by a number of offloading requests.
et al. in [15] proposed a mobility-based services migration The task offloading can be organized according to the priority
prediction (MSMP) model to select an optimal micro data center of local task interest measured by constant tracing the task
(MDC). The model estimates the throughput of the user and the execution.
time for MDC service area hand-offs. They demonstrated that
the proposed model surfaces the latest other approaches, in The purpose of our proposal is to present the optimal service
terms of data latency. The paper [7] presented a layered migration model on MEC environment, which is considering
85
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Three Layer Model [7] Layering Active Service Migration Migration time, Down time
Reconfiguring cloudlets [8] PSO Cloudlet Activation Processing Delay, Transmission Delay, Backhaul Delay
Threshold Policy-Based Mechanism [11] MDP Service Migration Running Time, Discounted Sum Cost
Distance Based MDP [12,13] MDP Service Migration Computation time, Discounted Sum Cost
PRIMAL [14] MIQP Avatar Migration E2e Delay, Migration Overhead
MSMP [15] DAMP Service Migration Data Latency
Lyapunov Optimization
Mobility-Aware Online Framework [16] Markov Approximation Service Migration User-perceived Latency, Migration Cost
Best Response Update
DRL Based Model [17] DQN Service Migration Migration Cost, Communication Cost
DQN Based Model [18] DQN Service Migration QoS, Migration Cost
DSM [19] Optimal Stopping Theory Migration Path Selection Energy Consumption, Migration Distance
MINLP [20] Q-learning Energy Minimization Energy Consumption
FGMBGA [21] Genetic Algorithm Energy Minimization Energy Consumption
Proposed Model DQN Service Migration Migration Cost, Transaction Cost, Energy Consumption
86
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The parameters and set the weight of each distance to the ▪ Reward: The agent obtains the reward according to the
costs, and their values can be influenced with the network action and the state. In this paper, we organize the
topology and routing mechanism of the network. Besides, the object function as the cost and energy consumption.
parameters and control the costs proportionally. Aside from this, we define the reward function as the
differentiation between an adequate large QoS value as
The expression of the one-timeslot cost is given by (3). I and the object function. With the equations (3) and
(4), it is designed as follows:
= | − | | − ′ | ′ 3 = − ∙ 5
A. Q-value function
⁄
∙ 2 −1 ∙ | − |
= 4 The Q-value function represents the maximum value of the
∙ discounted total future rewards when the actions a are executed
in state s. Regarding the Bellman equation, the Q function is the
Where , , , , denotes the service content size, the
summation of the optimal reward at the current state and the
noise power, the transmission rate, the bandwidth, and the
maximum future reward at the next state, which is given by (6).
channel attenuation coefficient respectively. We further employ
the degree of service unexecuted ∈ 0%, 100% , which is
, = , 6
presented in the article [24], to approximate the running state of
the service to be migrated.
With the Q-value function, the agent selects an optimal
IV. PROBLEM SOLUTION action for a certain state according to the policy, which is
In this section, based on the afore-mentioned service represented as follows:
migration model, we formulate the DQN algorithm, which
includes not only the Q-learning function but also the DQN = , 7
algorithm. The RL method allows the agent to deliver better
estimations for the Q-value. The target problem of Q-learning Q-learning can estimate the optimal policy as calculating the
is expressed in the MDP. In the RL algorithm, the agent can Q-value function repeatedly. The process of the Q-learning is
make a certain action when it encounters a specific situation in simply denoted as bellows:
the environment. The action results in a reward and a new state,
where the agent can perform another action. With the recursive ,
process, the agent obtains the optimal policy as determining the ⟵ , , − , 8
action, which maximizes the cumulative rewards (shown as Fig.
2) [26]. B. Deep Q-network
We define the necessary factors (state, action, and reward) In practice, it is unrealistic to apply the Q-value function
as follows: because of the size of the input, i.e. the number of the states. In
▪ State: We define the state as the distance between the other words, as increasing the targeted states, it requires more
UE and the MEC server in which the service is located. computation time and power to find the optimal value by
The state is denoted as =‖ − ‖ at tracking all the situations caused by possible actions in a
specific state, even it is impossible to complete. Hence, it is
timeslot t.
necessary to implement the approximation for the estimation of
▪ Action: The agent can move the state to the the Q-value [27]. We utilize the DQN, which includes the Deep
possible ′ as taking the action . The action Neural Network (DNN) stage, in this case, Convolutional
consists of the action set 0, 1 , which means whether Neural Network (CNN), as the approximation as well as
to migrate or not. implements with the experience replay method. We take the
states as input parameters and conduct the feed-forward and the
87
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
back-propagation processes to obtain corresponding output Q- Always migration, and Q-learning algorithm (as discussed later
values (shown as Fig. 3). in Performance of Model analysis). We conduct the simulation
with Python version 3.6 with Tensorflow API and implement
the extensive service migration model and the environment
model.
88
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
B. Performance of Model analysis the large gaps with the other two algorithms on the reward sum
In the experiments, we train the extensive service migration value.
model (ESM) based on DQN with random user movement 4850
patterns to achieve the minimization of the total sum of the cost
and the energy consumption related to the migration process. 4750
The trained model can select the appropriate action with the
environment state according to the optimal policy. Under the no 4650
to the Always migration algorithm, the agent allows the service 4350
Q-learning
Always Migration
to move to the closest MEC server as the user moves around the No Migration
area. As a result, it can reduce the transaction cost, but increases 4250
related to distances while increasing the frequency of service Migration cost parameter −
migration. On the other hand, reducing the number of servers Fig. 5. Total reward sum regarding the migration cost parameter − when
grows the distance between each server, but reduces the number . = and =
of migration target, causing in diminishing migration cost and
energy consumption. Due to the correlation of these A similar result is observed in the third experiment set,
inclinations, the reward sum values do not present any tendency which examines the influence of the content size on the reward
regardless of the number of servers. The results indicate that the values (shown as Fig. 6). With a low content size, ESM, Q-
policy generated by the ESM model can appropriately choose learning, and Always Migration can achieve better rewards
optimal actions by interacting with the occurred states. regarding migration energy consumption. In particular, Always
Migration seems to be more affected by the parameter.
C. Evaluation for Impact of variables Compared to the second experiment result of it, the algorithm
In the second experiment, we consider the variation of the has overall improved outcomes. Apart from that, ESM can
total reward sum when the migration cost parameter − achieve the best reward sum no matter how much the content
changes from 0 to 2 under the fixed number of MEC as 30 and size is. It indicates that the model is properly trained to obtain
the content size of 20. The migration cost function the optimal policy under the variations of the environmental
follows the exponential function curve since it has parameters factors.
that satisfy + = 1. Therefore, as − increases within 0 to D. Evaluation for Combination of DQN variables
2, the results are gradually raised, and eventually converge to In the last experiment, we verify the impacts of the
specific values. composition on DQN as observing the results from the
As shown in Fig. 5, the results from ESM, Q-learning, and networks that have different hidden layers or policy update
Always Migration, which involve the service migration, present methods. We consider three different hidden layer compositions,
a slight decrease according to − value is increased, while No i.e. 1 hidden layer, 2 hidden layers, and 3 hidden layers. As
shown in Fig. 7, both the total reward value and the loss value
Migration displays no fluctuation because it has no impacts on
are shown as the same trend even though some values result in
the migration. Although they also have trivial shifts on the
slight deviations. Therefore, there is no impact related to the
result according to the − value, ESM and Q-learning obtain difference in the hidden layers in our experiments.
89
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
environment.
4700
ESM
4650
Q-learning VI. CONCLUSION
Always Migration
No Migration In this paper, we proposed an extensive service migration
4600
model based on DRL to handle the service migration problem.
4550
In the proposed model, DQN is utilized to solve the inevitable
computation problem, arising out of the process of designing a
4500 service migration model that considers more factors occurring
2 5 10 15 20
Service content size
in the real environment. As demonstrating that the high-
complexity computations can be performed using DQN as an
Fig. 6. Total reward sum regarding the service content size when approximator, we suggest the possibility to realize the realistic
. = and − =
service migration scenario. In other words, we could derive the
In terms of the policy update methods, we compare the off- optimal policy under various determinants. Besides, we
policy, which the proposed mode uses, and the on-policy as a formulate the reward function to balance the trade-off between
control group. Fig. 8 describes the results of both the total service migration and energy consumption. We convince that
reward value and the loss value from the alternate methods. the policy derived by the formulation can be applied in a
There is a notable difference in loss results between those two practical way.
methods, while there is no special observation regarding the
total reward values. The off-policy method stably minimizes the On the next step, we are planning to design an agent to
loss value because it divides the policy to the target and the control MEC servers connected with multiple users as adopting
5000 5000
4900
4900
4800
4800
Total Reward Sum
4700
4700
4600
4600
4500
4500
4400
1 Hidden Layer
Off-policy
2 Hidden Layers 4400
4300
On-policy
3 Hidden Layers
4200 4300
Epoch Epoch
Total reward sum Total reward sum
350
4000
300 1 Hidden Layer Off-policy
2 Hidden Layers 3500 On-policy
250 3 Hidden Layers
3000
Loss
Loss
200 2500
2000
150
1500
100
1000
50
500
0 0
Epoch Epoch
Loss value Loss value
Fig. 7. Results regarding hidden layers when . = ,− = Fig. 8. Results regarding policy update methods when . = ,− =
, = , =
90
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
a sophisticated reward function. Furthermore, we will exploit [14] X. Sun and N. Ansari, “PRIMAL: PRofIt Maximization Avatar
various DQN algorithms for the proposed model to enhance the pLacement for mobile edge computing,” in 2016 IEEE International
Conference on Communications, ICC 2016, 2016.
training step by comparing their effectiveness.
[15] A. Nadembega, A. S. Hafid, and R. Brisebois, “Mobility prediction
model-based service migration procedure for follow me cloud to support
REFERENCES QoS and QoE,” in 2016 IEEE International Conference on
[1] E. Ahmed and M. H. Rehmani, “Mobile Edge Computing: Opportunities, Communications, ICC 2016, 2016.
solutions, and challenges,” Future Generation Computer Systems, 2017. [16] T. Ouyang, Z. Zhou and X. Chen, “Follow me at the edge: Mobility-aware
[2] S. Guan and A. Boukerche, “Design and Implementation of Offloading dynamic service placement for mobile edge computing,” IEEE J. Sel.
and Resource Management Techniques in a Mobile Cloud Environment,” Areas Commun., vol. 36, no. 10, pp. 2333-2345, Oct. 2018.
In Proceedings of the 17th ACM International Symposium on Mobility [17] Z. Gao, Q. Jiao, K. Xiao, Q. Wang, Z. Mo, and Y. Yang, “Deep
Management and Wireless Access (MobiWac '19), 97–102, 2019. reinforcement learning based service migration strategy for edge
[3] S. Guan and A. Boukerche, “A MEC-based Distributed Offloading Model computing,” in Proceedings - 13th IEEE International Conference on
for Ubiquitous and Time-constraint Offloading,” 2019 IEEE/ACM 23rd Service-Oriented System Engineering (SOSE), 2019.
International Symposium on Distributed Simulation and Real Time [18] C. Zhang and Z. Zheng, “Task migration for mobile edge computing using
Applications (DS-RT), pp. 1-8, 2019. deep reinforcement learning,” Future Generation Computer Systems,
[4] S. Guan, R. E. De Grande, and A. Boukerche, “A Multi-Layered Scheme 2019.
for Distributed Simulations on the Cloud Environment,” IEEE [19] J. Hu, G. Wang, X. Xu, and Y. Lu, “Study on Dynamic Service Migration
Transactions on Cloud Computing, vol. 7, no. 1, pp. 5-18, 2019. Strategy with Energy Optimization in Mobile Edge Computing,” Mobile
[5] S. Guan, R. E. De Grande, and A. Boukerche, “An HLA-Based Cloud Information Systems, vol. 2019, p. 5794870, 2019.
Simulator for Mobile Cloud Environments,” 2016 IEEE/ACM 20th [20] Y. Yang, X. Chen, Y. Chen, and Z. Li, “Green-oriented offloading and
International Symposium on Distributed Simulation and Real Time resource allocation by reinforcement learning in MEC,” in Proceedings -
Applications (DS-RT), pp. 128-135, 2016. 2019 IEEE International Conference on Smart Internet of Things, 2019.
[6] S. Wang, J. Xu, N. Zhang, and Y. Liu, “A Survey on Service Migration [21] Y. Wang, H. Zhu, X. Hei, Y. Kong, W. Ji, and L. Zhu, “An energy saving
in Mobile Edge Computing,” IEEE Access, 2018. based on task migration for mobile edge computing,” EURASIP Journal
[7] A. Machen, S. Wang, K. K. Leung, B. J. Ko, and T. Salonidis, “Live on Wireless Communications and Networking, 2019
Service Migration in Mobile Edge Clouds,” IEEE Wireless [22] A. Boukerche, S. Guan and R. E. De Grande, “A Task-Centric Mobile
Communications, 2018. Cloud-Based System to Enable Energy-Aware Efficient Offloading,” in
[8] T. G. Rodrigues, K. Suto, H. Nishiyama, N. Kato, and K. Temma, IEEE Transactions on Sustainable Computing, vol. 3, no. 4, pp. 248-261,
“Cloudlets Activation Scheme for Scalable Mobile Edge Computing with 1 Oct.-Dec., 2018.
Transmission Power Control and Virtual Machine Migration,” IEEE [23] T. Taleb and A. Ksentini, “An analytical model for follow me cloud,” in
Transactions on Computers, 2018. GLOBECOM - IEEE Global Telecommunications Conference, 2013.
[9] A. Boukerche, S. Guan, and R. E. De. Grande, “Sustainable offloading in [24] A. MacHen, S. Wang, K. K. Leung, B. J. Ko, and T. Salonidis, “Poster:
mobile cloud computing: Algorithmic design and implementation,” ACM Migrating running applications across mobile edge clouds,” in
Computing Surveys (CSUR), vol. 52, no. 1, p. 11, 2019. Proceedings of the Annual International Conference on Mobile
[10] P. Mach and Z. Becvar, “Mobile Edge Computing: A Survey on Computing and Networking, MOBICOM, 2016.
Architecture and Computation Offloading,” IEEE Communications [25] N. Xia, M. Tang, J. Jiang, D. Li, and H. Qian, “Energy Efficient Data
Surveys and Tutorials, 2017. Transmission Mechanism in Wireless Sensor Networks,” International
[11] S. Wang, R. Urgaonkar, T. He, M. Zafer, K. Chan, and K. K. Leung, Symposium on Computer Science and Computational Technology, 2008.
“Mobility-induced service migration in mobile micro-clouds,” in [26] R. S. Sutton and A. G. Barto, “Reinforcement Learning: an Introduction,”
Proceedings - IEEE Military Communications Conference MILCOM, Cambridge, Mass: MIT Press, 2018.
2014.
[27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.
[12] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung, Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
“Dynamic service migration in mobile edge-clouds,” in Proceedings of learning,” arXiv preprint arXiv:1312.5602, 2013.
14th IFIP Networking Conference, 2015.
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G.
[13] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung, Bellemare, A. Graves, M. Riedmiller et al., “Human-level control through
“Supplementary Materials for Dynamic Service Migration in Mobile deep reinforcement learning”, Nature, vol. 518, no. 7540, pp. 529-533,
Edge-Clouds”. 2015.
91
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The continuous spreading of the Internet-of-Things allowing users with little to none technical knowledge to
across application domains, aided by the continuous growth on develop and configure their IoT systems [5], [6]. Among those,
the number of devices and systems that are Internet-connected, the most common are visual programming solutions, generally
created both a rise in the complexity of these systems and made
noticeable a lack of human resources with the expertise to design, in the form of Visual Programming Languages (VPLs), as
develop and maintain them. Recent works try to mitigate these they were already used in tasks such as in the development
issues by creating solutions that abstract the complexity of the of Programmable Logic Controllers (PLCs) systems [7]. IoT
systems, such as using visual programming languages. Node-RED, systems are commonly created and managed using VPLs, either
as one of the most common solutions for the visual development at the fog or cloud tiers [8], by allowing users to define the
IoT systems, stills has several limitations, such as the lack of ob-
servability and inadequate debugging mechanisms. In this work, system’s behavior by manipulating visual elements rather than
we address some of these limitations by enhancing Node-RED text. Among these solutions, we can highlight Node-RED as
with new features that improve the user’s system development, one of the most used ones [9], being an open-source tool that
debugging, and understanding tasks. We proceed to empirically allows the mashup of hardware devices, APIs, and third-party
evaluate the impact of these enhancements, concluding that, services, in a hybrid text-visual programming approach [10].
overall, such enhancements reduce the development time and
the number of failed attempts to deploy the system. To compose the rules of the system, Node-RED allows the
Index Terms—Internet-of-Things, Node-RED, Software Engi- creation of flows connecting nodes that represent the various
neering, Monitoring, Debugging elements of the system (e.g., sensors, and actuators). Node-
RED provides not only a VPL but also a runtime environment
I. I NTRODUCTION that executes the constructed flows.
Internet-of-Things (IoT) systems permeate our daily lives As the system complexity evolves, understanding what is
by making everyday objects available everywhere and anytime. happening becomes harder, as Node-RED lacks in presenting
This pervasiveness of smart objects creates a foundation for a feedback to the user during development [11]. This makes it
more interactive environment between things and humans, with difficult for a user to create and modify existing rules while
the potential (and promise) of improving the quality of life. ensuring that changes do not break the expected behavior [12].
IoT’s unique characteristics — communication, identification, Node-RED lacks mechanisms to inspect the inner workings of
and interactivity — are what makes them so useful in applica- a node, to inject or modify messages during runtime, or even
tions such as home automation, transportation, manufacturing, to verify if connections between nodes will not raise runtime
healthcare, farming, and retail [1], [2]. The growing number errors. Its debug capabilities are also inadequate, relying on
of IoT systems and their increasing complexity (which can “log to console” strategies — leading to the proliferation of
be observed in aspects such as the multitude and continuous non-essential debug nodes in the flows. Every change the user
growth in diversity of communication protocols, architectures, makes to the system, even to add debug nodes, requires a new
and development solutions), together with the pervasiveness in deployment. Common mitigation strategies includes external
application domains, has led to several shortcomings, including solutions that provide visualization and monitoring mechanisms
the lack of human resources having the technical knowledge that allow understanding how the system is behaving (making
needed to develop IoT systems [3], [4]. the system observable to a certain degree), mostly through log
In an attempt to tackle both the growing complexity of analysis [13], [14].
developing IoT systems and the lack of specialized resources, Even considering other less popular solutions for developing
several approaches have been proposed (by both industry and IoT systems, they share similar downsides, including lack of
academia) empowering the so-called end-user development, observability (feedback between the development environment
and the system under development) and weak, or nonexistent,
mechanisms to properly debug the system (most rely solely
978-1-7281-7343-6/20/$31.00 ©2020 IEEE upon debug messages). While various research works approach
92
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
93
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
or plots, as shown in Fig. 2. The socket’s input and output possible to change it). Thus, it is impossible to inject
values from the node are color-coded according to the data faulty messages to verify if the system reacts as expected.
type (e.g., color, numeric, vector, or shader) it handles. Exploration: Every change in the system requires its deploy-
2) Unreal Engine: Unreal Engine’ Blueprints feature allows ment to production, including adding debugging nodes.
the creation of gameplay elements through classes which Solutions such as Unreal Engine debugging capabilities
are built by wiring function blocks and property references allow debugging without re-deploying the system.
together [22], [23]. Ancona et al. [20] and similar solutions attempt to address
some of these challenge. Though they are capable of providing
real-time information about the running system (which can be
leverage to provide self-healing capabilities [9]), they do not
provide any real-time visual feedback in the Node-RED editor.
No existing research, to the best of our knowledge, was found
that provides any kind of structural correctness verification in
design time, neither any feature regarding runtime modification
and exploration, in visual development solutions for IoT.
III. E NHANCING N ODE -RED
Inspired by the existent features on VPLs from other domains
of application, we consider that these issues could be addressed
in visual programming solutions for IoT. We consider that this
Fig. 3: Blueprints debugging mode where it is possible to would improve the development of IoT systems by reducing
watch the current path of the messages (i.e., the highlighted the development time, the number of bugs created during
one) [22], [23]. development, and overall system maintenance. We start by
presenting a motivational scenario depicted in Fig. 4.
Blueprints also provides a debugger capable of pausing the
execution of the game and step through the graph nodes by
using breakpoints. This debugger allows seeing the current
flow of the messages (and their value), as well as other node’s
variables, as shown in Fig. 3. It also provides a Call Stack and
an Execution Trace that shows a list of executed nodes and
allows further runtime inspection. Fig. 4: Whenever the temperature falls below 22ºC, the heating
system must turn on until the temperature reaches that value.
C. Discussion
There is a considerable amount of visual programming We modified Node-RED to augment the system’s observabil-
solutions for IoT [24]; however, they are typically limited in ity and improve the feedback-loop between the development
ways similar to Node-RED. For instance, none of them provides environment and its runtime, trying to improve the users’ ability
immediate feedback during development or at runtime [25]. to build, evolve, and maintain IoT systems.
Other known limitations include (but are not limited to): NODERED - CAULDRON focuses on addressing some of the
Observability: Nonexistent way to visualize the information identified missing features in Section II: (1) Observability,
that flows through the system in the development com- by providing the ability to show the information which flows
ponent. Both Blender Nodes editor and Unreal Engine through the nodes using different visual metaphors, (2) Run-
provide such features to other domains of application. time Modification, by allowing the injection of messages
Structural Correctness: There is no verification if the during runtime, and (3) Exploration, by enhancing the debug
connections between nodes will not raise runtime errors, capabilities through breakpoints on each node without the need
which implies that many faults will only emerge after for re-deployments. With our approach, each node presents
deployment. As some errors may only appear in specific each input’s messages; in nodes without any input, the output
conditions, it makes it harder (even impossible) to assert is shown. Thus, all the information flowing through all nodes
the system correctness (one must let the system run for is observable without the need to add new ones (cf. Fig. 5).
some time in a testbed or simulation setup to check for Using a Switch node as example, we can observe the added
potential errors [26]). The type system of Blender Nodes features in detail (cf. Fig. 6).
reduces these need by checking for valid connections Leveraging the already existing communication mechanism
between nodes in design time. (between the runtime and UI) a new topic was added that
Runtime Modification: They do not allow changing mes- allows showing the runtime data (i.e, messages between nodes)
sages at runtime to check the system’s behavior (e.g., in the UI. Using this additional communication channel, we can
when a temperature sensor emits a reading, it should be visualize incoming messages through two different plots (i.e, if
94
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Fig. 5: The flow from Fig. 4 with the NODERED - CAULDRON ’s features, namely the message plots and extra debug options.
the message’s payload type is a number, it displays a line plot. two messages is calculated, and this value is used to set the
Otherwise, a scatter plot). These plots let the user perceive the pace). We further enhanced the Debug Node with the same
values received, or at what pace they are coming. By allowing message’s visualization capabilities of the other nodes.
this communication to be bidirectional, and applying the same
strategy, we can inject messages into the runtime to test specific IV. E XPERIMENTS AND R ESULTS
system’s behavior.
We also added extra debugging capabilities, such as break- Our goal is to verify if these changes impact the development
points. This allows the user to “pause” incoming messages process. We carried a controlled experiment to compare the
for a given node by queuing them. The user can also step performance and behavior of two developer groups [27], [28].
forward one message at a time and change its payload. It is We hypothesized that these characteristics would improve the
also possible to clear all the queued messages. When the node ability of users to successfully build, evolve, and maintain IoT
is “unpaused”, the queued messages are released in the “same” systems faster, easier, and with fewer errors. Specifically, we
frequency that were received (i.e, the time between the last aim to answer the following research questions:
RQ1 Would users with increased exposure to real-time infor-
mation about the running system build and manage it
faster?
RQ2 Does providing users with real-time feedback increase
their ability to understand and change existing systems?
RQ3 Is an IoT visual programming environment, able to reduce
human-induced errors during development by providing
real-time feedback?
A. Experimental Parameters
We started by doing a preliminary assessment of our
procedure with two participants having distinct backgrounds:
(1) a casual Node-RED user and (2) a user with no previous
Fig. 6: An example with a Switch node in NODERED - experience in Node-RED. After which, we set out to adopt the
CAULDRON. On the top right, there is a Debug Button (1) that
following parameters for the full study:
allows to expand/collapse the messages’ plot (2) and the Show
1) Experiments: They consisted of (a) debugging, (b) im-
More button (3). This Show More button allows visualizing
proving, and (c) creating an IoT system using Node-RED;
functionalities related to the messages and breakpoint system.
hence, development experience and basic familiarity with IoT
For messages, it shows the current message to process (5), and
were required;
buttons (4) to access input and output messages’ history, clear
this history, and injecting messages in the current node. For 2) Participants: The sample size was twenty participants,
the breakpoint system (6), it allows pausing/starting message all of them, final-year computer science students with at least
processing (queuing the incoming messages) and process each basic IoT knowledge, but with no Node-RED experience;
message at a time by using the step button. This step button 3) Duration: To avoid participants’ overload and at the same
also allows the modification of the current message. The trash time providing a reasonable time to finish all of the tasks, the
button clears the breakpoint’s queue. duration of the experiment was set to 90 minutes, with a 25
minutes timeout per task;
95
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
4) Procedure: We made usage of a mix of quasi- ET1. A debugging task with a set of rules. The system
experimental with ethnographic research. The population was was capable of keeping the soil at a certain moist and
split into two groups, GA and GB, with different treatments: temperature level. For this, the user was able to control
GA used unmodified Node-RED, and GB used our tool. As (a) a heating system, (b) an irrigation mechanism, and
there were no guarantees of equal technical knowledge among (c) automatic windows. These were controlled by a
groups, two control tasks (CT) were performed to provide basic humidity/temperature sensor. These rules had some bugs
familiarity with the tool. Following, three experimental tasks related to (a) erroneous conditions, (b) wrong commands
(ET) were given to each group, viz. (a) debug, (b) improve, sent to the actuators, and (c) mismatched field accessors;
and (c) create a system from scratch. In these three tasks, ET2. An improvement task, where the user is responsible for
GB was provided with additional documentation regarding the adding a new feature to the current system, by using new
available new features. All tasks were solved in the same order, devices (both sensors and actuators): (a) the status of
with a small time break between them; the UV lamps should be adjusted according to weather
5) Environment: All experiments were conducted in a forecasts, and (b) if the UV lamps’ are on, the window
remote environment1 . The needed tools were hosted in a private should be closed;
virtual server. Video call software was used to communicate and ET3. An implementation task, where the user must create a
provide access to the participant’s screen. With this procedure, simple smart home system. Two different types of rules
it was possible to observe and take notes on the participant’s were given: (a) the lights should turn on when there is
behavior, clarify some doubts related to the tasks, and verify movement in the kitchen, and (b) every day at a given
if a certain outcome was correct; hour, the water heater and the coffee machine should be
6) Data: For both treatments we recorded: (a) the time taken turned on (recurrent rule).
to reach the solution; (b) the number of deployments made;
and (c) the number of verification requests (i.e., every time the C. Results
user thought the task was finished). For GB, the number of We now provide an analysis of the results for both the
clicks in each new functionality was also recorded; Control and Experimental Tasks. We discarded CT1, as it was
7) Post-test: A survey was carried to assess overall partici- mostly used as a sanity check.
pant’s experience, and to collect improvement suggestions. For 1) Control Task: We used CT2 to verify if there was a
this, we resorted to five statements evaluated using a Likert- statistical difference between the two experimental groups by
scale, three related to existing functionalities in NODERED - measuring the time spent and number of deployments required,
CAULDRON , and two regarding future improvements. We as presented in Table I.
slightly adapted some questions to match the specificities of We start with the Levene’s test verifying if both groups are
different treatments. from populations with equal variances. As the obtained ρ-value
B. Tasks is 0.54 for time, and 0.75 for the number of deployments, we
cannot reject the null hypothesis (i.e., both groups present
To make it possible to run the experiments with equal equal variances). A Shapiro-Wilk test verifies if each of the
operating conditions, a sensor/actuator simulator was developed groups were drawn from populations with a normal distribution.
(having a deterministic behavior) to provide real-time data Since the resulting ρ-value is above the significance level
(continuous flow of messages). This simulator implements (time: ρ(GA) = 0.69 and ρ(GB) = 0.61; deployments:
mechanisms to validate the correctness of the experimental ρ(GA) = 0.55 and ρ(GB) = 0.16), we also fail to reject the
outcomes. The CTs were: null hypothesis (i.e., both groups present a normal distribution
CT1. A preliminary task where Node-RED is introduced in the results). Ergo, we assume that both samples come from
alongside the process of creating a simple flow. It shows normally distributed populations with equal variances.
how to manually inject messages in a flow (using the We then use a Student’s t-test for assessing the following
Inject node), parse them with custom JavaScript (using hypothesis related to time, viz. H0 : both groups needed a similar
the Function node), and then display them in the amount of time to complete the task, and H1 : there exists a
sidebar (using the Debug node); significant difference in the average time for each group to
CT2. A task were data from seismometers must be used to complete the task. Concerning deployments, we assume H0 :
activate an alarm, depending on the inferred earthquake’s
magnitude. This task introduced new nodes and logic
(e.g., read data from sensors, add intermediate logic, send TABLE I: Time spent and number of deployments in CT2.
commands to the actuators) to be used in later tasks.
Grp N Mean σ Med S-W (ρ) Levene (ρ) t-test (ρ)
The first two ETs were both based on a smart farming sce-
Time
96
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
both groups made a similar amount of deployments to complete singled out one in ET2, forcing us to discard it (cf. Fig. 8b),
the task, and H1 : there exists a difference in the average of resulting in a ρ? -value of 0.03. This allows us to conclude
deployments made to each group to complete the task. that the experimental group does present a statistical difference
We observe that the time spent has a ρ-value=0.997 and the when adding new features to an existing system concerning
number of deployments has a ρ-value=0.866, failing to reject time. Regarding the other tasks, we believe that they might
H0 , and thus be forced to consider that there is no statistical have not captured a sufficient degree of difficulty/complexity
difference between the two groups, as intended (cf. Fig. 7). to evidence substantial differences and/or the sample size was
insufficient. We do consistently observe a lower mean and
median for all tasks in the experimental group.
(a) (b)
Fig. 7: Time (a) and number of deployments (b) in CT2.
Task Grp Mean σ Med t-test (ρ) (a) ET1 (b) ET2 (c) ET3
A 12:53 5:34 12:17 Fig. 9: Number of deployments in ET1–3.
ET1 0.75
B 12:08 4:33 11:36
ET2
A 8:13 2:10 8:34
0.30 (0.03? )
Deployments: All experimental tasks present ρ-values lower
B 6:57 3:05 5:47 than the significance level (0.05). This allows us to reject the
A 8:34 2:32 8:12 null hypothesis and accept there is a significant difference in
ET3 0.47
B 7:49 1:59 8:05 the average number of deployments made between the groups,
with the experimental performing fewer attempts.
Comparing the mean and median of the number of deploy-
ments to reach the solution (cf. Table III), there is a clear
tendency for the experimental group to need fewer deployments
— nearly half compared to the control group. This aligns with
our initial hypothesis since every time the user needs to add
new debug nodes in the control group, they are forced to deploy.
On the other hand, the experimental group was presented with
real-time feedback, thus decreasing such need.
Verification Requests: A verification request occurred every
time a participant regarded their task as completed. The
(a) ET1 (b) ET2 (c) ET3
statistical analysis allow us to reject the null hypothesis on both
Fig. 8: Time spent in ET1–3. ET2 and ET3 (cf. Table IV). Regarding the construction and
evolution tasks, we conclude that there is a significant difference
Time: Analyzing the time spent for the three tasks and between groups concerning their subjective perception of task
the results from the t-test (cf. Table II), we were initially completion, as the experimental group required fewer attempts.
unable to reject the null hypothesis for all tasks. We started Behavior: We observed that the experimental group, espe-
by concluding there are no relevant differences between the cially during ET1, changed their debugging strategy by focusing
two groups (cf. Fig. 8). However, a Grubb’s test for outliers on visualizing and understanding the messages in the system
97
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Q5 3 4 3 Q5 2 8
Q3 5 2 3 Q3 1 2 2 5
A 1.50 0.53 1.50 Strongly Disagree Disagree Neither Agree Strongly Agree
ET2 0.05
B 1.10 0.32 1.00 Fig. 11: Results of the survey post-test.
A 1.80 0.92 1.50
ET3 0.04
B 1.10 0.32 1.00
showing the input’s messages on each node, (Q2) to the plot
that shows the messages, (Q3) to the breakpoint system, (Q4) in
instead of attempting to understand the underlying logic of having typed connections between nodes, and (Q5) the highlight
each node. This was one of the most interesting observed of the node path of a message. Although only the experimental
phenomena because it represents a change in the participants’ group (GB) used some of the new features, we also asked the
behavior when approaching their tasks. This finding merits control group (GA) if they would like to have had such features.
further study before any major conclusions can be drawn. Interestingly, we have found a very close match between the
3) Experimental Group Feature Usage Analysis: After two groups (cf. Fig. 11). The highest divergence was found in
aggregating the results for each task (cf. Fig. 10), we conclude Q4 and Q5, which referred unavailable features on both groups
that the most used features in NODERED - CAULDRON were (i.e., these were not implemented in NODERED - CAULDRON).
those related to the visualization of the messages, i.e., (1) plot, This can be explained considering that the experimental group
(2) detailed message, and (3) history. In terms of usage by task was exposed to the experience of having real-time feedback
(cf. Table V), we observe an higher mean and median for ET1, during development, and not feeling the need of these extra
followed by ET3 and then ET2. These results were expected, features. In Q3, the results were similar, since in our tool
since on ET1 participants spent more time in understanding participants ended up not using breakpoints. We conclude that
the system, and consequently the messages that flow through it. most participants seem to want the functionalities described
In ET2, the extra features were not used as much because the in each question. Finally, the results of S1 suggest that the
participants already understood the system and did not feel the experimental group had a more enjoyable experiment.
need for a deeper exploration. ET3 was focused on constructing D. Discussion
a new system, which results in the observed higher values as
they attempted to understand the messages’ flow. Taking into account the experimental results presented in
Section IV-C, we now revisit our research questions:
RQ1. Would users with increased exposure to real-time
information about the running system build and manage it
faster? Both groups spend a similar amount of time in solving
the tasks, with a statistical significant difference observed
on improving systems. We also note that experimental group
presented consistently smaller mean and median values;
RQ2. Does providing users with real-time feedback increase
their ability to understand and change existing systems?
According to the number of deployments performed per task
together with the qualitative analysis, we can conclude that
in a system with higher feedback, users tend to perform less
Fig. 10: Clicks on NODERED - CAULDRON functionalities.
attempts of deployment thus pointing that these features make
the system easier to change;
4) Post-test Survey: To evaluate the participants’ experience, RQ3. Is an IoT visual programming environment, able to
we performed a post-test survey composed of six questions, one reduce human-induced errors during development by providing
about the general satisfaction (S1), and five concerning each real-time feedback? By analyzing the number of deployments
one of the functionalities (Q1–Q5), namely: (Q1) is related to and attempts, we see a substantial difference where users in
the experimental group have less need to deploy and more
confidence in their solution (i.e., they required less attempts to
TABLE V: Total clicks aggregated by ET1–3. achieve a successful task completion). This can be specially
Task Mean σ Med Min. Max.
useful in more sensible systems, where deployments should be
kept to a minimum.
ET1 54.60 34.36 43.00 21 130 In summary, there is significant evidence that an environment
ET2 17.50 12.64 12.50 1 35
with real-time feedback and improved debug capabilities
ET3 23.10 15.38 21.00 4 61
impacts the ability to build, maintain and improve IoT systems.
98
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
99
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—In modeling and simulation process, data plays an climatology, etc.) [7].
important role. Data is required to validate the model and
to experiment scenarios. It is also necessary for fitting and In the context of real-time systems, the multi-Agent ap-
calibrating model parameters. In the case of online simulation,
data assimilation approaches make possible to inject data into proach is the most preferred for modeling complex systems
simulations and to recalibrate simulations based on real-time and for understanding the emergence phenomena [9]. Indeed,
data. This paper addresses the challenge of assimilating data with this approach, several agents will be able to manage the
into an agent-based simulation by promoting a novel architecture time constraints of the system and participate in its dynamics at
dedicated to data assimilation. Few improvements have been the same time [4] [5]. However, depending on the field, several
made to adapt Multi-Agent Simulations to real-time data as-
similation. The architecture is designed to be generic enough assimilation methods have been developed in the literature.
to allow wild diversity of case studies. We propose a meta- These methods can be split into 3 groups: DA with sequential
model of data assimilation and implement a toolkit based on methods, variational methods or machine learning methods.
the GAMA simulator. Finally, we use temperature data to test The work that we present in this paper do not focus on
the implementation of a simple use case.
a particular assimilation method. We propose a meta-model
Keywords —Data Assimilation, Real-time System, agent-based
to ease data assimilation in real-time multi-agent simulation.
simulation, Dynamic Data-Driven Simulation, GAMA simulator The architecture of the proposed meta-model is designed to
be generic enough to address various simulation domain. The
main goal is to help modelers, whatever their study case and
simulator (GAMA, NetLogo, ...), to implement the appropriate
I. I NTRODUCTION assimilation methods. In this paper we present :
• The architecture of the proposed Meta-model of Data
Simulation have been used to study, understand and predict
complex systems behavior. Data play an important role in Assimilation on real-time multi-Agent simulation named
the modeling and simulation processes. Data is used in the MEDART-MAS.
• The implementation of the proposed MEDART-MAS us-
models design, validation and to try/test “what if” scenarios.
The simulation also uses the data to calibrate the model ing the agent-oriented language GAMA(GISAgent-based
parameters in order to reduce the gap between simulation Modeling Architecture) simulator.
• A simple use case to validate the implementation of the
results and actual observations [1]. Nevertheless, combining
system observations (the data) with running simulation can MEDART-MAS on simulator.
increase the accuracy of these models. This combination can The rest of the paper is organised as follows, section II
also improve the results of simulation and react quickly to depicts an overview of work on assimilation and real-time
unexpected phenomenons (in case of wildfire [2] or road simulation. Section III describes data assimilation architecture
traffic regulation [4]). For that, data assimilation seems to be and our propose. Afterwards, it is devoted to present the meta-
an interesting way to inject data into the simulation and the model of data assimilation in real-time multi-agent. Then, the
management of the evolution of the system’s state with real section IV introduces the implementation of the meta-model to
data. GAMA simulator. An simple use case with is describe section
V. Finally, section VI concludes our work.
Data assimilation (DA) is a collection of methods that seek
II. OVERVIEW OF REAL - TIME SIMULATION AND DATA
to combine uncertain models with uncertain data to provide
ASSIMILATION MODELS .
the best estimation of the system state at the given point
in time which observations are available [6]. The challenge A. Real-Time Systems and Multi-Agent Based Modeling
of DA on real-time simulation has always been interested Literature shows that Multi-Agent Systems (MAS) are used
in the scientist community. Therefore, many DA methods to study various real systems [8]. When it is about real-
are promoted and applied to reduce the uncertainty. They time systems, several contributions introduce a new type of
were associated with the application of real-time models, agent, often called Real-Time Agent (RTAgent), which is
mainly in the related scientific field (meteorology, hydrology, more intelligent and autonomous than other agents in the
system. They require real-time responses and must eliminate
the possibility of massive communication among agents [9]. with random initial values and different parameter values. Each
These smart agents are responsible for DA into simulation simulation is pondered with a weight recalculated using the
model. This approach has been widely used in the management model correction and real data of the system. This method
of the road traffic system. TraSMAPI (Traffic Simulation provides much more accurate model parameters but is very
Manager Application Programming Interface) is proposed in computationally intensive.
[10]. It is an interface where multi simulator can connect to All of these concepts use DA methods to couple models
use road traffic data. Authors use multi-agent based modeling and observations data.
and a stochastic module to integrate data into the simulation.
In [4], authors present an architecture to control the road traffic
C. Data Assimilation Methods
using online simulation. In this paper, the authors present
a prediction architecture and a control of road traffic based Data assimilation can be described using two approaches:
on data collected in real-time. The way data is injected into sequential approach and variational approach [14]. Sequential
simulation is not specified in the paper, but authors define an approach assumes that all observations come from the past in
integration interface that allows the controller to collect new relation to the analysis(data a priori). It relies on statistical
data. Then probabilistic approaches are used to control traffic studies of the system’s state to statistically determine the state
lights or make a forecast on the road or visualize traffic. A that best suits the observations [7]. Furthermore, these methods
similar method is used in [5]. make it possible to perform an analysis at each time step,where
J. Soler et al. designed in [1] a simulation architecture the data is available, to estimate the actual system’s state. The
using agent-based approach for a real-time system. The role of variational approach assumes that future observations related
these proposed SIMBA(SIstema Multiagente Basado en Artis to the analysis are also usable [14]. Among the most used
—Artis-based Multi-agent System) is to introduce new types assimilation methods, we have:
of autonomous agents. It consists of a multi-agent platform
• Kalman filter
for a real-time agent to perform a real-time task and offering
• Ensemble Kalman Filter
services with time constraints. The SIMBA approach makes
• Extended Kalman Filter
possible to apply the multi-agent paradigm to a real-time
• Particle Filter or Monte-Carlos Sequence
distributed problem for which the multi-agent approach seems
• Variational methods(3D-VAR and 4D-VAR)
to be the most appropriate as a centralized approach. In [9],
Julian et al. proposed a real-time multi-agent system based on All of these assimilation methods contain two parts :
SIMBA. • Prediction: predict system state at a given time t,
This modeling approach allows us to understand the dy- • Correction: based on the past prediction of the system
namics of complex systems in real time. It has been used state and the new observations data of the system, updates
extensively in the community. The advantage of this approach the prediction and corrects the prediction error.
is the decoupling between temporal constraints and the dy-
namic of the system. To manage the data injected into the From the assimilation methods description done above, we
simulation, several concepts are developed like Dynamic Data- noticed that the assimilation model is closely related to the
Driven Simulation or Application system. simulation model either to predict system state or to correct
prediction errors. However, for the multi-agent systems, the
B. DDDS : Dynamic Data-Driven Simulation dynamic of the system is represented by a set of interacting
Nowadays, we are witnessing the emergence of sensors agents. With this type of system, DA becomes more and more
technologies. It would allow us to monitor all systems using complicated using assimilation methods described above.
sensors network and obtain real-time data. This allowed the Wang M. and Hu X. propose in [15] the assimilation of
introduction of new concepts such as Dynamic Data-Driven data sensors for multi-agent simulation of smart environments
Simulation(DDDS) [3], where the simulation is influenced in real-time. They use particle filter as assimilation methods.
by the system’s real data. Dynamic Data Driven Application Though in using particle filters, they estimate the state of the
Systems(DDDAS) [11] concepts allow the possibility to inject system, then the simulation restart is dynamic to take into
data into a running simulation of application and conversely account the new values estimated as an initial condition. This
the ability of an application to manage dynamically the same approach is presented in [16] but in this paper, authors
measurement process(retro-action). Dynamic Data Driven optimized the sampling algorithm of particle filters.
Multi-Agent Simulation (DDDMAS) proposed in [12], is the
link between DDDAS and agent-based simulation. In this III. A RCHITECTURE OF DATA A SSIMILATION ON
case of applications, data can be offered in real time(online) M ULTI -AGENT S IMULATION
or be archival data(standalone). DDDAS improve modeling
methods, increase the analysis and prediction capabilities of In this section we present the position of data assimilation
application simulations [1]. Suzuki and Osogami proposed in model between the simulation and the reality Fig.1.
[13] a real-time DA method using Monte-Carlos Sequences. The architecture in Fig.1 can be subdivided into three parts:
This approach consists to run multiple parallels simulations the real world, the virtual world and the assimilation Model.
101
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
102
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
run every Tdata . It takes into account new data D from In GAML, you can use the concept of skills to construct
Data Adaptor and the previous estimated data at time species in a combined manner. Skills are bundles of attributes
tsimul in order to update the parameters of the estimation and actions that can be shared between different species and
model E(e.g. it may be an error covariance matrix). This inherited by their children.
element E will be taken into account in the estimation In order to implement our proposal, we created an Assimilation
model to enhance the prediction. skill on GAMA.
103
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
• Data Adaptor It is a class that allows to define and create addition, access to data of this type would be easier because
connectors for different data sources. The nature of these we have deployed a real-time data temperature capture station.
data sources may be different, such as databases, sensors, Every 15 minutes, data is collected from temperature sen-
etc.The data adapter class defines how to receive, collect sors deployed in Dakar(Senegal) region.
and integrate data.
A. Assimilation model
In the implementation, we create two skills based on the
To inject the temperature data into the simulation through
meta-model defined in Fig. 3.
our assimilation scheme, an appropriate assimilation model
3) Network Skill: This skill is used to create a commu-
must be defined. According to the assimilation scheme intro-
nication link between an agent and an IP-based network. It
duced above (Fig. 2) we can choose an algorithm for each
provides many ways to connect agent with a network. This
block.
skill is designed to be independent of assimilation skill. This
• Data Adaptor: For the acquisition part, to simulate the
skill use the data part (data adaptor and data source) of our
propose. We designed an abstraction layer to enable GAMA real-time communication, we have written a python script
to connect to the server through the network and using many that sends the data to an MQTT broker, and the data
protocols (eg TCP, UDP, MQTT). Adaptor connects to the broker in order to get the data.
4) Assimilation skill: There are two important actions for The data Adaptor formats data by separating arrival time
assimilation skills: and temperature data, and injects it into the simulation
through the estimation model.
• Correction: This action is called every Tdata . We allow
• Estimation and correction Model: For estimation model,
modelers to design their own correction action, so we
we use a linear regression model. We defined D as the
don’t provide an implementation of this function, but it
data from the real sensor(the output of data adaptor), and
will be executed once for each Tdata in the background.
D’ the estimated data from the estimation model(output
• Estimation: This action is called every tsimul .
of estimation model). When we use the linear estimation
In the assimilation skill, the Correction action and the model, the output of the model is:
Estimation action must be implemented on the GAML by the
modeler. In this way, a modeler can chose and implement D0 = aD + b + (1)
different estimation and correction models or combine many where a and b represent model parameters, estimation
models for assimilation on a multi-agent system. error. Since D and D’ have different temporality (Tdata
and tsimul ),there is a temporality problem in this equa-
V. T EST WITH S IMPLE USE C ASE : A MBIENT tion. To solve this problem, we introduce λ variable,
T EMPERATURE which represents assimilation frequency and is defined
as:
The aim of our work in this chapter is to test the model and
Tdata = λtsimul (2)
show that we can provide it with real-time data with a simple
assimilation model and have consistent results. The idea is • Simulation Model: The simulation model is represented
therefore to try to see what type of model would be easy to by an agent, which will play the role of a virtual sensor
implement for test purposes. We then chose data temperature on GAMA simulator. This agent is connected to a real
because of the simplicity of the models used for the prediction sensor (temperature sensor) to retrieve and display data
and their implementation in our simulation platform [21]. In through the assimilation model.
104
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
105
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
[2] Mandel J, Bennethum LS, Chen M, Coen JL, Douglas CC, Franca LP, Smart Cities and Communities (SCCIC). IEEE, 2018. p. 1-6.
Johns CJ, Kim M, Knyazev AV, Kremens R, Kulkarni V. “Towards
a dynamic data driven application system for wildfire simulation”. In [19] Ferber, Jacques, and Gerhard Weiss. “Multi-agent systems: an
International Conference on Computational Science 2005 May 22 (pp. introduction to distributed artificial intelligence”. Vol. 1. Reading:
632-639). Springer, Berlin, Heidelberg. Addison-Wesley, 1999.
[3] Hu X. “Dynamic Data-Driven Simulation: Connecting Real-Time Data [20] Taillandier, P., Gaudou, P. Grignard, A. Huynh, Q.N., Marilleau, N.,
with Simulation”. In Concepts and Methodologies for Modeling and Caillou, P., Philippon, D., Drogoul, A.“Building, Composing and Exper-
Simulation 2015 (pp. 67-84). Springer, Cham. imenting Complex Spatial Models with the GAMA Platform”. GeoIn-
formatica, Dec. 2018. https://doi.org/10.1007/s10707-018-00339-6.
[4] Wahle J, Schreckenberg M.“A multi-agent system for on-line simulations [21] ZOUNEMAT-KERMANI, Mohammad. “Hourly predictive Leven-
based on real-world traffic data”. In System Sciences, 2001. Proceedings berg–Marquardt ANN and multi linear regression models for predicting
of the 34th Annual Hawaii International Conference on 2001 Jan 6 (pp. of dew point temperature”. Meteorology and Atmospheric Physics, 2012,
9-pp). IEEE. vol. 117, no 3-4, p. 181-192.
[10] Timóteo IJ, Araújo MR, Rossetti RJ, Oliveira EC. “TraSMAPI: An
API oriented towards Multi-Agent Systems real-time interaction with
multiple Traffic Simulators”. In Intelligent Transportation Systems
(ITSC), 2010 13th International IEEE Conference on 2010 Sep 19 (pp.
1183-1188). IEEE.
106
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Cyber-physical systems (CPSs) integrate (sub)models respectively associated with the discrete-time
continuous behavior of a physical controlled plant with discrete cyber part and the continuous-time physical part. Moreover,
behavior provided by a controlling cyber (software) part. The CPS models tend to be analysable mostly by statistical model
integration is challenging because continuous, Newtonian time of checking [5-7] because the integration of hybrid (ODE based)
the physical part needs be reconciled with discrete time of the
cyber part. In this work, the event-based asynchronous actors of
and discrete behaviours often makes the model undecidable,
Theatre extended with continuous modes, are used for modelling thus properties can be approximated by simulation
and analyzing CPSs. Continuous modes capture the dynamic experiments.
laws (ODEs) of variation of physical/environmental variables. In this paper, the control-based Theatre actor-framework in
Theatre is control-based and distributed. It is implemented in Java [8-9], extended with continuous modes (with ODEs), is
Java, which is used both as the modelling language and as the adopted for modelling and analysis of CPS. Theatre makes it
target implementation language. Specific control forms were possible to design CPS models which are closed (the external,
developed for simulating a distributed CPS and for assessing its physical environment is explicitly modelled and integrated
functional/temporal behavior. Continuous modes exploit suitable with the models of the cyber components/actors) and have the
ODE solvers to predict the future values of selected variables at
specific time points. Although classical actors depend on non-
“engineering nature” [3], that is a model is assumed to guide a
deterministic message passing, a Theatre model can be designed faithfully system synthesis in the physical world. However,
to have a deterministic behavior. A hybrid Theatre model can be CPS models have also the “science nature” [3] because the
analyzed by exhaustive model checking by having, for instance, dynamic laws of the external variables must preliminarily be
that the computations of the ODE solvers are, preliminarily, off- predicted and captured into continuous modes.
line collected and reused during verification. This paper This paper improves previous authors’ work [7,10] which
describes Theatre, summarizes its operational semantics and was mainly based on statistical model checking [5] of CPS
illustrates a model reduction onto Uppaal timed automata. Then models. The contribution is twofold. First Java is proposed as
an automotive deterministic model based on both wired and the modelling language for Theatre-based CPS, together with a
Controller Area Network transmitted messages is presented and
thoroughly analysed.
simulation control-layer which is capable of managing, as in
[11], both wired messages and deterministic messages
Keywords—Actors, asynchronous messages, continuous modes, transmitted through a Controller Area Network [12]. CPS
cyber-physical systems, determinism, timing constraints, Controller analysis techniques are then further enhanced by a model
Area Network, model checking, statistical model checking, Java. checking approach as advocated in [4] where a Lingua Franca
synchronous model is first transformed into the terms of Timed
I. INTRODUCTION Rebeca [13] actors and then verified by the associated Afra
Concurrent and timed systems, including critical embedded model checking tool [14]. In this work, a reduction of a Theatre
real-time systems and cyber-physical systems (CPS) [1], must model with continuous modes onto the Timed Automata (TA)
be correct from the functional and the temporal point of view. of the more general and mature Uppaal model checker [15] is
Failing to fulfil the timing constraints can have severe proposed, whose practical effectiveness is improved by
consequences in the practical case. Assuring properties of a implementing mechanisms which minimize the model partial
CPS relies on preparing a formal model [2-3] of the system and order. The model checking approach assumes that the timing
possibly using model checking [4], that is assessing properties model of the physical part will not request interactions with the
over all the state trajectories of the corresponding state cyber part at arbitrary (aperiodic) unknown time points, which
transition system. However, designing a model for a CPS is would make model checking practically impossible [4].
very challenging because it must integrate multiple The rest of this paper is structured as follows. Section II
first summarizes the Theatre actor system in Java, extended
978-1-7281-7343-6/20/$31.00 ©2020 IEEE
107
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
with continuous modes; then some informal arguments to the the send operation is invoked) before after time units are
operational semantics of Theatre are furnished; finally, the elapsed. The message becomes invalid and it is discarded
realization of a control-layer for modelling and simulation is would the current time become greater than the message
discussed. Section III describes an automotive CPS deadline. When missing, an after evaluates to 0, and a deadline
deterministic case study. The model is first developed and defaults to .
analysed in Java. Then a reduction onto the timed automata of A msgsrv can use a delay(d) operation to express the
Uppaal is detailed which enables model checking. Different duration of a code fragment. Since Theatre actors are non-
techniques are presented to minimize the model partial order. suspensive, if used, a delay should be the last operation of a
Finally, conclusions are drawn with an indication of further msgsrv. The effect of delay(d) is to occupy, for d time units,
work. the PU upon which the actor is logically executing.
Theatre actors prove effective for modelling the time
II. THEATRE CONCEPTS IN JAVA discrete cyber part of a CPS [7,10]. Challenging is the
Theatre [16,8-9,7] is a variant of the Actors computational modelling of the time-continuous physical part, that is the
model [17], that addresses all the development phases of external controlled environment. As in Hybrid Rebeca [11],
distributed timed systems. Theatre rests on global time, extended Theatre admits continuous modes, which are special
lightweight (that is thread-less) actors, and an asynchronous “physical actors” interfacing the cyber part with the external
message passing governed by a reflective control-layer, which environment. A continuous mode (see Fig. 2), as in Hybrid
can be customized. The following considers an extension of Automata [19], logically consists of an initialization, an
Theatre with hybrid aspects, designed for modelling and invariant, one or more flows (first order ODEs), a guard and a
analysis in Java of time-critical CPS. A CPS model is a final action. A continuous mode is activated by a cyber actor
federation of interconnected theatres (computing nodes). Each (environment accessor), which at the mode termination (the
theatre hosts a collection of application actors. Actors’ guard evaluates to true) receives (final action) an instantaneous
universal naming is directly based on Java object references. message with the computed values of continuous variables.
An actor is at rest until a message arrives. Processing (i.e., Such final messages allow to integrate the continuous time
reacting to) a message is an atomic activity which cannot be physical part behavior with that of the discrete time cyber part.
suspended nor preempted (macro-step semantics [18]). Only at Each continuous mode owns an implicit and hidden PU.
the termination of the current message reaction, the event-loop
of the control layer resumes its execution and selects and
delivers a next message for processing and so forth. Message
interleaving ensures a cooperative concurrency scheme among
the actors of a theatre.
For modelling purposes, theatres are abstracted as
processing units (PUs) each having a unique id. An actor is
allocated to a PU using a move() operation (see Fig. 1). A PU
can be free or busy. It is not possible to dispatch a message to
an actor allocated on a currently busy PU. A PU becomes busy
when one of its actors is delayed (see below).
Fig. 2. Continuous vs. discrete time integration
send( String msg[, Object…args] );
send( double after, String msg[, Object…args] ); Informal arguments to Theatre semantics
send( double after, double deadline, String msg[, Object…args] ); The semantics of a Theatre model can be given operationally
double now(); [8] by a Timed Transition System (TTS) (S,s0,→) where S is a
void delay( double duration ); set of states, s0 is an initial state and → is the transition
void move( theatre-id ); relation. A state is composed of:
Fig. 1. Basic Actor services • All the actor internal states (E).
• The current value of global time (now).
A programmer-defined actor class derives from the Actor • The set of sent but not yet dispatched messages (M).
base class which exposes some fundamental services (see Fig. • The set of activated but not yet expired delays (D).
1 and [9]) whose concretization depends on the adopted control The temporal information of scheduled messages or set
layer. An actor class (see for example Fig. 5) declares an
delays are assumed to be absolutized at the send/set time:
encapsulated data status which includes some acquaintances,
i.e., actor names to whom messages can be sent (for pro- after+now, deadline+now, delay+now.
activity, an actor can always send a message to itself), plus an A transition in the TTS can be a time advancement
interface of message servers (msgsrv) [13]. A msgsrv is a transition, or the occurrence of a most imminent event
method which always returns void and can have arguments. A (message dispatch or delay expiration). Time advancement
msgsrv processes a message with the same name. The basic increases now to the time of the (or one of the) most imminent
non-blocking send relies on Java reflection and specifies, event.
besides the msg name and its optional arguments, two time A message can be dispatched to its target actor provided its
attributes: an after and a deadline [13]. Both quantities are PU is free and its deadline is not exceeded. Message dispatch
relative to the current time (now()). The message can’t be causes a msgsrv to be atomically executed. A delay expiration
delivered to its destination (that is, the actor object upon which makes a PU again free. When multiple events are eligible to
108
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
occur at the same time, one of them is chosen non- the brake pedal are estimated (sampled) through simple ODEs
deterministically and occurs. Event non-determinism, though, respectively by the Rolling and Braking continuous modes,
can be controlled by attaching e.g. a Lamport logical clock and transmitted respectively to the wheels and then to their
(meta data) to messages, so as to dispatch simultaneous wheel controllers (WCtrl) and to the brake (Brake) and then to
messages according to their logical clock, thus restoring the brake controller (BrakeCtrl). Wheel controllers in turn
(almost) the sending order. Activated continuous modes transmit the wheel speeds to the brake controller. The brake
operate in parallel. However, their behaviour can be abstracted controller acts as the main controller of the model. When it
by final messages sent at current time to the accessor actors. has both the wheel speeds and the new bprcnt, it applies the
An automotive-based control layer control actions of the current period. It estimates the
A CPSSimulator control layer was developed to enable longitudinal vehicle speed (vspd) and sends to the wheel
modelling and analysis in Java of Theatre-based CPS models, controllers the vspd and the new value of brpcnt (assumed
e.g. belonging to the automotive domain. The control layer equal to the torque level to be applied to the wheels). From the
recognizes wired messages and Controller Area Network vspd and the wheel speed, each wheel controller evaluates the
(CAN) [12] transmitted messages. By default, messages are slip rate (slprt) and (possibly) immediately releases the
assumed to be wired and can have the usual after and deadline braking, for safety reasons, would the slip rate be greater than
timing attributes. In addition, wired messages are equipped 0.2. The BBW model evolves towards the vehicle coming to a
with a Lamport logical clock (LC). This way, messages with complete stop (wheel speed <=0). The bprcnt is increased
the same timestamp are ordered by their LC. CAN transmitted from an initial value of 0.60 to a maximum value of 0.85. The
messages must be explicitly declared through the method (see initial value of the wheel speed is 1.
also Fig. 12):
109
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Fig. 4 to Fig. 11. Some global parameters of the considered public class BrakeCtrl extends Actor{
scenario are collected in the G class which is statically private WCtrl wcR, wcL;
private double wspdR, wspdL, bprcnt, vspd;
imported in the actor classes. private int c=0;
@Msgsrv public void init( WCtrl wcR, WCtrl wcL ) {
public class Wheel extends Actor{
this.wcR=wcR; this.wcL=wcL;
//acquaintances
}//init
private WCtrl ctrl;
@Msgsrv public void control() {
private Rolling rolling;
//vehicle horizontal estimation speed
//local data variables
vspd=((wspdR+wspdL)*WRAD)/2.0;
private double trq, spd;
wcR.send( "applyTrq", bprcnt, vspd );
private int id;
wcL.send( "applyTrq", bprcnt, vspd );
@Msgsrv
}//control
public void init( Integer id, WCtrl ctrl, Rolling rolling, Double spd ) {
@Msgsrv public void setWspd( Integer id, Double wspd ) {
this.id=id; this.ctrl=ctrl; this.rolling=rolling; this.spd=spd;
if( id==WCR ) wspdR=wspd;
rolling.send( "activate", this, spd, trq );
else wspdL=wspd;
}//init
c++;
@Msgsrv public void setTrq( Double tq ) { trq=tq; }//setTrq
if( c==WN ) { send( "control" ); c=0; }
@Msgsrv public void sample( Double sp ) {
}//setWspd
spd=sp; //angular wheel speed
@Msgsrv public void setBprcnt( Double bprcnt ) { this.bprcnt=bprcnt; }
if( spd<=0.0D && id==WL ) {
}//BrakeCtrl
print current time, the spd and the maxEED estimated values
Fig. 8. The BrakeCtrl actor class
end();
} public class Rolling extends Mode{
ctrl.send( "setWspd", spd );
//estimates angular speed of a wheel
rolling.send( "activate", this, spd, trq );
double spd, trq;
}//sample @Msgsrv
}//Wheel
public void activate( Wheel wheel, Double spd0, Double tq ) {
Fig. 5. The Wheel actor class
trq=tq;
//spd'=-0.1-trq
public class WCtrl extends Actor{ spd=RollingODE.solve( PERIOD, spd0, trq );
private Wheel wheel; wheel.send( PERIOD, "sample", spd );
private BrakeCtrl bctrl; }//activate
private int id; }//Rolling
private double wspd, slprt; Fig. 9. The Rolling mode class
@Msgsrv
public void init( Integer id, Wheel wheel, BrakeCtrl bctrl ) { public class Braking extends Mode{
this.id=id; this.wheel=wheel; this.bctrl=bctrl; double bprcnt;
}//init @Msgsrv
@Msgsrv public void setWspd( Double spd ) { public void activate( Brake br, Double bprcnt0, Double r ) {
wspd=spd; bctrl.send( "setWspd", id, wspd ); //bprcnt'=r;
}//setWspd bprcnt=BrakingODE.solve( PERIOD, bprcnt0, r );
@Msgsrv public void applyTrq( Double reqTrq, Double vspd ) { br.send( PERIOD, "sample", bprcnt );
if( vspd<=0.0D ) slprt=0.0D; }//activate
else slprt=(vspd-wspd*WRAD)/vspd; }//Braking
if( slprt>SRT ) wheel.send( "setTrq", 0.0D ); Fig. 10. The Braking mode class
else wheel.send( "setTrq", reqTrq );
}//applyTrq public class Main {
}//WCtrl public static void main( String... args ){
Fig. 6. The WCtrl actor class CPSSimulator cm=new CPSSimulator( N, TEND );
Wheel wr=new Wheel(), wl=new Wheel();
public class Brake extends Actor{ WCtrl wcr=new WCtrl(), wcl=new WCtrl();
private BrakeCtrl bc; Brake bp=new Brake(); BrakeCtrl bc=new BrakeCtrl();
private double bprcnt, maxprcnt, r; Rolling ror=new Rolling(), rol=new Rolling();
private Braking braking; Braking braking=new Braking();
@Msgsrv wr.send( "init", WR, wcr, ror, 1.0 ); wl.send( "init", WL, wcl, rol, 1.0 );
public void init( BrakeCtrl bc, Braking brk, Double bprcnt, Double maxprc ) { wcr.send( "init", WCR, wr, bc ); wcl.send( "init", WCL, wl, bc );
this.bc=bc; this.braking=brk; this.bprcnt=bprcnt; this.maxprcnt=maxprc; bp.send( "init", bc, braking, 0.60, 0.85 ); bc.send( "init", wcr, wcl );
r=1; braking.send( "activate", this, bprcnt, r ); cm.can( bc, wcr, "applyTrq", 1, 0.01 ); //CAN declaration
}//init cm.can( bc, wcl, "applyTrq", 2, 0.01 );
@Msgsrv public void sample( Double bp ) { cm.can( wcr, bc, "setWspd", 3, 0.01 );
bprcnt=bp; bc.send( "setBprcnt", bprcnt ); cm.can( wcl, bc, "setWspd", 4, 0.01 );
if( bprcnt>=maxprcnt ) r=0; //actor partitioning – maximal parallelism as an example
braking.send( "activate", this, bprcnt, r ); wr.move( 0 ); wl.move( 1 ); wcr.move( 2 ); wcl.move( 3 );
}//sample bp.move( 4 ); bc.move( 5 );
}//Brake cm.controller(); //launches the control event-loop
Fig. 7. The Brake (pedal) actor class }//Main
Fig. 11. The Main class
110
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
It is worth noting that the ODE classes of continuous modes instances have a unique identifier established by a
(RollingODE and BrakingODE not shown for brevity) were corresponding sub-range type (typedef). The set of pending
implemented by using the Apache commons math3 library. scheduled messages, set delays and activated modes are
For simulation purposes, some further variables were added represented by corresponding entity instances with their
to the BBW actor model so as to decorate it, e.g., to estimate timing constraints for firing. The organization implicitly relies
the maximum end-to-end delay (maxEED) between the on Uppaal nondeterminism for choosing the entity who may
moment a new sample bprcnt value is generated, and the time fire next. Broadcast channels are used for scheduling a
the bprcnt is applied to the wheels. Similarly, the speed values message, activating a mode, setting a delay, proposing a
and torque values (bprcnt) were collected and stored to a file message for transmission over a CAN bus, and for dispatching
for offline analysis. Any simulation run terminates with the a message to a receiver actor. The use of broadcast channels is
output: Vehicle stopped @1.20sec spd=-0.03 maxEED=0.04sec, that is the a key for transforming a TA model for use also with the
vehicle is stopped after 24 periods and the maximum observed statistical model checker of Uppaal.
EED is 0.04sec, thus within the period. Fig. 12 shows the
observed shape of the angular (wspd) and horizontal (vspd) Actor automata. The template process of an actor directly
speeds and torque level (trq) vs. time. corresponds to its high-level Java model. The automaton is
organized into two main locations: Receive and Select. In the
Receive (home) location, the next message to process is
awaited. In the Select location, the particular received message
ID is checked and the corresponding “msgsrv body” (reaction)
executed. To comply with the macro-step semantics (see
section II), each msgsrv body is realized as a cascade of
committed locations, thus guaranteeing the atomic execution.
At the end of a msgsrv, the Receive location is re-entered.
Some actor TA of the BBW model are shown in the figures
from 14 to 17 (the simple Main automaton is not reported for
brevity). A 103 scale factor is used for double variables (wheel
speeds, bprcnt/torque level, time granularity etc.). Now one
period becomes 50ms. The wheel radius (see the WRAD
Fig. 12. Wheel and vehicle speeds and torque level vs. time constant in Fig. 4) is scaled, instead, by 102. The use of integer
arithmetic can be observed in Fig. 14 and Fig. 16.
spd<=0
Model checking the BBW model
The simulation results of the BBW model give an important
spd>0 activate[rolling]! A
indication about its quantitative behavior, also considering that D=self,M=SAMPLE
the model is deterministic. However, the possibility of making send[mi]!
an exhaustive verification based on model checking [4] would
allow a more in-depth analysis of the model properties. msg==SET_TRQ
trq=arg[0]
A CPS model like the BBW could be analyzed by statistical msg==SAMPLE
and by converting double values into integer values by a scale Fig. 13. The automaton of the Wheel actor
factor, with a corresponding approximated integer arithmetic. A critical aspect in TA actor modelling, is the dynamic
Most importantly, though, is a replacement of the continuous message exchanges. The Uppaal model checker requires to
behavior. The ODE solvers’ results at each time (period), can work with a statically defined number of entity instances. As a
be pre-computed from the Java program, and collected into consequence, a pool of (timed or immediate) wired message
constant arrays of the Uppaal model. This way, at each period, instances, a pool of CAN message instances and a pool of
a continuous mode simply accesses the corresponding ODE delays etc. are used.
value and transmits it to the accessor actor by a message. The Sending a message is realized by first achieving a message
behavior closely mimics that of modes kept into the final instance id through the nM() function, which checks if the
system implementation that, at each period, access the value message is wired of CAN-based. Then filling some arguments
of selected external environment variables, by exploiting an in the global array arg[] and, finally, by raising an output
abstraction like the envGateway proposed in [16,7]. operation (!) on the send[.] channel indexed by the message
First of all, all modeled entities (actors, messages, modes instance id. In a similar way a mode instance can be activated
and delays) are mapped onto TA (template processes) whose etc.
111
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
send[mi]! slprt>SLT location in Fig. 17, when eligible, without delay. The more
mi=nM(self,w,SET_TRQ),
general automaton in Fig. 18 controls that the message
arg[0]=0
send[mi]! slprt<=SLT
deadline is not exceeded. A discarded message instance is
mi=nM(self,w,SET_TRQ), simply returned to its pool. A dispatchable timed message is
arg[0]=reqTrq eventually delivered through an immediate wired message (see
send[mi]! msg==SET_WSPD vspd>0 vspd<=0 later for a discussion).
slprt=((vspd-(wspd*WRAD)/ slprt=0
wspd=arg[0],arg[1]=id, 100)*100)/vspd
mi=nM(self,bctrl,SET_WSPD) free(mi) deadline_miss
msg==INIT
mxprcnt=arg[1], send[cm]? check! x>=pdelay
braking=arg[2], dest=D,msg=M, canBusy=true,x=0,
bctrl=arg[3],r=1
x<=pdelay
activate[braking]! cs=req(S,D,M), pdelay=pTime(cs)
D=self,M=SAMPLE getParams()
Fig. 15. The automaton of the Brake actor send[mi]!
send[mi]! canBusy=false,free(cm),rel(cs) mi=nM(dest,dest,msg),putParams()
mi=nM(self,wctlrL,APPLY_TRQ),
mi=nM(self,wctlrR,APPLY_TRQ), Fig. 19. The automaton of CanMessage
arg[0]=bprcnt,
send[mi]! arg[0]=bprcnt,arg[1]=espd
arg[1]=espd
Continuous mode automata. In a statistical model checking
msgsrv[self]? msg==CONTROL model, a continuous mode like the Rolling one in the BBW
Receive msg=M Select espd=(((wspdR+wspdL)/2)* model, could be naturally expressed as a hybrid automaton as
WRAD)/100
send[mi]!
msg==INIT in Fig. 20.
wctlrR=arg[0],wctlrL=arg[1] Obviously, the automaton in Fig. 20 can’t be used in a TA
c==WN msg==SET_WSPD model for the presence of the ODE flows and the use of double
mi=nM(self,self,CONTROL),c=0 wspdL=(arg[1]==wctlrL)?arg[0]:wspdL, variables. Fig. 21 shows a corresponding automaton which
c<WN wspdR=(arg[1]==wctlrR)?arg[0]:wspdR,
simply reads on the pre-computed ODE solver values. A
c++
msg==SET_BPRCNT similar automaton is used for the Braking mode.
bprcnt=arg[0]
Fig. 16. The automaton of the BrakeCtrl actor
idle scheduled dispatch
send[wm]? avail[pu[dest]] mi=nM(dest,dest,msg),
dest=D,msg=M, check! darg[0]=spd
send[mi]!
getParams() Flow t>=PERIOD
M=msg,putParams() dest=D,msg=M,
spd'==0 t<=PERIOD &&
Fig. 17. The automaton of ImmediateWiredMessage spd=darg[0],
t'==1 &&
trq=darg[1],
spd'==-0.1-trq
The check broadcast and urgent channel is fictitiously sent t=0
to force the passage from, e.g., the scheduled to the dispatch Fig. 20. An hybrid automaton for the Rolling continuous mode
112
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
113
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
114
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Performance Evaluation of
HLA RTI Implementations
Moritz Gütlein, Wojciech Baron, Christopher Renner, Anatoli Djanatliev
Computer Networks and Communication Systems, Dept. of Computer Science
University of Erlangen-Nürnberg
Erlangen, Germany
{moritz.guetlein,wojciech.baron,chris.renner,anatoli.djanatliev}@fau.de
Abstract—The High Level Architecture is an IEEE standard been thoroughly tested and is therefore more trustworthy than
that enables distributed simulation. There are several implemen- proprietary self-crafted and possibly error-prone solutions.
tations of the standard, which are providing i.a. the Run-Time One of these middleware approaches for DS is the High
Infrastructure component. This paper compares the four most
known RTI Implementations, namely MAK RTI, Pitch pRTI, Level Architecture (HLA). The development of HLA started
Portico, and CERTI, with a focus on performance evaluation. In in the early 1990s by the US Department of Defense, which led
general, Pitch pRTI was the fastest implementation for most of to the major release of HLA 1.3 in 1998. In the late 1990s the
our experiments. CERTI performed best for big payload sizes steering wheel was handed over to IEEE, which resulted in the
and Portico showed an interesting oscillation pattern. first international standard (IEEE 1516-2000) in 2000. In 2010,
Index Terms—High Level Architecture, HLA, Distributed Sim-
ulation, Performance, Middleware, RTI HLA Evolved (IEEE 1516-2010) followed and is currently the
most recent version. At the moment, a subsequent release is
being developed. The HLA is defining a set of services to be
I. I NTRODUCTION
provided, while the underlying communication layer is left up
Distributed Simulation (DS) is associated with many ad- to the middleware implementation.
vantages such as overcoming of memory limits, performance The performance of the middleware implementation is cru-
gains, and fault tolerance. The techniques can be applied to cial for the performance of the entire distributed simulation
build co-simulations, by interconnecting different simulators setup. Having the freedom to use and implement different
running on various systems, or to couple simulators with approaches for the lower communication layer, the comparison
real-world components. The latter can be used for instance of different HLA middleware implementations is of interest.
to perform Hardware-In-the-Loop (HIL) tests of electronic This freedom is not only about the wire protocol itself, but also
control units (ECUs) in the vehicular context. Naturally, every the decision of what data needs to be exchanged, with whom
application comes with its own requirements. In the case of and when [1]. This implies that the performance of different
HIL, a logical simulation clock must be synchronized to the RTIs may be heterogeneous depending on the simulation setup.
wall clock time. For other use cases, it might be desirable for One implementation may be highly suitable for a simple use
the the simulation to run faster than real time by orders of case due to its Data Distribution Management (DDM) tweaks,
magnitude. while another might perform better regarding scalability when
There are two main approaches to couple simulators (and the scenario gets huge.
other components). First, implement a direct connection be- However, in order to compare the performance between the
tween the tools. This comes typically with low overhead, but different implementations, we constructed four small-scale test
requires manual work for every new topology and module. If cases and executed them for each RTI implementation under
more than two components should be connected, the imple- test. Hence, this paper focuses on a performance comparison
mentation effort increases and depending on the synchroniza- between the four leading HLA RTIs based on prototypical
tion concept, the performance could drop. test cases. The rest of the paper is organized as follows:
Second, a dedicated middleware could be used. Usually, in Section II, more details about the HLA, the RTIs, and
the middleware takes care of the delivery of messages be- related work is given. The experimental setup and the different
tween tools and it manages a global simulation clock. The test cases are described in Section III, while the results are
middleware can help to speed up the implementation of new presented in Section IV. Finally, a conclusion and an outlook
simulation setups by reusing existing modules and providing is drawn in Section V.
interoperability. In addition, an established middleware has
II. BACKGROUND
This work is part of the Virtual Mobility World (ViM) project and has been A. High Level Architecture
funded by the Bavarian Ministry of Economic Affairs, Regional Development
and Energy (StMWi) through the Centre Digitisation.Bavaria, an initiative of A priori, two important terms have to be clarified: federate
the Bavarian State Government. and federation. A federate is a single participant in a sim-
116
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
117
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
the current version of the standard (IEEE 1516-2010). This The first three experiments were run for every available
work should help novices as well as experienced users to rank binding, namely C++ and Java. These are available for every
their own performance measures, choose an RTI, or extend our middleware except for CERTI, which only ships with a C++
test suite. The provided code shall also lower the entry barrier binding. For each run, measurements per action cycle were
into the world of HLA, which can be thoroughly, especially captured. For each test, the two test federates were once
due to poor documentation or buggy implementations. executed on the same machine (Local Run) and another time
on two different physical machines (Ethernet Run). The fourth
III. E XPERIMENTAL D ESIGN
experiment was a pure Ethernet Run. We chose the fastest
We designed four different experiments in order to compare bindings for each RTI and measured 105 cycles for every RTI
base cases between the different RTI implementations that and payload size.
were described in Subsection II-B. Therefore, we were looking In order to execute the different experiments, a virtual
at the time advance progress, object attribute updates, and machine (VM) based on CentOS 7.4 was set up with all
ownership transfer in conjunction with object attribute updates. implementations. The VMs were hosted via VirtualBox 5.2 on
In another experiment, we were interested in the impact Ubuntu 18.04 machines with an i7-2600 at 3.4 GHz and 8 GB
of the message size. Consequently, we were sending HLA RAM. 4GB of the RAM and four of the Cores were assigned
interactions with varying payload sizes. to the virtual machine. The different VMs where connected
The HLA EVOKED callback model was used for all over a Gigabit Ethernet switch, while the VirtualBox adapter
conducted experiments. A minimum waiting time of 1 was set to bridge mode.
ms and a maximum waiting time of 10 ms for an Figure 2 shows the FOM that was used throughout all the
evokeMultipleCallback() call was applied during all test cases. It contains definitions for an object with a single
tests. Naturally, with the different implementations, there integer attribute of 64 bit and an interaction with a variable
comes a variety of different parameters and possibilities for payload (HLAopaqueData). The ordering of both is in Time
fine-tuning. We will stick to the default parameter set of Stamp Order.
each RTI for the sake of comparability. In case of MAK, the The implementation of the test suite and of each test case
rtiexec connection was used. For Portico the communication can be found under [30]. Basically, there is a generic test
via jgroups was chosen, since our test federates were run in case with three functions: init(role), step(iteration),
separate processes. Because the data should be received in and finish(). While the first and the last for instance
time stamp order, the reliable communication mode was used. allow registering or destroy objects, the step function is called
iteratively until the test is finished. The duration of each call
FOM
is measured (see Algorithm 1). The role parameter allows
objects
defining a different behavior for the involved federates (e.g.,
objectClass send an interaction before advancing time).
name: HLAobjectRoot
objectClass
A. Experiments
name: TestcaseObject As mentioned before, the comparison of different RTIs
sharing: PublishSubscribe is not straight-forward due to the freedom regarding wire
attribute protocols and DDM. Therefore, simple but reproducible ex-
name: TestcaseObjectAttribute
periments were designed to get an impression of the RTIs’
dataType: TestInteger64BE
performance. Each experiment involves two federates. The
updateType: Conditional
roles 0 and 1 were assigned to the two components. Both
ownership: DivestAcquire
sharing: PublishSubscribe
of them are time regulating and time constrained.
transportation: HLAreliable 1) Time Synchronization: The two federates are continu-
order: TimeStamp ously advancing their time with a fixed and common step
interactions
length (see Algorithm 2). Both federates act identically, thus
interactionClass the role parameter is ignored. The spent wall clock time is
name: HLAinteractionRoot measured for each time advance cycle.
interactionClass
name: TestcaseInteraction
sharing: PublishSubscribe Function runTest(testcase,role):
transportation: HLAreliable testcase.init(role);
order: TimeStamp while notFinished do
timer.start();
parameter testcase.step(iteration++);
name: Payload timer.stop();
dataType: HLAopaqueData end
testcase.finish();
118
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
e = encodeValue(payload); MAK JL
Pitch CE
parameters.put(e); Pitch CL
rtiamb.sendInteraction(handle, parameters); Pitch JE
delete payload; Pitch JL
Portico CE
advanceTime();
Portico CL
return; Portico JE
Algorithm 2: Logic of the four experiments. Portico JL
0 10 0 10 1 10 2
Cycle Duration (Milliseconds)
are connected via a Gigabit-Ethernet switch. Fig. 5. Round trip (experiment 3).
119
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
TABLE II
D URATION OF DIFFERENT EXPERIMENT CYCLES ( MILLISECONDS )
The results for Portico are one order of magnitude higher. experiment, should be observed. This is the case for CERTI
A time advance cycle took 30 ms in average. The difference CL, but mostly the values remain stable. Even some smaller
to the other three RTIs can probably be explained by the numbers can be seen (e.g., 42.9 ms in mean for Portico CL
bundling option of jgroups, which is enabled by default with a in experiment 1 and 41.8 ms in this experiment). In general,
maximum waiting time of 30 ms. The decentralized approach the medians are equal or higher (compared to experiment 1),
of Portico 2.1.0 could be another reason for the slower speeds. except for the values of Pitch. While the mean values do not
The different cycle durations are additionally plotted in Fig- differ here, the medians decrease for the Ethernet runs.
ure 6. One can see that the Portico cycles mostly take around
2 ms, around 30 ms, or around 60 ms. Particularly interesting C. Experiment 3
is a look into the time series. Mostly, there are oscillating Naturally, the time effort increases, if we introduce ad-
patterns between 1 ms and 60 ms (Figure 6), which may again ditional ownership management related tasks. The object’s
be explained by jgroups and the decentralized approach. attribute value is alternatingly reflected and updated by both
Another interesting point is the lower bound in the federates. The measures show the time for one turn. For
Portico cases. There are values lying below 1 ms the median, the observed durations are in a range between
with a minimum value of 0.04 ms. Therefore, the 1.41 (Pitch JL) and 4.07 (Portico CL) times higher than in
evokeMultipleCallback() minimum waiting time experiment 2. Overall, Pitch performs best. It is followed
value seems to be ignored. This was also a problem for CERTI by CERTI and MAK. Even if more messages need to be
1516-2010, but a patch that fixes this issue is available [31]. exchanged in this experiment, the observed maximum values
When looking at the difference between the local and are not necessarily higher than in the previous experiment.
the physical distributed runs, not much discrepancy and thus However, this is the case for all Portico runs. An overall
overhead can be observed. maximum duration of 2.3 s was observed for the Portico CL
run.
B. Experiment 2
The next experiment builds on top of the first experiment. D. Experiment 4
Prior to the time advance, an object attribute (64 bit integer) A typical payload size is depending on i.a. the application
is updated and reflected in each round by one of the fed- domain, the simulation models, the input parameters, and the
erates. Hence, higher cycle durations, compared to the first distribution topology. The used sizes should cover most of
them. Based on the previous experiments, the fastest bindings
Time Advance
were used for Ethernet runs in this experiment: CERTI CE,
CERTI CE
Duration (Milliseconds)
175000
MAK CE MAK CE, Pitch JE, and Portico JE. The impact of an
50
150000
Number of Measures
120
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Duration (Milliseconds)
the four most well-known HLA RTI implementations. The
findings should provide assistance to decide which RTI is
10 1
suitable for different purposes. Therefore, we provided general
information about HLA, designed four typical test cases and
discussed the test results. All implementations have their
advantages and drawbacks.
For different reasons, it is not possible to draw a general 10 0
conclusion about the single best RTI implementation from
1M
64
1k
2k
4k
8k
8
8k
6k
2k
16
32
64
12
25
51
our presented experiments. First, the four test cases do not
12
25
51
Payload Size (Bytes)
cover all aspects of HLA nor of the different RTIs’ features.
We chose a subset of functions that tended to be compara- Fig. 7. Interactions (experiment 4).
ble. Consequently, the RTIs’ default parameters were used.
Second, there is more that should be considered than just Interactions
40
Duration (Milliseconds)
1M
64
1k
2k
4k
8k
8
8k
6k
2k
16
32
64
12
25
51
12
25
51
which might meet other certain requirements. While Pitch Payload Size (Bytes)
pRTI was in general the fastest RTI, in experiment 4 CERTI Fig. 8. Mean duration of interactions (experiment 4).
became slightly faster with an increasing payload size (>64
kB). A complete comparison is not feasible, but with the used Portico: Interactions
and published test suite, the reproduction of results and the 8000
Number of Measures
8 kB Payload
extension with additional test cases is easily possible. 6000
16 kB Payload
32 kB Payload
When contrasting our findings with the existing work on 4000 64 kB Payload
HLA performance [29], it is necessary to take into account the 2000
hardware evolution, the different versions of the standard, and
0
the evolved RTI implementations. However, some differences 0 10 20 30 40 50 60 70
Duration (Milliseconds)
are noteworthy: in our case, CERTI performs best for an
interaction with 1 MB payload, while CERTI performs worst Fig. 9. Portico: multi modal distribution (experiment 4).
in their measurements for a round trip with that payload
size. In all other cases, Portico is the slowest RTI, which is
R EFERENCES
consistent with our numbers.
For future research, it would be interesting to involve more [1] L. Granowetter, “RTI interoperability issues–api standards, wire stan-
federates and observe the impact of multicast communication. dards, and RTI bridges,” in Proceedings of the 2003 European Simula-
tion Interoperability Workshop, no. 03S-SIW, 2003, p. 063.
To see the performance of region based DDM strategies in [2] “IEEE standard for modeling and simulation (m s) high level architecture
more complex test scenarios would be another relevant point, (hla)– framework and rules - redline,” IEEE Std 1516-2010 (Revision of
as well as using the HLA IMMEDIATE callback model. When IEEE Std 1516-2000) - Redline, pp. 1–38, 2010.
[3] B. Möller and L. Olsson, “Practical experiences from HLA 1.3 to HLA
version 2.2 of Portico is published, the performance impact of IEEE 1516 interoperability,” 04f-siw-045, www. sisostds. org, 2004.
the central component is another open question. [4] B. Möller, P.-P. Sollin, M. Karlsson, and F. Antelius, “Early experiences
Having base case benchmarks for HLA implementations, a from migrating to the HLA evolved c++ and java apis,” in Spring
Simulation Interoperability Workshop, 2009.
comparison to other distributed simulation middlewares such [5] B. Möller, K. L. Morse, M. Lightner, R. Little, and R. Lutz, “HLA
as DDS (with similar adapted test cases) sounds tempting. evolved–a summary of major technical improvements,” in Proceedings
121
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
of 2008 Spring Simulation Interoperability Workshop, 08F-SIW-064, [30] M. Gütlein, “Implementation of experiments,” accessed 27.05.2020.
2008. [Online]. Available: https://github.com/cs7org/HLAPerformance
[6] CERTI, “CERTI project,” accessed 22.05.2020. [Online]. Available: [31] CERTI, “1516e timing patch,” accessed 22.05.2020. [Online]. Available:
https://savannah.nongnu.org/projects/certi/ https://savannah.nongnu.org/bugs/?56284
[7] MÄK, “MÄK RTI,” accessed 22.05.2020. [Online]. Available: https:
//www.mak.com/products/link/mak-rti
[8] Pitch technologies, “Pitch pRTI,” accessed 22.05.2020. [Online].
Available: http://pitchtechnologies.com/products/prti/
[9] Portico, “poRTIco project,” accessed 22.05.2020. [Online]. Available:
http://www.porticoproject.org
[10] E. Noulard, J.-Y. Rousselot, and P. Siron, “Certi, an open source rti,
why and how,” 2009.
[11] MÄK, “Lightweight mode,” accessed 22.05.2020.
[Online]. Available: https://www.mak.com/products/link/mak-rti#
lightweight-mode-supports-rapid-development-and-real-time-federations
[12] Pitch, “Pitch pRTI USER’S GUIDEv 5.4,” 2019.
[13] T. Roth, M. Burns, and T. Pokorny, “Extending portico HLA to feder-
ations of federations with transport layer security,” 2018.
[14] Portico, “Architectural overview of portico,” accessed 22.05.2020. [On-
line]. Available: http://portico.openlvc.org/index.php$?$title=Portico
Architectural Overview
[15] P. Ryan, P. Ross, and W. Oliver, “Distributed interactive simulation
revisited: Capabilities of the revised IEEE standard,” 1994.
[16] R. M. Fujimoto and R. M. Weatherly, “HLA time management and dis,”
in Proceedings of 14th Workshop on Distributed Interactive Simulation,
1996.
[17] Modelica, “Fmi 2.0.1 specification,” accessed 22.05.2020. [Online].
Available: https://github.com/modelica/fmi-standard/releases/download/
v2.0.1/FMI-Specification-2.0.1.pdf
[18] G. Pardo-Castellote, “OMG data-distribution service: architectural
overview,” in 23rd International Conference on Distributed Computing
Systems Workshops, 2003. Proceedings. IEEE, 2003, p. 200206.
[19] T. Nouidui, M. Wetter, and W. Zuo, “Functional mock-up unit for
co-simulation import in energyplus,” Journal of Building Performance
Simulation, vol. 7, no. 3, pp. 192–202, 2014.
[20] M. U. Awais, P. Palensky, A. Elsheikh, E. Widl, and S. Matthias,
“The high level architecture RTI as a master to the functional mock-up
interface components,” in 2013 International Conference on Computing,
Networking and Communications (ICNC), 2013, pp. 315–320.
[21] M. U. Awais, M. Cvetkovic, and P. Palensky, “Hybrid simulation using
implicit solver coupling with HLA and fmi,” International Journal
of Modeling, Simulation, and Scientific Computing, vol. 8, no. 04, p.
1750055, 2017.
[22] Y. Bouanan, S. Gorecki, J. Ribault, G. Zacharewicz, and N. Perry,
“Including in HLA federation functional mockup units for supporting
interoperability and reusability in distributed simulation,” in Proceedings
of the 50th Computer Simulation Conference, ser. SummerSim 18. San
Diego, CA, USA: Society for Computer Simulation International, 2018.
[23] N. Sievert, “Modelica models in a distributed environment using fmi
and hla,” 2016, Thesis. [Online]. Available: http://www.diva-portal.org/
smash/record.jsf?pid=diva2%3A971217&dswid=-6326
[24] L. I. Hatledal, H. Zhang, A. Styve, and G. Hovland, “FMU-proxy: A
Framework for Distributed Access to Functional Mock-up Units,” Feb
2019, p. 7986.
[25] M. Krammer, M. Benedikt, T. Blochwitz, K. Alekeish, N. Amringer,
C. Kater, S. Materne, R. Ruvalcaba, K. Schuch, J. Zehetner, M. Damm-
Norwig, V. Schreiber, N. Nagarajan, I. Corral, T. Sparber, S. Klein, and
J. Andert, “The Distributed Co-Simulation Protocol for the Integration
of Real-Time Systems and Simulation Environments,” in Proceedings
of the 50th Computer Simulation Conference, 2018. [Online]. Available:
https://dl.acm.org/citation.cfm?id=3275383
[26] Y. Park and D. Min, “Development of HLA-DDS wrapper API for
network-controllable distributed simulation,” in 2013 7th International
Conference on Application of Information and Communication Tech-
nologies. IEEE, 2013.
[27] W. Baron, C. Sippl, K.-S. Hielscher, and R. German, “Repeatable
Simulation for Highly Automated Driving Development and Testing,”
in 2020 IEEE 91st Vehicular Technology Conference. IEEE, 2020.
[28] M. Gütlein and A. Djanatliev, “Modeling and simulation as a service
using Apache Kafka,” in Proceedings of the 10th International Con-
ference on Simulation and Modeling Methodologies, Technologies and
Applications, ser. SIMULTECH 2020, 2020.
[29] L. Malinga and W. H. Le Roux, “HLA RTI performance evaluation,”
2009.
122
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—A detailed computer simulation is an important The communication protocols in many existing distributed
tool for the managing of road traffic. Since it can be very time road traffic simulator successfully ensure both the synchroniza-
consuming, it is often performed in a distributed computing tion and the transfer of vehicles. Nevertheless, only in some
environment. The simulated road traffic network is then divided cases, an effort was invested into the minimization of the inter-
into sub-networks simulated by processes on the nodes of the process communication, although it is often the main bottle-
distributed computer. The inter-process communication necessa- neck of distributed applications. Hence, we explored possible
ry for the vehicle transfer and the synchronization can then ways for the reduction of inter-process communication. The
significantly influence the performance of the distributed road results of this research are two efficient communication proto-
traffic simulation. In this paper, two efficient communication
cols – the Long Step (LS) and the Long Step Binary (LSB)
protocols for distributed road traffic simulation, which we
protocols, whose basic functioning is described in [6] and [7]
developed during our previous research, are compared. These
protocols – the Long Step (LS) protocol and the Long Step in detail. There are three variants of both protocols. The com-
Binary (LSB) protocol – reduce the inter-process communication parison of these variants using a thorough testing is the main
using an aggregate message transfer and/or a lossy data contribution of this paper. The tests were performed using the
compression. Semi-centralized, centralized, and distributed Distributed Urban Traffic Simulator (DUTS) developed at De-
variants of both protocols were thoroughly tested and compared partment of Computer Science and Engineering of University
to a reference communication protocol representing a common of West Bohemia (DCSE UWB). However, they principles are
protocol of distributed road traffic simulators. The tests indicate utilizable for other distributed road traffic simulators as well.
significant savings of the number of transferred messages and,
more importantly, of the total computation time. II. DISTRIBUTED ROAD TRAFFIC SIMULATION
In order to make the further reading clearer, the basic
Keywords—road traffic, distributed simulation, communication notions of road traffic simulation and its distributed version are
reduction, aggregate transfer, lossy data compression briefly explained in following subsections.
I. INTRODUCTION A. Important Features of Road Traffic Simulation
The road traffic density on highways and especially in The time-flow mechanism determines the way the simula-
cities is steadily increasing. A computer simulation is one of tion time is advanced. Two mechanisms are commonly used –
the important tools for the managing of road traffic. It can be the time-stepped one and the event-driven one. Using the
used for analysis of existing road traffic networks and former mechanism, the entire simulation state is periodically
improvement of their performance, for prediction of recomputed [8]. The period (called time step) is usually one
consequences of a road closure, and so on. In order to model second (e.g., in [1], [5], [9]), but is not the only possible value
the real road traffic situations accurately, the simulation must (e.g., in [10], 0.1 seconds is used). Using the latter mechanism,
be very detailed. Very often, multiple simulation runs of a the simulation state changes by interpreting events. An event
single scenario are required in order to guarantee the fidelity of incorporates an action (i.e., an incremental change of the
the results. Because of these two requirements, the road traffic simulation state) and a time stamp indicating when the actions
simulation is very time-consuming, especially for large road should be performed. So, the simulation time advances from
traffic networks (e.g., large cities or entire states). Hence, some one time stamp to another [8].
existing road traffic simulators (e.g., [1], [2], [3], [4], [5]) were
adapted for a distributed computing environment where the The level of detail of the road traffic simulation determines
combined computing power of multiple interconnected (single- its fidelity on one side and its speed on another. The
core or multi-core) computers is used for a faster execution of macroscopic simulation deals only with aggregated traffic
the simulation. Common approach is that the road traffic flows in individual roads. These models are very fast and also
network is divided into sub-networks, whose simulations are very old [11]. Both time flow mechanisms are commonly used.
then performed as processes on individual computers (called The mesoscopic simulation adds some form of individual
nodes) of the distributed computer. Communication links are vehicles, but modeling of their mutual interactions is limited
maintained among the processes to ensure their mutual [12], which makes the mesoscopic simulations also very fast.
synchronization and the transfer of vehicles between the Again, both time flow mechanisms are commonly used. In the
neighboring sub-networks. microscopic simulation, every single vehicle is modeled as an
object with its own position, direction, speed, and acceleration.
This work was supported by Institutional support for long-term strategic
development of research organizations.
978-1-7281-7343-6/20/$31.00 ©2020 IEEE
123
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The simulated vehicles drive in roads, change their directions traffic lane is congested or become passable again. These
and traffic lanes, and interact with each other. The time- messages travel in opposite direction than the vehicles and can
stepped time-flow mechanism is most often used. There are significantly vary in their form (from a simple bit indication
two widely used microscopic traffic models – the cellular [19] to complex information about the traffic situation near the
automaton model dividing the traffic lanes into equally-sized end of the traffic lane [1], [20]).
traffic cells [13] and the car-following model enabling to place
a vehicle at any position in the traffic lane [14]. In both models, To ensure that the transferred vehicles and lane-blocks
the vehicles tend to accelerate to the maximal allowed speed arrive to the neighboring sub-networks in a correct time step,
and decelerate when there is an obstacle in their way (e.g., a all simulation processes must perform the same time step at the
slower vehicle). Both models also often incorporate random same moment. Usually, the processes are synchronized using a
deceleration in order to emulate natural fluctuations of the barrier. This barrier can be provided by a central (control)
speed [13], [14]. Since the microscopic simulations model the process [1], [5], [21], [22] or can be distributed among the
real traffic at a high level of detail, they are also much more working processes [15], [21], [23]. In both cases, the synchro-
computation-consuming, which is the reason, why they are nization mechanism requires additional messages to be sent
often performed in distributed computing environment. among the processes. In case of a centralized barrier, the
synchronization messages are transferred between the control
B. Important Features of Distributed Road Traffic Simulation process and the working processes. In case of a distributed
So, the primary reason for adaption of a road traffic barrier, the synchronization messages are transferred only
simulation for a distributed computing environment is its among the working processes. The vehicles and lane-blocks are
speedup. As it was said in Section I, a distributed computer often transferred directly between the working processes simu-
consists of multiple computers (nodes) interconnected by a lating the neighboring sub-networks [1], [15], [23]. An alter-
computer network (usually Ethernet). An example can be native is to transfer them via the control process [2], [6], [22].
ordinary workstations at a university classroom. There is no III. RELATED WORK
shared memory among the nodes. So, the only means of
communication is the message passing [8]. It should be noted Since our communication protocols are focused on the
that the nodes of the distributed computer can and often do inter-process communication reduction, related papers at least
incorporate multi-core processors. This distributed/parallel partially dealing with the minimization of the inter-process
computing environment can be used for an additional speedup communication are mentioned in following subsections.
of the distributed road traffic simulation, for example by empl- A. Reduction of Sent Messages Count
oying multi-threaded simulation processes (see [5] for details).
However, this does not influence the message passing between A way, how to reduce the inter-process communication, is
the processes, so we will not go in further details in this paper. to reduce the sent messages count. This approach is used in
[23]. There, a distributed synchronization is performed by
The distributed road traffic simulation consists of (possibly exchanging synchronization messages between the processes
multithreaded) processes running on the particular nodes of the simulating neighboring sub-networks (“neighboring processes”
distributed computer. So, there are two main issues, which for short) only. This “semi-optimistic” approach requires a
must be solved prior its execution – how to divide the lower messages count than the “all-to-all” approach used for
simulation into processes (i.e., the decomposition) and how example in [15], where each process sends a synchronization
these processes will communicate (i.e., the inter-process message to all other processes. Also, the vehicles and lane-
communication). In the field of road traffic simulation, the blocks are transferred via the synchronization messages [23].
spatial decomposition is most common. It is used for example
in [1], [3], [4], [5], [15]. Most often, the simulated road traffic A similar distributed synchronization between the neigh-
network is divided into sub-networks, which are then simulated boring processes only is described in [21] together with the
by particular working processes on the nodes of the distributed classical master-slaves synchronization. In both synchroniza-
computer. The inter-process communication is then necessary tion types, the vehicles and lane-blocks are sent directly
for the transfer of vehicles traveling from one sub-network to a between the neighboring sub-networks [21].
neighboring one. A special case of spatial decomposition is the The transfer of vehicles and lane-blocks via the synchroni-
uniform division of vehicles among processes, not the division zation messages is utilized also in [22]. However, the synchro-
of road traffic network into sub-networks (see [9] or [16]). nization is a centralized one with a master process controlling
There are also some examples of utilization of the temporal the transfer of all data in the distributed simulation [22].
decomposition, which divides the road traffic simulation run
into time intervals [17], and the task parallelization, which The messages count is also reduced in [10], where the
divides the simulation program into modules [18]. However, edges of neighboring sub-networks have overlapping regions
these two decompositions are quite rare and we will consider acting as buffers for vehicles. The contents of the buffers are
only the road traffic network division further in the text. exchanged between the neighbors only once per 300 time steps
(1 time step is 0.1 seconds) and the sizes of the buffers is set
The inter-process communication ensured by a communica- based on this time and the maximal speed of 12 mps. The
tion protocol is necessary primarily for the transfer of vehicles synchronization is performed during the exchange of buffers
and lane-blocks between the neighboring sub-networks in only [10]. This is possible, since when there is no transfer of
traffic lanes crossing the boundary between two sub-networks vehicles, no simulation inconsistencies can arise and no
(so called divided lanes). The lane-blocks indicate that the synchronization is needed.
124
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
125
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
V. COMMUNICATION PROTOCOLS
The LS and LSB communication protocols, which we
developed, reduce of the inter-process communication using
two approaches – the reduction of the number of sent messages
(see Section III.A) and the reduction of the amount of
transferred data [7]. The basic functioning of the protocols was
described in [6] and [7], respectively. Each protocol has three
variants, which are described in Section V.B and Section V.C,
respectively. The reference protocol representing commonly Fig. 3. The functioning of the SC-SV protocol
used protocols is described in Section V.A.
ming to it are stopped (h). When the congestion in the
A. Reference Communication Protocol generator’s traffic lane is over, the generator creates a lane-
The reference communication protocol – the Semi- block indicating that the traffic lane is clear and sends it to the
centralized single vehicle (SC-SV) protocol – does not employ terminator, which then becomes passable again.
any inter-process communication reduction. The protocol is The described implementation of the SC-SV protocol is
lossless and does not introduce any error into the simulation. usable by all three traffic models. No modifications are needed.
The “semi-centralized” attribute of the protocol means that B. Long Step Protocol
a centralized barrier provided by a control process is utilized
for the synchronization and the vehicles and lane-blocks are The Long Step (LS) protocol utilizes the aggregation of
transferred directly between the neighboring working transferred vehicles and lane-blocks for a significant reduction
processes. This is quite common approach utilized for example of the number of transferred messages. The vehicles and lane-
in [1], [21], [22]. There are two synchronization messages per blocks are aggregated both spatially and temporally. The
working process per time step – the notification sent from a spatial aggregation means that the vehicles and lane-blocks
working process to the control process to indicate the finish of from multiple traffic lanes are sent in one message. For
current time step computation and the permission sent from the example, the outgoing vehicles and lane-blocks from all lanes
control process to all working processes once all notifications leading from a single sub-network to a single neighboring sub-
were received enabling them to continue with next time step. network are transferred in one message instead of being
The “single vehicle” attribute of the protocol means that a transferred separately [6]. The specifics of the aggregation
single vehicle (or a lane-block) is transferred in a single depends on the variant of the LS protocol (see bellow).
message. This is used for example in [15]. So, the total number The temporal aggregation means that the vehicles and lane-
of transferred vehicle/lane-block messages corresponds to the blocks from multiple time steps are sent in one message
number of transferred vehicles and lane-blocks and can vary in regularly once per several time steps. The number of time steps
individual steps. The total number of messages sent per one between two successive transfers of vehicles and lane-blocks
time step by the SC-SV protocol can be expressed as: are designated as long step. This is possible, because the
movement of the vehicles in a single traffic lane is affected
P only by the vehicles themselves. Thus, the vehicles in a single
M SC − SV = 2 P + ∑ (Vi + Li ) , (1) lane, which shall be transferred to a neighboring sub-network
i =1 throughout the long step, are stored in a buffer. After the long
step period is elapsed, the entire content of the buffer is
where P is the number of working processes, Vi is the number transferred at once to the neighboring sub-network. The lane-
of vehicles sent by the ith working process in the time step, and blocks are also sent once per long step and have a different
Li is the number of lane-blocks sent by the ith working process form. Instead of simply indicate that a lane become congested
in the time step. or passable, they carry information about the available space in
the traffic lane. Since the synchronization of the working
The implementation of the SC-SV protocol in the DUTS processes is necessary only because of the transfer of vehicles
system works as follows (see Fig. 3). The vehicles and lane- and lane-blocks, it is performed only once per long step
blocks are transferred using the terminator-generator pairs (see together with the transfer of vehicles and lane-blocks itself [6].
Section IV.C). When a vehicle reaches the terminator, it is
removed from the lane (a) and its parameters (e.g., speed, As arise from previous paragraphs, the protocol is lossless,
length, etc.) are packed into a message (b). This message is since it transfers all information about all vehicles and lane-
sent to the neighboring working process using established blocks. The combination of both the spatial and the temporal
communication link where it is forwarded to the corresponding aggregation means that the number of messages sent per time
generator (c). The generator unpacks the message, creates a step of LS protocol is very low. Nevertheless, the exact number
new vehicle with the received parameters, and inserts it into the depends on the LS protocol variant – the semi-centralized
traffic lane (d). When a generator cannot insert a new vehicle (SC), centralized (C), and distributed (D) one (see Fig. 4).
to its congested traffic lane (e), it creates a lane-block
The SC-LS variant utilizes a centralized barrier provided by
indicating congestion and packs it into a message (f). This
the control process for the synchronization and the aggregated
message is sent to the neighboring working process and
vehicles and lane-blocks are transferred directly between the
forwarded to the corresponding terminator (g). The terminator
neighboring working processes (see Fig. 4a). Similarly to the
then becomes not passable, which means that the vehicles inco-
126
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
2P
M C − LS = , (3)
TLS
127
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
128
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
129
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
TABLE I. MESSAGES COUNT DEPENDENT ON DIVIDED LANES COUNT TABLE II. MESSAGES COUNT DEPENDENT ON THE VEHICLE DENSITY
Lanes count 2 4 8 16 32 Vehicle density 0.05 0.10 0.15 0.20 0.25
Protocol Number of transferred messages Protocol Number of transferred messages
SC-SV 4353 4699 5083 6568 7954 SC-SV 4283 4440 4527 4653 4738
SC-LS 750 750 750 750 750 SC-LS 750 750 750 750 750
C-LS 500 500 500 500 500 C-LS 500 500 500 500 500
D-LS 250 250 250 250 250 D-LS 250 250 250 250 250
SC-LSB 750 750 750 750 750 SC-LSB 750 750 750 750 750
C-LSB 500 500 500 500 500 C-LSB 500 500 500 500 500
D-LSB 250 250 250 250 250 D-LSB 250 250 250 250 250
130
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Fig. 14. The division of road traffic network used for processes count testing
131
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
TABLE IV. MESSAGES COUNT DEPENDENT ON THE PROCESSES COUNT [8] R. M. Fujimoto, Parallel and Distributed Simulation Systems, John
Wiley & Sons, New York, 2000.
Working processes count 2 4 8
Protocol Number of transferred messages [9] Z. Fu, J. Yu, and M. Sarwat, “Demonstrating GeoSparkSim: A Scalable
Microscopic Road Network Tra?ic Simulator Based on Apache Spark,”
SC-SV 6429 12928 26686
in SSTD '19: Proceedings of the 16th International Symposium on
SC-LS 750 2000 4500 Spatial and Temporal Databases, August 2019, pp. 186-189.
C-LS 500 1000 2000
[10] B. Jiang and H. Zhang, “Realization of Distributed Traffic Simulation
D-LS 250 1000 2500
System with SCA and SDO,” in 2009 Second International Conference
SC-LSB 750 2000 4500
on Future Information Technology and Management Engineering,
C-LSB 500 1000 2000 December 2009, pp. 222-225.
D-LSB 250 1000 2500
[11] M. J. Lighthill and G. B. Whitham, “On kinematic waves II: A theory of
traffic flow on long crowed roads,” in Proceedings of the Royal Society
of London, Series A. Mathematical and Physical Sciences, vol. 229, No.
VII. CONCLUSION 1178, 1955, pp. 317–345.
In this paper, we described two efficient communication [12] W. Burghout, Hybrid microscopic-mesoscopic traffic simulation,
protocols for the distributed road traffic simulation – the Long Doctoral thesis, Royal Institute of Technology, Stockholm, 2004.
Step protocol (LS) and the Long Step Binary protocol (LSB), [13] K. Nagel and M. Schreckenberg, “A Cellular Automaton Model for
each with three variants. The protocols were thoroughly tested Freeway Traffic,” Journal de Physique I, 2, 1992, pp. 2221–2229.
and compared to a reference communication protocol (SC-SV), [14] P. G. Gipps, “A behavioural car following model for computer
simulation,” Transp. Res. Board, 15-B(2), 1981, pp. 403–414.
which represents protocols commonly used in existing
[15] R. Klefstad, Y. Zhang, M. Lai, R. Jayakrishnan, and R. Lavanya, “A
distributed road traffic simulators. The results indicate that the Scalable, Synchronized, and Distributed Framework for Large-Scale
developed protocols are able to reduce the number of Microscopic Traffic Simulation,” in The 8th International IEEE
transferred messages by up to 97 % in comparison to the Conference on Intelligent Transportation Systems, 2005, pp. 813-818.
reference communication protocol. The computation time is [16] M. Mastio, M. Zargayouna, G. Scemama, and O. Rana, “Two
reduced by up to 58 %. Using these protocols for simulation of distribution methods for multiagent traffic simulations,” Simulation
large road traffic networks, it is possible to achieve a Modelling Practice and Theory, vol. 89, 2018, pp. 35–47.
noticeable speedup. For four working processes performed on [17] T. Kiesling and J. Lüthi, “Towards Time-Parallel Road Traffic
four nodes of the distributed computer, we achieved speedup Simulation,” in Proceedings of the Workshop on Principles of Advanced
and Distributed Simulation (PADS’05), 2005, pp. 7-15.
up to 3.64. For eight simulation processes, the speedup of 5.91
[18] N. Cetin, A. Burri, and K. Nagel, “A Large-Scale Agent-Based Traffic
was achieved. Microsimulation Based on Queue Model,” in Proceedings of 3rd Swiss
Transport Research Conference, 2003.
In our future work, we will focus on further improvements
of the communication protocols. We will also investigate the [19] T. Potuzak and P. Herout, “Use of Distributed Traffic Simulation in the
JUTS Project,” in Proceedings of EUROCON 2007, September 2007,
possibility of the utilization of the developed protocols for pp. 2250-2255.
other (i.e., non-road-traffic) distributed simulations. [20] Y. Xu, V. Viswanathan, and W. Cai, “Reducing Synchronization
Overhead with Computation Replication in Parallel Agent-Based Road
REFERENCES Traffic Simulation,” IEEE Transactions on Parallel and Distributed
[1] K. Nagel and M. Rickert, “Parallel Implementation of the TRANSIMS Systems, vol. 28, No. 11, 2017, pp. 3286–3297.
Micro-Simulation,” Parallel Computing, vol. 27, No. 12, 2001, pp. [21] K. Ramamohanarao, H. Xie, L. Kulik, S. Karunasekera, E. Tanin, R.
1611–1639. Zhang, and E. B. Khunayn, “SMARTS: Scalable Microscopic Adaptive
[2] D. Igbe, N. Kalantery, S. Ijaha, and S. Winter, “An Open Interface for Road Traffic Simulator,” ACM Transactions on Intelligent Systems and
Parallelization of Traffic Simulation,” in Proceedings of the Seventh Technology, vol. 8, No. 2, Article 26, 2016, pp. 1-22.
IEEE International Symposium on Distributed Simulation and Real- [22] M. S. Ahmed and M. A. Hoque, “Partitioning of Urban Transportation
Time Applications (DS-RT’03), October 2003, pp. 158-163. Networks Utilizing Real-World Traffic Parameters for Distributed
[3] D. Wei, W. Chen, and X. Sun, “An Improved Road Network Partition Simulation in SUMO,” in 2016 IEEE Vehicular Networking Conference
Algorithm for Parallel Microscopic Traffic Simulation,” in 2010 (VNC), December 2016.
International Conference on Mechanic Automation and Control [23] A. Ventresque, Q. Bragard, E. S. Liu, D. Nowak, L. Murphy, G.
Engineering, June 2010, pp. 2777–2782. Theodoropoulos, and Q. Liu, “SParTSim: A Space Partitioning Guided
[4] Y. Xu and G. Tan, “An Offline Road Network Partitioning Solution in by Road Network for Distributed Traffic Simulations,” in 2012
Distributed Transportation Simulation,” in 2012 IEEE/ACM 16th IEEE/ACM 16th International Symposium on Distributed Simulation
International Symposium on Distributed Simulation and Real Time and Real Time Applications – DS-RT 2012, October 2012, pp. 202-209.
Applications – DS-RT 2012, October 2012, pp. 210–217. [24] Y. Xu, H. Aydt, and M. Lees, “SEMSim: A Distributed Architecture for
[5] T. Potuzak, “Distributed-Parallel Road Traffic Simulator for Clusters of Multi-scale Traffic Simulation,” in 2012 ACM/IEEE/SCS 26th
Multi-core Computers,” in 2012 IEEE/ACM 16th International Workshop on Principles of Advanced and Distributed Simulation, July
Symposium on Distributed Simulation and Real Time Applications - 2012, pp. 178-180.
DS-RT 2012, October 2012, pp. 195–201. [25] D. Hartman, “Leading Head Algorithm for Urban Traffic Model,” in
[6] T. Potuzak and P. Herout, “An Efficient Communication Protocol for Proceedings of the 16th International European Simulation Symposium
Distributed Traffic Simulation: Introduction of the Long Step Method,” ESS, pp. 297-302, 2004.
in Sofsem 2009: Theory and Practice of Computer Science, Proceedings, [26] P. T. R. Wang and W. P. Niedringhaus, “Distributed/Parallel Traffic
Volume II, January 2009, pp. 72-83. Simulation for IVHS Application,” in Proceedings of the 25th Winter
[7] T. Potuzak, “Distributed and Centralized Version of an Efficient Conference on Simulation, 1993, pp. 1225-1230.
Communication Protocol for Distributed Traffic Simulation,” in
International Conference on Computer Modelling and Simulation
(CSSim 2009), September 2009, pp. 259-264.
132
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
6LPXODWLQJ+HWHURJHQHRXV0RGHOVRQ0XOWL&RUH
3ODWIRUPVXVLQJ-XOLD¶V&RPSXWLQJ/DQJXDJH3DUDOOHO
3RWHQWLDO
$EVWUDFW ² 7KLV SDSHU DGGUHVVHV WKH TXHVWLRQV RI KLJKOHYHO PDWKHPDWLFDOPRGHOOLQJSRVHVVSHFLDOUHTXLUHPHQWVWRWKHWRROV
V\VWHP PRGHOOLQJ XVLQJ KHWHURJHQHRXV PXOWLWRRO PRGHOOLQJ GHPDQGLQJ WKH HOHJDQFH ZLWK ZKLFK WKH PDWKHPDWLFDO
HQYLURQPHQW RQ SDUDOOHO PXOWLFRUH SURFHVVLQJ V\VWHPV IRU DEVWUDFWLRQFDQEHH[SUHVVHG
VLPXODWLRQDFFHOHUDWLRQ7KHPRGHOOLQJWHFKQLTXHKDVEHHQDSSOLHG
IRU KLJKOHYHO YDOLGDWLRQ RI D KLJKSUHFLVLRQ ,QGRRU 3RVLWLRQLQJ 7KH FRPSOH[LW\ RI WKH PRGHUQ V\VWHPV SRVHV D VSHFLDO
6\VWHP IRU 0RWLRQ $QDO\VLV ,360$ GHYHORSHG LQ WKH &HQWUDO GHPDQGWRWKHVSHHGRIPRGHOVLPXODWLRQ7KHKLJKHUVSHHGFDQ
,QVWLWXWH (OHFWURQLF 6\VWHPV =($ RI WKH 5HVHDUFK &HQWHU EH DFKLHYHG HLWKHU E\ LQFUHDVLQJ WKH SHUIRUPDQFH RI D VLQJOH
-XHOLFK *PE+ 7KH KHWHURJHQHRXV PRGHOOLQJ HQYLURQPHQW KDV SURFHVVRURUE\SDUDOOHOL]LQJWKHH[HFXWLRQRIDPRGHO'XHWRD
EHHQ EXLOW XVLQJ DQ LPSOHPHQWDWLRQOHYHO PRGHO GHVLJQHG LQ QDWXUDO OLPLWDWLRQ RI WKH PRGHUQ WHFKQRORJ\ WKH ODWWHU LV RIWHQ
0DWODE6LPXOLQNDYHULILFDWLRQPRGHOIRUGHVFULELQJWKHV\VWHP SUHIHUDEOH QRZDGD\V LI D JRRG VFDODELOLW\ RI D PRGHO LV
HQYLURQPHQW XVLQJ 0RGHOLFD ODQJXDJH DQG -XOLD ODQJXDJH IRU DFKLHYDEOH +RZHYHU GXH WR WKH FRPSOH[LW\ LQ RUJDQL]LQJ
DXWRPDWLF JHQHUDWLRQ RI ELQGLQJ PRGHOOLQJ HQYLURQPHQW DQG SDUDOOHO FRPSXWDWLRQ WKH PRGHOHUV RIWHQ QHJOHFW WKLV
SDUDOOHOL]LQJ WKH VLPXODWLRQ RI WKH RYHUDOO PRGHO 7KH DSSURDFK RSSRUWXQLW\ WKXV SRVLQJ D VSHFLDO GHPDQG WR WKH PRGHUQ
VKRZHGDJRRGIOH[LELOLW\LQV\VWHPGHVFULSWLRQDQGYHULILFDWLRQLQ PRGHOOLQJ PHDQV LQ RUJDQL]LQJ WKH SDUDOOHO FRPSXWLQJ DV
WKH PXOWLLQVWUXPHQW PRGHOOLQJ HQYLURQPHQW DQG D JRRG VHDPOHVV DV SRVVLEOH 2Q WKH RWKHU KDQG V\VWHPOHYHO GHVLJQ
SHUIRUPDQFHJDLQGXHWRVLPXODWLRQSDUDOOHOLVP RIWHQ UHTXLUHV WKH FRRSHUDWLRQ RI VSHFLDOLVWV IURP GLIIHUHQW
VFLHQWLILF DQG DSSOLFDWLRQ ILHOGV 2IWHQ WKHVH WHDP PHPEHUV
Keywords—distributed simulation, high level system modelling,
functional verification, Julia computing language, Modelica,
FDQQRWXVHDXQLILHGWRROVHWGXHWRVSHFLILFGHPDQGVRQ WKHLU
parallelized simulation SURIHVVLRQDODUHDV$GGLWLRQDOO\WKHYDULHW\LQWKHWRROVHWPD\
EH FDXVHG E\ SURSULHWDU\ UHDVRQV 7KHVH UHDVRQV LQFOXGH
, ,1752'8&7,21 DFTXLUHGWKLUGSDUW\LQWHOOHFWXDOSURSHUW\FRUHVOLFHQVLQJSROLF\
RUFRVWV IRU FRPPHUFLDO WRROVOHJDF\ LQ SUHYLRXVO\ GHYHORSHG
0RGHUQFRPSXWLQJDQGFRQWUROOLQJV\VWHPVDUHFRPSOH[DQG
UHXVDEOHFRGHEDVHZLWKLQWKHWHDPRUDVFLHQWLILFVRFLHW\7KXV
UHTXLUHFRPSUHKHQVLYHPXOWLDVSHFWPRGHOOLQJIURPWKHKLJKHU
V\VWHP PRGHOOLQJ UHTXLUHV VSHFLDO LQVWUXPHQWV IRU FUHDWLQJ
DEVWUDFWLRQ OHYHO GRZQ WR LPSOHPHQWDWLRQ DQG PDQXIDFWXULQJ
KHWHURJHQHRXVPRGHOOLQJHQYLURQPHQWDQGSRVHVKLJKGHPDQG
7KLVWRSGRZQDSSURDFKDOORZVSUHYHQWLQJFRVWO\GHVLJQHUURUDW
RQLQWHURSHUDELOLW\EHWZHHQWKHWRROVLQWKHWRROVHW
WKHHDUO\VWDJHVRIGHYHORSPHQW
7KHPRGHOOLQJSURFHGXUHSUHVHQWHGLQWKLVSDSHUDGGUHVVHV
)XQFWLRQDOGHVFULSWLRQRIWKHV\VWHPXQGHUGHVLJQ6X'LV
WKHVH FKDOOHQJHV E\ EXLOGLQJ D KHWHURJHQHRXV PRGHOOLQJ
RQH RI WKH NH\ KLJKOHYHO DVSHFWV LQ GHYHORSPHQW SURFHVV
HQYLURQPHQWWKDWFRPELQHVWZRSRSXODUPRGHOOLQJLQVWUXPHQWV
'HVFULELQJWKHV\VWHPIXQFWLRQDOLW\LVQRUPDOO\FDUULHGRXWE\
0DWKZRUNV6LPXOLQNDQG0RGHOLFDE\ELQGLQJWKHPZLWK-XOLD
FRPSXWHUH[HFXWDEOH PRGHOOLQJ XVLQJ ODQJXDJHEDVHG RU
FRPSXWLQJODQJXDJHWKDWVSHFLDOO\DGGUHVVHVWKHSUREOHPDWLFVRI
GLDJUDPEDVHGWRROVZKLFKFDQRIIHUDKLJKOHYHORIDEVWUDFWLRQ
WRRO LQWHURSHUDELOLW\ DQG LQWHQVLYH H[SORLWLQJ RI FRPSXWLQJ
7KLVPRGHOOLQJPLPLFVLQWHUDFWLRQRIWKHV\VWHPZLWKWKHRXWHU
SDUDOOHOLVP
HQYLURQPHQWYLDWKHV\VWHP¶VLQWHUDFWLRQSRLQWVLHVHQVRUVDQG
DFWXDWRUV DQG WKH UHDFWLRQ SURFHVVHV LQ WKH V\VWHP WR WKH :H DSSOLHG WKLV DSSURDFK LQ WKH GHYHORSPHQW RI D KLJK
UHFHLYHGRXWHUVWLPXOL7KHSURFHVVHVDWWKHLQWHUDFWLRQSRLQWVDUH SUHFLVLRQUDGLRIUHTXHQF\EDVHG,QGRRU3RVLWLRQLQJ6\VWHPIRU
RIWHQRIDSK\VLFDOQDWXUH7KLVLVZK\WKH\FDQEHEHVWGHVFULEHG KXPDQ 0RYHPHQW $QDO\VLV ,360$ IRU YHULI\LQJ WKH
XVLQJ WKH ODQJXDJH RI PDWKHPDWLFV 7KLV KLJKOHYHO VSHFLDOO\ GHVLJQHG LQWHJUDWHG FLUFXLW ,365) ,& DV WKH PRVW
,(((
133
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
FULWLFDOSDUWRIWKH,360$7KH,&GHYHORSPHQWFRQFHQWUDWHG VWUHDP 7KH '63 LQFOXGHV FRPE ILOWHU IRU VHSDUDWLQJ WKH %6
RQ WKH HQKDQFHPHQW RI SUHYLRXVO\ UHDOL]HG ,36 FRQFHSW >@ FKDQQHOVDQGWZREORFNVIRUVKLIWHGXQGHUVDPSOLQJRIWKHGDWD
IROORZLQJDQHZPHWKRGIRUSRVLWLRQGHWHUPLQDWLRQ>@ DQGH[WUDFWLQJDPSOLWXGHDQGSKDVHLQIRUPDWLRQUHVXOWLQJLQWKH
7'R$ LQIRUPDWLRQ >@ 7KH EDFNHQG IDFHV DQRWKHU GLJLWDO
,Q WKH QH[W VHFWLRQV ZH EULHIO\ JLYH DQ RYHUYLHZ RI WKH KDUGZDUHEORFNIRUIXUWKHUGDWDSDFNDJLQJDQGGDWDVWUHDPLQJWR
DSSOLFDWLRQEDFNJURXQGDQGWKH,365),&GHVLJQIROORZHGE\D D %7 WUDQVPLWWHU $V WKH PRVW FKDOOHQJLQJ GDWDSURFHVVLQJ
VKRUWGHVFULSWLRQRIWKHXVHGWRROVHWDQGVXEVHTXHQWH[SODQDWLRQ HOHPHQWRIWKH,365),&GHVLJQ'63PRGXOHUHTXLUHVDVSHFLDO
RI WKH KHWHURJHQHRXV PRGHOLQJ HQYLURQPHQW LPSOHPHQWDWLRQ FDUHLQIXQFWLRQDOYHULILFDWLRQ
EDVHGRQWKHVHWRROV7KHUHVXOWVDQGRXWORRNDUHJLYHQLQWKHODVW
VHFWLRQ
,, $33/,&$7,21%$&.*5281'
)LJ 6FKHPHRIWKH'63KDUGZDUHPRGXOHZLWKLQWKH,365)>@
,,, $33/,('722/6(7
A. Mathworks Simulink
6LPXOLQN LVJUDSKLFDO EORFNEDVHG PRGHOOLQJWRROEXLOG RQ
WRSRIWKH0$7/$%HQJLQH,WLVEURDGO\XVHGIRUPRGHOEDVHG
GHVLJQ RI DXWRPDWLF FRQWURO DQG GLJLWDO VLJQDO SURFHVVLQJ
V\VWHPV7KHPRGHOVDUHSULPDULO\EXLOGYLDDJUDSKLFDOLQWHUIDFH
E\ FUHDWLQJ VWUXFWXUDO GLDJUDPV IURP D FXVWRPL]DEOH VHW RI
IXQFWLRQDOEORFNOLEUDULHV6LPXOLQNKDVDQXPEHURIDGGRQIRU
DXWRPDWHG WUDQVLWLRQ IURP D YLUWXDO SURWRW\SH WR DQ
LPSOHPHQWDWLRQ PRGHO E\ JHQHUDWLQJ HLWKHU D SURGXFWLRQOHYHO
&FRGHRUDV\QWKHVL]DEOH+'/GHVFULSWLRQXVHGLQWKHKDUGZDUH
FKLSGHVLJQ
B. Modelica Modelling Language
)LJ ,OOXVWUDWLRQRIWKH,QGRRU3RVLWLRQLQJ6\VWHPIRU0RYHPHQW$QDO\VLV 0RGHOLFD LV D FRPSRQHQWRULHQWHG GHFODUDWLYH PXOWL
,360$ FRPSULVLQJ RI WKH %DVH 6WDWLRQ %6 GHILQLQJ WKH 0RQLWRULQJ GRPDLQ IUHH PRGHOOLQJ ODQJXDJH GHYHORSHG E\ WKH 0RGHOLFD
6SDFH06WKH0RELOH'HYLFHV0'GHWHUPLQLQJWKH7'R$LQIRUPDWLRQDQG
3URFHVVLQJ8QLWIRUUHFRQVWUXFWLRQWKHSRVLWRQVRIWKH0'EDVHGRQ7'R$GDWD
$VVRFLDWLRQ >@ 7KH ODQJXDJH LV GHVLJQHG IRU GHVFULELQJ
FRPSOH[ V\VWHPV
G\QDPLFV LQ FRQWLQXRXV DQG GLVFUHWH WLPH
7KH ,360$ )LJ WUDFNV KXPDQ PRYHPHQWV LQ WKH GRPDLQKDYLQJLQWULQVLFQRWLRQRIPRGHOOLQJWLPHDQGVWUXFWXUH
0RQLWRULQJ 6SDFH 06 RI
Pñ ZLWK VSDWLDO DQG 6\VWHPVDUHUHSUHVHQWHGDVDVWUXFWXUDOKLHUDUFKLFDOQHWZRUNRI
WHPSRUDO UHVROXWLRQ RI PP DQG PV LQ UHDOWLPH )RU FRQQHFWHG IXQFWLRQDO FRPSRQHQWV GHVFULELQJ WKH HOHPHQW
UHFRQVWUXFWLRQ RI WKH OLPE PRYHPHQWV XS WR PLQLDWXUH G\QDPLFVDQGLQWHUDFWLRQZLWKWKHRXWHUHOHPHQWV7KHODQJXDJH
0RELOH 'HYLFHV 0' DWWDFKHG WR WKH OLPEV FDQ EH WUDFNHG DQG D YDVW QXPEHU RI GRPDLQVSHFLILF OLEUDULHV HQDEOHV FURVV
VLPXOWDQHRXVO\ 3RVLWLRQ HVWLPDWLRQ LV EDVHG RQ WKH 7LPH GRPDLQ PRGHOOLQJ LQ RQH FRPSOH[ PRGHO ,W VXSSRUWV ERWK
'LIIHUHQFHRI$UULYDO7'R$RIYLUWXDOHYHQWVFRGHGLQKLJK WH[WXDO DQG VWUXFWXUDO GLDJUDP HQWU\ FRPELQLQJ IOH[LELOLW\ DQG
IUHTXHQF\ UDGLRZDYH VLJQDOV *+] IURP WR %DVH VSHHGRIPRGHOFUHDWLRQ7KHODQJXDJHH[SORLWVWKHFRQFHSWVRI
6WDWLRQV %6 WR WKH 0' %LQDU\ 3KDVH6KLIW .H\LQJ %36. REMHFWRULHQWHG SURJUDPPLQJ E\ RIIHULQJ PHFKDQLVP RI
ZLWK0+]PRGXODWLRQDVDELSRODUFORFNVLJQDOLVXVHGIRU FRPSRQHQW FODVV LQKHULWDQFH SDUDPHWHUL]HG FRPSRQHQW
FRGLQJ$Q0'UHFHLYHVWKHVLJQDOVIURPWKH%6VWKURXJKWKH LQVWDQWLDWLRQ FRQGLWLRQDO LQVWDQWLDWLQJ RU LQVWDQFH UHSOLFDWLRQ
,365),&7KH,&JHQHUDWHVHLJKWGDWDVWUHDPVDVVRFLDWHGZLWK 7KH G\QDPLFV RI D FRPSRQHQW LV GHVFULEHG E\ D V\VWHP RI
WKH %6V IURP ZKLFK VHYHQ 7'R$ VWUHDPV DUH FDOFXODWHG 7KH DFDXVDOHTXDWLRQVVLPLODUWRSXUHIXQFWLRQDOODQJXDJHVDVZHOO
,365) ,& ZLWKLQ WKH 0' DFWV DV D VLJQDO SUHSURFHVVRU 7KH DV E\ HOHPHQWV RI LPSHUDWLYH SURJUDPPLQJ )RU LQWHU
0' WUDQVPLWV WKH WHPSRUDULO\ VRUWHG 7'R$ SDFNDJHV WR WKH FRPSRQHQW FRPPXQLFDWLRQ YLD FRQQHFWLRQV WKH ODQJXDJH UHO\
VWDWLRQDU\ 3URFHVVLQJ 8QLW 38 RYHU %OXHWRRWK %7 7KH RQWKHFRQFHSWRITXDQWLW\IORZDQGFRQVHUYDWLRQODZVPDNLQJ
SRVLWLRQUHFRQVWUXFWLRQLVSHUIRUPHGLQWKH38 WKHERXQGVEHWZHHQWKHFRPSRQHQWVELGLUHFWLRQDO
7KH FHQWUDO HOHPHQW RI WKH ,365) ,& LV D '63 KDUGZDUH C. Julia Computing Language
PRGXOHIRURQIO\H[WUDFWLQJRIWKH7'R$LQIRUPDWLRQIURPWKH -XOLD LV D XQLYHUVDO ODQJXDJH IRU VFLHQWLILF FRPSXWDWLRQ
UHFHLYHGVXSHUSRVHG%6VLJQDOV7KHIURQWHQGSUHSURFHVVHVWKH UHO\LQJRQSULQFLSOHRIIXQFWLRQDOSURJUDPPLQJ>@-XOLDDGDSWV
DQDORJGDWDE\DPSOLI\LQJDQGGRZQFRQYHUWLQJLWWRQHDU]HUR RQIO\ FRPSXWDWLRQ YLD -,7 PHFKDQLVP DQG H[SORLWV KDUGZDUH
IUHTXHQF\ ILOWHULQJ RXW QRLVH DQG EORFNHU VLJQDOV DQG $'& EHQHILWV VHDPOHVVO\ UHO\LQJ RQ /RZ/HYHO 9LUWXDO 0DFKLQH
VDPSOLQJWRVXSSO\WKH'63)LJZLWKDGLJLWL]HG,4GDWD //90IRUKDUGZDUHDEVWUDFWLQJ,WVVFULSWLQJFDSDELOLW\HQDEOHV
134
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
135
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
)RU WKH SXUSRVH RI SDUDOOHO VLPXODWLRQ RI WKH RYHUDOO '63 5()(5(1&(6
PRGHODPXOWLFRUHSURFHVVLQJV\VWHPLVIRXQGWREHVXIILFLHQW >@ <<DR6YDQ:DDVHQ5;LRQJ06FKLHN0HWKRGDQGGHYLFH
7KH SURFHVVLQJ V\VWHP FDQ EH HLWKHU D GLVWULEXWHG FOXVWHU RU D IRUSRVLWLRQGHWHUPLQDWLRQU.S. Patent Application No. 16/330,768
VLQJOH PXOWLFRUH SURFHVVRU SDUDOOHO SURFHVVLQJ FRGH GRHV QRW >@ 5 ;LRQJ 6 YDQ :DDVHQ & 5KHLQOlQGHU 1 :HKQ
QHHG DQ\ PRGLILFDWLRQ IRU ERWK V\VWHP W\SHV H[FHSW IRU D Ä'HYHORSPHQW RI D QRYHO LQGRRU SRVLWLRQLQJ V\VWHP ZLWK PPUDQJH
SUHOLPLQDU\ QHWZRUN FRQILJXUDWLRQ SURFHGXUH IRU WKH FOXVWHU SUHFLVLRQEDVHGRQ5)VHQVRUVQHWZRUN´IEEE Sensors Letters1SS
V\VWHP 0HDQZKLOH WKH RYHUKHDG RI D ELJ GDWD WUDQVIHU RYHU D
>@ 3)ULW]VRQ9(QJHOVRQ
0RGHOLFD²$XQLILHGREMHFWRULHQWHGODQJXDJH
QHWZRUN FRQQHFWLRQ LQ FDVH RI D FOXVWHU FDQ EH FULWLFDO LQ IRUV\VWHPPRGHOOLQJDQGVLPXODWLRQ
,Q(XURSHDQ&RQIHUHQFHRQ
FRPSDULVRQ WR RUJDQL]LQJ GDWD FKDQQHOV LQ D XQLILHG PHPRU\ 2EMHFW2ULHQWHG3URJUDPPLQJ6SULQJHU%HUOLQ+HLGHOEHUJSS
DFFHVV DUFKLWHFWXUH LQ D VLQJOH PXOWLFRUH PDFKLQH 7KXV WKH >@ -%H]DQVRQ6.DUSLQVNL9%6KDK$(GHOPDQ
-XOLD$IDVWG\QDPLF
ODWWHU LV IRXQG WR EH VXIILFLHQW IRU WKH FXUUHQW YDOLGDWLRQ ODQJXDJH IRU WHFKQLFDO FRPSXWLQJ
DU;LY SUHSULQW DU;LY
SURFHGXUH
>@ 7KH -XOLD 3URMHFW
7KH -XOLD /DQJXDJH 0DQXDO 0XOWLSURFHVVLQJ DQG
&UHDWLQJWKHVSHFLILFVLPXODWLRQHQYLURQPHQWFDQEHVHHQLQ 'LVWULEXWHG &RPSXWLQJ
>2QOLQH@ $YDLODEOH
WZRDVSHFWVGHYHORSLQJWKHLQIUDVWUXFWXUHIRUVSHFLILFGDWDW\SH KWWSVGRFVMXOLDODQJRUJHQYPDQXDOGLVWULEXWHGFRPSXWLQJ
KDQGOLQJ DQG DQDO\VLV DQGLQWHJUDWLQJWKHWZR PRGHOVLQWRWKH >$FFHVVHG$XJ@
HQYLURQPHQW XVLQJ VFULSWLQJ FDSDELOLW\ RI -XOLD ODQJXDJH 7KH >@ 7 %HVDUG & )RNHW DQG % 'H 6XWWHU (IIHFWLYH H[WHQVLEOH
VFULSWLQJFDSDELOLW\RIWKHODQJXDJHLVXVHGWRFUHDWHDUHXVDEOH SURJUDPPLQJXQOHDVKLQJ-XOLDRQ*38V,(((7UDQVDFWLRQVRQ3DUDOOHO
DQG'LVWULEXWHG6\VWHPV
FRGH IRU FDOOLQJ WKH JHQHUDWHG &LPSOHPHQWDWLRQ RI WKH '63
>@ 6 'DQLVFK
$Q ,QWURGXFWLRQ WR *38 3URJUDPPLQJ LQ -XOLD
EORFN VLJQDOSURFHVVLQJ FKDLQ 7KH VFULSW DXWRPDWLFDOO\ >2QOLQH@ $YDLODEOH KWWSVQH[WMRXUQDOFRPVGDQLVFKMXOLDJSX
JHQHUDWHVD-XOLDIXQFWLRQDVDZUDSSHUIRUD6LPXOLQNPRGHODQG SURJUDPPLQJ>$FFHVVHG$XJ@
LQYRNHVWKHPRGHOEHKDYLRU7KHVDPHLVXVHGIRULQYRNLQJWKH >@ 7KH -XOLD 3URMHFW
7KH -XOLD /DQJXDJH 0DQXDO 0HWDSURJUDPPLQJ
9HULILFDWLRQ0RGHOIXQFWLRQDOLW\ >2QOLQH@ $YDLODEOH
KWWSVGRFVMXOLDODQJRUJHQYPDQXDOPHWDSURJUDPPLQJ >$FFHVVHG
9 5(68/76$1'287/22. $XJ@
7KHVWDUWLQJSRLQWRIWKHSUHVHQWHGZRUNZDVWKHYHULILFDWLRQ >@ :LNLSHGLD
)XQFWLRQDO 0RFNXS ,QWHUIDFH
>2QOLQH@ $YDLODEOH
KWWSVHQZLNLSHGLDRUJZLNL)XQFWLRQDOB0RFNXSB,QWHUIDFH>$FFHVVHG
PRGHO FRPSOHWHO\ LPSOHPHQWHG LQ 0DWKZRUNV 6LPXOLQN WKDW $XJ@
GHOLYHUHGWKHVLPXODWLRQUHVXOWVRIDVWUDMHFWRU\LQWKHUDQJHRI
GR]HQVRIKRXUVFRPSXWDWLRQWLPHWKXVKLQGHULQJH[WHQVLYHDQG
136
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The ever-growing advances in science and technol- from the general M&S lifecycle, the paper identifies some
ogy have led to a rapid increase in the complexity of most important pitfalls deriving from its application to CPS and
engineered systems. Cyber-physical Systems (CPSs) are the result presents remedies, which are already available in the literature,
of this technology advancement that involves new paradigms, ar-
chitectures and functionalities derived from different engineering to prevent and face them.
domains. Due to the nature of CPSs, which are composed of many The rest of the paper is structured as follows. Section II pro-
heterogeneous components that constantly interact one another vides an introduction to the essential concepts of the research
and with the environment, it is difficult to study, explain hypoth- domain. Section III presents some important pitfalls deriving
esis and evaluate design alternatives without using Modeling and
from the application of M&S to support the design, study,
Simulation (M&S) approaches. M&S is increasingly used in the
CPS domain with different objectives; however, its adoption is not and development of CPSs. In Section IV, for each identified
easy and straightforward but can lead to pitfalls that need to be pitfall a set of remedies, which are already available in the
recognized and addressed. This paper identifies some important literature, for addressing it are presented. Finally, conclusions
pitfalls deriving from the application of M&S approaches to the are presented in Section V.
CPS study and presents remedies, which are already available in
the literature, to prevent and face them.
Index Terms—Modeling and Simulation, Pitfalls, Cyber Phys- II. M ODELING AND S IMULATION OF C YBER P HYSICAL
ical Systems S YSTEMS
137
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
III. P ITFALLS IN MODELING AND SIMULATION OF CPS S Fig. 1. Modeling & Simulation lifecycle.
138
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
to achieve, also because it involves many stakeholders from much to the model precision can lead to the risk of loosing
different research domains with multiple prospectives on the essential components of the system and related relationships.
system [9]. Throughout the “Requirement Elicitation” step, This complexity could make the simulation project fail because
there are several pitfalls that can determine the origin of implementation, verification, and validation activities of the
incorrect requirements: (i) Objectives not well-defined, CPSs model are compromised.
are designed to carry out their activities by continuously e) Implementation: Once the conceptual model is final-
interacting with the environment, in which it operates. When ized, it is implemented through the support of a Modeling
the CPSs objectives are not well-defined or stakeholders lose and Simulation software. Before starting to examine all the
sight of their range of action, the requirements will be too available M&S software, it is necessary to decide how to im-
general, leaving out the essential functions in favour of the plement the CPS structure and behaviours (e.g., SysML, ODE,
unnecessary ones; (ii) Inconsistent information, during the en- Control Graphs) and the kind of simulation to perform (e.g.,
tire M&S lifecycle, researchers are often involved in collecting Continuous Simulation, Discrete Event Simulation, Stochastic
information on the CPSs under study. When used elicitation Simulation) so as to capture the CPS evolutions and changes
approaches are unable to capture all the CPSs details, it over time. Nowadays, there are different M&S software, each
becomes difficult to classify, determine priorities (i.e., by level of which specialized to address specific kind of problems
of risk, difficulties, costs), and harmonize the often conflicting (e.g. Modelica, Simulink, and Wolfram SystemModeler), since
stakeholders’ needs/objectives; and, (iii) Excess information, the researchers involved in the CPS M&S are different and
the elicitation of requirements in long text-based documents belong to different research domains, the risk is to choose an
leads to confusions on the CPSs objectives and makes it unsuitable one that does not offer functionalities to manage
difficult to identify by researchers missing components and the simulation model.
environmental constraints. f) Data Quality: In the “Simulation Experiments” step,
c) Precision and Accuracy: Inaccuracy and imprecision generally three types of data can be used for performing
arise due to the hybrid nature of CPSs that involve both experiments on the CPS synthetic model: (i) Historical data,
continuous and discrete dynamics [10]. In some cases, math- past performance data of the overall CPS and their individual
ematical equations used to describe the continuous behaviors components along with environmental conditions; (ii) Real
are simple; thus, simulation results can be computed with- data, data coming from the real CPS in operation, i.e. from
out any numerical approximation. However, in most cases, sensors and actuators, outputs of components including in-
the continuous behaviors are complex and the mathematical formation coming from supplementary business systems; and,
equations involve also Partial Differential Equations (PDE) (iii) Synthetic data, data from engineers, machine learning and
and/or Integral Equations (IE), which cannot be solved in artificial intelligence systems. Sometimes, data on how the
a precise way but only through numerical approximations. system operated in the past, how it operates currently, and
Other sources of errors are related to the interactions between how the synthetic model relates to the real system, is little or
the continuous and discrete dynamics that may lead to Zeno no-usable. This lack of quality in the data makes it challenging
executions. The Zeno phenomenon occurs when the system to perform valuable simulation scenarios.
undergoes an unbounded number of discrete transitions in g) Result interpretation: After completing simulation
a finite and bounded length of time [11]. This phenomenon experiments, it would be necessary to perform some inter-
lead to simulation execution crash, simulation results are not pretation of the produced results to provide more readable
accurate, and the system behaviors are fundamentally ill- information and highlight critical aspects that deserve special
defined beyond the Zeno point. attentions. Each simulation experiment has a model configura-
d) Complexity: In the “Problem Definition” step, com- tion, fixed parameters and initial conditions that make results
plexity pitfalls may arise in the identification of the system different, and it is up to researchers their correct interpretation.
boundary and in the definition of the research questions. In the Interpretation pitfalls can arise when researchers interpret the
“Requirement Elicitation” step, complexity pitfalls may hap- results partially without taking into account aspects related to
pen in the capture and managing activities such as, ambiguity, the CPS structure, behaviors, and environmental conditions,
multiple requirements and undefined terms [12]. Upon delin- losing the critical distance from their work [14]. Moreover, it
eating the system boundary, formulating the research questions is important to favour the reproducibility of results, meaning
and capturing the requirements, the conceptual model needs that a simulation model should not provide different results
to be formalized. Its formalization, in the “Conceptual Model” for each execution with the same initial conditions [15].
step, implies the simplification of the CPS parts and their rela-
tionships existing in reality so as to increase the model’s utility IV. R EMEDIES IN MODELING AND SIMULATION OF CPS S
[13]. A proper simplification is very important for the success
of the simulation study, but at the same time, the conceptual This section presents some remedies that are already avail-
model has to represent reality with sufficient precision for the able in the literature to address the identified pitfalls.
simulation to produce reliable results. Having a too simple a) System boundary: Without a clear identification of
conceptual model does not allow to capture the fundamental the system boundary, which separates what lies within the
characteristics of the real system, whereas shift focus too CPS to be studied and what is outside (not necessary to
139
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
be analyzed), the whole M&S process is likely to fail. Sev- infinitesimal quantities numerically. To increase precision in
eral research efforts focused their attention on the definition the simulation calculations and results, it is necessary to adopt
of suitable methods, models and techniques to address this a new kind of a computer. Falcone et al. in [23] present an in-
aspect. In [16], a set of practices to correctly identify the novative solution that allows one to use the Infinity Computer
system boundary are delineated. According to these practices, arithmetic within the Simulink environment. The Simulink-
the boundary identification is carried out by selecting the based solution allows one to perform numerical computations
environment variables that the CPS monitor and control. with finite, infinite, and infinitesimal numbers, increasing the
Monitored variables are quantities of the environment whose precision of the computations. Regarding the accuracy issues,
values impact the CPS behavior (e.g., altitude, inclination, and in [24], the authors present a set of methods to improve
airspeed of an airplane), whereas controlled ones represent the accuracy of simulations involving CPS. Specifically, the
quantities of the environment that the CPS affect with their authors identified three groups: (i) methods to prevent the
behaviours (e.g. wing position of the airplane). Wittmann et occurrence of errors; (ii) methods to reduce current errors;
al. in [17] present a methodology to define functional system and (iii) techniques for reducing methodical errors.
boundaries necessary for evaluating the risk of an automated d) Complexity: Today’s CPSs are hard to design, develop
driving system observing functional system boundaries and and maintain since they are composed of many interconnected
system errors. The proposed methodology allows to model components that make them so large and detailed that no
a set of level of details that drive the definition of relevant one can understand their behaviours. Keeping complexity
scenarios and system boundaries to supports the identification under control is fundamental as too complex systems lead
of functional system boundaries. In [18], the International to an increase in costs and risks. Lindemann et al. in [25]
Council on Systems Engineering (INCOSE) delineates the present three main dimensions of complexity that emerge in
Systems Development Life Cycle (SDLC) process that utilizes the context of CPS design and development: (i) Structural
systems thinking principles to design, integrate, and manage Complexity; (ii) Dynamic Complexity; and (iii) Organizational
complex systems over their life cycles. It provides a guidance Complexity. For each of them, the main issues are presented
and rationale to establish the external and internal components along with possible solutions. To support the systematic and
of a system, and define its boundaries, including the interfaces holistic analysis of an engineering design process, Kreimeyer
that reflects the operational scenarios and expected system et al. in [26] present a measurement system that adopts a
behaviours. set of complexity metrics to integrate the process’ entities
b) Requirement: To address the pitfalls related to the (e.g. tasks, documents, and organizational units). Specifically,
definition and management of system’s requirements differ- 52 metrics have been defined for the structural analysis of
ent research efforts propose methodologies and techniques processes (e.g. timeliness and need for communication). The
to avoid them. Gillani et al. in [19] present a survey of metrics are supported by a meta-model for process modeling.
requirement techniques for managing Safety Critical Systems e) Implementation: There are different M&S soft-
(SCS). The authors analyzed activities and techniques that ware/environments both commercial and non-commercial
should be performed by RE during safety analysis. Moreover, highly specialized that allow the design and implementation
specified tools to support, in an integrate way, the safety of CPS. However, a single software/environment is not able
analysis between RE and SCS in Safety Engineering have to manage all the CPSs aspects, but it is tailored to address
been explored. In [20], the authors stress the importance of the a specific type of problem. Thus, a combination of more
requirement management as most acute knowledge intensive M&S software/environments is required. Xiao and Fan in [27]
activity for managing a complex system. The authors classify, present a framework, based on the Model-Drive Architecture
for each requirements elicitation step, the main issues and (MDA) and the IEEE 1516-2010 (HLA) standard [28], that
explore how Artificial Intelligence (AI) techniques can be a allows to design and simulate heterogeneous CPSs also by
viable techinique to overcome them. The paper also delin- reusing simulation models already available. In [2], the authors
eates the connection between the identified issues and their highlight the benefits coming from the joint exploitation of
potential AI explanations in many requirements elicitation Distributed Simulation (DS) and Co-Simulation approaches to
techniques. Milani in [21] claims that in modern organizations study CPSs. The paper proposes a solution that relies on the
the Business Process Model and Notation (BPMN) language is integration of the Functional Mock-up Interface (FMI) and the
widely used to facilitate communications between engineers, HLA standard for addressing, in an integrated way, the issues
stakeholders, and researchers to understand how a complex of reusability, interoperability and distribution of CPSs. To
system works [22]. The paper presents a BPMN-based method achieve this integration, the authors defined the Adapter-based
that guides the elicitation of requirements with the domain Hybrid Federate (A-HF) that allows to reuse a Functional
experts in a collaborative manner. Mock-up Unit (FMU) in co-simulation modality into an HLA
c) Precision and Accuracy: Numerical computing is a simulation in a conservative time-stepped manner.
key part of the traditional computer architecture, and almost all f) Data Quality: High-quality data is an important as-
traditional computers implement the IEEE 754-1985 standard pect to consider in order to successfully conduct simulation
to represent and work with numbers. However, due to archi- experiments involving CPSs. In the literature are available
tectural limitations it is impossible to work with infinite and different methodologies to support data collection and analyze
140
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
their quality. Kewei and Sherali in [29] claim that when data, [6] R. G. Sargent, “Verification and validation of simulation models,” in
mostly from the physical world, is collected in CPSs, one of Proceedings of the 2010 winter simulation conference, pp. 166–183,
IEEE, 2010.
the most important challenges is the detection and filtering [7] T. Li, H. Zhang, Z. Liu, Q. Ke, and L. Alting, “A system boundary
of faulty data. To improve the quality of the collected data, identification method for life cycle assessment,” The International
the authors argue that it is necessary the definition of suitable Journal of Life Cycle Assessment, vol. 19, no. 3, pp. 646–660, 2014.
[8] S. Mittal, U. Durak, and T. Ören, Guide to simulation-based disciplines:
algorithms to find out and filter incorrect data efficiently and Advancing our computational future. Springer, 2017.
cost-effectively. In the proposed work, the authors present the [9] C. Wohlin et al., Engineering and managing software requirements.
challenges and techniques for incorrect data detection and Springer Science & Business Media, 2005.
[10] C. Beisbart and N. J. Saam, Computer simulation validation. Springer,
filtering. In [30], the authors present empirical descriptions of 2019.
simulation data quality problems, data production processes, [11] J. Zhang, K. H. Johansson, J. Lygeros, and S. Sastry, “Zeno hybrid
and relations between these processes and simulation data systems,” International Journal of Robust and Nonlinear Control: IFAC-
Affiliated Journal, vol. 11, no. 5, pp. 435–451, 2001.
quality problems by evaluating a multiple-case study within [12] D. Zowghi and C. Coulin, “Requirements elicitation: A survey of tech-
the automotive domain. The obtained results have been used niques, approaches, and tools,” in Engineering and managing software
to define guidelines to support manufacturing companies in requirements, pp. 19–46, Springer, 2005.
[13] D. van der Zee, “Approaches for simulation model simplification,” in
improving data quality. 2017 Winter Simulation Conference (WSC), pp. 4197–4208, Dec 2017.
g) Result interpretation: The results deriving from the [14] R. Barth, M. Meyer, and J. Spitzner, “Typical pitfalls of simulation
CPSs simulations are generally only numbers, therefore it is modeling: lessons learned from armed forces and business,” The journal
of artificial societies and social simulation, vol. 15, no. 2, p. 5, 2012.
up to researchers their interpretation in order to answer the [15] O. Dalle, “On reproducibility and traceability of simulations,” in Pro-
research questions defined in the “Problem Definition” phase. ceedings of the 2012 winter simulation conference (WSC), pp. 1–12,
One of the main threats is their partially interpretation with IEEE, 2012.
[16] D. L. Lempia and S. P. Miller, “Requirements engineering management
respect to the hypothesis with which the virtual model was handbook,” National Technical Information Service (NTIS), vol. 1, 2009.
built. In [31], the authors highlight that the extraction of [17] D. Wittmann, C. Wang, and M. Lienkamp, “Definition and identification
knowledge from simulation results is becoming increasingly of system boundaries of highly automated driving,” in 7. Tagung
Fahrerassistenz, 2015.
important in the design and management of complex systems, [18] C. Haskins, “Incose systems engineering handbook: A guide for sytem
since simulation results tend to be dynamic, incomplete, and life cycle processes and activities,” INCOSE, 2007.
redundant. To address these issues and achieve knowledge [19] M. Gillani, A. Ullah, and H. A. Niaz, “Survey of requirement manage-
ment techniques for safety critical systems,” in 2018 12th International
from simulation results, the authors present a framework along Conference on Mathematics, Actuarial Science, Computer Science and
with data mining algorithms. The framework has been defined Statistics (MACS), pp. 1–5, 2018.
by using novel techniques based on Rough Sets Theory (RST) [20] S. Sharma and S. Pandey, “Integrating ai techniques in requirements
elicitation,” Available at SSRN 3462954, 2019.
and Principal Component Analysis (PCA) for selecting the [21] F. Milani, “Requirement elicitation using business process models,” in
main attributes and their implicit relationships to create an Digital Business Analysis, pp. 311–319, Springer, 2019.
object-oriented data model for the simulation results. [22] A. Falcone, A. Garro, A. D’Ambrogio, and A. Giglio, “Engineering
systems by combining BPMN and HLA-based distributed simulation,”
in 2017 IEEE International Conference on Systems Engineering Sympo-
V. C ONCLUSION sium, ISSE 2017, Vienna, Austria, October 11-13, 2017, pp. 1–6, 2017.
[23] A. Falcone, A. Garro, M. S. Mukhametzhanov, and Y. D. Sergeyev,
The contribution of the paper is twofold. On the one hand, “Representation of Grossone-based Arithmetic in Simulink for Scientific
it identified some important pitfalls deriving from the adoption Computing,” Soft Computing, pp. 1–15, 2020.
of M&S approaches to the CPS study, and links them to the [24] Y. Yatsuk and S. Yatsyshyn, “Metrological array of cyber-physical
systems. part 5. quality assurance in measuring instrument design,”
corresponding phase(s) of the M&S lifecycle. In this way, Sensors & Transducers, vol. 188, no. 5, p. 1, 2015.
researchers have a guide that supports them, according to the [25] U. Lindemann, M. Maurer, and T. Braun, Structural complexity man-
M&S phase in which the project is located, in identifying agement: an approach for the field of product design. Springer Science
& Business Media, 2008.
possible pitfalls. On the other hand it presents for each [26] M. Kreimeyer and U. Lindemann, Complexity metrics in engineering
identified pitfall some remedies that are currently available design: managing the structure of design processes. Springer Science
in the literature to overcome it. & Business Media, 2011.
[27] T. Xiao and W. Fan, “Modeling and simulation framework for cyber
physical systems,” in Advanced Methods, Techniques, and Applications
R EFERENCES in Modeling and Simulation, pp. 105–115, Springer, 2012.
[1] R. Alur, Principles of cyber-physical systems. MIT Press, 2015. [28] A. Falcone, A. Garro, A. Anagnostou, and S. J. E. Taylor, “An
[2] A. Falcone and A. Garro, “Distributed Co-Simulation of Complex introduction to developing federations with the High Level Architecture
Engineered Systems by Combining the High Level Architecture and (HLA),” in 2017 Winter Simulation Conference, WSC 2017, Las Vegas,
Functional Mock-up Interface,” Simulation Modelling Practice and NV, USA, December 3-6, 2017, pp. 617–631, 2017.
Theory, vol. 97, no. August, p. 101967, 2019. [29] K. Sha and S. Zeadally, “Data quality challenges in cyber-physical
[3] J. S. Carson, “Introduction to modeling and simulation,” in Proceedings systems,” Journal of Data and Information Quality (JDIQ), vol. 6, no. 2-
of the Winter Simulation Conference, 2005., pp. 8–pp, IEEE, 2005. 3, pp. 1–4, 2015.
[4] P. Bocciarelli, A. D’Ambrogio, A. Falcone, A. Garro, and A. Giglio, [30] J. Bokrantz, A. Skoogh, D. Lämkull, A. Hanna, and T. Perera, “Data
“A model-driven approach to enable the simulation of complex systems quality problems in discrete event simulation of manufacturing opera-
on distributed architectures,” SIMULATION: Transactions of the Society tions,” Simulation, vol. 94, no. 11, pp. 1009–1025, 2018.
for Modeling and Simulation International, vol. 95, no. 12, 2019. [31] X. Shi, J. Chen, H. Yang, Y. Peng, and X. Ruan, “A novel approach to
[5] M. L. Loper, “The modeling and simulation life cycle process,” in extract knowledge from simulation results,” The International Journal
Modeling and Simulation in the Systems Engineering Life Cycle, pp. 17– of Advanced Manufacturing Technology, vol. 20, no. 5, pp. 390–396,
27, Springer, 2015. 2002.
141
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Virtual reality is slowly transitioning from a special- visualization frameworks [10]–[13]. No framework instead
ized laboratory-only technology, to a consumer electronics appli- exists, to the best of our knowledge, capable of supporting: (a)
ance. In this transition, two interesting research questions amount a wide set of applications to (b) simply select the 2D interface
to how 2D-based content and applications may benefit (or be
hurt) by the adoption of 3D-based immersive environments and elements which should be exported to a 3D immersive environ-
to how to proficiently support such integration. Acknowledging ment, while, (c) not requiring the creation of custom software,
the relevance of the former, we here consider the latter question, specific for the intended task. We here aim at moving a step
focusing our attention on the diversified family of PC-based forward along the path set by (a), (b) and (c), considering
simulation tools and platforms. VR-based visualization is, in fact, an application domain that has for long experimented and
widely understood and appreciated in the simulation arena, but
mainly confined to high performance computing laboratories. appreciated the opportunities laid by 3D interfaces and Virtual
Our contribution here aims at characterizing the simulation tools Reality (VR): scientific computation and simulation platforms
which could benefit from immersive interfaces, along with a [14]–[16]. An important body of evidence has demonstrated,
general framework and a preliminary implementation which may in fact, the benefits that can be attained when exploring
be put to good use to support their transition from uniquely 2D scientific data using immersive interfaces [17]–[19]. This work
to blended 2D/3D environments.
Index Terms—Virtual reality, OpenGL intercept, blended contributes to the research path presented so far with an
2D/3D interfaces, simulation environments. analysis of the graphical libraries utilized by a few of the
most widely used desktop computer simulation platforms and
I. I NTRODUCTION a preliminary implementation demonstrating the feasibility of
Many works have so far envisioned a future where 2D and the proposed technical approach. The remainder of this work
3D interfaces will both be supported by computing systems is organized as follows. In Section II we review the approaches
to provide better performances and experiences [1]–[9]. The taken, so far, in the development of VR applications. In Section
price drop of hardware components (e.g., Oculus Quest and III we delineate the scenario considered in this paper and ex-
Rift with prices below 500C and Oculus GO below 200C) plain the adopted architectural approach. Section IV describes
amounts to one of the factors that may make this happen. the results obtained in extending the interface capabilities
The integration of 2D/3D interface paradigms into software beyond 2D, for three different simulation platforms. To fully
platforms proceeds slowly, though, as it is not possible to benefit the scientific community, the code related to the work
observe a steep increase in the number of applications adding presented in this paper is available at [20].
immersive experiences to their traditional 2D ones. Such
resistance may be determined also by the fact that, to the best II. S TATE OF THE ART
of our knowledge, no general and simple approach has so far Scientific visualization, resorting to computer graphics, has
been developed to implement an easy transition from 2D to been so far used to represent: (a) data sets, which may be
2D/3D settings. The possible paradox, which may hence occur the output of numerical simulations, (b) recorded data, or, (c)
in the near future, is that hardware will be ready and cost- constructed shapes. VR, in particular, has aided in the display
effective for mass consumption, while a scarcity of software of 3D structures providing spatial and depth cues, as it allows
solutions will instead be available. In this paper we focus on a rapid and intuitive exploration of the volume containing the
such problem, which has been considered to some extent in data. The authors of [21], for example, have analyzed the
the past years in literature. Previous works, however, have usefulness of VR in specific task performance with volume
mainly concentrated on providing immersive interface support datasets, finding that such systems improve performance in
either through the provision of software platform specific add- spatial judgment tasks. More recently VR systems have been
ons or with the exhibition of dedicated APIs inside existing assessed in the visualization of complex weather-related infor-
mation [22]: to this aim, the effectiveness and usability of the
This work was supported by the University of Bologna’s AlmaAttrezzature
2017 grant. Xbox One controller in combination with a VR display proved
978-1-7281-7343-6/20/$31.00 ©2020 IEEE to be the most effective. Reski and Alissandrakis investigated
142
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
143
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
TABLE I
S CIENTIFIC COMPUTING AND SIMULATION PLATFORMS USING O PEN GL.
144
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
rendered in our 3D environment. The SUMO example also [10] Vr software for virtual reality design — autodesk. [Online]. Available:
provides us the means of an exemplar problem that may occur https://www.autodesk.com/solutions/virtual-reality
[11] D. J. Zielinski, R. Kopper, R. P. McMahan, W. Lu, and S. Ferrari,
when rendering in a 3D environment. In Figure 4, the car in “Intercept tags: enhancing intercept-based systems,” in Proceedings of
the 3D environment (the black triangle in the top picture) is the 19th ACM Symposium on Virtual Reality Software and Technology,
not near but far from the track. This happens because, usually, 2013, pp. 263–266.
[12] D. J. Zielinski, R. P. McMahan, S. Shokur, E. Morya, and R. Kop-
2D applications rely on the Z axis to sort the elements on per, “Enabling closed-source applications for virtual reality via opengl
the screen using values that, in a 3D context, may create an intercept-based techniques,” in 2014 IEEE 7th Workshop on Soft-
incorrect display of the objects. The Z value, assigned to the ware Engineering and Architectures for Realtime Interactive Systems
(SEARIS). IEEE, 2014, pp. 59–64.
car of Figure 4, is too large for a proper 3D render. Finally, [13] P. O’Leary, S. Jhaveri, A. Chaudhary, W. Sherman, K. Martin, D. Lonie,
we are also working on the NetLogo3D porting which we E. Whiting, J. Money, and S. McKenzie, “Enhancements to vtk enabling
may nevertheless confirm based on OpenGL calls and hence scientific visualization in immersive environments,” in 2017 IEEE Vir-
tual Reality (VR), 2017, pp. 186–194.
extensible according to the proposed approach. [14] C. Shaw, M. Green, J. Liang, and Y. Sun, “Decoupled simulation in
virtual reality with the mr toolkit,” ACM Transactions on Information
V. C ONCLUSION AND FUTURE WORKS Systems (TOIS), vol. 11, no. 3, pp. 287–317, 1993.
This work wants to reconnect to a stream of works that have [15] C. J. Turner, W. Hutabarat, J. Oyekan, and A. Tiwari, “Discrete event
simulation and virtual reality use in industry: new opportunities and
been published in the past and which have had the merit of future trends,” IEEE Transactions on Human-Machine Systems, vol. 46,
indicating a pathway for the provision of 3D immersive expe- no. 6, pp. 882–894, 2016.
riences, also for all those applications which are not designed [16] I. J. Akpan, M. Shanker, and R. Razavi, “Improving the success of
simulation projects using 3d visualization and virtual reality,” Journal
VR-ready. In particular, our contribution wants to respond to of the Operational Research Society, pp. 1–27, 2019.
the needs of a niche of users that have always demonstrated [17] K. Gruchalla, “Immersive well-path editing: investigating the added
interest towards immersive technologies: scientific computing value of immersion,” in IEEE Virtual Reality 2004. IEEE, 2004, pp.
157–164.
and simulation research professionals. The proposed approach [18] A. Forsberg, M. Katzourin, K. Wharton, M. Slater et al., “A comparative
may hence, at once, serve an interested group of users, while study of desktop, fishtank, and cave systems for the exploration of
fostering the development of a set of technologies which volume rendered confocal data sets,” IEEE Transactions on Visualization
and Computer Graphics, vol. 14, no. 3, pp. 551–563, 2008.
may in the near future bloom also in the general consumer [19] Y. Peng, Y. Ma, Y. Wang, and J. Shan, “The application of interactive
market. Future works will require the completion of the im- dynamic virtual surgical simulation visualization method,” Multimedia
plementation and a thorough experimentation, which may also Tools and Applications, vol. 76, no. 23, pp. 25 197–25 214, 2017.
[20] Varlab website. [Online]. Available:
include performance evaluation and human-computer interac- https://site.unibo.it/varlab/en/projects/code-and-demos
tion approaches, with the 3D immersive scientific computing [21] B. Laha, D. A. Bowman, and J. J. Socha, “Effects of vr system
and simulation environments supported within the proposed fidelity on analyzing isosurface visualization of volume datasets,” IEEE
Transactions on Visualization and Computer Graphics, vol. 20, no. 4,
framework. pp. 513–522, 2014.
[22] B. J. Andersen, A. T. Davis, G. Weber, and B. C. Wünsche, “Immersion
R EFERENCES or diversion: Does virtual reality make data visualisation more effec-
[1] K. Risden, M. P. Czerwinski, T. Munzner, and D. B. Cook, “An initial tive?” in 2019 International Conference on Electronics, Information,
examination of ease of use for 2d and 3d information visualizations and Communication (ICEIC). IEEE, 2019, pp. 1–7.
of web content,” International Journal of Human-Computer Studies, [23] N. Reski and A. Alissandrakis, “Open data exploration in virtual reality:
vol. 53, no. 5, pp. 695–714, 2000. a comparative study of input technology,” Virtual Reality, vol. 24, no. 1,
[2] A. G. Sutcliffe and K. D. Kaur, “Evaluating the usability of virtual pp. 1–22, 2020.
reality user interfaces,” Behaviour & Information Technology, vol. 19, [24] 3d design software — sketchup. [Online]. Available:
no. 6, pp. 415–426, 2000. https://www.sketchup.com
[3] J. J. LaViola Jr, “Bringing vr and spatial 3d interaction to the masses [25] G. Marino, D. Vercelli, F. Tecchia, P. S. Gasparello, and M. Bergam-
through video games,” IEEE Computer Graphics and Applications, asco, “Description and performance analysis of a distributed rendering
vol. 28, no. 5, pp. 10–15, 2008. architecture for virtual environments,” in 17th International Conference
[4] W. Cellary and K. Walczak, Interactive 3D multimedia content: models on Artificial Reality and Telexistence (ICAT 2007). IEEE, 2007, pp.
for creation, management, search and presentation. Springer, 2012. 234–241.
[5] D. A. Bowman, R. P. McMahan, and E. D. Ragan, “Questioning [26] Techviz website. [Online]. Available: https://www.techviz.net
naturalism in 3d user interfaces,” Communications of the ACM, vol. 55, [27] Moreviz website. [Online]. Available: http://www.more3d.com/
no. 9, pp. 78–88, 2012. [28] L. Yu, P. Svetachov, P. Isenberg, M. H. Everts, and T. Isenberg, “Fi3d:
[6] A. Cockburn and B. McKenzie, “3d or not 3d? evaluating the effect of Direct-touch interaction for the exploration of 3d scientific visualization
the third dimension in a document management system,” in Proceedings spaces,” IEEE transactions on visualization and computer graphics,
of the SIGCHI conference on Human factors in computing systems, 2001, vol. 16, no. 6, pp. 1613–1622, 2010.
pp. 434–441. [29] Openvr sdk. [Online]. Available:
[7] R. Alkemade, F. J. Verbeek, and S. G. Lukosch, “On the efficiency of a vr https://github.com/ValveSoftware/openvr
hand gesture-based interface for 3d object manipulations in conceptual [30] Omnet++ simulation manual. [Online]. Available:
design,” International Journal of Human–Computer Interaction, vol. 33, https://doc.omnetpp.org/omnetpp/manual//sec:graphics:overview
no. 11, pp. 882–901, 2017. [31] Qualnet manual. [Online]. Available: https://www.scalable-
[8] L. Donatiello, E. Morotti, G. Marfia, and S. Di Vaio, “Exploiting networks.com/products/qualnet-network-simulation-software-tool/
immersive virtual reality for fashion gamification,” in 2018 IEEE 29th [32] Arena installation notes. [Online]. Available:
Annual International Symposium on Personal, Indoor and Mobile Radio https://www.arenasimulation.com/
Communications (PIMRC). IEEE, 2018, pp. 17–21. [33] Vissim faq. [Online]. Available:
[9] E. Morotti, L. Donatiello, and G. Marfia, “Fostering fashion retail https://www.ptvgroup.com/en/solutions/products/ptv-vissim/knowledge-
experiences through virtual reality and voice assistants,” in 2020 IEEE base/faq/visfaq/search/
Conference on Virtual Reality and 3D User Interfaces Abstracts and [34] Sumo - simulation of urban mobility. [Online]. Available:
Workshops (VRW). IEEE, 2020, pp. 338–342. http://sumo.sourceforge.net/
145
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Wireless networks are prone to jamming-type attacks has been the subject of a few studies in recent years under the
due to their shared medium. An attacker node can send a radio name of jammer placement problem. The goal of this problem
frequency signal and if this signal interferes with the ”normal” is to find the optimal position of the jammer to minimize the
signals of two communicating nodes, the communication can be
severely impacted. In this paper, we examine radio interference throughput of the network. Studying this dilemma would make
attacks from the jamming node perspective. In particular, we it possible to improve detection methods, such as the location
assume a ”greedy” jamming node, whose main twofold objectives of jamming nodes [5], [6].
are to attack and interfere the communication of a transmitter
In [7], the authors study the impact of several types of
and a receiver node, by minimizing its energy consumption and
maximizing the detection time. The two communication nodes are jammers as a function of their distance from the victim nodes
static during the attack window time, while the attacker node and the size of packets. They deduce that the closer the
can adapt its distance from the transmitter in order to select attacker is to his victim, the more effective it is. However,
the most suitable range for a successful interference. In order this also leads to a high probability of detection. Panyim et al.
to take into account the distance factor for the effectiveness of
wondered if the random positioning of a jammer can be more
the attack, we derive an optimization model for representing the
attack and we will study the key factors that allow effective and effective than when the choice of the position of the attacker
efficient implementation of a jamming attack, namely a) the energy is made strategically [8]. They conclude that the aggressor
b) the detection time and c) the impact on the transmission in has more impact on the network when the jammer is situated
terms of lowering the PDR. Three different types of attacks will next to a node where a lot of data transits. The number of
be analyzed, 1) Constant Jamming, 2) Random Jamming and 3)
jamming devices (and their locations) required to suppress a
Reactive Jamming. Simulation results show that the effectiveness
of a jamming attack in respect to the others not only depends on given network was also investigated [9]. They compare the
the position of the jamming node but also on the distance between impact of the jammer when it is placed at random and when
the transmitter and receiver nodes. it is placed on a uniform grid. This placement problem can
Index Terms—Placement jammer problem, Jamming attacks, be formulated in the form of an optimization problem where
Security, Wireless Networks.
the goal is to corrupt a maximum number of packets from the
target network, while keeping a low detection probability [10].
I. I NTRODUCTION
This study is inspired by those previous works but takes
The inherent openness of the wireless transmission medium into account the fact that the attacker is also a constrained
has made wireless communication systems particularly vulner- node (e.g., energy, computation). By considering the attacker
able to a multitude of attacks. One of the biggest threats to perspective, we show here that there exists a trade-off between
these communication systems is the jamming attack, in part the efficiency of a jammer, its distance from the communication
by its ease of implementation. This kind of attack consists and its energy consumption. We assume an attacking node
in intentionally interfering with the communication medium which aims to interfere the communication as much as possible,
to keep it occupied or to corrupt data in transit to cause a while maximizing its impact on the network and minimizing
denial of service (DoS). Most research has been focused on its energy consumption and its probability of being detected.
creating new detection methods or countermeasures [1]–[4].
Nevertheless little work has been oriented towards optimizing We use the simulator NS-3 [11] to compare the energy
the impact of these attacks. consumption spent by the three distinct jamming strategies, as a
The effectiveness of a jamming attack is based on many pa- function of its distance from the victim node and the distance
rameters such as the transmission properties (e.g., modulation, between the transmitter and the receiver. Our analysis show
power), the characteristics of the network (e.g., routing), or also that for each, the distance between the two communication
the strategy of the jammer along with its position. The last point nodes influences the jamming efficiency and the probability of
being detected. We also expose that for each scenario, there is
a position of the attacker which makes it possible to reduce its
978-1-7281-7343-6/20/$31.00 ©2020 IEEE energy consumption and its probability of being detected while
146
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
having a reasonable impact on the networks. upon packet transmission. This strategy reduces attack time and
The main objective of this study is to prove that the choice increases its effectiveness because the attacker no longer blindly
of the optimal interference strategy does not only depend on jams the network.
its position in the network but also on its energy consumption We have chosen to implement three jamming approaches
and its probability of being detected. inspired by those mentioned above. Our first strategy: Constant
This article is organized as follows. In section II, the network Interval Jammer consists in injecting packets on the channel for
model, the jamming attack strategies, and the detection issue are a certain period at regular time intervals. We have chosen here
described. We introduce, in section III the problem formulation a time interval between two very short jammings in order to
and we provide details of simulations and results in section IV. corrupt a maximum of packets.
We conclude the paper in section V. The second is an implementation of a Randon jammer which
randomly draws the duration during which it will remain in an
II. S YSTEM MODEL
idle state after each sending of packets in a given interval. The
A. Network Model aggressor, therefore, alternates the two states randomly. The
We consider a wireless communication scenario with one last implementation corresponds to a Reactive Jammer.
transmitter, one receiver and one jammer. We assume that radios Table I shows the send interval for each type of jammer
have equal transmit power and equal noise power. We assume during the simulation.
that nodes are limited in energy. We define an amount of energy
in the initial state E0 . At the end of each transmission or each Constant
Random Reactive
Parameters Interval
change of state of a device, the consumed energy of a node is Jammer
Jammer Jammer
calculated as follows: Send interval
Send interval Between 100
1 of the
Ei+1 = Ei + V ∗ (ti+1 − ti ) ∗ Ii , (1) (ms) and 1
legitimate node
Energy (J) 55 55 55
where Ei is the energy consumption at time ti , V is the Supply
3 3 3
supply voltage and Ii is the total current draw at node i. voltage (V)
147
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
S
where W is the system bandwidth and is the signal to subject to
N D
noise ratio between the transmitter and receiver. We assume X
that in the absence of any external interference (i.e., jamming 1xt (i) < δ, (9)
i=1
attacker), the achieved capacity only depends on the reciprocal
D
distance between the two communicating nodes. X
Πi xt (i) = 0, (10)
Since a ”greedy” jamming node is considered, its main
i=1
objective is to decrease the effective Packet Delivery Ratio
(PDR), by minimizing its energy expenditure (which depends where δ is a threshold distance (beyond this distance the
on its distance from the transmitter) and increasing its detection attack has no effect on the transmission), λe is a variable for
time. Intuitively, if the attacker is close to transmission, it will considering the importance of the energy consumption, while
be more effective by spending less energy (that is adjusted with the variable λd is to consider the detection factor. The equation
the distance), yet its attack can be a failure since the detection (10) means that for each distance there is at least one slot where
time can be really fast. Since we consider three different aspects the transmitter and the attacker send data in the same slot. This
that can be opposite to each other, we formulate three different optimization problem is non-linear and the different types of
functions F1 , F2 and F3 . F1 is for characterising the goal of attacks considered will not be optimal. Such as an example,
impacting the PDR of the communication. In particular, in time the reactive jamming tries to ”intercept” the transmission, but
slot t, the achieved rate in respect of the distance i is: in order to do that the energy consumption will be larger.
Hereafter, we evaluate the different types of attacks in respect
of the impact on the PDR, the energy consumption of the at-
Rt (i) = xt (i) ∗ ct (i), (4)
tacker node and the detection time. In particular, we implement
and the function F1 can be defined as: the different functions F1 , F2 and F3 and we evaluate them for
the different types of attacks.
T X
X D T X
X D IV. P ERFORMANCE E VALUATION
F1 = E[Rt (i)] = E[xt (i) ∗ ct (i)], (5)
A. Simulation Details
t=1 i=1 t=1 i=1
The jamming attacks were simulated using the discrete event
where T is the total number of time slots, D is the distance,
simulator NS-3 (Network Simulator-3). The parameters set
E is the expectation and is with the respect of randomness
during simulations are shown in Table II. The transmitter
of ct (i), computed as in (3). Hereafter, E[.] will indicate the
constantly transmits packets every 0.1 seconds and begins its
average. F1 is for accounting the fact that if the transmissions
transmission at the start of the simulation (t = 0). The jammer
of both the emitting and the jamming nodes happen in the
aims to jam the transmitter node.
same time slot, they will collide with high probability. This
means that if the packet reaches the receiver, it will fail the Parameter Name Setting Used
CRC control, thus getting discarded, with a negative effect Radio Propagation Model Friis Propagation Loss Model
on the PDR. The function F2 is for accounting the energy
Routing protocol Ad-hoc routing
expenditure of the jamming node, depending on its distance
to the transmission, and can be expressed as: Energy Model EnergyBasicModel
Size of Legitimate Packet(octets) 1000
D
X Send interval legitimate nodes(s) 0.1
F2 = E[i2 ] (6)
i=1 TABLE II: Simulation and Node Parameters.
The function F3 accounts for the detection time, that is
proportional to the distance of the jamming node. The greater B. Results and Analysis
the distance of the attacker, the longer it would take to detect Our objective is to evaluate the impact of the different kinds
the attack. However, if the attacker is too far, an effective of jamming attacks on the network as function of the placement
attack would have a smaller impact while requiring more energy of the malicious node. A study about energy consumption as
consumption for the attacker node. We thus compute F3 as a function of the placement of the attacker is also carried. In
follows: particular, we evaluate the three different types of jamming
D
X attacks, the constant interval jammer, the random jamming
F3 = E[En (i)], (7) and the reactive jamming by considering the three factors a)
i=1 Detection Time; b) Energy Spent; c) Packet Delivery Ratio
(PDR) in a sinergic way. Indeed, in order to be effective, an
where En (i) is a function proportional to the distance.
attack has to be detected as late as possible (high detection
We then compute: time), the attacker has to minimize its energy consumption and
the PDR between transmitter and receiver has to be impacted
min(F1 + λe ∗ F2 − λd ∗ F3 ) (8) as much as possible.
148
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The first type of simulations are based on a distance between than 2 Joules, the detection time is increasing and achieves 3
the transmitter and the receiver of 20 meters. In Figure 1, we seconds around 35 meters but the PDR is sensibly impacted
report the a) Detection Time as function of distance, the b) by considering that up to 30 meters of the attacker distance,
Total Energy Spent for an attack by the jamming node and the the PDR is smaller than 80%. It is worth to recall that we are
c) Packet Delivery Ratio of the communication between the considering an ideal scenario, where no other communication
transmitter and the receiver. interfere with the channel of the transmitter and receiver, so
we expect 100% as PDR. The constant jamming impacts the
channel of a 20% in terms of PDR.
In order to evaluate how the distance between the transmitter
and receiver impacts on the effectiveness of a jamming attack,
when the same power level of the transmitter is considered, we
increase the distance between the sending node and the receiver
to 60 meters. This scenario confirms that the most effective
attack is the reactive jamming.
Indeed, when the jamming node is positioned at around
50 meters from the transmitter, detection time is around 2.4
seconds and achieves 3 seconds at 65 meters. The PDR is
highly impacted since it reaches 70% at 50 meters and 90%
(a) at 65 meters. In practice, the optimal position of the reactive
jamming in this scenario is around 50 meters with an energy
consumption around 2 joules. The others two attacks have a low
energy consumption, but their attacks are not effective since the
detection time is almost constant and equal to 1 second (i.e.,
the attack is soon detected) and just increases a little bit around
80 meters.
C. Discussion
The analysis dealt in the different scenarios arises some
interesting observations. First of all, as already assessed in other
previous works, there is a strong relation between the position
(b) of an attacker and its effectiveness in a wireless context. As
the attacker considered in this work is a greedy node, aiming at
being effective in terms of impact (i.e. by lowering the PDR) but
with the minimum energy consumption, our evaluation allowed
to understand that different types of attacks can be more effec-
tive based on different distances between two communication
nodes. In the specific scenarios considered, the constant attack
is with more impact than the random and the reactive ones,
when the distance between the two communicating nodes is
small (e.g., 20 meters). On the other hand, the reactive jamming
is more effective when the distance between transmitter and
receiver increases. A jamming node can easily implements the
three different types of attacks by switching from one to the
(c)
other, based on the specific situation of the two nodes that are
Fig. 1: Distance between Transmitter and Receiver equal to communicating. It is sufficient for the attacker node to listen
20 meters (a) Detection Time; (b) Total Energy spent by the for a sufficient time in order to acquire the needed data and
jamming node. (c) Packet Delivery Ratio. infer information as the distance between the two nodes.
In this work we have considered an ”ideal” scenario where
Among the three types of attacks, the reactive jamming is only two nodes are exchanging data, so no external interference
less detectable than the constant and random ones. On the other is considered; the detection is also ideal, in the sense that it is
hand, the energy depleted by the reactive jamming node is much with a fixed threshold and we assume it is able to perfectly
higher than for the others two types of attacks. Moreover, the detect the jamming attack with no false alarm. This is not true
detection time increases for constant and random attacks when in a realistic scenario, where lower PDR can be caused for
the attacker is positioned around 25 − 35 meters. different reasons and the detection scheme needs to account
In particular, the constant jamming is more effective in this for all these situations. The main objective of this analysis was
distance interval, since the energy wasted for the attacks is less to highlight not only the dependence of the attacker position
149
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
150
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Thanks to the recent advancements in the Software- architecture approach to virtually deploy network functions
Defined Networking (SDN) and Network Function Virtualization on generic hardware [1]. The state-of-the-art demonstrates
research domains, telecom operators are encouraged to upgrade that the combination of SDN and NFV enables unprecedented
their optical transport networks towards programmable, energy-
efficient, service-oriented, and interoperable architectures. The levels of network control, dynamicity, and flexibility [2]–[4].
availability of a large set of open-source building blocks, sup- Telco operators are encouraged to take this opportunity by
ported by different standardization bodies makes the selection integrating SDN and NFV into their large scale geographi-
and the integration of such technologies a very complex task. cal networks, eventually known as Transport-SDN (T-SDN).
In this context, the INTENTO project has the objective to However, this integration is not straight forward and poses
create an innovative simulation framework by selecting the
best technologies and use it to test applications, services, and numerous challenges due to the large scale complexity of
advanced optimization algorithms in a real environment. In the the telecommunication networks and the selection of suitable
initial phase, the project designed a large-scale, distributed, and technology from the available open-source projects backed by
hierarchical Transport SDN architecture, where optical switches different standardization bodies.
and networking functionalities are monitored and dynamically The INTENTO (INTElligent NeTwork Orchestration
configured through a two-level structure of SDN controllers. On
top of that, Virtual Network Functions are optimally deployed Framework) project [5], recently funded by the Apulia Region
and managed by a centralized orchestrator, based on network (Italy), is going to address the aforementioned issues by
condition, user requests, and application requirements. Based on developing an innovative simulation framework by selecting
this architecture, the project team started to develop a complex the appropriate state of the art technologies and integrate
simulation environment that harmoniously integrates within the them to test applications, services, and advanced optimization
OpenStack cloud: optical node simulators composed by simula-
tion agent and a suitable hardware emulation layer; proprietary algorithms in the real-time and complex T-SDN environment.
SDN network controller designed to enable the innovative optical In the initial stage of the project, a T-SDN architecture has
nodes characteristics; Open Network Operating System as the been designed, that incorporates distributed and hierarchical
second level controller, enabling the integration of third-party monitoring and deployment of large scale optical switches
or standardized models (multivendor environment), based on and network functionalities (i.e., VNFs) by means of a two-
standardized interfaces and communication protocols. After hav-
ing described the main components and functionalities already level structure of SDN controllers. The level-1 SDN controller
implemented into the simulation framework, the paper concludes manages the optical nodes, whereas the role of the level-2
by highlighting future research and development activities. controller is to allow the integration with third party and multi-
Index Terms—Optical Transport Networks; Software-Defined vendor environments. The Virtual Network Functions (VNFs)
Networking; Virtual Network Functions; Simulation Framework are optimally deployed via a central orchestrator based on the
network and user requirements.
I. I NTRODUCTION Based on the proposed architecture, the project team has
built a real-time and complex simulation environment within
Software-Defined Networking (SDN) is a cutting edge the OpenStack cloud, consisting of the following functionali-
technology for the deployment of programmable and virtu- ties: (1) Optical node simulators consisting simulation agents
alized service infrastructures. On the other hand, Network with the emulated hardware layer, (2) the level-1 proprietary
Function Virtualization (NFV) emerged as a new network SDN controller developed as the part of this project to manage
978-1-7281-7343-6/20/$31.00 ©2020 IEEE the advanced optical nodes features, and (3) Open Network
151
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Operating System (ONOS) has been adopted as the level-2 The extensive use of a virtualized environment, allow the
SDN controller in order to facilitate integration for third party mix of real nodes and simulators and can be used to reach a
and multi-vendor environments on standardized communica- very high node count, to test very complex networks.
tion protocols like Transport API (T-API), NETCONF, and The Project objectives are supported by an IT infrastruc-
RESTCONF, for both southbound and northbound interfaces. ture based on standard servers. More precisely, we aim at
It is worth noting that in this architecture real equipment can be demonstrating the overall framework capability of simulating
integrated into the simulation environment. Automated deploy- a complex infrastructure management, including optical layer
ment procedure has been developed to effectively deploy the design and planning, multilayer (DWDM / OTN / Packet)
complete simulation environment. At the time of this writing, management. The simulation framework can be used to carry
to certify the effectiveness of the overall simulation framework out complex simulation scenarios, very useful to select the
the project identifies innovative applications, which will be best solutions among alternative option and assess the overall
addressed in the final part of the INTENTO Project. performance. In addition, thanks to the open framework archi-
The rest of the paper is organized as follows: Section 2 tecture, advanced applications will be selected and tested, i.e.,
presents the overview, goals, high-level architecture of the IN- the effectiveness of a set of VNFs, which may be hosted in
TENTO project. Section 3 describes the introduced simulation the optical nodes.
framework, implemented technologies, and the future goals
of the project. Section 4 draws the conclusions of this work. A. Targeted use cases
Finally, the acknowledgments are given in Section 5. Although embedding multi-level service orchestration ar-
chitectures in nodes of a telecommunication network is often
II. T HE INTENTO PROJECT seen as a way for serving traditional Telco applications as
VNFs, this vision leaves out interesting target areas that are
INTENTO Project has the target to implement a Telco- non-purely Telco. Indeed, several ICT applications exist that
Cloud orchestration platform using open source software either demand or greatly benefit, from the availability of edge-
modules and standardized interfaces. The telecommunication based processing coupled with synchronous inter-node com-
infrastructure includes all the relevant hardware and software munication, in terms of low latency, availability, survivability,
components of a T-SDN architecture, ranging from optical as well as scalability, especially when all is cleanly modeled as
nodes through a two-level network management system, to VNFs and orchestrated as such. For example: Content Delivery
the overall infrastructure management based on a centralized Network (CDN) for efficient Video distribution, Blockchain
orchestrator. On top of that, VNFs are optimally deployed and processing VNF, Camera processing VNF for social distanc-
managed by a centralized orchestrator in order to implement ing and face mask-wearing rule infringements detection for
and test innovative services and applications. anti-COVID-19 precaution, IoT data collection and first-level
aggregation VNF for sensor arrays, vehicle traffic support
systems, and smart grid applications etc.
B. High-level architecture
Referring to Figure 1, the items composing the overall
architecture are:
1) Telecom Nodes: The simulator of the optical nodes is
based on the SM Optics technology and are the base of the
Telecom infrastructure, providing the connectivity for each
architectural component.
2) Specialized SDN Controller (L1 Controller): The L1
Controller is in charge to manage the Telecom Nodes sim-
ulation instances and represent the actual NMS solution for
SM Optics nodes.
Fig. 1. The conceived high-level framework. 3) Multi-vendor/Multi-domain SDN Controller (L2 Con-
troller): The L2 controller is supposed to act as a generic SDN
Figure 1 describes the reference architecture of the simula- controller, based on standard interface, enabling the simulation
tion framework, highlighting the driving factors: 1) develop- to deal with multi-domain, multi-vendor environment.
ment of the node models according the Yet Another Next Gen- 4) VNF Orchestrator: Provide the support for the whole
eration (YANG) standard, 2) support of T-API interface and lifecycle of VNF instances, from library management to the
NETCONF/RESTCONF communication protocols to ensure actual deployment, to the activation and monitoring functions.
the compatibility with existing network standards, 3) support 5) Framework Orchestrator: It is based on Openstack and
of L1-L3 network layers and related multi-layer management, should be intended as the general orchestrator framework
and 4) implement a NFV infrastructure management enabling needed to exploit all possible services conceived for the
the development and testing of VNFs. proposed infrastructure.
152
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
III. T HE IMPLEMENTED TESTBED the required dependencies for the deployment of the level-
1 SDN controller and optical node simulator are packed
This section focuses on the technical details related to the
inside the containers to make the application portable and
implemented simulation framework, while presenting: selected
easily deployable.
technologies, integrated components, and automated deploy-
• Operating environment: The OpenStack cloud has been
ment. An example showing the current usage of the simulation
selected as the Operating environment for the simulation
framework is discussed as well.
framework. It can virtualize and control large pools of
A. Components of the simulation framework computing, storage, and networking resources. Open-
Stack has been chosen because of its opensource licens-
The developed simulation framework consists of: ing, wide adaptation in the industry, active community
• Network Simulation Agent: The network simulation support, and frequent releases of new features as per
agents (also known as optical node simulator) are devel- industry demands [6].
oped to model virtualized optical switches in the network
comprising different characteristics i.e., bus speed, num- B. Communication protocols and interaction
ber of connecting ports, and type of connectors etc. Each The network configuration and communication between
virtual switch can be connected to one or multiple optical the components of the simulation framework is carried-out
switches in the simulation environment. through the following protocols:
• Level-1 SDN controller: A proprietary SDN controller • T-API: It is a transport protocol that delivers a flexible
has been designed and developed as part of the IN- North-Bound Interface for integrating SDN controllers
TENTO project to enable the management and control in the network by facilitating transport communication
of simulated optical nodes. The core responsibility of the through REST API following T-API models, written in
mentioned controller is the creation and management of YANG.
the virtualized optical switches on the network simulation • RESTCONF/NETCONF: The purpose of these protocols
agents connected to it. Moreover, it is also responsible is the communication between multiple controllers and
for communication with the multi-vendor supportive SDN network simulation agents. They provide mechanisms
controller ONOS on level-2. In our proposed simulation to install, manipulate, and delete the configuration of
environment level-2 SDN controller is connected with a network devices through remote procedure calls and
bunch of level-1 SDN controllers associated with an enor- XML/JSON based data encoding for the configuration
mous amount of Network Simulation Agents comprising data as well as the protocol messages.
several virtualized optical switches.
• Level-2 SDN controller: For the selection of level-2 SDN C. Achieved implementation and the developed simulation
controller, despite, several open-source controllers avail- framework
able in the industry, the most prominent ones are ONOS The current simulation framework being developed, inte-
and OpenDaylight. The aforementioned controllers allow grates deployment of two-level of SDN controllers, within
communication with third-party controllers through the the OpenStack cloud. The proprietary SDN controller devel-
well-known communication protocols available in the oped in this project is deployed on level-1 and an Open-
industry (i.e., OpenFlow, NETCONF, and RESTCONF). Source multi-vendor supportive ONOS SDN controller has
The motivation behind the selection of ONOS in the been placed on level-2 in the framework. The optical node
INTENTO framework is its communication mechanism. simulator, which is also developed as the part of this project
ONOS provides support for T-API protocol at the South- is connected to the level-1 and level-2 controllers via the
bound interface over the REST protocol. In the contrast, RESTCONF/NETCONF interfaces. As shown in Figure 2, the
OpenDaylight earlier provided support for T-API in their optical node simulator is dynamically controlled by the level-1
UniMgr project but in the recent releases of ODL, there SDN controller. Each optical node simulator represent telecom
is no support for T-API, which is the provision in the nodes that can create a large number of virtual interfaces for
INTENTO project for communication between the level- communication with other telecom nodes. The level-1 SDN
1 and level-2 controllers. controller is connected with the level-2 SDN controller through
• Modeling language: YANG is a data modeling language the T-API interface using the REST API. The level-2 SDN
for the definition of data sent over network management controller can retrieve the information related to simulated
protocols such as the NETCONF and RESTCONF. It is nodes either through the level-1 SDN controller or directly
used in our project to model both configuration data as from the optical nodes. The topology related information
well as state data of elements in the network. is retrieved through the level-1 SDN controller. The YANG
• Application deployment technology: Containers technol- language is used for the communication models between the
ogy provides an effective way for application deployment. level-1 and level-2 SDN controllers as well as the optical
Docker has been selected as the container engine, based node simulators. Currently, the level-1 SDN controller can
on a qualitative cross-comparison of technologies for communicate and control the optical node simulators and
containerization discussed in [6]. All the executables and perform tasks such as creating multiple interfaces on the
153
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
154
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—In this work, we propose an IoT edge-based energy of Things (IoT) paradigm, and cognitive abilities. Cognitive
management system devoted to minimizing the energy cost for the buildings differ from smart buildings because they can learn,
daily-use of in-home appliances. The proposed approach employs reason, adapt, and cooperate with each other to make decisions
a load scheduling based on a load shifting technique, and it is
designed to operate in an edge-computing environment naturally. in a time-constrained fashion. A key feature is to collect
The scheduling considers all together time-variable profiles for and analyze environmental data and consumer habits infor-
energy cost, energy production, and energy consumption for each mation to proactively operate in order to efficiently manage
shiftable appliance. Deadlines for load termination can also be its resources (for example, its internal and external spaces,
expressed. In order to address these goals, the scheduling problem technological infrastructures and systems) and with the goal
is formulated as a Markov decision process and then processed
through a reinforcement learning technique. The approach is of improving energy performance and raising the level of well-
validated by the development of an agent-based real-world test being and safety of the building inhabitants.
case deployed in an edge context. Applying DSM techniques is not a trivial issue because
Index Terms—Edge Computing, Reinforcement Learning, En- they require to consider several factors together in order to
ergy Management Systems, Internet of Things, Multi-Agent minimize energy costs. As an example, it is necessary to
Systems.
take into account the presence of PV (Photo-Voltaic) solar
panels that gives rise to the problem of using as much as
I. I NTRODUCTION
possible the self-produced energy so to minimize the cost of
Appliances and new technologies simplify everyday life, buying energy from the electrical grid [5]. Another factor that
making it easier to carry out daily domestic activities and increases the complexity of applying DSM is the variable price
guarantee surprising results. There are several types of avail- of energy [6].
able appliances: for cooking (e.g., hoods, hobs, ovens, and An effective way to implement DSM techniques is to stim-
microwaves), for making food (e.g., small appliances such ulate customers to shift loads from peak periods to off-peak
as blenders, mixers, and food processors), for keeping food periods or decrease their electricity usage during peak times.
fresh (e.g., refrigerators, cellars), for cleaning (e.g., vacuum All of this, always taking into account people’s preferences
cleaner), for washing (e.g., washing machines, dishwashers), in using appliances and the self-produced energy. In such a
for entertainment (e.g., smart TV, game console, home the- contest, Reinforcement Learning (RL) algorithms can make
aters), and for indoor wellness (e.g., air conditioners and the right decision in computing and operating an appropriate
purifiers). Inefficient use of these appliances causes a waste load scheduling [7].
of energy and time. One simple way to deal with this is DSM can significantly benefit for the exploitation of the
to provide consumers with some feedback. Feedback can be Edge Computing paradigms [8], [9]. The key concept of edge
used to inform about waste and give suggestions or best computing is distributing the power of data processing to the
practices to enhance user behavior regarding energy. In such edge of a system, giving to devices, sensors, and gateways the
a case, the user remains in charge of actuating proper actions capabilities to act or make decisions locally without relying on
to deal with energy management. Another more effective a far cloud environment. In such a case, advantages for DSM
way to reduce energy consumption is to apply Demand Side are tied to latency reduction, bandwidth saving, and privacy
Management (DSM) techniques [1]. DSM in the smart grid preservation [10]–[12]. An edge-based DSM can be replicated
allows customers to make autonomous decisions on their and disseminated on different computing nodes residing in dif-
energy consumption, helping energy providers to reduce the ferent places to execute locally in every building, thus taking
energy peaks in load demand. The automated scheduling of into account the specific context in which the system operates.
smart devices in residential and commercial buildings plays a Moreover, in order to foster high performance in using an
key role in DSM [2]. RL approach for DSM, edge computing favors efficient data
DSMs can be considered as a part of the so-called modern transfer between the RL algorithm and the physical devices
Cognitive Buildings [3]. Cognitive buildings, the natural evo- that must be managed. As another side benefit, edge computing
lution of the smart buildings [4], are environments equipped reduces costs as it permits to avoid buying cloud resources.
with sensors and actuators capabilities that exploit the Internet In this paper, we focus on the design and implementation
of an IoT-based energy management system for DSM, that agent incrementally updates its knowledge about the problem,
exploits reinforcement learning and is based on an edge and, eventually learns which are the actions to take for the
computing infrastructure. The goal is to schedule in-home maximization of the reward.
appliances to minimize the total cost spent on energy and take
into account some users’ preferences. The scheduling process B. Markov decision processes
is modeled as a Markov decision process (MDP) [13], and RL A Markov decision process (MDP) [13] is an extension of
is used to determine an effective policy that aims to minimize the Markov Process that embeds the concepts of actions and
the energy cost. rewards [13]. An MDP is defined as a tuple (S, A, Pa , Ra )
The contribution of the paper considers all together: (i) where:
time-variable profiles for energy cost, energy production, and
• S defines the set of the states;
energy consumption for each appliance; (ii) the presence of
• A defines the set of the actions;
both not-shiftable and always-on loads in a building; (iii) the ′
• Pa (s, s ) defines the transition probabilities, i.e., given a
definition of deadlines within which the appliances have to be
couple of states s, s′ ∈ S and an action a ∈ A, Pa (s, s′ )
executed. Moreover, (iv) the system has been designed to be
is the probability to move from the state s to the state s′
naturally distributed and capable of being deployed on several
with the action a;
edge nodes that can be spread in different parts of a building
• Ra (s) defines the reward obtained by taking the action
to exploit the edge-related advantages fully.
a ∈ A when in the state s ∈ S.
To prove the effectiveness of the approach, a real case study
has been implemented in the contest of the IoT Laboratory at Given an MDP, it is possible to define a policy function
ICAR-CNR (Rende, Italy). For realization purposes, the agent- π(s) which gives, for each state s ∈ S, an action a ∈ A to
based COGITO IoT platform [14] has been used. Developed at undertake. A policy function is optimal if it maximizes the
ICAR-CNR, COGITO proved to be effective for the realization expected cumulative reward of an MDP.
of cognitive building applications. For MDPs having a finite state space, a finite action space, a
The remainder of the paper is structured as follows: Sec- fully defined probability transition function, and a fully defined
tion II introduces some background concepts useful to under- reward function, it is possible to compute an optimal policy by
stand the rest of the paper and some related work; Section III exploiting dynamic programming. In the other cases, specific
describes problem statement and how it is modeled. Moreover, RL algorithms can be exploited to estimate effective policies.
it introduces the simulated environment and the used reward
C. COGITO
function. Section IV shows a case study implemented and
gives some experimental results. Finally, some conclusions are COGITO [14] is an agent-based IoT platform tailored to
drawn and future work are provided. the design and implementation of cognitive environments. A
cognitive environment extends the concept of smart environ-
II. BACKGROUND AND RELATED WORK ment [17], [18] by promoting the exploitation of cognitive-
This section provides some fundamental information about based technologies [19] which aim at realizing systems able
the concepts exploited in the rest of the paper. It also provides to automatically adapt to changes in user’s behavior and
a view on some other works in the literature having similar anticipating and predicting users’ activities and needs.
approaches with regard to the topics of this paper. COGITO, currently implemented in Java, relies on the agent
metaphor [20] and naturally permits to exploit the benefit of
A. Reinforcement learning both edge and cloud computing. Agent paradigm has been
RL is a technique useful for training decision-maker agents. chosen since it is well suited for the implementation of dis-
Such technique is studied in different fields, e.g., game theory, tributed and pervasive systems. In fact, agents can execute near
optimization, and control theory. RL considers four basic to the devices they need to control/manage thus implementing
components: agent, environment, reward, and action. An agent the edge computation and enabling real-time analysis on the
can observe a dynamic environment and interact with it by data gathered on single nodes. Furthermore, agents can execute
taking actions. A reward is given to the agent for each action on the cloud for implementing out-of-the-edge functionalities
it takes. Agents are trained by RL so to make decisions that (e.g., data mining or data storage) and taking advantages of the
maximize the given (cumulative) rewards. RL algorithms are features of the cloud. The COGITO platform offers the Virtual
useful to train decisors that operate with limited knowledge Objects abstraction, suited to hide heterogeneity of physical
of both the environment and the expected quality of each devices and communication protocols. COGITO promotes
decisions they take [7]. modularity and separation of concerns and offers some built-in
RL algorithms include State-Action-Reward-State-Action features which can be exploited to aggregate/filter information
(SARSA), Q-Learning, Deep Q-Learning, and Asynchronous at the edge and operating data-fusion on data coming from
Advantage Actor-Critic (A3C) [7], [15], [16]. Such algo- the deployed sensors. Other primitives are made available to
rithms are all based on learning-by-experience. The agent is simplify the use of artificial intelligence libraries (e.g., for
trained by running a set of simulations in which the agent machine learning) in a distributed and heterogeneous cross-
interacts with the environment. After each simulation, the language environment.
156
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
D. Load scheduling and reinforcement learning before the activation of the load i and after it completes
In literature, other works have tackled the problem of its execution, that is if ta < 0 or ta ≥ αi .
appliances scheduling through RL. In [21], authors have Both ci and Ci define the consumption profiles of the
compared variations of SARSA and Q-learning algorithms by load i;
applying them for finding a scheduling of six loads. Anyway, • the function p(t) is the price per kWh of the energy at
and did not consider self-produced energy. In [22], authors (e.g., by a PV solar panel) at time-step t. It is possible
propose a load scheduling algorithm using RL to minimize to consider both real and forecasted profiles for energy
the electricity bill considering a renewable energy source. The production. The use of self-produced energy is considered
paper presents some interesting scheduling of six loads based to be free of charge.
on time granularity of one hour. Moreover, it does not consider • the function kpeak (t) is the maximum permitted instan-
real deployments of the implemented system on a distributed taneous energy consumption, in kW, at time-step t. By
environment. The work at [23] introduces CAES as an energy reducing this function, it is possible to model the presence
management system for residential demand-response applica- of not-shiftable or always-on loads.
tions. CAES is based on Q-learning and has been developed The output of the approach is a vector L = [l1 , li , ..., lN ]
to adapt to changes in a consumer’s preferences or in the in which, for each load i, is represented the time-step t when
energy market prices. Although very interesting, this work such load has to be activated.
does not take into account time variable profiles regarding self- The approach pursues the goal of finding a scheduling which
production of energy or energy consumption of the appliances. minimizes the cost of executing all the loads:
III. P ROBLEM STATEMENT AND MODELING minimize : Ctot (1)
The goal of the proposed approach is the definition of a where Ctot is defined as:
schedule of load activations which aims at minimizing the ! !!
D−1 N
costs for energy and avoiding the overcoming of consumption X X
Ctot = (p(t) ∗ max 0, Ci (t − li ) − e(t)
peaks.
t=0 i=1
The scheduling algorithm evolves on a step-basis. At each (2)
step, the agent sees the environment and chooses which loads A scheduling should also address the following constraints:
can be activated. An activated load cannot be suspended.
The duration of the steps is configurable. As an example, D − li ≥ αi , ∀i (3)
by considering time steps with a duration of 15 minutes, the N
X
model considers 96 scheduling time-steps in a day. vβi = 0 (4)
i=1
A. Parameters and goals
D−1
Specifically, the approach considers as inputs the following X
vki = 0 (5)
parameters:
i=0
• the number N of shiftable loads to be scheduled;
where:
• the interval T , in minutes, which is the duration of a
• Vβ = [vβ1 , vβ2 , ..., vβN ] is a vector representing, for each
single time-step in the considered scheduling;
• the maximum number of time-steps D admitted for
load, its deadline violations expressed in time-steps, i.e.,
scheduling the loads; the number of time-steps that overcome the related βi
• for each load i, the boolean parameter σi which is its
parameter;
• Vk = [vk1 , vkt , ..., vkD ] is a vector of D elements
initial setup, i.e., on (1) or off (0);
• for each load i, the parameter βi which is the deadline
showing, for each time-step t, the amount of kW that
of the load execution, expressed in time-step; exceeds the kpeak t function.
• for each load i, the parameter αi describing the remaining The constraint described in Equation (3) has the aim of
execution time, expressed in time-steps, for each load guaranteeing the loads complete their execution during the D
once it is activated; time-steps of the scheduling.
• for each load i, the function ci (ta ) is the maximum in- Since we are interested to schedule the loads by exploiting
stantaneous consumption of the load i at the ta time-step an anytime heuristic, we consider also as admissible a schedul-
since its activation, expressed in kW. ci (ta ) is considered ing in which the constraints (4) and (5) do not hold. Thus, the
0 before the activation of the load i and after it completes approach will try to obtain a scheduling which minimizes the
its execution, that is if ta < 0 or ta ≥ αi . following:
• for each load i, the function Ci (ta ) is the cumulative
N
consumption of the load i at the ta time-step since its X
vβi (6)
activation, expressed in kWh. Ci (ta ) is considered 0 i=1
157
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
158
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Algorithm 2 The reward function Equation (2). It is worth noting that the contribution (E) takes
Const: c, C, p, e, kpeak into account the self-produced energy.
Input: H, t, α, β, a
IV. A SYSTEM FOR E NERGY M ANAGEMENT AT THE EDGE
1: r ← 0
2: maxPrice ← max(p(t)) To validate the proposed approach, a prototype system was
3: maxCost ← kpeak ∗ T ∗ maxPrice implemented in the IoT Laboratory at ICAR-CNR (Rende,
4: totCumulC ← 0 // current total cumulative consumption Italy). The system permits a user to request for a schedule of
5: totInstC ← 0 // current total instantaneous consumption loads in an home environment, then it finds such a schedule
6: for i ← 1 to N do by exploiting the proposed approach. The found schedule has
7: if (αi = 0) then to be approved by the user and, if accepted, it is applied on
8: r ← r + 10 // contribution (A) the real loads. Users’ interactions are mediated by a dedicated
9: end if GUI.
10: if (αi > 0 ∧ βi ≤ 0) then
A. Use Case Design
11: r ← r − 1 // contribution (B)
12: end if The architecture of the system is depicted in Figure 2. The
13: if (Hi = 1 ∧ αi > 0) then components shown as ellipses correspond to agents, the rect-
14: r ← r + 1 // contribution (C) angles to virtual objects (see Section II-C), and the hexagons
15: totCumulC ← totCumulC + Ci (t) to physical devices. Arrows between components highlight
16: totInstC ← totInstC + ci (t) communications, the grey-label near an arrow specifies the
17: end if exchanged data. Such data comprehends the parameters ex-
18: end for plained in Section III-A. A description of the software agents
19: if (totInstC > kpeak ) then follows.
20: r ← r − 1 // contribution (D) • Scheduling Manager: is in charge of gathering all the
21: end if parameters required for the scheduling to be calculated.
p(t)∗(totCumulC−e(t))
22: r ← r − maxCost // contribution (E) When all the parameters are available, it forwards them
23: return r to the Scheduling Finder and waits for receiving the
scheduling. Once the scheduling is accepted by the user, a
message is sent to the loads containing the time in which
C. The simulated environment and the reward function they have to be activated;
• Scheduling Finder: is devoted to perform the RL al-
The simulated environment is characterized by its state gorithm described in Section III-B, thus furnishing the
Es = hH, t, α = {α1 , . . . , αN }, β = {β1 , . . . , βN }i and by computed scheduling;
the nextState transition function that given an action a and a • Energy Manager: it is responsible of furnishing the
state Es determines a new state Es′ and a reward r. Hs and energy price per kWh by requesting such data to the smart
ts are made available to the decisor agent. Beside this state power grid;
information, the environment receives, as constants, the set of • Energy Production Manager: is the component that fur-
parameters specified in Section III-A. nishes the profile of the self-produced energy available in
The nextState function is shown in the Algorithm 1. Given the system. For testing purposes, in order to consider re-
the action a, H is modified so as to activate the loads not alistic curves, the estimated production profile is derived
yet activated that are requested by the action a. Then, αi is through the use of a small PV solar-panel;
decreased for each activated load and βi is decreased for each • Mirror Loadi : it manages the Loadi through the related
not activated load. Finally, the time-step t is increased by 1. virtual object V Oi , monitors the energy consumption
The reward function is computed as described in Algorithm profile of the load, and furnishes such information to the
2. The reward is composed by considering five contributions. scheduling Manager when needed. It is also responsible
The contribution (A) is the termination reward which gives to turn on the load at the correct time, as requested by
a prize when a load completes its execution (i.e., αi = 0). the Scheduling Manager;
The contribution (B) gives a penalty for each βi constraint • Mirror PV panel: by using the virtual object V OP V ,
that cannot be met, thus permitting to find a schedule that it monitors the energy produced by the PV solar panel
minimizes the Equation (6). The contribution (C) gives a and forwards such information to the Energy Production
prize for each active load, thus favoring to find a schedule in Manager.
which all the loads complete their execution. The contribution
(D) gives a penalty when the total instantaneous consumption B. Use Case Realization
overtakes the kpeak threshold, thus favoring to find a schedule The software components described above are deployed on
which minimizes the Equation (7). Finally, the contribution two edge nodes each made by a Raspberry Pi 3 and hosting
(E) gives a penalty which is proportional to the energy cost at the COGITO platform. More in detail, the first node is directly
the current time-step, thus favoring schedules that minimize connected to all the physical devices, and hosts all the Mirror
159
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Energy
Energy Mirror PV
Production VOPV
Manager PV panel Panel
Manager
p(t)
e(t)
N, T, D, σ, α, α, c, C,
β, c, C, p, e, k σi
N, D, β
GUI
agents, the related virtual objects, the Scheduler Manager, We assume that each load has a constant power
the Energy Production Manager, and the Energy Manager. consumption during its execution
The second node hosts only the Scheduling Finder, which is • c = {c1 , c2 , c3 , c4 , c5 , c6 } = {0.5, 0.7, 1.8, 0.5, 0.8, 1.2},
the most computing demanding software component in the in kW. We assume that each load has a constant maximum
system. The virtual objects interact with the physical devices peak consumption during its execution
by exploiting the MQTT protocol. • kpeak = 2, in kW
For the case study, the loads are realized by considering • p(t), price per kW per time unit T , defined as:
six different bulbs, each of them characterized by a differ-
0.093, 32 < t < 76
ent energy profile. Each load is controlled by a POWR2 p(t) = (10)
0.053, otherwise
sonOff Smart Switch (https://sonoff.tech/product/wifi-diy-
smart-switches/powr2) enhanced with the TASMOTA Open • e(t), self produced energy in kW per time unit T , defined
source firmware (https://tasmota.github.io/docs/), exploited as:
also to monitor the energy consumption of each load. The 0.3, 32 < t < 48 ∨ 64 < t < 72
hardware components described so far have been deployed on e(t) = 0.8, 48 < t < 64 (11)
a purposely-realized demo panel (see Figure 3). The panel
0.0, otherwise
presents some plugs that can host not-shiftable or always-on
This function has been obtained by rounding the produc-
loads. A further sonOff has been added to the panel to monitor
tion profile observed from the PV solar panel.
all together the loads (both schedulable and always-on) and to −0.008∗epnumber
• ǫ = 0.001 + (0.25 − 0.001) ∗ e : the
take into account the presence of possible leakage current.
probability of choosing a random action following the
Such sonOff emulates a standard home electric meter.
ǫ-greedy policy decreases exponentially as the episode
A further hardware component realized for the case study
number epnumber increases.
is the PV solar panel used for the estimation of the energy −0.008∗epnumber
• δ = 0.1 + (0.4 − 0.1) ∗ e , the learning rate
production profile. Such panel is installed on a window at the
for the Q-learning function decreases similarly to the ǫ
laboratory and it is shown in Figure 3c.
probability.
• γ = 0.99, the discount factor for the Q-learning function.
C. Experimental Results
Figure 4 shows, respectively, how the rewards, the peak
Some preliminary results are shown in Figure 4. The results violation number, the beta violations number, and the total
refer to a scenario, in which a scheduling has been requested scheduling cost evolve during the training of the scheduler,
at time-step t = 60 with the parameters listed in the following: considering 500 episodes. More in detail, Figure 4a shows,
• D = 96 for each episode, the cumulative reward obtained. The reward
• N =6 is of about 70 after 200 episodes. Later on, the peak violations
• T = 15m are eliminated (see Figure 4b and the algorithm continues to
• α = {5, 2, 7, 2, 2, 2} try and learn how to reduce the Beta violations(see Figure 4c).
• β = {10, 11, 12, 6, 6, 6} The reached minimum on the Beta violations, at about episode
• C = {C1 , C2 , C3 , C4 , C5 , C6 } = 430, causes a increment on the total cost (see Figure 4d), but
{0.5, 0.7, 1.8, 0.5, 0.8, 1.2}, in kW per time unit T . a slight increment in the total reward. It is worth to note that
160
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
161
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
the convergence is guaranteed since both ǫ and δ exponentially [5] G. Belli, A. Giordano, C. Mastroianni, D. Menniti, A. Pinnarelli,
decrease, thus after a certain number of episodes the algorithm L. Scarcello, N. Sorrentino, and M. Stillo, “A unified model for the
optimal management of electrical and thermal equipment of a prosumer
stops to explore new solutions. in a dr environment,” IEEE Transactions on Smart Grid, vol. 10, no. 2,
The scheduling furnished by the system corresponds pp. 1791–1800, 2017.
to the following array: L = {l1 , l2 , l3 , l4 , l5 , l6 } = [6] N. Amjady and M. Hemmati, “Energy price forecasting - problems and
proposals for such predictions,” IEEE Power and Energy Magazine,
{64, 66, 75, 64, 61, 69}, which, as stated in Section III-A, vol. 4, no. 2, pp. 20–29, 2006.
represents for each load i ∈ {1, . . . , 6} its activation time- [7] D. Zhang, X. Han, and C. Deng, “Review on the research and practice of
step. deep learning and reinforcement learning in smart grids,” CSEE Journal
of Power and Energy Systems, vol. 4, no. 3, pp. 362–370, 2018.
V. C ONCLUSIONS [8] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and
challenges,” IEEE internet of things journal, vol. 3, no. 5, pp. 637–646,
The paper proposed an IoT edge-based energy management 2016.
system devoted to minimizing the energy cost for the daily-use [9] F. Cicirelli, A. Guerrieri, A. Mercuri, G. Spezzano, and A. Vinci,
“Itema: A methodological approach for cognitive edge computing iot
of in-home appliances. For this purpose, a scheduling problem ecosystems,” Future Generation Computer Systems, vol. 92, pp. 189–
formulation was first given, then reduced to a reinforcement 197, 2019.
learning problem modeled through a Markov decision pro- [10] K. Shahryari and A. Anvari-Moghaddam, “Demand side management
using the internet of energy based on fog and cloud computing,” in
cess in an agent-environment scenario. The given MDP, the 2017 IEEE International Conference on Internet of Things (iThings) and
environment, and the defined reward function were capable IEEE Green Computing and Communications (GreenCom) and IEEE
of taking into account users’ preferences in load execution Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data
(SmartData). IEEE, 2017, pp. 931–936.
deadlines along with time-variable profiles for energy cost, [11] L. Y-H and H. Y-C., “Residential consumer-centric demand-side man-
energy production, and energy consumption for each shiftable agement based on energy disaggregation-piloting constrained swarm
appliance. The realized test case confirmed the feasibility of intelligence: Towards edge computing,” Sensors, vol. 18, no. 5, p. 1365,
2018.
the proposed approach in an edge-based environment. The [12] T. Li, Y. Xiao, and L. Song, “Deep reinforcement learning based resi-
preliminary experimental results have shown the effectiveness dential demand side management with edge computing,” in 2019 IEEE
of the exploited scheduling algorithm. International Conference on Communications, Control, and Computing
Technologies for Smart Grids (SmartGridComm). IEEE, 2019, pp. 1–6.
Future work is geared at: [13] M. L. Puterman, Markov decision processes: discrete stochastic dynamic
• improving the approach by considering energy storage programming. John Wiley & Sons, 2014.
facilities; [14] F. Cicirelli, A. Guerrieri, G. Spezzano, and A. Vinci, “A Cognitive
Enabled, Edge-Computing Architecture for Future Generation IoT En-
• trying other reinforcement learning approaches, such as vironments,” in Proceeding of the IEEE 5th World Forum on Internet of
Deep Reinforcement Learning, to cope with a broader Things, Limerick, Ireland, 2019.
number of loads; [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
• extending the formulation of the scheduling problem forcement learning,” in International conference on machine learning,
by considering other aspects like load priority and the 2016, pp. 1928–1937.
possibility of selling self-produced energy and/or buying [16] F. Cicirelli, A. Guerrieri, C. Mastroianni, G. Spezzano, and A. Vinci,
“Thermal comfort management leveraging deep reinforcement learning
energy from different suppliers; and human-in-the-loop,” in Accepted for the Proc. of the 1st IEEE
• adapting the approach to a more complex smart grid International Conference on Human-Machine Systems (ICHMS2020),
context, trying to coordinate agents in different houses 2020.
[17] S. Das and D. Cook, “Designing and modeling smart environments,” in
to optimize the energy costs in a whole district; World of Wireless, Mobile and Multimedia Networks, 2006. WoWMoM
• integrating the proposed system in a test case that con- 2006. International Symposium on a, Buffalo-Niagara Falls, NY, 2006,
siders synergies with other edge-based home applications pp. 5 pp.–494.
[18] F. Cicirelli, A. Guerrieri, G. Spezzano, A. Vinci, O. Briante, A. Iera,
regarding safety, security, and thermal comfort. and G. Ruggeri, “Edge computing and social internet of things for
large-scale smart environments development,” IEEE Internet of Things
ACKNOWLEDGMENT Journal, no. 99, 2017.
This work has been partially supported by the “COGITO” [19] A. K. Noor, “Potential of cognitive computing and cognitive systems,”
Open Engineering, vol. 5, no. 1, 2015.
project, funded by the Italian Government (ARS01 00836), [20] M. Wooldridge, An introduction to multiagent systems. John Wiley &
and by the “GLAMOUR” project, funded by POR Calabria Sons, 2009.
FESR-FSE, Italy (CUP: J28C17000250006). [21] N. Chauhan, N. Choudhary, and K. George, “A comparison of reinforce-
ment learning based approaches to appliance scheduling,” in 2016 2nd
R EFERENCES International Conference on Contemporary Computing and Informatics
(IC3I), 2016, pp. 253–258.
[1] P. Palensky and D. Dietrich, “Demand side management: Demand re- [22] T. Remani, E. Jasmin, and T. I. Ahamed, “Residential load scheduling
sponse, intelligent energy systems, and smart loads,” IEEE Transactions with renewable generation in the smart grid: A reinforcement learning
on Industrial Informatics, vol. 7, no. 3, pp. 381–388, Aug 2011. approach,” IEEE Systems Journal, vol. 13, no. 3, pp. 3283–3294, 2018.
[2] F. Fioretto, W. Yeoh, and E. Pontelli, “A multiagent system approach [23] D. O’Neill, M. Levorato, A. Goldsmith, and U. Mitra, “Residential
to scheduling devices in smart homes,” in Workshops at the Thirty-First demand response using reinforcement learning,” in 2010 First IEEE
AAAI Conference on Artificial Intelligence, 2017. international conference on smart grid communications. IEEE, 2010,
[3] J. Ploennigs, A. Ba, and M. Barry, “Materializing the promises of pp. 409–414.
cognitive iot: How cognitive buildings are shaping the way,” IEEE [24] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
Internet of Things Journal, vol. 5, no. 4, pp. 2367–2374, 2017. MIT Press, Cambridge, MA, USA, 2011.
[4] F. Cicirelli, G. Fortino, A. Guerrieri, G. Spezzano, and A. Vinci,
“Metamodeling of smart environments: from design to implementation,”
Advanced Engineering Informatics, vol. 33, pp. 274–284, 2017.
162
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—In this paper, we propose a safety-driven cost requests from propulsion motors so that the operation cost
effective scheduling controller to arbitrate and operate the is reduced as low as the safety permits. Runtime scheduling
energy resources of a maritime hybrid energy application. The decisions are made with respect to the actual system state
proposed control algorithm enables efficient energy management
to dynamically schedule the energy sources to supply the real- such as devices temperature and available energy reserves.
time power requests so that 1) we maintain the system safety To deliver an absolute guarantee about the system safety, we
by not overloading or heating up an energy source; 2) reduce examine our runtime scheduler using Uppaal model checker.
the operation cost by considering the cheapest energy source in
a real-time manner. The efficiency and safety of our scheduling II. S YSTEM A RCHITECTURE
algorithm have been examined using Uppaal model checker. The
experiment outputs show that our controller maintains the system In general, a zero-emission ferry power system with fuel
safety and guarantees the lowest operation cost. cells (as a main source of the ship power), power electronic
Index Terms—Safety control, real-time scheduling, hybrid- devices (as interfaces for renewable energy systems) and loads
energy systems, formal verification. (like ship motor(s) and navigation system(s)) can be consid-
ered as a special mobile islanded DC microgrid [12]. The
I. I NTRODUCTION
hybrid energy system we consider is formed by a composition
Fuel cell (FC) is a green technology to generate energy two subsystems: FC-based and battery-based, as depicted in
from hydrogen. It usually paired with reversible energy storage Fig. 1. The fuel cell-based subsystem consists of three Proton-
devices such as batteries and ultracapacitors to create hybrid Exchange Membrane Fuel Cell units (shortly FC) having
electric power supply solutions, shortly known by FC-hybrid different capacity (300kwh, 300kwh, 100kwh). These units
power systems [11]. FC-based systems have very strict safety can operate collaboratively to inject energy in the power bus
requirements due to the existence of hazardous phenomena depending on the load request and the subsystem configuration
such explosion of hydrogen tanks and batteries, melting of such as temperature, operation cost, etc. The battery-based
fuel cells, storage risks, etc. The complexity and criticality of system is formed by two lithium batteries (200kwh each)
such systems make the underlying real-time control complex operating in similar way to FC units, however batteries can
as it has to consider different constraints related to safety also extract energy from the power bus for self recharging. The
and reliability [1], [2], operation cost [24] and performance two energy subsystems have different safety and performance
[6]. To comply with the safety requirements imposed by characteristics as well as operation cost.
the international regulations and standards, FC-hybrid energy The control architecture of our HFC system is given as a
systems are subject to a rigorous and absolute guarantee hierarchical scheduling system [8]. An individual controller
verification and validation prior to deployment [24]. is dedicated to operate each of the two energy subsystems
The safety and energy efficiency of FC-based systems have locally (FC-Ctrl, B-Ctrl). Moreover, a top level controller
been studied for different transport applications [6], [16], [21] (H-Ctrl) is considered as the main energy management
such as road vehicles [3], buses [18] and trams [25], however system coordinating the different subsystems. Each subsystem
only few attempts consider both metrics (safety, operation can interact via 2 ways: send alarm events to the main
cost) together when designing controllers for FC-hybrid ap- controller and receive commands from the main controller
plications. Making safety as the main driving property in the through command bus, or inject energy to the DC network.
design of safety-critical systems may lead to expensive oper- The battery-based system can also receive energy from the
ation cost and inefficient utilization of the energy resources. DC network.
This paper introduces a safe controller for cost-effective real- The top level controller H-Ctrl receives power requests
time scheduling of the energy sources of a hybrid fuel cell- from the load (propulsion engines) as signals via port R,
based power system (HFC). real-time cost for the operation of both battery and FC via
The proposed real-time controller defines a compromise be- port C, and the internal configuration of each subsystem such
tween safety and operation cost to supply the real-time energy as the amount of energy left in each of the storage units and
978-1-7281-7343-6/20/$31.00
c 2020 IEEE
163
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
2
164
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
checks whether there is a battery unit that is able to provide D. Collaborative Scheduling
the totality of the requested amount by comparing the state of As mentioned earlier, if none of the energy subsystems
charge and draining that the request would make. If neither of passes successfully the availability and safety check to serve
the two battery units is able to provide the requested energy a power request individually we consider a collaborative
individually, we consider collaborative scheduling [10], [19] scheduling mode where both energy subsystems contribute
where both battery units supply together two sub-requests, with different rates. The individual contributions of each
respectively R1 and R2 such that e = e1 + e2 . subsystem needs to be approved for availability and safety.
B. Safety Check Collaborative scheduling [10] is achieved by a binary pro-
cess carried out in an iterative way following Dichotomy
This process analyzes whether a candidate energy unit
method [23]. For a given power request R, the start point of the
would violate its maximum temperature in case it is used
iterative process consists in checking whether each subsystem
to satisfy an energy request. To such an end, we calculate
is able to supply R/2. By ability we mean that a subsystem
what would be the temperature after satisfying a request. If
has enough reserve to serve a given request safely. In case
the the maximum temperature of a candidate energy source
one of the subsystems fails the ability check, we reduce its
unit is not violated then such a device will proceed further
contribution by half, i.e. to R/4, and so on until we find a
for the cost analysis, otherwise the energy source unit will be
contribution rate x the given subsystem is able to provide
discarded for supplying the totality of the requested energy
while the other subsystem is able to provide R − x safely.
and a collaborative scheduling has to be considered. Formally,
Otherwise, when both subsystems are able each to provide
we define a function Saf e(Fi , R) to check the safety of a FC
R/2 the iterative process considers increasing the contribution
unit as follows:
. of the resource having cheaper cost. We keep increasing the
Saf e(Fi , R) = T emp(Fi , t0 ) < M ax T empi . Accordingly,
contribution x of the cheap subsystem on each iteration, by
the fuel cell based subsystem is safe if and only if:
. 50% of the value added on the last iteration, until it becomes
Saf e(F C Ctrl,P R) = ∃i | Saf e(Fi , R) Or∀i Saf e(Fi , Ri ).
not able to secure the contribution rate x. The final value of x
such that R = i Ri . Similarly, we define the temperature
is most likely the contribution rate leading to optimal operation
safety property of a battery unit regarding a power request as
. cost. An abstraction of the algorithm is depicted in Algorithm
follows: Saf e(Bj , R) = T emp(Bj , t0 ) < M ax T empj .
1.
In case a battery unit is not able to safely satisfy a power
request R we consider a collaborative supply where both E. Safety and Operation Cost Analysis
battery units contribute. We define accordingly the safety of
the battery-based subsystem as follows: To perform a formal verification of the safety properties,
Saf e(B Ctrl, R) = ∃ j | Saf e(Bj , R)Or Saf e(B1 , R1 ) ∧ we mechanize the system model in Uppaal. In fact, we
Saf e(B2 , R2 ) where R = R1 + R2 . use symbolic model checking to examine safety properties
The contribution percentage of the two battery units, when whereas statistical model checking (SMC) is used for quan-
operating under collaborative mode, is calculated as stated in titative analysis [9]. An example of a safety property is
.
Algorithm 1. The identified percentages must satisfy both S1 = ∀ i t T emp(Fi , t) ≤ M ax T empi . In fact, each
safety and availability check. In case both subsystems succeed safety property is examined using Uppaal as follows: ∀[] Si .
in the safety check, we consider the related cost operation. In a similar way, the following Uppaal SMC queries are used
respectively to perform quantitative analysis of the batteries
C. Cost Calculation SoC and operation cost:
The cost analysis step amounts to calculate the operation E[time≤1e6; 1000] (max:SoCj ); E[time≤1e6;
cost of each energy subsystem when satisfying a given energy 1000] (max:cost). Each query specifies that SMC runs
request, using real time energy rates. The operation cost is sim- 1000 simulations, each of which last for one million time
ply obtained by multiplying the the total resource amount to be unit.
supplied by the respective energy unit price. We define the cost
V. C ONCLUSION
of FC Ctrl system to satisfy a request R = he, [t, t0 ]i by the
volume of hydrogen to consume multiplied by the unit price of This paper presented a safety-driven cost-effective controller
hydrogen U (H2 , t): Cost(F C Ctrl, R) = v(R) ∗ U (H2 , t). to schedule the energy sources of a hybrid energy solution. The
In a similar way, we calculate the budget of using battery- solution considered is formed by a set of batteries and fuel cell
based system by the amount of energy to inject multiplied by units to power a maritime application. The proposed algorithm
the actual unit price of charging U (E, t): Cost(B Ctrl, R) = operates the energy supply units following the real-time energy
drain(B1 , R) ∗ U (E, t) demand, safety constraints and operation cost.
Since our battery units are identical, we use the unit price To deliver an absolute guarantee about the system safety, we
of battery B1 no matter of which battery unit is actually being examine our runtime scheduler using Uppaal model checker.
selected by B Ctrl. In case both energy subsystems have the As a future work, we plan to investigate optimal decision
same operation cost to satisfy a power request, we prioritize to making strategies that are able to follow the HFC system
use the FC-based subsystem due to its fast tanking operation. dynamics in real-time.
3
165
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Algorithm 1 Sketch of the Scheduling Algorithm [4] N. Bigdeli. Optimal management of hybrid pv/fuel cell/battery power
foreach NewEvent X do system: A comparison of optimal hybrid approaches. Renewable and
Qheq , [t, tq ]i = Lef t Last Request(); Sustainable Energy Reviews, 42:377 – 393, 2015.
[5] N. Bizon. Energy optimization of fuel cell system by using global
if X = R(he, [t, t0 ]i) then
extremum seeking algorithm. Applied Energy, 206:458 – 474, 2017.
Scheduler(R); [6] N. Bizon. Real-time optimization strategy for fuel cell hybrid power
end sources with load-following control of the fuel or air flow. Energy
if X = M ax T emp(Fi ) then Conversion and Management, 157:13 – 27, 2018.
Cooling(Fi ); [7] J.-P. Bodeveix, A. Boudjadar, and M. Filali. An alternative definition for
Scheduler(heq + ec , [t, tq ]i) timed automata composition. In Automated Technology for Verification
end and Analysis, pages 105–119, Berlin, Heidelberg, 2011. Springer Berlin
if X = M ax T emp(Bj ) or X = M in H V olume() then Heidelberg.
Scheduler(Q); [8] A. Boudjadar, A. David, J. H. Kim, K. G. Larsen, M. Mikucionis,
end U. Nyman, and A. Skou. Hierarchical scheduling framework based on
if X = M in SoC(Bj ) then compositional analysis using uppaal. In 10th International Symposium
on Formal Aspects of Component Software - Volume 8348, FACS 2013,
Charge(Bj , hem in, [t, tq ]i);
page 6178. Springer-Verlag, 2013.
Scheduler(heq + em in, [t, tq ]i); [9] J. Boudjadar, A. David, J. Kim, K. Larsen, U. Nyman, and A. Skou.
end Schedulability and energy efficiency for multi-core hierarchical schedul-
if X = M ax SoC(Bj ) then ing systems. In European Congress on Embedded Real Time Systems,
Scheduler(heq − echarge , [t, tq ]i); 2014.
end [10] J. Boudjadar, S. Ramanathan, A. Easwaran, and U. Nyman. Combining
end task-level and system-level scheduling modes for mixed criticality
Scheduler(Z) systems. In 23rd IEEE/ACM International Symposium on Distributed
if Candidat(F C Ctrl, Z) and Candidat(B Ctrl, Z) then Simulation and Real Time Applications, DS-RT 2019, 2019.
if Saf e(F C Ctrl, Z) and Saf e(B Ctrl, Z) then [11] V. Das, S. Padmanaban, K. Venkitusamy, R. Selvamuthukumaran,
F. Blaabjerg, and P. Siano. Recent advances and challenges of fuel
if Cost(F C Ctrl, Z) ≤ Cost(B Ctrl, Z) then
cell based power system architectures and control a review. Renewable
Operate(F C Ctrl); and Sustainable Energy Reviews, 73:10 – 18, 2017.
Return(True); [12] M. Gheisarnejad, H. Mohammadi-Moghadam, J. Boudjadar, and M. H.
end Khooban. Active power sharing and frequency recovery control in an
else islanded microgrid with nonlinear load and nondispatchable dg. IEEE
Operate(B Ctrl); Systems Journal, 14(1):1058–1068, 2020.
Return(True); [13] N. Herr, J. Nicod, and C. Varnier. Prognostics-based scheduling in
end a distributed platform: Model, complexity and resolution. In 2014
end IEEE International Conference on Automation Science and Engineering
else (CASE), pages 1054–1059, 2014.
Case Saf e(F C Ctrl, Z) : Operate(F C Ctrl); [14] N. Herr, J.-M. Nicod, C. Varnier, L. Jardin, A. Sorrentino, D. Hissel,
Case Saf e(B Ctrl, Z) : Operate(B Ctrl); and M.-C. Pra. Decision process to manage useful life of multi-stacks
Default: CollaborativeSched(Z); fuel cell systems under service constraint. Renewable Energy, 105:590
end – 600, 2017.
[15] A. Neffati, M. Guemri, S. Caux, and M. Fadel. Energy management
end
strategies for multi source systems. Electric Power Systems Research,
else 102:42 – 49, 2013.
Case Candidate(F C Ctrl, Z) and Saf e(F C Ctrl, Z) : [16] F. Odeim, J. Roes, and A. Heinzel. Power management optimization of
Operate(F C Ctrl); an experimental fuel cell/battery/supercapacitor hybrid system. Energies,
Case Candidate(B Ctrl, Z) and Saf e(B Ctrl, Z) : 8(7):6302–6327, 2015.
Operate(B Ctrl); ; [17] L. Olatomiwa, S. Mekhilef, M. Ismail, and M. Moghavvemi. Energy
Default: CollaborativeSched(Z); management strategies in hybrid renewable energy systems: A review.
end Renewable and Sustainable Energy Reviews, 62:821 – 835, 2016.
CollaborativeSched(Z) [18] J. Peng, H. He, and R. Xiong. Rule based energy management strategy
while False do for a seriesparallel plug-in hybrid electric bus optimized by dynamic
x=Z; if Sched(x/2) ∧ Sched(x/2) then programming. Applied Energy, 185:1633 – 1643, 2017. Clean, Efficient
True; and Affordable Energy for a Sustainable Future.
[19] A. Toor, S. ul Islam, N. Sohail, A. Akhunzada, J. Boudjadar, H. A.
end Khattak, I. U. Din, and J. J. Rodrigues. Energy and performance aware
else fog computing: A case of dvfs and green renewable energy. Future
Sched(x + x/2) ∧ Sched(x − x/2); Generation Computer Systems, 101:1112 – 1121, 2019.
x 7→ x/2; [20] T. Tronstad, H. H. Astrand, G. P. Haugom, and L. Langfeldt. Study
end on the use of fuel cells in shipping. EMSA European Maritime Safety
end Agency, pages 1 – 108, 2017.
[21] N. Vafamand, M. H. Khooban, T. Dragievi, J. Boudjadar, and M. H.
Asemani. Time-delayed stabilizing secondary load frequency control of
R EFERENCES shipboard microgrids. IEEE Systems Journal, 13(3):3233–3241, 2019.
[1] P. Aguiar, C. Adjiman, and N. Brandon. Anode-supported intermediate [22] L. van Biert, M. Godjevac, K. Visser, and P. Aravind. A review of
temperature direct internal reforming solid oxide fuel cell. i: model- fuel cell systems for maritime applications. Journal of Power Sources,
based steady-state performance. Journal of Power Sources, 138(1):120 327:345 – 364, 2016.
– 136, 2004. [23] T. Villa, T. Kam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli.
[2] S. Alkaner and P. Zhou. A comparative study on life cycle analysis Synthesis of Finite State Machines: Logic Optimization. 2011.
of molten carbon fuel cells and diesel engines for marine application. [24] J. Wang. Barriers of scaling-up fuel cells: Cost, durability and reliability.
Journal of Power Sources, 158(1):188 – 199, 2006. Energy, 80:509 – 521, 2015.
[3] W. Andari, S. Ghozzi, H. Allagui, and A. Mami. Optimization of [25] W. Zhang, J. Li, L. Xu, and M. Ouyang. Optimization for a fuel
hydrogen consumption for fuel cell hybrid vehicle. Indian Journal of cell/battery/capacity tram with equivalent consumption minimization
Science and Technology, 11(2), 2018. strategy. Energy Conversion and Management, 134:59 – 69, 2017.
4
166
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The increasing scale and complexity of scientific computations (e.g., under the right circumstances, expensive
applications are rapidly transforming the ecosystem of tools, steps of an HPC simulation can be replaced with faster deep
methods, and workflows adopted by the high-performance com- learning predictions), guided ensemble searches (e.g., when
puting (HPC) community. Big data analytics and deep learning
are gaining traction as essential components in this ecosystem in running a set of simulations to find a molecule that docks to
a variety of scenarios, such as, steering of experimental instru- a protein, deep learning can be used to predict the next most
ments, acceleration of high-fidelity simulations through surrogate promising simulations to try next).
computations, and guided ensemble searches. In this context, the These scenarios require running opportunistic on-demand
batch job model traditionally adopted by the supercomputing in- jobs when certain conditions are triggered, e.g., an analytics
frastructures needs to be complemented with support to schedule
opportunistic on-demand analytics jobs, leading to the problem of job that looks for anomalies in the experimental data collected
efficient preemption of batch jobs with minimum loss of progress. by the instrument, a deep learning training and/or inference.
In this paper, we design and implement a simulator, CoSim, These jobs need to start within a given deadline, often in the
that enables on-the-fly analysis of the trade-offs arising between order of minutes. Failure to start them by the given deadline
delaying the start of opportunistic on-demand jobs, which leads leads to a missed opportunity (e.g., it’s too late to calibrate
to longer analytics latency, and loss of progress due to preemption
of batch jobs, which is necessary to make room for on-demand the instrument) and/or incur a performance penalty (e.g., idle
jobs. To this end, we propose an algorithm based on dynamic simulations that wait for the next deep learning prediction or
programming with predictable performance and scalability that otherwise take alternative suboptimal decisions). On the other
enables supercomputing infrastructure schedulers to analyze the hand, HPC datacenters traditionally adopt a batch job schedul-
aforementioned trade-off and take decisions in near real-time. ing model where users request compute and accelerator (e.g.,
Compared with other state-of-art approaches using traces of
the Theta pre-Exascale machine, our approach is capable of GPU) resources of the datacenters for the required amount of
finding the optimal solution, while achieving high performance time (wall time), while the scheduler decides when to run each
and scalability. job based on various trade-offs such as the need to maximize
Index Terms—High-performance computing, batch job pre- the utilization of machines, and job priority. Popular HPC
emption, job checkpointing datacenter schedulers, e.g., SLURM (Simple Linux Utility
I. I NTRODUCTION Resource Manager) [6], COBALT [7], and TORQUE (Tera-
scale Open-source Resource and QUEue manager) [8], cannot
Big data analytics and deep learning are rapidly gaining
co-schedule batch jobs with opportunistic on-demand jobs.
traction both in the industry and scientific computing. A key
A naive solution to address this problem could simply
driver for this trend has been the unprecedented accumulation
reserve a set of nodes for on-demand jobs and use the rest of
of big data, which exposes plentiful learning opportunities
the nodes for batch jobs. Although applied in practice, such a
thanks to its massive size and variety. Unsurprisingly, there has
solution is not desired as it is hard to predict how many nodes
been a significant interest to adopt deep learning at a very large
are needed by the on-demand jobs. Using too few nodes for
scale on supercomputing infrastructures in a wide range of
on-demand jobs leads to missed opportunities, whereas, using
scientific areas, e.g., fusion energy science [1], computational
too many nodes leads to idle nodes and slow progress of the
fluid dynamics [2], lattice quantum chromodynamics [3], vir-
batch jobs. Furthermore, even if such predictions were perfect,
tual drug response prediction [4], and cancer research [5].
One of the main use cases of big data analytics and deep there may be significant fluctuations in datacenter utilization
learning in scientific computing is to use them as a tool to patterns that make it hard to dynamically move the nodes back
complement high-performance computing (HPC) simulations and forth between on-demand and batch queues. At the other
running on supercomputing infrastructures in a variety of extreme, an alternative naive solution could simply use all
scenarios: steering of experimental instruments (e.g., calibrate nodes for batch jobs and start killing batch jobs to make room
scientific instruments in real-time to correct anomalies in for on-demand jobs when needed. This solution does not lead
experimental data and/or refocus dynamically on areas of inter- to missed opportunities but may incur significant overhead on
est), acceleration of high-fidelity simulations through surrogate the batch jobs due to loss of progress.
In this paper, we propose an alternative solution to address
these challenges that relies on checkpointing for suspending
978-1-7281-7343-6/20/$31.00 ©2020 IEEE and resuming batch jobs, if required, to make room for time-
167
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
sensitive on-demand jobs, thereby minimizing the amount of then prefer the subset for which the checkpointing overhead
lost progress by the batch jobs. To this end, we introduce is minimized. We refer to this as the eviction problem.
CoSim, a simulation framework that aims to identify the opti- Using the subsets Si and corresponding strategy, an HPC
mal combination of jobs that should be either checkpointed or datacenter scheduler can simulate the outcome of multiple
killed to free a fixed number of nodes that are required to run hypothetical scenarios with a variable deadline i corresponding
the on-demand job. Unlike other approaches, CoSim simulates to the trade-off between maximizing the value of the on-
all outcomes resulting from a variable deadline up to the given demand jobs and minimizing the loss of the batch jobs.
maximum in a single pass, thereby eliminating the need to We note that while we formulate the problem of HPC dat-
run separate simulations for each fixed deadline. Using this acenters, a similar formulation can be done for opportunistic
approach, the scheduler can make more informed decisions jobs in cloud computing architectures where there is an upper
by considering the various trade-offs arising from delaying bound on elasticity, e.g., the user cannot afford to run on
the start of the on-demand jobs and losing progress on the more than a fixed amount of virtual machines (VMs) at a
batch jobs. Specifically, we make the following contributions time and must evict existing jobs if necessary. Without loss of
in this paper: generality, CoSim can be applied in such scenarios as well.
• We formulate the problem statement, introducing a series
III. D ESIGN P RINCIPLES AND A PPROACH
of assumptions and general considerations for simulating
the outcomes of checkpointing batch jobs to vacate nodes This section introduces the high-level design principles of
for running on-demand jobs (Section II). our proposed approach and explains aspects related to the
• We introduce a series of design principles and an algo- checkpointing model and exploration algorithm that implement
rithm based on the dynamic programming to find the op- these design principles.
timal combination of batch jobs that incurs the minimum
A. Design principles
loss of progress while satisfying the deadline of the on-
demand jobs. Our algorithm produces an optimal solution CoSim is based on the following design principles:
for every possible deadline up to a given maximum in a 1) Mix of system-level and application-level checkpointing:
single pass (Section III). We differentiate between system-level and application-level
• We evaluate our approach in a series of experiments using checkpointing, because they present an interesting trade-off:
three scenarios extracted from the batch job traces of system-level checkpointing techniques, such as DMTCP [9],
Argonne’s Theta pre-Exascale machine. We compare our are application-agnostic and can be performed at any moment
approach with an exhaustive search based on backtrack- t0 . However, they involve large checkpoint sizes because the
ing and a greedy approach. The results show significant entire memory space of all application processes needs to be
performance and scalability improvement as compared persisted to a stable storage, e.g., a parallel file system (PFS).
to backtracking, as well as a significant improvement in Therefore, system-level checkpointing may take a long time to
the quality of the solution compared as compared to the complete. On the other hand, application-level checkpointing
greedy approach (Section IV). is typically performed by HPC applications regularly using
either a custom solution or a checkpointing library, such as,
II. P ROBLEM F ORMULATION VELOC [10]. In this case, the checkpoint size is smaller as
The problem of co-scheduling batch jobs with opportunistic each application process needs to save only the critical data
on-demand jobs in an HPC datacenter can be formulated structures needed for a restart and therefore faster to write to
as follows. Let’s assume that N batch jobs are running at the stable storage. However, it is necessary to wait for the
time t0 , and each of these jobs is characterized by the tuple application to reach a moment t1 > t0 when it is safe to
hid, jn, loss, tckpti, where id is a unique identifier of the checkpoint. Depending on how far away t1 is from t0 and
job, jn is the number of compute nodes the batch job is how much larger a system-level checkpoint is compared with
running on, loss quantifies the amount of lost progress if an application-level checkpoint, one or the other may be faster.
the job is killed (e.g., node-hours since the last checkpoint Furthermore, it is important to note that even if the system
or since the beginning if no checkpoint was taken), tckpt is and application-level checkpointing overheads are equal, it is
the time required to checkpoint the job to successfully suspend still important to choose the application-level checkpoint over
its execution without loss of progress. the system-level checkpoint, because using application-level
Given K nodes that need to be released not later than t0 +T checkpoint enables the batch job to make additional progress
(for the purpose of starting opportunistic on-demand jobs), the during the interval (t0 , t1 ). We incorporate such considerations
goal is to find all optimal subsets of batch jobs Si ⊂ N and in our simulator.
corresponding killing or checkpointing strategy for all t0 < 2) Simultaneous exploration of the full on-demand deadline
i < t0 + T . A subset Si is optimal if it satisfies the following range: As discussed in Section II, our goal is to solve the
properties simultaneously: (1) at least K nodes are released eviction problem for all deadlines in the range (t0 , t0 + T ),
by the deadline t0 +i; (2) the accumulated loss of work due to because the scheduler needs to consider the trade-off between
killing jobs is minimized; (3) if there are multiple subsets Si delaying the on-demand jobs, which may lead to lower quality
for which the accumulated loss due to job kills is minimized, of the results due to slow reaction time, and losing progress
168
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
of the batch jobs. Thus, a naive strategy would be to iterate checkpointing duration is ta = tckpt + delay, where delay is
over all deadlines i in range (t0 , t0 + T ) and solve the the difference between the next scheduled checkpoint and t0 .
problem independently for each i. However, such a strategy Since system-level checkpointing can be performed instantly
is sub-optimal, because the problems resulting from fixing at t0 , its duration, ts, will be equal to tckpt, i.e., ts = tckpt.
all deadlines i in range (t0 , t0 + T ) have identical inputs However, the two approaches will have a different checkpoint
except for the deadline i, therefore they may be decomposed size on each node, resulting in the trade-off that is explained
into sub-problems that are shared across several instances and in Section III-A.
thus need to be solved only once. Our approach leverages In a typical HPC datacenter, the size of each batch job is
this observation to construct an algorithm based on dynamic usually large enough to saturate the aggregated I/O bandwidth
programming that is capable of taking advantage of such of the PFS. Therefore, we consider a simple checkpointing
decompositions to solve all deadlines in a single pass. This model where the jobs are checkpointed serially. In this case,
algorithm is explained in Section III-C. the total time required for checkpointing a set of batch jobs
3) Polynomial response time: A key requirement for ex- is the sum of their corresponding ta or ts. In fact, under
ploring the full on-demand deadline range is to ensure fast such circumstances, checkpointing multiple batch jobs in
response time so that the scheduler can decide quickly, prefer- parallel would perform worse than checkpointing the batch
ably at moment t0 , about the jobs that must be checkpointed jobs serially, because of over-subscribing the aggregated I/O
to the PFS to run the incoming on-demand jobs. Thus, an bandwidth of the PFS. Nevertheless, we note that our model
algorithm that is not polynomial in any variable, such as, the can be further refined to simulate concurrent checkpointing of
number of jobs N , maximum deadline T , or the number of the batch jobs in the case of small jobs that do not saturate
nodes to be released K, will lead to unacceptable response the aggregated I/O bandwidth of the PFS.
time, considering that modern HPC datacenters routinely run
several batch jobs simultaneously and may need to release C. Exploration algorithm
a large number of nodes for on-demand jobs to accommo-
date bursts of opportunistic events. Therefore, our proposed In this section we propose a dynamic programming algo-
solution is designed to satisfy such constraints, and delivers rithm based on the aforementioned design principles.
response times in the order of milliseconds or less. The key observation that inspires our algorithm is the fact
that the eviction problem is related to the discrete backpack
B. Loss and checkpointing model problem: given N items, where Wi and Vi represent the weight
We estimate the loss incurred by killing a batch job as and value of the ith item, fill a backpack that can carry a
the number of node-hours that have elapsed since its last maximum weight K such that the combined value of all items
application-level checkpoint until t0 , the moment when the is maximized without overflowing K. By analogy, we can
nodes need to be evicted to make room for the on-demand consider the batch jobs as items and the backpack as the set of
jobs. This is based on a configurable interval that can be nodes where the jobs are running. This problem has a simple
independently adjusted for each job in our simulator. In dynamic programming decomposition: maximum value for N
practice, the interval is fixed based on empirical observations, items is the greater of: (1) the maximum value for N − 1
e.g., every hour, because the checkpoints are used both to items and capacity K (excludes item N ); (2) VN plus the
survive failures and to record intermediate results. However, if maximum value obtained for N −1 items and capacity K−WN
checkpoints are only used for fault tolerance, then an optimal (includes item N ). By solving this decomposition recursively
checkpointing interval can be computed [11]. and applying memoization techniques, a runtime complexity
In order to simulate alternatives that mix application-level of O(N · K) can be achieved. Note that this decomposition
checkpointing with system-level checkpointing, we consider solves the problem not only for a backpack of capacity K,
the time to checkpoint each batch job: but at the same time for all backpacks of capacity i such that
jn
X 0 < i ≤ K.
tckpt = max( sckpt(i)/Ba , maxjni=1 (sckpt(i)/Bc )) (1)
Starting from this observation, we adopt a similar strategy
i=1 but with two important differences. First, we need to free at
where jn is the number of nodes occupied by the batch least K nodes, which means that the optimal solution may
job, Ba is the aggregated I/O bandwidth of the PFS, Bc is the involve more than K nodes. Therefore, we need to consider
maximum I/O bandwidth of a compute node, and sckpt(i) is up to M nodes, where M is the number of nodes occupied by
the size of the checkpoint on node i ∈ [1 . . . jn]. The intuition all batch jobs at the moment t0 . Second, the eviction problem
behind this is that the checkpointing time is bounded either by introduces a new dimension in the decomposition, i.e., the
the maximum aggregated bandwidth of the PFS or the slowest deadline T to start the on-demand jobs. Specifically, it is not
node (if the nodes do not consume the maximum aggregated enough to release at least K nodes within a deadline T when
bandwidth). considering N − 1 batch jobs and then try for the N th batch
In the case of application-level checkpointing, we must job all four alternatives, i.e., ignore, kill, application-level
wait for the next checkpoint to happen, which introduces a checkpoint, and system-level checkpoint, because the optimal
delay in addition to tckpt. Therefore, the application-level solution for N − 1 batch jobs may get close to the deadline
169
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Algorithm 1: Dynamic programming algorithm to free the system-level checkpointing during the updates, and thus
K nodes within a range of deadlines [0 . . . T ] with becomes a preferred choice in the case of equal loss and
minimal loss of compute progress. checkpointing time.
Input: List J of N batch jobs running at t0 , K, T Another important observation is that Algorithm 1 solves
Output: List of job eviction strategies Si , 0 < i < T the eviction problem not only for at least K nodes, but the
1 a[0, 0] ← 0 entire node spectrum [0 . . . M ]. This enables the scheduler to
2 u[0, 0] ← ∅ consider more advanced trade-offs for on-demand jobs, such
3 M ←0
4 for (id, jn, loss, ts, ta) ∈ J do as running the on-demand jobs with more or less than K
5 M ← M + jn requested nodes, which can be used to dynamically adjust
6 for (id, jn, loss, ts, ta) ∈ J do the latency and/or the quality of the on-demand results. Such
7 b←a trade-offs can be incorporated at no additional simulation cost
8 v←u using our proposed approach.
9 for (n, t) ∈ a do
10 if b[n + jn, t] > a[n, t] + loss then IV. P ERFORMANCE E VALUATION
11 b[n + jn, t] ← a[n, t] + loss
12 v[n + jn, t] ← u[n, t] ∪ {(id, “kill”)} To evaluate our proposal, we study the traces of Argonne’s
13 if t + ta <= T ∧ b[n + jn, t + ta] > a[n, t] then Theta pre-Exascale machine and extract three representative
14 b[n + jn, t + ta] ← a[n, t] scenarios that create a challenging situation with respect to
15 v[n + jn, t + ta] ← u[n, t] ∪ {(id, “app”)} the eviction problem: most of the nodes are occupied by a
16 if t + ts <= T ∧ b[n + jn, t + ts] > a[n, t] then relatively large number of batch jobs, leading to many possible
17 b[n + jn, t + ts] ← a[n, t] combinations that need to be explored. For each scenario,
18 v[n + jn, t + ts] ← u[n, t] ∪ {(id, “sys”)} we augment the traces with additional data that enables us
19 a←b to apply our model in order to extract the parameters of
20 u←v each batch job: compute loss and application-level/system-
21 for i ∈ [0 . . . T ] do level checkpointing duration. We then compare our dynamic
22 (x, y) ← argmin(a[x = K . . . M, y = 0 . . . i]) programming algorithm with two other approaches: a greedy
23 Result[i] ← (a[x, y], u[x, y]) algorithm (linear complexity) and a backtracking algorithm
24 return Result that performs an exhaustive search (exponential complexity).
For the rest of this section, we introduce the methodology of
our proposal and discuss the results of the comparison.
T , thereby limiting the set of valid choices for job N (e.g., A. Batch job traces
no further checkpointing is possible within T ). In this paper, we consider the case of Argonne’s Theta
As a consequence, we propose a two-dimensional decom- supercomputer, a 11.69 petaflops pre-Exascale Cray XC40
position based on both the number of nodes and the deadline. system based on the second-generation KNL Intel Xeon Phi
We denote with the tuple hjnN , lossN , tsN , taN i the number 7230 SKU. The system has 4392 nodes, each equipped with
of nodes, loss of progress due to job killing, system-level 64 core processors (256 hardware threads), 16 GB of high-
checkpointing duration, and application-level checkpointing bandwidth MCDRAM (300-450 GB/s), 192 GB of main mem-
duration for job N . Then, the minimum loss for N jobs, M ory (DDR4 RAM, 20 GB/s), and a 128 GB SSD (700 MB/s).
nodes, and deadline T denoted as a[N, M, T ] is the lesser of: The interconnect topology is based on Dragonfly with a
(1) ignore job N , i.e., a[N − 1, M, T ]; (2) kill job N , i.e., total bisection bandwidth of 7.2 TB/sec. Durable storage is
lossN + a[N − 1, M − jnN , T ]; (3) take an application-level provided by a Lustre parallel file system that is accessible to
checkpoint of job N , i.e., a[N −1, M −jnN , T −taN ]; and (4) the compute nodes through a POSIX mount point. The total
take a system-level checkpoint of job N , i.e., a[N − 1, M − aggregated bandwidth is 250 GB/s.
jnN , T − tsN ]. Algorithm 1 presents our approach to solve First, we study the DIM_JOB_COMPOSITE trace1 of batch
this decomposition with a runtime of O(N ·M ·T ). The output jobs executed on Theta between 2017 and 2019. Specifically,
of this algorithm is a list Si for 0 < i < T , where Si is the set we extract for each job the required fields pertaining to the
of jobs to be evicted using an optimal strategy such that the runtime (start time, execution time) and the number of nodes.
compute loss is minimized, and, in case of multiple solutions Then, we aggregate this information to obtain the number of
with minimal compute loss, the checkpointing overhead is batch jobs and the number of nodes utilized by the batch jobs
minimized too. per time unit. We focus in particular on the year 2019, which
We note that Algorithm 1 uses a temporary minimum loss reflects the most recent utilization pattern: a total of 91,217
matrix b and a corresponding solution matrix v to hold the batch jobs were executed during the entire year.
updates resulting from considering all alternatives for job We zoom on the node utilization (Figure 1a) and the number
id. This is needed in order to avoid repeatedly selecting of jobs (Figure 1b) per hour during January 2019. A similar
the same id in subsequent decompositions. Furthermore, the
application-level checkpointing strategy takes precedence over 1 https://reports.alcf.anl.gov/data/theta.html
170
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
4000 25
20
Num of nodes
3000
Num of jobs
15
2000
10
1000 5
pattern can be observed for the rest of the year. These figures Fig. 3: Scenario-2: 16 batch jobs running on 4352 nodes.
reveal several interesting observations:
• During the entire year, all 4392 nodes were occupied
for only around 1.5 days. However, the most frequent
number of occupied nodes is 4352, which is close to the
maximum capacity and appears for a total of 34 days.
Therefore, the likelihood of having to run on-demand jobs
when the machine runs batch jobs close to full capacity
is very high.
• When the machine is operating close to capacity (4352
occupied nodes), the number of jobs is relatively high,
peaking at around 25 jobs.
• About 61% of the batch jobs reported an execution time
of less than 30 minutes. We consider these batch jobs Fig. 4: Scenario-3: 24 batch jobs running on 4352 nodes.
expendable, such that killing them incurs negligible loss. a series of synthetically generated checkpointing information
Based on these observations, we construct three representa- based on empirical observations. Specifically, we assume
tive scenarios, each of which occupies 4352 nodes at moment that all batch jobs conduct application-level checkpoints at
t0 and a variable number of jobs: 12, 16, 24. We deliberately an hourly interval. Since the batch jobs have different start
avoid expendable jobs in these scenarios (i.e., no expendable times, the likelihood that their application-level checkpoints
job is running at moment t0 ) in order to create a challenging are written concurrently to the PFS is very small. Furthermore,
situation where all jobs may incur a significant loss of node- we assume that each batch job allocates between 40%-90%
hours. The scenarios are illustrated in Figure 2, Figure 3 and of the memory available on each node. In this case, the
Figure 4. The regions of interest during which all jobs are size of the system-level checkpoint on each node coincides
running are marked with between two vertical timestamps with the allocated memory. Out of this memory, we assume
relative to the beginning of the earliest job. For example, 20%-60% holds critical data structures that are written by
Figure 2 captures a scenario of 12 jobs running for a total application-level checkpointing approaches. This is the size of
of 10 hours and 15 minutes, where, all batch jobs overlap for the application-level checkpoints. We use a random threshold
about 2 hours, i.e., from 02:41 to 04:52. The moment t0 is for each batch job, both for the application-level and system-
chosen within these regions of interest. level checkpoints, which is then used in Equation 1 to calculate
the application-level and system-level checkpointing duration.
B. Augmentation of the traces with checkpointing parameters
The DIM_JOB_COMPOSITE trace does not capture any C. Compared approaches
information about the checkpointing behavior of the batch Throughout our evaluations, we compare three approaches
jobs. Lacking such information, we augment the scenarios with that can be used to solve the eviction problem:
171
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Dynamic Dynamic
Greedy 105 Greedy
104 Back-tracking Back-tracking
103
104
103
103
102
102
102
Dynamic
101
Greedy 101
Back-tracking
101
0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15
Deadline time (m) Deadline time (m) Deadline time (m)
(a) Evict at least K=512 nodes for on-demand jobs (b) Evict at least K=1024 nodes for on-demand jobs (c) Evict at least K=2048 nodes for on-demand jobs
Fig. 5: Response time for Scenario-1 consisting of 12 batch jobs. Note the log scale on the Y axis. Lower is better.
104 105
103
103 104
102 103
102
Dynamic 102
Greedy
101 Back-tracking 101 101
0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15
Deadline time (m) Deadline time (m) Deadline time (m)
(a) Evict at least K=512 nodes for on-demand jobs (b) Evict at least K=1024 nodes for on-demand jobs (c) Evict at least K=2048 nodes for on-demand jobs
Fig. 6: Response time for Scenario-2 consisting of 16 batch jobs. Note the log scale on the Y axis. Lower is better.
1) Greedy: This algorithm implements a greedy strategy need to be evicted in order to make room for the on-demand
that tries to minimize the loss by checkpointing the most ex- jobs, i.e., 512, 1024, and 2048 for Scenario-1, Scenario-2, and
pensive jobs (high loss), while killing the least expensive jobs Scenario-3, respectively. This roughly corresponds to 12.5%,
(low loss). To this end, it sorts the batch jobs in descending 25% and 50% of the total capacity of Theta.
order of loss and tries to checkpoint them using the fastest
E. Results
available checkpointing method (application or system level).
When the total checkpoint duration becomes larger than the First, we focus on the performance and scalability of the
deadline T , it iterates over the sorted jobs in reverse order three approaches. To this end, we measure the response time
starting from the end, killing them one by one until at least taken by each approach in order to produce the optimal
K nodes have been released. While it does not produce an eviction strategy for all deadlines in the range [0 . . . T ]. As
optimal solution, this algorithm has linear complexity and a consequence, in the case of Greedy and Backtracking, a
therefore has a very fast response time. separate run is executed for each i ∈ [0 . . . T ]. Therefore, for
2) Backtracking: This algorithm implements an exhaustive an increasing i, the response time measures the accumulated
search of all possible choices for each batch job: keep running runtime of all i runs. In the case of CoSim, a single run is
(exclude), kill, checkpoint at application-level, checkpoint at sufficient to obtain the full solution thanks to the memoization
system-level. It optimizes the search by early abandoning of of overlapping sub-problems. This metric is important because
all combinations that cannot achieve a lower loss than the best it determines how soon the scheduler can take decisions, which
combination found so far. Unlike Greedy, this approach always in turn impacts the value that can be extracted from the on-
produces an optimal solution, however it has an exponential demand jobs (i.e., faster response time leads to better on-
complexity and therefore may become untractable for large demand job results).
problem sizes. The results for each of the three scenarios are depicted in
3) CoSim: This is our proposal that implements Algo- Figure 5, Figure 6 and Figure 7 respectively. Note that due
rithm 1. It guarantees an optimal solution just like Backtrack- to the large differences in algorithmic complexity between the
ing, but at the same time it has a fast response time thanks to three approaches, the y-axis is represented in a log scale. As
its polynomial complexity. expected, CoSim keeps a constant response time regardless
of the deadline T . Despite the accumulation of response time
D. On-demand job configurations from an increasing number of runs, the Greedy approach is still
For each of the three scenarios mentioned in Section IV-A, at least 30x faster than the other two approaches thanks to its
we consider the maximum deadline T = 15 minutes. We are linear complexity. It is interesting to observe that for a small K
interested in the optimal eviction strategy for all deadlines and a small number of batch jobs (as illustrated by Scenario-
in the range [0 . . . 15] with a granularity of one minute. 1), Backtracking is faster than our approach. However, with
Furthermore, for each of the three scenarios, we consider three increasing K and number of batch jobs (as illustrated by
different values for K, the minimum number of nodes that Scenario-2 and Scenario-3), the limitation of the exponential
172
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
100 300
Scenario-1
600
75 Scenario-2
Scenario-3 200
50 400
100
25 200
0 0
0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15
Deadline time (m) Deadline time (m) Deadline time (m)
(a) Evict at least K=512 nodes for on-demand jobs (b) Evict at least K=1024 nodes for on-demand jobs (c) Evict at least K=2048 nodes for on-demand jobs
Fig. 8: Relative compute loss of the Greedy approach relative to optimal solution produced by CoSim and Backtracking. Lower is better.
search becomes clearly visible, despite the aggressive early V. R ELATED W ORK
pruning optimization. In this case, our approach is up to Scheduling of batch and on-demand jobs for concurrent
five orders of magnitude faster. As a general conclusion, we execution where resources sharing is limited to each type of
observe that our approach has the advantage of providing the job has been widely studied [12]–[19] in the past. However,
optimal solution within a predictable constant time, which is not much work has been done for collocating both batch and
well suited for real-time scheduling decisions. on-demand jobs on the same set of resources [20]. SPRUCE
Next, we focus on the quality of the results of the Greedy (Special Priority and Urgent Computing Environment) [21]
approach. Since both our approach and the Backtracking supports on-demand jobs by considering a basic preemptive
approach produce the optimal solution, we use it as a baseline scheduling scheme with no checkpointing. However, this leads
that we substract from the minimum loss found by the Greedy to a significant loss of progress for the batch jobs.
approach. We call this the relative compute loss. This metric is Checkpointing based preemptive scheduling has been tradi-
important, because it indicates what result quality degradation tionally used at the operating system level for multi-tasking.
can be expected in order to benefit from faster response time. However, recent checkpointing-based preemptive scheduling
schemes [13], [22] focus on reducing their overheads and
As can be observed in Figure 8, the relative compute loss improving their effectiveness in reducing the average job
is very high, indicating that degradation in the quality of turnaround time. Nevertheless, these techniques do not directly
the result found by Greedy is unacceptable. In fact, with the address the challenges of co-scheduling batch and on-demand
exception of T > 13 for Scenario-3, the relative compute loss jobs in HPC settings.
is increasing for an increasing T , which means Greedy suffers Large-scale datacenters operated by industry (e.g., Face-
from an increasing degradation in the quality of the result. book [23] and Google [24]), leverage centralized job execution
Also, it is important to note that in absolute terms, the mini- environments where the centralized system accumulates jobs
mum compute loss is decreasing with an increasing T for all from multiple datacenters, and then runs the computation [25].
three approaches, because more checkpointing opportunities However, it leads to increased network traffic and job com-
become available. In fact, the optimal minimum loss is found pletion time when the data volume grows exponentially [26],
by CoSim and Backtracking is often 0 (meaning no job needs [27]. Furthermore, regulations may restrict moving data across
to be killed), especially for larger T . Therefore, even when the continents due to security and privacy constraints, thus making
relative compute loss seems to decrease for an increasing T , such approaches impractical to adopt in production environ-
it is still missing the optimal compute loss by a large margin. ments at large.
Based on this observation, we conclude that sacrificing
the result quality for faster response time is not beneficial, VI. C ONCLUSIONS
especially when considering that our approach runs in the In this paper, we present CoSim, a simulator that enables on-
order of milliseconds in the worst case. the-fly analysis of the trade-offs arising between delaying the
173
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
start of opportunistic on-demand jobs, which leads to longer [10] B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello,
analytics latency, and loss of progress due to preemption of “Veloc: Towards high performance adaptive asynchronous checkpointing
at large scale,” in IEEE International Parallel and Distributed Process-
batch jobs, which is necessary to make room for such on- ing Symposium (IPDPS), Rio de Janeiro, Brazil, 2019, pp. 911–920.
demand jobs. The key idea of our proposal is to implement [11] J. Daly, “A higher order estimate of the optimum checkpoint interval for
preemption through a combination of either killing or check- restart dumps,” Future Generation Computer Systems, vol. 22, no. 3, pp.
303 – 312, 2006.
pointing (at application-level or system-level) a subset of batch [12] R. Tyagi and S. K. Gupta, “A survey on scheduling algorithms for par-
jobs running on the compute nodes to free enough nodes by allel and distributed systems,” in Silicon Photonics & High Performance
a given deadline. To this end, we introduce a checkpointing Computing. Singapore: Springer, 2018, pp. 51–64.
[13] V. J. Leung, G. Sabin, and P. Sadayappan, “Parallel job scheduling
and loss model to develop a dynamic programming algorithm policies to improve fairness: A case study,” in International Conference
to minimize the loss for a variable deadline up to a given on Parallel Processing Workshops (ICPP), San Diego, USA, 2010, pp.
threshold, which gives the scheduler high flexibility in ex- 346–353.
[14] A. A. Chandio, K. Bilal, N. Tziritas, Z. Yu, Q. Jiang, S. U. Khan, and C.-
ploring a wide range of alternatives. CoSim finds the optimal Z. Xu, “A comparative study on resource allocation and energy efficient
solution up to 5 orders of magnitude faster than backtracking job scheduling strategies in large-scale parallel computing systems,”
approaches and offers a predictable response time in the Cluster computing, vol. 17, no. 4, pp. 1349–1367, 2014.
[15] A. W. Mu’alem and D. G. Feitelson, “Utilization, predictability, work-
order of milliseconds, thereby eliminating the need for greedy loads, and user runtime estimates in scheduling the ibm sp2 with
approaches that are fast but find only approximate solutions. backfilling,” IEEE Transactions on Parallel and Distributed Systems
In the future, we plan to investigate several avenues: (1) ap- (TPDS), vol. 12, no. 6, pp. 529–543, 2001.
[16] C. Gómez-Martín, M. A. Vega-Rodríguez, and J.-L. González-Sánchez,
plicability of our proposal to cloud computing; (2) refinement “Fattened backfilling: An improved strategy for job scheduling in par-
of checkpointing model (interval, interactions with PFS); (3) allel systems,” Journal of Parallel and Distributed Computing (JPDC),
integration with the workload schedulers at Argonne National vol. 97, pp. 69–77, 2016.
[17] B. Lawson and E. Smirni, “Multiple-queue backfilling scheduling with
Laboratory’s supercomputers to validate CoSim for real-life priorities and reservations for parallel systems,” ACM SIGMETRICS
on-demand workloads. Performance Evaluation Review, vol. 29, pp. 72–87, 2002.
[18] A. Tousimojarad and W. Vanderbauwhede, “An efficient thread mapping
ACKNOWLEDGMENTS strategy for multiprogramming on manycore processors,” Parallel Com-
This material is based upon work supported by the U.S. puting: Accelerating Computational Science and Engineering (CSE),
Advances in Parallel Computing, vol. 25, pp. 63–71, 2014.
Department of Energy (DOE), Office of Science, Office of [19] S. G. Ahmad, C. S. Liew, M. M. Rafique, E. U. Munir, and S. U. Khan,
Advanced Scientific Computing Research and Argonne Na- “Data-intensive workflow optimization based on application task graph
tional Laboratory. Results presented in this paper are obtained partitioning in heterogeneous computing systems,” in IEEE International
Conference on Big Data and Cloud Computing (BdCloud), 2014, pp.
using the Chameleon and CloudLab testbeds supported by the 129–136.
National Science Foundation. [20] D. Wang, E.-S. Jung, R. Kettimuthu, I. Foster, D. J. Foran, and
M. Parashar, “Supporting Real-Time Jobs on the IBM Blue Gene/Q:
R EFERENCES Simulation-Based Study,” in Job Scheduling Strategies for Parallel
[1] W. Tang, B. Want, S. Ethier, and Z. Lin, “Performance portability of Processing, D. Klusáček, W. Cirne, and N. Desai, Eds. Orlando, USA:
hpc discovery science software: Fusion energy turbulence simulations at Springer International Publishing, 2018, pp. 83–102.
extreme scale,” Supercomputing frontiers and innovations, vol. 4, no. 1, [21] N. Trebon, “Enabling urgent computing within the existing distributed
2017. computing infrastructure,” Ph.D. dissertation, University of Chicago,
[2] A. S. Kozelkov, V. V. Kurulin, S. V. Lashkin, R. M. Shagaliev, and A. V. USA, 2011.
Yalozo, “Investigation of supercomputer capabilities for the scalable [22] Q. Snell, M. Clement, and D. Jackson, “Preemption based backfill,” in
numerical simulation of computational fluid dynamics problems in Job Scheduling Strategies for Parallel Processing. Berlin, Heidelberg:
industrial applications,” Computational Mathematics and Mathematical Springer, 2002, pp. 24–37.
Physics, vol. 56, no. 8, pp. 1506–1516, 2016. [23] J. Meza, T. Xu, K. Veeraraghavan, and O. Mutlu, “A large scale
[3] P. Vranas, G. Bhanot, M. Blumrich, D. Chen, A. Gara, P. Heidelberger, study of data center network reliability,” in ACM Internet Measurement
V. Salapura, and J. C. Sexton, “The bluegene/l supercomputer and Conference (IMC), New York, USA, 2018, p. 393–407.
quantum chromodynamics,” in ACM/IEEE International Conference for [24] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
High Performance Computing, Networking, Storage and Analysis (SC), S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart,
Tampa, Florida, 2006, pp. 50–57. and A. Vahdat, “B4: Experience with a Globally Deployed Software
[4] S. R. Ellingson, J. C. Smith, and J. Baudry, “Polypharmacology and Defined WAN,” in ACM SIGCOMM, Hong Kong, China, 2013.
supercomputer-based docking: opportunities and challenges,” Molecular [25] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman,
Simulation, vol. 40, no. 10-11, pp. 848–854, 2014. S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kan-
[5] D. AOCNP, “Watson will see you now: a supercomputer to help clini- thak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle,
cians make informed treatment decisions,” Clinical journal of oncology S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor,
nursing, vol. 19, no. 1, p. 31, 2015. R. Wang, and D. Woodford, “Spanner: Google’s globally distributed
[6] A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility database,” ACM Transactions on Computer Systems (TOCS), vol. 31,
for resource management,” in Job Scheduling Strategies for Parallel no. 3, 2013.
Processing. Berlin, Heidelberg: Springer, 2003, pp. 44–60. [26] A. Vulimiri, C. Curino, P. B. Godfrey, T. Jungblut, J. Padhye, and
[7] N. Desai, “Cobalt: an open source platform for hpc system software G. Varghese, “Global analytics in the face of bandwidth and regulatory
research,” in Edinburgh BG/L System Software Workshop, 2005, pp. 803– constraints,” in USENIX Networked Systems Design and Implementation
820. (NSDI), USA, 2015, p. 323–336.
[8] G. Staples, “Torque resource manager,” in ACM/IEEE International [27] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, S. Pan,
Conference for High Performance Computing, Networking, Storage and S. Shankar, V. Sivakumar, L. Tang et al., “f4: Facebook’s Warm
Analysis (SC), New York, NY, USA, 2006, p. 8–es. BLOB Storage System,” in USENIX Operating Systems Design and
[9] J. Ansel, K. Arya, and G. Cooperman, “DMTCP: Transparent check- Implementation (OSDI), 2014, pp. 383–398.
pointing for cluster computations and the desktop,” in IEEE Interna-
tional Symposium on Parallel & Distributed Processing (IPDPS), Rome,
Italy, 2009, pp. 1–12.
174
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Unmanned Aerial vehicles (UAVs) have gained a lot tions have great benefits as they are generally able to perform
of interest over the last years due to the many fields of potential more sophisticated tasks efficiently or with more redundancy.
application. Nowadays, researchers are becoming interested in However, organizing a multi-UAV flight is not an easy task,
groups of UAVs working together. The collaborations between
UAVs open a wide field of opportunities, because they are with challenges in terms of (i) swarm formation definition,
typically able to do more sophisticated tasks than a single (ii) takeoff procedure, (iii) in-flight coordination, (iv) swarm
UAV. However, collaboration between multiple UAVs is still a layout reconfiguration, (v) handling the loss of swarms ele-
complex task, and significant challenges need to be addressed ments, (vi) communications and data relaying optimization,
before their mainstream adoption. For instance, the automatic and (vii) controlled landing, among others. In this work we
reconfiguration of a swarm can be used to adapt the swarm to
changing application demands to solve a task in a more efficient focus on the particular problem of swarm reconfiguration
and effective manner. However, the chances of collision become during a mission. Notice that the ability to automatic change
high if reconfiguration is not carefully planned. In this work we the shape of a formation during a mission can become very
propose an approach to allow changing the shape of a UAV useful in different kinds of applications to account for: variable
formation during flight through a computational inexpensive application requirements, coping with the loss of swarm ele-
method that is able to decrease collision chances significantly.
During the experiments we tested different reconfiguration events ments, handling temporary flight restrictions, etc. For instance,
that are prone to collisions. Results have shown that our approach consider a search and rescue mission where at first a swarm
maintains a safe distance (greater than 5 meters) between the has to cover a large area but, upon discovering the item of
UAVs, while keeping the time overhead limited to a few tenths interest, the swarm needs to reconfigure to better monitor that
of a second. Furthermore, scalability tests have proven that our area and provide different services.
approach can handle the reconfiguration of at least 25 UAVs
simultaneously. The main issue that we face during a reconfiguration is the
Index Terms—UAV; swarm reconfiguration; swarm formations chance of collisions, especially when the number of UAVs
becomes larger. In this work we focus on a computational
inexpensive technique to reduce the chances of collision that
I. I NTRODUCTION can be deployed easily under various conditions. Our solution
combines two algorithms, the first determines the optimal
Over the last decade the field of Unmanned Aerial Vehicles assignment of UAVs in the new formation accounting for
(UAVs) has gained universal interest and novel applications their current position, while the second one splits the UAVs
keep emerging every year. Due to the ever decreasing price in different mobility groups that are shifted to different alti-
of technology, UAVs (also known as drones) are becoming tudes during the reconfiguration process to minimize collision
mainstream for the general public and industry as well. This risks. Experimental results show that our solution is able to
results in many civilian applications in aerial photography and minimize collisions risks compared to other alternatives, while
video, topography, entertainment, etc. [1]. More professional introducing only a moderate reconfiguration delay.
applications such as precision agriculture, border surveillance, The rest of this paper is organized as follows: in Section II
package delivery, and thermal inspections are also common in we provide an overview of related works on this topic. In Sec-
the industry [2], [3]. Nowadays, UAVs are starting to be used tion III we detail our implementation. This implementation is
to assist in emergency situations such as search and rescue, or then tested through different experiments, which are presented
disaster scenarios [4], [5], where they can act as supporting and discussed in Section IV. This work finishes with a critical
nodes for communications being deployed on demand, and discussion and the obtained conclusions in Section V.
offering a wider communications range and better line-of-sight
(LOS) features than ground infrastructures. II. R ELATED WORK
Over the last few years, the research works shifted more The research towards swarms of UAVs has experienced
towards groups of coordinated UAVs [6]. Multi-UAV applica- a growing interest in recent years. The particular topic of
978-1-7281-7343-6/20/$31.00
2020
c IEEE flight configurations has been investigated by different authors.
175
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The work by V.T. Hoang et al. [7] presents an algorithm can be used both during the day and during the night.
to reconfigure a formation of multiple UAVs. This work Indoor and outdoor experiments performed in obstacle rich
is especially focused on the application of vision-based in- environments have proven the effectiveness of the proposed
spection of infrastructure. It presents a new algorithm for method. Furthermore, their software is implemented in the
reconfiguration based on the angle-encoded Particle Swarm robot operating system (ROS) [11], which promotes reusability
Optimization (PSO). They begin with a 3D representation of through its modular design.
the surface to be inspected and a set of intermediate waypoints. While flocking mechanisms are great to keep a swarm of
Additionally new constraints are proposed to decrease the UAVs organized, they do not provide the flexibility to com-
chance of collision and increase task performance; based on pletely define and change the formation itself. In many appli-
the assumption that an optimal path is produced by using the θ- cations, it is useful to change the formation (for instance, from
PSO path planning algorithm. Their work differs form ours as a line to a circle); however it is difficult to encapsulate such
they use just a limited number of reconfigurations. They only behaviour using flocking mechanisms. Therefore, in our work,
focus on alignment, rotation and shrinkage, while our proposal we specifically focus on changing between different flight
is able to change the entire topology of the formation. formations. Hence, instead of using a flocking mechanism,
Other works use an approach which is called flocking. we propose a master-slave model where the master instructs
Flocking is a behaviour that is common in nature, for instance the slaves how to safely accomplish the reconfiguration.
in a group of fish, birds or insects. It consists of a few
basic rules that are applied to each entity of the group. When III. P ROPOSED MECHANISM
those rules are respected, the group will stay united without The aim of this work is to reconfigure a swarm of UAVs,
collisions between the group elements. There are various seamlessly switching from one flight formation to another.
methods to achieve a flocking behaviour for a group of UAVs, In our approach we make use of a master-slave pattern. The
as discussed below. master is elected before taking off, as described in our previous
In the work by Ming Chen et al.[8] a flocking model for work [12]. The master is in charge of the main calculations,
an UAV network based on swarm intelligence is presented. In and keeps the swarm synchronized throughout the different
their work they propose a set of rules to make sure that the stages of the reconfiguration. All the stages are described
slaves will follow the master while maintaining a certain safe in Figure 1. The protocol starts with the UAVs taking off
distance from the master. They cannot get too close because and following a mission. The reconfiguration will start upon
this behaviour increases the chances of collisions; also, they a trigger event, which can be an user input or an event
cannot get too far away, because otherwise communication predefined in the ground control station. The reconfiguration
will be lost. Simulation results show that their model can itself is divided into two stages: an analysis step where the
guarantee connectivity between nodes, and it will also improve calculations are done, and a mobility step where the UAVs
bandwidth usage. move to their target locations in an intelligent manner to avoid
Victor Casas et al. [9] developed a flocking model without collisions. After the swarm has reconfigured itself, the mission
the use of a master-slave model. The UAVs in the swarm can continue. The protocol finishes at the end of the mission
regularly broadcast and receive movement information. That by landing all the UAVs.
information is then used to calculate two forces: a flock goal
A. Phase 1: Analysis
force, which guides the flock towards the target location and
aligns the swarm members, and a flock members force, which In a previous work we developed an algorithm to determine
provides cohesion and separation to the flock. Those two who the master should be in the scope of a UAV swarm [12].
forces are used to update a direction vector which points To understand our current proposal it is only relevant to
towards the target location, while at the same time avoids know that a single master is assigned, and that it will always
collisions. Their model is tested in simulation and in real be located in a central position on the flight formation to
experiments which show that a collision-free flight is ensured. minimize losses on the wireless channel. In this first step,
They tested the model under various speeds, although all of the master decides the slaves positions (later referred to as
them were rather slow (a maximum of 3m/s). Results also intelligent position). The idea is that the overall flight distance
showed that, during real experiments, the minimum distance is minimised by choosing the UAV that is already closest to a
between UAVs is decreased; according to the authors, this is new flight position to fly to it. This algorithm is also explained
due to GPS inaccuracy. in more detail in [12]. Basically, it consist of the following four
Yazhe Tang et al. [10] presented a swarm flocking scheme steps:
that was able to work in a radio silent environment. In contrast 1) Find a central location with respect to the current location
to many other works, their approach was not based on sending of the UAVs.
(GPS) information between the swarm elements. They used 2) Calculate the euclidean distances from that central loca-
two types of vision sensors (standard and thermal cameras) to tion to the positions in the new flight formation.
track their leader, and a LiDAR sensor to sense the surrounding 3) Sort this list in descending order.
environment for navigation and obstacle avoidance. Because 4) Assign each location in the flight formation to the closest
they used various high-end sensors, their flocking mechanism UAV.
176
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
B. Phase 2: Mobility
The mobility step is split up into three states: first the UAVs
will change altitude, depending on his sector as explained
in the previous section (movement in the Z direction), then
they will go towards their target location (X,Y movement),
Fig. 1: Flowchart of the flight formation algorithm and finally they will return to their initial altitude (return to
default Z value). In each state the master will send messages to
the slaves. When a slave receives the message it will perform
While the algorithm was originally designed to ensure a the movement and reply with an acknowledgement once the
safe and fast takeoff procedure, we were able to reuse it for movement is finished. The master receives the acknowledge-
our current swarm reconfiguration purposes. ments and, when all the slaves have sent an acknowledge
In order for the master to execute this algorithm it needs message (and the master has reached its position), the master
to know where all the UAVs currently are, and what new will transition to the next state. At that moment, the master
locations are defined in the new swarm layout. The current will start sending messages from his new state; slaves will
locations are known by the master since it defines and main- receive those messages, and transition too. The messages sent
tains the swarm topology at all times. Regarding the new by the master only contain an id which represents the current
177
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
state. They do not have to contain the location information C. Intelligent positioning, no altitude change.
because this was already sent in phase 1. D. Intelligent positioning, different altitudes.
As a final remark, it is worth pointing out that our proposal In our first set of experiments, 9 UAVs changed from
is computational efficient. Algorithm 1 is the only element a linear formation towards a compact mesh formation (see
with significant computational requirements, and it limited to Figure 2 for an example). The minimum distance between
a O(N 2 ). Since in most practical applications the number of UAVs in that formation was set to 10 meters, the number
UAVs in a swarm will be low (below 100), this algorithm can of sectors was equal to three and the altitude difference
be easily executed on the UAV’s onboard computer, such as a between sectors was of 5 meters. These variables can be
Raspberry Pi. Also the network will not be overloaded since set by the user. In real experiments, and in some network
the message payloads are quite small. models in our simulator, UAVs cannot communicate when the
distance between them is greater than 1200 meters. Therefore
IV. E XPERIMENTAL SETTINGS AND RESULTS
although possible in simulation, the distance between the
We performed a wide set of experiments in our own UAVs must not be set too large. For that reason, we have
UAV emulator/simulator in order to assess the validity and chosen the above mentioned values because they are realistic
robustness of our proposed mechanism. Before providing a and provide enough clearance to prevent UAVs from colliding
detailed explanation about our experiments and the results due to GPS errors, wind gusts, etc. During the experiments we
obtained, we will briefly discuss our simulator environment measured the time that the UAVs spent in each state ( Move Z,
called ArduSim. Move XY, Move Z Initial), the minimum distance between
the UAVs during the Move XY state, and the potential number
A. Ardusim
of collisions. A collision happens when the euclidean distance
ArduSim is multi-UAV flight simulator/emulator; it is avail- between two UAVs in our experiments is smaller than 5 meters
able online [13] under the Apache License 2.0. The simulator to account for the GPS offset error.
has many features, which are fully explained in our previous
work [14]. Here, we will just highlight some of the key
characteristics.
First of all, ArduSim makes it easy, fast and reliable to
deploy a protocol that was developed in the simulator to real
UAVs. It does this mainly by implementing the same open
source protocols and standards that are used by the majority of
the UAVs. Besides that, ArduSim really is a multi-UAV flight
simulator; it is able to scale up to 100 UAVs in real time,
and up to 256 UAVs in soft real time on a high-end PC (Intel
Core i7-7700, 32 GB RAM). Wireless communication models,
based on real experiments, are implemented to support UAV- Fig. 2: Transition of 9 UAVs from a linear formation to a
to-UAV communications; notice that this is a basic require- compact mesh
ment for nearly all swarm applications. Furthermore, a lot of
basic UAV functionality (such as taking off, moving to a GPS The results are shown in Table I and Table II. Our experi-
location, etc.) is provided by the Application Programming ments have shown (as stated before) that merely changing the
Interface (API). The user is provided with a functional GUI formation layout without adopting any type of strategy is very
and extensive logging features. dangerous, and prone to cause collisions. We can also observe
Overall, ArduSim is a versatile tool that provides re- that just by changing the altitude or the position assignment
searchers the opportunity to quickly develop new applications of the UAVs in an intelligent manner is not enough to avoid
and protocols, without losing accuracy and/or customization. collisions in all cases. Only when both where used could
B. Safety analysis collisions be entirely avoided. Furthermore, while changing
the altitude does make the process safer, an additional time
Our approach combines an intelligent UAV assignment (see overhead is introduced. The time overhead depends on the
Section III-A) with a sectorization procedure that groups UAVs number of sectors and the altitude difference between the
moving with similar directions so that their mobility takes sectors, the impact of both parameters are discussed in more
place at different heights (see Section III-B). To assess the detail in the following experiments. Implementing a intelligent
effectiveness of this combined approach, we will compare it to positioning system reduces the overall flight distance and,
other (simpler variants) where such mechanisms are not used, therefore, flight times are slightly shorter in experiments C
so that we can evaluate which part has the most influence and D.
and if our approach (as a whole) is effective. Therefore, we
propose three other (but similar) approaches: C. Scalability
A. Random position assignment, no altitude change. In our second experiment we want to evaluate the scalability
B. Random position assignment, different altitudes. of our protocol. We searched for the minimal number of
178
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
E. Time overhead
179
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Fig. 5: Minimum number of sectors required for a collision- Fig. 7: Estimated time overhead vs real time overhead
free reconfiguration) w.r.t. the type of transition.
180
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
[5] M. Aljehani and M. Inoue, “Multi-UAV tracking and scanning systems Networks, Systems, and Applications, DroNet ’20, (New York, NY,
in M2M communication for disaster response,” in 2016 IEEE 5th Global USA), Association for Computing Machinery, 2020.
Conference on Consumer Electronics, pp. 1–2, Oct 2016. [10] Y. Tang, Y. Hu, J. Cui, F. Liao, M. Lao, F. Lin, and R. Teo, “Vision-
[6] A. Tahir, J. Böling, M.-H. Haghbayan, H. T. Toivonen, and J. Plosila, aided multi-uav autonomous flocking in gps-denied environment,” IEEE
“Swarms of unmanned aerial vehicles — a survey,” Journal of Industrial Transactions on Industrial Electronics, vol. PP, pp. 1–1, 04 2018.
Information Integration, vol. 16, p. 100106, 2019. [11] Stanford Artificial Intelligence Laboratory et al., “Robotic operating
[7] V. T. Hoang, M. D. Phung, T. H. Dinh, Q. Zhu, and Q. P. Ha, “Recon- system.”
figurable multi-uav formation using angle-encoded pso,” in 2019 IEEE [12] F. Fabra, J. Wubben, C. Calafate, J. Cano, and P. Manzoni, “Efficient and
15th International Conference on Automation Science and Engineering coordinated vertical takeoff of UAV swarms,” in IEEE 91st Vehicular
(CASE), pp. 1670–1675, 2019. Technology Conference (VTC2020-Spring), May 2020.
[8] M. Chen, F. Dai, H. Wang, and L. Lei, “Dfm: A distributed flocking [13] “ArduSim. accurate and real-time multi-UAV simulation.”
model for uav swarm networks,” IEEE Access, vol. 6, pp. 69141–69150, https://bitbucket.org/frafabco/ardusim/src/master/, 2017. Accessed:
2018. 2020-05-11.
[9] V. Casas and A. Mitschele-Thiel, “Implementable self-organized flock- [14] F. Fabra, C. T. Calafate, J.-C. Cano, and P. Manzoni, “ArduSim: Accurate
ing algorithm for uavs based on the emergence of virtual roads,” and real-time multicopter simulation,” Simulation Modelling Practice
in Proceedings of the 6th ACM Workshop on Micro Aerial Vehicle and Theory, vol. 87, pp. 170–190, sep 2018.
181
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—The deployment of a swarm of cooperative UAVs a swarm. The ability of swarming many UAVs to perform
applications for the execution of distributed tasks has increased complex tasks becomes attractively recommended because it
attention from both academia and industry researchers. The use solves the limitations of single UAV systems like the limited
of a group of UAVs instead of one single UAV offers many
advantages like extending the mission coverage, providing a payload and flight time. It also adds more functionalities and
reliable ad-hoc networks services, and enhancing the service advantages including time-savings, reduction in manpower,
performance, to name a few. However, due to the highly dynamic and operational expenses optimization. In a single UAV sys-
nature of the swarm topology, the coordination of a large tem, if the UAV or a sensor/hardware fails, the UAV should
number of UAVs poses new challenges to traditional inter-UAV return to the base. However, in swarm-based systems, other
communication protocols. Therefore, there is a need for the design
of new networking protocols that can efficiently support the UAVs can share tasks among themselves and this increases the
fast-pace and real-time requirements of a coordinated swarm fault tolerance of the system. For example, in search missions
navigation in various environments. In this paper, we propose using a swarm of UAVs can parallelize the individual tasks,
SEMRP a Swarm energy-efficient multicast routing protocol for thus, decreasing the completion time of the mission, extending
UAVs flying in group formations. The main purpose of SEMRP is the coverage range, and also providing real-time images and
to facilitate the control and information delivery between UAVs
while minimizing inter-UAV packet loss, packet re-transmission, videos which may improve the quality of the operation.
and end-to-end delay. In this study we show how SEMRP achieves Although the deployment of swarm UAVs and their at-
these objectives by taking into account various Quality-of-Service tractive advantages, it still poses several challenging issues
parameters like the network throughput, the UAVs mobility, and that may affect their reliability and stability. To support their
energy efficiency to ensure a timely and accurate information
delivery to all members of a UAV swarm. The results of the various applications and maintain their stable functioning, and
conducted simulation using NS-2 advocate for the efficiency of to well exploit their features, it is necessary to design efficient
our proposal through its to two presented versions (SEMRP-v1 routing protocols adapted to the targeted missions. To this
and SEMRP-v2) in term of reducing the total emission energy end, many swarm routing protocols have been proposed in
(at least by 10 dBm), optimizing the End-to-End Delay by 44%, the context of Flying Ad hoc NETworks (FANET), these
and increasing the packet delivery ratio by more than to 22%
compared to SP-GMRF protocol. works can be classified into three main classes: (i) bioinspired-
Index Terms—UAVs, Swarm of UAVs, Multicast Routing Pro- based [3]–[6] , (ii) geographical location-based [7], and (iii)
tocol, SEMRP. multicast-based [8, 9]. In this work, we focus on the last
category (i.e, multicast-based routing protocols) which offers
I. I NTRODUCTION more advantages such as reduced bandwidth utilization in data
distribution from a source to its group members. Solutions
Unmanned Aerial Vehicles (UAVs) have recently attracted presented in this category do not simultaneously address
significant interest in civilians and military applications, such the power consumption, reliability, and network scalability.
as search and rescue operations, managing wildfire, agricul- Therefore, we attempt to design a new efficient multicast
tural applications, patrolling, delivery of goods, monitoring routing protocol for swarm-based systems that distributes
and surveillance [1, 2]. Swarms of UAVs may further increase data from one source node to a specific group of mobile
the effectiveness of these tasks. For instance, the possibility to drones, while minimizing the number of connections in the
enable larger mission coverage and to improve the operation network, ensuring both the reliability and the scalability, and
performance through multi-UAV cooperation. optimizing the global energy consumption that directly affects
As the technology of cooperative UAVs grows and their the system’s lifetime.
cost decreases, they become an interesting way to undertake The proposed solution can be applied in COVID-19 applica-
several difficult applications, especially when the drones form tions such as surveillance for purposes like social distancing
978-1-7281-7343-6/20/$31.00
c 2020 IEEE violation detection, in addition to various other applications
182
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
like spraying areas with disinfectants. The use of swarm of Despite its increased performance over traditional FANET
UAVs is highly beneficial for these types of applications, due routing algorithms in most cases, BeeAdHoc is characterized
to low operational costs and high spatial resolution of imagery, by a complex behavior modeling.
especially in large and hard to reach zones. Pan et al. [6] proposed CA-BCO as an improved version of
The rest of this paper is organized as follows. We present the BCO algorithm, to solve the UAV route planning problem.
the related works in Section II. In Section III, we provide CA-BCO uses a probabilistic representation of the population
the details of our proposed multicast routing algorithm. Then, to replace the design variable of solutions search space of BCO
the performance of our proposed algorithm is evaluated and algorithm. CA-BCO provides a better performance within the
discussed in Section IV. Finally, Section V concludes the category of memory-saving algorithms [7]. However, It has
paper. the same drawbacks as BeeAdHoc because it uses the same
BCO algorithm [7].
II. R ELATED W ORKS
Efficient routing protocols are required for a successful com- B. Geographic-based
munication among the cooperating UAVs in a swarm. There This class of methods routing is based on the geograph-
are many routing protocols used in this class of networks, ical location of the members of the swarm geocast routing
and they can be classified into three mains categories: (i) protocols. In [7], the authors propose GeoUAVs designed for
Bioinspired-based, (ii) geographical location-based, and (iii) managing wildfire, especially in the zones hard to reach, which
multicast-based approaches. aims at delivering data to a specific group of mobile UAVs
identified by their geographical location to manage an active
A. Bioinspired-based solutions fire. It takes into account the mobility of nodes with 3D
Various algorithms based on the swarm intelligence can movement and manages to reduce the delays and maximize
be successfully adapted for cooperative UAVs, they fall into the throughput. However, it does not take into consideration
category of bioinspired algorithms [10], and they are usually power consumption.
classified into two categories: Ant colony optimization-based
approaches and Bee colony optimization-based approaches. C. Multicast-based
1) Ant Colony Optimization-based approaches: The ACO In this class of routing methods, a source UAV may need
algorithm is based on the social behavior of ants on the way to to send data to a specific group of UAVs hence the use
find the shortest path to the source of food [3, 4]. For instance, of group communication or multicast. The main advantage
in [3], the authors proposed a bio-inspired algorithm named of multicast routing is used the reduction in transmission
”APAR” to solve the communication problems in multi-UAVs overhead, in control message overhead, in power consumption,
systems. APAR integrates ACO algorithm with the well- and network partitioning [14]. In [8], the authors proposed
known DSR. APAR proposes to avoid the congestion and link SP-GMRF, which offers a mechanism to predict the nodes
breakage by establishing standards to choose routes based on current positions allowing it to rule out the nodes that are
sensing the distance of a route, the stability of a route and the within the communication range as possible next hop, then
congestion level of a route. However, its main drawback is the add one hop neighbors to the multicast tree that provide the
introduction of overhead and delays and prevent high mobility shortest distance to each of the destination nodes. However,
nodes from participating in route discovery. SP-GMRF suffers from the absence of power management.
AntHocNet is an ACO based routing protocol proposed In addition to the start of the data delivery from the source
in [4] to solve the problem of high mobility in FANET. after the tree construction is complete, which can cause link
AntHocNet is more suitable for large networks with high breakages especially when the time of tree construction is
mobility. However, it is less effective because off high costs increased. Another setback is the computation of the best next
for routing service information transfer. hop when the number of one hop neighbors or the number of
2) Bee colony optimization-based approaches: BCO is the destinations increases since the selection procedure requires
sub-class of a bio-inspired approaches taking their models the selection of the shortest distance between each pair of
from the bee behavior in natural habitat [11]. That is, the neighbor to destinations which can lead to numerous issues
bee hive operating principle is based on a clear distribution including power consumption, and disconnections.
of responsibilities among the bees. All bees of the hive can be DPTR [9] is another protocol for FANETs designed to han-
divided into three groups [12, 13]:employee bees, onlookers, dle transmission in collaborative ad-hoc networks. By adding
and scouts. BeeAdHoc [5], which one of BCO algorithms, certain rules to the formation of the Red-Black (R-B) trees, a
has two different stages during its functioning: (i) A scouting distributed priority tree is formulated. This tree forms a priority
stage during which forward and backward scouts including network that allows selection of an appropriate node and a
the source ID, the number of hops, and the minimal residual channel for relaying to avoid network fragmentation. Although
energy, are flooded across the network to establish multiple DPTR is scalable ad overcomes network fragmentation, it does
paths between the communicating nodes; and (ii) resource not support mobility and requires a considerable effort for
foraging stage during which the data packets are delivered management and control [5]. In addition it does not take into
from the source to the destination using the forager bees. consideration energy consumption.
183
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
The majority of the methods described above suffer from the The SNIR ratio requirement should be satisfied for each UAV
absence of the reliability and scalability [7]. One major issue to receive multicast messages. Thus, the received signal power
with these methods is being power-agnostic, which does not level at each drone should be above the minimum sensitivity
serve the fact that UAV are power-hungry devices especially level noted as PM inS . As specified in IEEE 802.11p, the mini-
with the additional constraint of flying in swarm formation. mum sensitivity level equals -68 dBm for the throughput of 27
Therefore, with SERMP, we propose to design a new multicast Mb/s, and the 5.9 GHz band is used for U2U communications.
routing protocol, which addresses the aforementioned issues. Thus, the transmitting power should hold:
We describe SERMP in details in the following section. q 2
2 2 2
III. SEMRP: E NERGY- EFFICIENT M ULTICAST ROUTING PM inS 4π ∆x + ∆y + ∆z
Pi ≥ (2)
P ROTOCOL FOR UAV S WARMS Gi Gj λ
In this section, we will be explaining the basic functioning
of our solution using two versions (SEMRP-v1 and SEMRP- Where ∆x = xi − xj , ∆y = yi − yj , and ∆z = zi − zj .
v2) with different perspectives, yet having the same objectives Our objective is to design an optimal tree for delivering
discussed in system model. multicast messages such that the total power consumption
can be maximally reduced, then the objective function can
A. System Model be expressed by:
To overcome the above mentioned challenges, we propose
PM inS 4π 2 X
2
a new multicast approach in a swarm of UAVs in order to min ( ) · max D(u f ,uk )
(3)
Gi Gj λ k∈N (uf )
choose the shortest distance between UAVs, to optimize the uf ∈T
overall consumed energy consumption in the network, and to Our solution is based on designing an optimal multicast
expand the duration of labor. tree by using two different methods. The first method is
Our protocol aims at constructing a multicast tree to deliver SEMRP-v1, then secondly SEMRP-v2 where they both aim at
data from one source to the swarm members UAVs (U ). We choosing the nearest route to the multicast destination nodes
consider the following assumptions: and to switch to a closer route if found. This process helps
• We consider that each UAV is aware of its position in the to minimize transmission energy of forwarders and reduce the
3D environment with the help of a Positioning Service. number of hops in the tree.
The UAVs are equipped with cameras, sensors and other In the following sub-sections, we present the steps and
necessary equipment according to the application. working logic utilized by the both versions, each version is
• We assume that the transmission range of a drone can be described below for creating a multicast routing tree that spans
changed flexibly by adjusting its transmission power. The the multicast group UAVs members.
Maximum Transmission Range of each drone is restricted
by MTR, which represents the communication range that B. SEMRP-v1 Protocol
can be achieved using Maximum Transmission Power The construction of the multicast routing tree commences
(MTP). on demand-basis when a source has data to send to multiple
• To facilitate the movement of the drones and the delivery multicast destinations. The formal description of the tree
of data, we assume that the infected city area has a construction process is presented in Algorithm 1.
uniform shape rectangular or square for example. The SEMRP-v1 process is carried out in the following
In our model, we adopt the Friis’ power transmission phases:
formula, which is expressed as follows: (i) Initially, the multicast tree only contains the root which
α is source us .
Pj λ
= Gi Gj (1) (ii) Then, us calculates the distance as highlighted in step
Pi 4πD(ui ,uj )
2 of Algorithm 1.
Where Pi is emission power of the transmitting drone (iii) The source us initiates the construction of the tree by
ui , Pj is the receiving power at the end of drone uj , Gi adding the node with the shortest distance to the tree and
and Gj are the antenna gains of the transmitter and the adjusting the transmission power to reach it, the necessary
receiver respectively, and λ is the wavelength. The parameter steps are highlighted in steps 3, 4, 5, and 6 of Algorithm 1.
α is typically in the range of 2 to 4, depending on the (iv) Next, us starts verifying its other one-hop neighbors
characteristics of the communication medium [15]. In FANET to potentially increase its transmission power by testing the
applications α is set 2. D(ui ,uj ) is the distance between drones condition shown in step 7 and following its sub-steps of
ui and uj . Algorithm 1.
When constructing a multicast tree from the source drone us (v) The source proceeds by notifying its added nodes to the
to all destination nodes, it is necessary to consider the Signal- tree so they can continue the construction of the tree as shown
to-Noise-and-Interference Ratio (SNIR) requirement for wire- in step 8 of Algorithm 1. The notified nodes that are in the
less communications. In the free-space propagation model, the tree act as the new source and repeat steps of the Algorithm
path loss is generally a function of the distance between UAVs. starting from Step 2 of Algorithm 1 along with updating their
184
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
185
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
UE UE UE
9 9 9
UV UV UV
10 10 10
15 15
UF 4 UF UF
2 2 2
3 3 3 3 3 3
6 UA UC UH 6 UA UC UH 6 UA UC UH
3 9 9
US 4 2 4 US 4 4 US 4 4
5 5 5
9
UB UG 2 UB UG 2 UB UG
UD UD UD
(a) UR (b) UR (c) UR
Procedure 2: SearchForRemainingNodes()
C. SEMRP-v2 Protocol
Output : void This algorithm is an improved version of SEMRP-v1, which
Steps bases on two-hop discovering. In the following, we explain the
1. if CurrentNode ∈ Tree T then main phases of the SEMRP-v2 algorithm.
SearchForRemainingNodes()
end
else Algorithm 2: Multicast Construction Tree of SEMRP-v2
2. if CurrentNode ∈ / Tree T then Input: us , Nu ;
if CurrentNode is isolated then Output : M RT Multicast Routing Tree rooted at us ;
2.1a Add CurrentNode to the tree Steps
2.2a Make Sender node the parent of node
1. us discovers its one hop neighbors; ;
CurrentNode
2. for each node up i ∈ N (us ) do
2.3a Adjust transmission power of the parent node
2.4a for each node ∈ N(Parent Node ’P’) do D(uS ,ui ) = (Xi − Xs )2 + (Yi − Ys )2 + (Zi − Zs )2 ;
2.5a if D(i,p) < D(CurrentN ode,p) then end
Make node P the parent of node i 3. for each node ui ∈ N (us ) do
end if ui is not notified yet then
1.Notify(ui ) // to launch their one-hop discovery
end
process ;
end
2.Mark as notified;
end
end
else
end
if CurrentNode is ¬ isolated then
4. if all ui ∈ N (us ) finish their one-hop neighboring process
2.1b Add CurrentNode to the tree
then
2.2b Make shortest distance node the parent of
1. if myID == us then
node CurrentNode
1. us : insert myself in T ;
2.3b Repeat steps 2.3a,2.4a ,2.5a
2. BuildTree(myID);
end
else
end
1. Set my status ready;
end
2. if I am marked as selected to be a forwarder then
BuildTree(myID);
end
end
figure 1.c. The next remaining node is UV , the closest node end
is UH −→ UV with D(UV ,UH ) = 22, therefore UH increases 5. if I receive a selection alert then
BuildTree(myID);
its transmission power to reach UV directly in function of its end
one hop distance which equals to 10, neighbors closer to UH 6. if I receive a forwarder finished alert then
than UV such as UR with one hop distance equals to 9 can if N (us ) != N T (us ) then
be reached directly from UH after increasing the transmission BuildTree(myID);
power, therefore UR ’s new parent is UH , as shown in figure end
end
1.(c). This step has the advantage of minimizing number of
messages and number of hops. After adding all nodes to our
multicast tree, the delivery of packets that started in parallel (i) In the first phase, the source us broadcasts a hello
with tree construction, continue until reaching the multicast message to discover its one-hop neighbors N(us ), which they
destinations. reply with their 3D positions to allow it to calculate the
186
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
187
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
188
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
189
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
190
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
BASE
1 2 j N
191
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
192
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
0 1
193
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
A priori Sums/Subs
3 A priori Mults/Divs
10
ISAAC Sums/Subs
ISAAC Mults/Divs
Operations [#]
102
101
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
BSs [#]
194
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
195
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε ε
ε ε ε ε ε ε ε ε
150 150
Uploaded data [GB]
Acquired data [GB]
100 100
50 50
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
BSs [#] BSs [#]
ε ε ε ε ε ε ε ε
196
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
400
50
30
200
20
100
10
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
BSs [#] BSs [#]
197
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—As drone technology passes one milestone after of the Platform-as-a-Service (PaaS) paradigm, which is already
the other, drones are used in an ever-increasing number of very popular in Cloud computing. While such efforts are
applications and are now considered as an integral part of the still in their infancy, this approach has significant advantages
future smart city infrastructure. At the same time, the inherent
safety and privacy risks associated with drone-based applications compared to private drone ownership and operation.
call for appropriate testing and monitoring tools. In this paper, However, the upcoming coexistence of multiple drones
we present a simulation environment and digital twin support flying over citizens and private properties raises several safety
for a platform that allows the managed execution of drone-based and privacy issues due to the possible crashes, collisions, and
applications on top of a shared drone infrastructure. On the one uncontrolled usage of the drone’s onboard equipment [2], [3].
hand, the simulation environment makes it possible to perform a
wide range of tests regarding the operation of both the platform While most countries have formal processes for submitting
itself and the applications that run on top of it, before deploying flight plans and getting approval, in many cases there are
them in the real world. On the other hand, after deployment, no mechanisms for monitoring drone operation and, more
a digital twin of the drone is used to detect deviations of the crucially, for ensuring that the approved flight plan is followed.
application from the expected behavior, which, in turn, can serve Also, most efforts towards low altitude airspace manage-
as an indication of bugs that remained undetected during the
simulation tests or malfunctions that occur at runtime. We discuss ment [4], [5] focus on where drones are allowed to fly, rather
the most important elements of our approach and the simulation than on how their onboard equipment is used during flight.
and digital twin components of the proposed system. Also, we Inevitably, this leads to skepticism and limited public ac-
provide a functional evaluation of our work by presenting its ceptance of drone-based systems, even to extreme reactions by
capabilities regarding both offline testing and runtime checking people opposed to drone usage [6]. To gain the citizen’s trust,
through indicative use cases.
Keywords—drones, simulation environment, digital twin, PaaS systems have to be engineered to address safety and privacy
issues by design, through suitable mechanisms that can be
easily integrated and used to detect bugs and malfunctions.
I. I NTRODUCTION In this paper, we propose a holistic approach towards
Drones are evolving into a major component of the next- supporting a more reliable managed operation of third-party
generation smart city initiatives, which aim at enhancing the applications on a shared drone infrastructure, which can con-
life of their residents through efficient infrastructures and ser- tribute to building trust between the various stakeholders and
vices [1]. For example, delivery drones expedite the transporta- making drones more acceptable to the wider public. The main
tion of goods and medical products with minimal human in- contributions of the paper are: (i) we present a modular archi-
volvement while avoiding traffic-related delays. Furthermore, tecture combining a PaaS system for drone applications, which
drones are a valuable asset for private and public property offers automated deployment and restriction enforcement, with
surveillance and security operations offering better coverage corresponding simulation and digital twin support that can be
and rapid response to critical situations. The construction used to detect bugs before deployment and to indicate possible
industry has also started using drones for the aerial inspection malfunctions during operation in the real world, respectively;
of buildings and infrastructure in order to improve worker (ii) we discuss key aspects of a prototype implementation;
safety and increase the efficiency of scheduling operations. (iii) we showcase how the proposed work can be used in
Even though the prices of commercial off-the-self drones are practice through representative case studies.
steadily decreasing, the cost of specialized drones built with The rest of the paper is organized as follows. Section II
durable materials or carrying high-end equipment, combined describes the PaaS system that constitutes the baseline for
with licensing and insurance fees, remains significant. As a this work. Section III presents the design of the simulation
consequence, such drones typically remain unapproachable for environment and digital twin support we propose for this PaaS
most small and medium-sized companies. An alternative is to system, while Section IV discusses the main aspects of our
use drones and drone-related resources on-demand, in the spirit implementation. Section V illustrates the simulation and digital
twin functionality for an indicative test application under
different execution scenarios. Section VI gives an overview
978-1-7281-7343-6/20/$31.00 ©2020 IEEE of related work. Finally, Section VII concludes the paper.
198
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
199
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
200
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
201
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
networking/storage stacks. For the HITL configuration of the can be configured to use an integrated flight dynamics model
v-drone, we support the Raspberry Pi 3 platform, which is the or be connected with an external, higher fidelity simulator.
most popular companion board used in real drones. ArduPilot has built-in support for the former option, while for
The Test Orchestrator and Results Analyzer are imple- the latter option we currently support Gazebo [22] through
mented as Python libraries. The test scenarios are Python the corresponding plugin. Note that any other simulator that
scripts that simply invoke these libraries. The agents are supports ArduPilot (e.g., AirSim [23]) could be used instead.
also implemented as Python programs, running as standalone Finally, the simulation environment provides two options for
processes with the respective interface being invoked via RPCs simulated camera support to the application. When running the
through the zerorpc library [13]. For logging, we utilize the autopilot together with Gazebo, the ROS camera plugin can
standard Python logging facility, and event logs are processed be utilized to publish the virtual camera stream of Gazebo to a
by the Results Analyzer using standard Unix tools. ROS topic. Alternatively, we provide a custom virtual camera
The Test Orchestrator creates and configures all simulation module for taking snapshots, which mimics the API of the
entities through the respective agents. It is also responsible Raspberry Pi camera module and returns images automatically
for performing all the network configuration to enable the extracted from a pre-configured database (see [12]).
communication between the entities involved in a given test.
V. F UNCTIONAL E VALUATION
In simulation-based configurations, wireless networking is im-
plemented using ns-3 [14]. For Wi-Fi channels, each simulated We illustrate the main aspects of the provided functionality
ns-3 node, called ghost node, utilizes the ns-3 TapBridge through indicative scenarios focusing on flight-related behav-
device, which is connected to each v-drone through a combina- ior. The application we use for this purpose is intentionally
tion of network bridges and virtual network devices (see [12]). kept simple. It is a Python program that arms the drone, takes
We also provide support for simulated LTE interfaces. This is off to WP1 (at a height of 10 meters), goes to waypoint WP2,
achieved by introducing a high-bandwidth, low-latency CSMA next moves to waypoint WP3, returns to initial location WP1,
link between a ghost node and the simulated LTE’s UE node and lands. This is done by issuing corresponding commands
(see [15]). Finally, the v-drone agents continuously update the to the autopilot via DroneKit.
position of the respective ns-3 ghost nodes through a pub/sub
A. Offline Validation
scheme that is implemented using the ZeroMQ library [16]. In
the DT/SITL configuration, the Test Orchestrator instructs the Listing 1 shows the setup sequence for an offline test using
agents of the real and the v-drone to setup the physical con- a v-drone and a v-Controller that communicate via Wi-Fi.
nection that will be used to transfer the required information 1 controller = vController()
from the real drone to its digital twin. 2 drone = vDrone("vDrone-1", SITL,
3 autopilot=Ardupilot,
Inside v-drones, the application is executed as a Docker 4 setup=config.min_resources)
container (this is how applications are packaged in the PaaS 5 # drone = vDrone("vDrone-1", SITL, fdm=gazebo)
system). To this end, for v-drones running as LXDs we 6 drone.set_app("waypoint_app",
7 params=config.app_params)
exploit the nested container functionality. Applications may 8 drone.set_plan(config.flight_plan)
perform different navigation operations, access sensors and 9 drone.set_pos(config.home_pos)
issue actuation commands via the autopilot subsystem, using 10 network = NetSim(config.net_wifi)
11 network.add_participants(controller, drone)
the MAVLink messaging protocol [17]. Various APIs offer 12 drone.set_logs(app, runtime, autopilot)
MAVLink support, from low-level C and python (pymavlink) 13 controller.set_logs(runtime)
libraries to higher-level ones like DroneKit [18] and ROS [19] 14 drone.start_app()
through the MAVROS communication node. Listing 1: Setup and start offline test.
The autopilot used in v-drones is the latest stable version
of ArduPilot [20], which is one of the most widely adopted In a nutshell the steps are to: create a v-Controller entity
autopilot stacks supporting a wide variety of aerial vehicles. (line 1); instantiate a v-drone with identifier “vDrone-1”,
In the HITL and SITL configurations, we use the pre-built which is set in SITL mode using ArduPilot, with system
binaries for the ARM and x86 64 architectures, respectively. resources (in memory and disk) as indicated in a configu-
The autopilot proxy in the hybrid HITL/SITL v-drone configu- ration file (lines 2-4); set the waypoint application as a pre-
ration is implemented using the MAVProxy [21] multiplexing installed application at the v-drone (as opposed to deploying
tool, which forwards all messages of the Drone Runtime to the it dynamically via the v-Controller) and specify its parameters
remote autopilot (running in SITL), and vice versa. Note that (takeoff altitude and waypoints) (lines 6-7); set the approved
any autopilot that supports MAVLink and provides support flight plan, which also specifies restrictions and the respective
for HITL or SITL simulation, such as PX4, could be easily corrective actions (line 8); set the start location / home position
integrated in our framework. The replay engine used in the of the v-drone (line 9); create a Wi-Fi network (lines 10-11);
DT/SITL configuration is implemented as a Python module, activate logging at all available levels (lines 12-13); and finally
while the autopilot mockup utilizes the pymavlink library for start the application (line 14).
receiving and sending MAVLink messages. The option of specifying the usage of an external flight
As mentioned in Section III, the autopilot of a v-drone dynamics simulator in the SITL setup (in this case, Gazebo)
202
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
203
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
204
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
proposes the runtime monitoring of software components [7] J. Yapp, R. Seker, and R. Babiceanu, “UAV as a service: Enabling on-
through the execution of their digital twin (which are con- demand access and on-the-fly re-tasking of multi-tenant UAVs using
cloud services,” in Proc. IEEE/AIAA Digital Avionics Systems Confer-
sidered abstract specifications) in a simulated environment in ence, 2016.
order to detect and mitigate malicious behaviors. [8] A. Koubâa, B. Qureshi, M.-F. Sriti, A. Allouch, Y. Javed, M. Alajlan,
The authors in [31] argue that in software systems offline O. Cheikhrouhou, M. Khalgui, and E. Tovar, “Dronemap Planner: A
service-oriented cloud-based management system for the Internet-of-
verification before deployment must be accompanied by quan- Drones,” Ad Hoc Networks, vol. 86, pp. 46–62, 2019.
titative online verification of the key requirements at runtime [9] A. Van’t Hof and J. Nieh, “AnDrone: Virtual drone computing in the
in order to achieve software dependability and adaptiveness, cloud,” in Proc. EuroSys Conference, 2019, pp. 6:1–6:16.
[10] N. Grigoropoulos and S. Lalis, “Flexible deployment and enforcement of
through the identification, and sometimes prediction, of re- flight and privacy restrictions for drone applications,” in Proc. IEEE/IFIP
quirement violations. Along the same lines, in this work, we International Conference on Dependable Systems and Networks Work-
adopt such a holistic checking approach through a framework shops (DSN-W), 2020, pp. 110–117.
[11] F. Bonomi, R. Milito, P. Natarajan, and J. Zhu, “Fog computing: A
that provides the means to test the various software entities of platform for internet of things and analytics,” in Big Data and Internet
a PaaS system both in an offline and online fashion. of Things: A Roadmap for Smart Environments. Springer International
Publishing, 2014, pp. 169–186.
VII. C ONCLUSION [12] M. Koutsoubelias, N. Grigoropoulos, and S. Lalis, “A modular simu-
lation environment for multiple UAVs with virtual WiFi and sensing
We have presented our approach for supporting offline val- capability,” in Proc. IEEE Sensors Applications Symposium, 2018.
idation and runtime checking in the context of a PaaS system [13] J. Petazzoni, “Build reliable, traceable, distributed systems with Ze-
roMQ,” https://us.pycon.org/2012/schedule/presentation/260/.
for drone-based applications, through suitable simulation and [14] G. F. Riley and T. R. Henderson, “The ns-3 network simulator,” in Mod-
digital twin mechanisms. Also, we have discussed key aspects eling and Tools for Network Simulation. Springer Berlin Heidelberg,
of our implementation and have illustrated its functionality 2010, pp. 15–34.
[15] A. R. Portabales and M. L. Nores, “Dockemu: Extension of a scalable
through indicative simulation and real-world scenarios. network simulation framework based on docker and NS3 to cover IoT
The proposed framework has been successfully used in scenarios,” in Proc. International Conference on Simulation and Mod-
research projects, focused on the pre-deployment testing of eling Methodologies, Technologies and Applications (SIMULTECH),
2018, pp. 175—-182.
experiments with unmanned vehicles and the validation of [16] ZeroMQ, “Open-source messaging library,” https://zeromq.org/.
an automated inspection system of photovoltaic parks using [17] MAVLink, “Drone communication protocol,” https://mavlink.io/en.
drones. At the same time, we are continuously working on [18] DroneKit, “Developer tools for drones,” http://dronekit.io/.
[19] ROS, “Robot Operating System,” https://www.ros.org.
different improvements and extensions. On the one hand, we [20] ArduPilot, “Open source autopilot,” http://ardupilot.org.
wish to explore ways of enriching the digital twin setup [21] “MAVProxy,” http://ardupilot.github.io/MAVProxy/html/index.html.
in order to have the ability to run predictive simulations at [22] N. Koenig and A. Howard, “Design and use paradigms for gazebo,
an open-source multi-robot simulator,” in Proc. IEEE/RSJ International
runtime. On the other hand, we are in the process of integrating Conference on Intelligent Robots and Systems (IROS), 2004, pp. 2149–
yet another form of runtime testing, through the support of 2154.
suitable drills that imitate specific problematic situations in [23] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual
and physical simulation for autonomous vehicles,” in Field and Service
the PaaS in order to check the successful triggering of the Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065
respective compensating actions. [24] S. Baidya, Z. Shaikh, and M. Levorato, “FlyNetSim: An open source
synchronized uav network simulator based on ns-3 and ardupilot,”
ACKNOWLEDGMENT in Proc. ACM International Conference on Modeling, Analysis and
Simulation of Wireless and Mobile Systems, 2018, p. 37–45.
This research has been co–financed by the European Union [25] ArduPilot, “SITL Simulator,” http://ardupilot.org/dev/docs/
and Greek national funds through the Operational Program sitl-simulator-software-in-the-loop.html.
[26] A. Al-Mousa, B. H. Sababha, N. Al-Madi, A. Barghouthi, and
Competitiveness, Entrepreneurship and Innovation, under the R. Younisse, “UTSim: A framework and simulator for UAV air traf-
call RESEARCH — CREATE — INNOVATE, project PV- fic integration, control, and communication,” International Journal of
Auto-Scout, code T1EDK-02435. Advanced Robotic Systems, vol. 16, no. 5, 2019.
[27] S. Hallerbach, Y. Xia, U. Eberle, and F. Koester, “Simulation-based iden-
tification of critical scenarios for cooperative and automated vehicles,”
R EFERENCES SAE International Journal of Connected and Automated Vehicles, vol. 1,
[1] A. R. Singh, “How Drones are crucial for Smart Cities?” https://www. no. 2, pp. 93–106, 2018.
geospatialworld.net/blogs/how-drones-are-crucial-for-smart-cities/ [28] C. Blum, A. F. T. Winfield, and V. V. Hafner, “Simulation-based internal
(2018-09-04). models for safer robots,” Frontiers in Robotics and AI, vol. 4, 2018.
[2] E. Vattapparamban, I. Guvenc, A. I. Yurekli, K. Akkaya, and S. Uluagac, [29] R. Vaughan, “Massively multi-robot simulation in stage,” Swarm Intel-
“Drones for smart cities: Issues in cybersecurity, privacy, and public ligence, vol. 2, no. 2-4, pp. 189–208, 2008.
safety,” in Proc. International Wireless Communications and Mobile [30] E. Cioroaica, F. D. Giandomenico, T. Kuhn, F. Lonetti, E. Marchetti,
Computing Conference, 2016, pp. 216–221. J. Jahic, and F. Schnicke, “Towards runtime monitoring for malicious
[3] D. Wright and R. Finn, “Making drones more acceptable with pri- behaviors detection in smart ecosystems,” in Proc. IEEE International
vacy impact assessments,” in Information Technology and Law Series. Symposium on Software Reliability Engineering Workshops (ISSREW),
T.M.C. Asser Press, 2016, vol. 27, pp. 325–351. 2019, pp. 200–203.
[4] NASA, “Unmanned Aircraft System (UAS) Traffic Management [31] R. Calinescu, C. Ghezzi, M. Kwiatkowska, and R. Mirandola, “Self-
(UTM),” https://utm.arc.nasa.gov/index.shtml. adaptive software needs quantitative verification at runtime,” Communi-
[5] SESAR, “U-Space,” https://www.sesarju.eu/U-space. cations of the ACM, vol. 55, no. 9, pp. 69–77, 2012.
[6] M. Murisonon, “Drones Will Be Shot Down Until These
Misconceptions Are Tackled,” https://dronelife.com/2019/03/04/
drones-will-be-shot-down-until-these-misconceptions-are-tackled/
(2019-03-04).
205
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—With the shift toward a Mobility-as-a-Service usage is nowadays comparable with popular car ride-sharing
paradigm, electric scooter sharing systems are becoming a services like Uber and Lyft [4].
popular transportation mean in cities. Given their novelty, we Since 2017 several e-scooter companies started their ser-
lack of consolidated approaches to study and compare different
system design options. In this work, we propose a simulation vices in many cities in North-America and Europe. Internet-
approach that leverages open data to create a demand model of-Things technologies, paired with accurate GPS tracking,
that captures and generalises the usage of this transportation allow the providers to track the position of the e-scooters and
mean in a city. This calls for ingenuity to deal with coarse monitor users’ trips. These data can be used to understand the
open data granularity. In particular, we create a flexible, data- impact and the utilization of e-scooter in the smart city mobil-
driven demand model by using modulated Poisson processes for
temporal estimation, and Kernel Density Estimation (KDE) for ity ecosystem. In this direction, municipalities started offering
spatial estimation. We next use this demand model alongside a open data to let other players study alternative solutions.
configurable e-scooter sharing simulator to compare performance In this work we are the first - to the best of our knowledge
of different electric scooter sharing design options, such as the - to study the service sustainability of e-scooters systems from
impact of the number of scooters and the cost of managing their the point of view of a provider. Notice that the peculiarities
charging. We focus on the municipalities of Minneapolis and
Louisville which provide large scale open data about e-scooter of this novel scenario call for new approaches (see Section II
sharing rides. Our approach let researchers, municipalities and for a discussion). In this work, we consider the municipalities
scooter sharing providers to follow a data driven approach to of Minneapolis and Louisville as use cases.
compare and improve the design of e-scooter sharing system in First, we need to understand how, when and where e-
smart cities. scooters are used by the users, i.e., the mobility demand.
Index Terms—open data, demand model, scooter sharing,
electric vehicle, data driven optimization.
For this purpose we rely on open data. Open data typically
shares coarsely aggregated data for privacy reasons. This chal-
lenges its usage, and calls for ingenuity to appropriately pre-
I. I NTRODUCTION process data with spatio-temporal disaggregation techniques
Urban mobility presents a number of non-trivial challenges to increase resolution and derive a flexible - albeit realistic -
both for researchers and regulators. Some of these challenges demand model. For this, we combine Poisson processes for
are related to sustainability and pollution: in EU, for example, customers’ arrivals, and Kernel Density Estimate to model the
urban mobility accounts for 40% of all CO2 emissions of spatial demand [5]. To allow other researchers to reproduce
road transport and up to 70% of other pollutants comes and extend our results, we make our demand models available
from transport systems.1 The needs to reduce emissions and upon request.
congestions, along with the rising of the sharing economy, Afterwards, we leverage the constructed demand model to
moved several policy makers in promoting micro-mobility run simulation studies to compare different fleet management
services in cities. These services refer to lightweight, often policies, with a focus on battery charging strategies. For this
electric-powered vehicles rented for short trips and typically we extend our simulator implemented in [6] to support e-
operating at low speeds. scooters scenarios. The simulator allows us to model system
In this context electric scooters (e-scooters) represent a parameters such as the operative area granularity, vehicles
sustainable and cheap alternative to reduce the number of characteristics, fleet size, users’ preferences or fleet manage-
private vehicle trips [1] and consequently traffic congestion [2] ment policies. It simulates the search, rental, and return of e-
and land use [3]. Indeed, e-scooters are among the fastest scooters by customers, and the battery consumption and charg-
growing electric micro-mobility means. The number of compa- ing operations needed to maintain the fleet. As performance
nies offering e-scooters to rent and the number of cities where metrics, we mainly focus on satisfied trips, i.e., the fraction of
the service is available keep growing. Indeed, the e-scooter customers’ requests that the system can accommodate; and the
fleet management cost, proportional to the time workers have
1 https://ec.europa.eu/transport/themes/urban/urban\_mobility\_en to spend to reach and charge the e-scooter battery, assuming
a battery swap policy.
The results show that with a spatio-temporal disaggregation
978-1-7281-7343-6/20/$31.00 ©2020 IEEE coupled with Poisson process and the Kernel Density Estimate
206
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
we can create a reliable demand model to perform accurate analyze customers’ mobility demand patterns in a free floating
simulations. Our findings show how e-scooter operators should car sharing system. Here we revisit our methodology in the
carefully evaluate the best trade-off to balance the users’ context of micro-mobility.
satisfaction and the fleet management costs. In particular, we Considering shared electric vehicle systems in general, the
show (i) the impact of the size of the fleet, (ii) the impact major challenge is the battery charging process. The battery
of the choice of when to swap/charge the batteries, (iii) the swap appears the most suitable approach for e-scooters but
implications of using workers or asking the users’ cooperation its study has never been explicitly targeted so far in scientific
for charging operations. literature. Early studies about e-buses [17] focus their attention
Results show that the very heterogeneous demand calls for on the management of possible battery switch station and their
a large number of e-scooters. Similarly the fleet management placement. Other models focus on optimizing the charging
operations have a high cost due to many battery swap opera- process for large vehicles taking into account electric network
tions. Furthermore, reducing the time for workers to reach the constraints and system degradation [18], or considering the
e-scooters and change their batteries has a fundamental impact distance travelled to reach a battery switch station [19]. In
for reducing cost. Alternatively, directly involving the users in a recent work [20], authors optimize battery switch stations
the charging process would further reduce costs, becoming a considering costs of energy, equipment degradation and energy
key design decision. demand variability.
The paper is organized as follow. In Sec. II we discuss Few studies analyze the battery swap process applied to
existing works about e-mobility and charging solutions. In shared vehicles. Authors of [21] proposes a mixed integer
Sec. III we describe and characterize the used open datasets. programming formulation to maximise the satisfied trips in
In Sec. IV we introduce the spatio-temporal disaggregation an electric station-based car sharing system, minimizing at
techniques to create our demand models, as well as the the same time the number of battery swaps. Authors of [22]
simulation assumptions and performance metrics. In Sec. V propose an optimal schedule for EV battery swap at stations
we show results of our methodology for the cities of Louisville minimizing travel distance and electrical usage. Differently
and Minneapolis. Finally in Sec. VI we summarize the paper from our work, all these models do not fit the e-scooter
and present future directions. scenario because they do not consider small vehicles and small
batteries, hence they do not allow local swap of the batteries.
II. R ELATED WORK
Impact of e-scooters in urban mobility is an emerging re- III. DATA COLLECTION AND CHARACTERIZATION
search topic. The seminal works [7] tested in 2011 the benefits In this section we describe the datasets and characterize
of e-scooters on commuters. Since then, few other studies the system usage focusing on the most important metrics that
have tried to gauge the impact of e-scooters on mobility. For would impact the design of an e-scooter sharing system.
instance, authors of [8] present an extensive market analysis
A. Dataset description
emphasizing the possible growth in the usage of e-scooters and
raising the problem of how to handle the charging process We focus our study on two cities in the US, namely
in presence of large fleets. As a possible solution, authors Louisville and Minneapolis, where their municipalities make
of [9] propose a model where a MILP formulation clusters available data about all the e-scooter rides performed by the
together the e-scooters that need to be charged. Similarly, customers using any of the e-scooter sharing providers present
authors of [10] study the benefits of electric fleet (of e-scooters in each city.2 To protect the riders privacy and do not leak any
and e-bikes) in last mile delivery for big players in Milan. They company-specific strategy, data do not contain any identifier
are among the first to exploit real data - albeit collected from of the company, or vehicle, or customer. Furthermore, data
a very limited deployment (less than 75 vehicles). Authors is aggregated and/or fuzzed following NACTO guidelines in
of [11] offer a first users’ habits characterization collecting order to make the user tracking impossible. 3 This challenges
the daily trips of 38 users, pointing out how the leisure the direct usage of the open data, and calls for ingenuity to
component is relevant for e-scooters. More recent works ([12], derive suitable models.
[13]) compare micro-mobility services (dockless bike, e-bike In our cases, each trip exposes information describing the
and e-scooters) using data exposed by providers. The results trip duration, distance, starting and ending position, and the
confirm that users prefer e-scooters to cover trips shorter than time when the trip started. Different quantisation applies. For
1.6 km. Moreover the e-scooter daily patterns do not match the Louisville, starting and ending position are encoded with GPS
commuting patterns. In [14] the authors show that the number coordinates rounded at 3 decimals (approximately 80 m bins);
of bookings per hour is higher in good weather condition. trip duration is given with a precision of one minute, and
These characteristics reinforce the need of specific models and the starting timestamp is rounded to the closest 15 minutes
tools to study this new type of mobility. period. Minneapolis data expose similar information but even
To the best of our knowledge, our work is the first to more aggregated. Origins and destinations position are defined
present a holistic approach to study and compare different 2 Datasets are available at: https://data.louisvilleky.gov/dataset/
system design options, leveraging large open data. We follow a dockless-vehicles, and http://opendata.minneapolismn.gov/search
similar approach as in our previous work [15], [16] where we 3 National Association of City Transportation Officials https://nacto.org/
207
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
9 9 9 9 9 9 9
08 Jul '1 22 Jul '105 Aug '119 Aug '102 Sep '116 Sep '130 Sep '1
Date hour per weekday (solid line) and per weekend (dashed line).
As expected, Minneapolis exhibits more trip per hour than
Fig. 1: Time series of trips per day Louisville. At night we observe a negligible number of trips,
with Louisville showing slightly higher figures probably due
600 Minneapolis WD to a more vivid nightlife. During weekdays we observe a high
Average Trips per hour
Minneapolis WE
Louisville WD utilization during central hours of the day (12:00 to 17:00)
Louisville WE rather than during commuting hours. This drastically differs
400
from what commonly observed for other shared transportation
200 means like car sharing [15] where utilization peaks during
commuting time. Regarding weekends, Louisville confirms
0 the higher utilization with about 30% more trips than during
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223 weekdays. This result highlights the importance of a correct
Hour of day
characterization of different transportation means usage -
Fig. 2: Average trips per hour in weekends (WE) and working which results fundamental to study system design alternatives.
days (WD) We now focus on the characterization of two important
metrics: (i) trip duration, (ii) and trip covered distance. These
metrics are fundamental to understand the e-scooter avail-
by street IDs so that each trip refers to an entire street length ability and battery discharge properties. Fig. 3a reports the
rather than precise coordinates. Timestamps are rounded to Empirical Cumulative Distribution Functions (ECDFs) of the
the closest 30 minutes period. This rounding are essential to trip duration for each city during weekdays and weekends. The
protect the users’ privacy, but they complicate the extraction of similarity in the duration is striking, with both Minneapolis
useful insights from the data. The granularity of rides duration, and Louisville trips lasting longer during the weekdays than
distance, day and hour of the day still allow us to extract useful the weekends. Recall that Louisville dataset exposes time du-
patterns about e-scooter usage over time. However, the absence ration with a minute granularity which causes the quantisation
of e-scooter identifier, precise coordinates and timestamps seen in the ECDF. Overall, trip duration is very short, with the
makes impossible to track how each e-scooter moves in the majority of the trips lasting less than 13 minutes. This reflects
city. Thus we cannot simply reply the same trace in a simulator on the trip distance, as seen in Fig. 3b. Observe that almost
as done for car sharing services (e.g., in [15]). 90% the trip lasts less than 4 km, and more than 60% are
shorter than 2 km. These results confirms the typical usage of
B. Dataset characterization
e-scooters [12], [13]. Notice also the different service area size
First we provide a data characterization to let understanding of Minneapolis and Louisville which allows for longer trips
the scenarios we are facing. In Fig. 1 we report the number in the former. Table I provides a summary of the data.
of total recorded trips (i.e., rentals) for each day over the Considering spatial characterization of the demand we ob-
months of July, August and September 2019. More than half a serve that most of trips are confined in few relatively small
million and 180 k trips have been recorded in Minneapolis and neighborhoods. Fig. 4 show heatmaps to intuitively gauge this
Louisville, respectively. Interestingly, while Louisville shows a effect. Here, we divide the service areas of each city in 200 m
repetitive weekly pattern with peaks over weekend but without x 200 m cells. Then we count the number of trips originating
any specific trend, Minneapolis exhibits an increasing trend. in each cell during the three months. We use a decimal
The different number of daily trips justifies the difference in logarithmic scale. The heatmap shows how concentrated trips
size among the cities, with Minneapolis having more than are, with few hotter (in red) cells that accounts for 4 orders of
twice as much the e-scooters in Louisville (see Table I).4 Some magnitude more trips than those cells with few trips (in blue).
sudden falls are related to bad weather conditions that affects Overall, these informations are fundamental to generate a
the willingness of customers to rent an e-scooter [14]. demand model to compare different system designs.
To analyze how the demand is distributed during the hours
of the day, in Fig. 2 we report the average number of trips per IV. S YSTEM MODEL AND SIMULATOR
4 As no vehicle ID is present the maximum number of vehicles is extracted
In this section, we first describe the spatio-temporal disag-
from Louisville service description2 and Minneapolis official website http: gregation methodology that we employ to generalize the trips
//www.minneapolismn.gov/publicworks/trans/WCMSP-212816 present in the open data. Second we detail how we use them
208
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Minneapolis WE Minneapolis WD Louisville WE Lousville WD possibility to smooth the point process of a trace over a multi-
dimensional space while maintaining the origin/destination
1.0 1.0 correlation.
0.8 0.8 1) Time Modeling: We assume that the inter-arrival time of
0.6 0.6 trips follows an exponential distribution with rate that depends
ECDF
ECDF
on the type and hour of the day. To account for the highly
0.4 0.4 periodical rate as seen in Fig.1, here we distinguish between
0.2 0.2 weekday and weekends. We consider 24 time bins of 1 h each
(48 periods in total), where the Poisson arrival rate reflects the
0.0 0.0
0 20 40 60 0 2 4 6 8 10 average rate of requests in the original dataset. This allows to
Duration [min] Distance [km] scale the overall demand by introducing a global scaling factor.
(a) Trip duration (b) Trip distance Not reported here for brevity, we compare the number of trips
in the simulated and the disaggregated trace. As expected,
Fig. 3: ECDFs for trips duration and distance in weekends
there is a very good match (relative percentage residuals for
(WE) and working days (WD)
total trips between 0.6% and 1.3% for Louisville, 0.8% and
3.4% for Minneapolis)
to generate our demand model. Finally we use the demand 2) Spatial Modeling: Given an hour and a day, we want to
model to determine the occurring trips and feed our mobility generate origin and destination of a request according to the
simulator. specific demand model as exhibited in the disaggregated trace.
For this, we leverage KDE to estimate the joint probability
A. Spatio-temporal disaggregation distribution of the origin and destination positions of a trip.
Assume we have a dataset D of trips recorded during Given our scenario, this is fundamental to further smooth our
a given period of time. Each trip i ∈ D is defined by discrete events.
a discrete start time as (i), i.e., with time rounded with a To ease the KDE computation and the simulation process,
granularity ∆T (of 15 or 30 minutes in our case). To provide we divide the whole city area into contiguous squared zones
an estimation of the time instant in which the trip started, of side 200 m and map the trips to this grid.
we assume a local stationary process, and simply extract a Then, for each of the 48 time bins we fit a separate
new timestamp ts (i) from a uniform distribution in range
KDE based on the origin-destination zone grid, obtaining a
∆T ∆T
as (i) − , as (i) + . This allows to get back to a four dimensional problem (2 coordinates for origin and 2
2 2
continuous-time trace of events. coordinates for destination). In this way, we obtain 48 models
Considering the spatial information, origin o(i) and des- summarising the spatial mobility habits of the users in time.
tination d(i) positions may be already associated to spatial Here, we consider a Gaussian kernel [5] and set the bandwidth
coordinates, albeit rounded. First, we compute the distribution matrix of the KDE to the 4 x 4 identity matrix. Given the 200 m
of distance between o(i) and d(i) which will be useful to x 200 m zoning, this corresponds to a bandwidth selection
generate trip distances later. Second, we obtain, for each of 200 m for each coordinate. On the one hand a smaller
(o(i), d(i)) pairs, the trip duration from the open data. bandwidth would not help us to generalize the demand. On the
Origin and destination information might be aggregated into other hand, a bigger bandwidth would reduce the granularity
different geometries oid (i) and did (i). We have to employ a of city zoning, leading to a reduced precision in incorporating
spatial disaggregation methodology to derive possible coor- spatial patterns.
dinates. In Minneapolis case, oid (i) and did (i) are segments In a nutshell, we use KDE as a spatial data smoothing
representing streets and we randomly select two coordinates tool, able to capture mobility patterns from the trips in the
along the entire street (with a uniform probability). We obtain disaggregated trace while reducing the impact of the original
thus a possible origin o(i) and destination d(i) coordinates for open data aggregation. This is also very effective to cope with
each trip i. the fine grained spatial quantisation that is needed to model the
At the end of this pre-processing step, we have a new dis- demand of e-scooter sharing systems. To show how effective
aggregated trace where each trip in the dataset is characterized this is, in Fig. 4 we report the demand in each zone before and
by its start time, and initial and final coordinates. after applying the smoothing procedure for Louisville (Fig. 4a)
and Minneapolis (Fig. 4b). To ease the readability we report
B. Demand model only the demand in the peak hour. Looking at the demand
The goal of the demand model is to generalize the trace before the smoothing, most of the trips are concentrated in a
generated from the original open data. For this, we model the few areas with large differences also between nearby cells -
demand in time by using modulated Poisson processes - a resulting in a very noisy picture. Most popular zones do not
common accepted model for i.i.d. service requests of a very change with the smoothing, but we observe a redistribution of
large population [23]. For space, we generalize the demand the requests among neighboring zones. In a nutshell, trips are
using Kernel Density Estimation (KDE) [5]. KDE gives us the no more concentrated in single cells but rather in larger areas.
209
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
2 km 2 km 2 km 2 km
(a) Louisville: demand in open data (left) and in the peak hour model (right). (b) Minneapolis: demand in open data (left)
and in the peak hour model (right).
Fig. 4: Heatmap of the number of trips starting form each zone in a decimal logarithmic scale (the legend report the exponent).
The warmer the color, the higher the value
C. Mobility and charging simulator trip, updates its battery charge c(s∗ , tj ) = c(s∗ , tj ) − e(j),
ˆ
makes s∗ available in position d(j), and checks if a charging
Our goal is to simulate a fleet of e-scooters that move within
the city. The simulator uses the demand model to generate process is required. That is, it checks if c(s∗ , tj ) < α · C,
mobility requests. During the simulation we track each e- being α ∈ [0, 1] a threshold. If so, it triggers a charging
scooter over time saving information about its location and event.
battery state. The charging operation can be performed either by the
We use an event-based simulator. The simulator has a e-scooter provider through a battery swap operation, or by
set S of e-scooters. At any time t, each e-scooter s ∈ S volunteers through battery charging operation.
is characterised by its location P (s, t) and state of charge System battery swap: the e-scooter provider manages the
c(s, t) ∈ [0, C], where C is the maximum battery capacity. As charge events by means of a workforce of N worker-
previously, we use a 200 m x 200 m grid. At t = 0, e-scooters equivalent. Battery charge requests are modeled with a FIFO
are placed at random proportionally to the spatial demand, queue, with N parallel servers as follow:
with uniform random charge c(s, 0) ∈ [C/2, C]. • Charge request arrival: If there is a free server, the
The model generates trip-request event i at time ti request gets service immediately. Otherwise, the request
according to the Poisson model. It extracts the origin and gets queued and waits to be processed by a worker.
destination coordinates ô(i) and d(i) ˆ from the KDE, and • Service time: the battery swap entails two service oper-
associates the trip duration f (i) and distance ˆl(i) according the
ˆ ations: Reach time, i.e., the time it takes the worker to
CDF extracted from the original open data. The latter allows us physically reach the e-scooter; and the Swap time, i.e.,
to compute the eventual energy consumption assuming simple the time it takes the worker to complete the battery swap
proportionality, i.e., e(i) = k · ˆl(i). We obtain k from the e- operation.
scooter characteristics. When the i-th trip-request event We model the reach time and swap time as negative exponen-
fires, the simulator checks if there is any e-scooter s with tial distributions with average Treach and Tswap .
enough battery c(s, ti ) ≥ e(i) available in the same zone or Volunteer charging: We model the possibility that volun-
1-hop neighbors (the 8 adjacent zones in the grid). This is teers may contribute to fleet energy management, as done by
equivalent to assume that customers are willing to rent an e- some companies that remunerate people to handle the charging
scooter that is within the same or at neighboring zone from of e-scooters. When a charge is needed, a volunteer may be
where they are walking at most approximately up to 300m to found with probability w ∈ [0, 1]. w models people willingness
get it. to contribute to the system. If found, we assume the volunteer
If more than one e-scooter exists, the simulator picks brings the e-scooter at home and plugs it for charging. We
s∗ , the one having the highest c(s, ti ). It then schedules a assume the charging time to be a Gaussian random variable
trip-end event at time ti + fˆ(i). Otherwise, it marks the with average Tcharge and standard deviation σcharge . The
request as unsatisfied. In both cases, it schedules the next charging time is a random variable as it includes the whole
trip-request event at time ti +negexp(λ(ti )), being λ(ti ) process of taking the e-scooter home, charging it, and bringing
the current request rate. When the j-th trip-end event fires it back to the streets - in the same location as before for
at time tj , the simulator picks the e-scooter s∗ used for this simplicity.
210
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
D. Performance metrics
To compare system performance and gauge the impact of
parameters, we consider two fundamental metrics:
i) The Satisfied Demand measures the percentage of trip
requests that can be satisfied due to the presence of e-scooters
with enough energy in the trip origin zone.
ii) The Swap Time measures the total man-time needed to
handle the battery swap operations.
(a) Minneapolis
The simulator also breaks down the satisfied demand to
distinguish between i) no e-scooter is available, and ii) e-
scooters do not have enough energy to complete the trip
request. Similarly, it maps events to the city maps to observe
the city areas where most of these events occurs.
V. R ESULTS
Here we present simulation results obtained starting from
the original open data, from which we first generate a dis- (b) Louisville
aggregated trace, and then extract the trip request model as
Fig. 5: Percentage of satisfied demand and average number of
described above. We use the model to run simulations to gauge
trips per e-scooter per month
the impact of system design choices. In particular we study
the impact of:
• |S|: the e-scooters fleet size;
• α: the battery threshold that triggers a charging operation; B. Impact of charging threshold
• N : the provider workflow size;
• Treach : the average time to reach the e-scooter; Next, we evaluate the impact of the battery threshold α
• w: volunteers’ willingness to handle charging. that triggers charging events. In the one hand, the lower the
We assume an homogeneous fleet of e-scooters having a α, the less frequently e-scooter need to be charged. On the
C = 425 Wh battery capacity and k = 11 Wh / km energy other hand, if α is too low, we may cause users’ discomfort
efficiency, based on average characteristics present on the and loose revenues as the probability to find an e-scooter
market. with not enough energy would increase. If eventually taken,
that e-scooter would suddenly run out of the battery before
A. Impact of fleet size reaching the desired destination. Here we set |S| = 2 000 for
We first evaluate the impact of the fleet size on the satisfied Minneapolis and |S| = 850 for Louisville. Again, we assume
demand. We consider w = 0 and N = |S|, i.e., system takes the ideal charging policy with Treach = Tswap = w = 0 and
care of the charging, with enough workers to immediately N = |S|.
perform the battery swap. To consider ideal scenario, we fix Fig. 6 reports the percentage of trips in which the user
Treach = Tswap = 0. We choose α = 0.2 for Louisville and would run out of battery (left y-axis) and the percentage of
α = 0.4 and Minneapolis - so to guarantee their maximum trips that require a charging at the end of a trip (right y-
distance trips. axis). The latter represents the charging cost for the system.
We report results in Fig. 5a and Fig. 5b for Minneapolis and Starting from this (red curve), observe how the cost linearly
Louisville, respectively. They show the percentage of satisfied increases up to α around 0.5, after which quickly grows to
demand (left y-axis - blue curves) and the average monthly 100%. Indeed, when α approaches 1, every e-scooter needs
number of trips performed by each e-scooter (right y-axis - red to be charged at the end of each trip. Looking at the fraction
curve). Fleet size varies around the currently available number of trips that would not have enough energy to complete them
of e-scooters - 2 000 in Minneapolis and 850 in Louisville. (blue curves), we observe a sudden growth for values of α
The average number of monthly trips per e-scooter de- approaching 0. That is, if we allow the e-scooter battery to
creases with |S|, while the bigger the fleet size - the higher the reach a very low level, the probability of not completing the
probability to find an e-scooter in the desired origin zone - the trips increase. Minneapolis shows the strongest impact with
higher the percentage of satisfied demand. For Minneapolis, up 10% of the trips resulting impossible (for α = 0). Instead,
the currently available 2 000 e-scooters can satisfy less than Louisville exhibits a negligible fraction even for very low α.
50% of the demand. Notice the sub-linear growth, hinting This is due to the shorter distance than Minneapolis - see
that spatial heterogeneity calls for possible relocation policies. Fig. 3b. These results clearly highlight a trade-off between
For instance, for Louisville results are better, with 60% of impossible trips (and loss of revenues) and number of charging
satisfied trips with 850 e-scooters. Doubling the fleet size events (and costs). Our model and simulator allows one to
would increase of just about 15% the satisfied demand. explore this in details.
211
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
(a) Minneapolis
Fig. 7: Minneapolis - percentage of satisfied demand, varying
number of workers and average reach time
(b) Louisville
Fig. 6: Percentage of impossible trips caused by insufficient
battery level (left scale) and number of needed battery swaps
(right scale) by changing battery swap threshold α Fig. 8: Minneapolis - percentage of satisfied demand, vary-
ing number of workers and users willingness (treach = 30
minutes)
C. Impact of charging policy
We evaluate the cost that the provider faces for the charge
operations based on two different charging scenarios: with satisfied demand with respect to number of workers suggests
(w > 0) and without (w = 0) the users’ cooperation. Here we to employ strategies to reduce as much as possible treach . For
fix Tswap = 5 min for the operator, and Tcharge = 4 h, and example, each worker could be assigned to service a limited
σcharge = 30 min for the volunteers. We set Tcharge = 4 h area of the city.
and σcharge = 30 using an average time needed to charge
Finally, we evaluate how the users’ help reduces the charg-
an e-scooter with similar characteristics. Given the ease of
ing cost for the operator. For this, we consider the same
the Louisville case with respect to Minneapolis seen in the
scenario as before, choosing treach = 30 minutes and eval-
previous sections, here we just report the case of Minneapolis,
uating different users’ willingness (w). Intuitively, the more
with α = 0.3. First, we evaluate the cost when the charging
volunteers help the less workers are needed to perform a
operations are performed only by the workers (w = 0). For
battery swap operation. At the same time, due to the longer
this we run simulations with 2 000 e-scooters and evaluate
time for the charge operation by the user, i.e., 4 hours, other
how many workers are needed to satisfy as much demand as
effects may appear like a decrease in the satisfied demand
possible. We define a worker as an always available resource
due to several e-scooters being under charge at the same time.
(24 hours a day) that perform only one battery swap operation
In Fig. 8 we show the impact on the satisfied demand by
a time. Since we model the time to reach the e-scooter (treach )
changing users’ willingness with different number of workers.
as a stochastic variable, we also evaluate its impact in the
As a reference we also include the curve with w = 0
charging cost.5 . Intuitively, when few workers are present, or
(same as in Fig. 7). Despite users’ recharges are generally
when treach is too high, an increase in the charging FIFO
longer, there is a limited impact concerning the availability of
queue happens, causing e-scooters to be not available and
scooters, and therefore satisfied demand. With a willingness
decreasing the satisfied demand.
w = 0.5 we can see how the number of workers needed to
In Fig. 7 we evaluate the percentage of satisfied demand
reach the maximum possible feasible trips halves from 12 in
while increasing the number of workers simultaneously avail-
Fig. 7 to 6 in Fig. 8. With w = 1, the management of the
able in the system, with different values of treach . With small
batteries is completely taken care by volunteers. Interestingly,
treach (15 minutes), we can see how with 8 workers we
the longer unavailability due to longer charging time has
reach the highest satisfied demand as in the best case scenario
negligible impact on the satisfied demand.
(Fig. 5a). The increase of the reach time cause a drop in the
satisfied demand down to 30% when treach = 60 minutes, In Fig. 9 we show the total time employed by workers to
even when 14 workers are present. This strong dependence of perform the battery swap on a daily basis. When w = 1,
workers are not needed - hence the total average daily time is
5 Given our policy that only 1 battery swap operation is allowed per event, 0 hours. When w = 0, there are no volunteers, and the system
if two discarded e-scooters are close to each other we consider two reach time needs up to 250 hours of cumulative daily work to reach the
212
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
213
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—Vehicular Cloud computing is new paradigm where Cloud Computing and Vehicular Networks rely heavily on
vehicles collaboratively exchange data and resources to support precisely modeling the movement and position of vehicles and
services and problem-solving in urban environments. Charac- devices to provide proper resource management. The current
teristically, such Clouds undergo severe challenging conditions
from the high mobility of vehicles, and by essence, they are rather management methods for Vehicular Clouds dully depend on
dynamic and complex. Many works have explored the assembling mobility, proximity, and position of vehicles [2]. These meth-
and management of Vehicular Clouds with designs that heavily ods require constant, and sometimes exhaustive, monitoring.
focus on mobility. However, a mobility-based strategy relies on This communication management scheme is very costly and
the geographical position of vehicles and its feasibility has been prone to errors, impacting the assembling of Vehicular Clouds.
questioned in some recent works. Therefore, we present a more
relaxed Vehicular Cloud management scheme that relies on Besides high monitoring and control costs, these approaches
connectivity. This work models uncertainty and considers every face situations where there is minimal chance of communica-
possible chance a vehicle may be available through accessible tion where vehicles stay in range very briefly due to their
communication means, such as V2X communications and the high speed. Consequently, the short amount of time vehicles
vehicle being in the range of RSUs for data transmissions. We
are near each other turns the delivery of Cloud services almost
utilize the MDP model to track the state of vehicles and when
there are connected and available for transmission of the data. impractical. Scientific works [1] have extensively questioned
Index Terms—VANETs, Connectivity, Mobility, VCC the feasibility of Vehicular Clouds in highly mobile environ-
ments, such as urban centers. These works have demonstrated
I. I NTRODUCTION and proven that the contact time of vehicles is much shorter
Vehicular Cloud computing (VCC) has been defined as than for providing any services and resources to a requester.
new paradigm where smart and connected vehicles are put Besides, several approaches have explored connectivity in
together to form a mobile Cloud [1]. Recent technological vehicular networks. Delay Tolerant Networks [3], for instance,
advancements have driven the attention in creating VCCs. have coped with low-density vehicle scenarios to guarantee
Serving as support, such advancements allowed smart vehicles some level of packet delivery. Many technologies also make
to contain high processing and storage capacity; these vehicles use of vertical and horizontal handoffs [4], [5] to achieve
can also connect and interact among themselves and with the better networking conditions. These methods explore connec-
Internet through the use of vehicular networks. By creating a tivity opportunities, showing prospects that favorably support
Cloud, these vehicles can collaboratively build a distributed vehicles to be reachable.
system that extends their own individual capabilities. Therefore, due to the existing drawbacks and challenges of
The volume of vehicles in urban centers and their highly exiting mobility-based approaches, we propose an uncertainty-
dynamic mobility position them as valuable resource providers oriented connectivity model. The proposed model aims at dis-
for Smart Cities [1], [2]. Consequently, in the context of covering and mapping resources and content independent from
intelligent transportation systems, VCCs can potentially sup- the communication endpoints, acting similarly to Content-
port urban computing in a wide extent, turning itself into a Centric Networks [6], making use of mobility as a secondary
truly attractive but rather complex resource, data, and service parameter that just enables predictions. The approach provides
management approach to explore. Numerous works have pro- a more relaxed and flexible method of resource discovery and
posed and designed practical services and applications based indexing, increasing the opportunities for forming Clouds and
on Vehicular Clouds [2]. Thus, enhancing how such Clouds are finding content in a rather dynamic distributed environment.
assembled and managed is highly rewarding and significantly The remainder of the paper is as follows. Section II provides
impacts the community to a great extent. an overview on works about configuration and management of
Coping and dealing with the high mobility of vehicles VCCs. Section III describes the problem tackled in this work.
consist of the great challenge in VCC. Works in Vehicular Section IV presents our approach to enhance Vehicular Cloud
This work is funded by the Natural Sciences and Engineering Research management based on connectivity uncertainty. Section V
Council of Canada (NSERC). describes the experimental evaluations and discusses obtained
978-1-7281-7343-6/20/$31.00 ©2020 IEEE
214
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
results. Finally, Section VI presents the conclusion and future Crowdsourcing has been explored to enable pervasive Cloud
work directions. services through outsourcing tasks [14]. The massive number
II. R ELATED W ORKS of devices allows for offloading and environment sensing.
There is a firm assumption that Vehicular Clouds will in-
When involving Intelligent Transportation, approaches have
evitably benefit cooperation and sharing through significantly
tackled resource management to a great extent by exploring
adaptable Cloud management. Recent works have promoted
challenges originated from the high mobility of vehicles [1]. In
such Clouds by tackling diverse aspects, such as mobility,
the majority of cases and works, their core aspects consist of
heterogeneity, and scheduling. Cluster-based Vehicular Cloud
dealing with the unpredictability and highly-dynamic commu-
creation and coordination in highly dynamic urban scenarios
nication topology changes. As exception to such cases, there
aim at following platooning-inspired strategies [15], [16].
are approaches that do not emphasize on mobility – the ones
Based on this approach, the MDP-based modeling of vehicle
admittedly based on static parked vehicles [7]. However, the
mobility attempted to define enhanced resource allocation in
most common scenarios are dynamic, and works have already
Vehicular Clouds [17]. In the same perspective, a mobility
proven the non-feasibility of assembling Vehicular Clouds
model based on Artificial Neural Networks allowed to reduce
and cooperation while vehicles are moving on the traffic
the impact of sudden movement changes for more efficient
network [1]; the fast pace changing scenario makes inviable
resource allocation in Vehicular Clouds [18]. Cluster-based
for vehicles to sustain plausible connections long enough to
approaches support multi-edge computing [19] where vehicles
fulfill minimum sharing requirements. Even enabling platoon-
take the role of service providers. Thus, vehicular gateways re-
oriented Vehicular Micro-Clouds that compose a distributed
lay data, flowing it through the network and being fundamental
larger-scale Cloud cannot guarantee long-standing proximity
role players in spreading data; this approach then heavily rely
and contact. Besides, these approaches rely on overwhelming
on mobility to select gateways accurately, directly impacting
control to assess the movement of nodes continually.
The design of methods to properly form and sustain Ve- the efficiency in accessing and dissipating data.
hicular Clouds involves a range of several areas, including Formal models based on mobility traces undertake a more
the underlying networking protocols. Several works have precise selection of Vehicular Cloud hosts to received of-
already explored novel Content-centric Networks (CCN) to floaded tasks, given their required completion times [20]. The
cope with the dynamic vehicular environments [8], making diversity in urban environments has motivated the SMDP-
use of information-oriented paradigm to target the discovery based modeling to represent the heterogeneous spectrum of
of data more precisely. However, the support of CCN requires computing capabilities of vehicles and RSU to better assess the
frequent updates and mobility prediction models to handle availability and offloading strategies [21]. Eventually, Cloud
communication and connection inconsistencies. Predicting the management culminates in resource allocation, so efficient
position of vehicles, a extensively explored approach in ve- task scheduling is capable of leveraging Fog Computing and
hicular environments, supported the definition of stability of providing additional resources to build up distributed Data
the communication link between vehicles [9]. Equivalently, Centers [22]. Similar to Mobile Cloud computing, heuristic-
the link stability metric has been defined in other works [10] based placement and scheduling algorithms have been defined
according a quantification of wireless link stability based on to offload tasks from Vehicular Clouds to the Cloud to relieve
the movement of vehicles in a rectilinear form. The relative the burden of running numerous applications and sensory
distance, or speed, among vehicles characterizes stability. This services in vehicles [23]. This work advocates the prospect
particular work extends its definition by adding multi-hop that Vehicular Clouds will be vital for backing up IoT.
estimation on stability, tackling the V2V communication sce- Several works have demonstrated the importance of map-
narios, which is employed in a fuzzy-based selection system to ping and modeling link stability and network connectivity for
match with QoS service requests. In summary, mobility mod- Vehicular Networks. Such works have attempted to enhance
els support estimates to benefit the communication VANETs, connectivity through heavily mobility-oriented approaches,
even enabling approaches focused on the content instead of where the position, speed, and acceleration of vehicles and
vehicles to better balance information supply-demand [11]. traffic network topology are substantial elements in their mod-
Undoubtedly, Vehicular Networks are prone to connection els. Complementarily, we have observed steep advancements
losses. Formal modeling and analysis of delay allow the in communication technologies; for instance, cognitive radio
handling of intermittent connectivity in Vehicular Networks has become a key enabling technology of dynamic spectrum
in sparse scenarios [12]. Low RSU coverage implicates in access to achieve better exploitation of radio spectrum [24],
connection losses, which are also influenced by the speed and [25]. As a result, even though mobility is an important
density of vehicles in road segments. The cost in implementing influencing factor to connectivity, urban environments are
an infrastructure to support V2I communication has motivated more comprehensive and contain many more opportunities for
works to search for optimal placement of RSUs on a traffic devices and vehicles to connect over multiple media, following
network so that delay-sensitive applications are not compro- V2V, V2I, and V2X communication methods. Provided these
mised [13]. Effective handling of connection loss facilitates existing and growing alternative and redundant network con-
the implementation of many Cloud applications. With the pro- nections, we propose an approach that more closely matches
liferation of computing-capable devices and vehicles, mobile with the reality of urban centers and can take better advantage
215
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
216
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
1) State Space: Each vehicle may be connected through Since we assume that transition probability from any state
different means, including multiple means simultaneously. For leads to p(X|x, a) = 1, we can formulate as a starting model
instance, a vehicles may be connected to the “Internet” through set as equal transition probabilities in the situation that there
other vehicles or through an RSU. Each communication mean is no defined probabilities - initial values when starting the
represents a connectivity link that enhances the reachability model. This starting setup is described in Equation 1. Then, for
and network throughput of a vehicle. Each of these connectiv- instance, each individual vehicle can update these probabilities
ity opportunities can be mapped into states, including the over- independently to express their current behavior accurately.
n
lapping connection situations. In other words, the connection 1/2 c · · · 1/2nc
of a vehicle can transit among states which allow us to produce
Pxi ,xj (a) = ... .. .. (1)
the all possible connection combinations in the most optimistic . .
scenario over a given number of communication means, nc . 1/2nc ··· 1/2nc
The number of combined states can then be cumulatively 4) Rewards: In each model state, there is a reward that
inferred as 2nc − 1 ( 0≤k≤nc nkc ); if we include the state
P
represents the connectivity quality of a vehicle. Thus, an
where there is no connection at all, the model has then 2nc one-step reward is defined as r(xi , ai ), denoting the reward
states. We can also assume that this set of connectivity states using action a in state x. Quality can be a cumulative value
composed a fully connected directed graph, where there is originated from several factors, such as link stability and end-
always a transition leading from state si to state sj . In such to-end bandwidth, where both might be projected over time.
graph at its simplest, there are (2nc )2 edges. In the proposed approach, rewards directly indicate the best
For instance, in a real scenario, we envision that vehicles possible opportunities in terms of network reachability.
may be simultaneously connect with RSUs, other vehicles, and 5) Policies: The Markov policy π of this model is, by def-
LTE towers (nc = 3). Moreover, in the peculiar case of V2V inition, stationary and randomized since πnc does not depend
connections, a vehicle observes a multihop, on or more hops, on nnc , and the selection of the actions follows a probability
connection. However, mapping all possible multihop combi- distribution that generally produces a p(·|xi , ai ). By definition,
nation scenarios in the model leads to exponential growth a policy corresponds to a sequence of transition probabilities
in this discrete State Space. Thus, accounting simplicity, πn (an |hn ) from a n-step history Hn to A, in given number
we summarize the multihop V2V connection to a singular of iterations n ∈ N. Again, policy π is characterized by a
state factor, in which the number of hops directly impacts transition probability π that relates to x ∈ X mapped to a
connection stability and bandwidth; this impact can be either a ∈ A where we summarize as π(A(x)|x) = 1 for all x ∈ X.
measured through end-to-end checks or estimated over time From the model, we can obtain an optimal policy that
periods. As a result, this scenario allows us a set of 8 states corresponds to the “best” actions to maximize rewards in light
with 64 possible distinct transitions. of the conditioning transition probabilities. Therefore, this
2) Action Space: The actions related to gaining or loosing policy is employed as an indicator that optimistically estimates
connectivity as vehicles move around an urban area. As a the communication conditions with vehicles. All vehicles show
result, the actions are grouped in two sets, “connecting” or these optimistic estimates as a baseline that is used to classify
“disconnecting” from certain media. For instance, in a scenario them against service requirements.
where all possible communication media is through another 6) Transition Discounts: Discounts are applied to weigh
vehicle or an RSU, we have as actions the following: connect- down future steps/iterations in relation to the current state xi
ing to vehicle, connecting to RSU, disconnecting from vehicle, where the connection status is. To deal with expected total
and disconnecting from RSU. The uniqueness of the vehicular discounted reward, the discount is defined as a fixed value in
network scenario grants the representation of transition from this model as γ ∈ [0, 1[. We express the expected total reward
one state to another through a single action, such as the action over the first n steps, n ∈ N as in Equation 2.
“disconnecting from vehicle” leads the transition from the
"N #
X
π n
state “connected to RSU and vehicle” to “connected to RSU”. vN (x, π, γ) = Ex γ r(xn , an ) (2)
Assuming an nc number of media available, we will have an n=0
Action Space Ai of 2 ∗ nc possible actions. In other words, we define the value function v(X) as the
The action space is reduced when considering the restrictive sum of all predicted future rewards, implementing temporal
transition of communication conditions. Thus, given an state discounting with a gamma parameter where rewards are at k
xi of vehicle i, there is a set Ai (x) ⊆ Ai of available actions steps in the future are weighted by an exponential discount
that lead to states where a single transited state is xi ∈ Xi ; factor γ k . The value function turns in a weighted sum,
3) Transition Probabilities: Employing a randomized described in Equation 2.
model, for each each action Ai (x), there is a transition prob-
ability on x. These probabilities are assumed to represent the B. Connectivity Estimation
chances the connection status may change, being conditioned Assuming an infinite horizon when utilizing the reward
through a recent past. In the model, the transition probabilities function, described in Equation 2, we employ Bellman Equa-
may be assigned or adjusted through statistical analysis, such tion [26] to define an state value function in state x ∈ X for
as time series, following recorded past transitions. a stationary policy π, as described in Equation 3.
217
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Algorithm 1: Connectivity Status Estimation their MDP transition and reward matrices, individually. The
Data: Xi ; Pai ; R; Ai ; γ; optimal policy search algorithm can be executed on event basis
Result: πi ; v ∗ or periodically. In this work implementation, we employed a
1 V = 0; πi = 0;
2 do
event-oriented triggering of the search whenever an update
3 ∆ = 0; occurs on the matrices. The connectivity usually relates to a
4 for x ∈ X do reference point, which is the point of attachment in this work.
5 Av = 0;
6 for a ∈ A do
Thus, due to this assumption, this point of attachment, which
7 xn = A(x); can be an RSU, tower, or VC Management Unit, also becomes
8 Av[a] = Pa [x][xn ] ∗ (R[xn ] + γ ∗ V [xn ]); a reference for devices and other vehicles to contact service
9 avbest = max(Av); providers. Vehicles notify their respective management unit,
10 ∆ = max(∆, |avbest − V [x]|); which then classifies them. The unit finally matches incoming
11 V [x] = avbest ;
12 πi [x] = argmax(Av); service requesters based on the ranking it updated.
13 while ∆ < ; V. P ERFORMANCE A NALYSIS
We have conducted simulation experimental analysis to
evaluate the performance of the proposed connectivity-oriented
model. The simulations aimed to represent realistic urban
center scenarios, requiring simulation of communication, as
X
V π (X) = r(x, π(x)) + γ p(y|x, π(x))V π (y) (3)
y well as the mobility, of vehicles in a targeted intelligent
Following the Principle of Optimality of Bellman [26], we transportation environment.
establish an inductive greedy search process where takes as A. Scenario
basis on an initial state and decision, identifying an optimal The experimental scenario is completely built using
policy in the subsequent decisions/actions. Thus, the principle Veins [27], with the support of Omnetpp++ [28] and
of optimality applied over the value function gives the optimal SUMO [29]. The simulator Veins contains general protocols
value function V ∗ = maxπ V π , represented in Equation 4. and modeling capabilities of networking or wireless protocols
" #
∗
X
∗
for communication of the nodes. SUMO brings a microscopic
V (x) = max r(x, a) + γ p(y|x, a)V (y) (4) mobility traffic simulation of the nodes in Omnet++. Veins
a∈A
y
facilitate and keeps the consistency between these two later
In Equation 5, the optimal policy from a given state x simulators, allowing the basis for modeling and simulating
follows a similar representation of the optimal value function. Vehicular Networks.
"
X
# The whole ITS scenario stands on WAVE as the supporting
∗ ∗
π (x) = arg max r(x, a) + γ p(y|x, a)V (y) (5) V2V communication means. Vehicles can “connect” to the
a∈A y Internet by communicating through both V2V and V2I. Thus,
Equations 4 and 4 can then be translated to a Value-iteration RSUs are also present in the simulation scenario. Relying on
algorithm where it iteratively searches for the best policy given the IEEE 802.11p standard, the V2V communication mode fol-
transition and reward matrices. The value-iteration search is lows WSMP, in which OFDM guarantees different data rates.
summarizes in Algorithm 1. According to the algorithm the We thus assume proper, organized one-hop communication
search ends when a error is satisfied. Both discount γ and over multiple channels through Control Channel (CCH).
error condition the convergence and number of k iterations 1) Traffic Network Topology: For the sake of realism, we
in which the algorithm runs, k = log(r max /)
log(1/γ) . The processing
use a real-world urban scenario where vehicles present high-
to identify the current suitable pseudo-optimal MDP policy of mobility displacement patterns. We use a slice of Cologne
our model is conducted for each vehicle. metropolitan area as our urban center, which adopts a traffic
simulation data set scenario for bringing more realistic mo-
C. VC Management bility. The region represents a dense urban area of 1x1km2 ,
The Vehicular Cloud Management then fundamentally em- as depicted in Figure 2. Most of the map follows a standard
ploys the proposed model in order to differentiate vehicles grid layout, but some segment clusters form non-conventional
based on the quality and stability of their point of attachment to layouts. The urban map also shares parts with highways,
the network. The management can follow many architectures allowing a wider range of speeds and mobility patterns.
where centralized, hierarchical, or fully distributed service 2) Parameters: To delimit and constrain our experimental
orchestrators allocated resources dynamically. In our envi- scenarios, we defined combinations of parameter settings.
sioned scenario, the a set of VC Management Units serves as Table I summarizes all parameters used in our simulations.
reference points in which nodes, mobile devices, pose service In simulations, the speed of vehicles range between 5 and
requests that match up with surrounding (in terms of network) 15 m/s to resemble to usual urban centres. Such speeds
vehicles. The whole scheme works in 3 phases: monitoring, thus condition a high dynamicity in our observed scenarios
calculation, and advertisement. In fully distributed fashion, where vehicles might changes their connectivity statuses quite
vehicles monitor their own current connections and update frequently in a short span of time. Also, our scenarios adopted
218
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
219
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Reachability (# of hops)
Reachability (# of hops)
2 2 2
1 1 1
Reachability (# of hops)
Reachability (# of hops)
2 2 2
1 1 1
Success Rate
Success Rate
220
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
as the number of vehicles could vary, possibly close to the [9] A. Boukerche, R. W. L. Coutinho, and X. Yu, “Lisic: A link stability-
RSU. As more vehicles are present in the scenario, their state based protocol for vehicular information-centric networks,” in Proc. of
the IEEE Int. Conf. on Mobile Ad Hoc and Sensor Systems, 2017, pp.
could move from V2V to V2I, resulting in a higher ranking. 233–240.
Lastly, the Success Rate of the connections is summarized [10] N. Tamani, B. Brik, N. Lagraa, and Y. Ghamri-Doudane, “On link
in Figures 5a, 5b, and 5c. It is expected that the top set shows a stability metric and fuzzy quantification for service selection in mobile
vehicular cloud,” IEEE Transactions on ITS, pp. 1–13, 2019.
higher success ratio than the bottom set. However, the bottom [11] C. Xu, W. Quan, A. V. Vasilakos, H. Zhang, and G.-M. Muntean,
vehicles showed higher ratios. The reason behind a much lower “Information-centric cost-efficient optimization for multimedia content
success rate of the top group is our scenario’s high mobility. delivery in mobile vehicular networks,” Elsevier Computer Communi-
cations, vol. 99, pp. 93 – 106, 2017.
By the time contact messages are sent, vehicles have already [12] Y. Wang, J. Zheng, and N. Mitton, “Delivery delay analysis for roadside
left reach completely. Vehicles do not linger around the RSU, unit deployment in vehicular ad hoc networks with intermittent connec-
and they tend to leave; they either leave the simulated area tivity,” IEEE Transactions on Vehicular Technology, vol. 65, no. 10, pp.
8591–8602, 2016.
(map), the range of the RSU, or range of other vehicles. On the [13] S. Mehar, S. M. Senouci, A. Kies, and M. M. Zoulikha, “An optimized
other hand, vehicles that were ranked low but moving towards roadside units (rsu) placement for delay-sensitive applications in vehic-
the RSU were within its range when contacted. ular networks,” in Proc. of the Annual IEEE Consumer Communications
and Networking Conference, 2015, pp. 121–127.
VI. C ONCLUSION [14] J. Ren, Y. Zhang, K. Zhang, and X. Shen, “Exploiting mobile crowd-
sourcing for pervasive cloud services: challenges and solutions,” IEEE
In this paper, we have proposed a new MDP-based model Comm. Magazine, vol. 53, no. 3, pp. 98–105, 2015.
for estimating and representing the connectivity level in vehic- [15] R. I. Meneguette, A. Boukerche, and R. E. De Grande, “SMART: an
efficient resource search and management scheme for vehicular cloud-
ular networks. The connectivity model optimistically identifies connected system,” in Proc. of IEEE GLOBECOM, 2016, pp. 1–6.
the “best” vehicle candidate in a scenario where the access to [16] R. I. Meneguette and A. Boukerche, “Servites: An efficient search and
vehicles permeates the delivery of services and resources in allocation resource protocol based on V2V communication for vehicular
cloud,” Elsevier Computer Networks, vol. 123, pp. 104–118, 2017.
a Vehicular Cloud fashion. We have evaluated the proposed [17] R. I. Meneguette, A. Boukerche, A. H. M. Pimenta, and M. Meneguette,
model through simulations, which showed that the ranking “A resource allocation scheme based on semi-markov decision process
properly represents the connectivity conditions of vehicles. for dynamic vehicular clouds,” in Proc. of the IEEE Int. Conf. on
Communications, 2017, pp. 1–6.
As future work, we will study connection intermittency in [18] A. M. Mustafa, O. M. Abubakr, O. Ahmadien, A. Ahmedin, and
VANET scenarios where the quality of links are measured B. Mokhtar, “Mobility prediction for efficient resources management in
throughout the delivery of services. Frequency and availability vehicular cloud computing,” in Proc. of the IEEE Int. Conf. on Mobile
Cloud Computing, Services, and Engineering, 2017, pp. 53–59.
time will be incorporated into the devised MDP-based uncer- [19] I. Jabri, T. Mekki, A. Rachedi, and M. B. Jemaa, “Vehicular fog
tainty model so that they can more accurately characterize gateways selection on the internet of vehicles: A fuzzy logic with ant
the connectivity patterns vehicles may run into while they colony optimization based approach,” Ad Hoc Networks, p. 101879,
2019.
move around urban centres or are in stationary scenarios. This [20] F. Zhang, R. E. De Grande, and A. Boukerche, “Macroscopic interval-
extended model is intended to introduce a general representa- split free-flow model for vehicular cloud computing,” in Proc. of
tion that may specified and self-adapt according to the recent the IEEE/ACM Int. Symp. on Distributed Simulation and Real Time
Applications, 2017, pp. 1–8.
behavior of VC nodes. [21] C. Lin, D. Deng, and C. Yao, “Resource allocation in vehicular cloud
R EFERENCES computing systems with heterogeneous vehicles and roadside units,”
IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3692–3700, 2018.
[1] R. Florin and S. Olariu, “Toward approximating job completion time [22] M. Sookhak, F. R. Yu, Y. He, H. Talebian, N. S. Safa, N. Zhao,
in vehicular clouds,” IEEE Transactions on ITS, vol. PP, pp. 1–10, 11 M. K. Khan, and N. Kumar, “Fog vehicular computing: Augmentation
2018. of fog computing using vehicular cloud computing,” IEEE Vehicular
[2] A. Boukerche and R. E. De Grande, “Vehicular cloud computing: Technology Magazine, vol. 12, no. 3, pp. 55–64, 2017.
Architectures, applications, and mobility,” Elsevier Computer Networks, [23] A. Ashok, P. Steenkiste, and F. Bai, “Vehicular cloud computing through
vol. 135, pp. 171 – 189, 2018. dynamic computation offloading,” Elsevier Computer Communications,
[3] J. A. F. F. Dias, J. J. P. C. Rodrigues, F. Xia, and C. X. Mavromous- vol. 120, pp. 125 – 137, 2018.
takis, “A cooperative watchdog system to detect misbehavior nodes [24] K. K. Ghanshala, S. Sharma, S. Mohan, L. Nautiyal, P. Mishra, and R. C.
in vehicular delay-tolerant networks,” IEEE Transactions on Industrial Joshi, “Self-organizing sustainable spectrum management methodology
Electronics, vol. 62, no. 12, pp. 7929–7937, 2015. in cognitive radio vehicular adhoc network (cravenet) environment: A
[4] Y. Bi, H. Zhou, W. Xu, X. S. Shen, and H. Zhao, “An efficient pmipv6- reinforcement learning approach,” in 2018 First Int. Conf. on Secure
based handoff scheme for urban vehicular networks,” IEEE Transactions Cyber Computing and Communication, 2018, pp. 168–172.
on ITS, vol. 17, no. 12, pp. 3613–3628, 2016. [25] Y. He, F. R. Yu, Z. Wei, and V. Leung, “Trust management for secure
[5] P. Dhingra and P. C. Jain, “Cost-effective vertical handoff strategies in cognitive radio vehicular ad hoc networks,” Ad Hoc Networks, vol. 86,
heterogeneous vehicular networks,” in Proc. of the Springer Int. Conf. pp. 154 – 165, 2019.
on Advanced Computational and Communication Paradigms, 2018, pp. [26] C. Sammut and G. I. Webb, Eds., Bellman Equation, Boston, MA, 2010,
369–377. pp. 97–97.
[6] A. K. Niari, R. Berangi, and M. Fathy, “Eccn: an extended ccn architec- [27] C. Sommer, R. German, and F. Dressler, “Bidirectionally Coupled
ture to improve data access in vehicular content-centric network,” The Network and Road Traffic Simulation for Improved IVC Analysis,” IEEE
Journal of Supercomputing, vol. 74, no. 1, pp. 205–221, 2018. Transactions on Mobile Computing, vol. 10, no. 1, pp. 3–15, 1 2011.
[7] F. H. Rahman, A. Y. M. Iqbal, S. H. S. Newaz, A. T. Wan, and M. S. [28] A. Varga and R. Hornig, “An overview of the omnet++ simulation
Ahsan, “Street parked vehicles based vehicular fog computing: Tcp environment,” in Proc. of the Int. Conf. on Simulation Tools and
throughput evaluation and future research direction,” in 2019 21st Int. Techniques for Communications, Networks and Systems & Workshops,
Conf. on Advanced Communication Technology, 2019, pp. 26–31. 2008, pp. 60:1–60:10.
[8] R. W. L. Coutinho, A. Boukerche, and X. Yu, “Information-centric [29] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd,
strategies for content delivery in intelligent vehicular networks,” in Proc. R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner,
of the ACM Symp. on Design and Analysis of Intelligent Vehicular “Microscopic traffic simulation using sumo,” in Proc. of the 21st IEEE
Networks and Applications, 2018, pp. 21–26. Int. Conf. on Intelligent Transportation Systems, 2018.
221
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Abstract—In the modern telecommunication systems, mobility We base our approach on the Pairing Functions (PFs) [6–
is one of the key advantage of wireless communications, given that 10], applying them to the set of collected coordinates. In fact,
it is possible to transmit/receive data, without caring of having if we assume that the location of a moving node is represented
a static position into the network. Of course, mobility poses
special issues such as degradations, channel quality fluctuations, by its spatial coordinates (such as a couple of values in a 2D
fast topology changes, and so on. Modern researches focus space, or a triplet in a 3D space), then a new way to represent
their attention on predicting mobile future node positions, in them with a single value can be proposed: each prediction
order to a-priori know, for example, what the evolution of the can be made by considering only one evaluation of the next
network topology will be or which level of stability each node position, without taking into consideration the variations of
will reach. Each prediction scheme is based on the storage and
analysis of several historical mobility trajectories, in order to the single coordinates separately.
train the proper prediction algorithm. In this paper, we focus We evaluate different pairing functions and, for all of them,
our attention on the optimization of the space needed to store we consider the magnitude of the encoded samples (codomain)
historical mobility samples, encoding their values and evaluating and the error committed for decoding back (unpairing) the
the conversion error, comparing different encoding functions. original values. PFs have been widely used in information
Several simulation campaigns have been carried out in order
to evaluate the goodness and feasibility of our proposal. security systems [11, 14] and, in this work, instead, we apply
Index Terms—Mobile Networking, Mobility, Prediction, Train- them to a completely different research topic. Without loss
ing, Pairing functions, Sampling. of generality, we carry out our analysis on mobility records
composed only by GPS coordinates (or some representations
I. I NTRODUCTION of them), without considering any additional feature (such as
In the last years, data analysis for prediction purposes has social relations, points of interests, location-based social net-
been one of the main research activities carried out in a works, etc.). In this way, the proposed approach is completely
very wide variety of scientific communities [1–5, 12, 13]: the general and can be enhanced, if needed, by considering the
general term used to indicate such kind of activity is data ana- relevant issue, based on the considered scenario.
lytics. To this aim, it is very important to collect data from real The paper is organized as follows: Section II introduces
world processes (nodes mobility, financial tendencies, network some recent works about mobility and trajectory analysis,
performance, urban planning, epidemic control, location-based while Section III gives a deeper description of the proposed
services, and intelligent transportation management, etc.), to idea, under a theoretical point of view. Section IV gives a
analyse them from a statistical/stochastical point of view and deep description of the main reachable results, and section V
to implement a predictive algorithm, able to forecast future concludes the paper.
values of the observed process. One of the main issues in II. S TATE OF THE A RT
such kind of approach is the creation of historical log-files and
their storage in digital formats. In this paper we are interested Mobility analysis has been an extensive research activity
at studying and analysing mobile nodes behavior and, in in the last decades. The a-priori knowledge of the future
particular, at proposing a new way for storing their sampled positions of mobile hosts attracts the attention of many re-
values, in order to encode them and gain storing space. In searchers, who are interested at planning their systems on
fact, mobility is generally studied by considering the processes the basis of the future trend of the considered process. The
related to the single coordinates (expressed in Cartesian terms procedure is always the same: a) historical data collection;
or GPS values) separately [15, 16]: this implies an uncorrelated b) data analysis; c) implementation of a predictive model;
study of the trajectories and a wastage in the storing space, due d) training and prediction. For example, in [18] a novel
to the needing of creating historical trace-files (or log-files). prediction scheme is proposed, based on the management
of smartphones data (location, schedule, e-mail information,
etc.); the authors, after the illustration of the importance of
978-1-7281-7343-6/20/$31.00
2020
c IEEE big data management, show the way the data over more than
222
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
one year has been collected and, based on it, they demonstrated to be made, and once the next-value is obtained, it can be
that the proposed scheme can predict user location precisely, decoded back into the m original components. To do this,
giving to mobile users some enhanced services (about location, we based our approach on the Pairing Functions (PFs) [6, 7].
torrential rain, train delays, traffic jams, etc.). The work Now the concept of PF is briefly explained and, then, some
in [19] is strictly related to location prediction, Points of PFs are introduced for encoding the content of the trace files.
Social Interactions (PSIs) and Points of Interest (POIs). The A Pairing Function (PF) is defined on a particular domain
authors compared the last two aspects together with a two Dom and it encodes each couple (pair) of elements from Dom
steps PSI model and a two-stage POIs clustering approach, to to a single element of Dom: any two distinct couples will
reduce the effects of randomness and to improve the overall be represented with two distinct elements (in this way it is
performance of the prediction scheme. The paper illustrates possible to decode back the original values when needed). A
several results, by which it can be understood how the PSI PF is generally indicated as a function pf : Domm → Dom,
approach outperforms other predictive algorithms. In [20] and they are used in a wide variety of applications (renderers,
a recent overview of different methods and approaches for shaders, theoretical computer science, etc.). We will indicate
predicting mobile trajectories, basing the choice of next places with pf −1 : Dom → Domm the inverse PF function to
on mobility data. The paper, after an interesting introduction, decode back the m values (it is also called unpairing function).
describes the basic concepts of location prediction, including Many PFs have been defined in literature [8]: their study and
the different sources of trajectory data, the general prediction evaluation are out of the scope of this paper, while the main
framework, challenges in location prediction, and common aim of this sub-section consists in the application of some PFs
trajectory data preprocessing methods. The authors of [22] to encode mobility traces. Cantor’s PF is the most known [8],
underline the importance of analysing human mobility, as well defined as a bijection N2 → N:
as implementing a predictive approach. The work underlines, (x + y) · (x + y + 1)
at the same time, the heterogeneity of mobility nodes, because pfCantor (x, y) = +y (1)
2
nowadays they consist of handheld terminals, GPS, vehicular
nodes, sensors, social media based nodes, etc. The authors but it has been demonstrated that it has some limitations
survey several approaches for characterizing human mobility in terms of value packing efficiency. For example, if we set
patterns from individual, collective, and hybrid levels. In [21], x = 9 and y = 7 we would expect to obtain a maximum of 80
the authors, face the issue of sparse individual trajectory as a result (given that two digits 0-9 and 0-7 can create only 80
data, which often results in a high error of prediction results. combinations), but pfCantor (9, 7) = 143, with an efficiency of
The proposed scheme is called Individual Trajectory-Group only 56%. This result can be drastically improved by Szudzik’s
Trajectory (ITGT), and it is based on the pattern created by PF, also defined as a bijection N2 → N (it is indicated also as
group travels. Different stages are considered, starting from Elegant Pairing Function):
a stay point extraction with spatial clustering, and different
(
x + y2 x<y
Markov models (PPM and PST) are then exploited to predict pfSzudzik (x, y) = 2
(2)
x +x+y x≥y
the clustering link. A massive amount of real data points
have been used, and the obtained results confirmed authors with pfSzudzik (8, 8) = 97 (efficiency is increased to
expectations, with an accuracy of almost 90%. 82.4%).
To the best of our knowledge and from the reading of For the PFs in eq. 1 and eq. 2 the unpairing functions are
the most recent papers on mobility prediction (as the ones defined as follows:
described before), no works are focusing on pairing and (
unpairing mobility coordinates, in order to simplify the imple- −1 a − i·(i+1) x
pfCantor (a) = i·(3+i) 2 (3)
mentation of a predictive approach. So, our main contribution 2 −a y
consists in proposing a novel approach able to simplify the
where
representation of mobility samples and to reduce the time
complexity of the predictive algorithms. In the next section √
−1 + 1 + 8a
the main idea is illustrated. i=b c (4)
2
III. PAIRING F UNCTIONS , M OBILITY SAMPLES PAIRING and
AND U NPAIRING ( √ 2
−1 a − b ac x
In this section, the proposed idea is described. First of all pfSzudzik (a) = √ (5)
the concept of pairing function is described, then it is applied b ac y
to mobility in dynamic networks. We focused our attention on if x < y, or
2D mobility samples (but the proposal can be easily extended ( √
to a 3D environment), proposing a way to encode a sequence b ac x
−1
of m samples (each sample is represented by a couple of pfSzudzik (a) = √ 2 √ (6)
a − b ac − b ac y
coordinates x and y) into only one value: in this way, only
one trace is needed to be analyzed, only one prediction needs else.
223
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Once the PFs are defined, we have to verify if and how they • Mobility trace values can be quantized: given a geo-
can be adapted to our scopes (mainly the encoding of mobility graphical region in which nodes are moving, it is easy
samples. to derive the minimum and maximum extension of xn (t)
To this aim, let us consider a generic trace-file tf , containing and yn (t) and, after the sampling operation, values can be
mobility values belonging to R2 (the approach can be easily quantized, after setting a proper resolution and assigning
generalized if mobility coordinates belong to R3 , in general integer values;
to Rm ). That is to say, we consider historical mobility values • Samples approximation: depending on the used mobility
stored as couples (in the case of a 3D space they are stored as format (GPS, planar coordinates, etc.), the decimal part
triplets). Let us define xn (t) as the value of the x coordinate can be neglected, or any value can be transformed into
of user n at time t. The same definition can be given for an integer one;
the y coordinate and for yn (t). They are continuous functions Mobility traces often contain negative values which do not
of time. We assume that the mobile network (or, directly, the belong to N; they can be converted into integer ones, but no
mobile node n) is able to store xn (t) and yn (t) each T seconds operations can be made on the sign. Also, in this case, we
(sampling period): we indicate with Xn (kT ) and Yn (kT ) the have some solutions:
discretized versions of xn (t) and yn (t), where k is a positive • x and y values can be translated to move the origin of
and integer value (for k = 0 the sampling operation is started). the reference system;
In this sense, the terms Xn and Yn can be considered as • PF functions can be transformed to account for negative
random variables, both defined on the space Ω ≡ R. After the integers.
collection of mobility samples, the vectors X ~ n (T ) and Y
~n (T )
~ n (T )|| = ||Y
~n (T )|| = N . Clearly, as In particular, negative numbers are taken into account by
are obtained, with ||X
applying the following transformation before the evaluation
said before, it is also possible to extend the analysis to the
of the chosen pf:
third variable z. Most of the existing works do not account (
for the intrinsic correlation between the spatial component of −2x − 1 x<0
a 2D space (we consider the dimensions up-to R2 ). In this c= (7)
2x x≥0
paper, instead, we propose to study the mobility coordinates by (
considering also their intrinsic relationship, so encoding them −2y − 1 y<0
in one value at each sampling period. In fact, a node moves d= (8)
2y y≥0
by respecting the environmental constraints; so, analyzing the
individual coordinate process, independently from the other and evaluating pf (c, d).
ones, leads to the definition of some models which may leak In this paper we are considering Cantor’s function because
some precious information. Clearly, all the equations defined it is the most known in literature, while Szudzik’s function is
before are still valid for all the other mobility components. one of the most efficient PF. As said in the previous section,
In the next section, some numerical results are obtained, also 3D mobility environments [17] can be considered: given
showing the possible reachable results which can be reached x, y and z coordinates, we can evaluate x0 = pf (x, y) and
by considering the application of PFs to mobility samples. y 0 = pf (x0 , z) so the stored value will be only y 0 (in literature,
this is also defined recursively by writing pf [x, pf (y, z)]). In
IV. S IMULATION RESULTS AND ANALYSIS order to apply the concepts related to PFs as in the previous
section, the values of the trace-files should belong to N (or to
To test and verify the concepts illustrated before, a MAT- Z if negative values will be taken into account). In general,
LAB testbed has been setup. Different functions have been the content of the downloaded files contains real values, so
defined for analyzing and characterizing the downloaded data we decided to transform them into integer values by finding a
in terms of mobility samples. In particular, we considered proper multiplying factor; in this way, only integer values have
the datasets in [26], consisting of human mobility traces in been considered, while negative numbers have been avoided
GPS format from five different sites: two university campuses by equations 7, 8.
(NCSU and KAIST), New York City, Disney World (Orlando), In the case of pedestrian traces [27], each row of the
and North Carolina Raleigh (during the state fair event). trace files is simply formatted as ID, time(24hf ormat),
We referred to the previous traces, for which an observation latitude, longitude and Tpedestrian = 50ms. Figures 1,
window of 30s has been considered, that is to say, each 2 and 3 illustrate some examples of pedestrian patterns for
sample collection activity has a global duration of 30s. For NCSU, Disney and KAIST scenario respectively (a total of
the KAIST traces, the average number of samples N ∗ is 1608, 100 samples for each trace): in the upper part, the trends of x
for the NCSU traces N ∗ =1431, for the NY traces N ∗ =1600, and y in function of the discrete time k are shown (T = 30s).
for the Orlando, traces N ∗ =1284 and for the North Carolina The complete pattern in a 2D space can be observed at the
traces N ∗ =415. Given that PFs consider only the N as domain bottom of the figures. If we apply Cantor’s and Szudzik’s
and codomain, mobility samples need to be transformed (PFs pairing functions to the above illustrated mobility patterns we
are polynomial functions, and no continuous bijections are obtain the trends illustrated in figures 4, 5 and 6 respectively.
possible for R2 and R [10]). In particular:
224
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
225
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
Figure 10. Unpaired predicted y samples for the NCSU trace (last
70 samples).
how the predicted samples are very close to the original paired
curve. At this point, the predicted values have been unpaired
and the obtained results have been compared with the original
values. Figures 9, 10, 11, 12 show the goodness of the Figure 11. Unpaired predicted x samples for the Disney trace (last
prediction and unpairing operations, for the last 70 samples 70 samples).
of each trace. It mostly depends on the predictor accuracy, but
it could be seen how the obtained values are very close to the
original trend (solid curve). approximation in reconstructing the original traces has been
V. C ONCLUSION AND F UTURE W ORKS also illustrated, showing the feasibility of the proposed idea.
This paper argues about the possibility to analyze nodes ACKNOWLEDGMENT
mobility under a different point of view: in particular, when
prediction operations need to be made, mobility can be rep- This work was supported by the Czech Ministry of Ed-
resented in a different way, by the integration of pairing ucation, Youth and Sports from the Large Infrastructures for
functions, able to represent the different coordinates (x, y and Research, Experimental Development and Innovations project,
z) with only one value. Two pairing functions have been de- IT4Innovations National Supercomputing Center LM2015070
scribed and exploited to see how mobility can be represented. and partly by the institutional grant SGS reg. no. SP2020/65
Paired values have been obtained and predicted, while the conducted at VSB - Technical University of Ostrava.
226
2020 IEEE/ACM 24 ͭ ͪ International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
227
List of Authors
Author Page(s) Author Page(s)
Aznar, Pablo 175 Horiguchi, Tatsuya 49
Azumi, Takuya 49 Iacovelli, Giovanni 190
Baron, Wojciech 115 Iaffaldano, Giuseppe 151
Boccadoro, Pietro 190 Ianni, Mauro 59
Boudjadar, Jalil 163 Igarashi, Shingo 49
Boukerche, Azzedine 84 Ishigooka, Tasuku 49
Bousbaa, Fatima 182 Jiménez-Bravo, Diego 37
Bout, Emilie 146 Kerrache, Chaker Abdelaziz 182
Braem, Bart 37 Khooban, Mohammad Hassan 163
Buchholz, Peter 41 Koike, Ryotaro 49
Cakmak, Hueseyin 16 Kuehnapfel, Uwe 16
Calafate, Carlos 175 Kyesswa, Michael 16
Campolo, Claudia 1 Lagraa, Nasreddine 182
Cano, Juan-Carlos 175 Lakas, Abderrahmane 182
Cheriguene, Youssra 182 Lalis, Spyros 198
Cicirelli, Franco 107 155 Loscrí, Valeria 146
Ciociola, Alessandro 206 Manzoni, Pietro 175
Cocca, Michele 206 Marcellan, Anna 25
De Grande, Robson 214 Marfia, Gustavo 142
De Rango, Floriano 77 Marilleau, Nicolas 100
Diallo, Moussa 100 Marotta, Romolo 59
Dias, João Pedro 92 Marquez-Barja, Johann 37
Djanatliev, Anatoli 115 Masala Mutombo, Pierre 37
Djellikh, Soumia 182 Maurya, Avinash 167
Donatiello, Lorenzo 142 Mehic, Miralem 222
Drasar, Martin 7 Mellia, Marco 206
Erdmann, Anselm 25 Molinaro, Antonella 1
Fabra, Francisco 175 Moskal, Stephen 7
Falcone, Alberto 137 Müller, Dirk 25
Fazio, Peppino 222 Mussini, Marco 151
Ferreira, Hugo 92 Nevigato, Nicolas 77
Gallais, Antoine 146 Ngom, Bassirou 100
Garro, Alfredo 137 Nicassio, Francesco 151
Gasparini, Lorenzo 142 Nicolae, Bogdan 167
Genovese, Giacomo 1 Nigro, Libero 107
Gentile, Antonio 155 Park, Sung woon 84
Giordano, Danilo 206 Parladori, Giorgio 151
Greco, Emilio 155 Partila, Pavol 222
Grewing, Christian 133 Pellegrini, Alessandro 59 68
Grieco, Giovanni 151 Piccione, Andrea 68
Grieco, Luigi Alfredo 190 Piro, Giuseppe 151
Grigoropoulos, Nasos 198 Pizzimenti, Bruno 1
Guan, Shichao 84 Potuzak, Tomas 123
Guerrieri, Antonio 155 Puzicha, Alexander 41
Guliani, Ishan 167 Quaglia, Francesco 59
Gütlein, Moritz 115 Rab, Maryan 59
Hagenmeyer, Veit 16 25 Rafique, M. Mustafa 167
Henke, Martin 25 Renner, Christopher 115
Hering, Dominik 25 Restivo, André 92
List of Authors
Author Page(s) Author Page(s)
Robens, Markus 133 Tovarek, Jaromir 222
Saad, Abubakar 214 Triggiani, Francesco 151
Serianni, Abdon 33 Tropea, Mauro 33 77
Shah, Awais 151 Ulbrich, Carolin 25
Schiek, Michael 133 van Waasen, Stefan 133
Schlatmann, Rutger 25 Vassio, Luca 206
Schmurr, Philipp 16 Vinci, Andrea 155
Spezzano, Giandomenico 155 Voznak, Miroslav 222
Suriyah, Michael 25 Wubben, Jamie 175
Suslov, Sergey 133 Xhonneux, André 25
Tahari, Abdou El Karim 182 Yang, Shanchieh 7
Torres, Diogo 92 Zaťko, Pavol 7