Simulation of A Big Number of Microservices in A Highly Distributed Vast Network
Simulation of A Big Number of Microservices in A Highly Distributed Vast Network
Simulation of A Big Number of Microservices in A Highly Distributed Vast Network
vast network
1
execution environment where managing them allows for an in-
depth control and their single function can be reused. The
complexity of the system based on microservices is not removed,
on the contrary – it can be even bigger as the inter-connection
problems may arise, but the profit related to more elastic and
automated environment is usually bigger than the complexity
costs. From the deployment point of view microservices are often
realized as virtual machines or linux containers. Figure 1 Model of service with internal logic and external
Authors of this paper authors propose an approach in which the stimuli: timeouts, incoming messages and manual
intervention
algorithms in numerous and vast networks of services are verified.
The construction of a large-scale network without proper analysis algorithms and architecture solutions are capable of handling such
is fraught with the risk that known algorithms, used in the large-scale systems and therefore if there is a need to develop any
environment with hundreds of thousands of services, have new ones.
unexpected and unwanted side-effects. Testing such systems in a
simulator is much more cost-effective than building adequate real The exiting solutions can be verified for the presented challenges
ones. With an accurate simulation one may show possible (e.g. service discovery) theoretically or by experiment. The
problems with verified algorithms and can help to eliminate them theoretical verification of a numerous and complex networks is
or greatly reduce by fine tuning its parameters. difficult. Experiments can be performed on a real system or in a
simulator. Building of a real system can be too expensive,
In the second section authors address some issues in the large- therefore the simulation of such a system was realized through the
scale microservice solutions and present an approach to resolve selection and testing of the mentioned algorithms.
the problems. An idea of a simulator with its features is
introduced in the third section. The results of two experiments The existing, suitable simulators can be divided into two groups:
with large scale traffic scenarios are presented in the fourth network simulators and cloud simulators. Network simulators,
section. The fifth section concludes the work and shows potential surveyed in [5], are focused mainly on the network behavior like
fields for improvements. detailed network protocols simulations, finding best
communication path and routing scenarios. They can be combined
2 MOTIVATION with the real software and hardware parts to make the simulation
Microservices require an efficient communication between them process more like real behavior. However, they are usually
in terms of speed and latency. The communication is more capable of simulating a limited number of nodes which is not
complex than in the monolithic application as it must consider enough for the planned work. The most popular simulator is NS2
various aspects, e.g.: physical network design, limited bandwidth, [6] which can simulate a broad range of protocols and can be
local or global addressing schemes. Also, especially in the expanded with external plugins. Its capacity is reported to be up to
telecommunication network, the size of the network and physical 3 000 simulated nodes and cannot be parallelized. More nodes –
distance between services needs to be considered when designing up to 100 000 – can be simulated with an SSFNet [7]. This
the solution architecture. simulator is focused on scalability and provides tools to create a
fine-grained network simulation, too detailed for microservices
Other challenges crucial for the communication are service simulation because of the configuration complexity.
discovery and load balancing. Service discovery is a mechanism
thanks to which an address of the needed service can be found in a The second group contains cloud simulators which are focused
set of many different ones, which is usually connected with the mostly on simulating the computing center infrastructure in terms
monitoring of the provider of that service for failure or of efficient hardware allocation for virtual machines, optimization
malfunction. Thanks to the quick detection of the failed service of power consumption and security concerns. An example of such
the traffic can be redirected to the similar, working ones. a simulator is Cloudsim [8] which can simulate message-passing
Advantageously the repair procedure can be provided to the applications residing in a data center. It can handle up to
broken service without a full system outage. That kind of behavior 1 000 000 hosts in several data centers in a simple scenario [8]. It
requires efficient load balancing, an algorithm that analyses is however uneasy to simulate a system with many independent
workload of services and keeps all the system participants equally nodes as required in this research. Other simulators like
occupied. That prevents overload on some services with dry-runs CloudLightning [9] or GridSim [10] can simulate distributed
on others. Both algorithms have an important role to keep the networks but both are focused on the allocation and management
whole system with a big number of services in the working state. of cloud instances (like Virtual Machines) into the data center.
As already said, telecommunications networks can be very large The experiment with service discovery in large-scale, distributed
in terms of the number of connected services and also networks requires a dedicated simulator that connects both
geographical distribution. Both factors introduce the complexity mentioned types. It is somewhat simplified and can simulate the
for service discovery and load balancing algorithms. The main network with its characteristics like latency, throughput or
goal of this work is to provide a tool for checking if the existing
2
Figure 2 Building blocks of the service simulator
3
𝛥𝑡 = 𝑡0 + 𝐶 ∙ √(𝑥1 − 𝑥2 )2 + (𝑦1 − 𝑦2 )2
+ 𝑟𝑛𝑑(time)(1)
where constant t0 represents of message processing
operations (e.g. encoding, decoding) not related to the distance
between services, C is a constant that makes latency proportional
Figure 5 Message generating and processing
to the geographical distance (in [12] authors claim it is about
0.024ms per mile of straight distance) and rnd() is responsible for reasons, e.g. for control and observability. The simulator allows
modelling the random part of network jitter (preferably follows a for creating traffic with a predefined, static or dynamic pace, of
normal distribution [13]). one type of message or as a mix of different message types as
Another type of a backbone network can be represented by the presented in the Figure 5. If needed, the messages can be
graph of connections between services as presented in Figure 4. generated according specific traffic distribution taken from the
That model may be more accurate for the specific networks. The theoretical model or real-life recording. Generated traffic can be
latency for the graph-based backbone model can be calculated routed to a specified service, any service of given type, predefined
or random services set.
with weighted length of the shortest path between graph nodes.
The simulation of the backbone network facilitates the adjustment Simulation of network phenomena
of the network latency to match the needs of the experiment. This Every communication between services is passed (by default)
can be done by modifying Formula 1 with factors depending on through the backbone layer. Its main functions are delivering the
the service and message type, local and global network load, messages and allowing to observe the exchanged messages.
simulation phase, etc. It is possible to drop all or random Delivery of messages is a function of time. Transport time can be
messages matching the predefined filters. The overall view on all calculated, e.g. according to Formula 1. This time may be also
the messages exchanged between services can be the basis for modified by coefficients related to the message type, message size
counters and statistics of the traffic. and total number of current messages (e.g. higher latency when
the current number of messages in the network is high). Apart of
For optimizing the execution, the backbone can be split into a
the main functions the backbone layer allows the traffic to be
number of sub-backbones so the communication between services
modified by dropping or delaying messages with predefined or
is distributed to parallel processes.
Figure 4 Model of the backbone network graph where one Figure 6 Functional scheme of the generic service. Service
or more services occupy a given location and a can generate messages with predefined pace, handle and
communication latency depends on the shortest path process messages and send them with or without
between them confirmation
4
Generators have no incoming messages and depending on the
internal timer emit messages to other services. The generator may
produce messages with a constant pace, variable but predefined
pace or dynamically adjusted according to the current state of the
whole system. Processors handle one type of messages and after
some time of processing send out new messages to the next
Figure 7 Traffic flow schema with single generator (G), service. Terminators are used to close message chain. They only
several processors (Px) and single terminator (T) service receive messages and process them to produce counters and
statistics.
calculated rate, duplicating or corrupting messages. All effects can
be applied to all messages, specific subsets or specific messages Using those three major types, some complex structures can be
only. It is also possible to turn on and off the network built and some real traffic flows can be modeled, as shown in
modification in specific simulation phases. The backbone layer Figure 7.
may observe every packet that is transferred through and thus
various measurements can be realized. The service simulator allows for implementing and analyzing very
complex services, full implementation of the revised algorithm
Service limitations and communication protocol with a set of messages. Such services
can be finely calibrated and connected with other, simpler services
Service should have some limitations: processing power, amount
to create complex scenarios that mimic the existing solutions or
of memory, network bandwidth which prevents it from serving an
products.
infinite number of messages. Those limitations can be added to
the simulated model inside services and backbone layer. Service Observability
limitations may be simple (e.g. limit for the number of
concurrently processed messages, additional time for processing For evaluating the experiment, specific observations should be
messages depending on their size, etc.) or more complex (e.g. performed during and after the simulation. Service simulator
abstract processor with scheduling of tasks, shared memory or allows for gathering, pre-processing and storing various types of
critical resources that are necessary to message processing). Every measurements, depending on the user’s needs. The whole
limitation finally creates a delay in the message processing. It is environment is under developer’s control so necessary traces can
possible to add such a delay as constant or make it depend on an be placed into any part of the simulator. The following parameters
actual state of a given service or the whole simulation. are typically measured: number of messages created and
transferred during a given period, average time of processing the
Service logic messages, average and maximal latency observed, utilization of
given type of services (calculated as average processor load), time
Different experiments require a different level of service logic
of recovery after an introduced error and others. The
complexity. For some of them, focused on the communication
measurements do not affect the traffic and service behavior (only
characteristics, it is enough to have only a simple logic
real time of execution can increase), they are stored in memory,
implemented. For others the internal logic of a service must be
compiled and written into text files to be processed after
very complex including message analysis, internal state handling
simulation execution.
and dependencies between services.
Service simulator allows for simulating both simple and complex 4 SIMULATION RESULTS
services. The simplest service could be a proxy-type service that A service simulator was created, and the first results are already
receives only a single type of messages and sends them to another, available. If the simulator is to be used for the verification of the
predefined service. The output message can be a copy of the hypotheses, it must be verified for its correctness and capabilities.
incoming one or a new message from a predefined set. Another In next section two experiments are presented and a detailed
type of simple services may produce a predefined output of description provided.
messages with a specified statistical distribution of message types
Experiment 1 – correctness of the simulator
and sizes.
The simulator was checked for the correctness of data it provides
Figure 6 contains a schematic view of a generic, moderate
in the following experiment. The real system was compared with
complex service that can: initiate its internal state, send messages
its digital twin simulated model. The simulator can be claimed as
after predefined timeout, keep track of messages that require
valid if it can replicate the real system behavior. To perform such
confirmation, receive acknowledgments from the recipients,
an experiment a small-scale system consisting of 28 swim agents
handle timeouts, resending messages if the previous one was
[14] was created in a single linux container. All of them were
lost, process any kind of incoming messages, analyze them and
exchanging information about their addresses and statuses (which
act based on their type and content. Based on the generic schemes
took about 12s). After 15s when the system was stabilized, one of
presented in Figure 6 it is possible to distinguish between three
the agents was blocked from communication. Then, according to
major types of services: generators, processors and terminators.
the swim algorithm, another agent did not receive any
5
Figure 8 Timestamps of the first 'alive' message for agents Figure 9 Timestamps of the first 'alive' message for agents
in the real case execution in the simulation
acknowledgement for its query and started to suspect failure of the detection after one agent was disabled (in tick number 2000) and
receiver. It has asked three other agents to confirm that and the other agent found it suspected and later dead (took 859 ticks =
faulty agent was correctly identified, and his ‘dead’ state was 8.59 s), distribution of state change of the dead agent, another
distributed to all agents (after about 8s). After another stabilization period and finally another state change after enabling
stabilization period the dead agent was unblocked and able to previously disabled agent and detection of its state change to
handle messages again. After an about 14s the agent was alive.
recognized as alive by the system and information of that change
was distributed to all agents. After 0.42s all agents were notified For making a detailed comparison, the agent restoring procedure
for the first time about the agent’s reappearance. A similar set-up was analyzed. The timestamp of the message informing that
previously dead agent is now alive was measured for every other
has been modelled in the simulator. The same number of 28
agent. Figure 8 presents the observed moments in the real system
services implementing swim algorithm were placed in the
and Figure 9 analogous data for simulation. Two charts are similar
network. With the same traffic scenario executed in real and in shape and time range. Shorter and longer times in the simulated
simulated environment we expected that: a) overall behavior scenario are present because the broadcast timeframe cannot be
during scenario phases would be similar, b) time to notify other synchronized with the real case due its random nature.
agents would be similar if the parameters of simulation are Experimental results show that the simulator can mimic a real
adjusted to the real case. The settings of the parameters can be system.
found in Table 1.
Experiment 2 – capacity of simulator
The behavior of the simulated algorithm was similar to the real
system. We could observe distinct phases of: initiation with In the second experiment the limit of capacity of the simulator
exchanging information about agents states and location (that took was tested. To achieve that goal a set of simulations was planned
about 1240 ticks = 12.4s), calm phase of stabilization, failure – similar in content and increasing in size. The basic simulation
case consisted of a simple chain of services passing a message
Table 1 Comparison of parameters of the real system and from one to another. A simple database of services (SDS) was
simulated one added, where every service registered itself and could find an
address of the counterpart. The traffic flow and the arrangement
Parameter Real Simulated of services are presented in Figure 10
Protocol period (T’) 1.0 s 100 ticks =
1.0 s In a series of experiments, a given number of services was
Ping timeout 0,5 s 50 ticks = randomly placed on the backbone grid of size 10 000x10 000. The
0.5 s numbers of services are shown in Table 2. They were chosen to
Repetition number 8 8 provide a roughly equal distribution of size.
Number of random 3 3
The SDS is placed in the center of the network grid. The messages
members to make
are created in Service 1 with a random pace (between 10 and 99
indirect ping (k)
ticks, where 1 tick = 0.01 s). The addresses of Service 2 are
Communication latency 0.36 ms 1 tick = 10
returned after querying them from SDS. From the nearest three
(average) ms
Service 2 instances one is randomly selected as a target. The
Members to choose for 3 3 message is transported through the backbone network and the
ping (K) latency is calculated according to Formula 1 with configured
6
Figure 10 Scalability traffic model. Multiple copies of five service types executes presented traffic scenario.
constants: t0 = 1 tick, C = 0.01, rnd() is in range 0-5 ticks in messages to process – the processing is postponed and, in
discrete uniform distribution. That means that the shortest summary, takes longer. Service 5 is only receiving messages; it
possible latency is equal to 1 tick, the longest can be 147 ticks. calculates the total time of processing and produces information
Service 2, Service 3 and Service 4 process message for 3 ticks about message transport time from Service 1 to Service 5. The
with a simple resource limitation model in which no more than 10 distribution of this counter should be independent of the
messages can be handled at the same time. If there are more simulation size.
For every experiment the total time for transferring messages
Table 2 Number of services of a given type in the from Service 1 to Service 5 has a similar distribution. At the
experiment beginning of the scenario, it is about 50-100 ticks needed for the
first messages to be transferred between nodes close to each other
Experi Number of Executio Peak Average and placed in the center of the network grid. Then total transfer
ment services n Time memory memory time grows rapidly, up to 800 ticks, because a lot of messages are
number (Te) [s] (MEMp) (MEMa) sent in the same time to SDS. Until reception of SDS response
[MB] [MB] other messages are queued in the services. Due to resource
limitation it takes some time to process all messages in the queue.
1 101 1.5 1695 208 After the queue is unloaded the situation stabilizes and the transfer
2 1010 9.9 1769 290 time is between 21 and 80 ticks with mode equal to 53 ticks.
3 10010 130.9 2985 1112 Figure 11 presents the shapes of that distribution for different
4 50010 949.1 8204 4539 experiment sizes.
5 100010 2492.5 14340 8415
The capacity of the simulator is established by measuring the time
6 200010 8568.5 27224 16237
of execution of the scenario (Te), average and peak memory usage
7 400010 28489.4 48687 28812
(MEMp, MEMa) for an increasing size of the scenario.
8 750010 111888.1 72670 54316
Experiments were conducted for a hundred up to a million
9 1000010 205798.9 116853 72739
services.
7
can simulate systems with a huge number of communicating
services which are also geographically dispersed. Such systems
are planned to be used in 5G radio access network in mobile
communication.
Figure 12 Execution time of simulation grows Figure 13 Both maximum and average memory usages
quadratically with the increasing number of simulated grow linearly with the number of simulated services
services
8
[6] The Network Simulator - ns-2. Website
http://www.isi.edu/nsnam/ns/