Experimental Evaluation of The Cloud-Native Application Design
Experimental Evaluation of The Cloud-Native Application Design
Experimental Evaluation of The Cloud-Native Application Design
Application Design
Sandro Brunner, Martin Blchlinger, Giovanni Toffetti, Josef Spillner, Thomas Michael Bohnert
Zurich University of Applied Sciences, School of Engineering
Service Prototyping Lab (blog.zhaw.ch/icclab/)
8401 Winterthur, Switzerland
Email: {brnr,bloe,toff,josef.spillner,thomas.bohnert}@zhaw.ch
AbstractCloud-Native Applications (CNA) are designed to publication [5]. It is also not a step-by-step guide to CNA, for
run on top of cloud computing infrastructure services with which we refer to our detailed posts on this topic 1 . Instead,
inherent support for self-management, scalability and resilience it briefly repeats the CNA characteristics and then focuses on
across clustered units of application logic. Their systematic
design is promising especially for recent hybrid virtual machine a thorough evaluation.
and container environments for which no dominant application In the next sections, we will first recapitulate design princi-
development model exists. In this paper, we present a case study ples for CNA. Then, we will introduce an evaluation scenario,
on a business application running as CNA and demonstrate the and subsequently evaluate the scenario applications scalability
advantages of the design experimentally. We also present Dyna-
and resilience in a private and in a public cloud environment.
mite, an application auto-scaler designed for containerised CNA.
Our experiments on a Vagrant host, on a private OpenStack Finally, we will wrap up our findings and contributions which
installation and on a public Amazon EC2 testbed show that CNA includes a novel scaling engine called Dynamite.
require little additional engineering.
D. Implementation and Migration Details one another and be able to adapt to a changing environment,
1) Virtual Machines and Containers: We chose the oper- e.g. components being added or removed. In a self-managing
ating system image CoreOS as underlying virtual machine system, the web server should announce itself and the load-
because it is designed for clustered deployments and comes balancer should reconfigure itself accordingly, for instance. In
with system services for health management (Fleet2 ) and ser- order to achieve this, we use Etcd, essentially a distributed key-
vice/configuration management (etcd). Its software processes value store, as service-discovery component in combination
typically run inside Docker containers. Different components with confd. Confd watches certain keys in Etcd, gets notified
of an application are put into dedicated containers which when an update occurs, and updates configuration files based
are then network-connected or directory-connected with each on the values of those keys (Fig. 4). To announce services, we
other. In a first step, we containerized the components of use the sidekick pattern. With every service which needs to
Zurmo. Thus, we created one container for the web server, announce itself (e.g. web server), a complementary (lifecycle-
one for the load balancer, one for the database and one for the bound sidekick) service whose only job it is to announce the
caching system. All the CNA support components added later former service in Etcd is deployed. Confd is only installed
monitoring systems, scaling-engine were also container- in containers of services which need to update themselves in
ized. Fleet is actually a distributed init system responsible for case components are added or removed.
system boot sequences. It runs on top of systemd, a recent 3) Monitoring: The monitoring system, re-usable across
parallelized implementation and conceptual enhancement of CNA applications, consists of the so-called ELK stack, log-
init. Fleet allows to describe services and to deploy them courier and collectd. The ELK stack in turn consists of
on machines in a cluster, ensuring they are kept in a certain Elasticsearch, Logstash and Kibana. Logstash collects log
state, and re-deploys them in case of machine failure. In the lines, transforms them into a unified format and sends them
migrated application, the services typically start containers. to a pre-defined output. Collectd collects system metrics
By containerizing the application and running it in a CoreOS and stores them in a file. We use Log-Courier to send the
cluster with Fleet, we can ensure a resilient system as shown application and system-metric log-files from the container
in Fig. 3. in which a service runs to Logstash. The output lines of
Logstash are transmitted to Elasticsearch which is a full-text
2) Service Autodiscovery: In order for the application to
search server. Kibana is a dashboard and visualization web
be truly self-managing, its components need to be aware of
application which gets its input data from Elasticsearch. It is
2 For more details on Fleet the interested reader can refer to https://github. able to display the gathered metrics in a meaningful way for
com/coreos/fleet human administrators. To provide the generated metrics for
Start
initial deployment of the application. It consists of several
components shown in Fig. 6. An auto-scaled application is
reload reload
configuration configuration
assumed to be composed of Fleet services. Dynamite uses
Container HAProxy
system metrics and application-related information to take
(Loadbalancer)
decisions when a service should be scaled out or in. If a service
Notification
Listen to changes
Notification should be scaled out, Dynamite creates a new service instance
and submits it to Fleet. Otherwise, if a scale-in is requested,
it instructs Fleet to destroy a specific service instance.
etcd
To calculate scaling decisions, Dynamite uses at the moment
a rule-based approach, but it can be easily extended to support
Start Container fails
more advanced scaling logic (e.g., model based). A rule
Unregister from etcd
Register to etcd
consists of the name of the metric to be observed and a
threshold value, the type of the service the metric belongs to
Container Apache
and a comparative operator which tells Fleet how to compare
reported monitoring values with the threshold. Additionally, a
period can be defined in which the value should always be over
Fig. 4. Dynamic reconfiguration of services through the auto-discovery
mechanism of etcd respectively under the threshold before executing a scaling
action. This avoids single peaks generating useless actions. A
cool-down period can also be configured per rule. Dynamite
the scaling engine, we developed a new output adapter for will wait for the defined amount of time after the scaling rule
Logstash which enables to send the processed data directly was executed before issuing the next scaling action from the
to Etcd. The overall implementation is depicted in Fig. 5. same rule. The following example shows an excerpt of the
The monitoring component is essential to our experimental configuration file as used in the Zurmo scenario application.
evaluation. Service:
4) Auto-Scaling Engine: The new scaling-engine Dynamite apache:
is an open-source Python application which uses Etcd and name_of_unit_file: apache@.service
type: webserver
Fleet and therefore integrates natively with CNA. Dynamite min_instance: 2
takes care of automatic horizontal scaling, but also of the max_instance: 5
1000 6
0 800 4
0 500 1000 1500 2000 2500 600
Time (s) 2
400
anyservice/0.9/incident 200 0
50 0 160010
WebServ. (#)
1 200 400 600 800 1000 1200 1400
40 users 8
Users (#)
30 WS 6
20 4
10 2
0 0 0
0 500 1000 1500 2000 2500 0 200 400 600 800 1000 1200 1400 1600
Time (s) Experiment time (sec)
Fig. 7. Comparison of two unavailability models for a single service with Fig. 8. Sustained request rate and measured response time
90% average availability
WebServ. (#)
chosen Tsung, an open-source multi-protocol tool which runs 2.5 WS
2
distributed across multiple nodes5 . In our experiment, the tool 1.5
1
first records the interaction with Zurmo from a single web 0.5
0
browser as intermediate proxy. Then, it replays the interaction 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420
from multiple nodes in a cluster to achieve a high load. For Experiment time (sec)