Experimental Evaluation of The Cloud-Native Application Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Experimental Evaluation of the Cloud-Native

Application Design
Sandro Brunner, Martin Blchlinger, Giovanni Toffetti, Josef Spillner, Thomas Michael Bohnert
Zurich University of Applied Sciences, School of Engineering
Service Prototyping Lab (blog.zhaw.ch/icclab/)
8401 Winterthur, Switzerland
Email: {brnr,bloe,toff,josef.spillner,thomas.bohnert}@zhaw.ch

AbstractCloud-Native Applications (CNA) are designed to publication [5]. It is also not a step-by-step guide to CNA, for
run on top of cloud computing infrastructure services with which we refer to our detailed posts on this topic 1 . Instead,
inherent support for self-management, scalability and resilience it briefly repeats the CNA characteristics and then focuses on
across clustered units of application logic. Their systematic
design is promising especially for recent hybrid virtual machine a thorough evaluation.
and container environments for which no dominant application In the next sections, we will first recapitulate design princi-
development model exists. In this paper, we present a case study ples for CNA. Then, we will introduce an evaluation scenario,
on a business application running as CNA and demonstrate the and subsequently evaluate the scenario applications scalability
advantages of the design experimentally. We also present Dyna-
and resilience in a private and in a public cloud environment.
mite, an application auto-scaler designed for containerised CNA.
Our experiments on a Vagrant host, on a private OpenStack Finally, we will wrap up our findings and contributions which
installation and on a public Amazon EC2 testbed show that CNA includes a novel scaling engine called Dynamite.
require little additional engineering.

I. BACKGROUND II. C LOUD NATIVE A PPLICATIONS


Delivering software applications from the cloud requires a The essential properties of CNA are scalability and re-
new way of thinking about software architectures and software silience. For the scalability, a CNA has to be capable of taking
engineering processes [1]. Typical cloud service environments advantage of the cloud characteristics on-demand self-service,
with raw infrastructure and featureful platform services (IaaS rapid elasticity and measured service and to adjust its capacity
and PaaS, respectively) offer multiple benefits over other soft- by adding or removing resources. For the resilience, a CNA
ware hosting and delivery paradigms. Their full exploitation has to tolerate failures of commodity hardware, virtualised
is however only possible if the software is made to a resources or services. In order to achieve both properties,
certain extent aware of the cloud hosting characteristics, CNA are structured into core functionality and supporting
including elastic horizontal scalability, on-demand payment functionality. The core is subdivided into fine-grained mi-
and orchestrated application lifecycle. Often, applications are croservices so that each service can be scaled and governed
migrated into the cloud as monolithic virtual machines, which individually depending on the load within the corresponding
makes them vulnerable to availability failures and demand part of the application. The support encompasses monitoring
spikes. This paper reports on how to mitigate these two risks. and management systems [5].
There are a number of approaches to tackle this problem. On the implementation level, cloud containers are an ap-
Model-based software deployment in the cloud matches soft- propriate mechanism to realise CNA. Containers, as opposed
ware capabilities with the hosting environment [2]. Code- to virtual machines, can be spawned and terminated quickly
Cloud is a concrete multi-IaaS platform for executing sci- similar to native system processes. Furthermore, in order to
entific applications with specified infrastructure requirements, make use of existing infrastructure management tooling, the
including SLAs, minimum replication factors and elasticity handling of containers can be harmonised with the handling of
[3]. Dispersed Computing is an availability-increasing concept virtual machines by having groups of containers across nodes.
for storing and processing fragments of data across cloud
providers, either in the clear or with encryption for higher III. E VALUATION S CENARIO
confidentiality, resulting in Stealth Computing [4].
Cloud Native Applications (CNA) [5] are our approach of We present the CNA evaluation scenario in four steps. First,
de-composing software functionality into smaller service units, we select a suitable target application. Second, we analyse
connecting them with an orchestration authority and a scaling the application and plan the conversion to a CNA. Third, we
engine, and running them with high resilience and scalability. conduct the cloud deployment to achieve a running CNA.
So far, CNA has been a mostly theoretic design with a distinct
lack of an evaluation study. This paper therefore does not 1 http://blog.zhaw.ch/icclab/category/research-approach/themes/
attempt to argue for CNA, for which we refer to our previous cloud-native-applications/
A. Selection of a Software Application
Our aim is to show and evaluate how to migrate an ap-
plication as CNA into the cloud and by doing so, making
it resilient and scalable. This implies that the application is
self-managing so that once the application has been deployed,
it should automatically recover from failures and also auto-
matically scale up or down should the need arise. For the
evaluation, we had to settle on one particular existing software
application to show the applicability of the research. One of
the main requirements the application to migrate needed to Fig. 1. Original monolithic architecture of Zurmo
have was to be open-source. This allows for changes to some
of the core functionality of the application at some point.
After filtering popular business domains on the cloud and a
subsequent comparative evaluation involving five customer-
relationship management (CRM) business applications, we
selected Zurmo, an open-source CRM, as target to be migrated
into the cloud. There are two reasons Zurmo is a suitable
example. First, its code-base size is moderate which would
allow us to make changes to the source-code without too much
effort. Second, the application is being developed using a test-
driven process which ensures that the code has a certain quality
which would allow for a safer change of the application.
Next, we will describe how the original architecture of the
application looked like and which changes we introduced to
adapt it for a cloud environment. We explain how the resulting
CNA looks like and how it is technically implemented.
B. Analysis and Planning
Fig. 2. Evolved architecture of Zurmo with CNA support functions
From an architectural point of view, Zurmo is very straight-
forward and representative for many SaaS offerings. It is a web
application connected to a database with an optional cache. An services. The health management component keeps track of
Apache web server is used to run the PHP code, MySQL is the health of the various parts of the system and restarts failed
the standard database used for the backend and Memcached components. The configuration/service discovery is used for
is used for the caching system. Session information is saved the various parts of the application to register themselves and
locally on the web server by the application process. Fig. 1 store information about the current status of the system. The
shows Zurmos original architecture. In contrast, Fig. 2 shows auto-scaling system takes scaling decisions based on input it
the architecture after the process of migrating it to a proper gets from the monitoring systems.
CNA architecture. The CNA support functions are prominently
visible.
C. Conversion to CNA Architecture
The application is scaled by placing a (possibly distributed)
load balancer in front of the web servers. Memcached already Zurmo originally saved the session state from clients locally
allows horizontally scaling its service by adding additional on the web server nodes. Scaling the application horizontally
servers to a cluster. The database runs in a master/slave setup. by running multiple web server instances behind a load
It can handle a bigger write load than the usage of a CRM balancer would work but present some significant drawbacks.
will typically require, and therefore, we do not consider the Because of the local sessions, clients would always have to be
single master a bottleneck in practice. forwarded to the web server they originally connected to and
To enable the applications ability of being self-managing, a crash of that particular web server would have resulted in
a monitoring as well as a management system has been all clients connected to it losing their session information. The
added. The monitoring system monitors the application and former issue could have easily been resolved by using sticky
the systems the application runs on. This information is sessions. The latter issue could only be resolved by changing
needed to know the status of the system and to provide the application to save session state remotely instead of locally.
input for the auto-scaling decision. All information gathered In the migrated version of the application, Zurmo saves the
from those systems are collected and saved in an external client session state to both the cache and the database, making
database for analysis. The management system is responsible the apache layer stateless and the application both scalable
for the discovery, the reliability and the automatic scaling of and more resilient. The failure of a single web server will not
the application. The discovery allows services to find other negatively influence the clients anymore.
Fig. 3. Deployment scheme for containerised applications

D. Implementation and Migration Details one another and be able to adapt to a changing environment,
1) Virtual Machines and Containers: We chose the oper- e.g. components being added or removed. In a self-managing
ating system image CoreOS as underlying virtual machine system, the web server should announce itself and the load-
because it is designed for clustered deployments and comes balancer should reconfigure itself accordingly, for instance. In
with system services for health management (Fleet2 ) and ser- order to achieve this, we use Etcd, essentially a distributed key-
vice/configuration management (etcd). Its software processes value store, as service-discovery component in combination
typically run inside Docker containers. Different components with confd. Confd watches certain keys in Etcd, gets notified
of an application are put into dedicated containers which when an update occurs, and updates configuration files based
are then network-connected or directory-connected with each on the values of those keys (Fig. 4). To announce services, we
other. In a first step, we containerized the components of use the sidekick pattern. With every service which needs to
Zurmo. Thus, we created one container for the web server, announce itself (e.g. web server), a complementary (lifecycle-
one for the load balancer, one for the database and one for the bound sidekick) service whose only job it is to announce the
caching system. All the CNA support components added later former service in Etcd is deployed. Confd is only installed
monitoring systems, scaling-engine were also container- in containers of services which need to update themselves in
ized. Fleet is actually a distributed init system responsible for case components are added or removed.
system boot sequences. It runs on top of systemd, a recent 3) Monitoring: The monitoring system, re-usable across
parallelized implementation and conceptual enhancement of CNA applications, consists of the so-called ELK stack, log-
init. Fleet allows to describe services and to deploy them courier and collectd. The ELK stack in turn consists of
on machines in a cluster, ensuring they are kept in a certain Elasticsearch, Logstash and Kibana. Logstash collects log
state, and re-deploys them in case of machine failure. In the lines, transforms them into a unified format and sends them
migrated application, the services typically start containers. to a pre-defined output. Collectd collects system metrics
By containerizing the application and running it in a CoreOS and stores them in a file. We use Log-Courier to send the
cluster with Fleet, we can ensure a resilient system as shown application and system-metric log-files from the container
in Fig. 3. in which a service runs to Logstash. The output lines of
Logstash are transmitted to Elasticsearch which is a full-text
2) Service Autodiscovery: In order for the application to
search server. Kibana is a dashboard and visualization web
be truly self-managing, its components need to be aware of
application which gets its input data from Elasticsearch. It is
2 For more details on Fleet the interested reader can refer to https://github. able to display the gathered metrics in a meaningful way for
com/coreos/fleet human administrators. To provide the generated metrics for
Start
initial deployment of the application. It consists of several
components shown in Fig. 6. An auto-scaled application is
reload reload
configuration configuration
assumed to be composed of Fleet services. Dynamite uses
Container HAProxy
system metrics and application-related information to take
(Loadbalancer)
decisions when a service should be scaled out or in. If a service
Notification
Listen to changes
Notification should be scaled out, Dynamite creates a new service instance
and submits it to Fleet. Otherwise, if a scale-in is requested,
it instructs Fleet to destroy a specific service instance.
etcd
To calculate scaling decisions, Dynamite uses at the moment
a rule-based approach, but it can be easily extended to support
Start Container fails
more advanced scaling logic (e.g., model based). A rule
Unregister from etcd
Register to etcd
consists of the name of the metric to be observed and a
threshold value, the type of the service the metric belongs to
Container Apache
and a comparative operator which tells Fleet how to compare
reported monitoring values with the threshold. Additionally, a
period can be defined in which the value should always be over
Fig. 4. Dynamic reconfiguration of services through the auto-discovery
mechanism of etcd respectively under the threshold before executing a scaling
action. This avoids single peaks generating useless actions. A
cool-down period can also be configured per rule. Dynamite
the scaling engine, we developed a new output adapter for will wait for the defined amount of time after the scaling rule
Logstash which enables to send the processed data directly was executed before issuing the next scaling action from the
to Etcd. The overall implementation is depicted in Fig. 5. same rule. The following example shows an excerpt of the
The monitoring component is essential to our experimental configuration file as used in the Zurmo scenario application.
evaluation. Service:
4) Auto-Scaling Engine: The new scaling-engine Dynamite apache:
is an open-source Python application which uses Etcd and name_of_unit_file: apache@.service
type: webserver
Fleet and therefore integrates natively with CNA. Dynamite min_instance: 2
takes care of automatic horizontal scaling, but also of the max_instance: 5

Fig. 5. Monitoring and Logging


cluster resiliently. If the node Dynamite is running on crashes,
Fleet will re-schedule the service to another machine and
start Dynamite there where it can restore the state from
Etcd. The INIT component also takes care of starting the
other components of Dynamite. The METRICS component
is used to collect the metrics stored in Etcd. It regularly
requests them and forwards them to the SCALING component.
This component compares the received metric values with
the scaling policy constraints and manages the state of the
scaling policies. If a scaling policy triggers a scaling action, it
will be sent to the EXECUTOR component. This component
creates or destroys service instances with the help of Fleet. For
more details, we refer to the documentation of the Dynamite
implementation3 .
IV. E VALUATION E XPERIMENTS AND R ESULTS
The evaluation of the CNA design involves stress-testing
the scenario application Zurmo in several cloud infrastructure
environments. The goal is to confirm the scalability and
resilience properties under the influence of availability failures
and demand spikes. Emulation within the actual system as
opposed to pure simulation has been chosen as scientific
technique for validating the application behaviour due to the
ability to cover all side effects, including service dependencies
[6]. We describe our approach of emulating cloud failures and
provoking demand spikes, followed by a presentation of the
Fig. 6. Dynamite scaling engine components results.
A. Resilience: Failure Emulation
scale_down_policy: For the emulation experiments, we have chosen the Multi-
ScalingPolicy: apache_scale_down Cloud Simulation and Emulation Tool (MCS-SIM)4 which
scale_up_policy: is an extensible open-source tool for the dynamic selection
ScalingPolicy: apache_scale_up
[...] of multiple resource services according to their availability,
ScalingPolicy: price and capacity. Subsequently, we have extended MCS-
apache_scale_down: SIM with an additional unavailability model and hooks for
service_type: webserver
metric: cpu_utilization enforcing container service unavailability. Fig. 7 compares
comparative_operator: lt the two models. By executing them, the state transition from
threshold: 15 unavailable to available causes the corresponding container
threshold_unit: percent
period: 30 service to be killed immediately.
period_unit: second The container service hook connects to a Docker interface
cooldown_period: 1 per VM to retrieve available container images and running in-
cooldown_period_unit: minute
stances. Following the models determination of unavailability,
The file declares an Apache web service which has at least the respective containers are forcibly stopped remotely. It is
two running instances. If the scale-out (instance up) policy the task of the CNA framework to ensure that in such cases,
would be triggered, at most five instances would be running. the desired number of instances per image is only shortly
The policy also defines under what circumstances the service underbid and that replacement instances are launched quickly.
should be scaled-in (instance down). If the CPU utilization of Therefore, the overall applications availability should be close
a service of the type webserver falls under 15% for at least to 100% even if the container instances are emulated with 90%
30 seconds, the instance which triggered the scaling policy estimated availability.
is removed. Figure 6 depicts the components of Dynamite
B. Scalability: Demand Spike Emulation
and the workflow between them. The configuration file is
read by the INIT component which initializes Dynamite and The simulation of heavy user influx on the web application
writes the information from the configuration file to Etcd. requires a configurable HTTP stress testing tool. We have
Dynamite is itself designed according to CNA principles. If it 3 Dynamite scaling engine: https://github.com/icclab/dynamite/blob/master/
crashes, it is restarted and re-initialized using the information readme.md
stored in Etcd. This way, Dynamite can be run in a CoreOS 4 MCS-SIM Tool: http://nubisave.org/cgi-bin/gitweb.cgi?p=mcssimulation
Emulation of Service Unavailability on Scenario 'singleservice' Request rate and mean response time
1600 10
req. rate

Response time (msec)


anyservice/0.9/convergence 1400 resp. time

Requests rate (r/sec)


1 8
1200
Unavailability (true/false)

1000 6

0 800 4
0 500 1000 1500 2000 2500 600
Time (s) 2
400

anyservice/0.9/incident 200 0
50 0 160010

WebServ. (#)
1 200 400 600 800 1000 1200 1400
40 users 8

Users (#)
30 WS 6
20 4
10 2
0 0 0
0 500 1000 1500 2000 2500 0 200 400 600 800 1000 1200 1400 1600
Time (s) Experiment time (sec)

Fig. 7. Comparison of two unavailability models for a single service with Fig. 8. Sustained request rate and measured response time
90% average availability

Running apache instances


3

WebServ. (#)
chosen Tsung, an open-source multi-protocol tool which runs 2.5 WS
2
distributed across multiple nodes5 . In our experiment, the tool 1.5
1
first records the interaction with Zurmo from a single web 0.5
0
browser as intermediate proxy. Then, it replays the interaction 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420
from multiple nodes in a cluster to achieve a high load. For Experiment time (sec)

our experiment we used a simple step model increasing the


Fig. 9. Resulting service availability
demand from 10 to 50 users over 25 minutes.
C. Resilience and Scalability Results
V. D ISCUSSION AND F UTURE W ORK
We have run the CNA scenario on Vagrant during develop-
ment, on OpenStack with Heat orchestration in a private cloud Our work has shown that with moderate effort, resulting
and on Amazon AWS with CloudFormation orchestration on in mostly re-usable components, applications can be migrated
a public cloud. The experiments reported here were performed into cloud-native, resilient and scalable services.
on the AWS deployment. The input models for demand spikes We continue to work on CNA and evaluate conceptual
and service unavailability have been executed in configurations extensions such as clustered database backends, new tools such
ranging from 3 to 10 virtual machines6 over which the as fluentd and additional scenario applications.
application containers were deployed. R EFERENCES
Fig. 8 shows the response time for the configuration of 10 [1] R. Bahsoon, I. Mistrk, N. Ali, T. S. Mohan, and N. Medvidovic, The
VMs as a result of the increasing number of users. The bottom future of software engineering IN and FOR the cloud, Journal of Systems
graph shows the number of active users over time. The trace- and Software, vol. 86, no. 9, pp. 22212224, September 2013, Editorial.
[2] F. M. R. Junior and T. da Rocha, Model-based Approach to Automatic
driven workload run by Tsung for each of our simulated users Software Deployment in Cloud, in Proceedings of the 4th Interna-
is configured to adopt and average think time of 5 seconds tional Conference on Cloud Computing and Services Science (CLOSER),
between requests. As a result, the expected request range Barcelona, Spain, April 2014, pp. 151157.
[3] M. Caballer, C. de Alfonso, G. Molt, E. Romero, I. Blanquer, and
is supposed to vary between 2 and 10 requests per second. A. Garca, CodeCloud: A platform to enable execution of programming
In practice, Tsung waits for a response before starting the models on the Clouds, Journal of Systems and Software, vol. 93, pp.
think time for submitting the next requests, so the request rate 187198, July 2014.
[4] J. Spillner, Secure Distributed Data Stream Analytics in Stealth Appli-
obtained experimentally is slightly lower. cations, in 3rd IEEE International Black Sea Conference on Communi-
As it can be seen in the upper graph of Fig. 8, while the cations and Networking (BlackSeaCom), Constant, a, Romania, May 2015.
request rate increases five-fold, the application continues to [5] G. Toffetti, S. Brunner, M. Blchlinger, F. Dudouet, and A. Edmonds,
An architecture for self-managing microservices, in Proceedings of the
handle the load while keeping the response time constrained. 1st International Workshop on Automated Incident Management in Cloud
As a reference, running the same experiment using only 4 (AIMC), Bordeaux, France, April 2015, pp. 1924.
VMs results in average response times of over 4 seconds. [6] R. Lbke, D. Schuster, and A. Schill, Experiences Virtualizing a Large-
Scale Test Platform for Multimedia Applications, in 10th International
Fig. 9 shows the effect of the health management component Conference on Testbeds and Research Infrastructures for the Development
which ensures a rapid recovery (within seconds) of the Apache of Networks & Communities (TridentCom), Vancouver, Canada, June
containers after failures are induced in the system through our 2015.
extended version of MCS-SIM.
5 Tsung
Tool: http://tsung.erlang-projects.org/
6 We
used AWS t2.small machines to test a larger distributed scenario with
constrained resources requiring autoscaling

You might also like