Reliability of Cloud Computing Services: Anju Mishra & Dr. Viresh Sharma & Dr. Ashish Pandey
Reliability of Cloud Computing Services: Anju Mishra & Dr. Viresh Sharma & Dr. Ashish Pandey
Reliability of Cloud Computing Services: Anju Mishra & Dr. Viresh Sharma & Dr. Ashish Pandey
www.iosrjen.org
anju Mishra & 2dr. Viresh Sharma & 3dr. Ashish Pandey
1
Department of Computer Application, IEC-CET, Greater Noida, India 2 Department of Mathematics, N.A.S. (P G) College ,Meerut, India 3 Sapient Consulting, Gurgaon, India
Abstract: - Cloud computing is a recently developed new technology for complex systems with massive scale service sharing, which is different from the resource sharing of the grid computing systems. Despite the profound technical challenges involved, reliability is not, at its root, a technical problem, nor will merely technical solution be sufficient. Instead deep economic, political, and cultural adjustments will ultimately be required, along with a major, long-term commitment in each sphere to deploy the requisite technical solutions at scale. Nevertheless, technological advance and enablers have a clear role in supporting such change, and information technology (IT) is a natural bridge between technical and social solutions because it can offer improved communication and transparency for fostering the necessary economic, political, and cultural adjustments. Various types of failures are interleaved in the cloud computing environment, such as overflow failure, timeout failure, resource missing failure, network failure, hardware failure, software failure, and database failure. This paper systematically analyzes cloud computing and models the reliability of the cloud services. . It is a holistic approach that stretches from power to waste to purchasing to education and is a lifecycle management approach to the deployment of IT across an organization using Markov models, Queuing Theory and Graph Theory.
I.
INTRODUCTION
Cloud computing as a technology is difficult to define because it is evolving without a clear start point and no clear prediction of its future course. The cloud technology seems to be in flax, hence it may be one of the foundations of the next generation of computing. Its built on a solid array of fundamental and proven technolo gies: virtualization, grid computing, service oriented architectures, distributed computing, broadband networks, browser as a platform, free and open source software, autonomic systems, web application frameworks Service level agreements. Host a variety of different workloads, including batch-style back-end jobs and interactive, user-facing applications. Allow workloads to be deployed and scaled out quickly through the rapid provisioning of virtual machines or physical machines. Support redundant, self-recovering, highly scalable programming models that allow workloads to recover from many unavoidable hardware/software failures. Monitor resource use in real time to enable rebalancing of allocations when needed Cloud computing environments support grid computing by quickly providing physical and virtual servers on which the grid applications can run. Cloud computing is different from but related with grid computing, utility computing and transparent computing. Grid computing [1] is a form of distributed computing whereby a "super and virtual computer" composed of a cluster of networked, loosely-coupled computers acts in concert to perform very large tasks. Grid computing involves dividing a large task into many smaller tasks that run in parallel on separate servers. Grids require many computers, typically in the thousands, and commonly use servers, desktops, and laptops. Utility computing [2] is the packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility such as electricity. Transparent computing [3] means complex back-end services are transparent to users who only see a simple and easy-to-use front-end interface. The cloud computing deployments are today powered by grids, having transparent characteristics and billed like utilities; but cloud computing is rather a natural next step from the grid-utility-transparent model. Based on this model, the cloud computing can rather realize the service sharing than only the resource sharing coined by grid computing. The
51 | P a g e
II.
Cloud computing is a term used to describe both a platform and type of application. A cloud computing platform dynamically provisions, configures, reconfigures, and deprovisions servers as needed. Servers in the cloud can be physical machines or virtual machines. 2.1. Cloud Computing Model Cloud computing is an umbrella term used to refer to Internet based development and services Cloud characteristics are given below: Remotely hosted: Services or data are hosted on remote infrastructure. Ubiquitous: Services or data are available from anywhere. Commoditized: The result is a utility computing model similar to traditional that of traditional utilities, like gas and electricity - you pay for what you would want. Essential characteristics: On-demand self-service Broad network access Resource pooling Location independence Rapid elasticity Measured service 2.2. Cloud Service Model Software as a Service (SaaS): Saas(Software as a service) model you are provided with access to application softwares often referred to as ondemand Service. Saas(Software as a service) deliver one application to many-user, regardless of their location, rather than the traditional model of one application per desktop. In Saas there is no need of installation, set up and running of application. These activities to be managed by central location, Service provider. Platform as a Service (PaaS): Platform as a Service (PaaS) brings the benefits that SaaS bought for applications, but over to the software development world. PaaS can be defined as a computing platform that allows the creation of web applications quickly and easily and without the complexity of buying and maintaining the software and infrastructure underneath it. 2.3. Infrastructure as a Service (IaaS): Infrastructure as a Service (IaaS) is a way of delivering Cloud Computing infrastructure servers, storage, network and operating systems as an on-demand service. Rather than purchasing servers, software, datacenter space or network equipment, clients instead buy those resources as a fully outsourced service on demand [7].
52 | P a g e
Community cloud: shared infrastructure for specific community. This cloud computing environment is outside of the boundaries of the organization, though it is not necessarily a public cloud. Some external clouds make their cloud infrastructure available to specific other organizations, but not to the general public. 3. Public cloud: Sold to the public, mega-scale infrastructure This environment can be used by the general public.This includes individuals, corporations and other types of organizations. Typically, public clouds are administrated by third parties or vendors over the Internet, and services are offered on pay-per-use basis. These are also called provider clouds.Business models like SaaS (Software-as-a- Service) and public clouds complement each other and enable companies to leverage shared IT resources and services. Advantages Public clouds are widely used in the development, deployment and management of enterprise applications, at affordable costs Allows organizations to deliver highly scalable and reliable applications rapidly and at more affordable costs Limitations Security is a significant concern in public clouds
4. Hybrid cloud: composition of two or more clouds This is a combination of both private (internal) and public (external) cloud computing environments. 3. Cloud Management System There is a cloud management system (CMS) which is composed by a set of servers (either centralized or distributed). The CMS mainly fulfills four different functions: 1. To manage a request queue that receives job requests from different users for cloud services. 2. To manage computing resources (such as PCs, Clusters, Supercomputers, etc.) all over the Internet. 3. To manage data resources (such as Databases, Publicized Information, URL contents, etc.) all over the Internet. 4. To schedule a request and divide it into different subtasks and assign the subtasks to different computing resources that may access different data resources over the Internet. When a user requests a certain given cloud service, we apply a workflow to describe and manage the cloud service [8]. Fig.1 depicts a workflow template of a service that includes four different subtasks (S1, S2, S3, S4) and their interrelationship (data dependency), e.g. S3 needs the inputs that result from S1 and S2. It also shows the required data resources that the subtasks need to access, e.g., S1 needs to access data resource D1 when running, S2 needs D2 and D3, and S4 needs D4, but S3 needs nothing. With the given workflow of a cloud service, the scheduler in the CMS can assign these subtasks to different computing resources while allocating the data resources, as shown in Fig.1, e.g., the computing resource C1 is assigned two subtasks, S1 and S3, to run, C5 is a data resource offering data D2, D3 and D4, and C3 is both computing resource and data resource to run subtask S2 while offering data D1 and D3. After the computing resources and data resources receive the commands/subtasks from the CMS, they form a network according to the connectivity or accessibility, e.g. C3 is directly connected with C5, but cannot directly communicate with C4 due to the connectivity (e.g. computers C3 and C4 may be both behind routers that
53 | P a g e
Fig. 1. Workflow of a Cloud Service and Scheduling The cloud network shown in Fig.1 can be very large, and each link in Fig. 1 is actually a virtual link that may go through many components (routers/cables/optical fibers/machines) over a long distance. Thus, the computing resources will work together via the network to run the subtasks while accessing necessary data from the data resources. When the job is finished, the results will return to the user who requests this service. 1. Failure Analysis of Cloud Service There are a variety of types of failures that may affect the success/reliability of a cloud service, including Overflow, Timeout, Data resource missing, Computing resource missing, Software failure, Database failure, Hardware failure, and Network failure. We analyze these failures in more details: Overflow: The request queue should have a limitation on the maximal number of requests waiting in the queue. Otherwise, new requests have to wait for too long a time in the queue, which could make the Timeout failures much more dominant. Therefore, if the queue is full when a new job request arrives, it is simply dropped and the user is unable to get service, which is called an overflow failure. Timeout: The cloud service usually has its due time set by the user or the service monitor. If the waiting time of the request in the queue is over the due time, the Timeout failure occurs, see e.g. [10]. As a result, those timeout requests will be dropped from the queue so that not to affect other following requests. Data resource missing: In CMS, the data resource manager (DRM) registers all data resources. However, it is possible that some previously registered data are removed but the DRM is not updated. As a result, if those data resources are assigned in a certain job request, they will cause the data resource missing failure. Computing resource missing: Similarly to the above data resource miss, the computing resource missing may also occur, such as PC turns off without notifying the CMS. Software failure: The subtasks are actually software programs running on different computing resources, which contain software faults, see e.g. [11]. Database failure: The database that stores the required data resources may also fail, causing that the subtasks when running cannot access the required data. Hardware failure: The computing resources and data resources in general have hardware (such as computers or servers) which may also encounter hardware failures. Network failure: When subtasks access remote data, the communication channels may be broken either physically or logically, which causes the network failure, especially for those long time transmissions of large datasets, see e.g. [12]. The model for cloud computing reliability has to consider all types of these failures, which would be very complicated and existing reliability models cannot address all of these concerns in
54 | P a g e
IV.
In this section, we develop a holistic model for Cloud Service Reliability, which is defined as theprobability that a cloud service under consideration can be successfully completed for a user in aspecified period of time. In particular, this requires that the job request be successfully served bythe schedulers in time, the set of subtasks contained by the service be completed, the computing/data resources required by the subtasks be available; and the network be operational during the communications. From the definition of cloud service reliability, it is clear that all types of failures we have discussed in section 2 will more or less affect this probability to provide a successful service. We classify the above failures in two groups: 1. Request Stage Failures: Overflow and Timeout. 2. Execution Stage Failures: Data resource missing, Computing resource missing, Software failure, Database failure, Hardware failure, and Network failure. The failures in Group 1 may occur before the job request is successfully assigned to computing/data resources; on the other hand, the failures in Group 2 may occur after the job request has been successfully assigned and during the execution of subtasks. Therefore, the two groups of failures could be deemed as independent. Nevertheless, failures within each group are strongly correlated. In summary, the modeling of cloud service reliability can be separated in two parts: modeling of Request Stage Reliability and modeling of Execution Stage Reliability. 4.1. Request Stage Reliability This request stage contains two types of failures: overflow and timeout. The due time for a specific service is the allowed time spent from the submission of the job request to the completion of the job. The due time can be set by the user or by the service monitor. If a job request is not served by a scheduler before the due time, it will be dropped. The dropping rate is denoted by d . Suppose the capacity of the request queue is N (the maximal number of requests in the queue). We assume that the arrival of submissions of job requests follow a Poisson process with the arrival rate of a . Usually, there are multiple schedule servers to serve the requests. These schedule servers are usually homogeneous with similar structures, schemes and equipments. Here, we assume a total of S homogenous schedule servers are running simultaneously to serve the requests. The service time to complete one request by each schedule server is assumed to be an exponentially distributed quantity with parameter m . Thus, such process can be modeled by a Markov process as depicted by Fig. 3, in which state n (n=0,1,,N) represents the number of requests in the queue.
Fig. 2. Markov model for the request queue. In Fig. 2, the transition rate from state n to state n+1 is a . At state N, the arrival of a new request will make the request queue overflow, so the request is dropped and the queue still stays at state N. The service rate
55 | P a g e
(n=s,..,N-1) (5)
(4)
n=0 qn = 1 The probability for the overflow failure NOT to occur is thus Roverflow n=0 qn
N-1 (7)
(6)
where qn (n=0,1,,N) can be obtained by solving equations (2)-(6). 4.2. Execution Stage
4.2.1. A New Model To address various types of failures during the execution of a cloud service, we propose a new model here. All types of execution stage failures are integrated in this new model, as illustrated by a graph model in Fig. 3.
Fig. 3. A graph model integrating different types of failures at the execution stage. In this model, hardware (such as a computer) is represented by a solid-line node, so the characteristics regarding the hardware (such as hardware failures, processing speed, etc.) can be associated with the node.
56 | P a g e
where Tw (elementn) denotes the length of working time of the n:th element in a cloud service, which can be derived, respectively, as follows. The time that the k:th software program is running on the i:th machine is Tw(Software) = Software Workload = wpi Processing Speed (12) psi
The time that the m:th communication link is transmitting data is Tw(Communication) = Amount of Data = Sdij Bandwidth (13) bwm
57 | P a g e
which means the summation of the execution time of all software programs running on this hardware and the communication time of all channels going through this hardware. The working time for a data source can be calculated as the summation of all communication times that access the data on the data source. Tw(DataSource) = Datta Tw(Communication) (15)
With the working time derived by equations (12)-(15), the reliability of individual element can be obtained from (11), which is more realistic and practical than other conventional methods [13] assuming the reliabilities of elements (nodes and links) are constant, (e.g. a node is always 90% reliable, regardless of how long it works). In fact, the reliability of individual element is affected by various conditions such as failure rate, amount of data, bandwidth, operation time, etc. 4.2.3. New Evaluation Algorithm Though the new graph model and the parameters of elements are more realistic and practical, they also make the evaluation of overall reliability much more complicated so that the existing algorithms [13] could not be directly applied here. For instance, those conventional algorithms have one or some of the following assumptions that are not applicable to evaluate the reliability given the above new model: 1) the network topology is made up of physical nodes/links without considering the virtual nodes/links; 2) the operational probabilities (reliabilities) of nodes or links are constant; 3) only hardware failures of links and processors are considered without taking into account the software, data and resource failures. Therefore, we further present a new algorithm for evaluating the overall cloud service reliability considering all different factors during the execution stage given the new graph model and the above parameters. The new evaluation algorithm based on Graph theory and Bayesian theorem is presented to derive the reliability, as follows. A. Minimal Subtask Spanning Tree (MSST) The set of all nodes and links involved in completing a specific subtask form a Subtask Spanning Tree (SST). This SST can be considered to be a combination of several minimal subtask spanning trees ( MSSTs), where each MSST represents a minimal possible combination of available elements (nodes and links) that guarantees the success to execute this specific subtask (i.e., failure of any element in MSST leads to the subtask failure). By this definition of MSST, we can see that each MSST contains exactly one set of data resources without any duplications, because any duplication could be reduced to another smaller SST. Therefore, for any MSST, the data resources and precedent subtasks that provide certain input for the subtask are also determined. One can also obtain the working times of different elements by (12)-(15). Some elements inside one MSST can still belong to several paths if they are involved in different communications tasks, such as data transmission or data resource access. Note that all elements in the execution stage are hot-standby although some elements/subtasks may be waiting for the output of some other subtasks. So during the waiting period, those elements are also possible to fail. Thus, we suppose that an MSST completes the entire service if all of its elements do not fail during the maximal time allowed to complete all subtasks in executing which they are involved. Therefore, when calculating the element reliability in a given MSST, one has to use the corresponding record with maximal time. Assume there are a total of K elements in an MSST, and elementi (i=1,2,,K) denotes the i:th element in the MSST. Accordingly, the communication time of the i:th element is denoted by T w(elementi) and (elementi) l element represents its failure rate. The reliability of this single MSST can be simply expressed as
k
(16)
With this equation, the reliability of an MSST can be computed if the working times of all the elements are obtained. Hence, finding all the MSSTs and determining the working time of their elements are the first step in deriving the execution reliability of a cloud service. To solve the graph traversal problem, several classical algorithms have been suggested, such as depth-first search, breadth-first search, etc. These algorithms can find all MSSTs in an arbitrary graph. Here, we propose a depth-first search algorithm here, which is briefly described as follows:
58 | P a g e
(17)
In practice, all MESTs could be generated in the following steps: Step 1: Select an MSST from each set of MSST (Sm) where (m=1,2,M). Step 2: M MSSTs are obtained and put them together to generate the MEST. For each common element when intersecting trees together, record the greater working time as the final working time of this element in the MEST. Step 3: Repeat Step 1-2 until all combinations are tried to generate all N MSSTs. Similar to (16), the reliability of a single MEST can be calculated by RMSST = iMEST exp{- (elementi) . Tw (elementi)}, C. Execution Reliability (18)
Having the list of N MESTs and the corresponding task completion time, one can determine the reliability of cloud service at the execution stage, as follows.
N
(19)
which means any one MEST out of the total N MESTs being succeeded will make the cloud service successfully executed in the execution stage. Denote event Ej the successful operation of the MEST j while Ej the failure of the MESTj. Using the Bayesian theorem on conditional probability, we can derive (19) to a summation of conditional probabilities _ _ _ N Ni Rexecute = Pr( Ui=1 MSSTi) =j=1 Pr(Ej).Pr(E1 ,E2, .,Ej-1| Ej) (20) The probability Pr(Ej) can be directly obtained from (18) as RMESTj and the probability, Pr(E1 ,E2, .,Ej-1| Ej) can be computed by the following two-step algorithm. Step 1 identifies the failures of all of the critical elements in a period of time during which they lead to the failures of any one MEST from previous j-1 MESTs, but do not affect MEST j . Step 2 generates all the possible combinations of the identified critical elements that lead to the event E 1 ,E2, .,Ej-1| Ej by a binary search, and computes the probabilities of those combinations. Their summation is Pr{E 1 ,E2, .,Ej-1| Ej}
59 | P a g e
where Rrequest can be derived from the reliability of request stage by (10), and R execute can be derived from the reliability of execute stage by (20) [15].
V.
In this paper, reliability modeling and analysis of cloud service is conducted. We first elaborate various types of possible failures in a cloud service, based on which a holistic reliability model is developed. A new algorithm is proposed to evaluate cloud service reliability based on the developed model. The developed cloud service reliability model and evaluation algorithm, however, is yet to be validated by simulation and real-life data.
REFERENCES
[1] [2] I. Foster, C. Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Los Alios, MorganKaufmann, 2003. C.S. Yeo, R. Buyya1, M.D. de Assuno, et al. Utility Computing on Global Grids. Technical Report, GRIDS-TR-2006-7, Grid Computing and Distributed Systems Laboratory, The University of Melbourne, Australia, 2006. Y. Zhang, Y. Zhou. Transparent computing: A new paradigm for pervasive computing. Proceedings of the 3rd International Conference on Ubiquitous Intelligence and Computing (UIC-06), LNCS 4145, 111, 2006. Y.S. Dai, Y. Pan, X.K. Zou. A hierarchical modeling and analysis for grid service reliability. IEEE Transactions on Computers, 56(5), 681-691, 2007. M.L. Shooman. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis and Design. New York: John Wiley & Sons, Inc., 2002. M. Xie, Y.S. Dai, K.L. Poh. Computing System Reliability: Models and Analysis. New York: Kluwer Academic Publishers, 2004. http://diversity.net.nz/wp-content/uploads/2011/01/Moving-to-the-Clouds.pdf L. Xing, Y.S. Dai, A new decision diagram model for efficient analysis on multi-state systems, IEEE Transactions on Dependable and Secure Computing, Accepted for Publication, 2008, Publishers: IEEE Press. X. Zou, Y.S. Dai, Y. Pan, Trust and Security in Collaborative Computing, World Scientific, Hackensack, NJ, U.S.A., 2008, ISBN: 981-270-368-3. D. Abramson, R. Buyya, J. Giddy. A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems, 18(8), 1061-1074, 2002. Y.S. Dai, M. Xie, K.L. Poh. Reliability of grid service systems, Computers & Industrial Engineering, 50(1-2), 130-147, 2006. Y.S. Dai, M. Xie, K.L. Poh, Reliability Analysis of Grid Computing Systems, The 9 th IEEE Pacific Rim Symposium on Dependable Computing (PRDC2002), IEEE Computer Press, 2002, pp. 97-103. M. Xie, Y.S. Dai, K.L. Poh, Computing Systems Reliability: Models and Analysis, (330 pages), Springer: New York, U.S.A., 2004. ISBN: 0-306-48496-X. B. Yang, M. Xie. A study of operational and testing reliability in software reliability analysis, Reliability Engineering & System Safety, 70(3), 323-329, 2000. Yuan-Shun Dai, Bo Yang, Jack Dongarra, Gewei Zhang, Cloud Service Reliability: Modeling and Analysis
[3]
60 | P a g e