Chapter 5: Availability: © Len Bass, Paul Clements, Rick Kazman, Distributed Under Creative Commons Attribution License
Chapter 5: Availability: © Len Bass, Paul Clements, Rick Kazman, Distributed Under Creative Commons Attribution License
Chapter 5: Availability: © Len Bass, Paul Clements, Rick Kazman, Distributed Under Creative Commons Attribution License
Chapter Outline
What is Availability?
Availability General Scenario
Tactics for Availability
A Design Checklist for Availability
Summary
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
What is Availability?
Availability refers to a property of software
that it is there and ready to carry out its task
when you need it to be.
This is a broad perspective and
encompasses what is normally called
reliability.
Availability builds on reliability by adding the
notion of recovery (repair).
Fundamentally, availability is about
minimizing service outage time by
mitigating faults.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Principal properties
Availability
The probability that the system will be up and
running and able to deliver useful services to users.
Reliability
The probability that the system will correctly deliver
services as expected by users.
Safety
A judgment of how likely it is that the system will
cause damage to people or its environment.
Security
A judgment of how likely it is that the system can
resist accidental or deliberate intrusions.
Chapter 11 Security and Dependability
Causes of failure
Hardware failure
Hardware fails because of design and
manufacturing errors or because
components have reached the end of their
natural life.
Software failure
Software fails due to errors in its
specification, design or implementation.
Operational failure
Human operators make mistakes. Now
perhaps the largest single cause of system
failures in socio-technical systems.
Chapter 11 Security and Dependability
Dependability attribute
Safe system operation depends on
the system being available and
operating reliably.
A system may be unreliable because
its data has been corrupted by an
external attack.
Denial of service attacks on a system
are intended to make it unavailable.
If a system is infected with a virus,
you cannot be confident in its
reliability.
Chapter 11 Security and Dependability
Availability General
Scenario
Portion
of
Scenario
Source
Possible Values
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Availability Tactics
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Detect Faults
Ping/echo: asynchronous request/response
message pair exchanged between nodes, used
to determine reachability and the round-trip
delay through the associated network path.
Monitor: a component used to monitor the state
of health of other parts of the system. A system
monitor can detect failure or congestion in the
network or other shared resources, such as from
a denial-of-service attack.
Heartbeat: a periodic message exchange
between a system monitor and a process being
monitored.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Detect Faults
Timestamp: used to detect incorrect
sequences of events, primarily in distributed
message-passing systems.
Sanity Checking: checks the validity or
reasonableness of a components operations
or outputs; typically based on a knowledge of
the internal design, the state of the system, or
the nature of the information under scrutiny.
Condition Monitoring: checking conditions in a
process or device, or validating assumptions
made during the design.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Detect Faults
Voting: to check that replicated components
are producing the same results. Comes in
various flavors: replication, functional
redundancy, analytic redundancy.
Exception Detection: detection of a system
condition that alters the normal flow of
execution, e.g. system exception, parameter
fence, parameter typing, timeout.
Self-test: procedure for a component to test
itself for correct operation.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Prevent Faults
Removal From Service: temporarily placing a
system component in an out-of-service state for
the purpose of mitigating potential system failures
Transactions: bundling state updates so that
asynchronous messages exchanged between
distributed components are atomic, consistent,
isolated, and durable.
Predictive Model: monitor the state of health of a
process to ensure that the system is operating
within nominal parameters; take corrective action
when conditions are detected that are predictive of
likely future faults.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Prevent Faults
Exception Prevention: preventing
system exceptions from occurring by
masking a fault, or preventing it via
smart pointers, abstract data types,
wrappers.
Increase Competence Set: designing
a component to handle more cases
faultsas part of its normal
operation.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Allocation
of
Responsibil
ities
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Summary
Availability refers to the ability of the
system to be available for use when a
fault occurs.
The fault must be recognized (or
prevented) and then the system must
respond.
The response will depend on the criticality
of the application and the type of fault
can range from ignore it to keep on going
as if it didnt occur.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License
Summary
Tactics for availability are categorized into detect
faults, recover from faults and prevent faults.
Detection tactics depend on detecting signs of
life from various components.
Recovery tactics are retrying an operation or
maintaining redundant data or computations.
Prevention tactics depend on removing elements
from service or limiting the scope of faults.
All availability tactics involve the coordination
model.
Len Bass, Paul Clements, Rick Kazman, distributed under Creative Commons
Attribution License