c1
c1
c1
Definition 2.1: Reliability is the probability of success or the probability that the
system will perform its intended function under specified design limits.
More specific, reliability is the probability that a product or part will operate
properly for a specified period of time (design life) under the design operating
conditions (such as temperature, volt, etc.) without failure. In other words,
reliability may be used as a measure of the system’s success in providing its
function properly. Reliability is one of the quality characteristics that consumers
require from the manufacturer of products.
Mathematically, reliability R(t) is the probability that a system will be
successful in the interval from time 0 to time t:
R(t ) = P(T > t ) t≥0 (2.1)
where T is a random variable denoting the time-to-failure or failure time.
Unreliability F(t), a measure of failure, is defined as the probability that the sys-
tem will fail by time t:
F (t ) = P (T ≤ t ) for t ≥ 0
In other words, F(t) is the failure distribution function. If the time-to-failure
random variable T has a density function f(t), then
∞
R (t ) = ∫ f ( s)ds
t
or, equivalently,
d
f (t ) = − [ R (t )]
dt
The density function can be mathematically described in terms of T:
lim P(t < T ≤ t + Δt )
Δt → 0
This can be interpreted as the probability that the failure time T will occur between
the operating time t and the next interval of operation, t+Δt.
Consider a new and successfully tested system that operates well when put
into service at time t = 0. The system becomes less likely to remain successful as
the time interval increases. The probability of success for an infinite time interval,
of course, is zero.
Thus, the system functions at a probability of one and eventually decreases to
a probability of zero. Clearly, reliability is a function of mission time. For example,
one can say that the reliability of the system is 0.995 for a mission time of 24
hours. However, a statement such as the reliability of the system is 0.995 is
meaningless because the time interval is unknown.
System Reliability Concepts 11
Example 2.1: A computer system has an exponential failure time density function
t
1 −
f (t ) = e 9,000 t≥0
9, 000
What is the probability that the system will fail after the warranty (six months or
4380 hours) and before the end of the first year (one year or 8760 hours)?
= 0.237
This indicates that the probability of failure during the interval from six months to
one year is 23.7%.
If the time to failure is described by an exponential failure time density
function, then
t
1 −
f (t ) = e θ t ≥ 0, θ > 0 (2.2)
θ
and this will lead to the reliability function
∞ s t
1 − −
R (t ) = ∫θ t
e θ ds = e θ t≥0 (2.3)
Consider the Weibull distribution where the failure time density function is
given by
⎛ t ⎞β
β t β −1 −⎜⎜ θ ⎟⎠
f (t ) = β e ⎝ t ≥ 0, θ > 0, β > 0
θ
Then the reliability function is
⎛ t ⎞β
−⎜ ⎟
⎜
⎝θ ⎠
R (t ) = e t≥0
Thus, given a particular failure time density function or failure time distribution
function, the reliability function can be obtained directly. Section 2.2 provides
further insight for specific distributions.
Substituting
d
f (t ) = −
[ R (t )]
dt
into equation (2.4) and performing integration by parts, we obtain
12 System Software Reliability
∫
MTTF = − td [ R (t )]
0
∞
(2.5)
∞
0
∫
= [−tR(t )] | + R(t )dt
0
The first term on the right-hand side of equation (2.5) equals zero at both
limits, since the system must fail after a finite amount of operating time. Therefore,
we must have tR(t) → 0 as t → ∞ . This leaves the second term, which equals
∞
∫
MTTF = R (t )dt
0
(2.6)
Using a basic averaging technique, the component MTTF of 3660 hours was
estimated. However, one of the components failed after 920 hours. Therefore, it is
important to note that the system MTTF denotes the average time to failure. It is
neither the failure time that could be expected 50% of the time, nor is it the guaran-
teed minimum time of system failure.
A careful examination of equation (2.6) will show that two failure distributions
can have the same MTTF and yet produce different reliability levels. This is
illustrated in a case where the MTTFs are equal, but with normal and exponential
System Reliability Concepts 13
failure distributions. The normal failure distribution is symmetrical about its mean,
thus
R( MTTF ) = P ( Z ≥ 0) = 0.5
where Z is a standard normal random variable. When we compute for the
exponential failure distribution using equation (2.3), recognizing that θ = MTTF,
the reliability at the MTTF is
MTTF
−
R( MTTF ) = e MTTF = 0.368
Clearly, the reliability in the case of the exponential distribution is about 74% of
that for the normal failure distribution with the same MTTF.
∫
t1
∫ ∫
f (t )dt = f (t )dt − f (t )dt
t1 t2
= R(t1 ) − R(t2 )
or in terms of the failure distribution function (or the unreliability function) as
t2 t2 t1
∫ f (t )dt = ∫
t1 −∞
f (t )dt − ∫
−∞
f (t )dt
= F (t2 ) − F (t1 )
The rate at which failures occur in a certain time interval [t1, t2] is called the
failure rate. It is defined as the probability that a failure per unit time occurs in the
interval, given that a failure has not occurred prior to t1, the beginning of the
interval. Thus, the failure rate is
R (t1 ) − R (t2 )
(t2 − t1 ) R(t1 )
Note that the failure rate is a function of time. If we redefine the interval as
[t , t + Δt ], the above expression becomes
R (t ) − R(t + Δt )
ΔtR(t )
The rate in the above definitions is expressed as failures per unit time, when in
reality, the time units might be in terms of miles, hours, etc. The hazard function is
defined as the limit of the failure rate as the interval approaches zero. Thus, the
hazard function h(t) is the instantaneous failure rate, and is defined by
R (t ) − R(t + Δt )
h(t ) = lim
Δt → 0 ΔtR(t )
1 d
= [− R(t )] (2.7)
R(t ) dt
f (t )
=
R (t )
14 System Software Reliability
The quantity h(t)dt represents the probability that a device of age t will fail in the
small interval of time t to (t+dt). The importance of the hazard function is that it
indicates the change in the failure rate over the life of a population of components
by plotting their hazard functions on a single axis. For example, two designs may
provide the same reliability at a specific point in time, but the failure rates up to
this point in time can differ.
The death rate, in statistical theory, is analogous to the failure rate as the force
of mortality is analogous to the hazard function. Therefore, the hazard function or
hazard rate or failure rate function is the ratio of the probability density function
(pdf) to the reliability function.
Maintainability
When a system fails to perform satisfactorily, repair is normally carried out to
locate and correct the fault. The system is restored to operational effectiveness by
making an adjustment or by replacing a component.
Maintainability is defined as the probability that a failed system will be
restored to specified conditions within a given period of time when maintenance is
performed according to prescribed procedures and resources. In other words,
maintainability is the probability of isolating and repairing a fault in a system
within a given time. Maintainability engineers must work with system designers to
ensure that the system product can be maintained by the customer efficiently and
cost effectively. This function requires the analysis of part removal, replacement,
tear-down, and build-up of the product in order to determine the required time to
carry out the operation, the necessary skill, the type of support equipment and the
documentation.
Let T denote the random variable of the time to repair or the total downtime. If
the repair time T has a repair time density function g(t), then the maintainability,
V(t), is defined as the probability that the failed system will be back in service by
time t, i.e.,
t
V (t ) = P(T ≤ t ) = ∫ g ( s )ds
0
V (t ) = 1 − e − μ t
which represents the exponential form of the maintainability function.
An important measure often used in maintenance studies is the mean time to
repair (MTTR) or the mean downtime. MTTR is the expected value of the random
variable repair time, not failure time, and is given by
∞
MTTR = ∫ tg (t )dt
0
When the distribution has a repair time density given by g (t ) = μ e− μ t , then, from
the above equation, MTTR = 1/ μ . When the repair time T has the log normal
density function g(t), and the density function is given by
System Reliability Concepts 15
(ln t − μ )2
1 −
g (t ) = e 2σ 2
t>0
2π σ t
then it can be shown that
σ2
MTTR = m e 2
where m denotes the median of the log normal distribution.
In order to design and manufacture a maintainable system, it is necessary to
predict the MTTR for various fault conditions that could occur in the system. This
is generally based on past experiences of designers and the expertise available to
handle repair work.
The system repair time consists of two separate intervals: passive repair time
and active repair time. Passive repair time is mainly determined by the time taken
by service engineers to travel to the customer site. In many cases, the cost of travel
time exceeds the cost of the actual repair. Active repair time is directly affected by
the system design and is listed as follows:
1. The time between the occurrence of a failure and the system user
becoming aware that it has occurred.
2. The time needed to detect a fault and isolate the replaceable
component(s).
3. The time needed to replace the faulty component(s).
4. The time needed to verify that the fault has been corrected and the system
is fully operational.
The active repair time can be improved significantly by designing the system in
such a way that faults may be quickly detected and isolated. As more complex
systems are designed, it becomes more difficult to isolate the faults.
Availability
Reliability is a measure that requires system success for an entire mission time. No
failures or repairs are allowed. Space missions and aircraft flights are examples of
systems where failures or repairs are not allowed. Availability is a measure that
allows for a system to repair when failure occurs.
The availability of a system is defined as the probability that the system is
successful at time t. Mathematically,
System up time
Availability =
System up time + System down time
MTTF
=
MTTF + MTTR
Availability is a measure of success used primarily for repairable systems. For
non-repairable systems, availability, A(t), equals reliability, R(t). In repairable
systems, A(t) will be equal to or greater than R(t).
The mean time between failures (MTBF) is an important measure in repairable
systems. This implies that the system has failed and has been repaired. Like MTTF
16 System Software Reliability
and MTTR, MTBF is an expected value of the random variable time between
failures. Mathematically,
MTBF = MTTF + MTTR
The term MTBF has been widely misused. In practice, MTTR is much smaller
than MTTF, which is approximately equal to MTBF. MTBF is often incorrectly
substituted for MTTF, which applies to both repairable systems and non-repairable
systems. If the MTTR can be reduced, availability will increase, and the system
will be more economical.
A system where faults are rapidly diagnosed is more desirable than a system
that has a lower failure rate but where the cause of a failure takes longer to detect,
resulting in a lengthy system downtime. When the system being tested is renewed
through maintenance and repairs, E(T) is also known as MTBF.
Binomial Distribution
The binomial distribution is one of the most widely used discrete random variable
distributions in reliability and quality inspection. It has applications in reliability
engineering, e.g., when one is dealing with a situation in which an event is either a
success or a failure.
The pdf of the distribution is given by
()
P( X = x) = nx p x (1 − p) n − x x = 0, 1, 2, ..., n
( nx ) = x!(nn−! x)!
where n = number of trials; x = number of successes; p = single trial probability of
success.
The reliability function, R(k), (i.e., at least k out of n items are good) is given by
∑ ( nx ) p
n
R(k ) = x
(1 − p) n − x
x=k
Example 2.2: Suppose in the production of lightbulbs, 90% are good. In a random
sample of 20 lightbulbs, what is the probability of obtaining at least 18 good
lightbulbs?
20
⎛ 20 ⎞
R (18) = ∑
x =18
x
⎜ ⎟ (0.9) (0.1)
⎝ ⎠
18
20 − x
= 0.667
Poisson Distribution
Although the Poisson distribution can be used in a manner similar to the binomial
distribution, it is used to deal with events in which the sample size is unknown.
This is also a discrete random variable distribution whose pdf is given by
(λ t ) x e − λ t
P( X = x) = for x = 0, 1, 2, ....
x!
where λ = constant failure rate, x = is the number of events. In other words, P(X = x)
is the probability of exactly x failures occurring in time t. Therefore, the reliability
Poisson distribution, R(k) (the probability of k or fewer failures) is given by
(λ t ) x e − λ t
k
R(k ) = ∑
x=0 x!
This distribution can be used to determine the number of spares required for the
reliability of standby redundant systems during a given mission.
Exponential Distribution
Exponential distribution plays an essential role in reliability engineering because it
has a constant failure rate. This distribution has been used to model the lifetime of
electronic and electrical components and systems. This distribution is appropriate
when a used component that has not failed is as good as a new component - a
rather restrictive assumption. Therefore, it must be used diplomatically since
numerous applications exist where the restriction of the memoryless property may
not apply. For this distribution, we have reproduced equations (2.2) and (2.3),
respectively:
t
1 −
f (t ) = e θ = λ e−λt , t ≥ 0
θ t
−
R (t ) = e θ = e− λ t , t ≥ 0
where θ = 1/λ > 0 is an MTTF’s parameter and λ ≥ 0 is a constant failure rate.
The hazard function or failure rate for the exponential density function is
constant, i.e.,
1
1 −θ
e
f (t ) θ 1
h (t ) = = = =λ
R(t ) −
1 θ
e θ
The failure rate for this distribution is λ, a constant, which is the main reason
for this widely used distribution. Because of its constant failure rate property, the
exponential is an excellent model for the long flat “intrinsic failure” portion of the
18 System Software Reliability
bathtub curve. Since most parts and systems spend most of their lifetimes in this
portion of the bathtub curve, this justifies frequent use of the exponential (when
early failures or wear out is not a concern). The exponential model works well for
inter-arrival times. When these events trigger failures, the exponential lifetime
model can be used.
We will now discuss some properties of the exponential distribution that are
useful in understanding its characteristics, when and where it can be applied.
Property 2.2: If T1, T2, ..., Tn, are independently and identically distributed
exponential random variables (RVs) with a constant failure rate λ, then
n
2λ ∑ Ti ~ χ 2 (2n)
i =1
where χ (2n) is a chi-squared distribution with degrees of freedom 2n. This result
2
Normal Distribution
Normal distribution plays an important role in classical statistics owing to the
Central Limit Theorem. In reliability engineering, the normal distribution primarily
applies to measurements of product susceptibility and external stress. This two-
parameter distribution is used to describe systems in which a failure results due to
some wearout effect for many mechanical systems.
System Reliability Concepts 19
0.9
0.8
0.7
0.6
0 2000000 4000000 6000000 8000000 10000000
The normal distribution takes the well-known bell shape. This distribution is
symmetrical about the mean and the spread is measured by variance. The larger the
value, the flatter the distribution. The pdf is given by
1 t −μ 2
1 − ( )
f (t ) = e 2 σ
-∞ < t < ∞
σ 2π
where μ is the mean value and σ is the standard deviation. The cumulative distri-
bution function (cdf) is
t 1 s−μ 2
1 − (
∫σ
)
F (t ) = e 2 σ ds
−∞ 2π
The reliability function is
∞ 1 s−μ 2
1 − (
R (t ) = ∫
)
e 2 σ ds
t σ 2π
There is no closed form solution for the above equation. However, tables for the
standard normal density function are readily available (see Table A1.1 in Appendix
1) and can be used to find probabilities for any normal distribution. If
T −μ
Z=
σ
is substituted into the normal pdf, we obtain
z2
1 −
f ( z) = e 2
-∞ < Z < ∞
2π
This is a so-called standard normal pdf, with a mean value of 0 and a standard
deviation of 1. The standardized cdf is given by
t 1
1 − s2
Φ (t ) =
−∞
∫ 2π
e 2 ds (2.9)
20 System Software Reliability
where Φ yields the relationship necessary if standard normal tables are to be used.
The hazard function for a normal distribution is a monotonically increasing
function of t. This can be easily shown by proving that h’(t) ≥ 0 for all t. Since
f (t )
h (t ) =
R (t )
then (see Problem 15)
R (t ) f '(t ) + f 2 (t )
h '(t ) = ≥0 (2.10)
R 2 (t )
One can try this proof by employing the basic definition of a normal density
function f.
Example 2.5: A part has a normal distribution of failure times with μ = 40000
cycles and σ = 2000 cycles. Find the reliability of the part at 38000 cycles.
System Reliability Concepts 21
1 − ⎜ ⎟
2⎝ σ ⎠
f (t ) = e t≥0 (2.11)
σ t 2π
where μ and σ are parameters such that -∞ < μ < ∞, and σ > 0. Note that μ and σ
are not the mean and standard deviations of the distribution.
Reliability Curve
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
35000 37000 39000 41000 43000 45000 47000 49000 51000 53000
The relationship to the normal (just take natural logarithms of all the data and
time points and you have “normal” data) makes it easy to work with many good
software analysis programs available to treat normal data.
Mathematically, if a random variable X is defined as X = lnT, then X is
normally distributed with a mean of μ and a variance of σ 2. That is,
E(X) = E(lnT) = μ
and
V(X) = V(lnT) = σ 2.
X
Since T = e , the mean of the log normal distribution can be found by using the
normal distribution. Consider that
⎡ 1 ⎛ x−μ ⎞ ⎤
2
∞ ⎢x − ⎜ ⎥
1 2 ⎝ σ ⎟⎠ ⎦⎥
E (T ) = E (e ) = ∫σ ⎣⎢
X
e dx
−∞ 2π
and by rearrangement of the exponent, this integral becomes
σ2 ∞ 1
μ+ 1 − [ x − ( μ +σ 2 )]2
E (T ) = e 2
∫σ
−∞ 2π
e 2σ 2
dx
h (t ) =
f (t ) Φ σ
=
ln t − μ
( )
R(t ) σ t R (t )
where Φ is a cdf of standard normal density.
System Reliability Concepts 23
Example 2.6: The failure time of a certain component is log normal distributed
with μ = 5 and σ = 1. Find the reliability of the component and the hazard rate for a
life of 50 time units.
h(50) =
(
Φ ln 50 − 5
1 )
= 0.032 failures/unit.
50(1)(0.8621)
Thus, values for the log normal distribution are easily computed by using the
standard normal tables.
Example 2.7: The failure time of a part is log normal distributed with μ = 6 and σ =
2. Find the part reliability for a life of 200 time units.
Reliability Curve
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000
The log normal lifetime model, like the normal, is flexible enough to make it a
very useful empirical model. Figure 2.3 shows the reliability of the log normal vs
time. It can be theoretically derived under assumptions matching many failure
24 System Software Reliability
mechanisms. Some of these are: corrosion and crack growth, and in general,
failures resulting from chemical reactions or processes.
Weibull Distribution
The exponential distribution is often limited in applicability owing to the
memoryless property. The Weibull distribution (Weibull 1951) is a generalization
of the exponential distribution and is commonly used to represent fatigue life, ball
bearing life, and vacuum tube life. The Weibull distribution is extremely flexible
and appropriate for modeling component lifetimes with fluctuating hazard rate
functions and for representing various types of engineering applications. The
three-parameters probability density function is
β
⎛ t −γ ⎞
β (t − γ ) β −1 −⎝⎜ ⎟
f (t ) = e θ
⎠
t ≥γ ≥0
θβ
where θ and β are known as the scale and shape parameters, respectively, and γ
is known as the location parameter. These parameters are always positive. By using
different parameters, this distribution can follow the exponential distribution, the
normal distribution, etc. It is clear that, for t ≥ γ, the reliability function R(t) is
β
⎛ t −γ ⎞
−⎜ ⎟
R (t ) = e ⎝ θ ⎠
for t > γ > 0, β > 0, θ > 0 (2.13)
hence,
β (t − γ ) β −1
h (t ) = t > γ > 0, β > 0, θ > 0 (2.14)
θβ
It can be shown that the hazard function is decreasing for β < 1, increasing for
β > 1, and constant when β = 1.
Example 2.8: The failure time of a certain component has a Weibull distribution
with β = 4, θ = 2000, and γ = 1000. Find the reliability of the component and the
hazard rate for an operating time of 1500 hours.
and the hazard function given in equation (2.14) reduces to 1/θ, a constant. Thus,
the exponential is a special case of the Weibull distribution. Similarly, when γ = 0
and β = 2, the Weibull probability density function becomes the Rayleigh density
function. That is
t2
2 −
f (t ) = te θ for θ > 0, t ≥ 0
θ
1
1
Mean = λ γ Γ(1 + )
γ
⎛ ⎛ 2 ⎞ ⎛ ⎛ 1 ⎞ ⎞2 ⎞
2
Variance = λ ⎜ Γ ⎜1 + ⎟ − ⎜ Γ ⎜1 + ⎟ ⎟ ⎟
γ
(2.16)
⎜ ⎝ γ ⎠ ⎝ ⎝ γ ⎠⎠ ⎟
⎝ ⎠
γ
Reliability = e−λt
1
Example 2.9: The time to failure of a part has a Weibull distribution with =250
λ
(measured in 105 cycles ) and γ=2. Find the part reliability at 106 cycles.
W e i b u l l R e l i a b i l i t y C u r ve
1
0 .9
0 .8
0 .7
0 .6
0 .5
0 .4
0 .3
0 .2
0 .1
0
0 10 20 30 40 50 60 70 80 90 100
Gamma Distribution
Gamma distribution can be used as a failure probability function for components
whose distribution is skewed. The failure density function for a gamma distribution
is
t
t α −1 −
f (t ) = α
e β t ≥ 0, α , β > 0 (2.17)
β Γ(α )
where α is the shape parameter and β is the scale parameter. Hence,
∞ s
1 −
R (t ) = ∫ s α −1
e β
ds
t
β α Γ (α )
If α is an integer, it can be shown by successive integration by parts that
− t α −1 ( βt )i
R (t ) = e β ∑
i =0 i!
(2.18)
and
− βt
f (t ) βα
1
Γ (α )
tα −1 e
h (t ) = =
R(t ) − βt
α −1 ( βt )i
e ∑
i =0 i!
The gamma density function has shapes that are very similar to the Weibull
distribution. At α = 1, the gamma distribution becomes the exponential distribu-
tion with the constant failure rate 1/β. The gamma distribution can also be used to
model the time to the nth failure of a system if the underlying failure distribution is
exponential. Thus, if Xi is exponentially distributed with parameter θ = 1/β, then
T = X1 + X2 +…+Xn, is gamma distributed with parameters β and n.
System Reliability Concepts 27
Example 2.10: The time to failure of a component has a gamma distribution with α
= 3 and β = 5. Determine the reliability of the component and the hazard rate at 10
time-units.
(105 )
10 i
− 2
R(10) = e 5
∑ = 0.6767
i =0 i!
From equation (2.17), we obtain
f (10) 0.054
h(10) = = = 0.798 failures/unit time
R (10) 0.6767
The other form of the gamma probability density function can be written as
follows:
β α t α −1 −t β
f ( x) = e for t >0 (2.19)
Γ(α )
This pdf is characterized by two parameters: shape parameter α and scale
parameter β. When 0<α<1, the failure rate monotonically decreases; when α>1,
the failure rate monotonically increase; when α=1 the failure rate is constant.
The mean, variance and reliability of the density function in equation (2.19)
are, respectively,
α
Mean( MTTF) =
β
α
Variance = 2
β
∞ β α xα −1 − x β
Reliability = ∫ t Γ(α )
e dx
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Beta Distribution
The two-parameter Beta density function, f(t), is given by
Γ(α + β ) α
f (t ) = t (1 − t ) β 0 < t < 1, α > 0, β > 0
Γ(α )Γ( β )
where α and β are the distribution parameters. This two-parameter distribution is
commonly used in many reliability engineering applications.
Pareto Distribution
The Pareto distribution was originally developed to model income in a population.
Phenomena such as city population size, stock price fluctuations, and personal
incomes have distributions with very long right tails. The probability density
function of the Pareto distribution is given by
α kα
f (t ) = k ≤t≤∞
t α +1
The mean, variance and reliability of the Pareto distribution are, respectively,
Mean = k /(α − 1) for α > 1
αK2
Variance = for α > 2
(α − 1) 2 (α − 2)
α
⎛k⎞
Reliability = ⎜ ⎟
⎝t⎠
The Pareto and log normal distributions have been commonly used to model
the population size and economical incomes. The Pareto is used to fit the tail of the
distribution, and the log normal is used to fit the rest of the distribution.
System Reliability Concepts 29
Rayleigh Distribution
The Rayleigh function is a flexible lifetime distribution that can apply to many
degra- dation process failure modes. The Rayleigh probability density function is
⎛ −t 2 ⎞
⎜⎜ 2 ⎟⎟
t ⎝ 2σ ⎠
f (t ) = e (2.20)
σ2
The mean, variance, and reliability of Rayleigh function are, respectively,
1
⎛ π ⎞2
Mean = σ⎜ ⎟
⎝2⎠
⎛ π⎞
Variance = ⎜ 2 − ⎟σ 2
⎝ 2⎠
−σ t 2
Reliability = e 2
Example 2.12: Rolling resistance is a measure of the energy lost by a tire under
load when it resists the force opposing its direction of travel. In a typical car,
traveling at 60 miles per hour, about 20% of the engine power is used to overcome
the rolling resistance of the tires.
A tire manufacturer introduces a new material that, when added to the tire
rubber compound, significantly improves the tire rolling resistance but increases
the wear rate of the tire tread. Analysis of a laboratory test of 150 tires shows that
the failure rate of the new tire linearly increases with time (hours). It is expressed
as
h(t ) = 0.5 × 10−8 t
Find the reliability of the tire at one year.
Solution: The reliability of the tire after one year (8760 hours) of use is
0.5
− × 10−8 × (8760)2
R(1year ) = e 2 = 0.8254
Reliability Curv e
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
4
x 10
Figure 2.6. Rayleigh reliability function vs time
and
tα
R(t ) = e1− a (2.22)
respectively. The corresponding failure rate of the Pham distribution is given by
α
h(t ) = α ln(a ) t α −1 a t (2.23)
Figures 2.7 and 2.8 describe the density function and failure rate function for
various values of a and α .
System Reliability Concepts 31
1
0.9
0.8
0.7 a=1.4 a=1.8
0.6 a=1.2
0.5 a=1
a=1.1
0.4
0.3 a=0.5
0.2 a=0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 2.7. Probability density function for various values α with a=2
Figure 2.8. Probability density function for various values a with α = 1.5
R(t ) = e− ln( λt
N
+1)
(2.25)
and
32 System Software Reliability
nλ t n −1 − ln( λt n
f (t ) = e +1)
n ≥ 1, λ > 0, t ≥ 0 (2.26)
λt n + 1
where n = shape parameter; λ = scale parameter
where f(s) and h(s) are, respectively, the failure time density and failure rate
function.
The operating environments are, however, often unknown and yet different due
to the uncertainties of environments in the field (Pham and Xie 2003). A new look
at how reliability researchers can take account of the randomness of the field
environments into mathematical reliability modeling covering system failure in the
field is great interest.
Pham (2005a) recently developed a new mathematical function called system-
ability, considering the uncertainty of the operational environments in the function
for predicting the reliability of systems.
System Reliability Concepts 33
Notation
hi (t ) ith component hazard rate function
Ri (t ) ith component reliability function
λi Intensity parameter of Weibull distribution for ith component
λ λ = ( λ1 , λ2 , λ3 ..., λn ) .
γi Shape parameter of Weibull distribution for ith component
γ γ = ( γ 1 , γ 2 , γ 3 ..., γ n ) .
η A common environment factor
G (η ) Cumulative distribution function of η
α Shape parameter of Gamma distribution
β Scale parameter of Gamma distribution
Definition 2.2 (Pham 2005a): Systemability is defined as the probability that the
system will perform its intended function for a specified mission time under the
random operational environments.
In a mathematical form, the systemabililty function is given by
t
∫
−η h ( s ) ds
Rs (t ) = ∫ e 0
dG (η ) (2.30)
η
) ∑ exp ( −η ( λ t )) +
n
(
RParallel (t | η , λ , γ ) = exp −ηλi t γ i − i1
γ i1
+ λi t
2
γ i2
i1 ,i2 =1
i1 ≠ i2
( ( )) −
n
∑
γ i1 γ i2
exp −η λi t + λi t + λi t γ i 3
i1 ,i2 ,i2 =1
1 2 3
(2.40)
i1 ≠ i2 ≠ i3
.....
⎛ n
⎞
+ (−1) n −1 exp ⎜ −η ∑ λi t γ i ⎟
⎝ i =1 ⎠
Hence, the parallel systemability is given by
α α
n ⎡ ⎤
⎡ β n
⎤ β
R parallel (t | λ , γ ) = ⎢
i =1 ⎣ β
∑
+ λ t γi ⎥
− ⎢
⎢ β + λ t
γi γ
+ λi t i
∑ ⎥ +
⎥⎦
i ⎦ i ,i =1 ⎣ i 1 2
1 2
i1 ≠ i2 1 2
α
n ⎡ β ⎤
∑ ⎢
⎢ β + λ t
γi
+ λ
γi
+ λi t γ i
⎥ −
⎥⎦
i t
1 2 3
i ≠i ≠i ⎣
i ,i ,i =1
1 2 2 i
1 2 3
1 2 3
(2.41)
.....
α
⎡ ⎤
⎢ ⎥
β
+(−1) n −1 ⎢ ⎥
⎢ n
⎥
⎢ β + λi t
⎣ i =1
γi
∑ ⎥
⎦
or
α
⎡ ⎤
n ⎢ n
β ⎥
R parallel (t | λ , γ ) = (−1) ∑
k −1
⎢ ∑ γj ⎥
(2.42)
k =1 i1 , i2 ..., ik =1 ⎢ β +
i1 ≠ i2 ...≠ ik ⎢
∑ λ jt ⎥
⎣ j = i1 ,...ik ⎦⎥
To simplify the calculation of a general n-component parallel system, we only
consider here a parallel system consisting of two components. It is easy to see that
the second-order moments of the systemability function can be written as
γ1 γ
−2η λ 2 t 2 −2η ( λ 1 t γ 1 + λ 2 t γ 2 )
E ⎡⎣ R Parallel
2
(t | λ , γ ) ⎤⎦ = ∫ η (e−2η λ 1 t + e +e
γ γ γ γ γ
−η ( λ 1 t 1 + λ 2 t 2 ) −η (2 λ 1 t 1 + λ 2 t 2 ) −η ( λ 1 t γ 1 + 2 λ 2 t 2 )
+e −e − e ) d G (η )
36 System Software Reliability
Note that
n− j
⎛ n − j ⎞ −ηλ tγ l
(1 − e−ηλ t )( n − j ) = ∑ ⎜
γ
⎟ (−e )
l =0 ⎝ l ⎠
The conditional reliability function of k-out-of-n systems, from equation (2.45),
can be rewritten as
n
⎛ n ⎞ n− j ⎛ n − j ⎞
Rk − out − of − n (t | η , λ , γ ) = ∑ ⎜ ⎟ ∑ ⎜ l −η ( j + l ) λ t γ
⎟(−1) e (2.46)
j = k ⎝ j ⎠ l =0 ⎝ l ⎠
Then if η ~ gamma(α , β ) then the k-out-of-n systemability is given by
α
n
⎛ n ⎞ n− j ⎛ n − j ⎞ ⎡ β ⎤
∑
R(T ,...,T ) ( t | λ , γ ) = ⎜ ⎟ ⎜ ∑ ⎟ ( −1)
l
⎢ ⎥ (2.47)
⎢⎣ β + λ ( j + l ) t
γ
j =k ⎝ j ⎠ l =0 ⎝ l ⎠ ⎥⎦
1 n
Since
2n −i − j
⎛ 2n − i − j ⎞ −ηλtγ l
∑
γ
(1 − e−ηλ t )(2 n −i − j ) = ⎜ ⎟ (−e ) (2.49)
l =0 ⎝ l ⎠
we can rewrite equation (2.48), after several simplifications, as follows
n
⎛n⎞ n
⎛n⎞ 2n −i − j
⎛ 2n − i − j ⎞ −η (i + j + l )λ tγ
Rk2− out − of − n (t | η , λ , γ ) = ∑ ⎜⎝ i ⎟⎠∑ ⎜⎝ j ⎟⎠(−1) ∑
i=k j =k
l
l =0
⎜
⎝ l
⎟e
⎠
(2.50)
Figures 2.9 and 2.10 show the reliability function (conventional reliability
function) and systemability function (equation 2.52) of a series system (here k=5)
for α = 2, β = 3 and for α = 2, β = 1 , respectively.
Figures 2.11 and 2.12 show the reliability and systemability functions of a
parallel system (here k=1) for α = 2, β = 3 and for α = 2, β = 1 , respectively.
Similarly, Figures 2.13 and 2.14 show the reliability and systemability functions of
a 3-out-of-5 system for α = 2, β = 3 and for α = 2, β = 1 , respectively.
38 System Software Reliability
0.8
Reliability
0.6
0.4
0.2
0
0 100 200 300 400 500
Time (hours)
Rseries(t|n) Rseries(t)
Figure 2.9. Comparisons of series system reliability vs systemability functions for α = 2 and
β =3
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500
Rseries(t|n) Rseries(t)
Figure 2.10. Comparisons of series system reliability vs. systemability functions for α = 2
and β =1
System Reliability Concepts 39
0.96
Reliability
0.92
0.88
0.84
0.8
0 100 200 300 400 500
Time (hours)
Rpara(t|n) Rpara(t)
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500
Rpara(t|n) Rpara(t)
0.8
Reliability
0.6
0.4
0.2
0
0 100 200 300 400 500
Time (hours)
Rkn(t|n) Rkn(t)
Figure 2.13. Comparisons of k-out-of-n system reliability vs. systemability functions for
α = 2 and β =3
0.8
Rkn(t)
0.6
Rkn(t|η)
0.4
0.2
0
0 100 200 300 400 500
Rkn(t|n) Rkn(t)
Figure 2.14. Comparisons of k-out-of-n system reliability vs. systemability functions for
α = 2 and β =1
Upper Bound
1
0.9
RK -out-of-n (t )
Reliability
0.8
Lower Bound
0.7
0.6
0 100 200 300 400 500
Time (hours)
Figure 2.15. A 2-out-of-3 systemability and its 95% confidence interval where α = 2, β = 1
Upper Bound
1
RK -out-of-n (t )
0.95
Reliability
Lower Bound
0.9
0.85
0.8
0 100 200 300 400 500
Time (hours)
Figure 2.16. A 2-out-of-3 systemability and its 95% confidence interval (α = 2, β = 2)
When the open- and closed-mode failure structures are dual of one another, i.e.
Pr{system fails in both modes} = 0, then the system reliability given by equation
(2.53) becomes
Notation
q0 The open-mode failure probability of each component (p0 = 1 - q0)
qs The short-mode failure probability of each component (ps = 1 – qs)
⎣⎢ x ⎦⎥ The largest integer not exceeding x
* Implies an optimal value
Proof: The proof is left as an exercise for the reader (see Problem 2.17).
Example 2.14: A switch has two failure modes: fail-open and fail-short. The
probability of switch open-circuit failure and short-circuit failure are 0.1 and 0.2
respectively. A system consists of n switches wired in series. That is, given q0 =
0.1 and qs = 0.2. Then
⎛ 0.1 ⎞
log ⎜ ⎟
n0 = ⎝ 1 − 0.2 ⎠ = 1.4
⎛ 0.2 ⎞
log ⎜ ⎟
⎝ 1 − 0.1 ⎠
Thus, n* = ⎢⎣1.4 ⎥⎦ + 1 = 2. Therefore, when n* = 2 the system reliability Rs(n) =
0.77 is maximized.
Proof: The proof is left as an exercise for the reader (see Problem 2.18).
It is observed that, for any range of q0 and qs, the optimal number of parallel
components that maximizes the system reliability is one, if qs > q0 (see Problem
2.19). For most other practical values of q0 and qs the optimal number turns out to
be two. In general, the optimal value of parallel components can be easily obtained
using equation (2.58).
The probabilities of a system failing in open mode and failing in short mode are
given by
F0 (m) = [1 − (1 − q0 ) n ]m (2.60)
For given n, take m = mn; then one can rewrite equation (2.62) as:
R ps (n, mn ) = (1 − qsn )m − [1 − (1 − q0 ) n ]m
n n
n →∞ n →∞
For fixed n, q0, and qs, one can determine the value of m that maximizes Rps.
Theorem 2.3 (Barlow and Proschan 1965): Let n, q0, and qs be fixed. The
maximum value of Rps(m) is attained at m* = ⎢⎣ m0 ⎥⎦ + 1 , where
n(log p0 − log qs )
m0 = (2.63)
log(1 − qsn + log(1 − p0n )
Proof: The proof is left as an exercise for the reader (see Problem 20).
Theorem 2.4 (Barlow and Proschan 1965): Let n, q0, and qs be fixed. The
maximum value of R(m) is attained at m* = ⎢⎣ m0 ⎥⎦ + 1 , where
n(log ps − log q0 )
m0 = (2.67)
log(1 − q0n ) − log(1 − psn )
If mo is an integer, then mo and mo + 1 both maximize R(m).
46 System Software Reliability
The system fails in open mode if and only if at least (n - k +1) of its n components
fail in open mode, that is:
n
⎛ n ⎞ i n − i k −1 ⎛ n ⎞ i n −i
F0 (k , n) = ∑ ⎜ ⎟ q0 p0 = ∑ ⎜ ⎟ p0 q0
i = n − k +1 ⎝ i ⎠ i =0 ⎝ i ⎠
(2.69)
For a given k, we can find the optimum value of n, say n*, that maximizes the
system reliability.
Theorem 2.5 (Pham 1989a): For fixed k, q0, and qs, the maximum value of R(k,
n) is attained at n* = ⎢⎣ n0 ⎥⎦ where
System Reliability Concepts 47
⎡ ⎛ 1 − q0 ⎞ ⎤
⎢ log ⎜ ⎟⎥
n0 = k ⎢1 + ⎝ qs ⎠ ⎥ (2.71)
⎢ ⎛ 1 − qs ⎞ ⎥
⎢ log ⎜ ⎟⎥
⎢⎣ ⎝ q0 ⎠ ⎥⎦
If n0 is an integer, both n0 and n0 + 1 maximize R(k, n).
Proof: The proof is left as an exercise for the reader (see Problem 22).
This result shows that when n0 is an integer, both n*-1 and n* maximize the system
reliability R(k, n). In such cases, the lower value will provide the more economical
optimal configuration for the system. If q0 = qs the system reliability R(k, n) is
maximized when n = 2k or 2k-1. In this case, the optimum value of n does not
depend on the value of q0 and qs and the best choice for a decision voter is a
majority voter; this system is also called a majority system (Pham,1989a).
From Theorem 2.5, we understand that the optimal system size n* depends on
the various parameters q0 and qs. It can be shown the optimal value n* is an
increasing function of q0 and a decreasing function of qs (see Problem 23).
Intuitively, these results state that when qs increases it is desirable to reduce the
number of components in the system as close to the value of threshold level k as
possible. On the other hand, when q0 increases, the system reliability will be
improved if the number of components increases.
Theorem 2.6 (Ben-Dov 1980): For fixed n, q0, and qs, it is straightforward to see
that the maximum value of R(k, n) is attained at k* = ⎣⎢ k0 ⎦⎥ +1, where
⎛q ⎞
log ⎜ 0 ⎟
k0 = n ⎝ ps ⎠ (2.72)
⎛qq ⎞
log ⎜ s 0 ⎟
⎝ ps p0 ⎠
If k0 is an integer, both k0 and k0 + 1 maximize R(k, n).
Proof: The proof is left as an exercise for the reader (see Problem 24).
We now discuss how these two values, k* and n*, are related to one another.
Define α by
⎛q ⎞
log ⎜ 0 ⎟
α= ⎝ ps ⎠ (2.73)
⎛ qs q0 ⎞
log ⎜ ⎟
⎝ ps p0 ⎠
then, for a given n, the optimal threshold k is given by k* = ⎢⎡ nα ⎥⎤ and for a given k
the optimal n is n* = ⎣⎢ k / α ⎦⎥ . For any given q0 and qs, we can easily show that (see
Problem 25)
48 System Software Reliability
Notation
pic Pr{converter i works}
pis Pr{switch i is connected to converter (i + 1) when a signal arrives}
m1
p i Pr{monitor i works when converter i works} = Pr{not sending a signal to
the switch when converter i works}
pim 2 Pr{i monitor works when converter i has failed} = Pr{sending a signal to
the switch when converter i has failed}
Rnk− k Reliability of the remaining system of size (n-k) given that the first k swit-
ches work
Rn Reliability of the system consisting of n converters
and a mean equal to 1/λi then min { X1, X2, … , Xn} has an exponential density with
(∑ λ )
−1
mean i .
The significance of the property is as follows:
1. The failure behavior of the simultaneous operation of components can be
characterized by an exponential density with a mean equal to the reciprocal of
the sum of the failure rates.
2. The joint failure/repair behavior of a system where components are operating
and/or undergoing repair can be characterized by an exponential density with a
mean equal to the reciprocal of the sum of the failure and repair rates.
3. The failure/repair behavior of a system such as 2 above, but further
complicated by active and dormant operating states and sensing and switching,
can be characterized by an exponential density.
The above property means that almost all reliability and availability models can be
characterized by a time homogeneous Markov process if the various failure times
and repair times are exponential. The notation for the Markov process is {X(t),
t>0}, where X(t) is discrete (state space) and t is continuous (parameter space). By
convention, this type of Markov process is called a continuous parameter Markov
chain.
From a reliability/availability viewpoint, there are two types of Markov processes.
These are defined as follows:
1. Absorbing Process: Contains what is called an "absorbing state" which is a
state from which the system can never leave once it has entered, e.g., a failure
which aborts a flight or a mission.
2. Ergodic Process Contains no absorbing states such that X(t) can move around
indefinitely, e.g., the operation of a ground power plant where failure only
temporarily disrupts the operation.
Pham (2000a) page 265, presents a summary of the processes to be considered
broken down by absorbing and ergodic categories. Both reliability and availability
can be described in terms of the probability of the process or system being in
defined "up" states, e.g., states 1 and 2 in the initial example. Likewise, the mean
time between failures (MTBF) can be described as the total time in the “up” states
before proceeding to the absorbing state or failure state.
The zeros on Pij, i > j, denote that the process cannot go backwards, i.e., this is not
a repair process. The zero on P13 denotes that in a process of this type, the
probability of more than one event (e.g., failure, repair, etc.) in the incremental
time period dt approaches zero as dt approaches zero.
Except for the initial conditions of the process, i.e., the state in which the
process starts, the process is completely specified by the incremental transition
probabilities. The reason for the latter is that the assumption of exponential event
(failure or repair) times allows the process to be characterized at any time t since it
depends only on what happens between t and (t+dt). The incremental transition
probabilities can be arranged into a matrix in a way which depicts all possible
statewide movements. Thus, for the parallel configurations,
⎡ 1 2 3 ⎤
⎢1 − 2λ dt 2λ dt 0 ⎥⎥
[ pij (dt )] = ⎢
⎢ 0 1 − λ dt λ dt ⎥
⎢ ⎥
⎣ 0 0 1 ⎦
for i, j = 1, 2, or 3. The matrix [Pij(dt)] is called the incremental, one-step transition
matrix. It is a stochastic matrix, i.e., the rows sum to 1.0. As mentioned earlier, this
matrix along with the initial conditions completely describes the process.
Now, [Pij(dt)] gives the probabilities for either remaining or moving to all the
various states during the interval t to t + dt, hence,
P1(t + dt) = (1 - 2λdt)P1(t)
P2(t + dt) = 2λdt P1(t)(1 - λdt)P2(t)
P3(t + dt) = λdtP2(t) + P3(t)
[ P1 (t + dt ) − P1 (t )]
= −2λ P1 (t )
dt
[ P2 (t + dt ) − P2 (t )]
= 2λ P1 (t ) − λ P2 (t )
dt
[ P3 (t + dt ) − P3 (t )]
= λ P2 (t )
dt
Taking limits of both sides as dt → 0 , we obtain
P1’(t) = - 2λP1(t)
P2’(t) = 2λP1(t) - 2λP2(t)
P3’(t) = λP2(t)
The above system of linear first-order differential equations can be easily solved
for P1(t) and P2(t), and therefore, the reliability of the configuration can be
obtained:
2
R(t ) = ∑ Pi (t )
i =1
Actually, there is no need to solve all three equations, but only the first two as P3(t)
does not appear and also P3(t) = 1 – P1(t) – P2(t). The system of linear, first-order
differential equations can be solved by various means including both manual and
machine methods. For purposes here, the manual methods employing the Laplace
transform (see Appendix 2) will be used.
∞
L[ Pi (t )] = ∫ e − st Pi (t )dt = fi ( s )
0
∞
L[ Pi ' (t )] = ∫ e − st Pi ' (t )dt = s fi ( s ) − Pi (0)
0
The use of the Laplace transform will allow transformation of the system of linear,
first-order differential equations into a system of linear algebraic equations which
can easily be solved, and by means of the inverse transforms, solutions of Pi(t) can
be determined.
Returning to the example, the initial condition of the parallel configuration is
assumed to be “full-up” such that
P1(t = 0) = 1, P2(t = 0) = 0, P3(t = 0) = 0
transforming the equations for P’1(t) and P’2(t) gives
sf1 ( s ) − P1 (t ) |t = 0 = −2λ f1 ( s )
sf 2 ( s ) − P2 (t ) |t = 0 = 2λ f1 ( s ) − λ f 2 ( s )
Evaluating P1(t) and P2(t) at t = 0 gives
sf1 ( s ) − 1 = −2λ f1 ( s )
sf 2 ( s ) − 0 = 2λ f1 ( s ) − λ f 2 ( s )
from which we obtain
( s + 2λ ) f1 ( s ) = 1
−2λ f1 ( s ) + ( s + λ ) f 2 ( s ) = 0
System Reliability Concepts 55
∞ ∞ ∞
0 0 0
Similarly,
∞
∫ P (t )dt =
0
1 mean time spent in state 1, and
∫ P (t )dt =
0
2 mean time spent in state 2
−1 = −2λT1 + μT2
0 = 2λT1 − (λ + μ )T2
or, equivalently,
⎡ −1⎤ ⎡ −2λ μ ⎤ ⎡T1 ⎤
⎢ ⎥=⎢ − (λ + μ ) ⎦⎥ ⎢⎣T2 ⎥⎦
⎣ 0 ⎦ ⎣ 2λ
Therefore,
(λ + μ ) 1
T1 = T2 =
2λ 2
λ
(λ + μ ) 1 (3λ + μ )
MTBF = T1 + T2 = + =
2λ 2 λ 2λ 2
The MTBF for non-maintenance processes is developed exactly the same way
as just shown. What remains under absorbing processes is the case for availability
for maintained systems. The difference between reliability and availability for
absorbing processes is somewhat subtle. A good example is that of a communica-
tion system where, if such a system failed temporarily, the mission would continue,
but, if it failed permanently, the mission would be aborted. Consider the following
cold-standby configuration consisting of two units: one main unit and one spare
unit (Pham 2000a):
State 1: Main unit operating - spare OK
State 2: Main unit out - restoration underway
State 3: Spare unit installed and operating
State 4: Permanent failure (no spare available)
The incremental transition matrix is given by (see Figure B.8 in Pham 2000a, for a
detailed state transition diagram)
⎡1 − λ dt λ dt 0 0 ⎤
⎢ 0 1 − μ dt μ dt 0 ⎥⎥
[ Pij (dt )] = ⎢
⎢ 0 0 1 − λ dt λ dt ⎥
⎢ ⎥
⎣ 0 0 0 1 ⎦
We obtain
P1’(t) = - λP1(t)
P2’(t) = λP1(t) - μP2(t)
P3’(t) = μP2(t) - λP3(t)
Using the Laplace transform, we obtain
sf1 ( s) − 1 = −λ f1 ( s)
sf 2 ( s) = λ f1 ( s) − μ f 2 ( s )
sf3 ( s) = μ f 2 ( s) − λ f 3 ( s )
After simplifications,
58 System Software Reliability
1
f1 ( s) =
(s + λ )
λ
f 2 ( s) =
[( s + λ )( s + μ )]
λμ
f3 ( s) =
[( s + λ )2 ( s + μ )]
Therefore, the probability of full-up performance, P1(t), is given by
P1 (t ) = e − λt
Similarly, the probability of the system being down and under repair, P2(t), is
⎡ λ ⎤ − μt
P2 (t ) = ⎢
λ − μ ⎥ e −e (
− λt
)
⎣ ( ) ⎦
and the probability of the system being full-up but no spare available, P3(t), is
⎡ λμ ⎤ − μ t
P3 (t ) = ⎢ 2 ⎥
[e − e − λt − (λ − μ )te− λ t ]
⎣ (λ − μ ) ⎦
Hence, the point availability, A(t), is given by
A(t ) = P1 (t ) + P3 (t )
If average or interval availability is required, this is achieved by
⎛1⎞ T ⎛1⎞ T
⎜ ⎟
⎝t ⎠
∫ 0
A(t )dt = ⎜ ⎟ ∫ [ P1 (t ) + P3 (t )]dt
⎝t⎠ 0
(2.79)
⎡1 − 2λ μ ⎤ ⎡1 0⎤
[Q ] = ⎢ −
⎣ 2λ 1 − (λ + μ ) ⎥⎦ ⎢⎣0 1⎥⎦
⎡ −2 λ μ ⎤
=⎢
⎣ 2λ − (λ + μ ) ⎥⎦
Further define [P(t)] and [P’(t)] as column vectors such that
⎡ P (t ) ⎤ ⎡ P '(t ) ⎤
[ P1 (t )] = ⎢ 1 ⎥ , [ P '(t )] = ⎢ 1 ⎥
⎣ P2 (t ) ⎦ ⎣ P2 '(t ) ⎦
then
[ P '(t )] = [Q ][ P(t )]
At the above point, solution of the system of differential equations will produce
solutions to P1(t) and P2(t). If the MTBF is desired, integration of both sides of the
system produces
⎡ −1⎤ ⎡T1 ⎤
⎢ 0 ⎥ = [Q] ⎢T ⎥
⎣ ⎦ ⎣ 2⎦
⎡ −1⎤ ⎡ −2λ μ ⎤ ⎡T1 ⎤
⎢ 0 ⎥ = ⎢ 2λ − (λ + μ ) ⎥ ⎢T ⎥ or
⎣ ⎦ ⎣ ⎦⎣ 2⎦
⎡1 ⎤ ⎡T1 ⎤
[Q ]−1 ⎢ ⎥ = ⎢ ⎥
⎣ 0 ⎦ ⎣T2 ⎦
where [Q]-1 is the inverse of [Q] and the MTBF is given by
3λ + μ
MTBF = T1 + T2 =
(2λ ) 2
In the more general MTBF case,
⎡ −1⎤ ⎡ T1 ⎤
⎢0⎥ ⎢T ⎥
⎢ ⎥ ⎢ 2 ⎥
⎢ ⋅⎥ ⎢ ⋅ ⎥ n −1
[Q ]−1 ⎢ ⎥ = ⎢ ⎥ where ∑T i = MTBF
⎢ ⋅⎥ ⎢ ⋅ ⎥ i =1
⎢ ⋅⎥ ⎢ ⋅ ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣ 0 ⎥⎦ ⎣⎢Tn −1 ⎦⎥
and (n - 1) is the number of non-absorbing states.
For the reliability/availability case, utilizing the Laplace transform, the system of
linear, first-order differential equations is transformed to
60 System Software Reliability
⎡ f ( s ) ⎤ ⎡1 ⎤ ⎡ f ( s) ⎤
s ⎢ 1 ⎥ − ⎢ ⎥ = [Q ] ⎢ 1 ⎥
⎣ f 2 ( s) ⎦ ⎣0 ⎦ ⎣ f 2 (s) ⎦
⎡ f1 ( s ) ⎤ ⎡1 ⎤
[ sI − Q] ⎢ ⎥=⎢ ⎥
⎣ f 2 ( s) ⎦ ⎣0 ⎦
⎡ f1 ( s ) ⎤ −1 ⎡1 ⎤
⎢ ⎥ = [ sI − Q] ⎢ ⎥
⎣ f 2 (s) ⎦ ⎣0 ⎦
⎡ f1 ( s ) ⎤ ⎡1 ⎤
L−1 ⎢ ⎥ = L−1{[ sI − Q ]−1 ⎢ ⎥}
⎣ f 2 ( s)⎦ ⎣ 0⎦
⎡ p1 ( s ) ⎤ −1 −1 ⎡1 ⎤
⎢ p ( s ) ⎥ = L {[ sI − Q] ⎢0 ⎥}
⎣ 2 ⎦ ⎣ ⎦
Generalization of the latter to the case of (n-1) non-absorbing states is
straightforward.
Ergodic processes, as opposed to absorbing processes, do not have any
absorbing states, and hence, movement between states can go on indefinitely For
the latter reason, availability (point, steady-state, or interval) is the only meaningful
measure. As an example for ergodic processes, a ground-based power unit config-
ured in parallel will be selected.
The parallel units are identical, each with exponential failure and repair times
with means 1/λ and 1/μ, respectively (Pham 2000a). Assume a two-repairmen
capability if required (both units down), then
State 1: Full-up (both units operating)
State 2: One unit down and under repair (other unit up)
State 3: Both units down and under repair
It should be noted that, as in the case of failure events, two or more repairs
cannot be made in the dt interval.
⎡ 1 − 2λ dt 2λ dt 0 ⎤
⎢
[ Pij (dt )] = ⎢ μ dt 1 − (λ + μ )dt λ dt ⎥⎥
⎣⎢ 0 2μ dt 1 − 2μ dt ⎦⎥
⎡ P1 '(t ) ⎤ ⎡ −2λ μ 0 ⎤ ⎡ P1 (t ) ⎤
⎢ ⎥ ⎢ ⎥ ⎢ P (t ) ⎥
⎢ P2 '(t ) ⎥ = ⎢ 2λ − (λ + μ ) 2μ ⎥⎢ 2 ⎥
⎢⎣ P3 '(t ) ⎥⎦ ⎢⎣ 0 λ − 2μ ⎥⎦ ⎢⎣ P3 (t ) ⎥⎦
Similar to the absorbing case, the method of the Laplace transform can be used to
solve for P1(t), P2(t), and P3(t), with the point availability, A(t), given by
A(t ) = P1 (t ) + P2 (t )
Case II: Interval Availability - Ergodic Process. This is the same as the absorbing
case with integration over time period T of interest. The interval availability, A(T), is
T
1
A(T ) =
T 0 ∫
A(t )dt (2.80)
Case III: Steady State Availability - Ergodic Process. Here the process is
examined as t → ∞ with complete “washout” of the initial conditions. Letting
t → ∞ the system of differential equations can be transformed to linear algebraic
equations. Thus,
⎡ P1 '(t ) ⎤ ⎡ −2λ μ 0 ⎤ ⎡ P1 (t ) ⎤
⎢ ⎥ ⎥ ⎢ P (t ) ⎥
lim ⎢ P2 '(t ) ⎥ = lim ⎢⎢ 2λ − (λ + μ ) 2μ ⎥⎢ 2 ⎥
t →∞ t →∞
⎢⎣ P3 '(t ) ⎥⎦ ⎢⎣ 0 λ − 2μ ⎥⎦ ⎢⎣ P3 (t ) ⎥⎦
∑ P (t ) = 1
i =1
i
With the introduction of the new equation, one of the original equations is deleted
and a new system is formed:
⎡1 ⎤ ⎡ 1 1 1 ⎤ ⎡ P1 (t ) ⎤
⎢ 0 ⎥ = ⎢ −2λ ⎢ ⎥
⎢ ⎥ ⎢ μ 0 ⎥⎥ ⎢ P2 (t ) ⎥
⎣⎢0 ⎦⎥ ⎣⎢ 2λ − (λ + μ ) 2μ ⎦⎥ ⎣⎢ P3 (t ) ⎦⎥
or, equivalently,
−1
⎡ P1 (t ) ⎤ ⎡ 1 1 1⎤ ⎡1 ⎤
⎢ P (t ) ⎥ = ⎢ −2λ μ 0 ⎥⎥ ⎢0 ⎥
⎢ 2 ⎥ ⎢ ⎢ ⎥
⎢⎣ P3 (t ) ⎥⎦ ⎢⎣ 2λ − (λ + μ ) 2μ ⎥⎦ ⎢⎣0 ⎥⎦
62 System Software Reliability
For example, if N(t) equals the number of persons who have entered a restaurant at
or prior to time t, then N(t) is a counting process in which an event occurs
whenever a person enters the restaurant.
Property 2.3: The sum of independent Poisson processes, N1(t), N2(t), …., Nk(t),
with mean values λ1t, λ2t, …., λkt respectively, is also a Poisson process with mean
⎛ k ⎞
∑
⎜ λi ⎟ t . In other words, the sum of the independent Poisson processes is also a
⎝ i =1 ⎠
Poisson process with a mean that is equal to the sum of the individual Poisson
process' mean.
Proof: The proof is left as an exercise for the reader (see Problem 26).
64 System Software Reliability
Property 2.4: The difference of two independent Poisson processes, N1(t), and
N2(t), with mean λ1t and λ2t, respectively, is not a Poisson process. Instead, it has
the probability mass function
k
− ( λ1 + λ2 ) t ⎛ λ1 ⎞ 2
P[ N1 (t ) − N 2 (t ) = k ] = e ⎜ ⎟ I k (2 λ1λ2 t ) (2.82)
⎝ λ2 ⎠
where Ik(.) is a modified Bessel function of order k (Handbook 1980).
Proof: The proof is left as an exercise for the reader (see Problem 27).
Property 2.5: If the Poisson process, N(t), with mean λt, is filtered such that every
occurrence of the event is not completely counted, then the process has a constant
probability p of being counted. The result of this process is a Poisson process with
mean λpt [ ].
Property 2.6: Let N(t) be a Poisson process and Yn a family of independent and
identically distributed random variables which are also independent of N(t). A
stochastic process X(t) is said to be a compound Poisson process if it can be
represented as
N (t )
X (t ) = ∑Y
i =1
i
A renewal process is a more general case of the Poisson process in which the
inter-arrival times of the process or the time between failures do not necessarily
follow the exponential distribution. For convenience, we will call the occurrence of
an event a renewal, the inter-arrival time the renewal period, and the waiting time
the renewal time.
Definition 2.5: A counting process N(t) that represents the total number of
occurrences of an event in the time interval (0, t] is called a renewal process, if the
time between failures are independent and identically distributed random variables.
The probability that there are exactly n failures occurring by time t can be written
as
P{N (t ) = n} = P{N (t ) ≥ n} − P{N (t ) > n} (2.83)
Note that the times between the failures are T1, T2, …, Tn so the failures occurring
at time Wk are
k
Wk = ∑T
i =1
i
and
Tk = Wk − Wk −1
System Reliability Concepts 65
Thus,
P{N (t ) = n} = P{N (t ) ≥ n} − P{N (t ) > n}
= P{Wn ≤ t} − P{Wn +1 ≤ t}
= Fn (t ) − Fn +1 (t )
where Fn(t) is the cumulative distribution function for the time of the nth failure
and n = 0,1,2, ... .
Example 2.16: Consider a software testing model for which the time to find an
error during the testing phase has an exponential distribution with a failure rate of
X. It can be shown that the time of the nth failure follows the gamma distribution
with parameters k and n with probability density function. From equation (2.83) we
obtain
(λ t ) n − λ t
= e for n = 0,1, 2,....
n!
Several important properties of the renewal function are given below.
Property 2.7: The mean value function of the renewal process, denoted by m(t), is
equal to the sum of the distribution function of all renewal times, that is,
m(t ) = E[ N (t )]
∞
= ∑ F (t )
n =1
n
The mean value function of the renewal process is also called the renewal function.
Property 2.8: The renewal function, m(t), satisfies the following equation:
66 System Software Reliability
∫
m(t ) = Fa (t ) + m(t − s )dFa ( s )
0
(2.84)
where Fa(t) is the distribution function of the inter-arrival time or the renewal
period. The proof is left as an exercise for the reader (see Problem 28).
∫
y (t ) = x(t ) + y (t − s )dFa ( s )
0
(2.85)
∫
y (t ) = x(t ) + x(t − s )dm( s )
0
Example 2.17: Let x(t) = a. Thus, in Property 2.9, the solution y(t) is given by
t
∫
y (t ) = x(t ) + x(t − s )dm( s )
0
t
∫
= a + a dm( s )
0
= a(1 + E[ N (t )])
Definition 2.6 (Wang and Pham 1996a): If the sequence of non-negative random
variables {X1, X2, ....} is independent and
X i = aX i −1 (2.86)
for i ≥ 2 where α > 0 is a constant, then the counting process {N(t), t ≥ 0} is said to
be a quasi-renewal process with parameter and the first inter-arrival time X1.
System Reliability Concepts 67
Proposition 2.1 (Wang and Pham 1996a): The shape parameters of Xn are the
same for n = 1, 2, 3, ... for a quasi-renewal process if X1 follows the gamma,
Weibull, or log normal distribution.
This means that after "renewal", the shape parameters of the inter-arrival time
will not change. In software reliability, the assumption that the software debugging
process does not change the error-free distribution type seems reasonable. Thus,
the error-free times of software during the debugging phase modeled by a
quasi-renewal process will have the same shape parameters. In this sense, a
quasi-renewal process is suitable to model the software reliability growth. It is
worthwhile to note that
E ( X 1 + X 2 + ... + X n ) μ (1 − α n )
lim = lim 1
n →∞ n n →∞ (1 − α ) n
= 0 if α < 1
= ∞ if α > 1
68 System Software Reliability
Distribution of N(t)
Consider a quasi-renewal process with parameter α and the first inter-arrival time
X1. Clearly, the total number of renewals, N(t), that has occurred up to time t and
the arrival time of the nth renewal, SSn, has the following relationship:
N(t) ≥ n if and only if SSn ≤ t
that is, N(t) is at least n if and only if the nth renewal occurs prior to time t. It is
easily seen that
n n
SSn = ∑ X i =∑ α i −1 X 1 for n ≥ 1 (2.87)
i =1 i =1
Therefore,
∞
LFn ( s ) = ∫ e − a
n −1
st
dF1 (t ) = LF1 (α n −1 s )
0
System Reliability Concepts 69
and
∞
Lmn ( s ) = ∑ LGn ( s )
n =1
∞
= ∑ LF1 ( s )LF1 (α s ) ⋅⋅⋅⋅ LF1 (α n −1 s )
n =1
Proposition 2.2 (Wang and Pham 1996a): The first inter-arrival distribution of a
quasi-renewal process uniquely determines its renewal function.
If the inter-arrival time represents the error-free time (time to first failure), a
quasi-renewal process can be used to model reliability growth for both software
and hardware.
Suppose that all faults of software have the same chance of being detected. If
the inter-arrival time of a quasi-renewal process represents the error-free time of a
software system, then the expected number of software faults in the time interval
[0, t] can be defined by the renewal function, m(t), with parameter α > 1. Denoted
by mr(t), the number of remaining software faults at time t, it follows that
mr (t ) = m(Tc ) − m(t )
where m(Tc) is the number of faults that will eventually be detected through a
software lifecycle Tc.
The non-homogeneous Poisson process model (NHPP) that represents the number
of failures experienced up to time t is a non-homogeneous Poisson process {N(t), t
≥ 0}. The main issue in the NHPP model is to determine an appropriate mean value
function to denote the expected number of failures experienced up to a certain
time.
With different assumptions, the model will end up with different functional
forms of the mean value function. Note that in a renewal process, the exponential
assumption for the inter-arrival time between failures is relaxed, and in the NHPP,
the stationary assumption is relaxed.
The NHPP model is based on the following assumptions:
• The failure process has an independent increment, i.e., the number of failures
during the time interval (t, t + s) depends on the current time t and the length
of time interval s, and does not depend on the past history of the process.
• The failure rate of the process is given by
P{exactly one failure in (t , t + Δt )} = P{N (t + Δt ) − N (t ) = 1}
= λ (t )Δt + o(Δt )
where λ(t) is the intensity function.
• During a small interval Δt, the probability of more than one failure is
negligible, that is,
70 System Software Reliability
Reliability Function
The reliability R(t), defined as the probability that there are no failures in the time
interval (0, t), is given by
R(t ) = P{N (t ) = 0}
= e− m(t )
In general, the reliability R(x|t), the probability that there are no failures in the
interval (t, t + x), is given by
R( x | t ) = P{N (t + x) − N (t ) = 0}
= e −[ m (t + x ) − m (t )]
and its density is given by
f ( x) = λ (t + x)e−[ m (t + x ) − m (t )]
where
∂
λ ( x) = [m( x)]
∂x
The variance of the NHPP can be obtained as follows:
t
Var[ N (t )] = ∫ λ ( s)ds
0
Example 2.18: Assume that the intensity λ is a random variable with the pdf f(λ).
Then the probability of exactly n failures occurring during the time interval (0, t) is
given by
∞
(λ t ) n
P{N (t ) = n} = ∫ e− λ t f (λ ) d λ
0
n!
It can be shown that if the pdf f(λ) is given as the following gamma density
function with parameters k and m,
1
f (λ ) = k m λ m −1e − k λ for λ ≥ 0
Γ ( m)
then
⎛ n + m − 1⎞ m n
P{N (t ) = n} = ⎜ ⎟ p q n = 0,1, 2,...
⎝ n ⎠
is also called a negative binomial density function, where
k t
p= and q = = 1− p
t+k t+k
Devore, J.L., Probability and Statistics for Engineering and the Sciences, 3rd
edition, Brooks/Cole Pub. Co., Pacific Grove, 1991.
Gnedenko, BV and I.A. Ushakov, Probabilistic Reliability Engineering, Wiley,
New York, 1995.
Feller, W., An Introduction to Probability Theory and Its Applications, 3rd edition,
Wiley, New York, 1994.
2.8 Problems
1. Assume that the hazard rate, h(t), has a positive derivative. Show that the
hazard distribution
t
H (t ) = ∫ h( x)dx
0
is strictly convex.
Assume that units on standby cannot fail and the lifetime of each unit follows
the exponential distribution with failure rate λ .
(a) What is the distribution of the system lifetime?
(b) Determine the reliability of the standby system for a mission of 100
hours when λ = 0.0001 per hour and n = 5.
(a) Show that the lifetime of a series system has the Weibull distribution
with pdf
⎧ n ⎛ n
⎞β
⎞ β −1 −⎜⎜⎝ ∑ λ
β
⎪⎛ λ β ⎟⎟ t
∑
i
⎟ βt e for t ≥ 0
⎠
f s (t ) = ⎨⎜⎝ i =1 i
i =1
⎠
⎪
⎩0 otherwise
(b) Find the reliability of this series system.
5. Consider the pdf of a random variable that is equally likely to take on any
value only in the interval from a to b.
(a) Show that this pdf is given by
⎧ 1
⎪ for a < t < 0
f (t ) = ⎨ b − a
⎪⎩0 otherwise
(b) Derive the corresponding reliability function R(t) and failure rate h(t).
(c) Think of an example where such a distribution function would be of
interest in reliability application.
7. One thousand new streetlights are installed in Saigon city. Assume that the
lifetimes of these streetlights follow the normal distribution. The average life
of these lamps is estimated at 980 burning-hours with a standard deviation of
100 hours.
(a) What is the expected number of lights that will fail during the first 800
burning-hours?
(b) What is the expected number of lights that will fail between 900 and
1100 burning-hours?
(c) After how many burning-hours would 10% of the lamps be expected to
fail?
8. A fax machine with constant failure rate λ will survive for a period of 720
hours without failure, with probability 0.80.
(a) Determine the failure rate λ .
(b) Determine the probability that the machine, which is functioning after
600 hours, will still function after 800 hours.
(c) Find the probability that the machine will fail within 900 hours, given
that the machine was functioning at 720 hours.
10. A diode may fail due to either open or short failure modes. Assume that the
time to failure T0 caused by open mode is exponentially distributed with pdf
f 0 (t ) = λ0 e − λ t
0
t≥0
and the time to failure T1 caused by short mode has the pdf
f s (t ) = λs e− λ t
s
t≥0
The pdf for the time to failure T of the diode is given by
f (t ) = p f 0 (t ) + (1 − p) f s (t ) t≥0
(a) Explain the meaning of p in the above pdf function.
(b) Derive the reliability function R(t) and failure rate function h(t) for the
time to failure T of the diode.
(c) Show that the diode with pdf f(t) has a decreasing failure rate (DFR).
11. A diesel is known to have an operating life (in hours) that fits the following
pdf:
74 System Software Reliability
2a
f (t ) = t≥0
(t + b ) 2
The average operating life of the diesel has been estimated to be 8760 hours.
(a) Determine a and b.
(b) Determine the probability that the diesel will not fail during the first
6000 operating-hours.
(c) If the manufacturer wants no more than 10% of the diesels returned for
warranty service, how long should the warranty be?
16. Show that the reliability function of Pham distribution (see equation 2.21) is
given as in equation (2.22).
19. Show that for any range of q0 and qs, if qs > q0, the optimal number of parallel
components that maximizes the system reliability is one.
23. Show that the optimal value n* in Theorem 2.5 is an increasing function of q0
and a decreasing function of qs.
25. For any given q0 and qs, show that qs < α < p0 where α is given in equation
(2.73).
29. Events occur according to an NHPP in which the mean value function is
m(t) = t3 + 3t2 + 6t t > 0.
What is the probability that n events occur between times t = 10 and t = 15?
http://www.springer.com/978-1-85233-950-0