Random Variables
Random Variables
ca
Random Variables
Discrete Random Variables
A discrete random variable has a discrete sample space. For example, the outcomes of a
discrete random variable may be damage states 1, 2, 3, and 4. Consider a random variable
denoted by uppercase X, with outcomes, i.e., realizations, denoted by lowercase x. The
probability of occurrence of each outcome of the discrete random variable is given by the
probability mass function (PMF):
p X (x) ≡ P(X = x) (1)
The PMF has the following property:
N
∑p X (xi ) = 1 (2)
i=1
The CDF has the two properties F(-∞)=0 and F(∞)=1. Yet another representation of the
probability distribution is the complementary CDF (CCDF):
G X (x) = 1− FX (x) (4)
which has the properties G(-∞)=1 and F(∞)=0.
Partial Descriptors
A random variable is completely defined by its probability distribution. However, “partial
descriptors” are useful in lieu of having the complete distribution. The partial descriptors
are equal to or related to the parameters of the probability distributions that are listed later
in this document. The partial descriptors are also related to the statistical moments of the
probability distribution. The first moment of the distribution is the mean of the random
variable:
N
µ X = E[X] = ∑ xi ⋅ pX (xi ) (5)
i=1
The second moment is called the mean square of the random variable:
N
E[X 2 ] = ∑ xi2 ⋅ p X (xi ) (6)
i=1
Conversely, central moments are taken about the mean of the random variable. As a
result, the first central moment is zero. The second central moment is the variance of the
random variable, which is the square of the standard deviation:
N
σ X2 = Var[X] = E[(x − µ X )2 ] = ∑ (xi − µ X )2 ⋅ pX (xi ) (7)
i=1
Several concepts for discrete random variables, such as coefficient of variation and
coefficient of skewness are the same for continuous random variables. Therefore, further
details are provided in the document on continuous random variables.
The Bernoulli Distribution
Consider a discrete random variable, X, with two possible outcomes: failure and success,
i.e., 0 and 1, respectively. The probability of success is denoted p. Consequently, the
probability of failure is 1-p and the Bernoulli PMF is thus defined:
⎧
⎪ 1− p for x = 0
p(x) = ⎨ (8)
⎪ p for x = 1
⎩
Using earlier formulas, the mean is p and the variance is p(1-p). To specify that a random
variable, X, has the Bernoulli distribution, one writes: X~Bernoulli(p).
The Binomial Distribution
Consider a sequence of mutually independent Bernoulli trials with constant success
probability, p. Let the random variable X denote the number of successes in n trials. The
PMF for this random variable is the binomial distribution
⎛ n⎞
p(x) = ⎜ ⎟ ⋅ p x ⋅ (1 − p)n − x (9)
⎝ x⎠
The mean of X is np and its variance is np(1-p). To specify that a random variable, X, has
the binomial distribution, one writes: X~Binomial(p,n).
The Geometric Distribution
The number of trials, S, until success, and the number of trials between successes, in a
Bernoulli sequence is given by the geometric distribution:
p(s) = p ⋅ (1 − p)s −1 (10)
The mean recurrence time, sometimes called return period, is 1/p. The variance is (1-
p)/p2. To specify that a random variable, S, has the geometric distribution, one writes:
S~Geometric(p).
The Negative Binomial Distribution
The number of Bernoulli trials, W, until k occurrences of success is
W = S1 + S2 + + Sk (11)
where Si is the number of trials between success number i-1 and success number i. The
distribution of S is geometric. Combined with the fact that Eq. (11) is a sum of random
variables, the mean and variance of W is
1
µW = k ⋅ µS = k ⋅ (12)
p
1− p
σ W2 = k ⋅ σ S2 = k ⋅ (13)
p2
The distribution type for W is the negative binomial distribution:
⎛ w − 1⎞ k
p(w) = ⎜ ⎟ ⋅ p ⋅ (1 − p)w − k (14)
⎝ k −1⎠
To specify that a random variable, W, has the Bernoulli distribution, one writes:
W~NegativeBinomial(p,k).
The Poisson Distribution
In situations where the number of Bernoulli trials is infinite, such as when every time
instant is considered a trial, the Poisson distribution gives the number of successes, x:
(λ ⋅ T )x − λ ⋅T
p(x) = e (15)
x!
where l is the rate of occurrence of success per unit time and T is the time period under
consideration. The mean number of occurrences is lT, which is also equal to the
variance. To specify that a random variable, X, has the Poisson distribution, one writes:
X~Poisson(l,T).
∫f
−∞
X (x)dx = 1 (17)
FX (x) ≡ P(X ≤ x) = ∫f
−∞
X (x)dx (18)
which has the properties F(-∞)=0 and F(∞)=1. The PDF can be computed from the CDF
by differentiation:
dF(x)
f (x) = (19)
dx
Another representation of the probability distribution of X is the complementary CDF
(CCDF):
Partial Descriptors
A random variable is completely defined by its probability distribution. However, “partial
descriptors” are useful in lieu of having the complete distribution. The partial descriptors
are equal to or related to the parameters of the probability distributions that are listed later
in this document. The partial descriptors are also related to the statistical moments of the
probability distribution. The first moment of the distribution is the mean of the random
variable:
∞
µ X = E[X] = ∫ x⋅ f
−∞
X (x) dx (21)
In passing, it is noted that the mean can also be calculated as the area underneath the
CCDF. Because the PDF is obtained by differentiation of the CCDF, with a negative sign
in front, Eq. (21) turns into:
∞
G(x)
E[X] = − ∫ x ⋅ dx
0
dx
Integration by parts yields:
∞ ∞
G(x)
E[X] = − ∫ x ⋅ dx = − [ x ⋅G(x)]0 + ∫ 1⋅G(x)dx
∞
0
dx 0
E[X] = ∫ G(x)dx
0
The second moment is called the mean square of the random variable:
∞
E[X 2 ] = ∫x ⋅ f X (x) dx
2
(22)
−∞
Conversely, central moments are taken about the mean of the random variable. As a
result, the first central moment is zero. The second central moment is the variance of the
random variable, which is the square of the standard deviation:
∞
By expanding Eq. (23) one finds that “the variance is equal to the mean square minus the
square of the means:”
σ X2 = E[X 2 ] − µ X2 (24)
The coefficient of variation of a random variable is defined as:
σX
δX = (25)
µX
The coefficient of skewness of a random variable is related to the third central moment as
follows:
E[(x − µ X )3 ]
γX = (26)
σ X3
The coefficient of Kurtosis provides a measure of the flatness of the distribution and is
related to the fourth central moment as follows:
E[(x − µ X )4 ]
κX = (27)
σ 4X
Uniform
Normal
Lognormal
Gamma Rayleigh
Exponential
Beta
Chi-squared
t
Weibull
Logistic
Type I, Gumbel
Figure 1: Plots of selected continuous PDFs from the GNU Scientific Library Reference.
PDF 1 ⎛ 1 ⎛ x − µ⎞2⎞
f (x, µ, σ ) = ⋅ exp ⎜ − ⋅ ⎜ ⎟ ⎟
2π ⋅ σ 2 ⎝ 2 ⎝ σ ⎠ ⎠
mean= µ
stdv= s
µ= mean
s= stdv
where fY and FY are the normal PDF and CDF, respectively. By employing the standard
normal distribution, Eqs. (31) and (32) turn into
1 ⎛ ln(x) − µY ⎞
f (x) = ⋅ϕ (33)
x ⎜⎝ σY ⎟⎠
⎛ ln(x) − µY ⎞
F(x) = Φ ⎜ ⎟⎠ (34)
⎝ σY
and solving for x in Eq. (34) yields the inverse lognormal CDF in terms of the inverse
normal CDF:
−1
( p)⋅σ Y + µY
x = eΦ (35)
The remaining question is how to evaluate the parameters µY and sY, which are the mean
and standard deviation of the normal random variable Y=ln(X), not the lognormal random
variable X. Several options are possible. One is having µX and sX. It can then be shown
that the sought parameters are
1 ⎛ ⎛σX ⎞ ⎞
2
µY = ln( µ X ) − ⋅ ln ⎜ 1+ ⎜ ⎟ ⎟ (36)
2 ⎝ ⎝ µX ⎠ ⎠
⎛⎛σ ⎞2 ⎞
σ Y = ln ⎜ ⎜ X ⎟ + 1⎟ (37)
⎝ ⎝ µX ⎠ ⎠
Another option is having the median and “dispersion” of X. To understand this, consider
first the median of the normal random variable Y, denoted mY, which for the normal
distribution equals the mean, µY. Because of Eq. (30), it is clear that the sought parameter
µY is
µY = ln(m X ) (38)
As an aside note, this implies that the term ln(x)-µY in the argument of the distributions
above can be written as
⎛ x ⎞
ln(x) − µY = ln(x) − ln(m X ) = ln ⎜ (39)
⎝ m X ⎟⎠
The so-called dispersion is merely another name for sY. In Rt the following symbols are
used for µY and sY:
µY = ζ
(40)
σY = σ
As a result, to specify that a random variable, X, has the lognormal distribution, one
writes: X~LN(z,s). The implementation in Rt is:
PDF 1 ⎛ 1 ⎛ ln(x) − ζ ⎞ 2 ⎞
f (x,ζ , σ ) = ⋅ exp ⎜ − ⋅ ⎜ ⎟⎠ ⎟
x ⋅ 2π ⋅ σ 2 ⎝ 2 ⎝ σ ⎠
mean=µX= ⎛ σ2 ⎞
exp ⎜ ζ +
⎝ 2 ⎟⎠
stdv=sX= ⎛ σ2 ⎞
( )
exp σ 2 − 1 ⋅ exp ⎜ ζ +
⎝ 2 ⎟⎠
z=µY=ln(mX)= 1 ⎛ ⎛ stdv ⎞ ⎞
2
ln(mean) − ⋅ ln ⎜ 1 + ⎜ ⎟
2 ⎝ ⎝ mean ⎠ ⎟⎠
s =sY=dispersion= ⎛ ⎛ stdv ⎞ 2 ⎞
ln ⎜ ⎜ ⎟ + 1⎟
⎝ ⎝ mean ⎠ ⎠
parameter and a different notation, as shown in the table below. Therefore, another way
to specify that a random variable, X, has the exponential distribution is X~Exp(µ,xo).
Rt
PDF 1 ⎛ 1 ⎞
f (x, µ, x0 ) = ⋅ exp ⎜ − ⋅ ( x − x0 )⎟
µ ⎝ µ ⎠
mean= µ+x0
stdv= µ
µ= stdv
x0= mean-µ
⎜⎝ ⎟
stdv ⎠
b= stdv 2
mean
Viewed as a waiting time, x, in a Poisson process, and in several other situations, the
gamma distribution is often written as
ν (ν x)k−1 e−ν x
f (x) = (42)
Γ(k)
where a and b are related to the parameters n and k as follows: a=k and b=1/n. Then the
mean is k/n and the standard deviation is √k/n. Using the distribution withn and k, the
postulation that a random variable, X, has the Gamma distribution can also be written
A more versatile beta distribution is obtained by letting the user specify the interval.
Instead of being defined in the interval 0 to 1, the beta distribution that is available in Rt
is defined in the interval min to max. This yields a particularly handy distribution, but the
versatility comes at the cost of having to specify four distribution parameters. To specify
that a random variable, X, has this full beta distribution, one writes:
X~Beta(a,b,min,max). Because there are four distribution parameters, the distribution
parameters cannot be determined uniquely from the mean and standard deviation. The
implementation in Rt is:
Rt
b −1
PDF 1 Γ(a + b) ⎛ x − min ⎞
a −1
⎛ ⎛ x − min ⎞ ⎞
f (x, a,b, min, max) = ⋅ ⋅⎜ ⎟ ⋅ ⎜1 − ⎜
max − min Γ(a) ⋅ Γ(b) ⎝ max − min ⎠ ⎝ ⎝ max − min ⎟⎠ ⎟⎠
mean= a
min + ⋅ ( max - min )
a+b
stdv= 1 a ⋅b
⋅ ⋅ (max - min)
a + b a + b +1
is a random variable that has the chi-squared distribution with n degrees of freedom. The
implementation in Rt is:
Rt
PDF ν
−1
1 ⎛ x − x0 ⎞ 2 ⎛ x − x0 ⎞
f (x, ν , x0 ) = ⋅⎜ ⎟ ⋅ exp ⎜
2 ⋅ Γ(ν / 2) ⎝ 2 ⎠ ⎝ 2 ⎟⎠
mean= ν + x0
stdv= 2ν
n= stdv 2
2
x0= mean − ν
Rt
PDF b b −1 ⎛ ⎛ x⎞b⎞
f (x, a,b) = b ⋅ x ⋅ exp ⎜ − ⎜ ⎟ ⎟
a ⎝ ⎝ a⎠ ⎠
mean= ⎛ 1⎞
a ⋅ Γ ⎜1 + ⎟
⎝ b⎠
stdv= ⎛ 2⎞ ⎛ ⎛ 1⎞⎞
2
a ⋅ Γ ⎜1 + ⎟ − ⎜ Γ ⎜1 + ⎟ ⎟
⎝ b⎠ ⎝ ⎝ b⎠⎠
When the number of experiments is large then the distribution for the extreme value, X, is
mostly dependent on the tail behaviour of the underlying probability distribution for Z. It
is rather insensitive to the overall behaviour of the actual underlying probability
distribution. For these situations several asymptotic extreme-value distributions are
developed. They cannot be synthesized into one distribution because the result is
different for minimum and maximum values. Furthermore, the result is different for
different types of tail-behaviour in the underlying probability distribution.
Type I Distributions (Gumbel)
This distribution addresses the maximum value of many experiments. The “Type I”
assumption is that the tail of the underlying distribution varies exponentially:
FZ (z) = 1 − exp ( −h(z)) (50)
The tails are unbounded. This type of tail is found in the normal, exponential, and gamma
distributions. Application of this underlying tail distribution in extreme value theory
yields the Type I Largest and Type I Smallest distributions, for the maximum and
minimum of many realizations, respectively. The resulting distributions are named after
Gumbel:
⎛ ⎛ x − µ⎞⎞
F(x) = exp ⎜ − exp ⎜ (51)
⎝ ⎝ σ ⎟⎠ ⎟⎠
⎛ ⎛ x⎞k⎞
F(x) = 1 − exp ⎜ − ⎜ ⎟ ⎟ for x≥0 (56)
⎝ ⎝ λ⎠ ⎠
k ⎛ x⎞
k −1
⎛ ⎛ x⎞k⎞
f (x) = ⋅ ⎜ ⎟ ⋅ exp ⎜ − ⎜ ⎟ ⎟ for x≥0 (57)
λ ⎝ λ⎠ ⎝ ⎝ λ⎠ ⎠
where k>0 is the shape parameter and l>0 is the scale parameter.
Generalized Extreme Value Distribution
⎛ ⎛ ⎛ x − µ⎞⎞ ⎞
−1 ξ
where µ is the location parameter, s is the scale parameter, and x is the shape parameter.
The rule of total probability to obtain a continuous probability distribution, having the
distribution conditioned on some events is:
N
f (x) = ∑ f (x | Ei )⋅ P(Ei ) (60)
i=1
where N is the number of mutually exclusive and collectively exhaustive events, which
could be the outcomes of a discrete random variables. The rule of total probability to
obtain a probability distribution, having the distribution conditioned on the outcomes of
another continuous random variable is:
∞
For discrete random variables, the rule of total probability to obtain the probability of an
event, having probability values conditioned upon the outcomes of the random variable:
N
P(A) = ∑ P(A | xi )⋅ p(xi ) (62)
i=1
The rule of total probability to obtain a probability distribution, having the distribution
conditioned on the outcomes of another random variable or some other discrete events:
N
p X (x) = ∑ p X (x | yi )⋅ pY (yi ) (63)
i=1
f (y) = ∫
−∞
f (x, y)dx
∞ (65)
f (x) = ∫
−∞
f (x, y)dy
∞ ∞
∫∫
−∞ −∞
f (x, y)dx dy = 1
The relationship between the joint PDF and the joint CDF is
x y
Partial Descriptors
In the context of joint distributions, partial descriptors include the mean product:
∞ ∞
Expansion of the integrand in Eq. (73) reveals that the covariance is equal to the mean
product minus the product of the means:
Cov[ X,Y ] = E[ XY ] − µ X µY (74)
This echoes the fact that the variance of a marginal distribution equals the mean square
minus the square of the means. As discussed later in this document, the covariance
between two random variables is a measure of linear dependence between them. A
normalized, i.e., dimensionless measure of this linear dependence is the correlation
coefficient, which is defined as:
Cov[X,Y ] E[(X − µ X )(Y − µY )] ⎡ X − µ X Y − µY ⎤
ρ XY = = = E⎢ ⋅ ⎥ (75)
σ Xσ Y σ Xσ Y ⎣ σX σY ⎦
In the same way as relative frequency diagrams are instructive visualizations of the
realizations of a single random variable, a scatter diagram is valuable when two random
variables are observed simultaneously. The scatter diagram visualizes the outcomes of
one variable along one axis versus the outcomes of the other variable along the other axis.
The plot gives a sense of the dependence between the two variables. Statistical
dependence between random variables may take different forms. For example, one form
of dependence is that one variable varies exponentially with the other. Yet another
example is linear dependence, in which the realizations of one random variable tend to be
proportional to the outcomes of another random variable. Correlation, defined in Eq. (75),
measures linear dependence. In other words, two random variables can be uncorrelated
but statistically dependent. It is also emphasized that when statistical dependence is
specified by means of correlation then the possibility of a non-positive definite
correlation matrix is present. In reliability analysis, this prevents the transformation into
standard normal random variables. As a result, some correlation structures are impractical
and/or unphysical. Importantly, the range of possible correlation depends upon the
marginal probability distributions of the random variables. Hence, in reliability analysis
applications, the specification of correlation must be made with care and with knowledge
about the marginal probability distributions.
Matrix Notation
When dealing with second-moment information, i.e., mean, variance, and correlation of
multiple random variables it is convenient to use matrix notation. As an illustration, let X
By defining the matrix DXX to be a square matrix with the standard deviations on the
diagonal the covariance matrix is written as the decomposition
Σ XX = D XX R XX D XX (78)
where RXX is the correlation matrix, which is also symmetric:
⎡ 1 ρ12 ρ13 ⎤
⎢ ⎥
R XX = ⎢ ρ12 1 ρ23 ⎥ (79)
⎢ ρ ρ23 1 ⎥
⎣ 13 ⎦
The Joint Normal Distribution
Unlike the situation for univariate distributions, only a few standard multivariate
distribution types are encountered. By far the most common is the joint Normal
distribution. The joint normal PDF is
1 ⎛1 ⎞
f (x) = ⋅ exp ⎜
⎝
( x − M X )T
XX ( x − M X )⎟
Σ −1
⎠
(80)
(2π ) ⋅ det(Σ XX )
n 2
where
2 2
⎛ x − µ1 ⎞ ⎛ x2 − µ2 ⎞ 2 ρ(x1 − µ1 )(x2 − µ2 )
z=⎜ 1 ⎟ +⎜ ⎟ − (82)
⎝ σ1 ⎠ ⎝ σ 2 ⎠ σ 1σ 2
A special case is the standard normal distribution, which is characterized by zero means,
unit variances, and zero covariances. This PDF is denoted by the symbol j and takes the
form
1 ⎛ 1 ⎞ (83)
ϕ (y) = ⋅ exp ⎜ − yT y ⎟
(2π )n ⎝ 2 ⎠
This multivariate distribution has several properties that are important in reliability
analysis:
1. The multivariate standard normal PDF is rotationally symmetric and it decays
exponentially in the radial and tangential directions
2. The probability content outside a hyper-plane distanced b from the point y=0 is:
p = Φ(− β ) (84)
which is employed in the document on FORM.
3. The probability content outside a hyper- paraboloid with apex distanced b from the
point y=0 is also available, as described in the document on SORM.
Copulas
Copulas represent an alternative technique for specifying statistical dependence between
random variables. Currently, its use is more widespread in economics than engineering,
but that may change. Copulas extend the options for prescribing statistical dependence
beyond the use of the correlation coefficient, which only provides linear statistical
dependence. The correlation coefficient is convenient and popular for a few reasons.
First, it appears prominently in second-moment theory, together with means and standard
deviations. Second, the correlation coefficient appears as a parameter in the powerful
joint normal probability distribution, as described earlier in this document. However, the
convenience of the correlation coefficient diminishes in the general of circumstances.
Consider the example when the joint distribution is sought for a set of random variables
with mixed marginal distributions and perhaps nonlinear dependence tendencies. This
problem is important in reliability analysis where the Nataf or Rosenblatt transformations
are usually applied. Under such circumstances the copulas represent an alternative,
although it has yet to become popular in reliability analysis. The key feature of the copula
technique is that a variety of dependence structures are possible. One example is stronger
dependence in the distribution tails. An interesting class of copulas is the generalized
elliptical distributions that are generalizations of the joint normal distribution. The joint
normal distribution is also elliptical, but it is a special case of the “infinite” possibilities
provided by copulas.
From a philosophical viewpoint, the need to specify statistical dependence between
random variables is, in some sense, a symptom of imperfect models. The source of
correlation is due to hidden phenomena behind the random variables. If the underlying
phenomena were modelled then the need to prescribe statistical dependence might vanish.
Consider the example of prescribing correlation between the earthquake intensity at two
nearby sites. The need to estimate this correlation disappears if the modelling is expanded
to include the hypocentre location, the earthquake magnitude, and the attenuation of the
intensity to each site. It is those underlying phenomena that cause correlation in intensity
between sites. This philosophical discussion is somewhat akin to the discussion on
whether aleatory uncertainty exists. It does, unless all models are perfect, which they are
not. However, this paragraph is intended to foster a strong focus on modelling and careful
examination of the need to prescribe statistical dependence.
Sklar’s Theorem
Sklar’s theorem is the foundation for the use of copulas. It states that the joint CDF of
some random variables, X, can be written in terms of a copula, C, which is a function of
the marginal CDFs of the random variables:
F(x1 , x2 ,…, xn ) = C ( F1 (x1 ), F2 (x2 ),…, Fn (xn )) (85)
That is, the joint distribution is composed of the marginal distributions and the copula
function. In other words, the copula is a function that couples the marginal distribution
functions. This is the means by which dependence is introduced. It is also observed that
copulas express dependence on a “quantile scale,” namely along the random variable
axes. In this manner, the dependence at 10% probability of exceedance can be different
from the dependence at 90% probability of exceedance. Several other interpretations of
Eq. (85) are possible. First, it is observed that a copula is what remains of a joint
cumulative distribution once the action of the marginal cumulative distribution functions
has been removed. In other words, the marginals provide the probability distributions,
while the sole purpose of the copula is to provide statistical dependence. Furthermore,
Sklar’s theorem can be written
(
C ( p1 , p2 ,…, pn ) = F F1−1 ( p1 ), F2−1 ( p2 ),…, Fn−1 ( pn ) ) (86)
where pi are probabilities. This form of Eq. (85) is used to “extract” copulas from existing
joint distributions, as described shortly. It is noted that a copula is invariant with respect
to strictly increasing transformations of the random variables, such as that of
transforming random variables from normal to standard normal.
Explicit and Implicit Copulas
The simplest example of a copula is the one that yields no dependence at all. That is, the
copula for independent random variables is:
F(x1 , x2 ,…, xn ) = F1 (x1 ) ⋅ F2 (x2 )Fn (xn ) (87)
This expression, which corresponds to the definition of statistical independence between
random variables, is an example of an explicit copula. Copulas are either implicit or
explicit. Implicit copulas are extracted from known joint distributions. For example, the
Gauss copula is extracted from the joint normal probability distribution. Specifically,
from Sklar’s theorem in Eq. (85) it is understood that when the random variables have the
joint CDF F then the copula C is the CDF of the marginal distributions. This is what is
emphasized in Eq. (86). Consider two correlated normal random variables, here standard
normal for simplicity:
⎛ s12 + s22 − 2 ρ s1s2 ⎞
x2 x1
1
F(x1 , x2 ) = Φ(x1 , x2 ) = ∫∫
−∞ −∞ 2π 1 − ρ
2
⋅ exp ⎜⎝ − 2(1 − ρ 2 ) ⎟⎠
ds1 ds2 (88)
The Gauss copula is extracted from this joint CDF by substituting the random variables in
the original distribution with the marginal CDFs:
Φ−1 ( p2 ) Φ−1 ( p1 )
1 ⎛ s 2 + s22 − 2 ρ s1s2 ⎞
C( p1 , p2 ) = Φ(x1 , x2 ) = ∫
−∞
∫
−∞ 2π 1 − ρ 2
⋅ exp ⎜ − 1
⎝ 2(1 − ρ 2 ) ⎟⎠
ds1 ds2 (89)
⎛ ν +2⎞
Tν−1 ( p2 ) Tν−1 ( p1 ) −⎜
⎝ 2 ⎟⎠
1 ⎛ s 2 + s22 − 2 ρ s1s2 ⎞
Student: C( p1 , p2 ) =
−∞
∫ ∫
−∞ 2π 1 − ρ 2
⋅ exp ⎜ − 1
⎝ 2(1 − ρ 2 ) ⎟⎠
ds1 ds2 (92)
Frank:
1 ⎛
C( p1 , p2 ) = − ⋅ ln ⎜ 1 +
( )(
e−θ ⋅ p1 − 1 ⋅ e−θ ⋅ p2 − 1 ⎞ ) (93)
⎟
θ ⎝ e −θ − 1 ⎠
( )
1
−
Clayton: C( p1 , p2 ) = p1−θ + p2−θ − 1 θ (94)
⎛ ⎞
( ) ⎟⎠
1
Meta Distributions
A potentially interesting aspect of the use of copulas is the possibility of creating entirely
new “meta” distributions. This is achieved by first employing Eq. (86) to extract an
implicit copula, followed by utilization of Eq. (85) to substitute “arbitrary” CDFs into the
copula. Clearly, a great number of possible joint probability distributions—perhaps more
or less useful—then become available. To generate realization of random variables from
meta distributions the following sampling procedure may be helpful:
1. Generate outcomes of some random variables x from the fundamental
distribution, say the normal
2. Obtain the marginal CDF value for each random variable, i.e., p=F(x)
3. Transform according to some marginal distribution: x=F-1(p)
Copula Densities
n
f (x1 , x2 ,…, xn ) = c ( F1 (x1 ), F2 (x2 ),…, Fn (xn )) ⋅ ∏ fi (xi ) (96)
i =1
Similarly,
q→1
(
λlower = lim P Xi < Fi −1 (q) )
X j < Fj−1 (q) = lim
q→1
C(q, q)
q
(99)
It is noted that for the normal copula, lupper=llower=0. Hence, it is not possible to take into
account tail dependence with this copula, contrary to, say, the Student copula. When
using copulas, there are also other measures of dependence other than the measures of tail
dependence in Eqs. (98) and (99). These include the “rank-dependent correlation
coefficient,” such as Kendall’s tau and Spearman’s rho. For Archimedean copulas, there
is a strong connection between Kendall’s tau and the parameter of the copula function.
Classical Inference
Classical statistical inference for random variables attempts to determine point estimates
for the distribution parameters. In other words, values are sought for the mean and
standard deviation of the random variable, and perhaps other distribution parameters. In
the classical approach, such point estimates are sometimes complemented by confidence
intervals to gage the uncertainty in the point estimates. This document provides the most
basic formulas, but starts with an exposure of diagrams that should always be plotted
before computations are made.
Diagrams
Certain diagrams are helpful to visualize the characteristics of a probability distribution.
Three plots are particularly popular:
• Histogram: In these plots the abscissa axis shows the outcome space for the random
variable. To generate a histogram, this axis is divided into “bins” and the number of
observed realizations within each bin is plotted on the ordinate axis.
• Frequency diagram: This diagram is a normalized version of the histogram. In
particular, the area underneath the frequency diagram is unity, which means that it
can be visually compared with standard PDFs. The frequency diagram is normalized
by dividing the ordinate values of the histogram by the total area of the histogram.
The total area of the histogram equals the total number of observations multiplied by
the bin size.
• Cumulative frequency diagram: While the frequency diagram is comparable to a
PDF, the cumulative frequency diagram is comparable with the CDF. This plots also
has the random variable along the abscissa axis. Ordinate values are computed at
every observed realization of the random variable. Each ordinate value equals the
number of realizations at and below that abscissa value, divided by the total number
of observations.
Second-moment Statistics
Given n observations xi of the random variable X the sample mean is:
1 n
x= ⋅ ∑ xi (100)
n i =1
The sample variance, i.e., the sample standard deviation squared, is:
n
1
s2 = ⋅ ∑ (xi − x )2 (101)
n − 1 i=1
In situations with many observations Eq. (101) is somewhat cumbersome because the
mean of the random variable must be pre-computed before looping through the data again
to compute s2. This is remedied by the following manipulations:
n
1
s2 = ⋅ ∑ (xi − x )2
n − 1 i=1
1 ⎛ n 2 n 2 n ⎞
= ⋅ ⎜ ∑ xi + ∑ x − ∑ 2xi x ⎟
n − 1 ⎝ i=1 i=1 i=1 ⎠
(102)
1 ⎛⎛ n 2⎞ ⎞
= ⋅ ⎜ ⎜ ∑ xi ⎟ + nx 2 − 2nx 2 ⎟
n − 1 ⎝ ⎝ i=1 ⎠ ⎠
1 ⎛⎛ n 2⎞ ⎞
= ⋅ ⎜ ⎜ ∑ xi ⎟ − n ⋅ x 2 ⎟
n − 1 ⎝ ⎝ i=1 ⎠ ⎠
This expression is more computationally convenient because the data can be looped over
only once, to compute the sum of xi and the sum of xi2. In passing, it is noted that the
reason for the denominator (n-1) instead of simply n is as follows: Consider the sample
and the sample variance to be random variables in their own right. Then, the expectation
of the sample mean is
1 n 1 n
E[x ] = ⋅ ∑ E[xi ] = ⋅ ∑ µ x = µ x (103)
n i=1 n i=1
where the second-last term recognizes that E[xi] is the mean of the random variable. This
provides comfort that the expectation of the sample mean equals the mean of the
underlying random variable. Next, consider the expectation of the sample variance:
1 ⎡⎛ n ⎞ ⎤
E[s 2 ] = ⋅ E ⎢⎜ ∑ xi2 ⎟ − n ⋅ x 2 ⎥
n − 1 ⎣⎝ i=1 ⎠ ⎦
(104)
1 ⎛⎛ n ⎞ ⎞
= ⋅ ⎜ ⎜ ∑ E[xi2 ]⎟ − n ⋅ E[x 2 ]⎟
n − 1 ⎝ ⎝ i=1 ⎠ ⎠
To proceed, it is made use of the fact that “the variance is equal to the mean square minus
the square of the means,” so that:
n n n
∑ E[x 2
i ] = ∑ ( E[xi ] + Var[xi ]) == ∑ ( µ x2 + σ x2 )
2
(105)
i=1 i=1 i=1
and
E[x 2 ] = E[x ]2 + Var[x ] (106)
where the mean of the sample mean is provided in Eq. (103) and the variance of the
sample mean is:
1 n 1 n 2 σ2
Var[x ] = ⋅ ∑
n 2 i=1
Var[xi ] = ⋅ ∑σ x =
n 2 i=1 n
(107)
Substitution of Eqs. (103) and (107) into Eq. (106) and substitution of Eqs. (105) and
(106) into Eq. (104) yields
1 ⎛⎛ n 2 ⎞ ⎛ 2 σ x2 ⎞ ⎞
E[s ] =
2
⋅ ∑ ( µ x + σ x )⎟ − n ⋅ ⎜ µ x + ⎟ ⎟
2
n − 1 ⎜⎝ ⎜⎝ i=1 ⎠ ⎝ n ⎠⎠
1 ⎛ ⎛ 2 σ x2 ⎞ ⎞
= ⋅ n ⋅ ( µx + σ x ) − n ⋅ ⎜ µx + ⎟ ⎟
2 2
n − 1 ⎜⎝ ⎝ n ⎠⎠
⋅ ( n ⋅ σ x2 − σ x2 )
1
= (108)
n −1
σ2
= x ⋅ ( n − 1)
n −1
= σ x2
which shows that the denominator (n-1) is necessary to have the expectation of the
sample variance match to the underlying random variable.
Correlation
The formulas for sample mean and sample variance of individual random variables are
valid for the inference on joint random variables. In addition, the sample correlation
coefficient is:
⎛ n ⎞
1 ⎜ ∑ xi yi − n ⋅ x ⋅ y ⎟
ρ= ⋅ ⎜ i =1 ⎟ (109)
n −1 ⎜ sx ⋅ sy ⎟
⎜⎝ ⎟⎠
Bayesian Inference
Contrary to classical statistics, where point estimates are provided for distribution
parameters, the Bayesian approach provides probability distributions. For example, for a
Normal random variable, the Bayesian analysis provides the probability distribution for
the mean and standard deviation. All other inference statements are made from these
distributions. The availability of these distributions is also advantageous because they can
be included in subsequent reliability analysis. In the following, let X denote a random
variable and let q denotes a generic parameter in the probability distribution for X. The
key objective is to determine the probability distribution of q given observations of X,
collected in the vector x. The following formula synthesizes the essence of the Bayesian
approach (Box and Tiao 1992; Carlin and Louis 2009):
L(θ )
f ''(θ ) = ⋅ f '(θ ) (111)
c
where f’’(q) is the posterior PDF, c is a constant explained shortly, L(q) is the likelihood
function, and f’(q) is the prior PDF. The constant, c, serves the purpose of normalizing
the posterior, which implies the following definition:
∞
To understand the workings of Eq. (111) it is helpful to first relate it to Bayes’ Rule for
events, which reads:
P(E2 | E1 )
P(E1 | E2 ) = ⋅ P(E1 ) (113)
P(E2 )
where E1 is the event for which the probability is sought and E2 is the event that has
occurred. It is noted that the probability of the occurred event, conditioned upon E1,
serves the role as likelihood, and that the unconditional probability of the observed event
serves as normalizing factor in the denominator. This pattern also emerges when Eq.
(111) is written in the following more complete form:
L(x θ )
f ''(θ x) = ⋅ f '(θ ) (114)
c(x)
where
∞
which is sometimes called the marginal density of the data. This formulation clarifies that
the posterior distribution for q is directly linked to the observed data via the likelihood
function. The formulation of the likelihood function, as well as the prior distribution, is a
central topic in this document.
Discrete Problems
This document primarily addresses the case where q and X are continuous random
variables. However, in passing, a few other cases are noted. First, if one or both of the
variables are discrete, then the PDFs are simply replaced by PMFs. In fact, if the
observed random variable is discrete then the normalization constant takes on a direct
meaning, as shown here for the case where both variables are discrete:
p(y | x)
p(x | y) = ⋅ p(x) (116)
p(y)
Bayes’ theorem for updating the probability distribution of a continuous random variable,
q, given the occurrence of an event, E, is:
P(E θ )
f (θ E) = ⋅ f (θ ) (117)
P(E)
Similarly, Bayes’ theorem for updating the probability distribution of a discrete random
variable, q, given the occurrence of an event, E, is:
P(E θ )
p(θ E) = ⋅ p(θ ) (118)
P(E)
Bayes’ rule to update the probability of an event, E, given the outcome of a random
variable, q, is:
⎧ f (θ | A)
⎪ ⋅ P(A) (if θ is continuous)
⎪ f (θ )
P(A | θ ) = ⎨ (119)
⎪ p(θ | A) ⋅ P(A) (if θ is discrete)
⎪ p(θ )
⎩
Prior Distribution
Eq. (111) shows that the prior distribution is one of two key ingredients in the Bayesian
approach, with the likelihood function being the other one. The prior distribution is
sometimes a point of contention, because it allows subjective information to enter the
calculations. On one hand, this is an advantage because it gives more flexibility to the
analyst; on the other hand it may seem problematic because the prior assumptions may
seem arbitrary. To expose the matter, the following subsections list the options that are
available as prior distributions.
Previous Posterior
When a probability distribution for q is already available, for example from earlier
applications of Eq. (111), then it is natural to employ it as prior. In circumstances where
this choice leads to an unusually complicated expression for the posterior, then the use of
a conjugate prior may be explored, as explained shortly.
Uniform and Non-informative Priors
When little or now prior information is available about q it is desirable to select a prior
that is uniform over the “range of interest.” This is either the exact uniform distribution,
or a distribution that is approximately uniform over the important range of q-values. This
is intended to express complete a priori uncertainty about q.
Conjugate Prior
A prior is called conjugate if the distribution type of the posterior is the same as that of
the prior. The selection of a conjugate prior is convenient, because it often leads to simple
updating rules for the parameters of the distribution for q.
Likelihood Function
The likelihood function is the crucial means by which the observed data affects the
posterior. Eq. (111) illustrates that the likelihood function is a function of q, and Eq.
(114) clarifies that it takes as input the vector of observed realizations, x. To further
understand the meaning of the likelihood function, a strict comparison between Eq. (111)
and Bayes’ Rule in Eq. (113) could lead to the impression that L(q) is the “probability of
x given q.” However, this is misleading, both because the probability of any realization x
is zero, and because L(q) does not have to be interpreted as a probability. Instead,
because of the normalizing constant, c, in Eq. (111), it is only necessary that the
likelihood function is proportional to the probability of observing x. In short, the
µθ = ∫ θ ⋅ f ''(θ )dθ
−∞
(120)
For problems with more than one model parameter, the mean vector is:
∞ ∞
M θ = ∫ ∫ θ ⋅ f ''(θ)dθ (122)
−∞ −∞
Σ θθ = ∫ ∫ ( θ − M θ ) ⋅ ( θ − M θ ) ⋅ f ''(θ)dθ
T
(123)
−∞ −∞
Predictive Distribution
While the posterior in Eq. (111) provides the probability distribution for the model
parameter(s), the so-called predictive distribution addresses the original random variable,
X. Specifically, the predictive distribution is a distribution for X that incorporates the
uncertainty in the model parameter(s), q, by using the expectation integral in the
following way:
∞
f (x) = ∫ f (x θ )⋅ f ''(θ)dθ
−∞
(124)
Computational Methods
The Bayesian approach is philosophically appealing, but it has been hindered by
computational challenges for all but simple problems, e.g., problems with conjugate
priors. This has changed with the advent of high computer power and new sampling
algorithms. It is the evaluation of two integrals that are of particular importance in
Bayesian analysis:
1. The integral to obtain the normalizing constant, c, in Eq. (112).
References
Ang, A. H.-S., and Tang, W. H. (2007). Probability concepts in engineering: emphasis on
applications in civil & environmental engineering. Wiley.
Box, G. E. P., and Tiao, G. C. (1992). Bayesian inference in statistical analysis. Wiley.
Carlin, B. P., and Louis, T. A. (2009). Bayesian methods for data analysis. CRC Press.