Statistics
Statistics
Statistics
E.g. (1) Throwing a dice experiment getting the no‟s 1,2,3,4,5,6 (event)
Exhaustive Events:
The total no. of possible outcomes in any trial is called exhaustive event.
E.g.: (1) In tossing of a coin experiment there are two exhaustive events.
n
(2) In throwing an n-dice experiment, there are 6 exhaustive events.
Favorable event:
The no of cases favorable to an event in a trial is the no of outcomes which entities the
happening of the event.
E.g. (1) In tossing a coin, there is one and only one favorable case to get either head or tail.
Mutually exclusive Event: If two or more of them cannot happen simultaneously in the same
trial then the event are called mutually exclusive event.
E.g. In throwing a dice experiment, the events 1,2,3,------6 are M.E. events
Equally likely Events: Outcomes of events are said to be equally likely if there is no reason for
one to be preferred over other. E.g. tossing a coin. Chance of getting 1,2,3,4,5,6 is equally likely.
Independent Event:
Several events are said to be independent if the happening or the non-happening of the event is
not affected by the concerning of the occurrence of any one of the remaining events.
Eg: In tossing a coin and throwing a die, getting head or tail is independent of getting no‟s 1 or 2
or 3 or 4 or 5 or 6.
If a trial results in n-exhaustive mutually exclusive, and equally likely cases and m of them are
favorable to the happening of an event E then the probability of an event E is denoted by P(E)
and is defined as
Sample Space:
The set of all possible outcomes of a random experiment is called Sample Space .The elements
of this set are called sample points. Sample Space is denoted by S.
Eg. (1) In throwing two dies experiment, Sample S contains 36 Sample points.
A sample space is called discrete if it contains only finitely or infinitely many points which can
be arranged into a simple sequence w1,w2,……. .while a sample space containing non
denumerable no. of points is called a continuous sample space.
If a trial is repeated a no. of times under essential homogenous and identical conditions, then
the limiting value of the ratio of the no. of times the event happens to the total no. of trials, as
the number of trials become indefinitely large, is called the probability of happening of the
event.( It is assumed the limit is finite and unique)
Symbolically, if in „n‟ trials and events E happens „m‟ times , then the probability „p‟ of the
m
happening of E is given by p = P(E) = lim .
n n
Random Variables
ii) f (x) =1
x
f(x1) x1x<x2
f(x1)+f(x2) x2x<x3
………
For a continuous r.v. X, the function f(x) satisfying the following is known as the
probability density function(p.d.f.) or simply density function:
ii) f ( x ) dx 1
iii) P(a<X<b)= f ( x ) dx = Area under f(x) between ordinates x=a and x=b
a
P(a<X<b) = P(ax<b)=P(a<Xb)=P(aXb)
(i.e) In case of continuous it does not matter weather we include the end
points of the interval from a to b.This result in general is not true for
discrete r.v.
a x
Cumulative distribution for a continuous r.v. X with p.d.f. f(x), the cumulative
distribution F(x) is defined as
In case of discrete r.v. the probability at a point i.e., P(x=c) is not zero for some fixed c
however in case of continuous random variables the probability at appoint is always zero.
I.e., P(x=c) = 0 for all possible values of c.
P(E) = 0 does not imply that the event E is null or impossible event.
If X and Y are two discrete random variables the joint probability function of X and Y is
given by P(X=x,Y=y) = f(x,y) and satisfies
The joint probability function for X and Y can be reperesented by a joint probability
table.
Table
X Y y1 y2 …… yn Totals
f1(x) and f2(y) are called marginal probability functions of X and Y respectively.
If X and Y are two continuous r.v.‟s the joint probability function for the r.v.‟s X and Y is
defined by
(i) f(x,y) 0 (ii) f ( x , y ) dxdy =1
b d
P(a < X < b, c< Y < d) = f ( x , y ) dxdy
xa yc
f ( u , v ) dudv
u v
2
F
f ( x, y)
xy
The Marginal distribution function of X and Y are given by P( X x) = F1(x)=
f1(x) = f ( x , v ) dv and f2(y) = f ( u , y ) du
v u
f(x,y) = f1(x)f2(y) x, y
f(x,y) = f1(x)f2(y) x, y
If X and Y are two discrete r.v. with joint probability function f(x,y) then
f (x, y)
P(Y = y|X=x) = = f(y|x)
f1 ( x )
f (x, y)
Similarly, P(X = x|Y=y) = = f(x|y)
f2 ( y)
f (x, y)
If X and Y are continuous r.v. with joint density function f(x,y) then = f(y|x) and
f1 ( x )
f (x, y)
= f(x|y)
f2 ( y)
E(X)= xf ( x ) dx X is Continuous
g ( x ) f ( x ) dx For Continuous
If X and Y are two continuous r.v.‟s the joint density function f(x,y) the conditional
expectation or the conditional mean of Y given X is E(Y |X = x) = yf ( y | x ) dy
Similarly, conditional mean of X given Y is E(X |Y = y) = xf ( x | y ) dx
Median is the point, which divides the entire distribution into two equal parts. In case of
continuous distribution median is the point, which divides the total area into two equal
M
parts. Thus, if M is the median then f ( x ) dx = f ( x ) dx =1/2. Thus, solving any one of
M
Mode: Mode is the value for f(x) or P(xi) at attains its maximum
For continuous r.v. X mode is the solution of f1(x) = 0 and f11(x) <0
provided it lies in the given interval. Mode may or may not be unique.
Variance: Variance characterizes the variability in the distributions with same mean can
still have different dispersion of data about their means
(x ) f (x)
2
Var(X) = E (X - )
2
= for discrete
x
2
(x ) f ( x ) dx for continuous
where = E(X)
n n
If X1,X2,-------,Xn are independent r.v‟s then
E i
X E(X i) if all expectations
i 1 i 1
exists.
0 other wise
The notation X ~ B(n,p) is the random variable X which follows the binomial distribution with
parameters n and p
If n trials constitute an experiment and the experiment is repeated N times the frequency
function of the binomial distribution is given by f(x) = NP(x). The expected frequencies of
0,1,2,….. n successes are the successive terms of the binomial expansion N(p+q) n
Mode of the Binomial distribution: Mode of B.D. Depending upon the values of (n+1)p
(i) If (n+1)p is not an integer then there exists a unique modal value for binomial distribution
and it is „m‟= integral part of (n+1)p
(ii) If (n+1)p is an integer say m then the distribution is Bi-Modal and the two modal values are
m and m-1
The sum of two independent binomial variates is not a binomial varaite. In other words,
Binomial distribution does not possess the additive or reproductive property.
1 2p 1 6 pq
For B.D. 1= 1 = 2= 2 –3 =
npq npq
If X1~ B(n1,p) and X2~ B(n2,p) then X1+X2 ~ B(n1+n2,p).Thus the B.D. Possesses the additive
or reproductive property if p1=p2
Poisson distribution
Poisson Distribution is a limiting case of the Binomial distribution under the following
conditions:
(ii) P, the constant probability of success for each trial is indefinitely small.
(iii) np= , is finite where is a positive real number.
x
e
P(x,)= P(X= x) = : x= 0,1,2,3,…… > 0
x!
0 Other wise
We shall use the notation X~ P() to denote that X is a Poisson variate with parameter
The coefficient of skewness and kurtosis of the poisson distribution are 1 = 1= 1/ and 2=
2-3=1/. Hence the poisson distribution is always a skewed distribution. Proceeding to limit
as tends to infinity we get 1 = 0 and 2=3
(i) when is not an integer the distribution is uni- modal and integral part of is the
unique modal value.
(ii) When = k is an integer the distribution is bi-modal and the two modals are k-1 and k.
P(x+1) = p(x)
x 1
n x p
P(x+1) = . p(x)
x 1 q
Normal Distribution
A random variable X is said to have a normal distribution with parameters called mean and
2 called variance if its density function is given by the probability law
1 x 2
1
f(x; , ) = exp , - < x < , - < < , > 0
2 2
A r.v. X with mean and variance 2 follows the normal distribution is denoted by
X~ N(, 2)
X
If X~ N(, 2) then Z = is a standard normal variate with E(Z) = 0 and var(Z)=0 and
we write Z~ N(0,1)
1 2
z
, - < Z<
/2
The p.d.f. of standard normal variate Z is given by f(Z) = e
2
z
1 2
The distribution function F(Z) = P(Z z) = e
t /2
dt
2
F(-z) = 1 – F(z)
P(a < z b) = P( a z < b)= P(a <z < b)= P(a z b)= F(b) – F(a)
X b a
If X~ N(, 2) then Z = then P(a X b) = F F
N.D. is another limiting form of the B.D. under the following conditions:
v) 1 = 0 and 2 = 3
vii) Since f(x) being the probability can never be negative no portion of the curve lies below
x- axis.
1 1 / 2
x) The points of inflexion of the curve are given by x = , f(x) = e
2
2 4 2 4
xi) Q.D. : M.D.: S.D. :: : : :: : : 1 Or Q.D. : M.D.: S.D. ::10:12:15
3 5 3 5
xii) Area property: P(- < X < + ) = 0.6826 = P(-1 < Z < 1)
2
t /2
If Z~ N(0,1) then MZ(t) = e
Continuity Correction:
The N.D. applies to continuous random variables. It is often used to approximate distributions
of discrete r.v. Provided that we make the continuity correction.
If we want to approximate its distribution with a N.D. we must spread its values over a
continuous scale. We do this by representing each integer k by the interval from k-1/2 to k+1/2
and at least k is represented by the interval to the right of k-1/2 to at most k is represented by
the interval to the left of k+1/2.
X np
X~ B(n, p) and if Z = then Z ~ N(0,1) as n tends to infinity and F(Z) =
np (1 p )
z
1 2
F(Z)= P(Z z) = e
t /2
dt - < Z <
2
Use the normal approximation to the B.D. only when (i) np and n(1-p) are both greater than 15
(ii) n is small and p is close to ½
Poisson process: Poisson process is a random process in which the number of events
(successes) x occurring in a time interval of length T is counted. It is continuous parameter,
discrete stable process. By dividing T into n equal parts of length t we have T = n . T.
Assuming that (i) P T or P = t (ii) The occurrence of events are independent (iii)
The probability of more than one substance during a small time interval t is negligible.
As n , the probability of x success during a time interval T follows the P.D. with
parameter = np = T where is the average(mean) number of successes for unit time.
PROBLEMS:
Solution:
=1+2+2+3=8
2. Let X denotes the minimum of the two numbers that appear when a pair of fair dice is thrown
once. Determine (i) Discrete probability distribution (ii) Expectation (iii) Variance
Solution:
If the random variable X assigns the minimum of its number in S, then the sample space S=
1 1 1 1 1 1
1 2 2 2 2 2
1 2 3 3 3 3
1 2 3 4 4 4
1 2 3 4 5 5
1 2 3 4 5 6
Therefore, P(x=1)=11/36
X 1 2 3 4 5 6
P(x) 11/36 9/36 7/36 5/36 3/36 1/36
(ii)Expectation mean = pi xi
11 9 7 5 3 1
E (x) 1 2 3 4 5 6
36 36 36 36 36 36
1 9
Or 11 8 21 20 15 6 2 . 5278
36 36
(ii) variance = pi x
2
i
2
11 9 7 5 3 1
36 2 . 5278
2
E (x) 1 4 9 16 25
36 36 36 36 36 36
=1.9713
Solution:
0
x
0 dx kxe dx 1
0
x
i.e., kxe dx 1
0
e x e
x
k x 1 or k
2
2
0
0
x
0 dx kx
2
e dx
0
2 e x e
x
e
x
x 2 x 2
2
2 3
0
2
=
4
x f x dx
2 2
2
3 e x e
x
e
x
e
x
4
x 3x 6 x 6
2 2
2 3 4 2
0
2
2
4:
Out of 800 families with 5 children each, how many would you expect to have (i)3
boys (ii)5girls (iii)either 2 or 3 boys ? Assume equal probabilities for boys and girls
Solution(i)
1 5
C3
5
P(3boys)=P(r=3)=P(3)= 5
per family
2 16
Thus for 800 families the probability of number of families having 3 boys=
5
800 250 families
16
(iii)
1 1
C0
5
P(5 girls)=P(no boys)=P(r=0)= 5
per family
2 32
Thus for 800 families the probability of number of families having 5girls=
1
800 25 families
32
5: Average number of accidents on any day on a national highway is 1.8. Determine the
probability that the number of accidents is (i) at least one (ii) at most one
Solution:
Mean= 1 . 8
𝑒 −𝜆 𝜆 𝑥 𝑒 −1.8 1.8𝑥
We have P(X=x)=p(x) =
𝑥! 𝑥!
(i)P (at least one) =P( x≥1)=1-P(x=0)
=1-0.1653
=0.8347
=P(x=0)+P(x=1)
= 0.4628
6: The mean weight of 800 male students at a certain college is 140kg and the standard deviation
is 10kg assuming that the weights are normally distributed find how many students weigh I)
Between 130 and 148kg ii) more than 152kg
Solution:
Let be the mean and be the standard deviation. Then =140kg and =10pounds
x 138 140
(i) When x= 138, z 0 .2 z 1
10
x 148 140
When x= 138, z 0 .8 z 2
10
P(138≤x≤148)=P(-0.2≤z≤0.8)
=A( z 2 )+A( z 1 )
=A(0.8)+A(0.2)=0.2881+0.0793=0.3674
Hence the number of students whose weights are between 138kg and 140kg
=0.3674x800=294
𝑥−𝜇 152 −140
(ii) When x=152, = = 1.2=z1
𝜎 10
Therefore P(x>152)=P(z>z1)=0.5-A(z1)
=0.5-0.3849=0.1151
Therefore number of students whose weights are more than 152kg =800x0.1151=92.
Exercise Problems:
1. Two coins are tossed simultaneously. Let X denotes the number of heads then find i)
E(X) ii) E(X2) iii)E(X3) iv) V(X)
x
2. If f(x)=k e is probability density function in the interval, x , then find i) k
ii) Mean iii) Variance iv) P(0<x<4)
3. Out of 20 tape recorders 5 are defective. Find the standard deviation of defective in the
sample of 10 randomly chosen tape recorders. Find (i) P(X=0) (ii) P(X=1) (iii) P(X=2)
(iv) P (1<X<4).
4. In 1000 sets of trials per an event of small probability the frequencies f of the number of
x of successes are
f 0 1 2 3 4 5 6 7 Total
x 305 365 210 80 28 9 2 1 1000
Fit the expected frequencies.
5.If X is a normal variate with mean 30 and standard deviation 5. Find the probabilities
that i) P(26 X40) ii) P( X 45)
6. The marks obtained in Statistics in a certain examination found to be normally
distributed. If 15% of the students greater than or equal to 60 marks, 40% less than 30
marks. Find the mean and standard deviation.
3
7.If a Poisson distribution is such that P ( X 1) P ( X 3) then find (i) P ( X 1) (ii)
2
P ( X 3)
(iii) P ( 2 X 5 ) .
8. A random variable X has the following probability function:
X -2 -1 0 1 2 3
P(x) 0. K 0.2 2K 0.3 K
1
Then find (i) k (ii) mean (iii) variance (iv) P(0 < x < 3)
UNIT-II
MULTIPLE RANDOM VARIABLES
Joint Distributions: Two Random Variables
In real life, we are often interested in several random variables that are related to each other. For
example, suppose that we choose a random family, and we would like to study the number of
people in the family, the household income, the ages of the family members, etc. Each of these is
a random variable, and we suspect that they are dependent. In this chapter, we develop tools to
study joint distributions of random variables. The concepts are similar to what we have seen so
far. The only difference is that instead of one random variable, we consider two or more. In this
chapter, we will focus on two random variables, but once you understand the theory for two
random variables, the extension to n
random variables is straightforward. We will first discuss joint distributions of discrete random
variables and then extend the results to continuous random variables.
Marginal PMFs
The joint PMF contains all the information regarding the distributions of X and Y. This means
that, for example, we can obtain PMF of X from its joint PMF with Y. Indeed, we can write
Here, we call PX(x) the marginal PMF of X. Similarly, we can find the marginal PMF of Y as
PY(Y)=∑xi∈RXPXY(xi,y).
1
xy x y
Computational formula for r(X,Y) = n
1 2 2 1 2 2
x x y y
n n
-1 r 1
If r = 0 then X, Y are uncorrelated.
If r = -1 then correlation is perfect and negative.
If r = 1then the correlation is perfect and positive.
r is independent of change of origin and scale
Two independent variables are uncorrelated. Converse need not be true.
The correlation coefficient for Bivariate frequency distribution:
The bivariate data on X on Y are presented in a two-way correlation table with n
classes of Y placed along the horizontal lines and m classes of X along vertical lines
and fij is the frequency of the individuals lying in i, j th cell.
f (x, y) =g(y),is the sum of the frequencies along any row and
x
f ( x, y ) = f ( x, y ) = f (x) = g ( y) =N
x y y x x y
1 1
x =
N
xf ( x ) , y = yg ( y )
x N y
1 1
f (x) x and
2 2 2 2
y g ( y) y
2 2
X = x Y =
N x N y
1
Cov(X,Y) =
N
xyf ( x , y ) -x y
x y
Cov ( X , Y )
r=
X
Y
Rank Correlation: Let (xi,yi) for I = 1,2,…n be the ranks of the ith individuals in the
characteristics A and B respectively, Pearsonian coefficient of correlation between xi
and yi are called rank correlation coefficient between A and B for that group of
individual.
The Spearman’s rank correlation between the two variables X and Y takes the
n
2
di
values 1,2…n denoted by and is defined as = 1– i 1
1)
2
n(n
In case, common ranks are given to repeated items, the common rank is the average of
the ranks which these items would have assumed if they were slightly different from each
other and the next item will get the rank next to the rank already assumed. The adjustment or
1)
2
m (m
correction is made in the rank correlation formula. In the formula we add factor to
12
d2, where m is the number of times an item is repeated. This correction factor is to be added
to each repeated value in both X-series and Y- series.
-1 1
Regression analysis is a mathematical measure of the average relationship between
two or more variables in terms of the original units of the data.
The variable whose value is influenced or is to be predicted is called dependent
variable and the variable, which influences the values or is used for the prediction is
called independent variable. Independent variable is also known as regressor or
predictor or explanatory variable while the dependent variable is also known as
regressed or explained variable.
If the variables in bivariate distributions are related we will find that the points in the
scatter diagram will cluster round some curve called the “ curve of regression”. If the
curve is a straight line, it is called line of regression and
there is said to be linear Regression between the variables, otherwise the regression
is said to be curvilinear. The line of regression is line of best fit and is obtained by
principle of least squares.
X
Similarly the line of regression X on Y is X = a + b Y
i.e.,X- x = r (Y y )
X
Y
If X and Y are any random variables the two regression lines are
Cov ( X , Y )
Y – E(Y) = [X – E(X)]
2
X
Cov ( X , Y )
X – E(X) = [Y – E(Y)]
2
Y
Both lines of regression passes through the point x , y i.e., the mean values x , y
can be obtained at the point of intersection of regression lines.
The slope of regression line Y on X is also called the regression coefficient Y on X. It
represents the increment in the value of dependent variable Y corresponding to a unit
change in the value of independent variable X. We write, bYX = Regression
coefficient Y on X = r Y
X
Y
and X = x which are perpendicular to each other and are parallel to x- axis and y-
axis respectively.
X 1 an b X 2
c X 3
a X b X c X 2 X
2
X1X 2 2 2 3
a X b X c X
2
X1X 3 3 2
X 3 3
Similarly for the regression plane of X2 on X1 and X3 and the regression plane of X3
on X1 and X2
PROBLEMS:
1. Let x and y are two random variables with a joint probability density
y
e ,0 x y
function f ( x, y) . Find the marginal probability density
0 , otherwise
function of x
and y.
y
e ,0 x y
Solution: Given that f ( x, y)
0 , otherwise
f x (x) f ( x , y ) dy
y
e dy
x
y
e x
e
e
x
x
e
f y ( y) f ( x , y ) dx
y
y
e dx
0
y y
xe 0
ye y
0
y
ye
(x y)
be ,0 x a ,0 y
Solution: Given f ( x, y)
0 . Otherwise
f ( x , y ) dxdy 1
a
(x y)
be dxdy 1
y0 x0
e
y x a
be 0
dy 1
y0
be
y
e 0
e
a
dy 1
y0
a y
b (1 e ) e dy 1
y0
a y
b (1 e ) e 0
dy 1
b (1 e
a
) e 0
1
a
b (1 e ) 1
1
b a
(1 e )
x 12 9 8 10 11 13 7
y 14 8 6 9 11 12 13
xi 12 9 8 10 11 13 7
x 10
n 7
y
yi
14 8 6 9 11 12 13
10 . 4
n 7
x y X x x Y y y XY X
2
Y
2
XY 16 28 Y 49 . 3
2 2
X
r
XY
Correlation Coefficient
. Y
2 2
X
16
28 49 . 3
r 0 . 43
r is positive.
Mathematics X Statistics Y
1 0 0
1
10 -8 64
2
3 0 0
3
4 0 0
4
5 0 0
5
7 -1 1
6
2 5 25
7
6 2 4
8
8 1 1
9
11 -1 1
10
15 -4 16
11
9 3 9
12
14 -1 1
13
12 2 4
14
16 -1 1
15
13 3 9
16
136
2
D
6 D
2
6 136
1
16 225
0 .8
5. Determine the regression equation which best fit to the following data:
x 10 12 13 16 17 20 25
y 10 22 24 27 29 33 37
y na b x
xy a x b x
2
2
x y x xy
10 100 100
10
22 144 264
12
24 169 312
13
27 256 432
16
29 289 493
17
33 400 660
20
37 625 925
25
y na b x 182 7 a 113 b 1
X1 3 5 6 8 12 14
X2 16 10 7 4 3 2
X3 90 72 54 42 30 12
48 42 300
Solution: here n 6, X 1
8, X 2
7, X 3
50
6 6 6
x1 X 1
X 1
x2 X 2
X 2
x3 X 3
X 3
2 2 2
X1 x1 x1 X 2
x2 x2 X 3
x3 x3 x1 x 2 x2 x3 x 3 x1
S.NO
3 -5 25 16 9 81 90 40 1600 -45 360 -200
1
5 -3 9 10 3 9 72 22 484 -9 66 -66
2
6 -2 4 7 0 0 54 4 16 0 0 -8
3
8 0 0 4 -3 9 42 -8 64 0 24 0
4
12 4 16 3 -4 16 30 -20 400 -16 80 -80
5
14 6 36 2 -5 25 12 -38 1444 -30 190 -228
6
68 0 90 42 0 140 300 0 4008 -100 -582 720
r12
x1 x 2
100
0 . 89
90 140
2 2
x1 x2
r12
x1 x 3
582
0 . 97
90 4008
2 2
x1 x3
r12
x2 x3
720
0 . 96
140 4008
2 2
x2 x3
R 3 . 12 0 . 987
1 r12
2
Exercise Problems:
1. Calculate the Karl Pearson‟s coefficient of correlation from the following data
x 15 18 20 24 30 35 40 50
y 85 93 95 105 120 130 150 160
2. A sample of 12 fathers and their elder sons gave the following data about their elder sons.
Calculate the coefficient of rank correlation.
Father 6 6 6 6 6 6 70 66 68 67 69 71
s 5 3 7 4 8 2
Sons 6 6 6 6 6 6 68 65 71 67 68 70
8 6 8 5 9 6
3. Find the most likely production corresponding to a rainfall 40 from the following data:
Determine A.
6. Determine the regression equation which best fit to the following data:
x 10 12 13 16 17 20 25
y 10 22 24 27 29 33 37
UNIT-III
SAMPLING DISTRIBUTION AND
TESTING OF HYPOTHESIS
Sampling Distribution
Population is the set or collection or totality of the objects, animate or inanimate, actual
or hypothetical under study. Thus, mainly population consists of set of numbers
measurements or observations, which are of interest.
Size of the population N is the number of objects or observations in the population.
Population may be finite or infinite.
A finite sub-set of the population is known as Sample. Size of the sample is denoted by
n.
Sampling is the process of drawing the samples from a given population.
If n 30 the sampling is said to be large sampling.
If n < 30 then the sampling is said to be Small sampling.
Statistical inference deals with the methods of arriving at valid generalizations and
predictions about the population using the information contained in the sample.
Parameters Statistical measures or constants obtained from the population are known as
population parameters or simply parameters.
Population f(x) is a population whose probability distribution is f(x).If f(x) is binomial,
Poisson or normal then the corresponding population is known as Binomial Population,
Poisson population or normal Population.
Samples must be representative of the population, sampling should be random.
Random Sampling is one in which each member of the population has equal chances or
probability of being included in the sample.
Sampling where each member of a population may be chosen, more than once is called
Sampling with replacement. A finite population, which is sampled with replacement,
can theoretically be considered infinite since samples of any size can be drawn with out
exhausting the population. For most practical purpose sampling from a finite population,
which is very large, can be considered as sampling from an infinite population.
If each member cannot be chosen more than once it is called sampling with out
replacement.
Any quantity obtained from a sample for the purpose of estimating a population
parameter is called a sample statistics or briefly Statistic. Mathematically a sample
statistic for a sample of size n can be defined as a function of the random variables X 1,
X2……Xn i.e., g(X1, X2……Xn). The function g(X1, X2……Xn) is another random
variable whose values can be represented by g(X1, X2……Xn). The word statistic is often
used for the r.v. or for its values.
Random samples (Finite population): A set of observations X 1, X2……Xn, constitute a
random sample of size n from a finite population of size N, if its values are chosen so that
each subset of n of the N elements of the population has same probability if being
selected.
Random sample (Infinite Population): A set of observations X 1, X2……Xn constitute a
random sample of size n from infinite population f(x) if:
(i) Each Xi is a r.v. whose distribution is given by f(x)
(ii) These n r.v.‟s are independent
Sample Mean X1, X2……Xn is a random sample of size n the sample mean is a r.v.
X 1 X 2 ....... X n
defined by X =
n
Sample Variance X1, X2……Xn is a random sample of size n the sample variance is a r.v.
n
2
(X i X )
i 1
defined by S2 = and is a measure of variability of data about the mean.
n
Sample Standard deviation is the positive square root of the sample variance.
Degrees of freedom (d o f) of a statistic is the positive integer denoted by , equals to n-k
where n is the number of independent observations of the random sample and k is the
number of population parameters which are calculated using sample data. Thus d o f =
n – k is the difference between n, the sample size and k, the number of independent
constraints imposed on the observations in the sample.
Sampling Distributions: The probability distribution of a sample statistic is often called
as sampling distribution of the statistic.
The standard deviation of the sampling distribution of a statistic is called Standard
Error(S.E)
The mean of the sampling distribution of means, denoted by x , is given by E( X ) = x =
where is the mean of the population.
If a population is infinite or if sampling is with replacement, then the variance of the
2
2
sampling distribution of means, denoted by
x
2
is given by E[( X ) ]= x
2
=
n
where is the variance of the population.
2
If the population is of siqe N, if sampling is without replacement, and if the sample size is
2
2 N n
nN then =
N 1
x
n
N n
The factor is called the finite population correction factor, is close to 1 (and can
N 1
be omitted for most practical purposes) unless the samples constitutes a substantial
portion of the population.
(Central limit theorem) If X is the mean of a sample of size n taken from a population
X
having the mean and the finite variance 2 , then Z= is a r.v. whose
n
Estimation
The most widely used values for 1- are 0.95 and 0.99 and the corresponding values
of Z/2 are Z0.025 = 1.96 and Z0.005 = 2.575
2
Sample size n = Z
/2
E
Confidence interval for ( for large samples n 30 ) known
x -Z /2 < < x +Z /2
n n
If the sampling is without replacement from a population of finite size N then the
confidence interval for with known is
N n N n
x -Z /2 < < x +Z /2
n N 1 n N 1
The end points of the confidence interval are called Confidence Limits.
In Bayesian estimation prior feelings about the possible values of are combined with
the direct sample evidence which give the posterior distribution of approximately
normally distributed with
n x 0
2 2
2 2
mean 1 = 0
and standard deviation 1= 0
. In the computation
n
2 2
n
2 2
0 0
and 1 and 1, is assumed to be known. When is unknown which is generally the
2 2
n 30(Large sample)
1 - Z / 2 1 < < 1 + Z / 2 1
Statistical decisions are decisions or conclusions about the population parameters on the
basis of a random sample from the population.
Statistical hypothesis is an assumption or conjecture or guess about the parameters of the
population distribution
Null Hypothesis (N.H) denoted by H0 is statistical hypothesis, which is to be actually
tested for acceptance or rejection. NH is the hypothesis, which is tested for possible
rejection under the assumption that it is true.
Any Hypothesis which is complimentary to the N.H is called an Alternative Hypothesis
denoted by H1
Simple Hypothesis is a statistical Hypothesis which completely specifies an exact
parameter. N.H is always simple hypothesis stated as a equality specifying an exact value
of the parameter. E.g. N.H = H0 : = 0 N.H. = H0 : 1- 2=
Composite Hypothesis is stated in terms of several possible values.
Alternative Hypothesis(A.H) is a composite hypothesis involving statements expressed as
inequalities such as < , > or
i) A.H : H1: > 0 (Right tailed) ii) A.H : H1: < 0 (Left tailed)
Errors in sampling
Type I error: Reject H0 when it is true
Accept H0 Reject H0
(Z) 1% 5% 10%
When the size of the sample is increased, the probability of committing both types of
error I and II (i.e) and are small, the test procedure is good one giving good chance of
making the correct decision.
P-value is the lowest level ( of significance) at which observed value of the test statistic is
significant.
A test of Hypothesis (T. O.H) consists of
1. Null Hypothesis (NH) : H0
2. Alternative Hypothesis (AH) : H1
3. Level of significance:
4. Critical Region pre determined by
5. Calculation of test statistic based on the sample data.
6. Decision to reject NH or to accept it.
PROBLEMS: 1. A population consists of five numbers 2,3,6,8 and 11. Consider all possible
samples of size two which can be drawn with replacement from this population. Find
2
N 5
16 9 0 4 25
=
5
10 . 8
3 . 29
Sampling with replacement(infinite population):
The total number of samples with replacement is
5 25
n 2
N
(2 6) ( 2 .5 6 ) (11 6 )
2 2 2
2
x
25
= 5 . 40
x
2 . 32
2. A population consists of five numbers4, 8, 12, 16, 20, 24. Consider all possible samples of
size two which can be drawn without replacement from this population. Find
2
N 6
100 36 4 4 36 100
=
6
46 . 67
3 . 29
Sampling without replacement (finite population):
The total number of samples without replacement is 𝑁𝑐𝑛 = 6𝑐2 = 15
( 6 14 ) ( 8 14 ) ( 22 14 )
2 2 2
2
x
15
= 18 . 67
x
4 . 32
3. The mean of certain normal population is equal to the standard error of the mean of the
samples of 64 from that distribution. Find the probability that the mean of the sample size
36 will be negative.
x
Now Z
n
157 155
=
15
36
=0.8
P ( x 157 ) P ( Z 0 . 8 )
= 0 .5 P ( 0 Z 0 .8 )
=0.5+0.2881
P ( x 157 ) = 0.7881
mean of Rs. 472.36 and the standard deviation of Rs. 62.35. If x is used as a point estimate
to the true average repair costs, with what confidence we can assert that the maximum
error doesn’t exceed Rs. 10.
Maximum Error(E)=10
E Z .
2 n
E. n 10 80
Z 1 . 4345
2 62 . 35
0 . 4236 0 . 8472
2
Confidence = (1 )100 % 84 . 72 %
Hence we are 84.72% confidence that the maximum error is Rs. 10.
5. Determine a 95% confidence interval for the mean of normal distribution with variance
0.25, using a sample of size 100 values with mean 212.3.
Confidence interval = x Z . , x Z .
2 n 2 n
0 .5 0 .5
= 212 . 3 1 . 96 . , 212 . 3 1 . 96 .
100 100
= (212.202, 212.398)
Exercise Problems:
1. Samples of size 2 are taken from the population 1, 2, 3, 4, 5, 6. Which can be drawn without
replacement? Find
2. If a 1-gallon can of paint covers on an average 513 square feet with a standard deviation of
31.5 square feet, what is the probability that the mean area covered by a sample of 40 of these 1-
gallon cans will be anywhere from 510to 520 square feet?
3. What is the size of the smallest sample required to estimate an unknown proportion to within a
maximum error of 0.06 with at least 95% confidence.
4. A random sample of 400 items is found to have mean 82 and standard deviation of 18. Find
the maximum error of estimation at 95% confidence interval. Find the confidence limits for the
mean if x =82.
5. A sample of size 300 was taken whose variance is 225 and mean is 54. Construct 95%
confidence interval for the mean.
UNIT - IV
LARGE SAMPLE TESTS
Test statistic for T.O.H. in several cases are
1. Statistic for test concerning mean known
X 0
Z=
/ n
n n2
1
1 - 2 < or H1: 1 - 2
4. Statistic for large samples concerning the difference between two means (1 and 2 are
unknown)
X 1
X 2
Z=
S1
2 2
S2
n n2
1
X np
Z = o
under the N.H: H0: p = po against H1: p p0 or p > p0 or p <P0
np 0 (1 p 0 )
n1 n2 X X
Z= with p̂ = 1 2
under the NH : H0: p1=p2 against the AH H1:p1 <
pˆ (1 pˆ )( 1
1
) n1 n 2
n1 n2
p2 or p1 > p2 or p1 p2
p (1 p )
Maximum error of estimate E = Z/2 with observed value x/n substituted for p
n
we obtain an estimate of E
2
Z
Sample size n = p(1-p) /2
when p is known
E
2
1 Z /2
n= when p is unknown
4 E
One sided confidence interval is of the form p < (1/2n)2 with (2n+1) degrees of
freedom.
Problems:
x
4. Test statistic: Z
n
x 40 38
Z = =4
10
n 400
Z 4
5. Conclusion:
Z >Z
= 39 . 02 , 40 . 98
2. Samples of students were drawn from two universities and from their weights in
kilograms mean and S.D are calculated and shown below make a large sample test to the
significance of difference between means.
1. Null hypothesis(H0): x 1 = x 2
2. Alternative hypothesis(H1): x1 x 2
x1 x 2 55 57
4. Test statistic: Z = =-1.26
2 2
S1 S2 100 225
n1 n2 400 100
Z 1.26
5. Conclusion:
Z <Z
1
P = = 0.5 , Q = 0.5
2
P p
4. Test statistic: Z
PQ
n
P p 0 . 54 0 . 5
Z = = 2.532
PQ 0 .5 0 .5
n 1000
Z 2.532
5. Conclusion:
Z <Z
4. Random sample of 400 men and 600 women were asked whether they would like to
have flyover near their residence .200 men and 325 women were in favour of
proposal. Test the hypothesis that the proportion of men and women in favour of
proposal are same at 5% level.
Solution: Given n1=400, n2=600 , x 1 200 and x 2 325
200
p1 0 .5
400
325
p2 0 . 541
600
200 325
400 600
n1 p1 n 2 p 2 400 600
p 0 . 525
n1 n 2 400 600
q 1 p 1 0 . 525 0 . 475
1. Null hypothesis(H0): p 1 = p 2
2. Alternative hypothesis(H1): p1 p 2
p1 p 2
4. Test statistic: Z =
1 1
pq
n1 n2
0 . 5 0 . 541
1 . 28
1 1
0 . 525 0 . 425
400 600
Z 1.28
5. Conclusion:
Z <Z
Exercise Problems:
1. An ambulance service claims that it takes on the average 8.9 minutes to reach its destination In
emergency calls. To check on this claim the agency which issues license to Ambulance service
has then timed on fifty emergency calls getting a mean of 9.2 minutes with 1.6 minutes. What
can they conclude at 5% level of significance?
2.According to norms established for a mechanical aptitude test persons who are 18 years have
an average weight of 73.2 with S.D 8.6 if 40 randomly selected persons have average 76.7 test
the hypothesis 𝐻0 : 𝜇 =73.2 againist alternative hypothesis : 𝜇 >73.2.
3.A cigarette manufacturing firm claims that brand A line of cigarettes outsells its brand B by
8% .if it is found that 42 out of a sample of 200 smokers prefer brand A and 18 out of another
sample of 100 smokers prefer brand B. Test whether 8% difference is a valid claim.