Class 03 04 Confidence Interval, Hypothesis Testing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 87

Sampling Distributions,

Confidence interval,
Hypothesis testing
Sampling Distribution of the means
• Central Limit Theorem: if is the mean of a random sample of size n
taken from a population mean and finite variance then the limiting
form of the distribution of

Is a standard normal distribution N(0,1) as n -> .

10/16/2022 @TKMISHRA ML NITRKL 2


10/16/2022 @TKMISHRA ML NITRKL 3
Sampling Distribution of S2: 2

Theorem 2:

10/16/2022 @TKMISHRA ML NITRKL 4


Sampling Distribution of S2
Example:

10/16/2022 @TKMISHRA ML NITRKL 5


Sampling Distribution of S2: t
Theorem 3:

10/16/2022 @TKMISHRA ML NITRKL 6


10/16/2022 @TKMISHRA ML NITRKL 7
Sampling Distribution of S2: F
Theorem 4:

10/16/2022 @TKMISHRA ML NITRKL 8


Confidence Intervals

10/16/2022 @TKMISHRA ML NITRKL 9


Interval Estimate and Confidence Level

• An interval estimate of a population parameter such as mean and


standard deviation is an interval or range of values within which the
true parameter value is likely to lie with certain probability.

• Confidence level, usually written as (1  )100%, on the interval


estimate of a population parameter is the probability that the interval
estimate will contain the population parameter. When  = 0.05, 95%
is the confidence level and 0.95 is the probability that the interval
estimate will have the population parameter

10/16/2022 @TKMISHRA ML NITRKL 10


Significance and Confidence Level

• The value of  is called significance


• 95% confidence implies that in 19 out of 20 cases, the true population
mean will be within the interval estimate.

• Confidence interval is the interval estimate of the population


parameter estimated from a sample using a specified confidence level

10/16/2022 @TKMISHRA ML NITRKL 11


Confidence Interval for Population Mean

• Let X1, X2, …, Xn be the sample means of samples S1, S2, …, Sn that are
drawn from an independent and identically distributed population
with mean  and standard deviation . From central limit theorem
we know that the sample means Xi follow a normal distribution with
mean  and standard deviation . The variable follows
a standard normal variable.

10/16/2022 @TKMISHRA ML NITRKL 12


Assume that we are interested in finding (1  ) 100% confidence
interval for the population mean. We can distribute  (probability of not
observing true population mean in the interval) equally (/2) on either
side of the distribution as shown in Figure

10/16/2022 @TKMISHRA ML NITRKL 13


CI for the population mean when population standard deviation is
known

• In general, (1 – ) 100% the confidence interval for the population


mean when population standard deviation is known can be written as

• Above equation is valid for large sample sizes, irrespective of the


distribution of the population. The above equation is equivalent to

10/16/2022 @TKMISHRA ML NITRKL 14


CI for Different Significance Values

• That is, the probability that the population mean takes a value
between and is 1 – .
• The absolute values of Z/2 for various values of  are shown below:
Confidence interval for
 |Z/2| population mean when
population standard deviation is
known

0.1 1.64

0.05 1.96

0.02 2.33

0.01 2.58

10/16/2022 @TKMISHRA ML NITRKL 15


Example

A sample of 100 patients was chosen to estimate the length of stay


(LoS) at a hospital. The sample mean was 4.5 days and the population
standard deviation was known to be 1.2 days.

(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than 4.73
days?

10/16/2022 @TKMISHRA ML NITRKL 16


Solution

(a) 95% confidence interval for population mean: We know that =4.5 and  = 1.2 and thus

The 95% confidence interval is given by

Note that 4.73 is the upper limit of the 95% confidence interval from part (a), thus the probability
that the population mean is greater than 4.73 is approximately 0.025.

10/16/2022 @TKMISHRA ML NITRKL 17


Confidence Interval for Population Mean when Standard Deviation is Unknown

• William Gossett (Student, 1908) proved that if the population follows a normal
distribution and the standard deviation is calculated from the sample, then the statistic
given in Eq will follow a t-distribution with (n  1) degrees of freedom

• Here S is the standard deviation estimated from the sample (standard error). The t-
distribution is very similar to standard normal distribution; it has a bell shape and its
mean, median, and mode are equal to zero as in the case of standard normal distribution.
The major difference between the t-distribution and the standard normal distribution is
that t-distribution has broad tail compared to standard normal distribution. However, as
the degrees of freedom increases the t-distribution converges to standard normal
distribution.

10/16/2022 @TKMISHRA ML NITRKL 22


Confidence Interval for Population Mean when Standard Deviation is Unknown

• The (1  )100% confidence interval for mean from a population that


follows normal distribution when the population mean is unknown is given
by

• In above Eq, the value t/2,n  1 is the value of t under t-distribution for which
the cumulative probability F(t) = /2 when the degrees of freedom is (n  1).

10/16/2022 @TKMISHRA ML NITRKL 23


• The absolute values of t/2,n1 for different values of  are shown in
Table along with corresponding Z/2 values.

 |t/2,10| |t/2,50| |t/2,500| |Z/2|

0.1 1.812 1.675 1.647 1.64

0.05 2.228 2.008 1.964 1.96

0.02 2.763 2.403 2.333 2.33

0.01 3.169 2.677 2.585 2.58


It is evident from table that the values of t/2,n1 and Z/2 converge
for higher degrees of freedom. In fact, as the sample size nears
100, the t-distribution gets very close to a normal distribution.

10/16/2022 @TKMISHRA ML NITRKL 24


Example

• An online grocery store is interested in estimating the basket size (number of items
ordered by the customer) of its customers so that it can optimize its size of crates used for
delivering the grocery items. From a sample of 70 customers, the average basket size was
estimated as 24 and the standard deviation estimated from the sample was 3.8. Calculate
the 95% confidence interval for the basket size of the customer order.

Solution
We know that , n = 70, S = 3.8 and t0.025, 69 = 1.995

The confidence interval for size of basket using Eq. is given by

Thus the 95% confidence interval for the size of the basket is
(23.09,24.91).
10/16/2022 @TKMISHRA ML NITRKL 25
HYPOTHESIS TESTING
INTRODUCTION TO HYPOTHEIS TESTING

Hypothesis testing is a statistical process of either rejecting or


retaining a claim or belief or association related to a business
context, product, service, processes, etc

10/16/2022 @TKMISHRA ML NITRKL 27


INTRODUCTION TO HYPOTHEIS TESTING

• Hypothesis test consists of two complementary


statements called null hypothesis and alternative
hypothesis, and only one of them is true

• Hypothesis is an integral part of many predictive


analytics techniques such as multiple linear
regression and logistic regression

10/16/2022 @TKMISHRA ML NITRKL 28


HYPOTHESIS TESTING
STEPS
1. Describe the hypothesis in words. Hypothesis is
described using a population parameter (such as
mean, standard deviation, proportion, etc.) about
which a claim (hypothesis) is made. Few sample
claims (hypothesis) are:

• Average time spent by women using social media is more


than men.
• Customers with more than one mobile handsets are more
likely to die early.

10/16/2022 @TKMISHRA ML NITRKL 30


HYPOTHESIS TESTING STEPS
2) Based on the claim made in step 1, define null and alternative
hypotheses. Initially we believe that the null hypothesis is true. In
general, null hypothesis means that there is no relationship between the
two variables under consideration (for example, null hypothesis for the
claim ‘women use social media more than men’ will be ‘there is no
relationship between gender and the average time spent in social
media’). Null and alternative hypotheses are defined using a population
parameter.

3) Identify the test statistic to be used for testing the validity of the null
hypothesis. Test statistic will enable us to calculate the evidence in
support of null hypothesis. The test statistic will depend on the
probability distribution of the sampling distribution; for example, if the
test is for mean value and the mean is calculated from a large sample
and if the population standard deviation is known, then the sampling
distribution will be a normal distribution and the test statistic will be a Z-
statistic (standard normal statistic).
10/16/2022 @TKMISHRA ML NITRKL 31
HYPOTHESIS TESTING STEPS
4. Decide the criteria for rejection and retention of null hypothesis.
This is called significance value traditionally denoted by symbol .
The value of  will depend on the context and usually 0.1, 0.05, and
0.01 are used.

5. Calculate the p-value (probability value), which is nothing but the


conditional probability of observing the test statistic value when the
null hypothesis is true. In simple terms p-value is the evidence in
support of the null hypothesis.

6. Take the decision to reject or retain the null hypothesis based on the
p-value and significance value . The null hypothesis is rejected
when p-value is less than  and the null hypothesis is retained when
p-value is greater than or equal to .

10/16/2022 @TKMISHRA ML NITRKL 32


Null and Alternative Hypothesis

• Null hypothesis, usually denoted as H0 (H zero and H


naught), refers to the statement that there is no
relationship or no difference between different groups
with respect to the value of a population parameter.

• Alternative hypothesis, usually denoted as HA (or H1), is


the complement of null hypothesis.

10/16/2022 @TKMISHRA ML NITRKL 33


Hypothesis statement to definition of null and alternative
hypothesis
S. No. Hypothesis Description Null and Alternative Hypothesis
1 Average annual salary of machine learning H0: m = f
experts is different for males and females.
HA: m  f
 
(In this case, the null hypothesis is that there is m and f are average annual salary of male and
no difference in male and female salary of female machine learning experts, respectively.
machine learning experts)  

H0: a  e
2 On average people with Ph.D. in analytics earn
HA: a > e
more than people with Ph.D. in engineering.

  a = Average annual salary of people with Ph.D. in

analytics.

e = Average annual salary of people with Ph.D. in

engineering.
It is essential to have the equal sign in null hypothesis
statement.

10/16/2022 @TKMISHRA ML NITRKL 34


Test Statistic
• Test statistic is the standardized difference between the
estimated value of the parameter being tested calculated
from the sample(s) and the hypothesis value (that is
standardized difference between and ) in order to
establish the evidence in support of the null hypothesis.

• It measures the standardized distance (measured in terms of


number of standard deviations) between the value of the
parameter estimated from the sample(s) and the value of
the null hypothesis.

10/16/2022 @TKMISHRA ML NITRKL 35


P - Value
• The p-value is the conditional probability of observing
the statistic value when the null hypothesis is true.

• For example, consider the following hypothesis:


Average annual salary of machine learning experts is at
least 100,000. The corresponding null hypothesis is H0:
m  100,000. Assume that estimated value of the salary
from a sample is 1,10,000 (that is and assume
that the standard deviation of population is known and
standard error of the sampling distribution is 5000 (that
is, where n is the sample size using which
was calculated).
10/16/2022 @TKMISHRA ML NITRKL 36
Hypothesis Testing
• The standardized distance between estimated salary from
hypothesis salary is (1,10,000 – 1,00,000)/5000 = 2.

• That is, the standardized distance between estimated value and


the hypothesis value is 2 and we can now find the probability of
observing this statistic value from the sample if the null hypothesis
is true (that is if m  100,000).

• A large standardized distance between the estimated value and


the hypothesis value will result in a low p-value.

• Note that the value 2 is actually the value under a standard normal
distribution since it is calculated from

10/16/2022 @TKMISHRA ML NITRKL 37


Standard normal distribution and the p-value
corresponding to Z = 2 are shown below:

10/16/2022 @TKMISHRA ML NITRKL 38


Hypothesis Testing
• Probability of observing a value of 2 and higher from a standard
normal distribution is 0.02275.

• That is, if the population mean is 1,00,000 and standard error of


the sampling distribution is 5000 then probability of observing
a sample mean greater than or equal to 1,10,000 is 0.02275.

• The value 0.02275 is the p-value, which is the evidence in


support of the statement in the null hypothesis.

p-value = P(Observing test statistics value | null hypothesis is


true)
10/16/2022 @TKMISHRA ML NITRKL 39
Decision Criteria – Significance Value

• Significance level, usually denoted by , is the criteria used


for taking the decision regarding the null hypothesis (reject
or retain) based on the calculated p-value.

• The significance value  is the maximum threshold for p-


value.

• The decision to reject or retain will depend on whether the


calculated p-value crosses the threshold value  or not

10/16/2022 @TKMISHRA ML NITRKL 40


Decision making under hypothesis testing

Criteria Decision

Reject the null hypothesis


p-value < 

Retain (or fail to reject) the null hypothesis


p-value  

10/16/2022 @TKMISHRA ML NITRKL 41


Example 1
Statement 1  Salary of machine learning experts on average is at
least US $100,000:
The null and alternative hypotheses in this case are given by

H0: m  100,000
HA: m > 100,000

where m is the average annual salary of machine learning experts.


Note that the equality symbol is always part of the null hypothesis
since we have to measure the difference between estimated value
from the sample and the hypothesis value. In this case, reject or
retain decision will depend on the direction of deviation of the
estimated parameter from the sample from hypothesis value.

10/16/2022 @TKMISHRA ML NITRKL 42


Solution
Below figure shows the rejection region on the right side of
the distribution. Since the rejection region is only on one
side this is a one-tailed test (right tailed test). Specifically,
since the alternative hypothesis in this case is m > 100,000,
this is called right-tailed test.

10/16/2022 @TKMISHRA ML NITRKL 43


Example 2
• Statement 2  Average waiting time at the London
Heathrow airport security check is less than 30 minutes:
The null and alternative hypotheses in this case are given by
H0: w  30
HA: w < 30

where w is the average waiting time at London Heathrow


security check. In this case, reject region will on the left side
(known as left tailed test) of the distribution as shown in
Figure

10/16/2022 @TKMISHRA ML NITRKL 44


Solution

Rejection region in case of left-sided test

10/16/2022 @TKMISHRA ML NITRKL 45


Example 3
Statement 3  Average salary of male and female MBA students at
graduation is different:

The null and alternative hypotheses in this case are given by


H0: m = f
HA: m  f

Where m and f are the average salaries of male and female MBA
students, respectively, at the time of graduation.
In this case, the rejection region will be on either side of the
distribution and if the significance level is  then the rejection region
will be /2 on either side of the distribution. Since the rejection
region is on either side of the distribution, it will be a two-tailed test.
10/16/2022 @TKMISHRA ML NITRKL 46
Solution

Rejection region in case of two-tailed test

10/16/2022 @TKMISHRA ML NITRKL 47


Hypothesis Testing for Population Mean with
known Variance: Z-Test
• Z-test (also known as one-sample Z-test) is used when a claim
(hypothesis) is made about the population parameter such as
population mean or proportion when population variance is known.
• Since the hypothesis test is carried out with just one sample, this
test is also known as one-sample Z-test.
• Z-test to conduct a hypothesis test for population mean when the
population variance is known; the test statistics for Z-test is given by

Z-statistic =

• The critical value in this case will depend on the significance value 
and whether it is a one-tailed or two-tailed test

10/16/2022 @TKMISHRA ML NITRKL 48


Critical value for different values of 

  Approximate Critical Values

 Left-tailed test Right-tailed test Two-tailed test

0.1
1.28 1.28 1.64 and 1.64

0.05
1.64 1.64 1.96 and 1.96

0.01
2.33 2.33 2.58 and 2.58

10/16/2022 @TKMISHRA ML NITRKL 49


Condition for rejection of null hypothesis H0

Type of Test Condition Decision

Left-tailed test Z-statistic < Critical value Reject H0

Z-statistic  Critical value Retain H0

Right-tailed test Z-statistic > Critical value Reject H0

Z-statistic  Critical value Retain H0

Two-tailed test |Z-statistic| > |Critical Value| Reject H0

|Z-statistic|  |Critical Value| Retain H0

10/16/2022 @TKMISHRA ML NITRKL 50


Example
• An agency based out of Bangalore claimed that the
average monthly disposable income of families
living in Bangalore per month is greater than INR
4200 with a standard deviation of INR 3200. From
a random sample of 40,000 families, the average
disposable income was estimated as INR 4250.
Assume that the population standard deviation is
INR 3200. Conduct an appropriate hypothesis test
at 95% confidence level ( = 0.05) to check the
validity of the claim by the agency.

10/16/2022 @TKMISHRA ML NITRKL 51


Solution
Claim: Average disposable income is more than INR 4200.
Let  and  denote the mean and standard deviation in
the population. The corresponding null and alternative
hypotheses are
H0:   4200
HA:  > 4200
Since we know the population standard deviation, we can
use the Z-test. The corresponding Z-statistic is given by

10/16/2022 @TKMISHRA ML NITRKL 52


Solution Continued…
This is a right-tailed test.
The corresponding Z-critical value at  = 0.05 for right-
tailed test is approximately 1.64
Since the calculated Z-statistic value is greater than the
Z-critical value, we reject the null hypothesis.

The corresponding p-value = 0.00088.

10/16/2022 @TKMISHRA ML NITRKL 53


Critical value, Z-statistic value, and corresponding p-value.

10/16/2022 @TKMISHRA ML NITRKL 54


Example
A passport office claims that the passport applications are
processed within 30 days of submitting the application form
and all necessary documents. Table 6.6 shows processing
time of 40 passport applicants. The population standard
deviation of the processing time is 12.5 days.
Conduct a hypothesis test at significance level  = 0.05 to
verify the claim made by the passport office.

16 16 30 37 25 22 19 35 27 32

34 28 24 35 24 21 32 29 24 35

28 29 18 31 28 33 32 24 25 22

21 27 41 23 23 16 24 38 26 28

10/16/2022 @TKMISHRA ML NITRKL 55


Solution
Null and alternative hypotheses in this case are given by
H0:   30
HA:  < 30
From the data in Table 6.6, the estimated sample mean is
27.05 days.
The standard deviation of the sampling distribution
The value of Z-statistic is given by

10/16/2022 @TKMISHRA ML NITRKL 56


Solution Continued…
• The critical value of left-tailed test for  = 0.05 is –1.644.
• Since the critical value is less than the Z-statistic value, we
fail to reject the null hypothesis. The p-value for Z =
1.4926 is 0.06777 which is greater than the value of .

• That is, there is no strong evidence against null


hypothesis so we retain the null hypothesis, which is  
30. Figure 6.6 shows the calculated Z-statistic value and
the rejection region.

10/16/2022 @TKMISHRA ML NITRKL 57


Left-tailed test

10/16/2022 @TKMISHRA ML NITRKL 58


Example
According to the company IQ Research, the average
Intelligence Quotient (IQ) of Indians is 82 derived
based on a research carried out by Professor Richard
Lynn, a British Professor of Psychology, using the data
collected from 2002 to 2006 (Source: IQ Research).
The population standard deviation of IQ is estimated
as 11.03. Based on a sample of 100 people from
India, the sample IQ was estimated as 84.
(a) Conduct an appropriate hypothesis test at  =
0.05 to validate the claim of IQ Research (that
average IQ of Indians is 82).

10/16/2022 @TKMISHRA ML NITRKL 59


Solution
a)Hypothesis test: It is given that  = 82,  = 11.03, n = 100, and
=84.
The null and alternative hypotheses in this case are:
H0:  = 82
HA:   82
Since the direction of alternative hypothesis is both ways, we
have a two-tailed t-test. The test statistics is given by

10/16/2022 @TKMISHRA ML NITRKL 60


Solution Continued…
• For a two-tailed test, the critical values at /2 = 0.025 are -1.96
and 1.96.

• Since the calculated Z-statistic value is less than the critical


value, we fail to reject the null hypothesis (retain the null
hypothesis).

• Since the Z-statistic value is 1.8132 and falls on the right tail, we
first calculate normal distribution beyond 1.8132 which is equal
to 0.0348.

• Since this is a two-tailed test, the p-value is twice the area to


the right side of the Z-statistic value, which is = 0.0698, that is
the p-value in this case is 0.0698
10/16/2022 @TKMISHRA ML NITRKL 61
Statistic, critical values, and the rejection region

10/16/2022 @TKMISHRA ML NITRKL 62


Hypothesis Test for Population Mean Under Unknown Population
Variance: t-Test

• We use the fact that a sampling distribution of a sample


from a population that follows normal distribution with
unknown variance follows a t-distribution with (n  1)
degrees of freedom.

• In many cases the population variance (and thus the


standard deviation) will not be known. In such cases we
will have to estimate the variance using the sample itself.

• Let S be the standard deviation estimated from the


sample of size n.

10/16/2022 @TKMISHRA ML NITRKL 63


t-test continued…
• Then the statistic will follow a t-distribution with
(n  1) degrees of freedom if the sample is drawn from
a population that follows a normal distribution.

• Here 1 degree of freedom is lost since the standard


deviation is estimated from the sample.

• Thus, we use the t-statistic (hence the test is called t-


test) to test the hypothesis when the population
standard deviation is unknown
t-statistic =
10/16/2022 @TKMISHRA ML NITRKL 64
Example
Aravind Productions (AP) is a newly formed movie production
house based out of Mumbai, India. AP was interested in
understanding the production cost required for producing a
Bollywood movie. The industry believes that the production
house will require at least INR 500 million (50 crore) on
average. It is assumed that the Bollywood movie production
cost follows a normal distribution. Production cost of 40
Bollywood movies in millions of rupees are shown in Table 6.7.
Conduct an appropriate hypothesis test at  = 0.05 to check
whether the belief about average production cost is correct.

10/16/2022 @TKMISHRA ML NITRKL 65


Production cost of Bollywood movies

601 627 330 364 562 353 583 254 528 470

125 60 101 110 60 252 281 227 484 402

408 601 593 729 402 530 708 599 439 762

292 636 444 286 636 667 252 335 457 632

10/16/2022 @TKMISHRA ML NITRKL 66


Solution
It is given that the production cost of Bollywood movies follows a
normal distribution; however, the standard deviation of the
population is not known and we need to estimate the standard
deviation value from the sample. Thus, we have to use the t-test
for testing the hypothesis. From the sample data in Table we get
the following values:
n = 40, =429.55, and S = 195.0337

The null and alternative hypotheses are


• H0:   500
• HA:  > 500

10/16/2022 @TKMISHRA ML NITRKL 67


Solution Continued…
The corresponding test statistic is

Note that this is a one-tailed test (right-tailed) and the


critical t-value at  = 0.05 under right-tailed test, tcritical =
1.6848

10/16/2022 @TKMISHRA ML NITRKL 68


Solution Continued…
Since t-statistic value is less than the critical t-value, we
retain the null hypothesis. The t-statistic value and critical
value for the t-test are shown in Figure

10/16/2022 @TKMISHRA ML NITRKL 69


Example
According to statistics released by the Department of
Civil Aviation, the average delay of flights is equal to
16.8 minutes, flight delays are assumed to follow a
normal distribution. However, from a sample of 50
flights, the average delay was estimated to be 19.5
minutes and the sample standard deviation was 6.6
minutes.
Conduct a hypothesis test to disprove the claim that
the average delay is equal to 16.8 minutes at  = 0.01.

10/16/2022 @TKMISHRA ML NITRKL 70


Solution
Given n = 50, , S = 6.6.
Null and alternative hypotheses are
H0:  = 16.8
HA:   16.8
The corresponding t-statistic value is

10/16/2022 @TKMISHRA ML NITRKL 71


Solution Continued…
• The critical t-value for two-tailed t-test when  = 0.01 and degrees of
freedom = 49 is 2.67
• Since the calculated t-statistic value is greater than the t-critical
value, we reject the null hypothesis. The corresponding p-value is
0.0057. The values of t-statistic, t-critical value, rejection and
retention regions are shown in Figure

10/16/2022 @TKMISHRA ML NITRKL 72


Paired Sample t-Test
• In a paired t-test, the data related to the parameter is
captured twice from the same subject, once before the
intervention and once after intervention

• Alternatively, the paired t-test can be used for comparing


two different interventions such as two different
promotion strategies applied on the same subject

10/16/2022 @TKMISHRA ML NITRKL 73


Examples of paired t-test
 Body weight of subjects before and after attending
a yoga training program.
 Cholesterol levels of subjects before and after
attending meditation training.
 Amount of time spent by subjects on the internet
before and after marriage.
 Quantity of alcohol consumed by people before and
after breakup.
 Level of cortisol among students during and after
exam.

10/16/2022 @TKMISHRA ML NITRKL 74


Paired t-Test
Assume that the mean difference in the parameter value
before and after the treatment is d and the corresponding
standard deviation of difference is Sd . Let D be the
hypothesized mean difference. Then the statistic defined in Eq
follows a t-distribution with (n  1) degrees of freedom.

Here we assume that the differences follow a normal


distribution.

10/16/2022 @TKMISHRA ML NITRKL 75


Example
Table shows data on alcohol consumption before and after breakup. Conduct a paired
t-test to check whether the alcohol consumption is more after the breakup (that is d
> 0) at 95% confidence ( = 0.05).
Average weekly consumption of alcohol (in ml) before and after breakup
S. No. Before Breakup (X1) After Breakup (X2)
Difference (X2  X1)
1 470 408 62
2 354 439 85
3 496 321 175
4 351 437 86
5 349 335 14
6 449 344 105
7 378 318 60
8 359 492 133
9 469 531 62
10 329 417 88
11 389 358 31
12 497 391 106
13 493 398 95
14 268 394 126
15 445 508 63
16 287 399 112
17 338 345 7
18 271 341 70
19 412 326 86
20 335 467 132

10/16/2022 @TKMISHRA ML NITRKL 76


Solution
The mean difference, that is mean of (X2 – X1), is 11.5 and the
corresponding sample standard deviation is 95.67.
The null and alternative hypotheses are (when the claim is
that the difference is greater than zero):
H0: d 0
HA d > 0
The value of test statistic is

10/16/2022 @TKMISHRA ML NITRKL 77


Solution Continued…
• The critical t-value for one-tailed test when  = 0.05
and df = 19 is 1.7291
• Since the t-statistic value is 0.5375, which is less than
the critical value, we retain the null hypothesis and
conclude the difference in alcohol consumption is not
greater than 0 before and after breakup.
• The corresponding p-value is 0.70.

10/16/2022 @TKMISHRA ML NITRKL 78


Two Sample Z test
Assume that 1 and 2 are the population means. Our interest is to check

a hypothesis on difference between 1 and 2, that is (1  2). If and


are the estimated mean values from two samples drawn from the two
populations, the statistic follows a standard normal distribution
with mean (1  2) and standard deviation where n1 and n2 are
the sample sizes of two samples. The corresponding Z-statistic is given by

10/16/2022 @TKMISHRA ML NITRKL 79


Example
The Dean of St Peter School of Management Education
(SPSME) believes that the graduating students with
specialization in Marketing earn at least INR 5000 more per
month than the students with specialization in Operations
Management. To verify his belief, the Dean collected a
sample data from his graduating students, given in Table .
Conduct an appropriate hypothesis test at  = 0.05 to check
whether the difference in monthly salary is at least 5000 more for
students with marketing specialization compared to operations
specialization. Assume that the salary of students with marketing
specialization and operations specialization follow normal
distribution.

10/16/2022 @TKMISHRA ML NITRKL 80


Sample values on Marketing and Operations Students

Specialization Sample Size Estimated Mean Salary (in Rupees) Population Standard

per Month Deviation

Marketing 120 67,500.00 7,200

Operations 45 58,950.00 4,600

10/16/2022 @TKMISHRA ML NITRKL 81


Solution
We have n1 = 120, n2 = 45, ,  1 = 7,200 and  2 = 4,600
The null and alternative hypotheses are
H0: 1  2  5000
HA: 1  2 > 5000
The corresponding test statistic value is

The critical value of Z at  = 0.05 is 1.64 [= NORMSINV(1  0.05)]. Since the

Z-statistic value is higher than the Z-critical value, we reject the null

hypothesis. The corresponding p-value is 9.29  10-05.


10/16/2022 @TKMISHRA ML NITRKL 82
Two-Sample t-Test

Difference in Two Population Means when


Population Standard Deviations are Unknown and
Believed to be Equal: Two-Sample t-Test
• In this section we discuss the hypothesis test for
difference in two population means when the
standard deviation of the populations are unknown.
• Hence we need to estimate them from the samples
drawn from these two populations.
• An additional assumption we make here is that the
standard deviation of the two populations are
equal (however, unknown).

10/16/2022 @TKMISHRA ML NITRKL 83


Then the sampling distribution of the difference in estimated
means follows a t-distribution with (n1 + n2 – 2) degrees of
freedom with mean (1 – 2) and standard deviation

Where is the pooled variance of two samples and is given by

The corresponding t-statistic is

10/16/2022 @TKMISHRA ML NITRKL 84


Example
A company makes a claim that children (between the age group
between 7 and 12) who drink their health drink will grow (height)
taller than the children who do not drink that health drink. Data in
Table shows average increase in height over one-year period from
two groups: one drinking the health drink and the other not drinking
the health drink. At  = 0.05, test whether the increase in height for
the children who drink the health drink is at least 1.2 cm.

Group Sample Size Increase in Height (in cm) during the Standard Deviation

Test Period Estimated from Sample

Drink health drink


80 7.6 cm 1.1 cm

Do not drink
80 6.3 cm 1.3 cm
health drink

10/16/2022 @TKMISHRA ML NITRKL 85


Solution
We have n1 = 80, n2 = 80, , ,  1 = 1.1, and
 2 = 1.3.
The null and alternative hypotheses are
H0: 1  2  1.2
HA: 1  2 > 1.2

Pooled variance is

10/16/2022 @TKMISHRA ML NITRKL 86


Solution Continued…
The t-statistic is

The t-critical value for one-tailed t-test when  = 0.05 and


degrees of freedom = 158 (80 + 80 – 2) is 1.6546.
Since the calculated t-statistic value is less than t-critical
value we retain the null hypothesis. That is, the
difference between two groups is less than 1.2 and the
corresponding right-tailed test has a p-value of 0.3.

10/16/2022 @TKMISHRA ML NITRKL 87


Two Sample t test with unequal Variance

Difference in Two Population Means when Population Standard


Deviations are Unknown and Not Equal – Two Sample t test with
unequal Variance

We need to estimate standard deviations from the samples drawn


from these two populations. Then the sampling distribution of the
difference in estimated means follows a t-distribution with
mean (1  2) and standard deviation

10/16/2022 @TKMISHRA ML NITRKL 88


The corresponding degrees of freedom is given by

where the symbol implies rounding down to the


nearest integer. The t-statistic for testing two populations
with unequal variance is given by

10/16/2022 @TKMISHRA ML NITRKL 89


Example
A researcher is interested in finding average duration of marriage
based on the educational qualifications. Two groups were
considered for the study: Group 1 consisted of couples with no
Bachelor’s degree (both partners) and Group 2 consisted of couple
who both have Bachelor’s degree or higher. Data in Table shows
average duration of marriage in years. At  = 0.05, test whether
the average duration of marriage is more for couples with no
Bachelor’s degree compared to couples with Bachelor’s degree.
Group Sample Size Duration of Marriage in Years Standard Deviation

Estimated from Sample

Couples with no
120 10.1 years 2.4 years
Degree

Couples with
100 9.5 years 3.1 years
Degree

10/16/2022 @TKMISHRA ML NITRKL 90


Solution
We have n1 = 120, n2 = 100, , ,  1 = 2.4,
and  2 = 3.1.
The null and alternative hypotheses are
H0: 1  2  0
HA: 1  2 > 0
The t-statistic value is

10/16/2022 @TKMISHRA ML NITRKL 91


Solution Continued…
The corresponding degrees of freedom is

The critical value of t for  = 0.05 and df = 184 is 1.6531.


Since the t-statistic is less than critical value of t, we retain
the null hypothesis.
That is, the difference in duration of marriage between two
groups is less than or equal to zero.
The corresponding p-value is 0.05785

10/16/2022 @TKMISHRA ML NITRKL 92

You might also like