Inference Statistics Terminology

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Statistics for Data Science(23CSH-233)


Faculty: Prof. (Dr.) Madan Lal Saini(E13485)

Inferential Statistics DISCOVER . LEARN . EMPOWER

1
Statistics for Data Science : Course Objectives

COURSE OBJECTIVES
The Course aims to:
1. To equip students with the skills to summarize and interpret data using descriptive
statistics and visualization techniques.
2. To develop a foundational understanding of probability and its applications in data
science.
3. To enable students to perform hypothesis testing and construct confidence intervals
for statistical inference.
4. To teach students how to build and assess linear and logistic regression models for
predictive analysis.
5. To provide hands-on experience with statistical software for data manipulation,
analysis, and visualization.

2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-

Summarize and describe the main features of a dataset using measures such as mean,
CO1 median, mode, variance, and standard deviation, as well as graphical representations
like histograms, box plots, and scatter plots.
Understand of probability theory, including concepts such as random variables,
CO2 probability distributions, and the law of large numbers, enabling them to model and
reason about uncertainty in data.
Apply/perform statistical inference, including hypothesis testing, confidence interval
CO3 estimation, and p-value computation, to draw valid conclusions from sample data about
larger populations.

Apply linear and logistic regression techniques to identify relationships between


CO4
variables, make predictions, and evaluate model performance.

Utilize statistical software tools to perform data analysis, including data cleaning,
CO5
transformation, visualization, and implementing various statistical methods.

3
Unit-3 Syllabus

Unit-3 Inferential Statistics

Inferential Statistical Inference Terminology,


Statistics & Hypothesis Testing,
Hypothesis Parametric Tests,
Testing Non-parametric Tests

Industry Hypothesis Testing using Excel


Application Industry Practices & Applications of Statistics

4
SUGGESTIVE READINGS

TEXT BOOKS:
• T1. Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York:
Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570
• T2. Montgomery, Douglas C., and George C. Runger. Applied statistics and probability for
engineers. John Wiley & Sons, 2010.
• T3. Probability and Statistics The Science of Uncertainty Second Ed., Michael J. Evans and
Jeffrey S. Rosenthal.

REFERENCE BOOKS:
• R1. Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al,
Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942
• R2. An Introduction to Statistical Learning: with Applications in R, Authors: Gareth James, et
al, Publisher: Springer, Edition: Second Edition (2021), ISBN: 978-1071614174
• R3. Think Stats: Exploratory Data Analysis in Python, Author: Allen B. Downey, Publisher:
O'Reilly Media, Publication Year: 2014 (2nd Edition), ISBN: 978-1491907337

5
 Overview of Inference
 Statistical confidence
 Confidence intervals
 Confidence interval for a population mean
 How confidence intervals behave
 Choosing the sample size
 Some Cautions

6
Statistical Inference
After we have selected a sample, we know the responses of the
individuals in the sample. However, the reason for taking the sample is
to infer from that data some conclusion about the wider population
represented by the sample.

Statistical inference provides methods for drawing conclusions about a


population from sample data.

Population
Collect data from a
Sample representative sample...

Make an inference
about the population.
7
Confidence Interval
A level C confidence interval for a parameter has two parts:
 An interval calculated from the data, which has the form
estimate ± margin of error
 A confidence level C, where C is the probability that the
interval will capture the true parameter value in repeated
samples. In other words, the confidence level is the success
rate for the method.

We usually choose a confidence level of 90% or higher because we


want to be quite sure of our conclusions. The most common confidence
level is 95%.

8
Statistical Estimation

Note: Assume we know the stdev σ of the population,


σ = 100.

9
 We know that sample mean is an unbiased estimator for
the (unknown) population mean µ.
 So we can take = 495 as a good estimate.

 But how reliable is this estimate?


If we take repeated samples, the sample means will vary.

Note: Numbers in these figures are different from the


previous example.
10
11
Statistical Confidence

• Because of the Central Limit Theorem, sample mean is


normally distributed.
• From the 68-95-99.7 rule, we know 95% of the values are
between +/- 2 standard deviations .
And .

So we say that the true population


mean µ lies somewhere in the
interval 495 9 = [486, 504] with
95% confidence.

This is the 95% confidence


interval for the population mean.

12
Confidence Interval for a
Population Mean
To calculate a confidence interval for µ, we use the formula:

estimate ± (critical value) • (standard deviation of statistic)


Z*
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807

Choose an SRS of size n from a population having unknown mean µ and


known standard deviation σ. A level C confidence interval for µ is

The critical value z* is found from the standard Normal distribution. 13


The Margin of Error
The confidence level C determines the value of z* (in Table D).

The margin of error also depends on z*.


m z *  n
Higher confidence C implies a larger
margin of error m (thus less precision
in our estimates).

A lower confidence level C produces a C
smaller margin of error m (thus better
precision in our estimates). m m

−z* z*

14
15
16
17
18
Choosing the Sample Size
You may need a certain margin of error (e.g., in drug trials or
manufacturing specs). In most cases, we have no control over the
population variability (s), but we can choose the number of
measurements (n).

The confidence interval for a population mean will have a specified


margin of error m when the sample size is

 z *  2
m z *  n  
n  m 

Remember, though, that sample size is not always stretchable at will. There are
typically costs and constraints associated with large samples. The best approach is to
use the smallest sample size that can give you useful results.

19
Sample Size Example
How many undergraduates should we survey?
Suppose we are planning a survey about college savings programs.
We want the margin of error of the amount contributed to be $30 with
95% confidence. Let us assume the population standard deviation, σ,
equals $1483.
How many measurements should you take?
For a 95% confidence interval, z* = 1.96.

Using only 9387 measurements will not be enough to ensure that m is


no more than $30. Therefore, we need at least 9388 measurements.

20
6.2 Tests of Significance

 The reasoning of tests of significance


 Stating hypotheses
 Test statistics
 P-values
 Statistical significance
 Tests for a population mean
 Two-sided significance tests and confidence intervals

21
Statistical Inference 2
The second common type of Statistical inference, called tests of
significance, is to assess evidence in the data about some claim
concerning a population.

A test of significance is a formal procedure for comparing observed


data with a claim (also called a hypothesis) whose truth we want to
assess.
 The claim is a statement about a parameter such as the population
proportion p or the population mean µ.
 We express the results of a significance test in terms of a
probability, called the P-value, which measures how well the data
and the claim agree.

22
Four Steps of Tests of Significance
Tests of Significance: Four Steps
1. State the null and alternative hypotheses.
2. Calculate the value of the test statistic.
3. Find the P-value for the observed data.
4. State a conclusion.

We will learn the details of many tests of significance in the following


chapters. The proper test statistic is determined by the hypotheses
and the data collection design.

23
1. Stating Hypotheses
A significance test starts with a careful statement of the claims we want to
compare.

The claim tested by a statistical test is called the null hypothesis (H0).
The test is designed to assess the strength of the evidence against the
null hypothesis. Often, the null hypothesis is a statement of “no effect”
or “no difference in the true means.”

The claim about the population for which we’re trying to find evidence
is the alternative hypothesis (Ha).

24
25
2. Test Statistic
A test of significance is based on a statistic that estimates the parameter that
appears in the hypotheses. When H0 is true, we expect the estimate to be
near the parameter value specified in H0.

Values of the estimate far from the parameter value specified by H0 give
evidence against H0.

A test statistic calculated from the sample data measures how far
the data diverge from what we would expect if the null hypothesis
H0 were true.

Large values of the statistic show that the data are not consistent
with H0.

26
27
3. P-Value
The probability, computed assuming H0 is true, that the
statistic would take a value as or more extreme than the one
actually observed is called the P-value of the test. The smaller
the P-value, the stronger the evidence against H0.

28
29
4. Conclusion
We make one of two decisions based on the strength of the evidence against
the null hypothesis ―reject H0 or fail to reject H0.

P-value small → reject H0 → conclude Ha (in context),


P-value large → fail to reject H0 → cannot conclude Ha (in context).

If the P-value is smaller than , we say that the data are


statistically significant at level . The quantity  is called the
significance level or the level of significance.

When we use a fixed level of significance to draw a conclusion in a


significance test,
P-value <  → reject H0 → conclude Ha (in context)
P-value ≥  → fail to reject H0 → cannot conclude Ha (in context)

30
31
Tests for a Population Mean

One-sided, upper-tail test

One-sided, lower-tail test

Two-sided test – count both sides


32
33
34
Two-Sided Significance Tests
and Confidence Intervals
Because a two-sided test is symmetrical, we can also use a 1 – a
confidence interval to test a two-sided hypothesis at level a.

Confidence level C
and a for a two-sided
test are related as
follows:

C=1–a

a/2 a/2

35
36
37
More About P-Values

38
6.3 Use and Abuse of Tests

 Choosing a significance level


 What statistical significance does not mean
 Do not ignore lack of significance
 Beware of searching for significance

39
Cautions About Significance Tests 1
Choosing the significance level 
Factors often considered:
 What are the consequences of rejecting the null hypothesis
when it is actually true?
• What might happen if we concluded that global warming was real when
it really wasn’t?
• Suppose an innocent person was convicted of a crime.

 Are you conducting a preliminary study? If so, you may want a


larger  so that you will be less likely to miss an interesting
result.

40
Choosing Significance
Some conventions:
Level
 Typically, the standards of our field of work are used.

 There are no sharp cutoffs for P-values: for example, there is


no practical difference between 4.9% and 5.1%.

 It is the order of magnitude of the P-value that matters:


“somewhat significant,” “significant,” or “very significant.”

41
Cautions About Significance Tests 2
Do not ignore lack of significance

 Consider this provocative title from the British Medical Journal: “Absence of
evidence is not evidence of absence.”
 Having no proof that a particular suspect committed a murder does not imply
that the suspect did not commit the murder.

Indeed, failing to find statistical significance in results means that “the null
hypothesis is not rejected.” This is very different from actually accepting the
null hypothesis. The sample size, for instance, could be too small to overcome
large variability in the population.

42
Cautions About Significance Tests 3
Statistical inference not valid for all sets of data

43
6.4 Power and Inference as a
Decision
 Power
 Increasing the power
 The common practice of testing hypotheses

44
Power of Test

45
46
47
48
49
TypeWhen
I and Type II Errors
we draw a conclusion from a significance test, we hope our
conclusion will be correct. But sometimes it will be wrong. There are two
types of mistakes we can make.

If we reject H0 when H0 is true, we have committed a Type I error.


If we fail to reject H0 when H0 is false, we have committed a Type II
error.
Truth about the
population
H0 false
H0 true
(Ha true)
Conclusion
Reject H0 Correct
based on Type I error
conclusion
sample
Fail to reject Correct
H0 Type II error 50
conclusion
51
Increasing the Power
Suppose we have performed a power calculation and found that the power is
too small. Four ways to increase power are

1. Increase the significance level α. It is more difficult to


reject a null hypothesis with a larger α level.
2. Consider a particular alternate value for μ that is
farther from the null value. Values of μ that are farther
from the hypothesized value are easier to detect.
3. Increase the sample size. More data will provide better
information about the sample average, so we have a better
chance of distinguishing values of μ.
4. Decrease σ. Improving the measuring process and
restricting attention to a subpopulation are possible ways to
decrease σ.

52
References
Books:
• Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher: Springer, Edition:
Second Edition (2009), ISBN: 978-0387848570
• Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al, Publisher: O'Reilly
Media, Edition: Second Edition (2020), ISBN: 978-1492072942

Research Papers:
• Garg, Ram and Goyal, Ruchi, Inferential Statistics As a Measure of Judging the Short-Term Solvency An Empirical Study of Three Steel
Companies in India (February 5, 2019). International Journal of Advanced Studies of Scientific Research, Vol. 4, No. 1, 2019, Available at
SSRN: https://ssrn.com/abstract=3329388.
• Alacaci, C. (2004). Inferential Statistics: Understanding Expert Knowledge and its Implications for Statistics Education. Journal of Statistics
Education, 12(2). https://doi.org/10.1080/10691898.2004.11910737
Websites:
• https://www.simplilearn.com/inferential-statistics-article/
• https://builtin.com/data-science/inferential-statistics#:~:text=Inferential%20statistics%20is%20the%20practice,
sample%20data%20sample%20or%20population./

Videos:
• https://www.youtube.com/watch?v=cjTgyRUaD1s&list=PLbRMhDVUMngeD_vOeveVE-3b7wu_AZph9
• https://www.youtube.com/watch?v=ZmCBF5JXOPM&list=PLFW6lRTa1g80s2MWqXNg2o0haq1k14v2I 53
THANK YOU

For queries
Email: madan.e13485@cumail.in

You might also like