Statistics Notes
Statistics Notes
Statistics Notes
Contents
Poor ways to sample .................................................................................................................................... 2
Confidence interval ...................................................................................................................................... 2
Choosing sample size ................................................................................................................................... 5
Significance tests .......................................................................................................................................... 6
Errors......................................................................................................................................................... 7
Group comparisons ...................................................................................................................................... 8
Comparing dependent samples: .................................................................................................................. 9
Independence and association: ................................................................................................................... 9
Formatting excel sheet:
Confidence interval
1. CI for population proportion:
For the 95% confidence interval for a proportion p to be valid, you should have at
least 15 successes and 15 failures:
95% CI
3. The margin of error measures how accurate the point estimate is likely to be in estimating
a parameter. It is a multiple of the standard error of the sampling distribution of the
estimate, such as 1.96 (standard error) when the sampling distribution is a normal
distribution.
A 95% confidence interval for the population mean of is (L,U ) hours. We can be 95%
confident that the mean amount of s is between L and U of
Check whether 0 falls in the confidence interval. If so, it is plausible (but not necessary)
that the population proportions are equal.
If all values in the confidence interval for ( p 1 - p2) are positive, you can infer that ( p 1
- p2) 7 0, or p1 > p2. The interval shows just how much larger p1 might be. If all values
in the confidence interval are negative, you can infer that ( p 1 - p2) 6 0, or p1 6 p2.
The magnitude of values in the confidence interval tells you how large any true
difference is. If all values in the confidence interval are near 0, the true difference may
be relatively small in practical terms.
In addition, if the confidence interval contains 0, then it is plausible that (p1 - p2) = 0,
that is, p1 = p2. The population proportions might be equal. In such a case insufficient
evidence exists to infer which of p1 or p2 is larger.
Check whether or not 0 falls in the interval. When it does, 0 is a plausible value for (1 -
2), meaning that possibly 1 = 2.
A confidence interval for (1-2) that contains only positive numbers suggests that (1-
2) is positive. We then infer that 1 is larger than 2
A confidence interval for (1-2) that contains only negative numbers suggests that (1-
2) is negative. We then infer that 1 is smaller than 2.
Descriptive statistics: input, output, labels, summary statistics and confidence level
Confidence.t
Choosing sample size
A significance test is conducted and the probability value(p-value) reflects the strength of
the evidence against the null hypothesis. If the probability is below 0.01, the data provide
strong evidence that the null hypothesis is false. If the probability value is below 0.05 but
larger than 0.01, then the null hypothesis is typically rejected, but not with as much
confidence as it would be if the probability value were below 0.01. Probability values
between 0.05 and 0.10 provide weak evidence against the null hypothesis and, by
convention, are not considered low enough to justify rejecting it. Higher probabilities
provide less evidence that the null hypothesis is false.
When a probability value is below the level, the effect is statistically significant and the
null hypothesis is rejected. If the null hypothesis is rejected, then the alternative to the
null hypothesis (called the alternative hypothesis) is accepted
When a significance test results in a high probability value, it means that the data provide
little or no evidence that the null hypothesis is false. However, the high probability value
is not evidence that the null hypothesis is true.
The test statistic measures how far the sample proportion falls from the null hypothesis
value, p0, relative to what wed expect if H0 were true
1. Significance test proportions:
For proportion, the p-value calculated with norm.dist of the z-score
Errors
Requirements:
o Independent random samples from two groups, either from random sampling or a
randomized experiment
o An approximately normal population distribution for each group. This is mainly
important for small samples sizes, and even then the method is robust to
violations of this assumption
Comparing means for two groups in excel:
o Data analysis: t-Test: Two-Sample Assuming Equal Variances (individual
observances)
o T.test
o For paired groups: Data analysis: t-Test: Paired Two Sample for Means
Comparing dependent samples:
o To compare proportions with dependent samples, construct confidence intervals and
significance tests using the single sample of the difference scores: pi = xi1 xi2
o The 95% confidence interval pd 1.96 sepd and the test statistics z = (pd 0)/sepd
are the same as for a single sample. The assumptions are also the same: A random sample
or a randomized experiment and at least 15successes and 15 failures in the sample of
difference scores
o To compare means with dependent samples, construct confidence intervals and
significance tests using the single sample of the difference scores: di = xi1 xi2
o The 95% confidence interval and the test are the
same as for a single sample. The assumptions are also the same:
A random sample or a randomized experiment and a normal population distribution of
the difference scores
Two categorical variables are independent if the population conditional distribution for
one them is identical at each category of the other; the variables are dependent(or
associated) if the conditional distributions are not identical; even if variables are
independent, we would not expect the sample conditional distributions to be identical
Because of sampling variability, each sample percentage typically differs somewhat from
the true population percentage
Construct a contingency table( which displays two categorical variables, the rows list the
categories of the variable and the columns list the categories of the other variable), entries
in the table are frequencies; the percentages in a particular row of a table are called
conditional percentages and they form the conditional distribution
To test for independence conduct a CHI SQUARED TEST
2. Hypotheses:
H0: The two variables are independent
Ha: The two variables are dependent
What to expect under H0: The count in any particular cell is a random variable;
the mean of its distribution is called and expected cell count
3. The Chi squared statistic: summarizes how far the observed cell counts in a contingency
table fall from the expected cell counts for a null hypothesis. The formula is
For each cell, square the difference between the observed count and the expected
count and then divide that square by the expected count. After calculating this
term for every cell, sum the terms to find X2
4. P-value:
Convert the chi-squared test statistic to a p-value, use the sampling distribution of
the chi-squared statistic(for large sample sizes, this sampling distribution is well
approximated by the chi-squared distribution)
Main properties of chi-squared distribution:
o It falls on the positive part of the real number line
o The precise shape of the distribution depends on the degrees of freedom
o The mean of the distribution equals the df value
o It is skewed to the right
o The larger the X2 value, the greater the evidence against H0
The p-value is the right-tail probability for the observed X2 value for the chi-
squared distribution with df=(r-1)(c-1), r=number of rows, c=number of columns
in the contingency table
5. Conclusion: report and interpret p-value in context, reject null hypothesis if<=
significance level; if the null hypothesis is rejected, there is proof that the two variables
are associated
6. In excel:
CHIDIST gives directly P-value
CHITEST
Measure of strength:
A large cho-squared value provides strong evidence that the variables are associated, but
it doesnt indicate how strong the association is, it merely indicates through its p-value,
how certain we can be that the variables are associated
The strength of an association can be measured by difference in proportions p1-p2 or by
the ratio p1/p2
Pattern of association:
A standardized residual for each cell, is like a z-score; values below -3 or above 3 are
unlikely and hence indicate dependence
Used when the expected frequencies are small, any of them being<5(since chi-squared
test of indepence is a large sample test), small-sample tests are more appropriate
Fishers exact test is a small-sample test of independence
Complex calculations
The smaller the p-value, the stronger the evidence that the variables are associated
For two such variables X and Y, with m and n observed states. Form an m x n matrix in
which the entries aij represent the number of observations in which x=I and y=j. Calculate
the row and column sums Ri and Cj and the total sum
Significance
If the p-value is < than 0.10, we can reject the null hypothesis at a 90% confidence level
If the p-value is < than 0.05, we can reject the null hypothesis at a 95% confidence level
If the p-value is < than 0.01, we can reject the null hypothesis at a 99% confidence level
Construct confidence intervals and significance tests using the single sample of the
difference scores: add a new column with the difference between the first 2 columns(A-
B) and calculate the x_bar, n, st.dev, se and me and other for that column, same
assumptions as for a confidence interval for one mean
Characteristics of the least squares method line: Has some positive residuals and some
negative residuals, but the sum of the residuals equal 0. The line passes through (x,y)
Regression analysis find out if there is any association between the two quantitative variables
Correlation find out how strong is the connection between the variables
Another way o describe the strength of association, refers to how close predictions fo y
tend to be to the observed y values
The variables are strongly associated if you can predict y much better by substituting x
values into the prediction equation than by merely using the sample mean y and ignoring
x
The prediction error= difference between the observed and the predicted values of y, each
error is (y-y hat)
When a strong linear association exists, the regression equation predictions tend to be
much better than the prediction using
Measure the proportional reduction in error, call it R2
Properties of R2:
o It summarizes the reduction in sum of squared errors in predicting y using the
regression line instead of using the mean of y.
o Falls between 0 and 1
o R2=1 when RSS=0 : it happens only when all the data points fall exactly on the
regression line
o R2=0 when :this happens when the
slope=0, in which case y hat= y bar
o The closer R2 to 1, the stronger the linear association: the more effective the
regression equation is compared to y bar in predicting y
R2 vs rxy:
rxy falls between -1 and 1; it represents the slope of the regression line when x and y have
been standardized
R2 falls between 0 and 1; it summarizes the reduction in sum of squared errors in
predicting y using the regression line instead of using y bar
Are descriptive parts of a regression analysis
The inferential parts of regression use the tools of confidence intervals and significance tests to
provide inference about the regression equation, the correlation and R2 in the population of
interest.
Assumptions for regression line for description:
Population means of y at different values of x have a straight line relationship with x; this
assumption states that a straight line regression model is valid and can be verified with a
scatterplot
Suppose the slope beta of the regression line=0, then the mean of y is identical at each x
value; the two variables, x and y, are statistically independent: the outcome for y does not
depend on the value of x, it doesnt help us to know the value of x if we want to predict
the value of y
Conducting a significance test about a population slope beta:
1. Assumptions:
a) Population satisfies regression line: y=+x
b) Data gathered using randomization
c) Population y values at each x value have normal distribution with same
standard deviation at each x value
2. Hypothesis:
a) H0: =0
b) Ha: 0
3. Test statistic:
4. P-value: Two tail probability of t test statistic value more extreme than
observed, using t distribution with df =n-2
5. Conclusion: Interpret p-value in context. Reject H0 if p-value<= significance
level
Small p value in significance test of beta=0 suggests that the population regression line
has a nonzero slope
To learn how far the slope beta falls from 0, we construct a confidence interval
Prediction interval for y: the estimate y hat= a+b for the mean of y at a fixed value of x is also a
prediction for an individual outcome y at the fixed value of x
The confidence interval for y is an inference about where a population mean falls; use a
confidence interval if you want to estimate the mean of y for all individuals having a particular x
value, approximately equal to
The prediction interval for y is an inference about where individual observations fall; use it if you
want to predict where a single observation on y will fall for a particular x
value, approximately equal to: ,s is the residual standard
deviation