AP Statistics 핵심정리

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Data Collection Data Collection Data Analysis I

Methods of data collection More on experiments Graphs for data analysis


Census Sample Survey x is the explanatory variable (factor) Bar Chart
entire population part of a population y is the response variable
time and cost time and cost experimental units (nonhuman) are
sometime called subjects (human)
Experiment Observational Study control group does not receive treatment
part of a population part of a population treatment group receives treatment
subjects controlled subjects observed placebo effect is a response to "fake"
blocking (block) stratification (strata) treatment
indicates causation indicates correlation single blinding (only subjects are blind)
control group and confounding (lurking double blinding (both subjects and
treatment group variable) may occur evaluators are blind)
single-blind and completely randomized design Note: there are gaps between bars.

double-blind 뺑
테스트프렙어학원

Methods of data planning


(for surveys)
different samples get different
treatments
randomized paired comparison design
Dotplot

one sample gets different treatments


1. simple random sampling (equal prob. for all)
randomized block design
2. systematic sampling (every nth person)
population undergoes blocking
3. stratified sampling (homogeneous strata)
each block receive randomization
4. proportional sampling (proportional to pop.)
random samples get different
5. cluster sampling (heterogenous clusters)
treatments
6. multistage sampling (methods combined)
control, blocking, randomization, replicability
Bias in data planning and generalizability
Histogram
(for surveys)
1. household bias (family of two vs. five) Sampling error vs. bias
2. nonresponse bias (refuse to respond) Sampling Error Bias
3. response bias (respond untruthfully)
natural variability tendency to favor
4. voluntary response bias (strong opinions)
when taking the selection of
5. quota sampling bias (homogeneous group)
samples from a certain members of
6. selection bias (specific subjects are chosen)
population a population
7. size bias (big vs. small coins)
cannot be avoided can be avoided
8. undercoverage bias (part of pop. ignored)
9. wording bias (poorly worded questions) Note: there may be no gaps between bars.
Data Analysis I Data Analysis I Data Analysis I
Graphs for data analysis (cont.) Special features of graphs Shape
Stemplot Clusters

Note: the leaves may not be skipped and the key Outlier
must be clearly indicated. 테스트프렙어학원

y lp o
Cumulative Relative Frequency (CRF) Plot Gaps

Center
(Measures of Central Tendency)
mean: add all values and divide by n
Describing distributions (SOCS)
Shape Outlier
Note: the median value can be found by drawing symmetric by inspection median: arrange values in ascending order
a horizontal line across 0.5 on the y-axis. skewed to the right by formula and do one of the following
Boxplot median < mean less than if n is odd, select the middle value
skewed to the left Q1 - 1.5 IQR if n is even, take the average of two
Note: min, Q1, median, Q3, max are indicated. median > mean greater than middle values
Outliers are indicated by separate dots bell-shaped Q3 + 1.5 IQR mode: most frequently occurring value
uniform Spread
Choosing the right graphs Center Spread
(Measures of Dispersion)
Qualitative Quantitative range: max - min
mean range
(Categorical) (Numerical) interquartile range: Q3 - Q1
median interquartile range
Variable Variable variance:
divides area variance
dotplot dotplot under graph into standard deviation
bar chart histogram two equal parts
stemplot standard deviation
mode
boxplot uni vs. bimodal
CRF plot
Data Analysis I Data Analysis I Data Analysis I
Measures of Position Transforming distributions Comparing distributions (cont.)
simple ranking: indicates rank from an Addition and subtraction affects Parallel Boxplots
ordered list mean
percentile ranking: indicates a percentage of median
values under the value under consideration mode
z-score: indicates specifically by how many
Multiplication and division affects
standard deviations the value under
mean
consideration varies from the mean
median
mode
range
IQR
variance Overlapping CRF Plots
standard deviation

Comparing distributions
Empirical Rule (68-95-99.7 Rule) (Use SOCS when making comparisons)
(applies to bell-shaped curves only)
Double Bar Charts

테스트프렙어학원

Note: range is approximately equal to 6 standard


deviations in a bell-shaped distribution.
Back-to-back Stemplots
Resistance to outliers
Not resistant Resistant
mean median
range mode
variance IQR
standard deviation
Data Analysis II Data Analysis II Data Analysis II
Exploring bivariate data Least squares regression line (LSRL) Residual plots
Scatterplot

There could be many lines of best fit, but the residual = observed (actual) - predicted
x is the explanatory (independent) variable. one which minimizes the sum of the squares ê=y-ŷ
y is the response (dependent) variable. of residuals is called the least squares
A line of best fit describes the overall pattern. regression line. The residual plot is used as evidence that a
The correlation coefficient (r) gives the linear regression is a good fit when the
strength of association between the two residual plot shows no overall pattern as
variables. -1 ≤ r ≤ 1 깸 테스트프렙어학원 shown below.

The sum of residuals is always zero.

LSRL passes through the point (x̄, ȳ) and has a The standard deviation of the residuals is can
slope b1 , which has the same sign as r. be calculated as follows.

Population Regression Line


It gives a measure of how the data points are
spread around the regression line.
The coefficient of determination (r2 ) gives the Sample Regression Line
percentage of variation in y that is If the residual plot shows a pattern, it means
explained by the variation in x. that a nonlinear model is more appropriate.
Data Analysis II Data Analysis III Probability Distribution
Transformation to achieve linearity Exploring categorical variables Law of large numbers
Instead of using nonlinear regression models, Marginal Frequency and Distribution Relative frequency tells you the percentage of
we can 1transform existing data so that the an event
1 that happened relative to the whole.

scatterplot shows a linear pattern (which Relative frequencies vary from experiment to
means that linear regression could be used). experiment.

Main transformation are log, square root, and When an experiment is performed a large
reciprocal transformations. number of times, the relative frequency
converges to a certain value. We call this
Before log transformation
value the probability of that event.

In other words, probability is long-term


relative frequency.

Calculating probabilities
Conditional Frequency and Distribution
General formulae

After log transformation

(for people who saw baby animals)

Mutually exclusive events


(implies that there is no intersection)

Independent events
(for people who saw adult animals) (implies that events do not influence each other)

테스트프렙어학원

Conditional probability
(probability of B given A)
(for people who saw tasty foods)
Probability Distribution Probability Distribution Probability Distribution
Calculating multistage probabilities Binomial distribution (cont.) Geometric distribution
Product Principle For instance, when you toss a coin, the two When there are two possible outcomes
(multiply probabilities that possible
1 outcomes are heads (H) and tails (T).
(binomial)
1 and you want to find the probability
occur together or in series) But you can have various combinations of that the first success occurs after nth trial,
them: HHHH, HHTT, TTHT, etc. model the situation using a geometric
distribution.
Each of these combinations have a probability
associated with them: binomial probability. For example, what is the probability that the
first honest man Diogenes encounters will be
the third man he meets?
where
Addition Principle
p is the probability of success This is clearly a binomial problem (person met
(add probabilities that cannot occur together or
q is the probability of failure (1-p) is honest or not honest), and we are
are from entirely different scenarios)
n is the number of trials interested in meeting an honest man on the
k is the number of successes third trial. This implies that Diogenes would
(n-k) is the number of failures not meet an honest man in the first and

테스트프렙어학원
second trials. So, the probability would be
Types of probability distributions (failure) x (failure) x (success). In general,
Alternatively, you can use the
Discrete Continuous binompdf (n, p, k) function to calculate
binomial normal specific binomial probabilities. where
geometric p is the probability of success
If you have to add up binomial probabilities q is the probability of failure (1-p)
Binomial distribution (starting from 0), use the binomcdf (n, p, k) k is the trial number when success occurs
Binomial distributions are used to model where k is the number of successes up want
problems
1 that have two possible outcomes to add up to. You can use the geometpdf (p, k) or the
(successes or failures). Examples of such geometcdf (p, k) accordingly.
scenarios include:
defective vs. not defective
Binomial distribution keywords Geometric distribution keywords
5 on a die vs. not 5 on a die (happens first, first success is, first occurrence is)
heads vs. tails binompdf binomcdf
score a goal vs. not score a goal exactly at most, at least geometpdf geometcdf
Remember that although there are only two ____ out of ____ more than, less than first, second, third no later than
possible outcomes, you can still have different
combinations of these two outcomes.
Probability Distribution Probability Distribution Probability Distribution
Discrete distribution Discrete distribution (cont.) Designing simulations
When there are two or more possible Formula for discrete random variable
outcomes,
1 you can use a discrete distribution
to model the problem.

For example, a highway engineer knows that


his crew can lay 5 miles of highway on a clear
day, 2 miles on a rainy day, and only 1 mile on
Formula for binomial random variable
a snowy day. You can construct a discrete
probability distribution as follows.
In performing a simulation, you must do the
following.
1

1. Set up a correspondence between


outcomes and random numbers (0~6 is

Note that the binomial distribution is a special Combining random variables success and 7~9 is failure).
(random variables must be independent) 2. Give a procedure for choosing random
case of the discrete distribution when there
numbers.
are only two possible outcomes.
3. Give a stopping rule.

테스트프렙어학원
4. Note what is to be counted.
For example, in a lottery, 10,000 tickets are The variance of sums and the variance of
sold at $1 each with a prize of $7,500 for one differences
1 always adds the individual
winner. You can construct a discrete/binomial variances. Variance may be combined only
probability distribution as follows. when the two random variables are
independent.

There is no formula for combining standard


deviations of two random variables. So, you
In both cases, the discrete random variable must calculate the variances of each random
(usually denoted as X) is associated with a variable, use the variance combination
numerical value. formula above, and then take the square root
of the combined variance.

Transforming random variables


(random variables must be independent)
Probability Distribution Probability Distribution Probability Distribution
Normal distribution Normal distribution (cont.) Normal distribution (cont.)
The normal distribution is a type of Finding area under a normal curve
continuous
1 distribution. It is symmetric, bell- The area under the normal curve is the
shaped, and unimodal. It has two tails at both probability of whatever you are solving for.
1
ends that approaches the horizontal infinitely.
You can use a z-table to find the area number
a normal curve. For this method, you need to
calculate the z-score. Remember that the z-
table gives you the area to the right of that z-
score.

Alternatively, you can use normalcdf (lower


It is useful in describing various natural bound, upper bound, mean, standard
phenomena. deviation) to find the area under a normal
. 테스트프렙어학원
P
curve. You can enter raw values into this
The normal distribution is the limiting case of function.
the binomial distribution when n ∞.
However, if you want to enter z-scores for the
lower and upper bound, you must set the
mean to 0 and the standard deviation to 1.

To find the z-score given an area to the left


of the normal curve, use invNorm (area).
Probability Distribution Probability Distribution Probability Distribution
Common probabilities and z-scores Checking for normality Checking for normality (cont.)
For statistical inference You must be able to decide whether it is Normal probability plot
reasonable
1 to assume the data come from a
normal population. This skill is especially
important when you have to do statistical
inference.
For percentile ranking

To check for normality, you should create a


graph of the data. For example, the ages at
inauguration of U.S. presidents were: {57, 61,
57, 57, 58, 57, 61, 54, 68, 51, 49, 64, 50, 48,
65, 52, 56, 46, 54, 49, 51, 47, 55, 55, 54, 42,
51, 56, 55, 51, 54, 51, 60, 61, 43, 55, 56, 61, A diagonal straight line pattern in the normal
Normal approximation to binomial 52, 69, 64, 46, 54}. Can we conclude that the probability plot is an indication that the
distribution is roughly normal? distribution of data is roughly normal.
The binomial distribution takes values only at
integers,
1 while the normal distribution is

continuous with probabilities corresponding Parameter vs. statistic


to areas over intervals. A parameter is a number that describes

테스트프렙어학원
some characteristic of the population.
For approximation purposes, we think of each
binomial probability corresponding to the A statistic is a number that describes some
normal probability over a unit interval characteristic of a sample. It is used to
centered at the desired value. estimate the parameter of interest.

For example, to approximate the binomial Population Sample


probability of five successes we determine parameter statistic
the normal probability of being between 4.5 pop. proportion (p) sample proportion (p̂)
and 5.5. pop. mean (μ) sample mean (x̄)
pop. standard sample standard
deviation (σ) deviation (s)
Probability Distribution Probability Distribution Probability Distribution
Sampling distribution Sampling distribution (cont.) Sampling distribution of p̂
The AP Statistics exam tests you on five types The population distribution of a variable When we want information about the
of sampling distributions. describes the values of the variable for all population proportion p of successes, we
sampling distribution of sample individuals in a population. often take an SRS and use the sample
proportions proportion p̂ to estimate the unknown
sampling distribution of sample means The sample distribution describes the values parameter p.
sampling distribution of differences of the variable for all individuals in a
between sample proportions particular sample. The sampling distribution of the sample
sampling distribution of differences proportion p̂ describes how the statistic p̂
between sample means Biased and unbiased estimators varies in all possible samples of the same size
sampling distribution of slope of sample from the population.
A statistic can be an unbiased estimator or a
LSRL
biased estimator of a parameter.

뻔 The mean of the sampling distribution of p̂ is


테스트프렙어학원
When random samples are taken from a
A statistic is an unbiased estimator if the
population, the sample statistics vary from So, p̂ is an unbiased estimator of p.
center (mean) of its sampling distribution is
sample to sample. This natural deviation is
equal to the true value of the parameter.
called sampling variability. It refers to the The standard deviation of the sampling
fact that different random samples of the distribution of p̂ is
When trying to estimate a parameter, choose
same size from the same population produce
a statistic with low or no bias and minimum
different values for a statistic.
variability.

For example, every sample taken will have a Conditions you need to check are:
unique sample proportion p̂. The various p̂ SRS
values possible can then be plotted to create 10% condition (n < 0.10N)
a distribution. This distribution of various large counts condition (np≥10, n(1-p)≥10)
sample proportions is called the sampling state that since np≥10 and n(1-
distribution of sample proportions. p)≥10, the sampling distribution of
p̂ is approximately normal by the
A similar case can be made for sampling large counts condition.
distributions of other sample statistic.

Note that sampling distributions are


different from sample distributions and
population distributions.
Probability Distribution Probability Distribution Probability Distribution
Sampling distribution of x̄ Sampling distribution of p̂1-p̂2 Sampling distribution of x̄1-x̄2
When we want information about the The mean of the sampling distribution of p̂1- The mean of the sampling distribution of x̄1-
population mean μ for some quantitative p̂2 is x̄2 is
variable, we often take an SRS and use the
sample mean x̄ to estimate the unknown So p̂1-p̂2 is an unbiased estimator of p1-p2. So x̄1-x̄2 is an unbiased estimator of μ1-μ2.
parameter μ.
The standard deviation of the sampling The standard deviation of the sampling
The sampling distribution of the sample mean distribution of p̂1-p̂2 is distribution of x̄1-x̄2 is
x̄ describes how the statistic x̄ varies in all
possible samples of the same size from the
population.

P
테스트프렙어학원 Conditions you need to check for both Conditions you need to check for both
The mean of the sampling distribution of x̄ is samples are: samples are:
SRS for both samples SRS for both samples
So x̄ is an unbiased estimator of μ. 10% condition for both samples 10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2 n1 < 0.10N1 and n2 < 0.10N2
The standard deviation of the sampling large counts condition for both samples normality for both samples
distribution of x̄ is n1p1≥10, n1(1-p1)≥10 if both are normal, say that they are
n2p2≥10, n2(1-p2)≥10 if both are not normal, use central
state that since the large counts limit theorem on both samples and
condition is met for both samples, state that since the central limit
Conditions you need to check are: the sampling distribution of p̂1-p̂2 theorem is met for both samples
SRS is approximately normal. (n1≥30 and n2≥30), the sampling
10% condition (n < 0.10N) independence condition distribution of x̄1-x̄2 is
normality mention that the two samples are approximately normal.
if normal, say that it is independent random samples if one is normal but the other isn't,
if not normal, use central limit state that it is normal for the normal
theorem and state that since n≥30, data but use the central limit theorem
the sampling distribution of x̄ is on the other
approximately normal by the independence condition
central limit theorem. mention that the two samples are
independent random samples
Probability Distribution Probability Distribution Statistical Inference I
Sampling distribution of b1 Sampling distribution of b1 (cont.) Confidence interval
The mean of the sampling distribution of b1 is distribution of residuals it approximately The AP Statistics exam tests you on five types
normal of confidence intervals (CI).
So b1 is an unbiased estimator of β1. CI for population proportion
CI for population mean
The standard deviation of the sampling CI for difference between population
distribution of b1 is proportions
CI for difference between population
means
Conditions you need to check are: CI for slope of the LSRL
SRS
10% condition (n < 0.10N) A confidence interval gives an interval of
scatterplot of sample data is plausible values for a parameter based on
approximately linear sample data.
'
테스트프렙어학원

A point estimator is a statistic that provides


an estimate of a population parameter.

The value of that statistic from a sample is


no apparent pattern in residuals plot called a point estimate.
(=equal SD; residuals have roughly equal
variability at all x-values in sample data) The confidence level c gives the overall
success rate of the method used to calculate
the confidence interval. To interpret the
confidence level: if we were to select many
random samples from a population and
construct a [c]% confidence interval
using each sample, about [c]% of the
intervals would capture the true
[parameter in context].
Statistical Inference I Statistical Inference I Statistical Inference I
Confidence interval (cont.) CI for population proportion CI for population mean
To interpret the confidence interval: we are σ is known σ is unknown
[c]% confident that the interval from
[lower bound] to [upper bound] captures
the true [parameter in context]. where
where
The margin of error of an estimate
describes how far, at most, we expect the Identify: one sample z interval for p Identify: one sample z interval for μ OR
estimate to vary from the true population
one sample t interval for μ (df = n-1)
value. Conditions:


테스트프렙어학원
SRS Conditions:
Affecting margin of error 10% condition (n<0.10N) SRS
In general, we prefer an estimate with a small large counts condition (np̂≥10, n(1-p̂)≥10) 10% condition (n<0.10N)
margin of error. The margin of error gets state that the number of successes normality
smaller when: (np) and the number of failures if normal, say that it is
the confidence level decreases. To (n(1-p)) are both greater than or if not normal, use central limit
obtain a smaller margin of error from the equal to 10, so the sampling theorem and state that since n≥30,
same data, you must be willing to accept distribution of p̂ is approximately the sampling distribution of x̄ is
less confidence. normal. approximately normal by the
central limit theorem
Calculate: if the conditions are met, perform if not normal and n < 30, draw graph
the sample size n increases. In general, the calculations (1-PropZInt) to check for normality, no strong
increasing the sample size n reduces the skewness, and no outliers
margin of error for any fixed confidence Conclude: interpret your confidence interval
level. in the context of the problem. Calculate: if the conditions are met, perform
Sample size for a desired margin of error the calculations (ZInterval or TInterval)

Conclude: interpret your confidence interval


The critical value is a multiplier that makes in the context of the problem.
the interval wide enough to have the stated
capture rate. The critical value depends on where Sample size for a desired margin of error
both the confidence level c and the sampling
distribution of the statistic.
Statistical Inference I Statistical Inference I Statistical Inference I
CI for difference in CI for difference in CI for slope of
population proportions population means population regression line
σ is known σ is unknown

Identify: one sample t interval for β (df = n-2)


Identify: two sample z interval for μ1-μ2 OR
Identify: two sample z interval for p1-p2 Conditions:
two sample t interval for μ1-μ2
SRS

PPO
테스트프렙어학원 10% condition (n<0.10N)

'
Conditions: *df = (n1-1 or n2-1, whichever is smaller) OR
SRS for both samples scatterplot of sample data is
(use technology for precision)
10% condition for both samples approximately linear
n1 < 0.10N1 and n2 < 0.10N2 no apparent pattern in residuals plot
Conditions:
large counts condition for both samples (=equal SD; residuals have roughly equal
SRS for both samples
n1p̂1≥10, n1(1-p̂1)≥10 variability at all x-values in sample data)
10% condition for both samples
n2p̂2≥10, n2(1-p̂2)≥10 distribution of residuals it approximately
n1 < 0.10N1 and n2 < 0.10N2
state that the numbers of normal
normality for both samples
successes (n1p1, n2p2) and the if normal, say that it is
numbers of failures (n1(1-p1), Calculate: if the conditions are met, perform
if not normal, use central limit
n2(1-p2)) are both greater than or the calculations (LinRegTInt)
theorem and state that since n≥30,
equal to 10, so the sampling the sampling distribution of x̄1- x̄2
distribution of p̂1-p̂2 is Conclude: interpret your confidence interval
is approximately normal by the
approximately normal. in the context of the problem.
central limit theorem
independence condition if not normal and n < 30, draw graph
mention that the two samples are check for normality, no strong
independent random samples skewness, and no outliers
independence condition
Calculate: if the conditions are met, perform mention that the two samples are IMPORTANT
the calculations (2-PropZInt) For paired data (data that are not independent), create
independent random samples
a new variable d, which is the variable for differences,
by taking the differences of x̄1 from sample 1 and the
Conclude: interpret your confidence interval Calculate: if the conditions are met, perform corresponding x̄2 from sample 2.
in the context of the problem. the calculations (2-SampZInt or 2-SampTInt)
Then, you need to create a one sample t interval
using the new variable d by using TInterval.
Conclude: interpret your confidence interval
in the context of the problem.
Statistical Inference II Statistical Inference II Statistical Inference II
Test of significance for Types of hypothesis tests Type I and Type II Errors
quantitative data: hypothesis test The inequality symbol used in the alternative When we make a conclusion in a significance
hypothesis determines whether a one-tailed test, there are two kinds of mistakes we can
The AP Statistics exam tests you on five types
or two-tailed test should be performed. make.
of significance tests for quantitative data.
hypothesis test for population proportion
When the < or > symbol is used, conduct a
hypothesis test for population mean
one-tailed test. One-tailed tests are also
hypothesis test for difference between
known as one-sided tests.
population proportions
hypothesis test for difference between
When the ≠ symbol is used, conduct a two-
population means
tailed test. Two-tailed tests are also known
hypothesis test for slope of population
as two-sided tests. Do not forget to double You need to be able to describe type I and
LSRL
type II errors in context.

Confidence interval vs.


significance test

테스트프렙어학원 the p-value at one tail for two-tailed tests.

P-value vs. significance level


The probability of making a type I error is α.
The probability of making a type II error is β.

Confidence intervals aim to estimate over The p-value refers to the probability of
α and β are inversely proportional. They do
which interval the unknown parameter may getting evidence for the alternative
not necessarily add up to 1.
lie. hypothesis as strong or stronger than the
observed evidence assuming the null
The probability of avoiding a type II error is
Significance tests aim to investigate whether hypothesis is true.
called power. Power = 1 - β
the known parameter is valid or needs to be
changed. The significance level (α) is the value that
we use as a boundary for deciding whether
Types of hypotheses an observed result is unlikely to happen by
The hypothesis test is conducted by setting chance alone assuming the null hypothesis is
up the null hypothesis (Ho) and the true. α = 1 - c
alternative hypothesis (Ha). Ha is also
known as the research hypothesis. In a hypothesis test, the p-value is compared
with the significance level (α).
The null hypothesis almost always uses the
equality symbol (=) while the alternative
hypothesis uses inequality symbols (<, >, ≠).
Statistical Inference II Statistical Inference II Statistical Inference II
HT for population proportion HT for population proportion HT for population mean
Identify: one sample z test for p Calculate: if the conditions are met, perform Identify: one sample z test for μ OR
the calculations (1-PropZTest) one sample t test for μ (df = n-1)

Alternatively, find the p-value using


normalcdf and compare this value with the
significance level.

When calculating the standard deviation of


sample proportions, use the following
formula.

where p0 is the null proportion, not the


sample proportion.

where p is (description in context). Conclude: Since the p-value is (less where μ is (description in context).
than/greater than or equal to) the

테스트프렙어학원
State the significance level. (If not given, use significance level α, we (reject/fail to State the significance level. (If not given, use
0.05). reject) the null hypothesis. We (have/do 0.05).
not have) sufficient evidence that (state
Conditions: your alternative hypothesis). Conditions:
SRS SRS
10% condition (n<0.10N) 10% condition (n<0.10N)
large counts condition (np0≥10, n(1- normality
p0)≥10) if normal, say that it is
state that the number of successes if not normal, use CLT and state that
(np0) and the number of failures since n≥30, the sampling
(n(1-p0)) are both greater than or distribution of x̄ is approximately
equal to 10, so the sampling normal by CLT
distribution of p̂ is approximately if not normal and n < 30, draw graph
normal. to check for normality, no strong
skewness, and no outliers
Statistical Inference II Statistical Inference II Statistical Inference II
HT for population mean HT for difference in HT for difference in
Calculate: if the conditions are met, perform population proportions population proportions
the calculations (Z-Test or T-Test) Identify: two sample z test for p1-p2

Alternatively, you can find the p-value using


where p is the pooled proportion, which can
normalcdf or tcdf and compare this value Conditions:
be calculated using the formula below.
with the significance level. SRS for both samples


테스트프렙어학원
10% condition for both samples
When using tcdf, remember that df = n - 1. n1 < 0.10N1 and n2 < 0.10N2
large counts condition for both samples
Conclude: Since the p-value is (less
When calculating the standard deviation of n1p1≥10, n1(1-p1)≥10
than/greater than or equal to) the
sample means, use the following formula. n2p2≥10, n2(1-p2)≥10
significance level α, we (reject/fail to
state that the numbers of
σ is known σ is unknown reject) the null hypothesis. We (have/do
successes (n1p1, n2p2) and the
not have) sufficient evidence that (state
numbers of failures (n1(1-p1),
your alternative hypothesis).
n2(1-p2)) are both greater than or
equal to 10, so the sampling
Conclude: Since the p-value is (less distribution of p̂1-p̂2 is
than/greater than or equal to) the approximately normal.
significance level α, we (reject/fail to independence condition
reject) the null hypothesis. We (have/do mention that the two samples are
not have) sufficient evidence that (state independent random samples
your alternative hypothesis).
Calculate: if the conditions are met, perform
the calculations (2-PropZTest)

Alternatively, find the p-value using


normalcdf and compare this value with the
significance level.

When calculating the standard deviation of


the difference in sample proportions, use the
pooled proportion in the following formula.
Statistical Inference II Statistical Inference II Statistical Inference II
HT for difference in HT for difference in HT for slope of
population means population means population regression line
Identify: two sample z test for μ1-μ2 OR When calculating the standard deviation of Identify: one sample t interval for β (df = n-2)
two sample t test for μ1-μ2 sample means, use the following formula.
σ is known σ is unknown Remember, β1 = 0 simply states that there is no
linear relationship between two variables.
Conditions:

S
SRS for both samples 테스트프렙어학원 For Ha, use
10% condition for both samples β1 > 0 if you need to prove a positive linear
n1 < 0.10N1 and n2 < 0.10N2 Conclude: Since the p-value is (less relationship between two variables,
normality for both samples than/greater than or equal to) the β1 < 0 if you need to prove a negative linear
if normal, say that it is significance level α, we (reject/fail to relationship between two variables, or
if not normal, use central limit reject) the null hypothesis. We (have/do β1 ≠ 0 if you need to show there is "some"
theorem and state that since n≥30, not have) sufficient evidence that (state linear relationship between two variables
the sampling distribution of x̄1- x̄2 your alternative hypothesis).
is approximately normal by CLT Conditions:
if not normal and n < 30, draw graph IMPORTANT SRS
check for normality, no strong For paired data (data that are not 10% condition (n<0.10N)
skewness, and no outliers independent), create a new variable d, which scatterplot of sample data is approx linear
independence condition is the variable for differences, by taking the no apparent pattern in residuals plot
mention that the two samples are differences of x̄1 from sample 1 and the (=equal SD; residuals have roughly equal
independent random samples corresponding x̄2 from sample 2. variability at all x-values in sample data)
distribution of residuals it approx normal
Calculate: if the conditions are met, perform Then, you need to perform a one sample t
the calculations (2-SampZTest or 2- test using the new variable d as shown below. Calculate: if the conditions are met, perform
SampTTest) Do not perform a two sample test on paired the calculations (LinRegTTest)
data!
Alternatively, you can find the p-value using Conclude: Since the p-value is (less
normalcdf or tcdf and compare this value Identify: paired t-test for the set of differences than/greater than or equal to) the
with the significance level. significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
When using tcdf, use df = n1-1 or df = n2-1, Then, follow a similar procedure for HT for not have) sufficient evidence that (state
whichever is smaller. Or use technology for a population mean. your alternative hypothesis).
more precise df.
Statistical Inference III Statistical Inference III Statistical Inference III
Test of significance for Chi-square test for goodness-of-fit Chi-square test for homogeneity
qualitative data: chi-square test The χ2 test for goodness-of-fit compares The χ2 chi-square test for homogeneity
the distribution of observed counts in the compares the distribution of a single
The AP Statistics exam tests you on three
sample with the distribution of expected categorical variable for each of several
types of significance tests for qualitative
counts if Ho were true. populations.
data.
chi-square test for goodness-of-fit
The expected count for any category is The expected count for any category is
chi-square test for independence
found by multiplying the sample size (n) by found using the formula below.
chi-square test for homogeneity
the proportion in each category according to

'
테스트프렙어학원
the null hypothesis.
Chi-square test statistic
Data is usually given in a one-way table. Data is usually given in a two-way table.

The chi-square test statistic is a measure of


Identify: chi-square test for goodness-of-fit Identify: chi-square test for homogeneity
how far the observed counts are from the
expected counts.

You should be able to do a follow up analysis


Conditions: Conditions:
on which category has the largest
SRS SRS
contribution to the chi-square test statistic.
10% condition 10% condition
n < 0.10N n < 0.10N
Chi-square distribution all expected counts are at least 5 all expected counts are at least 5

Calculate: if the conditions are met, perform Calculate: if the conditions are met, perform
the calculations (χ2 GOF-Test) the calculations (χ2-Test)

Alternatively, you can find the p-value using Alternatively, you can find the p-value using
χ2cdf and compare this value with the χ2cdf and compare this value with the
A chi-square distribution is defined by a significance level. When using χ2cdf, significance level. When using χ2cdf,
density curve that takes only nonnegative df = (number of categories)-1. df = [r-1] x [c-1].
values and is skewed to the right. A particular
chi-square distribution is specified by its Conclude: compare p-value and significance Conclude: compare p-value and significance
degrees of freedom. level; reject or fail to reject Ho level; reject or fail to reject Ho
Statistical Inference III Statistical Inference III How to Get a 5


Chi-square test for independence Chi-square test for independence
in AP Statistics
The χ2 chi-square test for independence Identify: chi-square test for independence
is used test the association/relationship
between two categorical variables in a Get a perfect score on
single population.
the MCQs.
The null hypothesis is that there is no Conditions:
association between the two categorical SRS
variables in the population of interest. 10% condition

rr
Another way to state the null hypothesis is n < 0.10N Write something (that is
that the two categorical variables are all expected counts are at least 5
independent in the population of interest. sensical) on the FRQs.
*il Calculate: if the conditions are met, perform
The expected count for테스트프렙어학원
any
anycategory
categoryis is the calculations (χ2-Test)
found using the formula below.
Alternatively, you can find the p-value using And do your homework.
χ2cdf and compare this value with the
significance level. When using χ2cdf,
Data is usually given in a two-way table. df = [r-1] x [c-1].

Conclude: compare p-value and significance


level; reject or fail to reject Ho

You might also like