AP Statistics 핵심정리
AP Statistics 핵심정리
AP Statistics 핵심정리
double-blind 뺑
테스트프렙어학원
Note: the leaves may not be skipped and the key Outlier
must be clearly indicated. 테스트프렙어학원
y lp o
Cumulative Relative Frequency (CRF) Plot Gaps
Center
(Measures of Central Tendency)
mean: add all values and divide by n
Describing distributions (SOCS)
Shape Outlier
Note: the median value can be found by drawing symmetric by inspection median: arrange values in ascending order
a horizontal line across 0.5 on the y-axis. skewed to the right by formula and do one of the following
Boxplot median < mean less than if n is odd, select the middle value
skewed to the left Q1 - 1.5 IQR if n is even, take the average of two
Note: min, Q1, median, Q3, max are indicated. median > mean greater than middle values
Outliers are indicated by separate dots bell-shaped Q3 + 1.5 IQR mode: most frequently occurring value
uniform Spread
Choosing the right graphs Center Spread
(Measures of Dispersion)
Qualitative Quantitative range: max - min
mean range
(Categorical) (Numerical) interquartile range: Q3 - Q1
median interquartile range
Variable Variable variance:
divides area variance
dotplot dotplot under graph into standard deviation
bar chart histogram two equal parts
stemplot standard deviation
mode
boxplot uni vs. bimodal
CRF plot
Data Analysis I Data Analysis I Data Analysis I
Measures of Position Transforming distributions Comparing distributions (cont.)
simple ranking: indicates rank from an Addition and subtraction affects Parallel Boxplots
ordered list mean
percentile ranking: indicates a percentage of median
values under the value under consideration mode
z-score: indicates specifically by how many
Multiplication and division affects
standard deviations the value under
mean
consideration varies from the mean
median
mode
range
IQR
variance Overlapping CRF Plots
standard deviation
Comparing distributions
Empirical Rule (68-95-99.7 Rule) (Use SOCS when making comparisons)
(applies to bell-shaped curves only)
Double Bar Charts
뻐
테스트프렙어학원
There could be many lines of best fit, but the residual = observed (actual) - predicted
x is the explanatory (independent) variable. one which minimizes the sum of the squares ê=y-ŷ
y is the response (dependent) variable. of residuals is called the least squares
A line of best fit describes the overall pattern. regression line. The residual plot is used as evidence that a
The correlation coefficient (r) gives the linear regression is a good fit when the
strength of association between the two residual plot shows no overall pattern as
variables. -1 ≤ r ≤ 1 깸 테스트프렙어학원 shown below.
LSRL passes through the point (x̄, ȳ) and has a The standard deviation of the residuals is can
slope b1 , which has the same sign as r. be calculated as follows.
scatterplot shows a linear pattern (which Relative frequencies vary from experiment to
means that linear regression could be used). experiment.
Main transformation are log, square root, and When an experiment is performed a large
reciprocal transformations. number of times, the relative frequency
converges to a certain value. We call this
Before log transformation
value the probability of that event.
Calculating probabilities
Conditional Frequency and Distribution
General formulae
Independent events
(for people who saw adult animals) (implies that events do not influence each other)
ㅹ
테스트프렙어학원
Conditional probability
(probability of B given A)
(for people who saw tasty foods)
Probability Distribution Probability Distribution Probability Distribution
Calculating multistage probabilities Binomial distribution (cont.) Geometric distribution
Product Principle For instance, when you toss a coin, the two When there are two possible outcomes
(multiply probabilities that possible
1 outcomes are heads (H) and tails (T).
(binomial)
1 and you want to find the probability
occur together or in series) But you can have various combinations of that the first success occurs after nth trial,
them: HHHH, HHTT, TTHT, etc. model the situation using a geometric
distribution.
Each of these combinations have a probability
associated with them: binomial probability. For example, what is the probability that the
first honest man Diogenes encounters will be
the third man he meets?
where
Addition Principle
p is the probability of success This is clearly a binomial problem (person met
(add probabilities that cannot occur together or
q is the probability of failure (1-p) is honest or not honest), and we are
are from entirely different scenarios)
n is the number of trials interested in meeting an honest man on the
k is the number of successes third trial. This implies that Diogenes would
(n-k) is the number of failures not meet an honest man in the first and
폈
테스트프렙어학원
second trials. So, the probability would be
Types of probability distributions (failure) x (failure) x (success). In general,
Alternatively, you can use the
Discrete Continuous binompdf (n, p, k) function to calculate
binomial normal specific binomial probabilities. where
geometric p is the probability of success
If you have to add up binomial probabilities q is the probability of failure (1-p)
Binomial distribution (starting from 0), use the binomcdf (n, p, k) k is the trial number when success occurs
Binomial distributions are used to model where k is the number of successes up want
problems
1 that have two possible outcomes to add up to. You can use the geometpdf (p, k) or the
(successes or failures). Examples of such geometcdf (p, k) accordingly.
scenarios include:
defective vs. not defective
Binomial distribution keywords Geometric distribution keywords
5 on a die vs. not 5 on a die (happens first, first success is, first occurrence is)
heads vs. tails binompdf binomcdf
score a goal vs. not score a goal exactly at most, at least geometpdf geometcdf
Remember that although there are only two ____ out of ____ more than, less than first, second, third no later than
possible outcomes, you can still have different
combinations of these two outcomes.
Probability Distribution Probability Distribution Probability Distribution
Discrete distribution Discrete distribution (cont.) Designing simulations
When there are two or more possible Formula for discrete random variable
outcomes,
1 you can use a discrete distribution
to model the problem.
Note that the binomial distribution is a special Combining random variables success and 7~9 is failure).
(random variables must be independent) 2. Give a procedure for choosing random
case of the discrete distribution when there
numbers.
are only two possible outcomes.
3. Give a stopping rule.
뺑
테스트프렙어학원
4. Note what is to be counted.
For example, in a lottery, 10,000 tickets are The variance of sums and the variance of
sold at $1 each with a prize of $7,500 for one differences
1 always adds the individual
winner. You can construct a discrete/binomial variances. Variance may be combined only
probability distribution as follows. when the two random variables are
independent.
For example, every sample taken will have a Conditions you need to check are:
unique sample proportion p̂. The various p̂ SRS
values possible can then be plotted to create 10% condition (n < 0.10N)
a distribution. This distribution of various large counts condition (np≥10, n(1-p)≥10)
sample proportions is called the sampling state that since np≥10 and n(1-
distribution of sample proportions. p)≥10, the sampling distribution of
p̂ is approximately normal by the
A similar case can be made for sampling large counts condition.
distributions of other sample statistic.
P
테스트프렙어학원 Conditions you need to check for both Conditions you need to check for both
The mean of the sampling distribution of x̄ is samples are: samples are:
SRS for both samples SRS for both samples
So x̄ is an unbiased estimator of μ. 10% condition for both samples 10% condition for both samples
n1 < 0.10N1 and n2 < 0.10N2 n1 < 0.10N1 and n2 < 0.10N2
The standard deviation of the sampling large counts condition for both samples normality for both samples
distribution of x̄ is n1p1≥10, n1(1-p1)≥10 if both are normal, say that they are
n2p2≥10, n2(1-p2)≥10 if both are not normal, use central
state that since the large counts limit theorem on both samples and
condition is met for both samples, state that since the central limit
Conditions you need to check are: the sampling distribution of p̂1-p̂2 theorem is met for both samples
SRS is approximately normal. (n1≥30 and n2≥30), the sampling
10% condition (n < 0.10N) independence condition distribution of x̄1-x̄2 is
normality mention that the two samples are approximately normal.
if normal, say that it is independent random samples if one is normal but the other isn't,
if not normal, use central limit state that it is normal for the normal
theorem and state that since n≥30, data but use the central limit theorem
the sampling distribution of x̄ is on the other
approximately normal by the independence condition
central limit theorem. mention that the two samples are
independent random samples
Probability Distribution Probability Distribution Statistical Inference I
Sampling distribution of b1 Sampling distribution of b1 (cont.) Confidence interval
The mean of the sampling distribution of b1 is distribution of residuals it approximately The AP Statistics exam tests you on five types
normal of confidence intervals (CI).
So b1 is an unbiased estimator of β1. CI for population proportion
CI for population mean
The standard deviation of the sampling CI for difference between population
distribution of b1 is proportions
CI for difference between population
means
Conditions you need to check are: CI for slope of the LSRL
SRS
10% condition (n < 0.10N) A confidence interval gives an interval of
scatterplot of sample data is plausible values for a parameter based on
approximately linear sample data.
'
테스트프렙어학원
양
테스트프렙어학원
SRS Conditions:
Affecting margin of error 10% condition (n<0.10N) SRS
In general, we prefer an estimate with a small large counts condition (np̂≥10, n(1-p̂)≥10) 10% condition (n<0.10N)
margin of error. The margin of error gets state that the number of successes normality
smaller when: (np) and the number of failures if normal, say that it is
the confidence level decreases. To (n(1-p)) are both greater than or if not normal, use central limit
obtain a smaller margin of error from the equal to 10, so the sampling theorem and state that since n≥30,
same data, you must be willing to accept distribution of p̂ is approximately the sampling distribution of x̄ is
less confidence. normal. approximately normal by the
central limit theorem
Calculate: if the conditions are met, perform if not normal and n < 30, draw graph
the sample size n increases. In general, the calculations (1-PropZInt) to check for normality, no strong
increasing the sample size n reduces the skewness, and no outliers
margin of error for any fixed confidence Conclude: interpret your confidence interval
level. in the context of the problem. Calculate: if the conditions are met, perform
Sample size for a desired margin of error the calculations (ZInterval or TInterval)
PPO
테스트프렙어학원 10% condition (n<0.10N)
'
Conditions: *df = (n1-1 or n2-1, whichever is smaller) OR
SRS for both samples scatterplot of sample data is
(use technology for precision)
10% condition for both samples approximately linear
n1 < 0.10N1 and n2 < 0.10N2 no apparent pattern in residuals plot
Conditions:
large counts condition for both samples (=equal SD; residuals have roughly equal
SRS for both samples
n1p̂1≥10, n1(1-p̂1)≥10 variability at all x-values in sample data)
10% condition for both samples
n2p̂2≥10, n2(1-p̂2)≥10 distribution of residuals it approximately
n1 < 0.10N1 and n2 < 0.10N2
state that the numbers of normal
normality for both samples
successes (n1p1, n2p2) and the if normal, say that it is
numbers of failures (n1(1-p1), Calculate: if the conditions are met, perform
if not normal, use central limit
n2(1-p2)) are both greater than or the calculations (LinRegTInt)
theorem and state that since n≥30,
equal to 10, so the sampling the sampling distribution of x̄1- x̄2
distribution of p̂1-p̂2 is Conclude: interpret your confidence interval
is approximately normal by the
approximately normal. in the context of the problem.
central limit theorem
independence condition if not normal and n < 30, draw graph
mention that the two samples are check for normality, no strong
independent random samples skewness, and no outliers
independence condition
Calculate: if the conditions are met, perform mention that the two samples are IMPORTANT
the calculations (2-PropZInt) For paired data (data that are not independent), create
independent random samples
a new variable d, which is the variable for differences,
by taking the differences of x̄1 from sample 1 and the
Conclude: interpret your confidence interval Calculate: if the conditions are met, perform corresponding x̄2 from sample 2.
in the context of the problem. the calculations (2-SampZInt or 2-SampTInt)
Then, you need to create a one sample t interval
using the new variable d by using TInterval.
Conclude: interpret your confidence interval
in the context of the problem.
Statistical Inference II Statistical Inference II Statistical Inference II
Test of significance for Types of hypothesis tests Type I and Type II Errors
quantitative data: hypothesis test The inequality symbol used in the alternative When we make a conclusion in a significance
hypothesis determines whether a one-tailed test, there are two kinds of mistakes we can
The AP Statistics exam tests you on five types
or two-tailed test should be performed. make.
of significance tests for quantitative data.
hypothesis test for population proportion
When the < or > symbol is used, conduct a
hypothesis test for population mean
one-tailed test. One-tailed tests are also
hypothesis test for difference between
known as one-sided tests.
population proportions
hypothesis test for difference between
When the ≠ symbol is used, conduct a two-
population means
tailed test. Two-tailed tests are also known
hypothesis test for slope of population
as two-sided tests. Do not forget to double You need to be able to describe type I and
LSRL
type II errors in context.
Confidence intervals aim to estimate over The p-value refers to the probability of
α and β are inversely proportional. They do
which interval the unknown parameter may getting evidence for the alternative
not necessarily add up to 1.
lie. hypothesis as strong or stronger than the
observed evidence assuming the null
The probability of avoiding a type II error is
Significance tests aim to investigate whether hypothesis is true.
called power. Power = 1 - β
the known parameter is valid or needs to be
changed. The significance level (α) is the value that
we use as a boundary for deciding whether
Types of hypotheses an observed result is unlikely to happen by
The hypothesis test is conducted by setting chance alone assuming the null hypothesis is
up the null hypothesis (Ho) and the true. α = 1 - c
alternative hypothesis (Ha). Ha is also
known as the research hypothesis. In a hypothesis test, the p-value is compared
with the significance level (α).
The null hypothesis almost always uses the
equality symbol (=) while the alternative
hypothesis uses inequality symbols (<, >, ≠).
Statistical Inference II Statistical Inference II Statistical Inference II
HT for population proportion HT for population proportion HT for population mean
Identify: one sample z test for p Calculate: if the conditions are met, perform Identify: one sample z test for μ OR
the calculations (1-PropZTest) one sample t test for μ (df = n-1)
where p is (description in context). Conclude: Since the p-value is (less where μ is (description in context).
than/greater than or equal to) the
많
테스트프렙어학원
State the significance level. (If not given, use significance level α, we (reject/fail to State the significance level. (If not given, use
0.05). reject) the null hypothesis. We (have/do 0.05).
not have) sufficient evidence that (state
Conditions: your alternative hypothesis). Conditions:
SRS SRS
10% condition (n<0.10N) 10% condition (n<0.10N)
large counts condition (np0≥10, n(1- normality
p0)≥10) if normal, say that it is
state that the number of successes if not normal, use CLT and state that
(np0) and the number of failures since n≥30, the sampling
(n(1-p0)) are both greater than or distribution of x̄ is approximately
equal to 10, so the sampling normal by CLT
distribution of p̂ is approximately if not normal and n < 30, draw graph
normal. to check for normality, no strong
skewness, and no outliers
Statistical Inference II Statistical Inference II Statistical Inference II
HT for population mean HT for difference in HT for difference in
Calculate: if the conditions are met, perform population proportions population proportions
the calculations (Z-Test or T-Test) Identify: two sample z test for p1-p2
품
테스트프렙어학원
10% condition for both samples
When using tcdf, remember that df = n - 1. n1 < 0.10N1 and n2 < 0.10N2
large counts condition for both samples
Conclude: Since the p-value is (less
When calculating the standard deviation of n1p1≥10, n1(1-p1)≥10
than/greater than or equal to) the
sample means, use the following formula. n2p2≥10, n2(1-p2)≥10
significance level α, we (reject/fail to
state that the numbers of
σ is known σ is unknown reject) the null hypothesis. We (have/do
successes (n1p1, n2p2) and the
not have) sufficient evidence that (state
numbers of failures (n1(1-p1),
your alternative hypothesis).
n2(1-p2)) are both greater than or
equal to 10, so the sampling
Conclude: Since the p-value is (less distribution of p̂1-p̂2 is
than/greater than or equal to) the approximately normal.
significance level α, we (reject/fail to independence condition
reject) the null hypothesis. We (have/do mention that the two samples are
not have) sufficient evidence that (state independent random samples
your alternative hypothesis).
Calculate: if the conditions are met, perform
the calculations (2-PropZTest)
S
SRS for both samples 테스트프렙어학원 For Ha, use
10% condition for both samples β1 > 0 if you need to prove a positive linear
n1 < 0.10N1 and n2 < 0.10N2 Conclude: Since the p-value is (less relationship between two variables,
normality for both samples than/greater than or equal to) the β1 < 0 if you need to prove a negative linear
if normal, say that it is significance level α, we (reject/fail to relationship between two variables, or
if not normal, use central limit reject) the null hypothesis. We (have/do β1 ≠ 0 if you need to show there is "some"
theorem and state that since n≥30, not have) sufficient evidence that (state linear relationship between two variables
the sampling distribution of x̄1- x̄2 your alternative hypothesis).
is approximately normal by CLT Conditions:
if not normal and n < 30, draw graph IMPORTANT SRS
check for normality, no strong For paired data (data that are not 10% condition (n<0.10N)
skewness, and no outliers independent), create a new variable d, which scatterplot of sample data is approx linear
independence condition is the variable for differences, by taking the no apparent pattern in residuals plot
mention that the two samples are differences of x̄1 from sample 1 and the (=equal SD; residuals have roughly equal
independent random samples corresponding x̄2 from sample 2. variability at all x-values in sample data)
distribution of residuals it approx normal
Calculate: if the conditions are met, perform Then, you need to perform a one sample t
the calculations (2-SampZTest or 2- test using the new variable d as shown below. Calculate: if the conditions are met, perform
SampTTest) Do not perform a two sample test on paired the calculations (LinRegTTest)
data!
Alternatively, you can find the p-value using Conclude: Since the p-value is (less
normalcdf or tcdf and compare this value Identify: paired t-test for the set of differences than/greater than or equal to) the
with the significance level. significance level α, we (reject/fail to
reject) the null hypothesis. We (have/do
When using tcdf, use df = n1-1 or df = n2-1, Then, follow a similar procedure for HT for not have) sufficient evidence that (state
whichever is smaller. Or use technology for a population mean. your alternative hypothesis).
more precise df.
Statistical Inference III Statistical Inference III Statistical Inference III
Test of significance for Chi-square test for goodness-of-fit Chi-square test for homogeneity
qualitative data: chi-square test The χ2 test for goodness-of-fit compares The χ2 chi-square test for homogeneity
the distribution of observed counts in the compares the distribution of a single
The AP Statistics exam tests you on three
sample with the distribution of expected categorical variable for each of several
types of significance tests for qualitative
counts if Ho were true. populations.
data.
chi-square test for goodness-of-fit
The expected count for any category is The expected count for any category is
chi-square test for independence
found by multiplying the sample size (n) by found using the formula below.
chi-square test for homogeneity
the proportion in each category according to
'
테스트프렙어학원
the null hypothesis.
Chi-square test statistic
Data is usually given in a one-way table. Data is usually given in a two-way table.
Calculate: if the conditions are met, perform Calculate: if the conditions are met, perform
the calculations (χ2 GOF-Test) the calculations (χ2-Test)
Alternatively, you can find the p-value using Alternatively, you can find the p-value using
χ2cdf and compare this value with the χ2cdf and compare this value with the
A chi-square distribution is defined by a significance level. When using χ2cdf, significance level. When using χ2cdf,
density curve that takes only nonnegative df = (number of categories)-1. df = [r-1] x [c-1].
values and is skewed to the right. A particular
chi-square distribution is specified by its Conclude: compare p-value and significance Conclude: compare p-value and significance
degrees of freedom. level; reject or fail to reject Ho level; reject or fail to reject Ho
Statistical Inference III Statistical Inference III How to Get a 5
첨
Chi-square test for independence Chi-square test for independence
in AP Statistics
The χ2 chi-square test for independence Identify: chi-square test for independence
is used test the association/relationship
between two categorical variables in a Get a perfect score on
single population.
the MCQs.
The null hypothesis is that there is no Conditions:
association between the two categorical SRS
variables in the population of interest. 10% condition
rr
Another way to state the null hypothesis is n < 0.10N Write something (that is
that the two categorical variables are all expected counts are at least 5
independent in the population of interest. sensical) on the FRQs.
*il Calculate: if the conditions are met, perform
The expected count for테스트프렙어학원
any
anycategory
categoryis is the calculations (χ2-Test)
found using the formula below.
Alternatively, you can find the p-value using And do your homework.
χ2cdf and compare this value with the
significance level. When using χ2cdf,
Data is usually given in a two-way table. df = [r-1] x [c-1].