Confidence Intervals, Limits, and Levels? Confidence Intervals, Limits, and Levels?
Confidence Intervals, Limits, and Levels? Confidence Intervals, Limits, and Levels?
Confidence Intervals, Limits, and Levels? Confidence Intervals, Limits, and Levels?
15(2) 23 - 27
Statistics Corner
Questions and answers about language testing statistics:
One simple way to look at the mean of a set of scores is to think about it as a sample-based
estimate of the mean of the population from which the sample was drawn. Since that estimate is
1
Interestingly perhaps, given that the various standard error statistics are themselves estimates, it must be possible to
estimate the standard errors of standard error statistics. For example, it should be possible to estimate the error
involved (i.e., the standard error) in estimating the standard error of the mean. But ultimately, who would care?
23
SHIKEN: JALT Testing & Evaluation SIG Newsletter. October 2011. 15(2) 23 - 27
never perfect, it is reasonable to want to know how much error there may be in that estimate of the
population mean. The magnitude of this error can be calculated using the seM as follows:
S
seM =
N
Where the seM = the standard error of the mean, S = the standard deviation of the scores on a test,
and N = the number of examinees who took the test. Consider a test that has a mean of 51, S = 12.11,
and N = 64. The seM would be:
S 12.11 12.11
seM = = = = 1.51375 ≈ 1.51
N 64 8
This seM is an estimate of the amount of variation due to error that we can expect in sample means.
For more information on interpreting the seM, see the discussion below of confidence intervals,
limits, and levels.
Standard error of measurement (SEM)
Language testers use reliability estimates to investigate the proportion of consistent variation in
scores on a test (for more on this topic see Bachman, 2004; Brown, 1997, 1998, 2002, 2005).
Another more useful way to look at the consistency of test scores is to estimate the magnitude of the
error by calculating the SEM as follows:
SEM = S 1 − rxx '
Where S = the standard deviation of the scores on a test and rxx ' = the reliability estimate for those
scores (e.g, Cronbach alpha, K-R20, etc.). Consider a test that has a mean of 31, S of 5.15, and rxx '
of .93. The SEM would be:
SEM = S 1 − rxx ' = 5.15 1 − .93 = 5.15 .07 = 5.15(.2646) = 1.36269 ≈ 1.36
This SEM is an estimate of the proportion of variation in the scores that is due to error in the
sample score estimates of the examinees’ true scores. For more information on interpreting the SEM,
see the discussing below of confidence intervals, limits, and levels.
Standard error of estimate (see)
Language testers use regression to predict scores on one test (usually labeled Test Y) from
scores on another (usually labeled Test X). One useful way to think about those predictions of Y
scores is to estimate how much error there is in the Test Y predictions by calculating the see as
follows:
see = S y 1 − rxy2
Where Sy = the standard deviation of the scores on Test Y and rxy = the correlation coefficient for
the degree of relationship between the Test X scores and those on Test Y. Consider a regression
analysis where Sy = 9.54 and rxy = .80. The see would be:
see = S y 1 − rxy2 = 9.54 1 − .80 2 = 9.54 1 − .64 = 9.54 .36 = 9.54(.60) = 5.724. ≈ 5.72
This see is an estimate of the amount of variation due to error that we can expect in the predicted
Test Y scores based on scores on Test X in a particular regression analysis. For more information
on interpreting the see, go to the discussion below of confidence intervals, limits, and levels.
What Are Confidence Intervals, Confidence Limits, Confidence Levels, etc.?
The confidence intervals, limits, and levels that you asked about in your question, all have to do
with the next step after you have the standard error calculated. This next step is to interpret the
24
SHIKEN: JALT Testing & Evaluation SIG Newsletter. October 2011. 15(2) 23 - 27
standard error. In order to do so, we need to understand the differences among confidence intervals,
limits, and levels so we can clearly think, talk, and write about our interpretations of standard errors.
Before we turn to using any of the types of standard errors described above to help us interpret
our sample statistics, we need to understand that errors are typically assumed to be normally
distributed. Since all of the error estimates that we are talking about are standard errors, they are
standardized and can be described as shown in Figure 1.
-3 se -2 se -1 se 0 +1 se +2 se +3 se
2
About this term parameter, note that statistics are used in samples to estimate analogous parameters in the population
from which the sample was drawn. For example, a sample mean statistic, M, is often calculated to estimate the
analogous population parameter µ.
25
SHIKEN: JALT Testing & Evaluation SIG Newsletter. October 2011. 15(2) 23 - 27
Why should we care? Consider what the latest APA Manual (APA, 2010) says: “The inclusion
of confidence intervals (for estimates of parameters, for functions of parameters such as differences
in means, and for effect sizes) can be an extremely effective way of reporting results. Because
confidence intervals combine information on location and precision and can often be directly used
to infer significance levels, they are, in general, the best reporting strategy. The use of confidence
intervals is therefore strongly recommended” (p. 34, italics added).
In language testing we use confidence intervals to interpret at least the standard error of the
mean (seM), standard error of measurement (SEM), and standard error of estimate (see) as I will
explain in three separate subsections.
Confidence and the seM
Let’s begin by considering the example used above for the seM, where the mean was 51 and the
seM turned out to be 1.51. The mean (M) for the sample of 51 is the best estimate that we have of the
parameter μ. However, the seM of 1.51 tells us that there is error in that estimate and how big the
error is. Since we assume that error is normally distributed, we can estimate the range within which
the population mean is likely to exist in probability terms. In this case, we know that the population
mean is likely to fall within ±1 seM 68% of the time (34.13 + 34.13 = 68.26 ≈ 68), ±2 seM 95% of
the time (34.13 + 34.13 + 13.59 + 13.59 = 95.44 ≈ 95), and ±3 seM 99% of the time (34.13 + 34.13 +
13.59 + 13.59 + 2.14 + 2.14 = 98.58 ≈ 99).
Hence, we can say that the population μ in our example will fall within plus or minus one
confidence interval of the sample mean of 51, that is, from 49.49 to 52.00 about 68% of the time
(±1 se in this case = ±1.51; 51 – 1.51 = 49.49; 51 + 1.51 = 52.51). Using the same reasoning, we
can say that the population μ in our example will fall within plus or minus two confidence intervals
of the sample mean, that is, from 47.98 to 54.02 with 95% probability (±2 se in this case = ±3.02;
51 – 3.02 = 47.98; 51 + 3.02 = 54.02), and that the population μ in our example will fall within plus
or minus three confidence intervals of the sample mean, that is, from 46.47 to 55.53 with 99%
probability (±3 se in this case = ±4.53; 51 – 4.53 = 46.47; 51 + 4.53 = 55.53).
Confidence and the SEM
The SEM calculated in the example above turned out to be 1.36, which can be used to further
estimate confidence intervals that indicate how many score points of variation can reasonably be
expected with 68%, 95%, or 98% probability around any given point (e.g., a score or a cut-point).
Let’s say a student scored 32; that student (or any student with that same score) has a 68%
probability of getting a score between 30.64 and 33.36 (32 – 1.36 = 30.64; 32 + 1.36 = 33.36) by
chance alone if the test were administered repeatedly. Similarly, any examinee with a score of 32 is
likely to fall within two SEMs (1.36 + 1.36 = 2.72) plus or minus (32 - 2.72 = 29.28; 32 + 2.72 =
34.72), or a band from 29.28 to 34.72, 95% of the time by chance alone. And finally, an examinee
falling within three SEMs (3 x 1.36 = 4.08) plus or minus (32 – 4.08 = 27.92; 32 + 4.08 = 36.08), or
a band from 27.92 to 36.08, is likely to fluctuate within that band 99% of the time. In practical
terms, language testers most often use the SEM in cut-point decision making, where they may want
to at minimum consider gathering additional information about any examinees who have scores
within the band of plus or minus one SEM of a given cut-point in order to increase the reliability of
that decision making. However, whether the tester chooses a 68%, 95%, or 98% confidence level is
a judgment call.
For additional information on SEM, see Bachman (2004, pp. 171-174), or Brown (2005, pp.
188-190, 193-195).
26
SHIKEN: JALT Testing & Evaluation SIG Newsletter. October 2011. 15(2) 23 - 27
References
APA (2010). The publication manual of the American Psychological Association (6th ed.). Washington, DC: American
Psychological Association.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge University Press.
Brown, J. D. (1997). Statistics corner: Questions and answers about language testing statistics: Reliability of surveys.
Shiken: JALT Testing & Evaluations SIG Newsletter, 1(2), 18-21. Retrieved from http://jalt.org/test/bro_2.htm
Brown, J. D. (1998). Statistics corner: Questions and answers about language testing statistics: Cloze tests and optimum
test length. Shiken: JALT Testing & Evaluations SIG Newsletter, 2(2), 18-22. Retrieved from
http://jalt.org/test/bro_3.htm
Brown, J. D. (1999). Statistics corner: Questions and answers about language testing statistics: Standard error vs.
standard error of measurement. Shiken: JALT Testing & Evaluations SIG Newsletter, 3(1), 20-25. Retrieved from
http://jalt.org/test/bro_4.htm
Brown, J. D. (2002). Statistics corner: Questions and answers about language testing statistics: The Cronbach alpha
reliability estimate. Shiken: JALT Testing & Evaluations SIG Newsletter, 6(1), 17-19. Retrieved from
http://jalt.org/test/bro_13.htm
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment (New
edition). New York: McGraw-Hill.
Vogt, W. P., & Johnson, R. B. (2011). Dictionary of statistics & methodology: A nontechnical guide for the social
sciences. Thousand Oaks, CA: Sage.
HTML: http://jalt.org/test/bro_35.htm / PDF: http://jalt.org/test/PDF/Brown35.pdf
27