Krippendorff - 2004 - Reliability in Content Analysis PDF
Krippendorff - 2004 - Reliability in Content Analysis PDF
Krippendorff - 2004 - Reliability in Content Analysis PDF
ScholarlyCommons
Departmental Papers (ASC) Annenberg School for Communication
7-2004
Recommended Citation
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human
Communication Research, 30 (3), 411-433. http://dx.doi.org/10.1111/j.1468-2958.2004.tb00738.x
Disciplines
Communication | Social and Behavioral Sciences
Issues of Scale
Let me start with the ranges of the two broad classes of agreement coefficients, chance-
corrected agreement and raw or %-agreement. While both kinds equal 1.000 or 100% when agreement
is perfect, and data are considered reliable, %-agreement is zero when absolutely no agreement is
observed; when one coder‟s categories unfailingly differ from the categories used by the other; or
disagreement is systematic and extreme. Extreme disagreement is statistically almost as unexpected as
perfect agreement. It should not occur, however, when coders apply the same coding instruction to the
same set of units of analysis and work independently of each other, as is required when generating data
for testing reliability.
Where the reliability of data is an issue, the worst situation is not when one coder looks over
the shoulder of another coder and selects a non-matching category, but when coders do not understand
what they are asked to interpret, categorize by throwing dice, or examine unlike units of analysis,
causing research results that are indistinguishable from chance events. While zero %-agreement has no
meaningful reliability interpretation, chance-corrected agreement coefficients, by contrast, become
zero when coders‟ behavior bears no relation to the phenomena to be coded, leaving researchers
clueless as to what their data mean. Thus, the scales of chance-corrected agreement coefficients are
anchored at two points of meaningful reliability interpretations, zero and one, whereas %-like
agreement indices are anchored in only one, 100%, which renders all deviations from 100%
uninterpretable, as far as data reliability is concerned. %-agreement has other undesirable properties;
for example, it is limited to nominal data; can compare only two coders2; and high %-agreement
becomes progressively unlikely as more categories are available. I am suggesting that the convenience
of calculating %-agreement, which is often cited as its advantage, cannot compensate for its
meaninglessness. Let me hasten to add that chance-correction is not a panacea either. Chance-corrected
agreement coefficients do not form a uniform class. Benini (1901), Bennett, Alpert, and Goldstein
(1954), Cohen (1960), Goodman and Kruskal (1954), Krippendorff (1970, 2004), and Scott (1955)
build different corrections into their coefficients, thus measuring reliability on slightly different scales.
Chance can mean different things. Discussing these coefficients in terms of being conservative
(yielding lower values than expected) or liberal (yielding higher values than expected) glosses over
their crucial mathematical differences and privileges an intuitive sense of the kind of magnitudes that
are somehow considered acceptable.
If it were the issue of striking a balance between conservative and liberal coefficients, it would
be easy to follow statistical practices and modify larger coefficients by squaring them and smaller
coefficients by applying the square root to them. However, neither transformation would alter what
these mathematical functions actually measure; only the sizes of the intervals between 0 and 1.
Lombard et al., by contrast, attempt to resolve their dilemma by recommending that content analysts
use several reliability measures. In their own report, they use , “an index …known to be
conservative,” but when measures below .700, they revert to %-agreement, “a liberal index,” and
accept data as reliable as long as the latter is above .900 (2002, p. 596). They give no empirical
justification for their choice. I shall illustrate below the kind of data that would pass their criterion.
Coder A
Values: 0 1 Population Estimates
0 a b pB p (p A p B ) / 2 = proportion of 0s in data
Coder B
1 c d qB q (q A q B ) / 2 1 p = proportion of 1s in data
pA qA 1
Figure 2 states the above-mentioned agreement coefficients in terms of Figure 1 and in ‟s
economical form:
Do Observed Disagreeme nt
Agreement 1 1
De Expected Disagreeme nt
where, when the observed disagreement Do=0, agreement =1; and when the two disagreements are
equal, Do=De, agreement =0. So, Do expresses the lack of agreement; whereas De defines the zero point
of the measure.
Figure 2
The Dichotomous Forms of Seven Agreement Coefficients
Agreement = 1 – Observed / Expected Disagreements
%-agreement Ao = 1 – (b + c)
2 N A B
Osgood (1959); Holsti‟s CR = [1 – (b+c)] N N
A B
Bennett et al. (1954) S = 1 – (b + c) / 2½½ where ½ is the logical probability of 0 and of 1
pA pB
Scott (1955) = 1 – (b + c) / 2 pq where p and q 1 p
2
Krippendorff (1970) = 1 – (b + c) / n 2 pq where n = the number of 0s and 1s used jointly
n–1
Cohen (1960) = 1 – (b + c) / pAqB+pBqA
Benini (1901) = 1 – [(b+c)|bc|] / [pAqB+pBqA|bc|]
Evidently, all coefficients in Figure 2 contain the same observed disagreement, the proportion
of mismatches (b+c), which satisfies part of (2). The %-agreement measure, Ao, stops there, making no
allowance for disagreements that are expected by chance, assuming nothing about the properties of the
data in question, and depriving researchers, as already stated, of a meaningful second anchor for their
reliability scale. Ao cannot indicate the absence of reliability, as called for in (3).
Osgood‟s (1959, p.44) coefficient, named CR by Holsti (1969, p.140), amounts to the product
of two proportions, the %-agreement, Ao, equivalent to 1(b+c), and the proportion of the
number N AB of units coded jointly to the average number of those coded individually, NA and NB.
Unlike the other coefficients reviewed here, Osgood‟s responds not only to disagreements in coding
but also to disagreements in the numbers of identified units, however, without reference to what would
amount to the absence of reliability: chance. Thus, Osgood‟s coefficient suffers from the same
problems that %-agreement does.
Bennett et al. (1954) were probably the first to realize that %-agreement becomes more difficult
to achieve as the number of available categories increases. Their coefficient, S, corrects for this effect.
For just two categories, S calculates the disagreement in the two cells, b and c, that can be expected by
chance as 2½½ or 50%. Here, ½ is the logical probability of the distinction between category 0 and 1.
It is not very flattering to the literature on content analysis that this coefficient has been reinvented
with minor variations at least five times since its original proposal: as Guilford‟s G (Holley &
Guilford, 1964); as the R.E. (random error) coefficient (Maxwell, 1970); as C (Janson & Vegelius,
1979); as n (Brennan & Prediger, 1981), and as the intercoder reliability coefficient Ir. (Perreault &
Leigh, 1989). The authors of the last two derivations at least knew of S. The justifications given in the
literature for using this coefficient range from fairness to each category and appropriateness to the
discipline of its advocates,3 to the absence of hard knowledge about the true distribution of categories
in the population from which reliability data are sampled. By treating all categories as equally likely, S
is insensitive to unequal (non-uniform) distributions of categories in the population of data, fails to
respond to disagreements among coders regarding their frequencies of using these categories, becomes
inflated by unused categories, and not satisfying (3), it cannot indicate the reliability in the population
of data.
Regarding the absence of knowledge about the true distribution of categories in the population
of data, it would make good sense, indeed, to calculate expected disagreements from the proportions of
categories in that population. After all, it is the nature of the data – not of the coders‟ proclivity for
particular categories or systematic coding habits; not the categorical structure of a coding instrument –
that empirical inquiries are ultimately concerned with and on which researchers hope all coders would
agree. As stated in (2), content analysts have to accept the epistemological fact that data are knowable
only through their descriptions and the true proportions of categories in the population of data remain
unknown until the whole population of units of analysis is reliably observed, transcribed, categorized,
or coded. Without a priori knowledge of the data, according to (3), these proportions must be
estimated from the reliability sample, using as many coders as possible (at least two), and assuming
that individual differences among them wash out with large numbers. It is standard statistical practice
to take the mean of multiple coder judgments on a sample as estimates of the otherwise unknowable
population proportions. With only two coders and two categories, as in Figure 1, p =(pA+pB)/2 is the
best estimate of the proportion of 0s in the population of data and its complement,
q =(qA+qB)/2=(1 p ), is the best estimate of the proportion of 1s in the same data.
In Figure 2, and can be seen to be alike in calculating their expected disagreements in cells
b and c as 2p q , relying on precisely this population estimate, thus satisfying the inductive step of (3)
and the interchangeability of coders mentioned in (2). Evidently, nowhere does and “assume” that
“coders have distributed their values across the categories identically” as Lombard et al. (2002, p.591)
claim. and merely estimate the population proportions and calculate their expected disagreements
in these terms. Confusing the computation of expected disagreements from population estimates with
the assumption that coders have used their categories with identical frequencies and that disagreements
between them are ignored is rooted in the failure to recognize that coder interchangeability is necessary
to get to the population estimates. In Figure 2, and can be seen to refer to the population of data
whereas the other coefficients do not.
and differ in one respect, in the factor n/(n1), which is recognizable in but not in . n is
the total number of categories used to describe all units by all coders. This factor corrects for the
effects of small sample sizes and few coders. Numerically, exceeds by (1)/n. But as sample
sizes increase, the factor n/(n1) converges to 1, the difference (1)/n converges to 0, and and
become asymptotically indistinguishable.
Turning now to , its expected disagreement differs from ‟s and ‟s. The sum of pAqB and
pBqA compute the proportions in cells b and c that can be expected under the condition that coders A
and B are statistically independent. does this, much like the familiar 2 statistic does. The latter is
used to test null-hypotheses regarding associations, not agreement. Thus, and in violation of the second
part of (2), ‟s expected disagreement is a function of the individual coder preferences for the two
categories 0 and 1, not of the estimated proportions p of 0s and q of 1s in the population of data. This
expected disagreement renders zero when the two coders‟ use of categories are statistically
independent. As deviates from perfect agreement, it becomes increasingly determined by coder
preferences and says less about the data it is to evaluate.
I suggested elsewhere (Krippendorff, 1978) that is a hybrid coefficient. It enters the observed
disagreement just as all agreement measures do but corrects this by a conception of chance that derives
its logic from association measures. This inconsistency explains why behaves so oddly in the
numerical examples in Figure 3 below. But, faced with this characterization of , Fleiss (1978, p. 144),
a major proponent of , conceded that when coders are interchangeable, (and ) would be the correct
measure of reliability. The use of , he wrote, should be restricted to reliability studies in which one
pair of coders judge all units of analysis and unequal coder preferences are not problematic. Thus,
fails to recognize that the two coders‟ unequal uses of categories could be a reliability problem.
Notwithstanding ‟s popularity, the amount of research devoted to this coefficient, and the
interpretations that Lombard et al. cite from the literature, the mathematical structure of Cohen‟s is
simply incommensurate with the logic of the situation that content analysts are facing when the
reliability of their data is in question. cannot be recommended as one of several alternative indices,
as Lombard et al. are suggesting.
As seen in Figure 2, Benini‟s (1901) 4 differs from only in its subtracting the absolute
difference |b-c| from both, ‟s observed and expected disagreements. This adjustment to preserves its
reliance on the statistical independence of the two coders and therefore disqualifies it from being
interpretable as an index of the reliability of data. The importance of this seemingly small adjustment
is that , unlike , carries its dependence on the coders‟ unequal use of categories to its logical
conclusion, measuring 1.000 when agreement is the largest one possible, given these coders‟ marginal
distributions. This might not be so easily recognizable in the mathematical form of Figure 2, but the
behavior that follows from it might become clear in Figure 3 below.
Figure 3
Three Contingency Tables with Equal/Unequal Margins and Largest Agreement
Coder A
Categories: a b c a b c a b c
a 12 9 9 30 a 12 18 18 48 a 12 36 48
Coder B b 9 14 9 32 b 0 14 18 32 b 32 32
c 9 9 20 38 c 0 0 20 20 c 20 20
30 32 38 100 12 32 56 100 12 32 56 100
Ao = .460 Ao = .460 Ao = .640
= .186 = .186 = .457
= .186 = .258 = .506
= .186 = .511 =1.000
These two tables are identical in the %-agreement they exhibit but differ in how their
mismatching categories are distributed in these tables. In the left table, coder A and B agree on their
marginal frequencies; in the right table, they do not. When they do agree, , and are equal, as they
should be. But when coders disagree on these frequencies, when they show unequal proclivities for the
available categories, as is apparent in the margins of the table in the middle, exceeds . does not
ignore the disagreements between the coders‟ use of categories, but adds it to the measure as an
agreement! This highly undesirable property benefits coders who disagree on these margins over those
who agree and it clearly contradicts what its proponents (Cohen, 1960; Fleiss, 1975) argued and what
Lombard et al. (2002) have found to be the dominant opinion in the literature. Evidently, there are still
46 out of 100 units with matching categories in the diagonal cells. What accounts for this difference is
that the 54 mismatches, occupying the cells of both off-diagonal triangles in the left table of Figure 3,
have now migrated to one off-diagonal triangle in the center table. It makes for an uneven distribution
of the mismatching categories, increasing not agreement, but the predictability of the mismatching
pairs of categories. Unlike , is evidently not affected by where the mismatching categories occur,
satisfying (2) by not distinguishing who contributed which disagreements and, when data are nominal,
which categories are confused. Predictability has nothing to do with reliability.
Figure 3 demonstrates another peculiarity of . Not only does counter intuitively exceed
when disagreements in marginal frequencies are present, unlike , cannot reach 1.000 when such
disagreements exist. This already had been observed by Cohen (1960), noted as a drawback by
Brennan and Prediger (1981) and others, and may also be seen in the right table of Figure 3. This table
has the same marginal frequencies as the one in the center but exhibits the largest possible agreement,
given the marginal constraints. Under these conditions, cannot exceed .505, its largest possible value
for these marginal frequencies. By contrast, registers this very condition by measuring 1.000.5
I am less concerned with this additional peculiarity of , except to note that is always equal to
or larger than and is always equally equal to or larger than . Having shown the reasons for these
inequalities, both in mathematical terms and by numerical examples, characterizing these coefficients
in terms of the aforementioned conservative/liberal dimension would be besides the point of this
demonstration. When the reliability of data is the issue, is simply wrong in what it does. Its behavior
clearly invalidates widely held beliefs about , which are uncritically reproduced in the literature.
I have to say that the above misinterpretation of goes back to its inception. To justify his
unfortunate modification of Scott‟s (1955) , Cohen incorrectly criticized for ignoring “one source of
disagreement between a pair of judges, … their proclivity to distribute their judgments differently over
the categories” (1960, p. 41). Figure 3 showed that behaves contrary to what Cohen had intended.
Instead of including this error as disagreement, credits this error towards agreement. Brennan and
Prediger (1981) observed this highly undesirable property of as well, pointing out that “two judges
who independently, and without prior knowledge, produce similar marginal distributions must obtain a
much higher agreement rate to obtain a given value of kappa, than two judges who produce radically
different marginals. … [The former judges] are in a sense penalized” (p. 692) for agreeing on marginal
frequencies. Zwick (1988) has considered this statistical artifact. Her advice to users of is to test for
unequal margins before applying . Its violating (2) and (3) renders just about worthless as a
reliability index in content analysis. The same can be said about , although I have not heard anyone
claiming as much.
Numerical Comparisons
Following their own recommendation to compute several agreement coefficients and to find a
balance between conservative and liberal coefficients, Lombard et al. calculated the values of the four
aforementioned indices, %-agreement, , , and , for 36 of their variables. Their corrected table
(2003, pp. 470-471) provides good empirical examples for discussing what their numerical differences
mean. However, since all content analysts work hard to achieve reliable data, such a table cannot
possibly reveal the full ranges of these coefficients. Therefore, let me state them generally:
0 %-agreement 1
1 , , and +1
For nominal variables, which account for the majority of the authors‟ data, their inequalities are:
%-agreement and nominal
Careful readers of Lombard et al.‟s corrected table will notice the small differences among the
three chance-corrected agreement coefficients and might come to the seriously mistaken conclusion
that the choice among these coefficients would not matter much. However, even small differences
mean rather different things, starting with their zero values:
%-agreement = 0: one coder describes all units of analysis in terms not chosen by the other
= 0: multiple descriptions are chance events, assuming large numbers of units of analysis
= 0: multiple descriptions are chance events, adjusted for variable numbers of units and coders
= 0: coders are statistically independent of each other, assuming large numbers of units of analysis
As already stated, when the sample size is large, theoretically infinite, nominal = . Otherwise,
nominal exceeds by (1)/n, which corrects nominal for small reliability sample sizes. With the
authors‟ sample size of n=256 (2 coders 128 units), that difference is noticeable only in the third
digits. Smaller samples would result in larger differences.
As above demonstrated, when coders agree on their use of categories, on their marginal
distributions, = . When coders disagree regarding these distributions, exceeds , responding to the
increased predictability of one coder‟s categories from those of the other. Predictability has nothing to
do with reliability measures and must not contaminate them. In the authors‟ table, the values of and
turn out to barely differ, suggesting that the two coders exhibit only small marginal differences.
However, Figure 3 shows that such differences could be much larger.
Lombard et al. also report the reliabilities for ordered data. If agreement concerns ordered
reliability data – ranks, intervals, and proportions – an agreement coefficient that is appropriate to
these data utilizes this information and can be expected to exceed nominal coefficients, which ignore
that information. is applicable to metrics other than nominal; %-agreement, , and are not. In the
authors‟ table, the names of variables with ratio metrics are superscripted “b.” %-agreement, , and
are inappropriate for these variables. However, since the authors happen to calculate these coefficients,
comparing them with the values of the ratio coefficient may show the reader how much %-agreement,
, and respectively omit.
Figure 4
Reliability Data on the Agreement Coefficient Used: “„Simple Agreement‟ Only”
Coder C
Categories: 0 1
0 83 1 84 + 3 without a match by C
Coder J NJ = 89
1 2 0 2
85 1 86 = NCJ
+ 1 without a match by J
NC = 87
Ao = .965
CR = .943
S = .930
= .012
= .016
= .018
= .024
Figure 4 also lists the value of Osgood‟s coefficient (Holsti‟s CR), which Lombard et al.
discuss but do not report7 and of Benett et al.‟s S, and Benini‟s for comparison.
The 0-0 cell in this table shows the two coders agreeing that this category was absent in 83
articles. Its 0-1 and 1-0 cells indicate a total of three cases of one coder identifying this category while
the other did not. And in four cases, one coder noted the absence of this category while the other
abstained from coding the article. The four chance-corrected agreement coefficients for these data are
near zero, suggesting the virtual absence of reliability. Yet, the authors‟ decision criterion suggest
otherwise. Unable to accept the data on account of =.012, which measures significantly less than
.700, the criterion relies on the fact that the %-agreement of 96.5% is well above the 90% that
Lombard et al. require and so, the authors feel justified in accepting this variable as reliable and report
that 1% (or 2/137) of the articles they examined mention “simple agreement” only (2003, p.471)8.
Note that in Figure 4, all 96.5% coincidences pertain to absences, the 0s. Regarding the 1s of
the variable mentioning “„simple agreement‟ only,” which the authors report as their findings, the two
coders do not agree at all, not even once! The 1-1 cell in Figure 4 is completely empty. And in the
three cases in which one coder identifies “„simple agreement‟ only,” the other does not. If the %-
agreement measure would be allowed to go down to 90%, the number of mismatches could triple
without shaking the authors‟ confidence in the reliability of the reported finding. Eighty-six out of 137
units of analysis is a decent reliability sample, but could one trust a claim that the 137 articles in the
data contained two mentions of this category when coders cannot agree on even one? In the calculation
of reliability, large numbers of absences should not overwhelm the small number of occurrences that
authors care to report.9 Without a single concurrence and three mismatches, the report of finding 2 out
of 137 cases is about as close to chance as one can get – and this is born out by the near zero values of
all the chance-corrected agreement coefficients.
For Lombard et al., this case was not an oversight. In their Table 1 (2002, p. 592), they
reproduce Perrault and Leigh‟s hypothetical 2-by-2 data (1989, p. 139) with very uneven marginal
frequencies that yield =.000 while showing 82% agreement – just to argue for the conservative nature
of , and by extension, of all chance-corrected agreement measures. The marginal frequencies in the
table of Figure 4 are even more uneven. Yet most striking and often mystifying those who hold on to
the %-agreement conception is the case in which all coders use one and the same category for all units
of analysis, yielding 100% agreement. Such data can be obtained by broken instruments or coders who
fell asleep or agreed in advance of the coding effort to make their task easy. As suggested in (3),
appropriate indices of reliability cannot stop at measuring agreement but must infer the reproducibility
of a population of data; and one cannot talk about reproducibility without evidence that that it could be
otherwise. When all coders use only one category, there is no variation and, hence, no evidence of
reliability. In the case of the slightly less extreme data in Figure 4, Lombard et al.‟s criterion for
accepting data as reliable clearly fails to warn researchers about significant unreliabilities in data and
induces a false sense of certainty about the conclusions drawn from these data when they actually are
indistinguishable from chance events. Their criterion for accepting data as reliable does not separate
the wheat from the chaff. The use of %-agreement should be actively discouraged, especially as a
fallback criterion. Instead, I recommend that only chance-corrected agreement coefficients that satisfy
(2) and (3) be used for inferring the reliability of data.
Because agreement coefficients are averages over the categories in a variable, which allows
unreliable categories to hide behind reliable ones, I am suggesting that reliabilities be obtained for all
distinctions that matter. To state proportions of frequencies, the distinctions between these categories
and their complements need to be reliable. If differences in frequencies of two categories are to be
reported, the two categories must be reliably distinguishable. Overall agreement measures applied to a
multi-category variable do not provide such assurances. For a simple numerical example, consider one
of Lombard et al.‟s variables, the 20th, recording whether articles report reliability figures (2003, p.
470). It recorded data in three categories: 0=No; 1=Yes together with findings; and 2=Yes separately10
and measures =.686. This borderline measure should signal doubt. The data for this variable are
reproduced in the left table of Figure 5.
Figure 5
Reliability Data on Whether Article Reports Reliability Figures and Two Distinctions
Coder J 1st Distinction
Categories: 0 1 2 0 1&2 2nd Distinction
0 80 0 1 81 0 80 1 81 1 2
Coder C 1 1 0 1 2 1 0 1 1
1&2 1 4 5
2 0 0 3 3 2 0 3 3
81 0 5 86 81 5 86 0 4 4
Recommendations
Let me conclude with four recommendations for establishing the reliability of given data,
measured by the degree to which a coding process is reproducible with different coders, elsewhere, and
under conditions that should not affect the results:13
(i) Reliability data, the sample of data from which the trustworthiness of a population of data is to be
inferred, have to be generated by coders that are widely available, follow explicit and
communicable instructions (a data language), and work independently of each other. Reliability
data must be representative of the data whose reliability is in question (not of the population of
ultimate research interest); and the more coders participate in the process and the more common
they are, the more likely can they ensure the reliability of data. Coders must be interchangeable,
may code different subsamples of data, provided there is enough duplication or overlap.
(ii) A decisive agreement coefficient should measure agreements within multiple descriptions,
regardless of numbers and kinds of coders. Its values should be indicative of the likelihood that
conclusions drawn from imperfect data are valid beyond chance. For two coders, large sample
sizes, and nominal data, is such a coefficient. When data are ordered, it is advantageous to select
a coefficient that responds to the information in their metric (scale characteristic or level of
measurement) but assumes not more than warranted by the data in hand. can handle multiple
coders, nominal, ordinal, interval, ratio, and other metrics, missing data, and small sample sizes.
Content analyses that assess reliability in terms of any association coefficient, Pearson‟s r, for
example, Benini‟s (1901) , Cohen‟s (1960) , Cronbach‟s (1951) alpha, Goodman and Kruscal‟s
(1954) r (lambda r), and %-agreement should be rejected as these measures are incompatible with
reliability concerns in content analysis. For any other measure and when in doubt, the
mathematical structures of proposed indices should be examined for their ability to shed light on
the reproducibility of the data making process. Unsubstantiated claims should be questioned.
(iii) An acceptable level of agreement below which data are to be rejected as too unreliable must be
chosen depending on the costs of drawing invalid conclusions from these data. When human lives
hang on the results of a content analysis, whether they inform a legal decision or tip the scale from
peace to war, decision criteria have to be set far higher than when a content analysis is intended to
merely support scholarly arguments. In case of the latter, to be sure that the data under
consideration are at least similarly interpretable by other scholars (as represented by different
coders), I suggested elsewhere to require .800, and where tentative conclusions are still
acceptable, .667 (Krippendorff, 2004, p. 241).14 Except for perfect agreement, there are no
magical numbers, however. The ones suggested here should be verified by suitable experiments.
To ensure that the measured agreement is representative of the data in question, confidence
intervals should be consulted. Testing the null-hypothesis that observed agreement deviates from
chance has no bearing on reliability, which concerns deviations from perfect agreement or 1.000.
(iv) All distinctions that matter should be tested for their reliability. Where a system of several
variables is intended to support a conclusion (e.g., as in an index, a regression equation, or any
multi-variate analysis), the reliability of each variable should be measured and the smallest among
them should be taken as the reliability of the whole system. Averaging the agreement measures of
several variables, especially when they include easily coded clerical ones, can easily mislead
researchers about the reliability of variables that matter. This logic applies to individual categories
as well. Where differences in frequencies of the categories of a variable influence the conclusions
of a research effort (e.g., in reports on differences, changes, or proportions – as exemplified in
Lombard et al. [2003]), the reliability of each distinction should be tested and the smallest one
should be taken as the reliability of the whole variable. This may not be required when a
subsequent analysis concerns variances (e.g., in tests concerning correlations or associations),
which are averages, just as measures of agreement of multi-category variables are. After data have
been generated, reliability may be improved by discarding unreliable distinctions, recoding or
lumping categories or dropping variables that do not meet the criterion adopted in (iii). Resolving
disagreements by majority among three or more coders may make researchers feel better about
their data, but does not affect the measured reliability (Krippendorff, 2004, p. 219).
I commend Lombard, et al. (2002, 2003) for bringing the sad state of reliability testing to the
attention of content analysts. The above criticism is directed less to the authors then to the literary
practices in communication research. As a critical scholar, I defend the principle of encouraging
multiple voices to speak through a text. However, when it comes to discussing mathematical objects,
such as agreement measures and their use as indices of the reliability of data, mathematical proofs and
demonstrations should speak louder than majority opinions, even when published in respectable
journals. Quoting from the work of other scholars does not absolve our responsibility for investigating
and judging what we are reproducing.
References
Benini, R. (1901). Principii di Demographia. Firenze: G. Barbera. No. 29 of Manuali Barbera di
Scienze Giuridiche Sociali e Politiche.
Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications through limited response
questioning. Public Opinion Quarterly, 18, 303-308.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives.
Educational and Psychological Measurement, 41, 687-699.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 37-46.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrica, 16, 297-
334.
Craig, R. T. (1981). Generalization of Scott‟s index of intercoder agreement. Public Opinion
Quarterly, 45, 260-264.
Fleiss, J. L. (1975). Measuring agreement between two judges on the presence and absence of a trait.
Biometrics 31, 651-659.
Fleiss, J. L. (1978). Reply to Klaus Krippendorff‟s “Reliability of binary attribute data.” Biometrics,
34, 144.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley & Sons.
Goodman, L. A, & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of
the American Statistical Association, 49, 732-764.
Holley, W., & Guilford, J. P. (1964). A note on the G-index of agreement. Educational and
Psychological Measurement, 24, 749-754.
Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-
Wesley.
Hughes, M. A., & Garrett, D. E. (1990). Intercoder reliability estimation – Approaches in marketing: A
generalizability theory framework for quantitative data. Journal of Marketing Research, 27, 185-
195.
Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the phi coefficient to nominal
scales. Multivariate Behavioral Research, 14, 255-269.
Krippendorff, K. (1970). Bivariate agreement coefficients for reliability data. In E. R. Borgatta & G.
W. Bohrnstedt (Eds.), Sociological methodology 1970 (pp. 139-150). San Francisco, CA: Jossey
Bass.
Krippendorff, K. (1978). Reliability of binary attribute data. Biometrics, 34, 142-144.
Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Second Edition.
Thousand Oaks, CA: Sage.
Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication
research: An assessment and reporting of intercoder reliability. Human Communication
Research, 28, 587-604.
Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2003). Correction. Human Communication
Research, 29, 469-472.
Maxwell, A. E. (1970). Comparing the classification of subjects by two independent judges. British
Journal of Psychiatry, 116, 651-655.
Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
Osgood, C. E. (1959). The representational model and relevant research. In I. de Sola Pool (Ed.),
Trends in content analysis (pp. 33-88). Urbana: University of Illinois Press.
Pasadeos, Y., Huhman, B., Standley, T., & Wilson, G. (1995, May). Applications of content analysis in
news research: A critical examination. Paper presented to the annual convention of the
Association for Education in Journalism and Mass Communication, Washington, DC. Cited in
M. Lombard, J. Snyder-Duch, & C. C. Bracken (2002). Content analysis in mass communication
research: An assessment and reporting of intercoder reliability. Human Communication
Research, 28, 587-604.
Perreault, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments.
Journal of Marketing Research, 26, 135-148.
Potter, W. J., & Levine-Donnerstein, D. (1999). Rethinking validity and reliability in content analysis.
Journal of Applied Communication Research, 27, 258-284.
Riffe, D., & Freitag, A. (1996, August). Twenty-five years of content analyses in Journalism & Mass
Communication Quarterly. Paper presented to the annual convention of the Association for
Education in Journalism and Mass Communication, Anaheim, CA, cited in D. Riffe, S. Lacy, &
F. G. Fico (1998). Analyzing media messages. Mahwah, NJ: Erlbaum.
Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion
Quarterly, 19, 321-325.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 347-387.
Endnotes
1
The authors used a beta version of the software package PRAM, an acronym of “Program for Reliability Assessment with
Multiple-coders” (Skymeg Software, 2002), also described by Neuendorf (2002, pp. 241-242), to calculate %-agreement, ,
and , and a separate unpublished software to calculate (Lombard et al., 2002, p. 596).
2
Lombard et al. (2002, p.590) claim that %-agreement can also be computed for any number of coders without explaining
how this could be accomplished. The aforementioned software PRAM includes a feature to average pairwise %-
agreements. This average of averages cannot express patterns of disagreement that inevitably arise when multiple coders
are involved and becomes of dubious validity when coders code different sets and numbers of units.
3
For example, Perreault and Leigh argue that “in most marketing research studies (and in many other areas of applied
research), there is no a priori knowledge of the likely distribution or responses” (1989, p. 139), much as I said in (2), but
they then proceed to define expected disagreement as in S, in terms of the number of available categories, which is
equivalent to assuming categories to be uniformly distributed.
4
Since is less familiar than the other coefficients, I offer this definition: i p ii i p Ai p Bi , where, in a
i min(p Ai , p Bi ) i p Ai p Bi
contingency table, i is a generic category, pii is the proportion of pairs of matching categories i, and pAi and pBi are the
marginal sums for category i used by coder A and B respectively.
5
It might be noted that Cohen (1960), probably unfamiliar with Benini‟s , discussed a ratio /max (p. 43), which equals .
I have not seen it used, however.
6
On December 12, 2002, Matthew Lombard kindly made the authors‟ data available to me and in return received the
recalculations of .
7
In a footnote to the original table, Lombard et al. write, “Holsti‟s method is not reported because it is identical to Scott‟s
pi in the case of two coders evaluating the same units” (2002, p. 598). In the revised table, the authors replaced “Scott‟s pi”
by “percent agreement” (2003, p. 471), which makes this statement a mathematical possibility, but one that is not born out
by their data. Figure 4 shows reliability data in which one coder categorized N C=87 articles, another categorized NJ=89
articles, and both categorized NCJ=86 articles, rendering Osgood‟s coefficient (Holsti‟s
CR) A o 2N CJ .965 2 86 .943 .
NC NJ 87 89
8
The original table reports 2% (Lombard, et al. 2002, p.579). I do not know what prompted this revision.
9
Arguably, 99% is a large proportion and 1% is a small one. Considering small errors, say 1%, 991% still defines a large
proportion with a relatively small error, but 11% refers to a small proportion with a relatively large error. Thus, a range
between 0% and 2% seems more severe than a range between 98% and 100%.
10
http://astro.temple.edu/~lombard/carman.htm, accessed in January 2003.
11
Lombard et al. are not explicit about the 6% (8 articles) they report as containing reliability information (2003, p.470). I
presume, however, it refers to categories 1&2 lumped together, in which case the proper reliability should have been
computed with data on the 1st distinction, not for the whole variable, and reported as .739, not as .686 – widespread practice
notwithstanding.
12
PRAM, op. cit.
13
These recommendations do not agree with Lombard et al.‟s (2002) guidelines 2, 4, parts of 8 and 9, the common practice
of calculating average reliabilities for multi-category variables of which the frequencies and proportions (%) of individual
categories are reported, but particularly not with criterion they have adopted in accepting their own findings as reliable (p.
596; pp. 600-602).
14
These standards were suggested for , and the experiments that led to them concerned only. Other coefficients may
require different standards. Setting standards for all coefficients alike, even discussing them as if that made sense, glosses
over their mathematical differences and the assumptions that go into their construction. This would apply also to
conceptualizing agreement coefficients on a conservative/liberal continuum according to the numerical results they
produce, discussed above.