Bsi 18
Bsi 18
Bsi 18
The psychometric structure of the Brief Symptom Inventory18 (BSI-18; Derogatis, 2001) was inves-
tigated using Mokken scaling and parametric item response theory. Data of 487 outpatients, 266 students,
and 207 prisoners were analyzed. Results of the Mokken analysis indicated that the BSI-18 formed a
strong Mokken scale for outpatients and prisoners, indicating strong unidimensionality. For students,
only the depression and anxiety items formed a medium Mokken scale. Parametric item response theory
analyses showed that the best discriminating items came from the depression and anxiety subscales.
Keywords: Brief Symptom Inventory, Mokken scaling, item response theory, psychological distress
The Brief Symptom Inventory18 (BSI-18; Derogatis, 2001) is (2001) and several researchers claim that the BSI-18 has a multi-
a widely used self-report questionnaire that measures general psy- dimensional structure. Piersma et al. (1994) administered the com-
chological distress. It is the briefest and latest version in a series of plete BSI to 217 adults and 188 adolescents at admission and
instruments designed by Derogatis (Derogatis, 1983; Derogatis & discharge from a private psychiatric hospital. Principal compo-
Melisaratos, 1983; Derogatis, 2001) to measure general distress. nents factor analysis revealed that most variance between dimen-
The questionnaire consists of 18 descriptions of physical and sion scores was accounted for by one unrotated factor. In contrast,
emotional complaints; respondents are asked to indicate on a scale Galdon et al. (2008), using a sample of 175 breast cancer patients,
from 0 (not at all) through 4 (very much) to what extent they are found three dimensions underlying the BSI-18. These dimensions
troubled by the complaints. Table 1 shows the item content. corresponded to the hypothesized subscales somatization, depres-
The BSI-18 is a shortened version of the BSI, which consists of sion, and anxiety. The same structure was found by Dura et al.
53 items distributed over nine subscales: Somatization, Obsessive- (2006) and Recklitis et al. (2006). Dura et al. (2006) investigated
Compulsive Disorder, Interpersonal Sensitivity, Depression, Anx- 114 patients with temporomandibular disorders. Recklitis et al.
iety, Hostility, Phobic Anxiety, Paranoid Ideation, and Psychoti- (2006) investigated a sample of 14,193 adult survivors of child-
cism. Piersma, Boes, and Reaume (1994), among others, found hood cancer. The mean scores on the items varied between .12
that responses to all items can be described by a unidimensional (Item 17, suicidal thoughts) and .77 (Item 6, feeling tense).
model. The BSI was reduced to the BSI-18 to decrease the average Thus, the general distress level was low in this sample. Andreu et
completion time and to improve its structural validity (Derogatis, al. (2008) also reported multidimensionality but preferred a four-
2001). According to Derogatis (2001), the structural validity has dimensional structure, where the anxiety dimension was split into
improved because the reduced scale is composed of only three a general anxiety dimension and a panic dimension. The sample in
dimensionsnamely, somatization, depression, and anxiety
the Andreu et al. (2008) study consisted of 200 outpatients with
which together are more homogeneous than other dimensions from
psychological symptomatology. Fifty-two percent of the outpa-
previous instruments, both conceptually and empirically. Each
tients were recruited from a private psychology clinic, whereas
subscale of the BSI-18 contains six items from the three corre-
48% came from public services. Fifty-five percent were women
sponding subscales of the BSI. A total score over all items can be
and the remaining 45% were men. The majority of the sample
calculated representing general distress, which is highly correlated
(53.4%) were diagnosed with anxiety disorders; 32% were diag-
with the total score from the BSI (r .90; Andreu et al., 2008;
nosed with major depression disorders. The mean item scores on
Dura et al., 2006; Galdon et al., 2008).
the item varied between .94 (Item 1, dizziness) and 2.51 (Item 3,
Although some authors (e.g., Piersma et al., 1994) have claimed
that one dimension underlies the item scores of the BSI, Derogatis feeling blue) and were considerably higher than in the Recklitis
et al. (2006) study. According to Andreu et al. (2008), the four-
dimensional structure may be specific for patients with psychiatric
This article was published Online First January 31, 2011. disorders, in contrast to patients with medical problems for whom
Rob R. Meijer and Rivka M. de Vries, Department of Psychometrics and the threat caused by the medical condition might homogenize their
Statistics, University of Groningen, Groningen, the Netherlands; Vincent
experience of fear.
van Bruggen, Dimence, Almelo, the Netherlands.
Correspondence concerning this article should be addressed to Rob R. In the studies cited above, a multidimensional structure was
Meijer, University of Groningen, Department of Psychometrics and Sta- preferred to a unidimensional structure based on model fit. How-
tistics, Grote Kruisstraat 2/1, 9712 TS Groningen, the Netherlands. E-mail: ever, when we consider the fit indices reported in these studies,
r.r.meijer@rug.nl differences between unidimensional and multidimensional models
193
194 MEIJER, DE VRIES, AND VAN BRUGGEN
Table 1
Item Descriptives, Item-Total Correlation With Subscale (rs), Item-Total Correlation With Total Scale (rt), Rho, Scale H, Item H
Value Within Subscale (His), and Item H Value Within Total Scale (Hit) in the Sample With Clinical Patients
were not always compelling. For example, Recklitis et al. (2006) is that under an IRT framework, information (precision) can vary
found a three-factor structure when conducting a confirmatory depending on where an individual falls along the trait range,
factor analysis. But when fitting a hierarchical model with depres- whereas in CTT, the scale reliability (precision) is assumed to be
sion, anxiety, and somatization as first-order factors and general the same for all individuals, regardless of their raw-score levels.
distress as a second-order factor, they found very high correlations As some authors have discussed (Recklitis et al., 2006), it is
between the first-order factors and the second-order factors. Som- important for studies to verify that the BSI-18 is sensitive to
atization correlated r .98 with general distress; depression and change in a population so that it can monitor clinical course or the
anxiety correlated r .79 and r .74 with general distress, outcome of an intervention. When measuring change, it is thus
respectively. These high correlations suggest that there is a strong important to have knowledge about the measurement precision
general dimension here. In sum, from the existing literature it is conditional on the latent trait.
unclear whether the BSI-18 items form a unidimensional scale or Furthermore, we extend the literature by investigating the psy-
whether different scales can be distinguished. chometric structure in a sample of outpatients with anxiety and
Thus far, the dimensionality of the BSI-18 has been investigated unipolar and bipolar disorders, a sample of students, and a sample
using models based on classical test theory (CTT) and factor of prisoners. When a scale is applied to populations with different
analysis. In the present study we first investigated the dimension- characteristics, the psychometric properties of the scale may vary.
ality of the BSI-18 using Mokken scaling (Sijtsma & Molenaar, Most studies on the BSI-18 have been conducted with medical
2002), which is a nonparametric item response theory (IRT; samples, in particular with samples of cancer patients (Dura et al.,
Embretson & Reise, 2000) technique. Second, we used a paramet- 2006; Galdon et al., 2008; Recklitis et al., 2006). An exception is
ric IRT model to obtain more detailed information about the Andreu et al. (2008), who analyzed data of a sample consisting of
quality of the items across different psychological distress levels. outpatients with psychiatric disorders.
IRT is a collection of statistical models that can be used to Our aim is to discover (a) how BSI-18 items are functioning in
evaluate and construct psychological tests and questionnaires. Al- different populations, (b) which items have the strongest relation to
though there are similarities between IRT, CTT, and factor anal- the constructs being measured, and (c) whether we can scale
ysis, IRT has a number of advantages when evaluating the psy- persons on the basis of the 18 items of the BSI-18.
chometric quality of a scale (see, e.g., Egberink & Meijer, 2010;
Reise & Waller, 2009; Santor & Ramsay, 1998). An important Method
advantage of IRT is that to judge the quality of an item, one can
obtain the item information function, which shows how much
Samples
psychometric information (a number that represents an items
ability to differentiate between people) the item provides at each The first sample consisted of 487 outpatients with anxiety and
trait level (such as psychological distress). Different items can depression disorders. 53% was diagnosed with anxiety disorder,
provide different amounts of information in different ranges of a 37% was diagnosed with unipolar or bipolar disorder, the remain-
given latent trait. Item and scale information are analogous to ing disorders were unknown. The sample was 62.5% female and
CTTs item and test reliability. An important difference, however, 37.5% male. Mean age was 34.4 (SD 10.9). Data were obtained
PSYCHOLOGICAL DISTRESS 195
as part of a psychological assessment and treatment program in the latent trait level (the IRF) by means of a parametric (e.g., logistic)
east of the Netherlands. The items were used as a screening function. An often-used model for dichotomous items is the two-
instrument for patients before starting treatment in a psychological parameter logistic model (Embretson & Reise, 2000). This model
clinic. Treatments that the patients received were mainly based on contains, in addition to the latent trait parameter theta, two param-
cognitive behavioral approaches. Note that this sample has some eters representing item characteristics. One parameter is the item
similarities with the sample analyzed by Andreu et al. (2008) and location or item difficulty i, which is the location on the theta
that it consists of persons for whom the BSI-18 is often used. scale for which Pi() .5. The other parameter is the discrimi-
A second sample consisted of 266 psychology students (29% nation parameter or i parameter. The i parameter is the steepness
male, 71% female; mean age 22.2, SD 4.1) that filled out the of the IRF at the item difficulty level. In practice, i ranges from
BSI-18 for screening purposes. These students were not selected 0 (flat IRF) to 3 (steep IRF). Items with a larger i parameter are
on the basis of reporting psychological problems or on the basis of more useful for separating examinees near a trait level.
elevated distress levels. Instead, the BSI-18 was used to screen for For polytomous items, extensions of the models for dichoto-
students with potential psychological problems. Because we ex- mous items have been developed. An extension of the two-
pect that in this population the general distress level is lower than parameter logistic model is the graded response model of Same-
the distress level in medical or clinical samples, this sample is jima (1969) for ordinal answering categories, which models the
useful to obtain information about the psychometric structure for probability that an examinee responds in a particular answering
populations with low general distress levels. category n. The model contains a discrimination parameter and a
A third sample consisted of 207 prisoners (94% male, 6% number of location parameters (denoted n) equal to the number of
female; mean age 34.10, SD 9.5). Their self-reported ethnicity answering categories minus 1. As before, the discrimination pa-
was 51% African descent, 25% White, 4% Hispanic, 3% Asian; for rameter reflects the strength of the relationship between the item
the remaining prisoners, ethnicity was unknown. Data were col- and the latent trait. The location parameter for a specific category
lected at different prisons in the Netherlands as part of forensic n is the location on the theta scale for which the probability of
research. All testing was done by forensic psychologists at the scoring in this category or higher is .5. Together the location
various institutions, and all the prisoners were tested on intake. parameters reflect the spacing of the answering categories on theta.
Each sample was analyzed separately. We did not combine In addition to the difficulty and discrimination parameters, a
samples because that would lead to misleading results. Waller useful concept in describing the quality of the items is the item
(2008) showed that reliability coefficients and related indices information. The item information is the inverse of the standard
(such as the H values we use) are severely biased when samples error of measurement, so more measurement error results in less
are commingledthat is, when they are drawn from multiple information. In IRT the measurement error and information de-
populations. In most cases the estimates are inflated. Furthermore, pend on theta, which is different from the single estimate of
results based on samples from two or more populations (e.g., measurement error in CTT, which is assumed to be equal across
combined community and clinical samples) will yield a mean on different values of theta. The information an item provides about
the latent trait that is difficult to interpret (cf. Reise & Waller, a person is higher when the item difficulty i is close to theta and
2009). when i is high. Once a valid model has been constructed, it can
be used to estimate theta for specific persons on the basis of their
Item Response Theory test scores. Examples of applications of parametric IRT modeling
can be found in, for example, Lambert et al. (2003), who investi-
For dichotomous items, unidimensional IRT is based on the gated the Youth Self-Report; Teresi et al. (2000), who investigated
assumption that a persons performance on a test item can be the Comprehensive Assessment and Referral Evaluation diagnos-
predicted by the interplay between a latent trait theta () and the tic scale; and Emons, Meijer, and Denollet (2007), who investi-
item characteristics, such as item discrimination and item difficulty gated a questionnaire measuring Type D personality.
(e.g., Hambleton, Swaminathan, & Rogers, 1991). The relationship Nonparametric IRT. In contrast to parametric models, non-
between item performance and the trait level theta can be de- parametric models do not fully determine the IRFs (Hambleton,
scribed by a monotonically increasing function, which is called the Swaminathan, & Rogers, 1991). Examples of nonparametric IRT
item characteristic function, the item characteristic curve, or the models are the Mokken models (Sijtsma & Molenaar, 2002). The
item response function (IRF). Let Pi() be the probability of a least restrictive Mokken model is the monotone homogeneity
positive response (i.e., a correct answer or the agreement with a model (MMH model), which only requires that the relationship
specific statement) on item i for a given level of . Then the core between Pi() and is monotonely nondecreasing. That is, if for
assumption states that when the trait level theta increases, the two persons a and b it holds that a b, then it should also hold
probability of a positive item response Pi() also increases. For that Pi(a) Pi(b). A more restrictive Mokken model is the
polytomous items, this assumption is made at the level of item double monotonicity model, where the additional assumption of
steps, which are the transitions from one answering category to the nonintersecting IRFs is made (e.g., Meijer, 2010).
next. For example, subjects choosing Category 2 on a 4-point scale Mokken models do not offer estimates of parameters like i and
have a score of 1 on the first two item steps (from 0 to 1 and from i, nor do they allow for point estimates of theta. However, several
1 to 2) and a score of 0 on the second two item steps (from 2 to 3 measures can be used to obtain an idea about the quality of the
and from 3 to 4). A distinction can be made between parametric scale, such as the item proportion correct score reflecting item
and nonparametric IRT models. difficulty and scalability coefficients (H) reflecting discrimination
Parametric IRT. Parametric IRT models describe the rela- power. Besides, at the scale level, H is also defined at the
tionship between the probability of a positive response and the item(step)-pair level (Hij) and item level (Hi) and can be expressed
196 MEIJER, DE VRIES, AND VAN BRUGGEN
in terms of observed versus expected number of Guttman errors or c are (a) most or all items are on one scale, (b) one smaller scale
in terms of observed versus maximal possible covariance between is found, and (c) one or a few small scales are found and several
items (for exact formulas, see, e.g., Sijtsma & Molenaar, 2002, pp. items are excluded. In multidimensional scales, the typical results
5158). For the interpretation of H, Sijtsma and Molenaar (2002, with increasing c are (a) most or all items are on one scale, (b) two
pp. 60) give the following guidelines. The scale H should be above or more scales are formed, and (c) two or more smaller scales are
.3 for the items to form a scale. When .3 H .4, the scale is formed and several items are excluded. A strength of this proce-
considered weak; when .4 H .5, the scale is considered dure is that it removes most of the items that do not satisfy the
medium; and when H .5, the scale is considered strong. In MMH model; depending on the choice of the lower bound, it
addition, although point estimates of theta are not possible, an removes items that hardly contribute or contribute only modestly
estimated ordering of subjects by their theta values is possible to the dimensional structure of the data. A drawback may be that
using the number-correct score. Examples of applications of Mok- this research algorithm is not a formal test of the MMH model.
ken scaling in the typical performance domain can be found in, for Sometimes an item may be rejected that shows a few local de-
example, Meijer and Baneke (2004), who showed the usefulness of creases in the IRF or has an increasing but relatively flat IRF.
Mokken scaling to analyze the MMPI depression scale; Moorer, Second, the graded response model was estimated using MUL-
Suurmeijer, Foets, and Molenaar (2001), who applied Mokken TILOG7 (Thissen, Chen, & Bock, 2003). The graded response
scaling to the Rand-36; and Meijer, Egberink, Emons, and Sijtsma model was estimated to obtain item and person parameters and to
(2008), who discussed the use of Mokken scaling to identify estimate the information curves for the subscales and total scale.
atypical response behavior.
Results
Analysis
Descriptive Statistics
First, we used Mokken models to investigate the psychometric
structure of the data. These models are excellent tools for a first Tables 1, 2, and 3 present the item means with their standard
exploration of the psychometric structure of test and questionnaire deviations, together with the item-total correlations for the sub-
data. In contrast to parametric IRT models, Mokken models do not scales and the total scale (the item and scale H values are also
specify the exact relation between endorsing an item and the latent presented in the table and will be discussed below) for the clinical,
trait level; as a result, they are less restrictive to empirical data than prisoner, and student samples, respectively. The items are clus-
parametric models. Mokken scale analyses were performed using tered according to their theoretical assignment to the different
the computer program Mokken Scale Analysis for Polytomous dimensions in earlier research (e.g., Derogatis, 2001). Table 4
Items (MSP5.0; Molenaar & Sijtsma, 2000). presents the intercorrelations between the subscales and the total
We started with running the TEST option in MSP5.0. This is a scale scores for the three samples.
procedure where the researcher specifies which items form a scale. Clinical sample. In the clinical sample, the mean total
The three subscales as defined in the literature (somatization, score for the complete scale (which is also referred to as the
depression, and anxiety) were analyzed separately, as well as Global Severity Index) equaled M 21.27 (SD 16.02).
together forming the total scale that measures general distress. The Compared with the mean item scores found in medical samples
main focus was on the H values of the different scales and the total (cancer patients), which are generally below 1 (Dura et al.,
scale. Also, a reliability coefficient rho () was estimated for each 2006; Galdon et al., 2008; Recklitis et al., 2006), the average
scale, which is an unbiased estimate rather than a lower bound like item scores in this sample tended to be higher. That is, the
Cronbachs alpha (Moorer et al., 2001). subjects were more distressed. Except for one item (suicidal
In addition, the SEARCH option was applied, which is an thoughts), all depression and anxiety items had means between
exploratory procedure searching for unidimensional scales within 1.07 and 1.78. This is lower, however, than the average item
a specified set of items. The procedure starts with the item pair scores found by Andreu et al. (2008) in a psychiatric sample,
consisting of the items i and j with the largest pairwise H value, Hij which ranged from 1.27 through 2.54 with two exceptions.
(alternatively, the researcher can specify a start set of items). In the Subscale reliabilities were high; all rho values were .86 or
next step, one of the remaining items is selected that (a) correlates higher. The rho estimate of the total scale equaled .95. As
positively with each of the items already selected in the scale, (b) shown in Table 1, both the item-total correlations for the
has an Hi with respect to the selected items significantly larger subscales and the total scale were high. High correlations were
than 0 and also larger than a prespecified value c, and (c) maxi- also observed between the subscales and between the subscales
mizes the total H of the scale. This step is repeated until none of and the total scale (Table 4, upper panel). Correlations between
the items left meet the criteria for selection. Then the procedure the subscales ranged from .63 to .76, and correlations between
starts again, now applied on the remaining items, if any, until no subscales and the total scale ranged from .88 to .92.
items are left. Some items may not reach the criteria for any scale Prisoner sample. The mean total score for the complete
and are left out. Thus c is a constant, and often c .3 is used. The scale was lower than in the clinical sample (M 16.34, SD
higher c is, the more confidence we have in the ordering of persons 14.20). The rho estimate equaled .94. As shown in Table 2, the
according to their total score. mean item scores on the somatization items were lower than on
The SEARCH procedure is useful for investigating the dimen- the depression and anxiety items. The item-total correlations for
sionality of the scale. Sijtsma and Molenaar (2002, pp. 81 82) each subscale and for the total scale were high, although some-
give the following guidelines for determination of the dimension- what lower than for the clinical sample. Interesting was that the
ality. For unidimensional scales, the typical results with increasing only item on which the prisoners scored higher than the patients
PSYCHOLOGICAL DISTRESS 197
Table 2
Item Descriptives and Item and Scale Statistics in the Sample With Prisoners
in the clinical sample was Item 5 (Feeling lonely). Intercor- symptoms under subsequent stress (p. 29). It is interesting that the
relations between the scales were comparable with the clinical item-total correlations for each subscale were, in general, some-
sample (Table 4, middle panel). what lower than in the clinical and prisoner samples and that the
Student sample. The mean total score equaled M 8.41 item-total correlations for the total scale were considerably lower
(SD 7.83, .88) and was comparable with the mean score than in the clinical and prisoner samples, especially for the som-
found in the Recklitis et al. (2006) study using cancer survivors atization items. Also, intercorrelations between the scales were
(their mean score equaled 6.18 for the total sample). This may lower than for the clinical and prisoner samples (Table 4, lower
seem surprising, but Recklitis et al. (2006) already discussed that panel). In the samples there were few missing values (between
the low distress level in their study may be due to improved zero and four per item). We used multiple imputation (Van Ginkel,
coping skills or social supports . . . over time this develops into a Van der Ark, & Sijtsma, 2007) to obtain item scores for these
form of resilience, making survivors less prone to psychological items.
Table 3
Item Descriptives and Item and Scale Statistics in the Sample With Students
Table 6
Estimated Item Parameters (SD) for the Graded Response Model (Clinical Sample)
Item 1 2 3 4
Som 16 2.43 (0.27) 0.48 (0.10) 0.21 (0.10) 0.80 (0.12) 1.52 (0.16)
Som 7 1.77 (0.23) 0.25 (0.11) 0.54 (0.13) 1.08 (0.17) 1.80 (0.26)
Som 13 1.66 (0.22) 0.20 (0.11) 0.85 (0.17) 1.36 (0.22) 2.09 (0.30)
Som 1 1.65 (0.22) 0.37 (0.12) 0.99 (0.17) 1.62 (0.25) 2.82 (0.46)
Som 10 1.72 (0.26) 0.08 (0.12) 0.97 (0.18) 1.65 (0.24) 2.65 (0.42)
Som 4 1.60 (0.22) 0.03 (0.12) 0.90 (0.17) 1.52 (0.23) 2.55 (0.42)
Dep 8 2.71 (0.30) 1.18 (0.11) 0.31 (0.09) 0.30 (0.09) 0.92 (0.13)
Dep 2 2.57 (0.28) 0.70 (0.09) 0.22 (0.09) 0.61 (0.11) 1.21 (0.17)
Dep 5 2.54 (0.21) 1.14 (0.16) 0.07 (0.12) 0.57 (0.15) 1.26 (0.22)
Dep 14 2.17 (0.25) 0.85 (0.12) 0.01 (0.10) 0.61 (0.12) 1.25 (0.16)
Dep 11 2.01 (0.24) 0.66 (0.11) 0.28 (0.11) 0.89 (0.14) 1.49 (0.22)
Dep 17 1.37 (0.21) 0.43 (0.15) 1.66 (0.29) 2.61 (0.42) 3.25 (0.59)
Anx 6 3.26 (0.35) 1.25 (0.10) 0.34 (0.07) 0.31 (0.08) 0.96 (0.13)
Anx 3 2.06 (0.22) 1.23 (0.13) 0.08 (0.11) 0.57(0.11) 1.53 (0.19)
Anx 18 2.67 (0.29) 0.77 (0.10) 0.09 (0.09) 0.09 (0.09) 1.39 (0.16)
Anx 12 2.46 (0.28) 0.38 (0.09) 0.36 (0.10) 0.92 (0.14) 1.61 (0.18)
Anx 9 2.57 (0.27) 0.54 (0.09) 0.40 (0.10) 0.90 (0.14) 1.40 (0.17)
Anx 15 1.61 (0.21) 0.60 (0.13) 0.57 (0.14) 1.18 (0.19) 1.95 (0.29)
imply that the depression and anxiety items had a stronger rela- 6 from the anxiety scale (feeling tense) provided much more
tionship with the general distress trait level than the somatization information (with a peak above three) than Item 4 from the
items. For comparison, when the items are ordered according to somatization scale (pains in chest, with a peak below 1). Be-
the item H values instead of the estimated alpha values, also nine cause the information is inversely related to the standard error of
out of the 10 items with the highest item H values are depression measurement, items that provide little information did not add
and anxiety items. much to the reliability of the scale.
Because i is related to the item information (a larger i implies For most items with relatively high estimated alpha parameters,
more information), the depression and anxiety items tend to pro- the item location parameters and thus the item information was
vide more information about general distress than the somatization symmetrical around estimated 0 (see Table 6). For example,
items. To illustrate this, consider the item information curves for for the item I feel blue (Item 8), the item location ranged
three items in Figure 1. On the x axis the estimated latent trait between 1.18 and 0.92; and for the item feeling tense (Item 6),
values are given in standard score form. The mean of the latent the item location ranged between 1.25 and 0.96. That is, the
trait (estimated 0) reflects the mean of this specific clinical standard error of the trait estimate is similar for persons with
population. Thus all IRT indices must be interpreted relative to the 1 and 1. This contrast with the findings in many clinical
metric in the calibration sample. As can be seen in Figure 1, Item studies (see Reise & Waller, 2009) that the majority of items have
Figure 2. Distributions of total scale information and standard error of measurement in the clinical sample.
PSYCHOLOGICAL DISTRESS 201
structure above a unidimensional structure based on model fit. combination of depression, anxiety, and somatization, it is more of
Although differences may be due to different types of samples a depression/anxiety scale.
(cancer patients in earlier research vs. clinical, prisoner, and stu- Results for the student sample showed that the depression and
dent samples in the present study), we think that at least part of the anxiety items form one scale and that the somatization items do not
difference may be explained by the way results are interpreted. In discriminate between students total scores. It is thus important to
general, sets of items will always show multidimensionality to realize that when the BSI-18 is used as a screening tool in a
some extent, and a multidimensional model will therefore always population with a low distress level, such as in a general popula-
fit better than a unidimensional model. Our scale analysis, how- tion or in a student population, the scale characteristics may be
ever, showed that all 18 items formed a strong Mokken scale. This different from those in a population with a high distress level.
is in line with earlier studies that investigated the dimensionality of Some authors claim that although high intercorrelations may
the BSI consisting of 54 items (e.g., Piersma et al., 1994). exist between several factors, as we found in the present study, the
Of the four studies cited above, only two studies estimated a scale may still be multidimensional because factors may have a
unidimensional model in addition to a three- and four-dimensional different pattern of correlations with other measures relevant for a
model (Galdon et al., 2008; Recklitis et al., 2006). Results of particular patient group and that these patterns of correlations
Recklitis et al. (2006) showed that the fit of the unidimensional should be identified. We do not think that this is a very fruitful
model was only slightly worse than the fit of the multidimensional strategy. Any two items that are not perfectly correlated must
model, and the fit indices did not even reach the criteria levels. correlate with an external variable differently. When we should
Galdon et al. (2008) did not find such satisfactory results for the follow then the strategy of looking at different correlation patterns,
unidimensional model, but neither did they for the three- and scale analysis would be irrelevant. Instead, we advocate a strategy
four-dimensional models. Only after improving the three- where in order for a measures external correlates to be meaning-
dimensional model by including the correlations between some ful, a coherent latent structure must be identified first. To the
errors did the fit become acceptable. However, improving the extent that this is not the case (such as for the somatization items
unidimensional model in this way might have resulted in a satis- in the student sample), it is challenging to fully understand the
factory model as well. Thus, on the basis of a slightly different sources of the BSI-18 score variation. The strong relation between
interpretation of the results of previous studies and the results of the BSI-18 scales and their theoretical structure does have practical
the present study, we conclude that there is a strong common factor significance; it indicates that the scales can be meaningfully inter-
underlying all items of the BSI-18, at least for the clinical and the preted in the population of prisoners and clinical patients as a
prisoner sample. general distress scale. Note that this is concordant with the original
idea by Derogatis (2001) that the BSI-18 is constructed to be more
Another argument for why we are skeptical about forming
homogeneous than earlier instruments. The results also emphasize
subscales is that subscales are often so unreliable compared to
that one should be careful in clinical practice to overemphasize
composite scores that the composite scores often better predict
the difference between depression, anxiety, and somatization sub-
the true score on a subscale than the subscale score itself.
scale scores, because subtest scores are highly related.
Sinharay, Puhan, and Haberman (2010) showed, using results
We hypothesize that relations between subtest scores and other
from operational and simulated data, that diagnostic scores have
variables will not show large differences and that much of the
to be based on a sufficient number of items and have to be
variance explained in the subscores is due to the underlying factor
sufficiently distinct from each other to be worth reporting and
of psychological distress (for a related discussion, see the remarks
that several operationally reported subtest scores are actually
made by Reise & Waller, 2009, with respect to the bifactor model).
not worth reporting. Emons, Sijtsma, and Meijer (2007) also
Although clinicians and educational researchers use diagnostic
showed that the classification consistency using short scales (at
scores on subtests as added value to the total score, we fully agree
most 15 items) is at most 50%.
with Sinharay et al. (2010) that these scores are only useful when
Another interesting finding in the clinical and prisoner samples they provide a more accurate measure of the construct being
was that the overall construct of psychological distress, defined by measured (e.g., depression or algebra) than is provided by the total
the somatization, depression, and anxiety items, is in particular score (psychological distress or content knowledge of mathemat-
defined by the depression and anxiety items. The somatization ics). Therefore, future research regarding convergent and discrim-
items are clearly less related to the overall construct of psycho- inant validity of the BSI-18 may take the general factor of psy-
logical distress. The parametric analysis showed that nine out of 10 chological distress into account (Brouwer, Meijer, Weekers, &
best discriminating items all came from the depression and anxiety Baneke, 2008) .
subscales, and most of the worst discriminating items came from
the somatization subscale. For example, the estimated discrimina-
tion parameter of the anxiety item feeling tense was twice as References
high as the estimated discrimination parameter of the somatization
item pains in heart or chest. This difference was also reflected in Andreu, Y., Galdon, M. J., Dura, E., Ferrando, M., Murgui, S., Garca, A.,
& Ibanez, E. (2008). Psychometric properties of the Brief Symptoms
the item information curves.
Inventory18 (BSI-18) in a Spanish sample of outpatients with psychi-
Thus, although our scale analysis showed that each of the three
atric disorders. Psicothema, 20, 844 850.
a priori scales is a strong scale, when the scales are combined to Brouwer, D., Meijer, R. R., Weekers, A. M., & Baneke, J. J. (2008). On the
form one scale to measure the overall construct psychological dimensionality of the Dispositional Hope Scale. Psychological Assess-
distress, the depression and anxiety items mostly define the scale. ment, 20, 310 315. doi:10.1037/1040-3590.20.3.310
Although theoretically psychological distress is claimed to be a Derogatis, L. R. (1983). SCL90 R administration, scoring, and proce-
202 MEIJER, DE VRIES, AND VAN BRUGGEN
dures manual (2nd ed.). Baltimore, MD: Clinical Psychometric Re- Moorer, P., Suurmeijer, Th. P. B. M., Foets, M., & Molenaar, I. W. (2001).
search. Psychometric properties of the RAND-36 among three chronic diseases
Derogatis, L. R. (2001). Brief Symptom Inventory (BSI)-18: Administra- (multiple sclerosis, rheumatic diseases and COPD) in the Netherlands.
tion, scoring and procedures manual. Minneapolis, MN: NCS Pearson. Quality of Life Research, 10, 637 645. doi:10.1023/A:1013131617125
Derogatis, L. R., & Melisaratos, N. (1983). The Brief Symptom Inventory Piersma, H. L., Boes, J. L., & Reaume, W. M. (1994). Unidimensionality
(BSI): An introductory report. Psychological Medicine, 13, 595 605. of the Brief Symptom Inventory (BSI) in adult and adolescent inpatients.
doi:10.1017/S0033291700048017 Journal of Personality Assessment, 63, 338 344. doi:10.1207/
Dura, E., Andreu, Y., Galdon, M. J., Ferrando, M., Murgui, S., Poveda, R., s15327752jpa6302_12
& Jimenez, Y. (2006). Psychological assessment of patients with tem- Recklitis, C. J., Parsons, S. K., Shih, M., Mertens, A., Robison, L. L., &
poromandipular disorders: Confirmatory analysis of the dimensional Zeltzer, L. (2006). Factor structure of the Brief Symptom Inventory-18
structure of the Brief Symptoms Inventory 18. Journal of Psychosomatic in adult survivors of childhood cancer: Results from the Childhood
Research, 60, 365370. doi:10.1016/j.jpsychores.2005.10.013 Cancer Survivor Study. Psychological Assessment, 18, 2232. doi:
Egberink, I. J. L., & Meijer, R. R. (2010). An item response theory analysis 10.1037/1040-3590.18.1.22
of Harters self-perception profile for children or why strong clinical Reise, S. P., & Havilund, M. G. (2005). Item response theory and the
scales should be distrusted. Assessment. Advance online publication. measurement of clinical change. Journal of Personality Assessment, 84,
doi:10.1177/1073191110367778 228 238. doi:10.1207/s15327752jpa8403_02
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychol- Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical
ogists. Mahwah, NJ: Erlbaum. measurement. Annual Review of Clinical Psychology, 5, 27 48. doi:
Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2007). On the consistency 10.1146/annurev.clinpsy.032408.153553
of individual classification using short scales. Psychological Methods, Samejima, F. (1969). Estimation of latent ability using a response pattern
12, 105120. doi:10.1037/1082-989X.12.1.105 of graded scores (Psychometric Monograph No. 17). Iowa City, IA:
Emons, W. H. M., Meijer, R. R., & Denollet, J. (2007). Negative affectivity Psychometric Society.
and social inhibition in cardiovascular disease: Evaluating type-D person- Santor, D. A., & Ramsay, J. O. (1998). Progress in the technology of
ality and its assessment using item response theory. Journal of Psychoso- measurement: Applications of item response models. Psychological
matic Research, 63, 2739. doi:10.1016/j.jpsychores.2007.03.010 Assessment, 10, 345359. doi:10.1037/1040-3590.10.4.345
Galdon, M. J., Dura, E., Andreu, Y., Ferrando, M., Murgui, S., Perez, S., Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item
& Ibanez, E. (2008). Psychometric properties of the Brief Symptom response theory. Thousand Oaks, CA: Sage.
Inventory-18 in a Spanish breast cancer sample. Journal of Psychoso- Sinharay, S. Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic
matic Research, 65, 533539. doi:10.1016/j.jpsychores.2008.05.009 scores in educational testing: Temptations, pitfalls, and some solutions.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamen- Multivariate Behavioral Research, 45, 553573. doi:10.1080/
tals of item response theory. Newbury Park, CA: Sage. 00273171.2010.483382
Lambert, M. C., Schmitt, N., Samms-Vaughan, M. E., An, J. C., Fair- Teresi, J. A., Kleinman, M., Ocepek-Welikson, K., Ramirez, M., Gurland,
clough, M., & Nutter, C. A. (2003). Is it prudent to administer all items B., Lantigua, R., & Holmes, D. (2000). Applications of item response
for each child behavior checklist cross-informant syndrome? Evaluating theory to the examination of the psychometric properties and differential
the psychometric properties of the youth self-report dimensions with item functioning of the comprehensive assessment and referral evalua-
confirmatory factor analysis and item response theory. Psychological tion dementia diagnostic scale among samples of Latino, African Amer-
Assessment, 15, 550 568. doi:10.1037/1040-3590.15.4.550 ican, and white non-Latino elderly. Research on Aging, 22, 738 773.
Meijer, R. R. (2010). A comment on Watson, Deary, and Austin (2007) and doi:10.1177/0164027500226007
Watson, Roberts, Gow, and Deary (2008): How to investigate whether Thissen, D., Chen, W. H., & Bock, R. D. (2003). MULTILOG (Version 7)
personality items form a hierarchical scale? Personality and Individual [Computer software]. Lincolnwood, IL: Scientific Software Interna-
Differences, 48, 502503. doi:10.1016/j.paid.2009.11.004 tional.
Meijer, R. R., & Baneke, J. J. (2004). Analyzing psychopathology items: A van Ginkel, J. R., van der Ark, A. L., & Sijtsma, K. (2007). Multiple
case for nonparametric item response theory modeling. Psychological imputation of items scores in test and questionnaire data, and influence
Methods, 9, 354 368. doi:10.1037/1082-989X.9.3.354 on psychometric results. Multivariate Behavioral Research, 42, 387
Meijer, R. R., Egberink, I. J. L., Emons, W. H. M., & Sijtsma, K. (2008). 414.
Detection and validation of unscalable item score patterns using item Waller, N. G. (2008). Commingled samples: A neglected source of bias in
response theory: An illustration with Harters self-perception profile for reliability analysis. Applied Psychological Measurement, 32, 211223.
children. Journal of Personality Assessment, 90, 227238. doi:10.1080/ doi:10.1177/0146621607300860
00223890701884921
Molenaar, I. W., & Sijtsma, K. (2000). MSP5 for Windows: A program for Received September 18, 2009
Mokken scale analysis for polytomous items (Version 5.0) [Users man- Revision received July 2, 2010
ual]. Groningen, the Netherlands: IecProGAMMA. Accepted July 9, 2010