Measurement Errors
Measurement Errors
Some systematic and random errors may occur during measurement (Table 1 ). Of interest to
clinical trials are the strategies to reduce performance bias (additional therapeutic interventions
preferentially provided to one of the groups) and to limit information and detection bias (ascertainment or
measurement bias) by masking (blinding) [ 9 ]. Masking is a process whereby people are kept unaware of
which interventions have been used throughout the study, including when outcome is being assessed.
Patient/ clinician blinding is not always practical or feasible, such as in trials comparing surgery with non-
surgery, diets, and lifestyles. Finally, measurement error can occur in the statistical analysis of the data.
Important elements to specify in the protocol include: definition of the primary and secondary outcome
measure; how missing data will be handled (depending on the nature of the data there are different
techniques); subgroup (secondary) analyses of interest; consideration of multiple comparisons and the infl
ation of the type I error rate as the number of tests increases; the potential confounders to control for; and
the possible effect modifi ers (interaction). This issue has implication for modeling techniques and is
The operational criteria applied in the design infl uence the external and internal validity of the
study (Fig. 3 ). Both construct validity and external validity relate to generalization. However, construct
validity involves generalizing from the study to the underlying concept of the study. It refl ects how well
the variables in the study (and their relationships) represent the phenomena of interest. For example, how
well does the level of proteinuria represent the presence of kidney disease? Construct validity becomes
important when a complex process, such as care for chronic kidney disease, is being described.
Maintaining consistency between the idea or concept of a certain care program and the operational details
of its specific components in the study may be challenging. External validity involves generalizing
conclusions from the study context to other people, places, or times. External validity is reduced if study
eligibility criteria are strict, or the exposure or intervention is hard to reproduce in practice. The closer the
intended sample is to the target population, the more relevant the study is to this wider, but defi ned, group
of people, and the greater is its external validity. The same applies to the chosen intervention, control and
outcome including the study context. The internal validity of a study depends primarily on the degree to
which bias is minimized. Selection, measurement, and confounding biases can all affect the internal
validity.
Fig. 3 Structure of study design. The left panel represents the design phase of a study, when
Patient, Intervention, Control and Outcome (PICO) are defi ned (conceptualization and
operationalization). The right panel corresponds to the implementation phase. Different types of bias
can occur during sampling, data collection and measurement. The extent to which the results in the
study can be considered true and generalizable depends on its internal and external validity. With
In any study there is always a balance between external and internal validity, as it is difficult and
costly to maximize both. Designs that have strict inclusion and exclusion criteria tend to maximize internal
validity, while compromising external validity. Internal validity is especially important in efficacy trials to
understand the maximum likely benefit that might be achieved with an intervention, whereas external
validity becomes more important in effectiveness studies. Involvement of multiple sites is an important
way to enhance both internal validity (faster recruitment, quality control, and standardized procedures for
data collection, management, and analysis) and external validity (generalizability is enhanced because the
The concepts of clinical relevance and statistical signifi cance are often confused. Clinical
relevance refers to the amount of benefit or harm apparently resulting from an exposure or intervention
that is sufficient to change clinical practice or health policy. In planning study sample size, the researcher
has to determine the minimum level of effect that would have clinical relevance. The level of statistical
significance chosen is the probability that the observed results are due to chance alone. This will
correspond to the probability of making a type I error, i.e., claiming an effect when in fact there is none.
By convention, this probability is usually 0.05 (but can be as low as 0.01). The P value or the limits of the
appropriate confidence interval (a 95 % interval is equivalent to a signifi cance level of 0.05 for example)
is examined to see if the results of the study might be explained by chance. If P <0.05, the null hypothesis
of no effect is rejected in favor of the study hypothesis, despite it is still being possible that the observed
results are simply due to chance. However, since statistical signifi cance depends on both the magnitude
of effect and the sample size, trials with very large sample sizes theoretically can detect statistically signifi
Hierarchy of Evidence
Fundamental to evidence-based health care is the concept of “hierarchy of evidence” deriving
from different study designs addressing a given research question (Fig. 4 ). Evidence grading is based on
the idea that different designs vary in their susceptibility to bias and,therefore, in their ability to predict the
true effectiveness of health care practices. For assessment of interventions, randomized controlled trials
(RCTs) or systematic review of good quality RCTs are at the top of the evidence pyramid, followed by
longitudinal cohort, case–control, cross-sectional studies and case series at the bottom [ 10 ]. However, the
choice of the study design depends on the question at hand and the frequency of the disease. Intervention
questions ideally are addressed with experiments (RCTs) since observational data are prone to
unpredictable bias and confounding that only the randomization process will control. Appropriately
designed RCTs allow also stronger causal inference for disease mechanisms. Prognostic and etiologic
questions are best addressed with longitudinal cohort studies in which exposure is measured fi rst and
participants are followed forward in time. At least two (and possibly more) waves of measurements over
time are undertaken. Initial assessment of an input–output relationship may derive from case–control
Fig. 4 Examples of study designs. In cross-sectional studies inputs and output are measured
simultaneously and their relationship is assessed at a particular point in time. In case–control studies
participants are identified based on presence or absence of the disease and the temporal direction of
the inquiry is reversed (retrospective). Temporal sequences are better assessed in longitudinal cohort
studies where exposure levels are measured fi rst and participants are followed forward in time. The
same occurs in randomized controlled trials RCTs) where the assignment of the exposure is under the
control of the researcher. With permission Ravani et al., Nephrol Dial Transpl [ 20 ]. P Probability (or
risk)
Participants are identifi ed by the presence or absence of disease and exposure is assessed
retrospectively. Cross-sectional studies may be appropriate for an initial evaluation of the accuracy of new
diagnostic tests as compared to a gold standard. Further assessments of diagnostic programs are performed
with longitudinal studies (observational and experimental). Common biases affl icting observational
designs are defi ned in Chapter 3 and discussed in more detail in Chapter 20 .
Experimental Designs for Intervention Questions
The RCT design is appropriate for assessment of the clinical effects of drugs, procedures, or care
processes, defi nition of target levels in risk factor modifi cation (e.g., blood pressure, lipid levels, and
them was no more than 3 mmHg. Non-inferiority trials have been criticized, as imperfections in study
execution, which tend to prevent detection of a difference between treatments, actually work in favor of a
conclusion of non-inferiority. Thus, in distinction to the usual superiority trial, poorly done studies may
disorders are often either inaccessible to clinicians or avoided for reasons of cost or risk. Therefore, the
relationship between more easily measured phenomena (patient history, physical and instrumental
examination, and levels of constituents of body fluids and tissues) and the fi nal diagnosis is an important
subject of clinical research. Unfortunately, even the most promising diagnostic tests are never completely
accurate. Clinical implications of test results should ideally be assessed in four types of diagnostic studies.
Table 4 shows examples from diagnostic studies of troponins in coronary syndromes. As a first step, one
might compare test results among those known to have established disease to results from those free of
disease. Crosssectional studies can address this question (Fig. 4 ). However, since the direction of
interpretation is from diagnosis back to the test, the results do not assess test performance. To examine test
performance requires data on whether those with positive test results are more likely to have the disease
than those with normal results [ 12 ]. When the test variable is not binary (i.e., when it can assume more
than two values) it is possible to assess the trade-off between sensitivity and specificity at different test
result cutoff points [ 13 ]. In these studies it is crucial to ensure independent blind assessment of results of
the test being assessed and the gold standard to which it is compared, without the completion of either
being contingent on results of the other. Longitudinal studies are required to assess diagnostic tests aimed
at predicting future prognosis or development of established disease [ 12 ]. The most stringent evaluation
of a diagnostic test is to determine whether those tested have more rapid and accurate diagnosis, and as a
result better health outcomes, than those not tested. The RCT design is the proper tool to answer this type
of question [ 14 ].
Maximizing the Validity of Non-experimental Studies
When randomization is not feasible the knowledge of the most important sources of bias is
important to increase the validity of any study. This may happen for a variety of reasons: when study
participants cannot be assigned to intervention groups by chance either for ethical reasons (e.g., in a study
of smoking) or participant willingness (e.g., comparing hemodialysis to peritoneal dialysis); the exposure
is fixed (e.g., gender); or the disease is rare and participants cannot be enrolled in a timely manner. When
strategies are in place to prevent bias, results of non-experimental studies may approach those of rigorous
RCTs.
Reporting
Adequate reporting is critical to the proper interpretation and evaluation of any study results.
Guidelines for reporting primary (CONSORT, STROBE, and STARD for example) and secondary
studies (PRISMA) are in place to help both investigators and consumers of clinical research [ 15 – 18 ].
Scientifi c reports may not fully reflect how the investigators conducted their studies, but the quality of the
scientifi c report is a reasonable marker for how the overall project was conducted. The interested reader is
referred to the above referenced citations for more details of what to look for in reports from prognostic,