U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Goyal M, Singh S, Sibinga EMS, et al. Meditation Programs for Psychological Stress and Well-Being [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Jan. (Comparative Effectiveness Reviews, No. 124.)

Cover of Meditation Programs for Psychological Stress and Well-Being

Meditation Programs for Psychological Stress and Well-Being [Internet].

Show details

Methods

The methods for this comparative effectiveness review follow the methods suggested in the Agency for Healthcare Research and Quality (AHRQ) “Methods Guide for Effectiveness and Comparative Effectiveness Reviews” (www.effectivehealthcare.ahrq.gov/methods guide.cfm). The main sections of this chapter reflect the elements of the protocol established for comparative effectiveness reviews; certain methods map to the PRISMA checklist.36 We carried out this systematic review according to a prespecified protocol registered at the AHRQ Web site.37

Topic Development

The Division of Extramural Research of the National Center for Complementary and Alternative Medicine, National Institutes of Health, nominated the topic for this report in a public process. We recruited six Key Informants to provide input on the selection and refinement of the questions for the systematic review. To develop the Key Questions (KQs), we reviewed existing systematic reviews, developed an analytic framework, and solicited input from our Key Informants through email and conference calls. We posted our draft KQs on the Effective Health Care Program Web site for public comment on October 14, 2011. We revised the KQs, as necessary, based on comments.

We drafted a protocol and recruited a multidisciplinary Technical Expert Panel (TEP), including methods experts, tai chi and qigong experts, and meditation experts. With input from the TEP and representatives from AHRQ, we finalized the protocol. Initially we planned to include physiologic outcomes and the various movement-based meditation programs. Based on expert panel input we eliminated the biological outcomes due to a need to limit the scope of this broad review, as well as a concern that a number of these outcomes, such as inflammatory markers, were felt to be more intermediate outcomes. We also eliminated the movement-based meditation programs because we felt their relevance would be greatest for the physiologic markers. We uploaded the protocol to the Effective Health Care Program Web site on February 22, 2012.

Search Strategy

We searched the following databases for primary studies: MEDLINE®, PsycINFO, Embase®, PsycArticles, SCOPUS, CINAHL, AMED, and the Cochrane Library through October 11, 2011. We developed a search strategy for MEDLINE, accessed via PubMed®, based on medical subject headings (MeSH®) terms and text words of key articles that we identified a priori (Appendix B). We reviewed the reference lists of included articles, relevant review articles, and 20 related systematic reviews to identify articles that the database searches might have missed. Our search did not have any language restrictions. We updated the search in November 2012.

We selected databases after internal deliberation and input from the TEP. We did not include meeting proceedings or abstracts of reports of unpublished studies. We searched clinicaltrials.gov. We evaluated the search strategy by examining whether it retrieved a sample of key articles. We did not limit our searches to any geographic regions. For articles written in non-English languages, we either used individuals familiar with the language or used the Google Translate Web site to assess whether an article fit our inclusion criteria.38

Study Selection

Two investigators independently screened title and abstracts, and excluded them if both investigators agreed that the article met one or more of the exclusion criteria. (Inclusion and exclusion criteria listed in Table 2 and the Abstract Review Form in Appendix C.) We resolved differences between investigators regarding abstract eligibility through consensus.

Table 2. Organization of various scales (instruments or measurement tools) for each Key Question.

Table 2

Organization of various scales (instruments or measurement tools) for each Key Question.

Citations that we promoted on the basis of title and abstract screen received a second independent screen of the full-text article (Appendix C, Article Review Form). We resolved differences regarding article inclusion through consensus. Paired investigators conducted another independent review of full-text articles to determine whether they included applicable information, and if so, included in the full-data abstraction (Appendix C, Key Question Applicability Form). We resolved disagreements about the eligibility of an article by discussion between the two reviewers or by adjudication of a third reviewer.

We required that studies reported on populations with a clinical condition, either medical or psychiatric. Although meditation programs may have an impact on healthy populations, we limited our evaluation to clinical populations. Since trials examine meditation programs in diverse populations, we defined a clinical condition broadly to include mental health/psychiatric conditions (e.g., anxiety or stress) and physical conditions (e.g., low back pain, heart disease, or advanced age). Additionally, since stress was of particular interest for meditation studies, we also included trials that studied stressed populations even though they may not have a defined medical or psychiatric diagnosis. We excluded studies among the otherwise healthy. We also excluded studies among children or adolescents because meditation instruction for non-adults is not the same as it is for adults, due to differences in maturity, understanding, and discipline. Non-adult studies would measure outcomes differently, making a synthesis difficult.

We excluded movement-based techniques that involve meditation due to the confounding effects of the exercise component of those techniques on outcomes (Table 1). To evaluate programs that are more than a brief mental exercise, yet remain broadly inclusive, we defined a meditation program as any systematic or protocolized meditation program that follows a predetermined curriculum. We defined these programs to involve, at a minimum, at least 4 hours of training with instructions to practice outside the training session.

Table 1. Study inclusion and exclusion criteria.

Table 1

Study inclusion and exclusion criteria.

We included both specific and nonspecific active controlled trials. We defined an active control as any control in which the control group is matched in time and attention to the intervention group. A nonspecific active control only matches time, attention and expectation similar to what a placebo pill does in a drug trial. Examples include “attention control” and “educational control.” It is not a known therapy. A specific active control compares the intervention to another known therapy, such as progressive muscle relaxation.34,35,39,40

We defined any control group that does not match time and attention for the purposes of matching expectation as an inactive control. Examples include wait-list or usual-care controls. We excluded such trials since it would be difficult to assess whether any changes in outcomes were due to the nonspecific effects of time and attention. We excluded observational studies susceptible to confounding and selection biases.

We evaluated the effect of these meditation programs on a range of stress-related outcomes and used the framework from the Patient Reported Outcomes Measurement Information System (PROMIS) to help guide our categorization of outcomes.41 The PROMIS framework is a National Institutes of Health-sponsored project to optimize and standardize patient reported health status tools. This framework breaks self-reported outcomes into the three broad categories of physical, mental, and social health, and then subdivides these categories further. Our outcomes included negative affect, positive affect, well-being, cognition, pain, and health-related behaviors affected by stress such as substance abuse, sleeping, and eating.41 Based on input from technical experts, we also evaluated the effect of meditation programs on weight—an additional stress-related outcome we deemed important.

We included randomized controlled trials (RCTs) in which the control group was matched in time and attention to the intervention group. The inclusion of such trials allowed us to evaluate the specific effects of meditation programs separate from the nonspecific effects of attention and expectation. Our team thought this was the most rigorous standard for determining the efficacy of the interventions and contributing to the current literature on the effects of meditation. We did not include observational studies because they are likely to have an extremely high risk of bias due to problems such as self-selection of interventions (people who believe in the benefits of meditation or who have prior experience with meditation are more likely to enroll in a meditation program) and use of outcome measures that can be easily biased by participants' beliefs in the benefits of meditation.

Data Abstraction and Data Management

We used Distiller SR (Evidence Partners, 2010) to manage the screening and review process. We uploaded all citations identified by the search strategies to the system. We created standardized forms for data extraction (Appendix C). We pilot tested the forms prior to beginning the data extraction. Reviewers extracted information on general study characteristics, study participants, eligibility criteria, interventions, and the outcomes. Two investigators reviewed each article for data abstraction. For study characteristics, participant characteristics, and intervention characteristics, the second reviewer confirmed the first reviewer's data abstraction for completeness and accuracy. For outcome data and risk-of-bias scoring, we used dual and independent review. Reviewer pairs included personnel with both clinical and methodological expertise. We resolved differences between investigators regarding data through consensus.

For each meditation program we extracted information on measures of intervention fidelity including dose, training, and receipt of intervention. We measured duration and maximal hours of structured training in meditation, amount of home practice recommended, description of instructor qualifications, and description of participant adherence, if any. Many of the meditation techniques do not have clearly defined training and certification requirements for instructors. However, when available, we extracted data on whether instructors had specialized training or course certification in the particular meditative technique being assessed.

Since studies provided a variety of measures for many of our KQs, we included any RCT of a meditation program with an active control that potentially applied to any KQ. We then went through each of the papers to identify all the scales (instruments or measurement tools) that could potentially apply to a KQ. We then revised this list and organized instruments according to relevance for the KQs. We extracted data from instruments that have broad experience and that researchers commonly used to measure relevant outcomes. We prioritized instruments that were common to the numerous trials in our review, so as to allow more direct comparisons between trials (Table 2).

We entered all information from the article review process into the Distiller SR database. We used the DistillerSR database to maintain the data, which we then exported into Excel for the preparation of evidence tables.

Data Synthesis

For each KQ, we created a detailed set of evidence tables containing all information abstracted from eligible studies.

Trials used either nonspecific active controls or specific active controls (Table 1, Figure 1). Nonspecific active controls (e.g., education or attention control) control for the nonspecific effects of time, attention and expectation. Comparisons against these controls allow for assessments of the specific effectiveness of the meditation program (above and beyond the nonspecific effects of time, attention, and expectation). This is similar to a comparison against a placebo pill in a drug trial, where one is concerned with the nonspecific effects of interacting with a provider, taking a pill and expecting the pill to work. Specific active controls are therapies (e.g., exercise or progressive muscle relaxation) known or expected to change clinical outcomes. Comparisons against these controls allow for assessments of comparative effectiveness. In a drug trial, this would be similar to comparing one drug against another known drug. Since these study designs using different types of controls would yield quite different conclusions (efficacy vs. comparative effectiveness), we separated them in our analyses.

To display the outcome data, we calculated relative difference-in-change scores (i.e., the change from baseline in an outcome measure in the treatment group minus the change from baseline in the outcome measure in the control group, divided by the baseline score in the treatment group). However, many studies did not report enough information to calculate confidence intervals for the relative difference-in-change scores. When we evaluated point estimates and confidence intervals for just the post-intervention or end-of-study differences between groups, and compared these to the point estimates for the relative difference-in-change scores for those time points, some of the estimates that did not account for baseline differences appeared to favor a different group (i.e. treatment or control), when compared with the estimates that did account for baseline differences. We therefore used the relative difference-in-change scores to estimate the direction and approximate magnitude of effect for all outcomes. We used the relative difference-in-change graphs to determine consistency. They are not a statistical analysis, but a visual way to display the data. This was done by the following formula: {# (meditation T2-T1)-(control T2-T1)}/(meditation T1) where T1 is the baseline means score and T2 is the followup mean score.

For the purpose of generating an aggregate quantitative estimate of the effect of an intervention and the associated 95 percent confidence interval, we performed meta-analysis using standardized mean differences (effect sizes) calculated by Cohen's method (Cohen's d).42 For each outcome, we displayed the resulting effect size estimate according to the type of control group and duration of followup. Some studies did not report enough information to be included in meta-analysis. For that reason, we decided to display the relative difference-in-change scores along with the effect size estimates from meta-analysis so that readers can see the full extent of the available data. We used statistical significance of the meta-analytic result to guide our reporting of precision.

We calculated point estimates for the difference-in-change scores for all outcomes. Since these studies were looking at short interventions and relatively low doses of meditation, we considered a 5 percent relative difference-in-change score to be potentially clinically significant. In synthesizing the results of these trials, we considered both statistical and clinical significance. Statistical significance is according to study-specific criteria, and we reported p-values and confidence intervals where present. We defined clinical significance as a 5 percent relative difference-in-change.

Some scales show improvement with more positive numbers, and others show more improvement with less positive numbers. After calculating the relative difference-in-change scores, we reversed the sign on the scales which showed improvement with more negative numbers so that all scales showed an improvement in the positive direction. We oriented the meta-analysis graphs similarly, so that effect sizes are shown in the direction of which treatment arm they favored rather than increases and decreases in each scale.

During data synthesis, if trials reported on more than one scale for a particular outcome, we prioritized the scale that was most common to all the trials to improve comparability between trials. To arrive at an overall strength of evidence (SOE), we used only one scale per outcome per trial in order to avoid giving extra weight to trials that reported on the same outcome with multiple scales. For this reason, although we describe the various scales reported on by the trials in the text, the graphical displays show only the scale that was compared with other studies to arrive at the SOE. Since many trials reported on the same scale at multiple time points, we provided graphs showing the effects at the end of intervention and at the end of study. Wherever meta-analysis was possible, we separated outcomes by time-point. For most, these were at 2–3 months (post intervention) and beyond 3 months (end of study). We describe relevant changes in outcomes over time in the results, but for purposes of consistency we used the first time-point only for describing the magnitude of change in the SOE tables.

Some trials specified primary and secondary outcomes, while others did not. Since the direction and magnitude may differ based on whether it is a primary or secondary outcome, we categorized and labeled each outcome as primary or secondary on the difference-in-change graphs. For trials that did not specify a primary or secondary outcome, two reviewers independently assessed whether an outcome was identified as a primary focus of the study or if it was the outcome that the population was selected on, and these were classified as primary outcomes. We resolved any conflicts by consensus.

Although some trials had more than two arms, we report the sample sizes only for the two arms we examined. The numbers reported are the numbers that the trials used to calculate their effects. If a trial had some attrition but imputed data for the missing participants, then we reported those intent-to-treat (ITT) numbers. If a trial did not impute data for the missing participants, we reported the numbers they used to calculate effects. For this reason, our report of the number of participants randomized in each trial may differ from the number of participants the trials reported as randomized.

We combined stress and distress into a single outcome due to the paucity of studies and similarities between these outcomes. For studies that reported on both a stress and a distress scale, we prioritized using the scale that was most common in the group of studies. For the same reasons, we also combined well-being and positive mood into the single outcome of positive affect.43

To analyze the effects of meditation programs on negative affect, we combined one negative affect scale per trial with the others. Since some trials reported on more than one negative affect scale, we prioritized anxiety, then depression, then stress/distress. Anxiety is a primary dimension of negative affect and a common symptom of stress. Anxiety is highly correlated with depressive symptoms, and thus, when more than one measure of negative affect is available we consider anxiety a good primary marker of negative affect.44 We also conducted a sensitivity analysis by reversing the prioritization order, prioritizing stress/distress over depression, and depression over anxiety. For the large bulk of outcomes, we rated measures as direct measures of that outcome. However, since anxiety, depression, stress, and distress are components of negative affect, we rated them as indirect measures of negative affect. If a direct measure of negative affect was available (e.g. positive and negative affect schedule), we used that measure instead of any indirect measures.

Assessment of Methodological Quality of Individual Studies

We assessed the risk of bias in studies independently and in duplicate based on the recommendations in the Guide for Conducting Comparative Effectiveness Reviews.45 We supplemented these tools with additional assessment questions based on the Cochrane Collaboration's Risk of Bias Tool.46,47 While many of the tools to evaluate risk of bias are common to behavioral as well as pharmacologic interventions, some items are more specific to behavioral interventions. After discussion with experts in meditation programs and clinical trials, we emphasized four major and four minor criteria in assessing bias of meditation programs. The four major criteria were: matching control for time and attention; description of withdrawals and dropouts; attrition; and blinding of outcome assessors. We considered as minor criteria the description of randomization, allocation concealment, ITT analysis, and credibility evaluation (Table 3).

Table 3. List of major and minor criteria in assessing risk of bias.

Table 3

List of major and minor criteria in assessing risk of bias.

Matching controls for time and attention is prerequisite to matching expectations of benefit. We extracted data on time and attention for both groups. If the control gave at least 75 percent of the time and attention given the intervention arm, we gave it credit for matching. Evaluating credibility is also an important, albeit followup step. Clearly identifying the number of withdrawals and dropouts is necessary for estimating the role that it may play in biasing the results. If attrition was very large, greater than 20 percent, we felt it reflected a potentially large bias and lower quality of trial. Finally, although double blinding is not possible, single blinding of the data collectors is possible and important in reducing risk of bias. While all studies should clearly describe the randomization procedure rather than just stating that “participants were randomized,” we felt that some studies, especially older ones, may have conducted appropriate randomization but just not reported the procedures in detail. We therefore listed this as a minor criterion. The same applied for ITT analysis. However, if a study stated they conducted an ITT analysis but did not impute missing data, we did not give those studies points for an ITT analysis. Credibility is evaluated by administration of a scale that measures a participant's expectations of benefit before or during the trial. If credibility scores are similar in both arms of a trial, it suggests that those in the control group had similar beliefs and expectations of benefit as the treatment arm. We only gave 1 point for this if the trial specified administration of a measure of credibility.

We assigned 2 points each to the major criteria, weighting them more in assessing risk of bias (Table 3). We assigned 1 point each to the minor criteria. Studies could therefore receive a total of 12 points. If studies met a minimum of three major criteria and three minor criteria (9–12 points), we classified it as having “low risk of bias.” Studies receiving 6–8 points were classified as having “medium risk of bias,” and studies receiving 5 or less points were classified as having “high risk of bias.” Using this scoring system, we would still consider a study that did not meet one major criterion low risk of bias if it met other minor criteria. We could only grade a study that did not meet two major criterions as medium risk of bias or high risk of bias.

Low risk-of-bias studies had the least bias and we considered the results valid. Medium risk-of-bias studies were susceptible to some bias, but not enough to invalidate the results. High risk-of-bias studies had significant flaws that might have invalidated the results. In addition, if there were other issues with the studies that were not captured by the above criteria, such as significantly greater than 20 percent attrition (e.g. 40 or 50 percent attrition) or significant errors in reporting, we categorized such studies as high risk of bias on a study-by-study basis.

Assessment of Potential Publication Bias

Sometimes studies with positive results for a particular outcome get published while studies with negative results do not, erroneously leading readers to conclude that an intervention has positive effects on a given outcome when it may not. Even when an intervention does have an effect on an outcome, we expect that the distribution of results (by chance) will include null results. When conducting a meta-analysis, a funnel plot allows us to see if the results of the studies were spread in a distribution reflecting what we might expect by chance. It assumes that the largest studies will be near the average, and small studies will be spread on both sides of the average. However, this requires that we have the data to represent the results of each study in a meta-analysis. Anticipating that we might not find enough studies to support a quantitative assessment of publication bias, we conducted a qualitative assessment of publication bias by reviewing all the RCTs of meditation listed in the clinicaltrials.gov registry. We searched for any trials that completed recruitment 3 or more years ago that did not publish results, or that listed outcomes for which they did not report results.48 To assess for selective outcomes reporting, we examined the methods section for all the scales used to measure outcomes and assessed whether the studies had reported results for all of them.

Strength of the Body of Evidence

After synthesizing the evidence, two reviewers graded the quantity and quality of the best available evidence addressing KQs14 by adapting an evidence grading scheme recommended in the “Methods Guide for Effectiveness and Comparative Effectiveness Reviews.”45 In assigning evidence grades, we considered the four recommended domains, including risk of bias in the included studies, consistency across studies, precision of the pooled estimate or the individual study estimates, and directness of the evidence.

We derived the risk of bias for an individual study from the algorithm described above. We assessed the aggregate risk of bias of studies and integrated these assessments into a qualitative assessment of the summary risk-of-bias score. Since the studies in our evidence base were at varying risk of bias, we based most aggregate scores on a combination of high, moderate, or low risk-of-bias ratings. Where there was heterogeneity, we prioritized the lowest risk-of-bias studies.

We used the direction of effect of outcomes falling in the same category, irrespective of statistical significance, to evaluate consistency. In evaluating consistency, due to the heterogeneity of studies, we qualitatively considered giving greater weight to low risk-of-bias studies and/or those with large sample sizes if they were accompanied by one to two other conflicting studies that were of high risk of bias. If all the studies in an evidence base showed a similar direction of effect, we rated the evidence base as consistent. We rated single studies as consistency unknown.

We assessed the precision of individual studies by evaluating the statistical significance of a comparison through meta-analysis. To evaluate precision, we used confidence intervals or p-values. When we did not have a meta-analysis, we prioritized difference-in-change or “group-by-time interaction” confidence intervals or p-values where available. We found that few of the studies reported effect sizes and 95 percent confidence intervals. We estimated the confidence intervals for some of the outcomes. If all studies in an evidence base were precise, we rated the evidence base to be precise. We designated as imprecise studies whose effect size overlapped with the line of no difference. When studies did not report measures of dispersion or variability, we rated the precision as unknown.

We rated the evidence as being direct if the intervention was directly linked to the patient oriented outcomes of interest. We rated the evidence as indirect when studies measured the outcome using scales such as Penn alcohol craving scale, impaired response inhibition scale for alcohol use, and attention dot scales, as these were indirect measures of substance use behavior. We conducted internal deliberations to arrive at a consensus of what was direct or indirect. For the large bulk of outcomes, we rated measures as direct measures of that outcome. However, since anxiety, depression, and stress/distress are components of negative affect, we rated them as indirect measures of negative affect. If direct measures of negative affect such as the positive and negative affect schedule were available, we used that measure instead of any indirect measures. Similarly, we rated well-being and positive mood as indirect measures of positive affect.

To incorporate multiple domains into an overall grade of the SOE, we used the estimate of the summary risk-of-bias score, directness, consistency, and precision to evaluate an intervention. We used a qualitative approach to incorporating these multiple domains into an overall grade. We initially assigned SOE for all outcomes based on their risk-of-bias ratings. We assigned low risk-of-bias studies a high SOE and vice versa. We rated consistent, precise, and direct evidence from such low risk-of-bias studies as high-grade SOE. We downgraded the SOE when we could not determine consistency (i.e., single study) or when we deemed results inconsistent. We downgraded the SOE when evidence was indirect. Imprecision or unknown precision also led to a downgrade in the SOE (Figure 2).

Figure 2 describes the flow of algorithm for rating the Strength of Evidence. We started by assigning a strength of evidence for the group of trials based on their aggregate risk of bias. If trials were generally low risk of bias, we assigned them high strength of evidence. If they were overall medium risk of bias, we assigned them moderate strength of evidence. If they were overall high risk of bias, we assigned them a low strength of evidence. Then we looked at the consistency of findings among those trials. If they were consistent, we kept the previous strength of evidence rating. If they were inconsistent, we reduced their strength of evidence by one (e.g. if it was initially high strength of evidence but had inconsistent finding, we dropped it to moderate strength of evidence). Similarly, we looked at the precision of findings and if they were precise, we did not change the rating, but reduced it by one if they were imprecise. Lastly, we looked at the directness of measures and made no changes if the measures used were direct, and reduced the rating by one if they were indirect.

Figure 2

Algorithm for rating the strength of evidence.

We classified evidence pertaining to KQs14 into four categories: (1) “High” grade, indicating high confidence that the evidence reflects the true effect, and further research is very unlikely to change our confidence in the estimate of the effect; (2) “Moderate” grade, indicating moderate confidence that the evidence reflects the true effect, and further research may change our confidence in the estimate of the effect and may change the estimate; (3) “Low” grade, indicating low confidence that the evidence reflects the true effect, and further research is likely to change our confidence in the estimate of the effect and is likely to change the estimate; and (4) “Insufficient” grade, indicating evidence is either unavailable or inadequate to draw a conclusion.

We did not incorporate the optional domain of publication bias in the evidence grade. However, if we found qualitative evidence of publication bias, the ultimate conclusions took that into consideration. Thus, low SOE with probable publication bias translated into a very weak conclusion.

Applicability

We assessed applicability separately for the different outcomes for the entire body of evidence guided by the PICOTS framework as recommended in the “Methods Guide for Effectiveness and Comparative Effectiveness Reviews.”45 One of the potential factors we assessed was intervention fidelity (e.g., duration of structured meditation training, total amount of meditation practice (dose of meditation), subject adherence with meditation, subject proficiency with meditation, instructor qualifications, and study selection criteria for participants). We also assessed the selection process of these studies to evaluate the concern that participants in meditation studies are highly-selected, such as trained meditators. In addition, we assessed whether findings were applicable to various ethnic groups or whether the applicability of evidence was limited by race, ethnicity, or education.

Peer Review and Public Commentary

We invited experts in mind/body medicine and TM, as well as individuals representing stakeholder and user communities to provide external peer review of this comparative effectiveness review; AHRQ and an associate editor also provided comments. The draft report was posted on the AHRQ Web site for 4 weeks to elicit public comment. We addressed all reviewer comments, revising the text as appropriate, and documented everything in a disposition of comments report that we will make available 3 months after AHRQ posts the final comparative effectiveness review on its Web site.

Image introductionf1

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (15M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...