Digging Into Interpretation of TSH Results

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Digging into the interpretation of TSH results

Charna Albert
May 2024—In a time of wellness testing and high rates of levothyroxine prescribing for hypothyroidism, it may also
be time to rethink TSH test result interpretation.

Laboratory testing is more accessible now than it used to be and patients are more involved in their own care.
“You’re trying to give the patient access and ability to take care of their own health. But the double-edged sword is
you can start over-ordering things, and the way we’ve designed lab testing was not made for that,” says Joe El-
Khoury, PhD, D(ABCC), associate professor of laboratory medicine, Yale University School of Medicine, and director
of the clinical chemistry laboratory, Yale New Haven Health.

Commercial providers of thyroid-stimulating hormone assays suggest an upper reference limit of about 4.0 mIU/L
or slightly above. But in the absence of clinical symptoms, a patient with a TSH of 4.0 mIU/L doesn’t necessarily
require treatment.

“Guidelines based on cardiovascular outcome trials advise that treatment of hypothyroidism may not need to be
initiated unless TSH is above 10 mIU/L or significant symptoms of hypothyroidism or lipid profile abnormalities can
be improved,” says Laura Boucai, MD, endocrinologist at Memorial Sloan Kettering Cancer Center. “Evaluating for
thyroid autoimmunity and understanding the true probability that a minimal elevation of TSH indeed represents
disease is critical.”

Dr. Straseski

The clinical symptoms of hypothyroidism are nonspecific, says Joely A. Straseski, PhD, MS, MT(ASCP), D(ABCC),
professor in the Department of Pathology at the University of Utah School of Medicine and section chief for clinical
chemistry, medical director of endocrinology, and co-director of the automated core laboratory at ARUP
Laboratories. “It’s not like there’s one thing that nails this diagnosis, and that’s what makes it difficult. I will
probably have five of the 10 symptoms before I go home today,” jokes Dr. Straseski, a member of the CAP
Accuracy-Based Programs Committee. In lieu of obvious clinical symptoms, physicians may look to flagged results
to make decisions about initiating levothyroxine. “It’s the way our systems work. But if someone is primarily
treating to the flag, if they see there are results outside of what we would expect to see in a healthy individual,
then they are compelled to follow up with treatment.”

TSH is commonly ordered, “and the more you test, the more you find” and see subclinical versions of disease, she
says. “This trigger reaction to something that’s just barely over the upper reference limit is problematic.” Other
tests, like the thyroid antibodies, can help confirm the diagnosis. “There are other tests that have to be considered
in conjunction.”

Subclinical hypothyroidism is defined as an elevated TSH and a free thyroxine (T4) within the normal reference
range. A TSH between the upper reference limit and 10 mIU/L is usually thought to represent subclinical disease
(Ross DS. J Intern Med. 2022;291[2]:128–140). But intermethod biases add to the confusion around diagnosis, Dr.
Straseski says. “You’ll see quite a bit of variation in reference intervals among labs, primarily in the upper
reference limit.” A patient with borderline TSH values may have a result above the upper limit with one test and
below it with another. “The differences between the assays as far as what upper limit is used is confusing to
clinicians.”

To complicate matters further, Dr. El-Khoury says, TSH is subject to diurnal and seasonal variation, with values
peaking in the winter and at their lowest in the summer in healthy individuals, while free T4 remains relatively
stable (Yamada S, et al. J Endocr Soc. 2022;6[6]:bvac054). “These seasonal changes . . . are not captured by our
reference intervals and may lead to false diagnoses of subclinical hypothyroidism and unnecessary prescriptions of
levothyroxine to euthyroid individuals,” he wrote in a 2023 letter to the editor (El-Khoury JM. Clin Chem.
2023;69[5]:537–538). TSH levels also can rise with nonthyroidal illness, he says.

Dr. Boucai notes that TSH is also known to vary with, among other things, race, body mass index, age, and
whether the patient is a smoker. “Confirming the diagnosis of subclinical hypothyroidism is critical before initiating
lifelong replacement therapy,” she says. “This includes measuring TSH in different seasons, in fasting and fed
states, multiple times before prescribing levothyroxine replacement.”

The data suggest patients are being overprescribed levothyroxine, Dr. El-Khoury says. He points to a 2021 study by
Juan Brito, MBBS, of Mayo Clinic, and others (Brito JP, et al. JAMA Intern Med. 2021;181[10]:1402–1405). Dr. Brito
and his coauthors analyzed insurance claims data linked with laboratory results from patients throughout the U.S.
who were prescribed levothyroxine between 2008 and 2018. In a subset of 58,706 patients with thyrotropin and
FT4 or T4 levels available, Dr. Brito and his coauthors found that levothyroxine was initiated for overt
hypothyroidism in 8.4 percent, for subclinical hypothyroidism in 61 percent, and for patients with normal thyroid
levels in 30.5 percent.

“My concern,” Dr. El-Khoury says, “is that the general population is being treated at such low levels just because
they’re flagging high when in fact there’s nothing wrong with them.” Another problem: The drug’s side effects,
which include heart palpitations and headaches, among others, have been understated, he says. Response to a
public service video he posted on YouTube about thyroid testing and levothyroxine has been large: “Many patients
say their doctors do not believe them when they say they have these symptoms.”

TSH levels tend to rise with age, says James D. Faix, MD, medical director of immunology at Quest Diagnostics and
a member of the CAP Clinical Chemistry Committee. “Many elderly people are being treated for hypothyroidism
because people aren’t aware that the reference interval that’s used generally is too low for elderly individuals,” Dr.
Faix says. “An elderly person with a TSH of six is probably completely euthyroid” (Biondi B, et al. Lancet Diabetes
Endocrinol. 2022;10[2]:129–141). And thyroid hormone replacement can precipitate atrial fibrillation or accelerate
osteoporosis, he says.

In his review of hypothyroidism treatment, Douglas Ross, MD, of Massachusetts General Hospital’s Endocrinology
Division, says epidemiology studies demonstrate increased cardiovascular mortality in patients with subclinical
hypothyroidism. In addition, subclinical hypothyroidism has been associated with measures that correlate with CVD
disease, he writes, including higher lipid levels, increased epicardial adipose tissue, increased carotid intima-media
thickness, and endothelial dysfunction, and it improves total and LDL cholesterol levels. Dr. Ross describes it as
“reasonable and necessary,” based on studies, to treat patients with subclinical hypothyroidism if their TSH levels
exceed 7–10 mIU/L. “However, treatment of lower levels of TSH has not been shown to provide a clear benefit,” he
writes, though “increasingly most patients with TSH values flagged as abnormal, because they exceed 4.2–5.0
mIU/L, are treated with thyroid hormone” (Ross DS. J Intern Med. 2022;291[2]:128–140).

In a randomized, placebo-controlled trial of 737 adults (≥ 65) with subclinical hypothyroidism, Stott, et al., found no
apparent change at one year in the Hypothyroid Symptoms score or Tiredness score when TSH levels were lowered
from a mean of 6.4 mIU/L at baseline to 3.63 mIU/L with treatment. Half received placebo with mock dose
adjustment (Stott DJ, et al. N Engl J Med. 2017;376[26]:2534–2544). And in a meta-analysis of 21 randomized
clinical trials that compared treatment of subclinical hypothyroidism with placebo, treatment was not associated
with improvements in general quality of life or thyroid-related symptoms (Feller M, et al. JAMA.
2018;320[13]:1349–1359).

“The problem,” Dr. El-Khoury says, “is that subclinical hypothyroidism is a biochemically defined disease, meaning
that the definition comes purely from the changes in TSH versus T4.” That’s what is being looked to for the
diagnosis, he says. “It’s okay to have biochemical-based definitions, as long as they continue to be relevant and
have not been challenged by new data—which is what I feel is happening here.”

Dr. El-Khoury

At Dr. El-Khoury’s institution, the clinical laboratory and endocrinology


department opted to append a comment to all results that fall between
4.2, the assay’s suggested upper reference limit, and 10 mIU/L.
The comment explains that TSH is known to naturally increase in winter, with age, and with certain nonthyroidal
illnesses, and recommends retesting patients with mild abnormalities in two to three months. The European
Thyroid Association guideline for managing subclinical hypothyroidism also recommends retesting after a two- to
three-month interval (Ross DS. J Intern Med. 2022;291[2]:128–140). In a study published this year, van der Spoel,
et al., found that in a large proportion of adults 65 and older with mild subclinical hypothyroidism, TSH levels
spontaneously normalized in a median follow-up of one year even after two consecutive measurements of elevated
levels. A third measurement may be recommended in older adults, the authors said, before treatment is
considered (van der Spoel E, et al. J Clin Endocrinol Metab. 2024;109[3]:e1167–e1174).

In Dr. El-Khoury’s view, retesting patients falls short, in part because physicians may not be aware of the guidance
on retesting or overlook the laboratory’s comment and prescribe levothyroxine at first elevation. Then, too, “We’re
only retesting because we don’t have the process set up right from the beginning,” he says. Raising the assay’s
upper limit to 7 mIU/L and establishing it as a clinical decision limit, or the universal point at which treatment
should be considered, would be a better fix, he says.

But big obstacles stand in the way of adopting a universal clinical decision limit for TSH, and the lack of
intermethod agreement is one. It isn’t like glucose or vitamin D or other standardized tests like cholesterol, Dr.
Straseski says. “If your cholesterol is above 200, guidelines say you’re likely going to do something. We all know
that. We don’t ask, ‘Was it a Roche cholesterol? Was it an Abbott cholesterol?’ Clinicians don’t evaluate results like
that, though they likely should if an assay isn’t standardized.”

Dr.Van
Uytfanghe

Katleen Van Uytfanghe, PhD, MSc, postdoctoral fellow in the Ref4U reference laboratory (part of the laboratory of
toxicology) at Ghent University, Belgium, explains that for TSH, a true standardization approach isn’t possible. She
is chair of the International Federation of Clinical Chemistry and Laboratory Medicine Committee for
Standardization of Thyroid Function Tests, which has been chipping away at standardization for TSH and serum
total and free thyroid hormone testing since 2005. “For standardization, we need a primary reference material
and/or a secondary measurement procedure,” Dr. Van Uytfanghe says. “For TSH, both are missing.”

There is a WHO international standard that’s used as a primary reference material and calibrator for the current
assays. “But this material is not commutable” with patient samples; it doesn’t behave quite like a native human
sample in an immunoassay (Cowper B, et al. Clin Chem Lab Med. 2024;62[5]:824–829). TSH is a complex molecule
that can undergo many modifications after it’s formed, she explains. “It’s very difficult to make just one material
that would represent that complex mixture of different forms as you encounter it in a human serum sample.” And
there isn’t yet a mass spectrometry-based method for quantification of TSH, she says, because the molecule is so
complex.

In lieu of standardization, the committee has pursued a harmonization approach. They’ve developed a clinical
serum-based reference panel for TSH, traceable to the WHO reference material. The panel includes 100 serum
samples, spanning the full concentration range of TSH. Committee members worked with manufacturers on a
comparison study to prove that the reference panel could be used to recalibrate the commercial assays based on
statistically inferred targets, leading to improved harmonization, and published its proof-of-concept study
(Thienpont LM, et al. Clin Chem. 2017;63[7]:1248–1260). “In an ideal world where there is harmonization, we could
use a common reference interval, and with the publication we wanted to prove that this is possible,” Dr. Van
Uytfanghe says.

In the committee’s follow-up study, published this year, the reported TSH concentrations of the reference panel
samples relative to assay calibrators, across 15 immunoassays, were better harmonized, suggesting that some
manufacturers have made improvements to their assays since the committee made the sample reference panel
available (Cowper B, et al. Clin Chem Lab Med. 2024;62[5]:824-829). “We can see that the difference between the
methods is smaller compared to five years ago,” Dr. Van Uytfanghe says. In Japan, she notes, the regulatory
agencies now require TSH assay harmonization, which may have an indirect influence on manufacturers
internationally.

Assay harmonization could pave the way toward a more uniform reference interval, or what Dr. Van Uytfanghe
calls generalized reference intervals. But “a uniform reference interval does not indicate a one-size-fits-all
reference interval,” she says. “What we can do is come up with reference ranges that should be applicable to
multiple assays should they be harmonized, but it will never be one reference range.” Using one number for the
upper reference limit will not be possible because TSH values depend on too many factors.

Dr. Boucai of Memorial Sloan Kettering notes that “different populations may have their own set points for TSH,” as
with older individuals, for example. “It would be ideal to establish a TSH reference range defined by age, race, BMI,
in specific populations, but this is hard to practically implement,” she says.

If the assays were harmonized, Dr. Van Uytfanghe says, it would be easier to establish reference intervals specific
to these populations. “There are a lot of variables to take into account, which means there’s a lot of work in
establishing these kinds of reference intervals.” But with assay harmonization, “the manufacturers could share the
work, and it would take less effort to come up with region-specific and age-specific reference intervals and so on.”

Harmonization also would make it easier for laboratories to implement recommendations from the medical
literature. Now, it’s an issue when the assay used in the study differs from the assay the laboratory uses. “If that
doesn’t matter, it’s way easier to use someone else’s research for yourself.”

The manufacturers have been partners in the committee’s work on standardization, Dr. Van Uytfanghe says. “It’s
not just some scientists who say it can be done. They all were coauthors. They all cooperated.” That’s been the
case throughout, she says, despite the likelihood that recalibration could be demanding from a regulatory
standpoint. The FDA would probably require new 510(k) approval. “But then you have Europe, which has a set of
regulations, you have China with different regulations—so for the manufacturers, it’s not easy.”
Even as progress on harmonization marches on, Dr. Van Uytfanghe
acknowledges that the TSH problem has no simple fixes.
“There has been a lot of controversy on the upper limit of the TSH reference interval,” she says. “The controversy
is still here.”

Dr. Faix

Understanding the assay’s history may shed light on the current situation. “The first generation of TSH assays
couldn’t differentiate normal from low,” says Dr. Faix, a former member of the IFCC Committee for Standardization
of Thyroid Function Tests. “Because the focus was on the low end, the upper limit of TSH was not given as much
attention.” The first-generation tests used an upper limit of 10 mIU/L. As sensitivity improved, manufacturers
lowered upper reference limits, typically to around 5 mIU/L.

Dr. Straseski says there have been long-standing questions about how the earliest reference interval studies for
TSH were conducted. The study populations likely included individuals with subclinical hypothyroidism, skewing the
upper limit. “That issue is still plaguing us today, and that was many decades ago, at this point.” And package
inserts tend to provide limited information about how reference populations are derived, Dr. Straseski says. Take
the TSH assay her institution uses. The package insert says it was derived from “a total of 516 healthy test
subjects,” she says, but nothing about how those subjects were ruled in or out. “Much more information needs to
be provided when we’re looking at who the reference population is made up of.”

In 2002, Dr. Faix says, researchers analyzed a nationally representative sample of thyroid function test results
from National Health and Nutrition Examination Survey data, finding that the mean TSH level in the general U.S.
population is about 1.5 mIU/L (Hollowell JG, et al. J Clin Endocrinol Metab. 2002;87[2]:489–499). That finding
precipitated calls to further lower the upper limit, he says, though physicians were concerned about false-positives
and resisted. “You want to detect people who have early hypothyroidism,” he says, “but you don’t want too many
false-positives. You don’t want too many people who are euthyroid and have just a slightly elevated TSH.”

Christopher Naugler, MD, observed this problem firsthand in Alberta, where he is professor in the Department of
Pathology and Laboratory Medicine at the Cumming School of Medicine, University of Calgary. One of two major
laboratories in the province lowered the upper limit of the reference range for TSH from 6 to 4 mIU/L to match that
of the other laboratory, following an administrative directive to harmonize reference ranges among laboratories in
Alberta (Symonds C, et al. CMAJ. 2020;192[18]:E469–E475). “New abnormal TSH results tripled from 3.3 percent to
9.1 percent,” and new levothyroxine prescriptions increased by about 20 percent, Dr. Naugler says. “Now we have
an entire new cohort of individuals overnight who were believed by their clinicians to be hypothyroid when the only
thing that happened was a minor change in the reference range.”

The endocrinology community in Alberta was not in favor of the change, he says. “They were worried it was going
to create potentially unnecessary endocrinology referrals and an increase in prescription rates and cost to patients,
and that’s exactly what happened.”

Sachin Majumdar Jr., MD, associate professor of medicine, Yale University School of Medicine, and director of the
endocrine neoplasia clinic, Yale New Haven Health, says he sees a fair number of referrals for patients with
subclinical hypothyroidism, but he doesn’t have a large data set with which to determine if the number has risen.
Many of these referrals come from primary care and include patients seen by physicians’ assistants and advanced
practice registered nurses. “In general, we get more referrals from those providers for things like hypothyroidism,”
he says. “In the past, primary doctors would manage that more, but in their absence, other practitioners probably
feel less comfortable.” And “as things become protocolized, people may tend to go more by reference ranges
now.”

In the bigger picture, Dr. El-Khoury says, the way reference intervals
typically are derived in laboratory medicine may not be suited to a test
like TSH, which is often ordered in healthy individuals.
With reference intervals based on a central 95 percent of the reference population, “that means two and a half
percent of people who get tested are going to flag as high, with no other problems.”

In an opinion piece in Clinical Chemistry, he and his coauthors argue for reevaluating the 95 percent inclusion
criteria for defining reference intervals, where appropriate (El-Khoury JM, et al. Clin Chem. Published online March
18, 2024. doi:10.1093/clinchem/hvae026). “What we’re calling it is ‘separate approaches,’” he says. For tests
commonly ordered in healthy people, they suggest a central 99 percent spread. This approach could increase
false-negatives but would significantly reduce the false-positive rate, they write. Widening the spread, Dr. El-
Khoury tells CAP TODAY, would keep the same general model while accounting for more natural variation. For TSH,
“I’m not saying widen it to the 99th percentile; I’m saying use the existing studies that tell us where it’s beneficial
and then set the limit there,” as was done for glucose and vitamin D. “Those are cutoffs based on clinical studies
that assessed the risk of developing disease. With TSH, the risk is not getting benefit and more side effects.”

Other options involve moving away from the within-the-reference-interval-is-good, outside-is-bad approach. “That’s
not how biology works. It’s a spectrum,” Dr. El-Khoury says. “The problem is you have people who may [in the lab
report] look ‘out’ but are fine, and TSH is proof of that.” (The opposite can be true also.) Clinical decision limits are
one option, for tests that are standardized. Personalized reference intervals are another, though that won’t help a
patient without a recorded medical history who presents in a crisis state. Reporting results with z-scores, or how far
the result is from the median in terms of standard deviations, is yet another possibility, though it’s not one with
which many physicians are familiar. “Instead of flagging one as in or out, it shows how far you are without giving
you one critical indication that something is wrong,” he explains.

Dr. Straseski believes personalized reference intervals are “where the world will eventually end up. Maybe a four is
completely appropriate for me, and a 4.17 is completely appropriate for you.”

“We are absolutely not there,” she says. “But it’s something we’ll be considering in the future.”

For now, Dr. Straseski points to the other tests that support the
diagnosis.
“I can name almost a dozen tests that support thyroid function testing,” she says. “TSH is our jumping-off point,
but it’s not the end-all, be-all.” Her own hospital system is evaluating the testing formularies for thyroid disease,
she says.

The early reference interval studies for TSH likely didn’t employ that supportive testing in recruiting reference
populations, she says. “Free T4 hasn’t been a great test for all that long.” Neither has free T3 or the thyroid
antibodies. With supportive testing, it would be possible to eliminate from the studies individuals with subclinical
disease. “Ruling in or ruling out individuals from these reference populations is the critical part of this, no matter
what analyte we’re talking about, but particularly for the thyroid. More careful consideration of who’s included in
these populations would be helpful.”

So too would more granularity in results interpretation, says Dr. Naugler of the University of Calgary. He adds, “We
need to give guidance to clinicians in terms of what to do with that result.”

“In a time when labs are increasingly automated and lab tests are commoditized, this is a real opportunity for
laboratory professionals to value-add by providing interpretive explanations for clinicians and to be available for
our clinical colleagues to consult on questions that arise,” he says.

The need is greatest in primary care, Dr. Naugler says, and the laboratory could partner with endocrinology groups
to help design and draft the interpretive comments. He cites as an example an initiative his research laboratory
spearheaded, in which it partnered with local cardiology groups to provide, for primary care physicians,
interpretive comments for lipid tests and prescription recommendations. “They would get the lipid result, a
Framingham risk score, and a recommendation as to whether the patient should be prescribed a statin.” A similar
approach could be used for TSH testing, he says, in which the laboratory partners with endocrinology groups to
generate the comments and algorithms for TSH interpretation and possibly also recommendations in the report
about whether levothyroxine should be prescribed.

“The diagnostic interpretive part for general lab tests, for chemistry, hematology, is often overlooked,” he says.
“And that’s something we need to be aware of and take every opportunity to show the value in laboratory
professionals by providing those interpretive services.”

Charna Albert is CAP TODAY associate contributing editor.

You might also like