Biomed Gpt

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

nature medicine

Article https://doi.org/10.1038/s41591-024-03185-2

A generalist vision–language foundation


model for diverse biomedical tasks

Received: 29 January 2024 Kai Zhang 1, Rong Zhou1, Eashan Adhikarla1, Zhiling Yan1, Yixin Liu 1, Jun Yu1,
Zhengliang Liu2, Xun Chen 3, Brian D. Davison 1, Hui Ren4, Jing Huang5,6,
Accepted: 10 July 2024
Chen Chen7, Yuyin Zhou8, Sunyang Fu 9, Wei Liu 10, Tianming Liu2,
Published online: xx xx xxxx Xiang Li 4 , Yong Chen5,11,12,13, Lifang He 1 , James Zou 14,15, Quanzheng Li4,
Hongfang Liu 9 & Lichao Sun 1
Check for updates

Traditional biomedical artificial intelligence (AI) models, designed for specific


tasks or modalities, often exhibit limited flexibility in real-world deployment
and struggle to utilize holistic information. Generalist AI holds the potential
to address these limitations due to its versatility in interpreting different d­at­a
t­yp­es and generating tailored outputs for diverse needs. H­­o­w­­ev­­er, e­­­­x­­i­­­s­t­­ing
biomedical generalist AI solutions are typically heavyweight and closed
source to researchers, practitioners and patients. Here, we describe
BiomedGPT, the first open-source and lightweight vision–language
foundation model, designed as a generalist capable of performing various
biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25
experiments while maintaining a computing-friendly model scale. We also
conducted human evaluations to assess the capabilities of BiomedGPT in
radiology visual question answering, report generation and summarization.
BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in
question answering, satisfactory performance with an error rate of 8.3% in
writing complex radiology reports, and competitive summarization ability
with a nearly equivalent preference score to human experts. Our method
demonstrates that effective training with diverse data can lead to more
practical biomedical AI for improving diagnosis and workflow efficiency.

AI techniques, especially transformer-based foundation models, have sporadic2,5. A generalist biomedical AI has the potential to overcome
demonstrated their power in solving a wide range of biomedical tasks, these limitations by using versatile models that can be applied to
including radiology interpretation, clinical-information summarization different tasks and are robust enough to handle the intricacies of
and precise disease diagnostics1. However, most of today’s biomedical medical data effectively2,6.
models act as specialist systems, tailored to specific tasks and modali- The emergence of general-purpose foundation models7,8 offers
ties2. Such specialization comes with substantial challenges in model a prototype for the development of biomedical generalist AI. These
deployment, especially with the growing interest in using AI for preci- advanced models serialize diverse datasets, regardless of their modali-
sion medicine and patient-centered care, which require the integra- ties, tasks or domains, into a uniform sequence of tokens, which are
tion and analysis of diverse data types and patient-specific details3,4. then processed using a transformer neural network9. Unlike large lan-
Furthermore, the hyper-specialization of AI in narrow disciplines often guage models10,11, which are primarily designed for processing textual
fails to provide the comprehensive insights necessary to assist doctors data, generalist models can handle both textual and visual information
in real-world settings, where the flow of information can be slow and simultaneously. This capability is pivotal for complex biomedical

A full list of affiliations appears at the end of the paper. e-mail: xli60@mgh.harvard.edu; lih319@lehigh.edu; lis221@lehigh.edu

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

applications, in which the integration of diverse data types—such as aid in disease diagnostics and lesion recognition; text understanding
clinical text and radiographic imaging—is crucial for accurate analysis and summarization can streamline clinic operations, such as easing
and decision-making. Furthermore, generalist models exhibit impres- doctors’ note-writing burden. Furthermore, image captioning and
sive multitasking capabilities, greatly simplifying the deployment and VQA lay the groundwork for future healthcare chatbots, addressing
management of AI systems by reducing the need to maintain numerous challenges in which common language might be ambiguous but medi-
narrowly focused specialist models. cal terminology is too complex for most people to understand. The
In this paper, we introduce BiomedGPT, a prototype for a general- complete statistics of downstream datasets used in this article are
ist vision–language foundation model designed to perform diverse shown in Extended Data Figure 1b.
biomedical tasks across modalities using natural-language instructions
(Fig. 1). Unlike multimodal biomedical AI systems that are specialized BiomedGPT is lightweight but competitive in multimodal
for a single task12, focused solely on one discipline13 or not publicly tasks
accessible6, BiomedGPT is trained with cross-disciplinary data and We fine-tuned BiomedGPT on two primary multimodal tasks, VQA and
evaluated on a wide range of tasks. BiomedGPT is fully transparent, image captioning, each using three downstream datasets. The VQA
open-source and lightweight (for example, it is 3,088 times smaller datasets included radiology data covering five anatomies (VQA-RAD17
than the commercial generalist biomedical AI model Med-PaLM M, and Semantically-Labeled Knowledge-Enhanced Dataset (SLAKE)18),
which has 562 billion parameters6), thereby facilitating broader imple- in addition to pathology data that captures both anatomical and
mentation. To empower the generalist capabilities of BiomedGPT, tissue-specific details (PathVQA19). For captioning, we incorporated
we curated a large-scale pretraining corpus comprising 592,567 chest X-ray (CXR) datasets (IU X-ray20 and Medical Information Mart
images, approximately 183 million text sentences, 46,408 object– for Intensive Care III-CXR (MIMIC-CXR)21) as well as clinical photo-
label pairs and 271,804 image–text pairs (Fig. 2c,d). Furthermore, to graphs from Peir Gross22. For comparison, we benchmarked BiomedGPT
enhance its ability to follow instructions, we developed a variant called against leading models for each dataset15,23–25.
Instruct-BiomedGPT with specifically curated instruction-tuning data We evaluated our model’s VQA performance by comparing gene­
(Supplementary Fig. 1). rated answers with the ground truths. The overall accuracy of our
To our knowledge, BiomedGPT is the first fully transparent BiomedGPT model is detailed in Extended Data Table 1. Notably,
generalist medical AI model that has been comprehensively evaluated BiomedGPT achieved an 86.1% overall accuracy on the SLAKE data-
on publicly accessible datasets and by medical professionals. This study set, surpassing the previous state-of-the-art (SOTA) performance of
first highlights the transfer-learning capabilities of BiomedGPT, dem- 85.4%, set by BiomedCLIP15. Additionally, we dissected the accuracy of
onstrating how the model uses knowledge from pretraining to special- both ‘closed ended’ and ‘open ended’ question–answer pairs (Fig. 3a).
ize effectively across 25 datasets through fine-tuning (Extended Data Our model recorded promising closed-ended accuracies: 88.0% on
Tables 1 and 2 and Supplementary Table 7). We used recognized metrics PathVQA, up by 1.0% compared with the performance of the current
from the literature to benchmark our model against state-of-the-art SOTA model25. On the SLAKE dataset, BiomedGPT-B achieved an 89.9%
(SOTA) results. Additionally, BiomedGPT is a zero-shot learner that closed-ended accuracy, down by 1.1% compared with the M2I2 model’s
can answer multimodal medical questions without further training performance23. In open-ended scenarios, our model excelled with an
for adaptation, and its performance is comparable to that of leading 84.3% accuracy, surpassing M2I2’s 74.7%. However, for the VQA-RAD
AI systems. Furthermore, doctors evaluated BiomedGPT in tasks such and PathVQA datasets, BiomedGPT’s performance on open-ended
as visual question answering (VQA), report generation and summari- queries was less competitive, recording accuracies of 60.9% and 28.0%,
zation within the radiology domain, and it demonstrated satisfactory respectively.
performance. Although our results highlight BiomedGPT’s potential In addition, we compared BiomedGPT-B with Med-PaLM M
in medical applications, they also indicate that substantial enhance- (12 billion parameters) using the weighted F1 score, as reported in
ments are required to make it usable in the clinic. Critical evaluations the paper6. Other metrics could not be calculated owing to the
for BiomedGPT are particularly needed in the areas of safety, equity and closed-source nature of Med-PaLM M. Remarkably, despite its much
bias. Our findings underscore the challenges that must be addressed smaller size, BiomedGPT-B achieved impressive results (Fig. 2b). On
before these models can be deployed effectively in clinical settings. We the VQA-RAD and SLAKE datasets, BiomedGPT-B attained scores of
outline these limitations and suggest directions for future research. 73.2% and 85.2%, respectively, which represent a substantial increase
of 22.5% on VQA-RAD and a slight improvement of 0.02% on SLAKE.
Results Additionally, on the PathVQA dataset, BiomedGPT-B had a weighted
Pretraining using large and diverse datasets F1 score of 56.9%, only 0.4% lower than Med-PaLM M, while utilizing a
BiomedGPT uses pretraining techniques including masked modeling model with 98.5% fewer parameters.
and supervised learning, aiming to establish robust and general data To evaluate the model’s image-captioning ability (Fig. 3b), we
representations by learning from extensive datasets across diverse meticulously assessed the quality of machine-generated text using
tasks (Extended Data Table 3). To maximize the generalization of three metrics: recall-oriented understudy for gisting evaluation-longest
BiomedGPT, we sourced the pretraining data from 14 freely available common subsequence (ROUGE-L)26, metric for evaluation of transla-
datasets, ensuring the diversity of modalities (Figs. 1a and 2c,d and tion with explicit ordering (METEOR)27 and consensus-based image
Extended Data Fig. 1a). In addition, to investigate how BiomedGPT description evaluation (CIDEr)28. We compared the performance of
performs across scales, we specifically introduced three versions of BiomedGPT to that of established models13,29–33. These evaluation
the model: BiomedGPT-S, BiomedGPT-M and BiomedGPT-B, which metrics are useful for assessing the similarity and consensus between
correspond to small, medium and base sizes, respectively (Fig. 2a and the generated text and the reference text written by medical experts.
Extended Data Figs. 2 and 3). They have also shown some alignment with ratings given by physicians34.
Consequently, models that score higher on these natural-language
Fine-tuning for downstream tasks processing (NLP) metrics can be selected as candidates for further
Multitasking is fundamental to a generalist AI. Following previous bio- human evaluation35. On the Peir Gross dataset, our BiomedGPT model
medical research14–16 and aiming for sufficiently effective performance, surpassed the existing SOTA benchmark36, demonstrating improve-
we primarily fine-tuned our model to adapt to various biomedical tasks ments of 8.1 percentage points in ROUGE-L and 0.5 points in METEOR,
(Fig. 1b,c). Our selection of downstream tasks stemmed from their and a substantial gain of 89.8 points in the CIDEr metric. Conversely,
potential real-world applications: medical-image classification can on the IU X-ray dataset, BiomedGPT achieved a leading CIDEr score

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a b
Multimodal data

Treatment
Report
suggestion
summarization
Mortality
prediction

Conversation
MRI X-ray EKG Ultrasound Clinical-trial Text Text summarization
matching understanding summarization

VQA Captioning
Pathology
Image Microscopy CT Endoscopy images
Pathology and Image Report
radiology generation
VQA classification

Text Publications EHRs Literature Clinical notes


Disease Lesion
diagnosis detection

c
Pathology and radiology VQA Disease diagnosis

Q: What is seen at this stage, associated with regeneration and repair?


Q: What disease does this
A: Numerous reactive type II pneumocytes.
image depict?
Q: Are bite cells like this one in the smear associated with regeneration and repair at this stage?
A: Breast cancer.
A: No.

Report generation Lesion detection

Q: What are the findings based on the image? Q: What skin lesion does this
A: The nasogastric tube is in adequate position, and there is a resolution of the gastric distention. image depict?
There is still mild bibasilar atelectasis. There are no pneumothorax no pleural effusion. A: Melanoma.

Mortality prediction Conversation summarization

Chief complaint: Dyspnea, abdominal distention


P r e s e n t i l l n e s s : 78-ye a r-o l d f e m a l e w i t h m u l t i p l e m e d i c a l p r o b l e m s , What symptoms are you experiencing?
pertinently including CAD status post CABG, hypertension and type 2 diabetes
Medical history: (1) CAD status post CABG [Reg#](2) Hypertension (3) Type 2
I have been suffering from headache and general
diabetes (4) Pulmonary fibrosis
Allergies: Patient recorded as having no known allergies to drugs weakness and have been diagnosed with typhoid fever.
···
Physical exam: BP 107/68, HR 70s, RR 28, 90% on NRB What treatments have you undergone in the past?

Q: What is the predicted outcome for the patient before discharge? I was put on siprofloxacin for ten days. One week after
A: Deceased. I completed the course, I started having the same problems.

Treatment suggestion It seems you are having relapse of typhoid fever. Up to 10%
of patients develop mild relapse. You need more tests to check
for other causes, including liver tests, because your urine is
Tumor Nodes dark, and hepatitis can sometimes happen with typhoid fever.
ID Age Sex Race … ER Status PR Status
size examined
001 54 Female White 25 14 Negative Negative Q: What is the summary of conversation?
002 44 Female White 23 34 Positive Positive
A: Suggest remedy for recurred symptoms of typhoid after treatment.
003 47 Male White 22 3 … Positive Positive
004 89 Female White 17 1 Negative Positive Report summarization
005 59 Female White 36 19 Positive Negative
There is no evidence of hemorrhage, masses, mass effect or shift of
normally midline structures. The ventricles and sulci are mildly
The patient is a 44-year-old white female. She has one malignant tumor and prominent, compatible with age-appropriate involutional changes. There
five regional lymph nodes that tested positive. The tumor measures 23 is hypoattenuation along the right caudate head and periventricular
mm. Estrogen and progesterone receptor tests are positive. A total of frontal white matter, unchanged, compatible with small vessel ischemic
34 regional nodes were removed. disease.
Q: Please provide treatment suggestion given the patient's information. Q: What is summary based on the given report?
A: Recommend using beam radiation, suggesting that the sequence for radiation A: 1. No acute intracranial process.
should be post-surgery. Furthermore, chemotherapy should indeed be considered. 2. Small vessel ischemic disease.
Clinical-trial matching
Patient information Clinical trial information
A 19-year-old male came to the clinic with some sexual concerns. On Description: Evaluate the safety and efficacy of Androxal.
physical examination, there are some poorly developed secondary sexual
–1
characteristics. Ultrasound reveals a testes volume of 1-2 ml. The Inclusion criteria: Total serum testosterone concentrations < 300 ng dl .
–1
hormonal evaluation showed a serum testosterone level of 65 ng dl with low Male patients over the age of 18.
–1
levels of GnRH. Exclusion criteria: Elevated PSA > 3.5 ng ml .

Q: Please determine the patient’s eligibility by comparing the given patient note and trial details.
A: The patient is eligible for the clinical trial.

Fig. 1 | BiomedGPT can process diverse modalities and perform versatile which the input consists of both image and text or only text; the model responds
tasks. a, BiomedGPT focuses primarily on visual and textual inputs, but can also to queries (Q) by generating responses (A). Thanks to its unified framework
process tabular data through serialization. CT, computed tomography; EHR, design and comprehensive pretraining on biomedical data, BiomedGPT is
electronic health records; EKG, electrocardiogram; MRI, magnetic resonance highly adaptable and can be applied to a variety of downstream tasks. BP, blood
imaging. b, Examples of the supported downstream visual-language tasks of pressure; CABG, coronary artery bypass graft surgery; CAD, coronary artery
BiomedGPT demonstrate its versatility. Additional tasks can be incorporated disease; ER, estrogen receptor; GnRH, gonadotropin-releasing hormone;
to meet further clinical needs through lightweight, task-specific fine-tuning. HR, heart rate; NRB, non-rebreather mask; PR, progesterone receptor;
c, Examples of clinically relevant use-cases for BiomedGPT include tasks in RR, respiratory rate; Reg#, de-identified ‘Medical Record Number’.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a
Text
encoder Medical VQA
Instruction BiomedGPT Discrete output
decoder sequence
2D image
Image classification
encoder
2D biomedical image BiomedGPT Text
encoder decoder
Input embedding Image captioning
3D image
encoder
BiomedGPT model scale:
3D biomedical image Small (33 million), Medium (93 million),
Base (182 million) Text understanding
Text and summarization
encoder
Biomedical text

b BiomedGPT vs. previous SOTAs BiomedGPT vs. Med-PaLMM (12 billion)


Chest X-ray captioning
(IU X-ray) Radiology VQA
Gross captioning (VQA-RAD)
(Peir Gross) Radiology VQA
(SLAKE)
Mortality prediction
(MIMIC-III)
Report
Image classification generation 80
(MedMNIST-raw) 120 (MIMIC-CXR) 60
90 40
60
20 Pathology
30 Pathology VQA
VQA
(PathVQA)
(PathVQA)

TB diagnostic Breast mass


(SZ-CXR) classification
(CBIS-DDSM)
Medical question
summarization
BiomedGPT Report
(MeQSum)
Madical language summarization
Previous SOTA
inference (MedNLI) (MIMIC-III)
Med-PaLM M (12 billion) Breast calcification
Report summarization
(MIMIC-CXR) classification
(CBIS-DDSM)

c Vision and language datasets Object-detection datasets Masked image modeling datasets

OIA-DDR ISIC (2020)


33,126
13,673 CytoImageNet
(5.59%)
(29.46%) Retinal fundus (sampled)
35,126 300,000
IU X-ray
7,470 (3.14%) (5.93%) (50.63%)
Peir gross 7,442 (3.13%) 217,060
PathVQA 4,998 (2.10%) (91.35%)
642 (0.27%) 32,735
SLAKE 224,315
(70.54%) (37.85%)

MediCat CheXpert
DeepLesion

d Number of words in sentence of masked language modeling datasets e Performance and model scale of models on SLAKE VQA dataset
0.04
Dataset (size)
Density (probability per word)

NCBI BioNLP (14,114) Med-PaLM M (84B)


90
0.03 BiodmedGPT-B (182M) Med-PaLM M (12B)
MIMIC-III Clinic Notes Med-PaLM M (562B)
(~1.8 million)
F1 score

80 CLIP-ViT w/ GPT2-XL (1.6B)


PubMed Abstract BiodmedGPT-M (93M)
0.02
(~180 million)

70
0.01
BiodmedGPT-S (33M)
CLIP-ViTw/ BioMedLM(2.8B)
60 CLIP-ViT w/ BioGPT (1.6B)

0
0 50 100 150 200 250 300 0.5 1 10 100 600
Median for NCBI BioNLP, MIMIC-III, and PubMed: Number of parameters in logarithmic scale (in billions)
82, 151 and 19 words, respectively

Fig. 2 | An overview of BiomedGPT: workflow, performance and pretraining for VQA (in comparison with Med-PaLM M); and F1-macro for breast mass and
datasets. a, Illustration of how BiomedGPT handles multimodal inputs calcification classification (also in comparison with Med-PaLM M). c, Distribution
and performs diverse downstream tasks. The expected form of output for of pretraining datasets including image captioning and VQA as vision and
each task is determined by feeding the specific instruction to the model. language datasets, object-detection datasets and image-only datasets for
2D, two-dimensional. b, Comparative performance analysis contrasting the masked image modeling. d, Density plot of the number of words per sentence in
achievements of BiomedGPT with prior SOTA results and Med-PaLM M (12 billion the text-only pretraining datasets. e, A comparison of scale-related performance.
parameters). The evaluation metrics include accuracy for image classification, BiomedGPT exhibits superior performance on the SLAKE VQA dataset, although
medical language inference and VQA (benchmarked against SOTA results); CIDEr it has considerably fewer parameters than its counterparts. B, billion; M, million.
for image captioning; ROUGE-L for text summarization; weighted F1 scores

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a Medical VQA performance

VQA-RAD accuracy SLAKE accuracy PathVQA accuracy


Model Parameters
Closed-ended Open-ended Closed-ended Open-ended Closed-ended Open-ended

BiomedGPT-S (ours) 33M (0.2×) 57.8 (23.5↓) 13.4 (47.5↓) 73.3 (16.6↓) 66.5 (17.8↓) 84.2 (3.8↓) 10.7 (17.3↓)

BiomedGPT-M (ours) 93M (0.5×) 79.8 (1.5↓) 53.6 (7.3↓) 86.8 (3.1↓) 78.3 (6.0↓) 85.7 (2.3↓) 12.5 (15.5↓)

M2I2 252M (1.4×) 81.6 (0.3↑) 61.8 (0.9↑) 91.1 (0.2↑) 74.7 (9.6↓) 88.0 36.3 (8.3↑)

BiomedCLIP 422M (2.3×) 79.8 (1.5↓) 67.6 (6.7↑) 89.7 (0.2↓) 82.5 (1.8↓) - -

CLIP-ViT with GPT2-XL 1.6B (8.8×) - - 82.1 (7.8↓) 84.3 87.0 (1.0↓) 40.0 (12.0↑)

MedVlnT-TD 7.0B (38.5×) 86.8 (5.5↑) 73.7 (12.8↑) 86.3 (3.6↓) 84.5 (0.2↑) - -

BiomedGPT-B (ours) 182M 81.3 60.9 89.9 84.3 88.0 28.0

b Image captioning performance


IU X-ray Peir Gross MIMIC-CXR

BiomedGPT-S 26.8 11.0 29.6 25.8 12.0 22.0 23.0 13.0 12.8

BiomedGPT-M 28.0 11.0 31.3 24.0 14.7 25.8 23.2 13.0 12.9

BiomedGPT-B 28.5 12.9 40.1 36.0 15.4 122.7 28.7 15.9 23.4

SOTAs 37.6 18.7 35.1 27.9 14.9 32.9 29.6 14.2 14.7

ROUGE-L METEOR CIDEr ROUGE-L METEOR CIDEr ROUGE-L METEOR CIDEr

c Image classification on MedMNIST-raw dataset


100 98.7 97.9
BiomedGPT (best) BiomedCLIP MedViT 97.2
95.8
95 94.9
92.9 93.0 92.5
91.0 91.0 90.6
90 89.1
Accuracy (%)

86.6
85.6
85 84.2 83.8
82.2
80 79.5
76.9
75
71.9 72.3
70

n images 107,180 10,015 109,309 5,856 780 17,092 23,660


n classes 9 7 4 2 2 9 11
Colon Dermato- Retinal Chest Breast Blood cell Coronal
pathology scopy OCT X-ray ultrasound microscope abdominal CT

d e f
BiomedGPT (best) BiomedGPT (best) BiomedGPT-B BiomedGPT-M BiomedGPT-S
100 80
LightTBNet Med-PaLMM (best) 224 × 224
97.0 colon pathology
72.8
600 × 450
95 3,000 × 300
70 dermoscopy
67.9 MC-CXR
Accuracy (%)

F1-macro (%)

91.0
89.7 (384–1,536)
90 60
88.9 57.2 4,892 × 4,020 × (277–512)
SZ-CXR retinal OCT

51.1
85 50
512 × 512 (384–2,916)
coronal × (127–2,713)
abdominal chest X-ray
80 40 CT
SZ-CXR MC-CXR CBIS-DDSM CBIS-DDSM 360 × 363 500 × 500
(mass) (calcification) blood cell breast ultrasound
microscope

Fig. 3 | BiomedGPT performs fine-tuning for vision–language and medical- and SOTA platforms on IU X-ray, Peir Gross and MIMIC-CXR data. The evaluation
image-classification downstream tasks. a, Medical VQA performance of metrics are ROUGE-L, METEOR and CIDEr. c, Evaluation of image classification
BiomedGPT and the leading models, in terms of closed-ended and open-ended on the MedMNIST-Raw dataset for each domain type. d, Image-classification
accuracies. The information in parentheses indicates the performance change performance with accuracy across two super-resolution image datasets.
compared to BiomedGPT-B. × denotes the multiple of the parameter size of e, Image-classification performance as assessed by the F1-macro on the CBIS-
other models relative to that of our model. ↓ denotes the performance decrease DDSM dataset. f, Accuracies across nine datasets with different resolutions
compared to our model. ↑ denotes the performance increase compared to our (shown on the graph, in pixels) vary with model scale. In general, larger models
model. For example, 0.5↓ means that the corresponding model has 0.5 lower tend to perform better.
accuracy than BiomedGPT-B. b, Image-captioning performance of BiomedGPT

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

of 40.1, marking a 5.0-point improvement over the SOTA model31. prediction on the basis of admission notes; and (4) clinical-trial match-
On the MIMIC-CXR dataset, in terms of METEOR, our model recorded ing that identifies lists of candidate clinical trials suitable for individuals.
a score of 15.9%, surpassing the previous leading result30. Moreover, we explored BiomedGPT’s performance in medical-text
summarization, which was applied to datasets of doctor–patient
BiomedGPT enables accurate medical-image classification dialogues (MedQSum44 and HealthCareMagic45) as well as radiology
For the medical-image-classification task, we curated a biomedical reports (MIMIC-CXR21 and MIMIC-III46).
image dataset, named MedMNIST-Raw, encompassing seven modali- While evaluating the MedNLI dataset for three-class classifica-
ties following ref. 37: (1) colon pathology with nine tissue types; tion (entailment, contradiction or neutral), we used accuracy as our
(2) dermoscopy images of seven typical pigmented skin lesions; evaluation metric, consistent with prior research (Fig. 4e). Notably,
(3) breast ultrasound (normal, benign and malignant); (4) retinal when compared with the SOTA performance of SciFive-Large16 at 86.6%
optical coherence tomography (OCT) categorized into four types of accuracy, BiomedGPT-B, which has merely a quarter of SciFive-Large’s
retinal diseases; (5) CXR images for binary-class classification of pneu- parameter count, exhibited a decline in accuracy of only 2.8%.
monia against normal; (6) blood cell microscope showcasing eight For the treatment-suggestion task, we adopted the preprocess-
kinds of normal cells; and (7) abdominal computed tomography (CT) ing steps as described in prior work47. An example output is: ‘Recom-
with 11 body organs across the coronal view. Additionally, we tested the mend using beam radiation, suggesting that the sequence for radiation
model on two super-resolution pulmonary disease datasets, with a spe- should be post-surgery. Furthermore, chemotherapy should indeed
cific focus on pulmonary tuberculosis (TB), which has a limited number be considered.’ To evaluate the effectiveness of three variants in treat-
of samples: (8) the Montgomery County CXR set (MC-CXR), with dimen- ment suggestions, we used a tenfold cross-validation method and
sions of either 4,020 × 4,892 or 4,892 × 4,020 pixels; and (9) the Shenz- compared current open-source SOTA methods, including BioGPT14 and
hen CXR set (SZ-CXR), with approximate dimensions of 3,000 × 3,000 LLaVA-Med12 (Fig. 4a), which have 347 million and 7 billion parameters,
pixels. To be consistent with prior works, we used accuracy for evalu- respectively—approximately 11 and 212 times larger, respectively, than
ation. As shown in Figure 3c–e, BiomedGPT outperformed previous BiomedGPT-S. BiomedGPT-B achieved a mean accuracy of 50.0% ±
SOTA systems on seven of the nine biomedicalimage-classification 5.3%, outperforming BioGPT and LLaVA-Med, which had accuracies
datasets after five-epoch fine-tuning. of 45.9% ± 4.8% and 41.5% ± 7.1%, respectively. Considering the com-
Notably, on the SZ-CXR and MC-CXR datasets38 (binary classi- plexity involved with six types of radiation therapy, seven radiation
fication), BiomedGPT had accuracies of 97.0% and 89.7%, reflecting sequences and two types of chemotherapy47, which together imply a
improvements of 6.0% and 0.8%, respectively, over the previously random-guess accuracy of 1.2%, both BiomedGPTs and the baseline
leading model, LightTBNet39 (Fig. 3d). For MedMNIST-Raw, we selected models have much higher accuracies than this baseline.
two top-performing approaches on biomedical imaging analysis, For the clinical-trial matching task, we collected a dataset from
MedViT (Large)40 and BiomedCLIP15, as benchmarks for compari- Text Retrieval Conference (TREC) 202248, categorized into three
son. For BiomedCLIP, we added a decision layer and fine-tuned the groups: eligible, irrelevant and ineligible. We randomly chose 80% of
entire model. BiomedGPT achieved 5 out of 7 best accuracies on the data from each group as the training set and the remaining 20%
MedMNIST-Raw (Fig. 3c): for example, on the dermoscopy dataset, as the test set, and reported the average results across 10 repetitions.
BiomedGPT surpassed the two baseline models by more than 14%. On Again, all three versions of BiomedGPT outperformed the baselines
average, BiomedGPT achieved performance improvements of 6.1% and (Fig. 4b). In particular, BiomedGPT-B achieved a mean accuracy of
3.3% over MedViT and BiomedCLIP, respectively. 85.2% ± 1.5%, substantially outperforming BioGPT and LLaVA-Med,
BiomedGPT exhibits performance enhancements as its scale which had accuracies of 42.0 % ± 1.8% and 48.7% ± 2.4%, respectively.
increases (Fig. 3f). Specifically, on the MC-CXR dataset, the small model To assess BiomedGPT’s performance in predicting in-hospital
had an accuracy of 75.9%. By contrast, the medium model had a score of mortality, we used admission notes extracted from the MIMIC-III data-
82.8%, which is 6.9% higher than its smaller counterpart’s performance. base, following ref. 49, with the official test set. Figure 4c presents the
The base model continued this upward trajectory, with a score of 89.7%, prediction-accuracy results for five models, demonstrating that all
surpassing the medium model by 6.9%. However, we also observed three versions of BiomedGPT outperformed BioGPT and LLaVA-Med.
performance saturation on several datasets, such as SZ-CXR. We also Notably, BiomedGPT-B achieved an accuracy improvement of more
tested the extreme situation in which the images were resized to a very than 15% compared with these two baselines.
small scale and found that performance saturation became much more We used the ROUGE-L metric to assess BiomedGPT-B’s
pronounced (Supplementary Table 1). text-summarization performance across four benchmark data-
Additionally, we benchmarked BiomedGPT against Med-PaLM M sets (Fig. 4d). BiomedGPT-B demonstrated its ability to summarize
on the Curated Breast Imaging Subset of Digital Database for Screening doctor–patient dialogues on the MedQSum and HealthCareMagic
Mammography (CBIS-DDSM) dataset41 for both three-class lesion-level datasets, achieving ROUGE-L scores of 52.3% and 42%, respectively.
mass classification and calcification classification. Using the Leading models32, with 400 million parameters (at least twice as large
macro-averaged F1 score (F1-macro) as the evaluation metric, consist- as BiomedGPT-B), recorded ROUGE-L scores of 53.2% and 44.7%,
ent with how Med-PaLM M was evaluated, we found that BiomedGPT-B BiomedGPT-B showed only minor performance drops of 0.9% and
outperforms all versions of Med-PaLM M, spanning 12 billion, 84 billion 2.7%. Additionally, in summarizing radiology reports, and specifically
and 584 billion parameters (Fig. 3e and Extended Data Fig. 4a). in generating impressions from radiologists’ findings, BiomedGPT-B
These findings underscore the impressive efficiency and efficacy of achieved a ROUGE-L score of 44.4% on the MIMIC-CXR dataset. This
BiomedGPT, even relative to models with larger scales. result is closely aligned with the performance of the SOTA model, trail-
ing by a mere 0.1% from the top score of 44.5%33. In the MIMIC-III dataset,
BiomedGPT understands and summarizes clinical text BiomedGPT-B’s performance stood out with a ROUGE-L score of 30.7%,
We assessed BiomedGPT’s proficiency in understanding and condens- surpassing Med-PaLM M (12 billion parameter), which scored 29.5%.
ing complex medical narratives that hold potential for addressing
real-world clinical needs: (1) medical natural-language inference, using BiomedGPT can perform zero-shot prediction on new data
the MedNLI dataset42, which tests the model’s comprehension in deduc- We focused on evaluating the zero-shot capabilities of BiomedGPT
ing hypotheses from provided premises; (2) treatment suggestions in VQA, highlighting its ability to answer biomedical questions in a
for radiation therapy and chemotherapy based on the Surveillance, freeform manner at scale, without requiring retraining. This contrasts
Epidemiology, and End Results (SEER) dataset43; (3) in-hospital mortality sharply with earlier biomedical AI models, such as bidirectional encoder

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a Treatment-suggestion performance
Max
b Clinical-trial matching performance
Max
90
Mean ± s.d.: 75% 75%
50.0 ± 5.3 Mean ± s.d.: Median Median
60 Mean ± s.d.:
46.4 ± 6.0 Mean ± s.d.: 25% 80 25%
Mean ± s.d.: 85.2 ± 1.5
Mean ± s.d.: 41.5 ± 7.1
49.0 ± 4.8 Min Min
45.9 ± 4.8
Outlier Mean ± s.d.: Outlier

Accuracy (%)
70
Accuracy

50 75.4 ± 1.9
Mean ± s.d.:
71.6 ± 2.3
60
Mean ± s.d.:
48.7 ± 2.4
40 50 Mean ± s.d.:
42.0 ± 1.8

40
30
-B -M -S PT ed -B -S
PT PT PT -M PT ed
G G oG -M PT PT PT oG -M
ed
G
ed Bi A
dG dG Bi
ed aV dG A
om om LL
e e e aV
Bi om Bi om om om LL
Bi Bi Bi Bi

c In-hospital mortality d Clinical text-summarization performance


Model
e Medical language
prediction performance MeQSum inference performance
BiomedGPT-B
Accuracy BiomedGPT-M BiomedGPT-S
HealthCare-magic BiomedGPT-S
BiomedGPT-B 89.5
BioBART-L (400M) BiomedGPT-M
75.8
BiomedGPT-M 89.2 RadAdapt (738M) 80.8

MedNLI
MIMIC-CXR BiomedGPT-B
MedPaLM M (12B) 83.8
BiomedGPT-S 77.8

MIMIC-III Parameters
BioGPT 74.2 33M 86.6 SciFive
182M
LLaVA-Med 72.8
738M 30 40 50 60 70 80 90
30 25 35 40 45 50 55 Accuracy (%)
ROUGEL-L

f Example:
g Average zero-shot accuracy (%) across seven question types
What type of imaging does this not represent?
Unrelated Disease
50.9 43.5 35.4 45.2 32.8 53.9 52.1 32.6 42.0 34.5
Answer: Chest diagnosis
Imaging technical
100 99.5 73.3 41.0 19.9 19.5 20.9 68.8 68.0 58.1 67.6 20.4
98.2 details
95.0
Alignment accuracy (%)

92.8 Lesion and abnormality


48.6 37.2 38.7 41.2 45.5 49.5 52.9 45.9 40.6 40.6
90
detection
86.4 86.4 Modality
77.9 68.7 59.6 42.7 43.4 77.1 69.5 55.5 69.4 55.0
82.8 81.8 recognition
80 79.2 Size 46.6 39.7 59.4 37.8 42.9 44.6 65.3 39.5 68.6 44.6
assessment
Spatial
70 68.4 47.7 14.4 21.8 9.5 23.6 44.0 31.8 27.6 35.4 28.2
relationships
Structural
52.0 41.3 28.8 32.4 30.7 43.1 35.2 37.0 41.0 40.0
60 identification
B S e t - t - t - e V -B -S e t- t- t- e
-4
V - - M -
ar
g c c c e d
ug -4 -M rg uc -B uc M uc -S ed ug
PT PT PT ru -B ru -M ru -S PT PT PT PT -la str PT str PT- str PT A-m A-h
PT G G d G A -L n st PT nst PT nst PT A-M A-H G d G G d G A n n n
G d I G I d F I G I I G F
e ed e F
ed
G I dG aV F e e e O ed ed
G
ed LLa
V O
om om
O ed e LL
O
om iom om
Bi om Bi om om Bi Bi om om om
Bi om B Bi Bi
Bi Bi Bi Bi

h Overall zero-shot learning performance


Max
Mean ± s.d.: 75%
Mean ± s.d.: 54.7 ± 5.7
Mean ± s.d.: Median
53.0 ± 6.7
52.4 ± 5.5 25%
Mean ± s.d.:
60 Mean ± s.d.: 43.9 ± 7.1 Min
41.5 ± 5.9 Outlier
Accuracy (%)

Mean ± s.d.: Mean ± s.d.:


Mean ± s.d.: 37.0 ± 4.8 35.7 ± 5.7
Mean ± s.d.: Mean ± s.d.:
38.8 ± 2.0
33.0 ± 4.0 32.0 ± 4.4
40

20
e ct
-
ct
- t-
4V -B -M -S rg uc S ed ge
- PT PT PT la ru -B ru M tr T- hu
PT G G A- st T st T- s - m -
G ed dG ed F In GP In GP In GP VA FA
e O ed ed ed a O
om om om LL
Bi Bi Bi om om om
Bi Bi Bi

Fig. 4 | BiomedGPT performs few-epoch transfer learning for clinical-text prediction. d, ROUGE-L scores across four text-summarization datasets, relative
understanding and summarization and generates a response through zero- to model scale. e, Medical language inference performance on the MedNLI
shot transfer learning. a, Evaluation of models for the treatment-suggestion dataset. f, Comparison of zero-shot question-alignment accuracy among Instruct-
task in terms of accuracy using tenfold cross-validation (n = 4,680 data samples). BiomedGPTs (base, medium, small), BiomedGPTs, OFAs (large, huge), LLaVA-Med
b, Comparison of performance, assessed using accuracy, on the patient–trial and GPT-4V. An example illustrating a mismatch between the generated answer
matching dataset, derived from the TREC 2022 dataset, using tenfold cross- and the question is shown. g, Average zero-shot accuracy across seven question
validation (n = 7079 data samples). c, Accuracy across three BiomedGPT types on the VQA-RAD dataset. h, Overall zero-shot learning performance on the
variants and two SOTA models, BioGPT and LLaVA-Med, for in-hospital mortality VQA-RAD dataset over 50 repeated samplings (n = 39 data samples).

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

representations from transformers (BERT)-based or vision transformer to the radiologist to serve as a reference, potentially facilitating a more
(ViT)-based models40, which are incapable of zero-shot prediction, precise evaluation.
or contrast language–image pretraining (CLIP)-based models15, BiomedGPT achieved an average score of 1.75 across all 52 sam-
which require predefined answers (Extended Data Fig. 5a). Unlike these ples, accumulating a total score of 91. In comparison, GPT-4V and
models, BiomedGPT can generate answers by simply processing the LLaVA-Med attained average scores of 1.17 and 1.4, resulting in total
input data, offering more flexible and dynamic AI-driven solutions for scores of 61 and 73, respectively. BiomedGPT demonstrated superior
biomedical inquiries. In addition to medical VQA, BiomedGPT show- performance in four out of five question categories. In addition, despite
cased zero-shot capabilities in disease diagnosis and X-ray report gen- the radiologist identifying some errors in the sampled gold labels from
eration, matching the performance of Med-PaLM M and LLaVA-Med MIMIC-Diff-VQA, we conducted a comparison using an exact match
(Extended Data Fig. 5b,c). score based on these labels across the test set with non-difference ques-
We used the VQA-RAD dataset18 (which was absent from the pre- tions. In this evaluation, BiomedGPT-B showed the best performance
training data) for evaluation, through 50 random samplings. Our eval- (Supplementary Table 3).
uation of BiomedGPT’s performance centered on two key metrics:
(1) the accuracy of the model in providing correct answers, and (2) its Radiology report generation. This task’s complexity arises from
ability to understand the questions and respond in a contextually rele­ the need for long-form outputs that provide detailed descriptions of
vant way, measured as alignment accuracy. We noted low alignment various aspects, such as the presence, location and severity of abnor-
accuracy, indicating poor question comprehension, by our pretrained malities. In this study, we randomly selected 30 sample image–report
models (Fig. 4f). To address this, we developed Instruct-BiomedGPT pairs from the MIMIC-CXR dataset21. We then applied BiomedGPT-B
which was fine-tuned using instruction-tuning data (Supplementary and BiomedGPT-M to generate the ‘findings’ section of the radiology
Fig. 1). We assessed this model against current SOTA models, including report based on the input CXR image. The radiologist assessed the
GPT-4V50, LLaVA-Med (7B)12, OFA-Huge (930 million parameters) and quality of the generated text by addressing several aspects. First, they
OFA-Large (470 million parameters)51 in a zero-shot setting, analyz- identified any disagreements with the generated report, such as incor-
ing various question types (Extended Data Table 4). Specifically, rect finding locations, incorrect severity levels, references to views
Instruct-BiomedGPT-B achieved a zero-shot accuracy of 54.7% ± 5.7%, not present or mentions of prior studies that do not exist. Second, the
surpassing GPT-4V’s 53.0% ± 6.7% (Fig. 4h). Despite this improvement radiologist determined whether the errors in the generated report
in understanding medical questions, neither model reached clinically are critical, with the options being critical, noncritical or N/A if more
acceptable performance. For example, the current top-performing information is needed. Third, they pinpointed any omissions in the
medical vision–language model, LLaVA-Med, achieved accuracies generated text. Finally, the radiologist judged whether the omissions
of only 42.0% and 40.6% in disease diagnosis and lesion detection, are clinically critical.
respectively (Fig. 4g). Although Instruct-BiomedGPT-B showed a more In the evaluation, we focused on finding-level metrics, in which
than 10% improvement over LLaVA-Med, accuracies remained under the generated text would be split into individual findings. For instance,
60%. These results highlight the complexity of diagnosis and the need the report ‘PA and lateral views of the chest provided. Cardiomegaly is
for ongoing fine-tuning in the development of visual-language bio- again noted with mild pulmonary edema. No large effusion or pneumo-
medical AI. thorax.’ consists of three findings. To clearly demonstrate the quality
Regarding alignment accuracy, GPT-4V and LLaVA-Med out­ of the generated findings, we quantified the error rates and omission
performed the other models (Fig. 4f); specifically, they achieved impres- rates (Fig. 5d). In the analysis of 192 generated findings, BiomedGPT-B
sive scores of 99.5% ± 1.1% and 98.2% ± 2.0%, respectively, likely owing to the achieved a rate of ‘critical error’ of 8.3%, whereas BiomedGPT-M
advanced large language models on which they are built10,11. The marked exhibited a rate of 11.0% (excluding one case that required additional
improvement in alignment accuracy between Instruct-BiomedGPT information for a comprehensive impact assessment). These rates are
and the pretrained BiomedGPT exemplifies the effectiveness of comparable to the human observer variabilities on the MIMIC-CXR,
instruction tuning in enhancing the model’s capability to follow which has an error rate of approximately 6%53. We also reported the
instructions accurately. For instance, BiomedGPT-B achieved a mean rate of ‘harmless error’; BiomedGPT-B and BiomedGPT-M achieved
alignment accuracy of 79.2%, but Instruct-BiomedGPT-B reached 95%. 5.2% and 11.5%, respectively. Our observations included an analysis of
254 findings from the reference report to calculate the omission rates.
Human evaluation of BiomedGPT for radiology tasks The total omission rates for BiomedGPT-B and BiomedGPT-M were
To evaluate the clinical applicability and deployment challenges of 23.3% and 23.5%, respectively. Because not all findings described in
BiomedGPT, we conducted a series of analyses through radiologist the reference are clinically necessary, our analysis primarily focused
evaluations of the model’s generated responses to a wide range of on critical omissions; BiomedGPT-B and BiomedGPT-M had similar
tasks, including VQA, report generation and report summarization rates, of 7.0% and 6.9%, respectively.
in radiology. Examples of human evaluation on these three tasks in
terms of response factuality, omissions and severity of the errors are Radiology report summarization. We evaluated 100 summaries gener-
shown in Figure 5a. The detailed evaluation procedure and performance ated by BiomedGPT-B based on findings from MIMIC-CXR data21, along-
analysis are as follows. side the ‘Impression’ sections of corresponding reference reports.
Our evaluation focused on completeness, correctness and potential
Radiology VQA. To clinically evaluate the correctness of BiomedGPT’s medically adverse effects due to any omissions or incorrect interpreta-
responses, we randomly selected 52 question–answer samples from tions (Fig. 5a). Completeness is rated from 1 (very incomplete) to 5 (very
16 images in the official test set of MIMIC-Diff-VQA52 over 6 catego- complete), with 3 representing a borderline (neutral) encapsulation.
ries (Supplementary Table 2): abnormality, presence, location, type, Accuracy is assessed by how well the content reflects the clinical impli-
view and severity level. For a fair comparison, we collected the cations for the patient, rated from 1 (very incorrect) to 5 (very correct).
answers generated by BiomedGPT, LLaVA-Med after fine-tuning The potential for medically adverse effects from errors is classified as
and GPT-4V (zero-shot). The generated answers were presented to a ‘no harm’, ‘mild’ or ‘severe’, on the basis of their clinical impact. Finally,
seasoned radiologist at Massachusetts General Hospital for scoring we compared which summary, generated or referenced, better encap-
(Fig. 5b,c). The answers were categorized as correct, partially cor- sulated all clinically relevant information, providing a comprehensive
rect, incorrect or unrelated, and were assigned scores of 2, 1, 0 and −1, comparison of AI-generated summaries with traditional radiology
respectively. Additionally, the original radiology reports were provided reports in terms of relevance, accuracy and safety.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a Reference report Report summarization


There is a right pleural effusion, the size of which is difficult to Right pleural effusion, no pulmonary congestion or
ascertain. There is unchanged bilateral lower lobe and right pneumothorax, unclear cardiac and mediastinal outlines.
middle lobe collapse. The small left pleural effusion is unchanged.
There is no pulmonary vascular congestion or pneumothorax. The Completeness Correctness Harm
cardiac and mediastinal contours are not well visualized. Score (1 2 3 4 5) (1 2 3 4 5) (None Mild Severe)

Report generation
Radiology VQA
A large right pleural effusion is increased from with associated
compressive atelectasis. A small left pleural fluid is also increased Q: What abnormalities are seen in the image?
from __. The left lung is clear. The heart and mediastinum cannot A: Pleural effusion, bilateral lower lobe collapse, right
be accurately assessed on this projection. (No mention of changes middle lobe collapse, unclear cardiac and mediastinal
in lobe collapse or new findings of congestion and pneumothorax.) outlines.

Significant Error Insignificant Error Significant Omission Correct Partially Incorrect Unrelated

b BiomedGPT GPT-4V LLaVA-med

Abnormality Abnormality Abnormality

Correct 27
Correct 33
Presence Presence Presence
Correct 41
Answer
Question
evaluation
type Location Location Location
Partially 9 distribution

Partially 9
Type Type Type
Incorrect 14
Level Partially 9 Level Level Incorrect 8
View Incorrect 2 View Unrelated 2 View Unrelated 2

c Average answer score d Error and omission rates in report generation


Harmless 16.3
2

2.0 16.6
omission
1.8

1.8
1.75

1.73
1.63
1.62

1.6
1.45

1.6
Harmless 5.2
1.4

1.23
1.23

error 11.5
1.17

1.2
1.2

1.2
Score

0.91
1

Critical 7.0
0.8 omission 6.9
0.6

0.4 Critical 8.3


error 10.9
0
Overall Abnormality Presence Location Type Level View 0 20% 40% 60% 80% 100%

BiomedGPT GPT-4V LLaVA-med BiomedGPT-B BiomedGPT-M

e Correctness Completeness Harm


100 100 100 1
5 6
15
Preference
80 80 41 80
Number of summaries

Number of summaries

Number of summaries

50
55
60 60 60
66 52% 48%
25 94 94
40 40 40
31 40
Medical expert BiomedGPT
20 20 20
30
11 14
7
0 2 1 2 1 3 1 4 1
0 0
Medical expert BiomedGPT Medical expert BiomedGPT Medical expert BiomedGPT
1 2 3 4 5 1 2 3 4 5 None Mild Severe

Fig. 5 | Human evaluation of the VQA, text-summarization and captioning BiomedGPT-B and BiomedGPT-M in the generated radiology report. e, Human
tasks. a, Examples of human evaluation for three tasks in terms of response evaluation of report summarization considers three attributes: completeness,
factuality, omissions and severity of the errors. In the given X-ray image, correctness and potential harm, with the radiologist’s preference. Specifically,
L indicates the left side of the patient’s body; the ‘O’ is not a letter but the imaging in all comparison pairs (reference summary from the medical expert and the
of a foreign object either inside or outside the subject’s body. b, Comparison of BiomedGPT-generated summary, the radiologist evaluator prefer the reference
performance between three models across six question categories for radiology summary in 52% of cases. For the remaining 48% of the cases, the evaluator think
VQA. c, Average answer score for radiology VQA. d, Error and omission rates of the BiomedGPT-generated summary is better.).

BiomedGPT-generated summaries generally exhibit higher com- 5% of BiomedGPT-generated summaries are considered incomplete
pleteness (Fig. 5e), achieving average completeness (score > 3) in 81.0% (score < 3), compared with 4% for the reference summaries. Despite
of cases, 15.0% higher than the reference summaries. Additionally, only these findings, the average completeness score for BiomedGPT is

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

a 13.3 Radiology b
captioning BloodMNIST DermaMNIST SLAKE-MRI SLAKE-CT
(ROCO)
Radiology VQA 12.9
RadGPT-S 93.2 (4.5↓) 68.1 (10.5↓) 25.9 (42.5↓) 42.0 (38.5↓)
(VQA-RAD)
12.6
RadGPT-M 96.5 (1.2↓) 67.8 (10.8↓) 43.4 (25.0↓) 57.2 (23.3↓)
38.2
36.6 12.2 RadGPT-B 96.8 (0.9↓) 70.5 (8.1↓) 66.2 (2.2↓) 65.3 (15.2↓)
35.0
Image classification BiomedGPT-B (ours) 97.7 78.6 68.4 80.5
33.4 11.9
(PneumoniaMNIST)
31.8
86.5 87.9 89.4 90.8 92.2
c RadGPT-B RadGPT-M RadGPT-S ResNet-50
95
16.5 68.5 93.0
24.1
68.9 90.5 90.4 90.2 90.5
31.8 90

Accuracy (%)
BiomedGPT-S 89.1
39.4 w/o MLM 87.5
69.2 87.2
47.1 w/o MIM 86.2
w/o OD 85.4
Medical question 69.6 85
summarization
(MeQSum) 82.7
70.0
Medical language inference 81.2
(MedNLI) 80
CXR Breast ultrasound Liver CT

Location tokens
Bounding Pix2Seq
[loc503, loc686, loc602, loc798]
box tokenizer Unified vocabulary
Location
Text tokens vocabulary loc1 loc2 loc3 loc1000
BPE ['ben', 'ign', '_', 'with', 'out', '_', Text
Text Benign_without_callback
tokenizer 'c', 'all', 'back'] vocabulary Ben Ign Out c
Image
Visual tokens vocabulary img1 img2 img3 img8192

img456 img789 img123 Total vocabulary size: 59,457


VQGAN
Original
tokenizer img777 img888 img999
image
(frozen) img321 img654 img987

Reconstructed masked token

img456 img789 img123 img777 img888 img999 img321 img654 img987

Image
BiomedGPT decoder
patches
Hidden
embedding
Blockwise BiomedGPT encoder
masking

Position
0 1 2 3 4 5 6 7 8 9 embedding
+ + + + + + + + + +
Flatten
[S] [M] Patch
embedding

Fig. 6 | Results of the ablation study on the impact of diversity of pretraining of the SLAKE data. c, In-domain transferability of BiomedGPT across three
datasets and tasks and a graphical demonstration of BiomedGPT’s radiology modalities and datasets. d, Description of the unified vocabulary used
design. a, Performance comparison excluding the specific task. The metrics in BiomedGPT for pretraining and inference. Tokenization of bounding boxes
used are accuracy for radiology VQA, medical language inference and image and text was achieved using Pix2Seq and byte-pair encoding (BPE), respectively.
classification; CIDEr for radiology captioning; and ROUGE-L for medical- There are three types of tokens: location tokens, text tokens and image tokens
question summarization. Pretraining without using masked image modeling, from frozen pretrained tokenizers, such as VQ-GAN. An illustration of masked
w/o MIM; without using masked language modeling, w/o MLM; without using image modeling in pretraining, which involves learning representations by
object detection, w/o OD. b, Cross-domain transferability of BiomedGPT across reconstructing masked patches, is also shown. [S] and [M] indicate the starting
four datasets. RadGPT is a variant of BiomedGPT but was pretrained with token and masked patch embedding, respectively.
radiology-only data. SLAKE-MRI and SLAKE-CT are the modality-specific subsets

slightly lower at 3.9, versus 4.0 for reference summaries, with no sig- for the reference impressions. The Wilcoxon rank-sum test showed no
nificant difference (P > 0.05). BiomedGPT also had a higher correctness significant difference (P > 0.05) in average correctness scores between
rate, with 90.0% of its summaries scoring above 3, compared with 86.0% BiomedGPT and the reference summaries, both averaging 4.4 out of 5.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

In addition, our analysis found that 6.0% of BiomedGPT-generated Evidence from our comprehensive analysis (Figs. 3a,b,f and 4a–e,h)
summaries contained medically adverse items, categorized as either indicates a direct correlation between increased model scale and
‘mild’ or ‘severe’, which is identical to the rate observed in the refer- enhanced performance, applicable to both zero-shot predictions and
ence impressions. This indicates that BiomedGPT has comparable post-fine-tuning. However, scaling brings its own set of challenges, par-
performance to human experts in summarizing radiology reports, ticularly concerning fine-tuning efficiency, training speed and memory
particularly in terms of assessing medical safety. Notably, there was requirements. We have tried to address the efficiency challenges of
one instance of a ‘severe’ adverse effect identified in the reference BiomedGPT by exploring prompt tuning, which adds small-scale para­
impressions, with no such cases found in the BiomedGPT-generated meters to condition-frozen models56. However, this method incurred
summaries. The overall score of summaries generated by BiomedGPT large performance degradation (Extended Data Fig. 4b).
closely matches the score of those produced by the reference, with Our zero-shot transfer-learning tests (Fig. 4f–h) indicated that
preference scores of 48% for BiomedGPT and 52% for the reference BiomedGPT’s text-comprehension capabilities, especially in compari-
(Fig. 5e). The results of the Sign test (P > 0.05) indicate that there is son with those of GPT-4V, are not fully established. Two main factors
no significant preference for either system, suggesting comparable contribute to this limitation: first, the current scale of BiomedGPT,
performance in delivering quality and safety in medical summarization. particularly the language backbone, is limited by available resources,
although it is expandable. Our preliminary observations indicate that,
Discussion even if a model has seven billion parameters and effective training,
In this study, we have shown that BiomedGPT can achieve competitive achieving robust zero-shot in-context or text understanding remains
transfer-learning performance across vision, language and multimodal challenging in complex medical applications. However, fine-tuning,
domains by integrating diverse biomedical modalities and tasks within even with a smaller-scale model such as BiomedGPT, proves to be a
a unified pretraining framework. However, the experimental results promising approach to mitigate risks (Supplementary Fig. 3). Second,
also revealed limitations, offering insights for potential improvement. the use of a single encoder that handles multiple input types compli-
The development of AI critically depends on the availability cates the separation of diverse modality representations, requiring
of high-quality, annotated data. This requirement poses a unique more refined training strategies.
challenge in the biomedical domain, in which data annotation is
expensive, time-consuming and demands extensive domain exper- Online content
tise54. Consequently, AI researchers often resort to public datasets, Any methods, additional references, Nature Portfolio reporting sum-
which can compromise data quality. When dealing with multimodal maries, source data, extended data, supplementary information,
biomedical datasets, particularly image–text pairs, issues become acknowledgements, peer review information; details of author con-
more pronounced: (1) most existing datasets focus primarily on radi- tributions and competing interests; and statements of data and code
ology, leading to a substantial modality imbalance; and (2) the scale availability are available at https://doi.org/10.1038/s41591-024-03185-2.
of images with detailed annotation is still limited in comparison with
unlabeled or weakly-labeled biomedical images and accessible bio- References
medical articles from PubMed or PubMed Central. In our study, we 1. Thirunavukarasu, A. J. et al. Large language models in medicine.
considered diverse modalities and ensured that the data scale is suf- Nat. Med. 29, 1930–1940 (2023).
ficient to train high-performance models. As more biomedical data are 2. Moor, M. et al. Foundation models for generalist medical artificial
curated and made open source, we can obtain better visual–semantic intelligence. Nature 616, 259–265 (2023).
mappings (Fig. 6). 3. Moody, L. et al. The person-centred care guideline: from principle
Evaluating the quality of generated text presents considerable to practice. J. Patient Exp. 5, 282–288 (2018).
challenges. Although metrics such as CIDEr and ROUGE-L can meas- 4. Langberg, E. M., Dyhr, L. & Davidsen, A. S. Development of the
ure the agreement between generated content and a gold standard, concept of patient-centredness–a systematic review. Patient
and are commonly used for model selection to further assess clini- Educ. Couns. 102, 1228–1236 (2019).
cal applicability35, ensuring the factual accuracy of these outputs 5. Bates, D. W. et al. Reducing the frequency of errors in medicine
remains a concern. To address this, recent research has introduced using information technology. J. Am. Med. Inform. Assoc. 8,
the F1-RadGraph score55, which qualitatively assesses the factual cor- 299–308 (2001).
rectness and completeness of generated reports. In other domains, 6. Tu, T. et al. Towards generalist biomedical AI. NEJM AI
such as pathology, similar evaluation metrics are not yet prevalent. https://doi.org/10.1056/AIoa2300138 (2024).
We anticipate the emergence of analogous metrics for these domains 7. Reed, S. et al. A generalist agent. Transact. Mach. Learn. Res.
that draw inspiration from factual-concerned metrics developed in https://openreview.net/pdf?id=1ikK0kHjvj (2022).
radiology56. These would further enhance our ability to measure the 8. Driess, D. et al. Palm-e: an embodied multimodal language
factual integrity and overall quality of AI-generated medical content model. In Proc. 40th International Conference on Machine
across various biomedical fields. Learning 8469–8488 (JMLR.org, 2023).
BiomedGPT is currently adept in processing images and text, 9. Vaswani, A. et al. Attention is all you need. In Advances in
and its capabilities could potentially be extended to other types of Neural Information Processing Systems 30 (Neural Information
biomedical data, such as video and time-series or sequential data. Processing Systems Foundation, 2017).
For instance, we demonstrated how BiomedGPT can be extended 10. Brown, T. et al. Language models are few-shot learners.
to handle three-dimensional (3D) images by introducing a 3D image Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
encoder into the framework (Extended Data Table 5 and Supplemen- 11. Touvron, H. et al. Llama: open and efficient foundation language
tary Table 4). Nevertheless, these expansions raise concerns about models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
negative transfer, in which learning from additional modalities might 12. Li, C. et al. Llava-med: training a large language-and-vision
inadvertently hamper performance on certain tasks. For instance, assistant for biomedicine in one day. In Advances in Neural
our ablation study revealed that excluding image data during pre- Information Processing Systems 36 (Neural Information
training improves performance on language-only downstream tasks Processing Systems Foundation, 2024).
(Fig. 6a), highlighting the risk of negative transfer. To mitigate this, 13. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. Towards generalist
we propose exploring controllable learning strategies, such as the foundation model for radiology. Preprint at https://arxiv.org/abs/
mixture of experts57. 2308.02463 (2023).

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

14. Luo, R. et al. BioGPT: generative pretrained transformer for 31. Liu, F. et al. Exploring and distilling posterior and prior
biomedical text generation and mining. Brief. Bioinform. 23, knowledge for radiology report generation. In Proc. IEEE/CVF
bbac409 (2022). Conference on Computer Vision and Pattern Recognition (CVPR)
15. Zhang, S. et al. Biomedclip: a multimodal biomedical foundation 13753–13762 (Institute of Electrical and Electronics Engineers/
model pretrained from fifteen million scientific image-text pairs. Computer Vision Foundation, 2021).
Preprint at https://arxiv.org/abs/2303.00915 (2023). 32. Yuan, H. et al. Biobart: pretraining and evaluation of a biomedical
16. Phan, L. N. et al. Scifive: a text-to-text transformer model for generative language model. In Proc. 21st Workshop on Biomedical
biomedical literature. Preprint at https://arxiv.org/abs/2106.03598 Language Processing (eds. Demner-Fushman, D., Cohen, K. B.,
(2021). Ananiadou, S. & Tsujii, J.) 97–109 (Association for Computational
17. Lau, J. et al. A dataset of clinically generated visual questions Linguistics, 2022).
and answers about radiology images. Sci. Data 5, 180251 33. Van Veen, D. et al. Radadapt: radiology report summarization via
(2018). lightweight domain adaptation of large language models. In 22nd
18. Liu, B. et al. Slake: a semantically-labeled knowledge-enhanced Workshop on Biomedical Natural Language Processing and BioNLP
dataset for medical visual question answering. In Proc. IEEE Shared Tasks (eds. Demner-fushman, D., Ananiadou, S. & Cohen, K.)
International Symposium on Biomedical Imaging (ISBI) 449–460 (Association for Computational Linguistics, 2023).
1650–1654 (Institute of Electrical and Electronics Engineers, 34. Yu, F. et al. Evaluating progress in automatic chest X-ray radiology
2021). report generation. Patterns 4, 9 (2023).
19. He, X. et al. Towards visual question answering on pathology 35. Van Veen, D. et al. Adapted large language models can
images. In Proc. of the 59th Annual Meeting of the Association outperform medical experts in clinical text summarization.
for Computational Linguistics and the 11th International Joint Nat. Med. 30, 1134–1142 (2024).
Conference on Natural Language Processing (Volume 2: Short 36. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical
Papers) 708–718 (Association for Computational Linguistics. imaging reports. Proc. 56th Annual Meeting of the Association
2021). for Computational Linguistics 1 (eds. Gurevych, I. & Miyao, Y.)
20. Demner-Fushman, D. et al. Preparing a collection of radiology 2577–2586 (2018).
examinations for distribution and retrieval. J. Am. Med. Inform. 37. Yang, J. et al. MedMNIST v2 - a large-scale lightweight benchmark for
Assoc. 23, 304–310 (2016). 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023).
21. Johnson, A. E. et al. MIMIC-CXR-JPG — chest radiographs with 38. Jaeger, S. et al. Two public chest X-ray datasets for computer-
structured labels. PhysioNet 101, 215–220 (2019). aided screening of pulmonary diseases. Quant. Imaging Med.
22. Pavlopoulos, J., Kougia, V., & Androutsopoulos, I. A survey on Surg. 4, 475–477 (2014).
biomedical image captioning. In Proc. Second Workshop on 39. Capellán-Martín, D. et al. A lightweight, rapid and efficient deep
Shortcomings in Vision and Language 26–36 (Association for convolutional network for chest x-ray tuberculosis detection. In Proc.
Computational Linguistics, 2019). 2023 IEEE 20th Int. Symp. Biomed. Imaging (ISBI) 1–5 (IEEE, 2023).
23. Li, P. et al. Self-supervised vision-language pretraining for medial 40. Manzari, O. N. et al. Medvit: a robust vision transformer for
visual question answering. In Proc. IEEE 20th International generalized medical image classification. Comput. Biol. Med. 157,
Symposium on Biomedical Imaging (ISBI) 1–5 (Institute of 106791 (2023).
Electrical and Electronics Engineers, 2023). 41. Lee, R. S. et al. A curated mammography data set for use in
24. Zhang, X. et al. Pmc-vqa: visual instruction tuning for medical computer-aided detection and diagnosis research. Sci. Data 4,
visual question answering. Preprint at https://arxiv.org/abs/ 1–9 (2017).
2305.10415 (2023). 42. Romanov, A. & Shivade, C. Lessons from natural language
25. Van Sonsbeek, T. et al. Open-ended medical visual question inference in the clinical domain. In Proc. 2018 Conference on
answering through prefix tuning of language models. In Empirical Methods in Natural Language Processing 1586–1596
International Conference on Medical Image Computing and (Association for Computational Linguistics, 2018).
Computer-Assisted Intervention 726–736 (MICCAI, 2023). 43. Gloeckler Ries, L. A. et al. Cancer survival and incidence from
26. Lin, C. Y. Rouge: a package for automatic evaluation of the surveillance, epidemiology, and end results (SEER) program.
summaries. In Text Summarization Branches Out 74–81 Oncologist 8, 541–552 (2003).
(Association for Computational Linguistics, 2004). 44. Abacha, A. B., & Demner-Fushman, D. On the summarization of
27. Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt consumer health questions. In Proc. 57th Annual Meeting of the
evaluation with improved correlation with human judgments. Association for Computational Linguistics 2228–2234 (2019).
In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation 45. Zeng, G. et al. Meddialog: large-scale medical dialogue datasets.
Measures for Machine Translation and/or Summarization (eds. In Proc. 2020 Conference on Empirical Methods in Natural
Goldstein, J., Lavie, A., Lin, C.-Y. & Voss, C.) 65–72 (Association for Language Processing (EMNLP) 9241–9250 (Association for
Computational Linguistics, 2005). Computational Linguistics, 2020).
28. Vedantam, R., Zitnick, C. L. & Parikh, D. Cider: Consensus-based 46. Johnson, A. E. et al. MIMIC-III a freely accessible critical care
image description evaluation. In Proc. Conference on Computer database. Sci. Data 3, 1–9 (2019).
Vision and Pattern Recognition (CVPR) 4566–4575 (Institute of 47. Dubey, S. et al. Using machine learning for healthcare treatment
Electrical and Electronics Engineers, 2015). planning. Front. Artif. Intell. 6, 1124182 (2023).
29. Jing, B., Xie, P. & Xing, E. On the automatic generation of medical 48. Roberts, K. et al. Overview of the TREC 2021 clinical trials track. In
imaging reports. In Proc. 56th Annual Meeting of the Association Proc. Thirtieth Text Retrieval Conference (TREC, 2021).
for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 49. Van Aken, B. et al. Clinical outcome prediction from admission
2577–2586 (Association for Computational Linguistics, 2017). notes using self-supervised knowledge integration. In Proc.
30. Chen, Z. et al. Generating radiology reports via memory-driven 16th Conference of the European Chapter of the Association for
transformer. In Proc. 2020 Conference on Empirical Methods in Computational Linguistics: Main Volume 881–893 (Association for
Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., Computational Linguistics, 2021).
He, Y. & Liu, Y.) 1439–1449 (Association for Computational 50. OpenAI. GPT-4V(ision) system card. OpenAI https://openai.com/
Linguistics, 2020). research/gpt-4v-system-card (2023).

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

51. Wang, P. et al. OFA: unifying architectures, tasks, and modalities 56. Yang, H., Lin, J., Yang, A., Wang, P. & Zhou, C. Prompt tuning
through a simple sequence-to-sequence learning framework. for unified multimodal pretrained models. In Findings of the
Proc. Int. Conf. Mach. Learn. PMLR 162, 23318–23340 (2022). Association for Computational Linguistics: ACL 2023 (eds. Rogers, A.,
52. Hu, X. et al. Expert knowledge-aware image difference graph Boyd-Graber, J. & Okazaki, N.) 402–416 (Association for
representation learning for difference-aware medical visual Computational Linguistics, 2023).
question answering. In Proc. 29th ACM SIGKDD Conference on 57. Chen, Z. et al. Towards understanding the mixture-of-experts
Knowledge Discovery and Data Mining 4156–4165 (Association for layer in deep learning. Adv. Neural Inf. Process. Syst. 35,
Computing Machinery, 2023). 23049–23062 (2022).
53. Jeong, J. et al. Multimodal image-text matching improves
retrieval-based chest x-ray report generation. In Proc. Medical Publisher’s note Springer Nature remains neutral with regard
Imaging with Deep Learning 227 978–990 (Proceedings of to jurisdictional claims in published maps and institutional
Machine Learning Research, 2024). affiliations.
54. Fu, S. et al. Assessment of data quality variability across two EHR
systems through a case study of post-surgical complications. Springer Nature or its licensor (e.g. a society or other partner) holds
In Proc. AMIA Joint Summits on Translational Science 196–205 exclusive rights to this article under a publishing agreement with
(American Medical Informatics Association, 2022). the author(s) or other rightsholder(s); author self-archiving of the
55. Delbrouck, J. B. et al. Improving the factual correctness of accepted manuscript version of this article is solely governed by the
radiology report generation with semantic rewards. In Findings of terms of such publishing agreement and applicable law.
the Association for Computational Linguistics: EMNLP 2022 (eds.
Goldberg, Y., Kozareva, Z. & Zhang, Y.) 4348–4360 (Association © The Author(s), under exclusive licence to Springer Nature America,
for Computational Linguistics, 2022). Inc. 2024

Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA, USA. 2School of Computing, University of Georgia, Athens, GA,
1

USA. 3Samsung Research America, Mountain View, CA, USA. 4Department of Radiology, Massachusetts General Hospital and Harvard Medical School,
Boston, MA, USA. 5Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA. 6PolicyLab, Children’s
Hospital of Philadelphia, Philadelphia, PA, USA. 7Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA. 8Department of
Computer Science and Engineering, University of California, Santa Cruz, CA, USA. 9McWilliams School of Biomedical Informatics, UTHealth, Houston,
TX, USA. 10Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA. 11The Center for Health AI and Synthesis of Evidence (CHASE), University of
Pennsylvania, Philadelphia, PA, USA. 12Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA. 13Leonard Davis Institute of Health Economics,
Philadelphia, PA, USA. 14Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA. 15Department of Computer
Science, Stanford University, Stanford, CA, USA. e-mail: xli60@mgh.harvard.edu; lih319@lehigh.edu; lis221@lehigh.edu

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Methods tokens of multimodal outputs. The total vocabulary size is 59,457


BiomedGPT is a transformer-based architecture specifically designed tokens, including 50,265 language tokens, 1,000 location tokens and
for the biomedical field, built on the success of existing unified models 8,192 vision tokens. The number of vision tokens was determined by
for general data. We follow the fundamental principles of a unified the variant of the pretrained VQ-GAN models used in BiomedGPT;
model51: (1) modality-agnostic, (2) task-agnostic and (3) modality specifically, we used the variant with a patch size of 8 and vocabulary
and task comprehensiveness. By discretizing data into patches or size of 8,192. During training, we randomly subsampled 196 image
tokens, we achieve input–output unification using ideas from ViT58 and patches for pretraining. The maximum model input length is trun-
language models10,11. cated to 512.

BiomedGPT architecture Ablation study on modality comprehensiveness. Additional evalu-


There are three principal architectures among pretrained founda- ations were conducted to address the query: ‘Can the proposed model
tion models: encoder-only, decoder-only and encoder–decoder. handle unseen data modalities (for example, images from a new dif-
Encoder-only models, such as BERT and its variants59, primarily use ferent imaging device like an ultrasound)?’ To investigate this, we
the transformer’s encoder to learn representations of input data, and adjusted our dataset selection for both pretraining and downstream
require additional modules, such classification heads or task-specific tasks (Supplementary Fig. 2b). Specifically, we used all 3,489 and
decoders, during fine-tuning. This architecture may struggle with align- 6,461 CXR image–text pairs from the SLAKE and IU X-ray datasets,
ing inputs and outputs across distinctly different modalities, limiting respectively. Additionally, we randomly selected 7,452 images from
its capability in complex zero-shot prediction or generation tasks. CheXpert while disabling MLM and OD during pretraining for sim-
Conversely, decoder-only models, exemplified by GPT10, rely solely on plification (Supplementary Fig. 2a). The pretrained BiomedGPT on
the transformer’s decoder to process raw text inputs. Although profi- X-ray modality, denoted as RadGPT-{size}, was then fine-tuned on
cient in text-based tasks, their architecture is not inherently equipped radiology datasets: CXR, breast ultrasound and liver CT (coronal view).
to handle multiple modalities, often leading to challenges in learning As a comparative baseline, we selected ResNet-50 (ref. 68), which was
joint representations across diverse data types. This can diminish flex- trained from scratch on these three datasets. We observed impressive
ibility and performance in multimodal tasks, particularly in biomedical in-domain transferability of BiomedGPT from the outcome (Fig. 6c):
applications. Therefore, we selected the encoder–decoder architecture RadGPT-B outperformed the baseline, achieving 93.0% classification
to design BiomedGPT, which is more adept at mapping various modali- accuracy on the CXR images, a 7.6% improvement. However, for liver
ties into a unified semantic representation space, thereby enhancing CT scans, we had to scale up the model to attain comparable results
task handling across a broader spectrum. to the baseline. This highlights the challenges in domain adaptation
BiomedGPT is implemented with a BERT-style encoder59 over for medical applications when the pretrained model does not learn
corrupted text and a GPT-style left-to-right autoregressive decoder10. diverse medical knowledge.
All these models rely on the transformer with the popular multi-head We further explored the aspect of cross-domain transferability
attention mechanism (Extended Data Fig. 3a), which allows the model (Fig. 6b). Specifically, we fine-tuned the aforementioned pretrained
to jointly attend to the information from different representation model, RadGPT, using datasets from other domains, such as blood
sub-spaces60. To improve the convergence efficiency and stability in cell microscopy and dermoscopy, for image classification. Addition-
the pretraining, we added three normalization operations to each layer: ally, we selected MRI-only and CT-only image–text pairs from SLAKE
a post-attention Layer Norm (LN)61, post-first-FFN LN and head-wise and conducted VQA fine-tuning. The results were compared with the
scaling within self-attention (Extended Data Fig. 2b), following ref. 62. benchmark (the original BiomedGPT-B pretrained with all modalities)
To encode positional information, we incorporated two sets of absolute and were measured in terms of accuracy. We found that cross-modality
position embeddings for both text and images. Rather than merely transfer with our model is feasible, albeit with potentially substantial
combining these embeddings with token and patch embeddings, we performance degradation. For example, RadGPT-B exhibited a notable
implemented a decoupling method to separate position correlation decrease in accuracy compared with the baseline on both the DermaM-
(Extended Data Fig. 3b), which could bring unnecessary randomness NIST dataset (dermoscopy), with an 8.1% drop, and the SLAKE-CT VQA
in the attention and further limit the expressiveness of the model60. dataset, with a more substantial reduction of 15.2%. Notably, we had to
Furthermore, we also incorporated one-dimensional relative posi- double the training epochs as compared with the previous fine-tuning
tion bias for text and 2D relative position bias for image (Extended with a pretrained model encompassing all modalities (100 versus 50).
Data Fig. 3c), as described in previous works63,64. To investigate the Therefore, we conclude that modality comprehensiveness is essential
performance of BiomedGPT for tasks at different scales, we explicitly for a generalist biomedical AI model to facilitate efficient knowledge
designed three scaling models, that is, BiomedGPT-S (33 million param- transfer.
eters), BiomedGPT-M (93 million parameters) and BiomedGPT-B (182
million parameters). The configurations for each model are detailed Natural language as a task instructor
in Extended Data Figure 2a. Multitasking is a key attribute of a unified and generalist model. Fol-
lowing the literature on language models using prompt and instruc-
Unifying input–output tion learning10,69,70 and existing unified frameworks to eliminate
To handle diverse modalities without relying on task-specific output task-specific modules, we defined each task with a custom instruc-
structures, we represented them as tokens drawn from a unified and tion, excluding VQA tasks, which are fully specified by their text
finite vocabulary (Fig. 6d). To achieve this, we used frozen image quan- inputs. BiomedGPT supports abstractions of several tasks, including
tization65 and object descriptor66 to discretize images and objects, vision-only, text-only and vision–language, to achieve task comprehen-
respectively, on the target side. We encoded text outputs, including siveness. We provide details of the pretraining tasks and fine-tuning
object labels and summarizations, using BPE tokens67. Specifically, and inference tasks, as well as their corresponding instructions, in the
an image with a resolution of 256 × 256 pixels is sparsely encoded into following sections.
a sequence of 16 × 16 pixels, which correlates strongly with the cor-
responding patch and can effectively reduce the sequence length of Pretraining tasks. We considered two vision-only tasks in the pretrain-
the image representation. The bounding boxes of objects in ing process: for MIM as well as image infilling, we borrowed the idea of
an image are expressed as sequences of location tokens in the block-wise masking71 and let the model recover the masked patches in
format of integers. We thereby built a unified vocabulary for all the middle part by generating the corresponding codes (see Fig. 6d).

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

The corresponding instruction is ‘What is the image in the middle part?’. parametrized by θ. Then we autoregressively train the model by mini-
For object detection, the model learns to generate the bounding box mizing the loss function Lθ:
of an object with the instruction ‘What are the objects in the image?’.
Lθ (x1,1 , ⋯ , xi,b )
For the text-only task, we adopted the commonly used MLM), whose
logic is similar to MIM but the instruction is ‘What is the complete text of B I B I
= − ∑ log ∏ pθ (xi,b |x1,b , ⋯ , xi−1,b ) = − ∑ ∑ log pθ (xi,b |x<1,b ).
‘{Text}’?’. Two types of multimodal tasks were selected, including image b=1 i=1 b=1 i=1
captioning with the instruction of ‘What does the image describe?’
and VQA with the instruction of ‘{Question}’. The addition of OD for In the context of BiomedGPT, x could refer to both linguistic and
pretraining BiomedGPT serves to enhance visual learning, inspired visual tokens in the pretraining tasks, including subwords, image codes
by ref. 72. The mixture of pretraining tasks is effective, especially for and location tokens. Specifically, subwords were extracted by a BPE
processing multimodal inputs (Fig. 6a). tokenizer, and we masked 15% of the tokens of the subwords in input
in the MLM task, because these medical words show relatively high
Fine-tuning and downstream tasks. Besides image captioning and degrees of overlap. For the object-detection task, location tokens are
VQA used in pretraining, we covered one more vision-only task and generated following Pix2Seq66, conditioned on the observed pixel
two more text-only tasks. Specifically, we used the instruction ‘What inputs. Data preprocessing was required for quantizing biomedi-
does the image describe?’ to differentiate image classification. ‘What cal images using VQ-GAN67 owing to trivial semantics such as black
is the summary of text ‘{Text}’?’ and ‘Can text1 ‘{Text1}’ imply text2 backgrounds and the need to meet specific input size requirements.
‘{Text2}’?’ were exploited for text summarization and natural-language Therefore, we first removed the trivial background and cropped the
inference, respectively.Notably, BiomedGPT is extendable, allowing for image to the bounding box of the object of interest. We then resized
customization of instructions for specific downstream tasks (Fig. 1c the cropped image to 256 × 256 pixels and fed the center part, with a
and Supplementary Figs. 4–9). resolution of 128 × 128 pixels, into the pretrained VQ-GAN to generate
the corresponding sparse image codes, which were the target output
Ablation study on task comprehensiveness. To gain a deeper in masked image modeling task. Vision–language tasks followed the
understanding of the impact of individual pretraining tasks on down- same tokenization flow. For fine-tuning, we also applied seq2seq learn-
stream performance, we implemented an ablation study that excludes ing using different datasets and tasks.
either image-only or text-only tasks during pretraining, followed by To pretrain our BiomedGPT, we used the AdamW74 optimizer with
fine-tuning of the resultant models on five downstream tasks. To exponential decay rates for the first and second momentum estimates
ensure a fair comparison, we utilized downstream datasets that were β1 = 0.9, β2 = 0.999, respectively, and a small constant ε = 1 × 10–8 added
excluded from the pretraining phase: (1) PneumoniaMNIST36 for image to the denominator to improve numerical stability. The peak learning
classification; (2) ROCO (https://github.com/razorx89/roco-dataset) rate is set to 1 × 10–4, and we applied a linear decay scheduler with a
for image captioning; (3) VQA-RAD for VQA; (4) MeQSum for text warmup ratio of 0.01 to control the learning rate. For regularization,
summarization; and (5) MedNLI for text understanding. Moreover, we set the dropout to 0.1 and used a weight decay of 0.01. To enhance
each model was fine-tuned using consistent training receipts across the training process, we used stochastic depth with a rate of 0.1, which
the same datasets. was applied to the encoder and decoder, except for convolution blocks.
Owing to the limited computing resources, we performed this Furthermore, we used a diversified approach in mixing all pretraining
study using only BiomedGPT-S. Referring to Supplementary Figure 2c, data within each batch. This included an assortment of multimodal,
we used the BiomedGPT-S model, pretrained with all tasks, as the text-only, vision-only and object-detection samples. These were used
baseline. We observed several empirical phenomena in this ablation in an 8:2:1:1 ratio to emphasize learning and enhance the interaction
study (Fig. 6a): (1) excluding the MIM component resulted in decreased between vision and language. In addition, to address the potential
performance in image-centric and multimodal tasks, such as image feature shift caused by the inherent modality imbalance within the
classification and VQA accuracy. Conversely, text-centric tasks showed pretraining data, we adopted modality sampling strategies in each
improvement. These outcomes indicate that MIM is not crucial for pretraining batch to ensure balance. The models were pretrained with
text-only tasks, potentially explaining the enhancements in those areas. 10 NVIDIA A5000 GPUs and mixed precision75. Pretraining of the base,
(2) When MLM was excluded during pretraining, performance declined medium and small models took approximately 87, 32 and 9 h, respec-
across all tasks in downstream evaluation. Text-centric tasks were sub- tively. We initialized BiomedGPT with the pretrained OFA model51 and
stantially impacted. These findings underscore the importance of MLM adapted it to the biomedical domain using our curated multimodal
for unified models, even for image-only tasks that require text-token biomedical dataset. Specifically, we continued training from OFA’s
dictionaries for label generation. (3) Excluding object detection dur- pretrained checkpoints to align biomedical concepts using diverse
ing pretraining led to notable performance reductions in tasks such modality data through masked modeling, OD and image–text matching
as image classification and radiology captioning. However, changes (Extended Data Table 3). This approach could reduce computational
in performance for other datasets were relatively minor, likely owing efficiency as the continued training incorporates general-domain
to the limited number of object-detection samples and the weak con- knowledge from OFA, including language-understanding capabilities
nection to language-only tasks. In summary, our study highlights the that are beneficial for question-answering tasks.
importance of task diversity in pretraining for the unified medical AI.
Although the exclusion of image-specific tasks might benefit perfor- Model fine-tuning and inference
mance on text-only tasks downstream, a varied task regime is essential Fine-tuning, a form of transfer learning, involves adapting a pretrained
for maintaining generalization across both unimodal and multimodal model’s weights to new data. The practice of fine-tuning pretrained
applications. models, a widely acknowledged and highly effective approach in
natural-language processing and computer vision, has also found
Model pretraining important application in medical AI76,77. Unlike most previous bio-
We adopted sequence-to-sequence (seq2seq) learning73, which is medical models that necessitate the addition and training of extra
a commonly used approach for large language models, to train our components, such as a linear output layer or a decoder, our BiomedGPT
BiomedGPT. Formally, suppose we are given a sequence of tokens model relies solely on fine-tuning the existing structure. The specific
xi,b as input, where i = 1, ⋯ , I indexes the tokens in a data sample instructions used for this fine-tuning procedure mirror those in the
and b = 1, ⋯ , B indexes a sample in a training batch. Let a model be pretraining workflow, thereby maintaining consistency and efficiency

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

in model adaptation. We observed that, in tasks requiring long-context Furthermore, we conducted preliminary zero-shot studies on
outputs, such as image captioning, the model’s performance is influ- two instruction-tuned large language models, aiming to explore the
enced by hyperparameters, specifically beam search size and output upper bounds of in-context learning performance using advanced
length constraints (Supplementary Table 6). These findings informed language backbones. We considered the potential integration of these
our selection of hyperparameters for fine-tuning, which should be elements into BiomedGPT to enhance reasoning capabilities. However,
based on data statistics from the training set, such as the maximum these models exhibited notable discrepancies when compared with
length of the target text (Supplementary Table 7). For datasets with fine-tuned models (Supplementary Fig. 3). These findings suggest
an official split, we selected the checkpoint that achieved the highest that future academic research in medical AI should focus on improving
metric on the validation data for inference during model evaluation in-context learning abilities and text comprehension, which are crucial
(Supplementary Table 7). For datasets lacking an official split, we for real-world clinical tasks.
employed k-fold cross-validation, used the checkpoint from the last
epoch for inference and reported the mean and s.d. Model extension
Similar to existing large language models and multimodal mod- BiomedGPT was initially developed to process visual (specifically 2D
els28, in inference, we used decoding strategies such as beam search to images) and text data. However, the prototype’s capabilities could be
improve generation quality. However, this approach poses challenges extended to encompass additional tasks and modalities. For example,
for classification tasks, including unnecessary searching of the entire we have extended BiomedGPT to include 3D medical imaging classifica-
vocabulary and the possibility of generating invalid labels beyond tion (Extended Data Table 5 and Supplementary Table 4). This extension
the closed label set. To tackle these issues, we applied a beam search involved implementing both pretraining and fine-tuning stages. It
strategy incorporating a prefix tree (also known as a trie), limiting the requires only integrating a pretrained 3D VQ-GAN for tokenizing 3D
number of candidate tokens and resulting in more efficient and accu- images in masked image modeling and adding a learnable 3D visual
rate decoding. Extended Data Figure 3d demonstrates an example of encoder into the pipeline (Fig. 2a). To further extend the model’s capa-
trie-based beam search; along the path across ‘Lipid’ and ‘breakdown’, bilities, especially for non-text generation tasks, such as segmentation,
BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘path- introducing additional decoders, such as a mask decoder, is appropriate.
way’) to −∞ while computing log-probabilities for the target token ‘in’.
It is worth noting that trie-based search was also applied during the Computing hardware and software
validation phase of the fine-tuning stage for acceleration (approxi- We used Python (version 3.7.4) for all experiments and analyses in the
mately 16× increase in speed in our experiments). study, which can be replicated using open-source libraries as outlined
below. For pretraining, we used ten 24-GB NVIDIA A5000 GPUs con-
Model instruction-tuning and zero-shot prediction figured for multi-GPU training using DistributedDataParallel (DDP)
Instruction-tuning was developed to improve the question- as implemented by the framework PyTorch (version 1.8.1, CUDA 12.2)
understanding capabilities of the pretrained BiomedGPT. Following with the sequence-to-sequence toolkit - fairseq (version 1.0.0). For
the data-curation method used for LLaVA-Med12, we diverged from masked image modeling, we first cropped the middle part of the image
the traditional VQA approach, in which a pre-built answer set is used and converted it to a sequence of visual tokens based on the pretrained
during both training and inference. Instead, in our instruction-tuning VQ-GAN model (https://heibox.uni-heidelberg.de/d/2e5662443a6b43
method, an open-vocabulary setting is used, allowing the model to 07b470/). Pillow library (version 9.0.1) was used to read images, which
operate without a predefined set of answers and thereby enabling it were then converted to the base64 string format using Python. Timm
to independently determine the most appropriate response during library (version 0.6.12), torchvision (version 0.9.1) and opencv-python
both the training and inference phases. (version 4.6.0) were applied for image processing and loading during
We summarized experimental settings for each zero-shot trial as training. We used the ftfy library (version 6.0.3) to fix potentially broken
follows. In the VQA-RAD zero-shot experiment (Fig. 4), we used the Unicode for text processing and loading. Einops library (version 0.6.0)
original questions from the dataset as prompts or instructions. For was applied for tensor operations in modeling. For model evaluation,
the disease-diagnosis zero-shot experiments (Extended Data Fig. 5b), we used pycocotools (version 2.0.4) and pycocoevalcap (version 1.2) to
we used a common prompt template: ‘Does the patient have <disease> calculate the NLP metrics such as ROUGE-L and CIDEr. Other metrics,
given the image?’. The evaluation datasets were curated on the basis of calculated on the basis of torchmetrics (version 0.11.0). Numpy (ver-
the RSNA Pneumonia Detection Challenge (2018) (https://www.rsna. sion 1.21.5) and Pandas (version 1.3.5), were used in data collection,
org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2 preprocessing and data analysis.
018) and MedMNIST v2 (images with a resolution of 224 × 224 pixels)36.
Specific evaluations were conducted across different medical data- Evaluation metrics
sets: (1) pneumonia detection involved 1,000 randomly sampled We used several evaluation metrics to thoroughly assess the capabilities
cases from RSNA, including 548 pneumonia and 452 normal cases. of our BiomedGPT model across different tasks. Accuracy is a primary
(2) Malignant tumor detection used the BreastMNIST dataset, com- metric used for evaluating the performance in medical-image classifica-
prising 114 normal or benign cases and 42 malignant cases. (3) Mela- tion, VQA and natural-language inference. In addition to accuracy, we
noma recognition was based on a subset of DermaMNIST with 223 also used the F1 score for the tasks in which class imbalance was consid-
positive melanoma cases. (4) Drusen recognition utilized a subset of ered. The F1 score is derived as the harmonic mean of precision and recall:
OCTMNIST, featuring 250 positive drusen cases. (5) Cancer tissue iden-
2 × precision × recall
tification was assessed on a PathMNIST subset, which included 1,233 F1 = .
precision + recall
colorectal adenocarcinoma epithelium cases, 421 cancer-associated
stroma cases, 339 debris cases and 741 normal colon mucosa cases. In
TB detection and report generation using two-view CXRs (Extended For a more convenient comparison with SOTA approaches, we used
Data Fig. 5c), we replicated the experimental settings and prompt the weighted F1 score for VQA. This measure is computed by averaging
templates used by Med-PaLM M. Additionally, we incorporated the the F1 scores across each class, with the individual class scores weighted
MIMIC-CXR training set, which includes single-view image–caption according to their frequency of occurrence:
pairs, during continual pretraining to ensure a fair comparison with
N
Med-PaLM M. For report generation, we utilized common NLP metrics ni
Weighted F 1 = ∑ × F1i ,
to align with Med-PaLM M. i=1
N

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

where ni is the number of instances in class i, N is the total number the image represents. CIDEr can range from 0 to 100. Typically, human
of instances across all classes and F1i is the F1 score for class i. captions tend to score near 90 (ref. 28).
Further­more, we applied the macro-average F1 score (F1-macro) in
image-classification tasks on the CBIS-DDSM dataset. The F1-macro Reporting summary
score is calculated by determining the F1 score for each class inde­ Further information on research design is available in the Nature
pendently and then averaging these scores across all classes. This Portfolio Reporting Summary linked to this article.
approach does not account for class imbalances, treating each class
with equal importance: Data availability
All data in this study are publicly available and can be accessed from:
N
1 IU X-ray and Peir Gross (https://github.com/nlpaueb/bioCaption),
F 1 − macro = × ∑ F1 .
N i=1 i MedICat (https://github.com/allenai/medicat), PathVQA (https://
huggingface.co/datasets/flaviagiammarino/path-vqa), SLAKE 1.0
The higher the accuracy and F1 score (either weighted- or (https://www.med-vqa.com/slake/), DeepLesion (https://nihcc.app.
maro-average), the better performance the model achieves. box.com/v/DeepLesion), OIA-DDR (https://github.com/nkicsl/OIA),
ROUGE-L26 was used to evaluate the quality of the generated text CheXpert- v1.0-small (https://www.kaggle.com/datasets/willarevalo/
on the image-captioning and text-summarization tasks. Given the chexpert-v10-small), CytoImageNet (https://www.kaggle.com/
candidate C and reference R, let LCS(C, R) be the length of the longest datasets/stanleyhua/cytoimagenet), ISIC 2020 (https://challenge2020.
common subsequence, which is determined by using dynamic pro- isic-archive.com), Retinal Fundus (https://www.kaggle.com/c/
gramming, it can be expressed as: diabetic-retinopathy-detection), MIMIC-III Clinic Notes (https://papers-
withcode.com/dataset/hospital-admission-notes-from-mimic-iii),
(1 + β2 ) RLCS PLCS NCBI BioNLP (https://www.ncbi.nlm.nih.gov/research/bionlp/
ROUGE − L = ,
RLCS + β2 PLCS Data/), PubMed abstracts derived from the BLUE benchmark (https://
github.com/ncbi-nlp/BLUE_Benchmark), VQA-RAD (https://osf.
LCS(C,R) LCS(C,R) PLCS
where RLCS = , RLCS = and β = . c and r represent io/89kps/), CBIS-DDSM (https://www.kaggle.com/datasets/awsaf49/
c c RLCS
the length of the candidate and reference. A higher ROUGE-L score cbis-ddsm-breast-cancer-image-dataset), SZ-CXR and MC-CXR
means that the generated text shares more of the same sequences of (access can be requested via the contact at http://archive.nlm.nih.
words as the reference text, which typically indicates better quality gov/repos/chestImages.php), MIMIC-CXR (https://physionet.org/
in terms of capturing the salient points of the reference. It suggests content/mimic-cxr-jpg/2.1.0/), MedNLI (https://physionet.org/content/
that the generated text is more similar to the reference summaries mednli/1.0.0/), TREC 2022 (https://www.trec-cds.org/2022.html), SEER
that it is being compared with, which is usually desirable in summariza- (https://seer.cancer.gov), MIMIC-III (https://physionet.org/content/
tion tasks. mimiciii/1.4/), HealthcareMagic (https://huggingface.co/datasets/
In addition to ROUGE-L, we also used METEOR27 and CIDEr28 to UCSD26/medical_dialog), MeQSum (https://huggingface.co/datasets/
obtain a more comprehensive evaluation of captioning generation sumedh/MeQSum), MedMNIST v2 (https://medmnist.com) and ROCO
quality. For METEOR, we represented precision and recall as P =
m (https://github.com/razorx89/roco-dataset). A randomly sampled
c
m
subset of RSNA Pneumonia Detection Challenge (2018) was used for
and R = , where m is the number of common words in the candi­date zero-shot prediction (https://www.rsna.org/rsnai/ai-image-challenge/
r
C and the reference R with the number of words of c and r, respectively. rsna-pneumonia-detection-challenge-2018). The MedMNIST-Raw is
METEOR is calculated as follows: curated using multiple sources, including NCT-CRC-HE-100K (colon
pathology) (https://zenodo.org/records/1214456), HAM10000 (der-
PR
METEOR = (1 − p) , moscopy) (https://github.com/ptschandl/HAM10000_dataset), OCT
αP + (1 − α) R
and Chest X-ray (https://data.mendeley.com/datasets/rscbjbr9sj/3),
θ
breast ultrasound (https://scholar.cu.edu.eg/Dataset_BUSI.zip), blood
ch
where p is the penalty factor and is denoted as p = γ( ) , ch is the cell microscopy (https://data.mendeley.com/datasets/snkd93bnjr/1)
m
number of chunks, where a chunk is defined as a set of unigrams and Liver Tumor Segmentation Benchmark (LiTS) (https://competi-
that are adjacent in the candidate and reference. α, θ and γ are tions.codalab.org/competitions/17094). The VQA data for human
hyperparameters that are set as 0.1, 3 and 0.5, respectively, in our evaluation are derived from Medical-Diff-VQA (https://physionet.
calculation. org/content/medical-diff-vqa/1.0.0/), with the exclusion of questions
CIDEr is specifically designed to evaluate the quality of image cap- related to differences, as these require a two-image input. Report
tions. The CIDEr score is calculated using n-gram matching, considering generation and summarization samples for human evaluations are
both precision (how many n-grams in the generated caption are also in extracted from MIMIC-CXR. The instruction-following data used in this
the reference captions) and recall (how many n-grams in the reference article are derived from Pubmed (https://pubmed.ncbi.nlm.nih.gov)
captions are also in the generated caption). It also weighs the n-grams following the LLaVA-Med approach (https://github.com/microsoft/
based on their saliency (importance in describing the image) and rarity LLaVA-Med/blob/main/download_data.sh) and are combined with
(uncommonness in the dataset), which helps to emphasize the impor- training sets from PathVQA and SLAKE. We also provided the table with
tance of capturing the most relevant aspects of the image in the caption. more details of the major datasets in Extended Data Table 2.
CIDEr is obtained by averaging the similarity of different lengths:
M
Code availability
1 gn (c) ⋅ gn (Si ) The pretrained and fine-tuned models, as well as source code for train-
CIDErn (c, S) = ∑ n ,
M i=1 ‖g (c)‖ ⋅ ‖gn (Si )‖ ing, inference and data preprocessing, can be accessed at https://
github.com/taokz/BiomedGPT.
where c is a candidate caption, S is set of reference captions, M denotes
the number of reference captions and gn (⋅) is an n-gram-based References
term frequency-inverse document frequency vector. A higher CIDEr 58. Dosovitskiy, A. et al. An image is worth 16×16 words: transformers
score suggests that the generated caption is more accurate and descrip- for image recognition at scale. In International Conference on
tive of the image content, aligning well with human judgments of what Learning Representations. (2021).

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

59. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pretraining 75. Micikevicius, P. et al. Mixed precision training. In International
of deep bidirectional transformers for language understanding. Conference on Learning Representations (International
In Proc. 2019 Conference of the North American Chapter of the Conference on Learning Representations, 2018).
Association for Computational Linguistics: Human Language 76. Raghu, M. et al. Transfusion: understanding transfer learning for
Technologies (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 medical imaging. In Advances in Neural Information Processing
(Association for Computational Linguistics, 2019). Systems 32 (Conference on Neural Information Processing
60. Ke, G. He, D. & Liu, T. Y. Rethinking positional encoding in Systems, 2019).
language pretraining. In International Conference on Learning 77. Zhou, C. et al. A comprehensive survey on pretrained foundation
Representations (ICLR, 2019). models: a history from BERT to ChatGPT. Preprint at https://arxiv.
61. Ba, J. L., Kiros, J. R. & Hinton, G.E. Layer normalization. Preprint at org/abs/2302.09419 (2023).
https://arxiv.org/abs/1607.06450 (2016)
62. Shleifer, S., Weston, J. & Ott, M., Normformer: Improved Acknowledgements
transformer pretraining with extra normalization. Preprint at NSF grant CRII-2246067, NSF POSE: Phase II-2346158 and Lehigh
https://arxiv.org/abs/2110.09456 (2021). Grant FRGS00011497 supported L.S., K.Z., Z.Y. and Y.L. NIH grant
63. Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: marrying convolution R21EY034179, NSF grants NCS-2319451, MRI-2215789 and IIS-1909879,
and attention for all data sizes. In Proc. Advances in Neural as well as Lehigh’s Accelerator and CORE grants S00010293 and
Information Processing Systems 34 (NeurIPS 2021) 3965–3977 S001250, supported L.H. and R.Z. NIH grants R01HL159183 and
(Neural Information Processing Systems, 2021). RF1AG057892 supported Q.L. NIH grant R03AG078625 supported
64. Wang, Z. et al. SimVLM: simple visual language model pretraining X.L. NIH grants R01EB19403 and R01LM11934, supported S.F. and H.L.
with weak supervision. In International Conference on Learning Icons used in Fig. 2 were made by Freepike, surang, Smartline and
Representations. (International Conference on Learning Blackonion02 at www.flaticon.com.
Representations, 2022).
65. Esser, P., Rombach, R. & Ommer, B. Taming transformers for Author contributions
high-resolution image synthesis. In Proc. IEEE/CVF Conference K.Z. and L.S. designed the study. K.Z., R.Z. and E.A. carried out
on Computer Vision and Pattern Recognition (CVPR) 12873–12883 data collection, data preprocessing, model construction and
(Institute of Electrical and Electronics Engineers/Computer Vision model validation. J.Y., Z.Y., Y.L. and Z.L. carried out the data analysis
Foundation, 2021). benchmarking results. X.C., B.D.D., J.H., C.C., Y.Z., S.F., W.L., T.L.,
66. Chen, T. et al. Pix2seq: a language modeling framework for X.L., Y.C., L.H., J.Z., Q.L. and H.L. provided knowledge support and
object detection. In International Conference on Learning interpreted the findings. H.R. carried out the human evaluation for
Representations (International Conference on Learning the generated text from BiomedGPT as well as GPT-4V. L.S. provided
Representations, 2022). knowledge support, interpreted the findings and supervised the
67. Gage, P. A new algorithm for data compression. C. Users J. 12, study. All authors contributed to manuscript writing and reviewed
23–38 (1994). and approved the final version. L.H., X.L. and L.S. co-supervised
68. He, K. et al. Deep residual learning for image recognition. In Proc. the study.
IEEE Conference on Computer Vision and Pattern Recognition
770–778 (Institute of Electrical and Electronics Engineers, 2016). Competing interests
69. Wei, J. et al. Finetuned language models are zero-shot learners. The research was conducted independently of any commercial or
In International Conference on Learning Representations financial relationships that could be construed as a potential conflict
(International Conference on Learning Representations, 2022). of interest. Although X.C. is employed by Samsung, the company was
70. Schick, T. & Schütze, H. It’s not just size that matters: small not involved in any aspect of this research. The other authors declare
language models are also few-shot learners. In Proc. 2021 no competing interests.
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies (eds. Additional information
Toutanova, K. et al.) 2339-2352 (Association for Computational Extended data is available for this paper at
Linguistics, 2021). https://doi.org/10.1038/s41591-024-03185-2.
71. Bao, H. et al. BEiT: BERT pretraining of image transformers.
In International Conference on Learning Representations Supplementary information The online version
(International Conference on Learning Representations, 2022). contains supplementary material available at
72. Xu, H. et al. E2E-VLP: end-to-end vision-language pretraining https://doi.org/10.1038/s41591-024-03185-2.
enhanced by visual learning. In Proc. 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Correspondence and requests for materials should be addressed to
Joint Conference on Natural Language Processing (eds. Zong, C. Xiang Li, Lifang He or Lichao Sun.
et al.) 503–513 (2021).
73. Sutskever, I., Vinyals, O. & Le, Q.V. Sequence to sequence Peer review information Nature Medicine thanks the anonymous
learning with neural networks. In Advances in Neural Information reviewers for their contribution to the peer review of this work.
Processing Systems 27 (Conference on Neural Information Primary Handling Editor: Lorenzo Righetto, in collaboration with the
Processing Systems, 2014). Nature Medicine team.
74. Loshchilov, I. & Hutter, F. Decoupled weight decay regularization.
In International Conference on Learning Representations Reprints and permissions information is available at
(International Conference on Learning Representations, 2019). www.nature.com/reprints.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 1 | Statistics of pretraining and fine-tuning datasets. typically follow the format of number of training samples/number of validation
(a) Modality distribution of pretraining data used in BiomedGPT. (b) For the samples/number of test samples to detail each dataset. More details of the data
training and testing splits of datasets used in downstream fine-tuning, we split are described in Supplementary Table 7.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 2 | Overview of BiomedGPT’s model configuration processing involves resizing and cropping the images to varying resolutions,
and architecture. (a) Detailed model configuration of BiomedGPT. Here, ‘#’ corresponding to the input sizes listed in the table. It should be noted that
indicates number of. ‘Att.’, ‘Enc.’ and ‘Dec.’ indicate Attention, Encoder and during fine-tuning and inference stages, the input resolution of BiomedGPT can
Decoder, respectively. The hidden size is the size of the embeddings and the size be flexibly adjusted according to the specific requirements of the task.
of the output of each self-attention and feed-forward layer. The first layer of FFN (b) The neural network architecture of BiomedGPT, which includes bidirectional
expands the hidden size to the intermediate size, and the second layer contracts encoder blocks and autoregressive decoder blocks. The number of blocks varies
it back to the hidden size. This expansion and contraction allow the network for different model scales.
to create more complex representations. During the pretraining phase, image

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 3 | See next page for caption.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 3 | The graphical illustrations of the key components in embedding of the relative position j−i, which is injected into the Query-Key
1
BiomedGPT. (a) Head-scale multi-head attention module in BiomedGPT. The product: (Ii W Q )(Pi W K ) + Bj−i , and shared in all layers. (d) An example of
√d
trainable parameters γh is applied prior to the output projection for each head.
trie-based beam search: along the path across ‘Lipid’ and ‘breakdown’,
(b) Instead of adding the absolute positional embedding Pi to the input
BiomedGPT sets logits for all invalid tokens (‘mechanism’ and ‘pathway’) to −∞
embedding Ii (left), we compute the positional correlation and input correlation
when computing log-probabilities for the target token ‘in’. It is worth noting that
separately with different projection matrices and add them together in the
trie-based search is also applied during the validation phase of the fine-tuning
self-attention module (right). (c) Graphical illustration of relative position bias.
stage for acceleration (approximately 16× increase in speed in our experiments).
Such an inductive bias Bj-i is learnable parameter and can be viewed as the

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 4 | Comparative Performance of BiomedGPT and 120} to investigate the performance comparison against full-model fine-tuning.
Med-PaLM M and the prompt tuning results in Image classification. The preliminary results of ‘Colon pathology’, ‘Blood cell microscope’, and ‘Chest
(a) Comparison between BiomedGPT-B and Med-PaLM M on CBIS-DDSM dataset. X-ray’ were obtained after 100, 512, and 55 training epochs respectively, all with a
(b) The experimental results of prompt tuning BiomedGPT-B on three image consistent batch size of 512. We observed that as the prompt length increases, the
classification datasets. Prompt tuning learns ‘soft prompts’ or extra model model performance tends to improve. However, despite an increased number of
parameters for each task instead of making a task-specific copy of the entire tuning epochs compared with fine-tuning on the original BiomedGPT (Fig. 3c),
pretrained model for each downstream task and inference must be performed in the performance after prompt tuning notably lags behind that of model fine-
separate batches. We must mention that the addition of soft prompts is contrary tuning. Specifically, considering only the best results in prompt tuning, there
to the design principle of the generalist model. We injected two prompt layers are substantial accuracy reductions of 32.3%, 54.6%, and 32.6% on these three
into the encoder and decoder, and varied the prompt length {20, 40, 60, 80, 100, datasets, respectively.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Fig. 5 | Additional zero-shot results of BiomedGPT. model consistently outperforms the others across all four metrics used in report
(a) Graphical illustration of zero-shot classification using CLIP-style models, generation. Here, SOTAs represent the best performance achieved in each
linear probing transfer learning using VIT or BERT-style models, and zero-shot specific metric. We fine-tuned our pretrained BiomedGPT-B on MultiMedBench,
generation of BiomedGPT. Notably, our model can generate the response which Med-PaLM M proposed and used for fine-tuning based on the pretrained
without providing additional components such as the label candidates for PaLM-E. We also attempted to fine-tune LLaVA-Med; however, the time and
CLIP or linear classifier requiring training for ViT. (b) Zero-shot performance computational costs were prohibitive due to the large scale of the model and
on five disease diagnosis tasks. (c) BiomedGPT shows competitive zero-shot data. Therefore, we reported the results using the pretrained checkpoint of
performance compared with Med-PaLM M with a much smaller model scale. LLaVA-Med.
The SOTA fine-tuned model for TB detection is TBLightNet. Note that no single

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Table 1 | Fine-tuned experimental results of BiomedGPT on 25 diverse experiments

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Table 2 | Datasets used in BiomedGPT for pretraining, fine-tuning, evaluation with details

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Table 3 | Instructions for pretraining tasks along with the corresponding format of the output

Here, <img> represents the image token derived from VQ-GAN’s vocabulary. <loc> represents the location token. The instruction for the VQA task is the question itself from the dataset.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Table 4 | Description of the question types for human evaluation

Description of the question types in the selected VQA-RAD data samples, which are used for the evaluation of zero-shot learning performance.

Nature Medicine
Article https://doi.org/10.1038/s41591-024-03185-2

Extended Data Table 5 | 3D medical image classification performance

3D medical image classification performance in terms of accuracy and F1-Macro. (Details of data and training are described in Supplementary Table 4).

Nature Medicine

You might also like