GigaPath

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Article

A whole-slide foundation model for digital


pathology from real-world data

https://doi.org/10.1038/s41586-024-07441-w Hanwen Xu1,2,7, Naoto Usuyama1,7, Jaspreet Bagga1, Sheng Zhang1, Rajesh Rao1,
Tristan Naumann1, Cliff Wong1, Zelalem Gero1, Javier González1, Yu Gu1, Yanbo Xu1, Mu Wei1,
Received: 30 November 2023
Wenhui Wang1, Shuming Ma1, Furu Wei1, Jianwei Yang1, Chunyuan Li1, Jianfeng Gao1,
Accepted: 19 April 2024 Jaylen Rosemon3, Tucker Bower3, Soohee Lee4, Roshanthi Weerasinghe4, Bill J. Wright4,
Ari Robicsek4, Brian Piening3,5, Carlo Bifulco3,5 ✉, Sheng Wang2,6 ✉ & Hoifung Poon1 ✉
Published online: 22 May 2024

Open access
Digital pathology poses unique computational challenges, as a standard gigapixel
Check for updates
slide may comprise tens of thousands of image tiles1–3. Prior models have often
resorted to subsampling a small portion of tiles for each slide, thus missing the
important slide-level context4. Here we present Prov-GigaPath, a whole-slide
pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image
tiles in 171,189 whole slides from Providence, a large US health network comprising
28 cancer centres. The slides originated from more than 30,000 patients covering 31
major tissue types. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision
transformer architecture for pretraining gigapixel pathology slides. To scale
GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath
adapts the newly developed LongNet5 method to digital pathology. To evaluate
Prov-GigaPath, we construct a digital pathology benchmark comprising 9 cancer
subtyping tasks and 17 pathomics tasks, using both Providence and TCGA data6. With
large-scale pretraining and ultra-large-context modelling, Prov-GigaPath attains
state-of-the-art performance on 25 out of 26 tasks, with significant improvement
over the second-best method on 18 tasks. We further demonstrate the potential of
Prov-GigaPath on vision–language pretraining for pathology7,8 by incorporating
the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model
that achieves state-of-the-art performance on various digital pathology tasks,
demonstrating the importance of real-world data and whole-slide modelling.

Computational pathology has the potential to transform cancer diag- pretrained on such data. For example, existing pathology foundation
nostics by empowering diverse clinical applications, including can- models were mainly pretrained on whole-slide images (WSIs) from The
cer subtyping2,9,10, cancer staging1,11–13, diagnostic prediction14–17 and Cancer Genome Atlas (TCGA), an expert-curated dataset comprising
prognostic prediction18–23. Despite the encouraging performance of approximately 30,000 slides and 208 million image tiles. Although
existing computational approaches, these are often developed for a they are a tremendous resource, TCGA data might not be sufficiently
specific application and require a large amount of annotated data for large to fully address the challenges around real-world digital pathol-
supervised learning. Data annotation is expensive and time-consuming ogy in clinical practice, such as heterogeneity and noise artefacts34,
and has emerged as an important bottleneck for computational pathol- leading to a substantial performance drop when using TCGA-based
ogy. Recently, self-supervised learning has shown promising results in predictive models and biomarkers on out-of-distribution samples.
leveraging unlabelled data to pretrain a foundation model, which can Second, it remains challenging to design a model architecture that can
substantially reduce the demand for task-specific annotations24–28. effectively capture both local patterns in individual tiles and global pat-
Owing to their strong generalizability, foundation models have been terns across whole slides35–39. Existing models often treat each image
developed for biomedical domains where labelled data are scarce but tile as an independent sample and formulate slide-level modelling as
unlabelled data are abundant, a situation that aptly describes compu- multiple instance learning4,40–43, thus limiting their ability to model
tational pathology29–33. complex global patterns in gigapixel whole slides. A notable exception
There are three major challenges that hinder the development and is Hierarchical Image Pyramid Transformer (HIPT), which explores
use of pathology foundation models for real-world clinical applica- hierarchical self-attention over the tiles35. Third, in the rare cases in
tions. First, publicly available pathology data are relatively scarce and which pretraining has been conducted on large-scale real-world patient
of varying quality, which limits the performance of foundation models data, the resulting foundation models are typically not accessible to

Microsoft Research, Redmond, WA, USA. 2Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. 3Providence Genomics, Portland, OR, USA.
1

Providence Research Network, Renton, WA, USA. 5Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA. 6Department of Surgery, University of Washington, Seattle,
4

WA, USA. 7These authors contributed equally: Hanwen Xu, Naoto Usuyama. ✉e-mail: carlo.bifulco@providence.org; swang@cs.washington.edu; hoifung@microsoft.com

Nature | Vol 630 | 6 June 2024 | 181


Article
the public, thus limiting their broader applicability in clinical research
and applications. Overview of Prov-GigaPath
We have developed Prov-GigaPath, an open-weight pathology foun- Prov-GigaPath takes the image tiles in a pathology slide as input
dation model, to address these three challenges (Supplementary and outputs the slide-level embeddings that can be used as features
Table 1). First, Prov-GigaPath is pretrained on Prov-Path, a large digital for diverse clinical applications (Fig. 1a). Prov-GigaPath excels in
pathology dataset from the Providence health network across 28 can- long-context modelling of gigapixel pathology slides, by distilling
cer centres. Prov-Path contains 1,384,860,229 image tiles from 171,189 varied local pathological structures and integrating global signa-
haematoxylin and eosin (H&E)-stained and immunohistochemistry tures across the whole slide. Prov-GigaPath consists of a tile encoder
pathology slides, which originated from biopsies and resections in for capturing local features and a slide encoder for capturing global
more than 30,000 patients, covering 31 major tissue types. Prov-Path features. The tile encoder individually projects all tiles into compact
is more than five times larger than TCGA in terms of the number of embeddings. The slide encoder then inputs the sequence of tile
image tiles and more than two times larger than TCGA in terms of the embeddings and generates contextualized embeddings taking into
number of patients. Our pretraining leverages all 1.3 billion image tiles, account the entire sequence using a transformer. The tile encoder is
which, to our knowledge, constitutes the largest pretraining effort to pretrained using DINOv2, the state-of-the-art image self-supervised
date. These large, diverse, real-world data serves as the foundation learning framework24. The slide encoder combines masked autoen-
for pretraining Prov-GigaPath. Prov-Path also encompasses a hier- coder pretraining with LongNet5, our recently developed method for
archy of valuable information, including histopathology findings, ultra long-sequence modelling. In downstream tasks, the output of
cancer staging, genomic mutation profiles, along with the associated the slide encoder is aggregated using a simple softmax attention layer.
pathology reports. Prov-GigaPath is a general pretraining method for high-resolution
Second, to capture both local and global patterns across the entire imaging data, which can potentially be extended to other biomedical
slide, we propose GigaPath, a novel vision transformer for pretraining problems, including the analysis of large 2D and 3D images and videos.
large pathology foundation models on gigapixel pathology slides. We pretrained Prov-GigaPath on the large and diverse real-world data
The key idea is to embed image tiles as visual tokens, thus turning a in Prov-Path. Given a downstream task, the pretrained Prov-GigaPath
slide into a long sequence of tokens. Transformer44 is a powerful neu- is fine-tuned using task-specific training data, as standard in the use
ral architecture for sequence modelling by distilling arbitrary com- of a foundation model. The resulting task-specific model can then be
plex patterns among the tokens. However, we cannot directly apply a evaluated on the test data for the given task. Prov-GigaPath attained
conventional vision transformer to digital pathology, as a pathology significant improvements compared to prior state-of-the-art public
slide may contain tens of thousands of tiles (as many as 70,121 in the pathology foundation models across 17 pathomics tasks and 9 subtyp-
Providence data) and computation with self-attention in transformer ing tasks. Our pretraining dataset Prov-Path consists of 1,384,860,229
grows quadratically in the sequence length. To address this problem, 256 × 256 image tiles in 171,189 H&E-stained and immunohistochem-
we leverage dilated self-attention by adapting our recently developed istry pathology slides, which stem from biopsies and resections of 31
LongNet method5. Pretraining starts with image-level self-supervised major tissue types in over 30,000 patients (Supplementary Figs. 1–3).
learning using DINOv224 with standard vision transformer, followed by We summarize the demographics, including the distribution of sex,
whole-slide-level self-supervised learning using masked autoencoder45 age and race in Supplementary Tables 3–5 and the mutation rates in
with LongNet. Supplementary Table 6.
Finally, to accelerate research progress in digital pathology, we make
Prov-GigaPath fully open-weight, including source code and pretrained
model weights. Prov-GigaPath improves mutation prediction
To systematically investigate the effectiveness of Prov-GigaPath A variety of function-altering somatic gene mutations underlie cancer
as a pathology foundation model in real-world scenarios, we estab- progression and development, and thus may have utility in both cancer
lished a comprehensive digital pathology benchmark spanning 26 diagnostics and prognostics. Although the cost of sequencing has
prediction tasks such as pathomics and cancer subtyping, using dropped substantially, there are still critical healthcare gaps in terms of
data from both Providence and TCGA. We compare Prov-GigaPath access to tumour sequencing worldwide. Therefore, predicting tumour
against the state-of-the-art pathology foundation models that are mutations from pathology images may help to inform treatment selec-
publicly available, including HIPT35, CtransPath41 and REMEDIS42. tion and increase personalized medicine utilization17. We compared
Combining large-scale pretraining and ultra-large-context model- Prov-GigaPath with competing methods on five-gene mutation predic-
ling, Prov-GigaPath attains state-of-the-art performance on 25 out of tion benchmarks (Fig. 2 and Extended Data Figs. 1–4) by formulating this
26 tasks, with significant improvement over the second-best method task as an image classification task. First, we examined the prediction
in 18 tasks (Supplementary Table 2). For example, on the TCGA dataset of 18 biomarkers that are most frequently mutated in a pan-cancer
for EGFR mutation prediction, Prov-GigaPath attained an improvement setting (Fig. 2a,f,l and Extended Data Fig. 1). Prov-GigaPath achieved
of 23.5% in AUROC and 66.4% in AUPRC compared with the second-best 3.3% macro-area under the receiver operator characteristic (AUROC)
model, REMEDIS. This is particularly remarkable as REMEDIS was improvement and 8.9% macro-area under the precision-recall curve
pretrained on TCGA data whereas Prov-GigaPath was not. For cancer (AUPRC) improvement across these 18 biomarkers compared with the
subtyping, Prov-GigaPath outperforms all other models in all nine best competing method. Given known associations between specific
cancer types, with significant improvement over the second-best tumour mutations and overall tumour composition and morphology,
method in six cancer types. This bodes well for its broad applicability we attribute this improvement to the ability of LongNet to effectively
across cancer types. Finally, we explore vision–language pretraining by capture the global image patterns. Next, we focused on lung adeno-
leveraging the associated pathology report for each slide to continue carcinoma (LUAD), which is one of the most widely studied cancer
pretraining Prov-GigaPath with vision–language contrastive learning. types for image-based mutation prediction (Fig. 2b,g and Extended
We showed that the resulting Prov-GigaPath exhibits state-of-the- Data Fig. 2). We focused on five genes (EGFR, FAT1, KRAS, TP53 and
art capability in standard vision–language modelling tasks such as LRP1B) that are closely related to LUAD diagnosis and treatment in the
zero-shot subtyping and mutation prediction, illustrating its poten- literature46–48. Prov-GigaPath demonstrated the best performance by
tial for multimodal integrative data analysis. In sum, Prov-GigaPath achieving an average macro-AUROC of 0.626, surpassing all competing
demonstrates the possibility to assist clinical diagnostics and decision approaches (P value < 0.01). On the pan-cancer analysis, Prov-GigaPath
support using large-scale machine learning models. also outperformed the best competing methods on these 5 genes with

182 | Nature | Vol 630 | 6 June 2024


a

Image-level embeddings

Slide-level embeddings
Vision transformer

Dilated attention
([CLS])
256 × 256 image tile sequence Tile-level encoder Slide-level encoder (LongNet)

b Reconstruct masked patches c Target tile embeddings

Contrastive loss
Reconstruction loss

Patch token [CLS] [CLS] Patch token


token token
LongNet-based decoder
Vision transformer Vision transformer
(Teacher model) (Student model)

LongNet-based encoder

Image tile

Global crops Local crops Global crops


with masks Input embeddings with masks

Fig. 1 | Overview of Prov-GigaPath. a, Flow chart showing the model to generate contextualized embeddings, which can serve as the basis for various
architecture of Prov-GigaPath. Prov-GigaPath first serializes each input WSI downstream applications. b, Image tile-level pretraining using DINOv2.
into a sequence of 256 × 256 image tiles in row-major order and uses an image c, Slide-level pretraining with LongNet using masked autoencoder. [CLS] is the
tile-level encoder to convert each image tile into a visual embedding. Then classification token.
Prov-GigaPath applies a slide-level encoder based on the LongNet architecture

6.5% macro-AUROC improvement and 18.7% AUPRC improvement is better than pretraining using a contrastive-learning-based approach
(Fig. 2c,h and Extended Data Fig. 3). SimCLR26 and masked autoencoders45 (Supplementary Fig. 4), demon-
We also conducted head-to-head comparison of all approaches on strating the effectiveness of our pretraining strategy. Prov-GigaPath
TCGA data to examine the generalizability of Prov-GigaPath. We again also outperformed a supervised learning approach that utilizes an
used LUAD-specific five-gene mutation prediction as a key evaluation ImageNet-trained model, necessitating our self-supervised learning
task (Fig. 2d,i and Extended Data Fig. 4). We observed similar advantage framework (Supplementary Fig. 4).
of Prov-GigaPath over the competing methods. This is all the more Overall, Prov-GigaPath demonstrated clear performance gains on
remarkable given that the competing methods35,41,42 were all pretrained various pathomics tasks over prior state-of-the-art pathology foun-
on TCGA. To further test the generalizability of Prov-GigaPath, we dation models. We hypothesize that such significant improvement
collected a new cohort of 403 patients with colorectal cancer from reflects the differentiation advantage in our whole-slide modelling.
Providence. These data were collected after March 2023, whereas all
data used for pretraining Prov-GigaPath were collected before March
2023. We found that Prov-GigaPath again outperformed competing Prov-GigaPath improves cancer subtyping
methods on this cohort. We also noted that the performance was not Given the overall utility of pathology images in assigning tumour sub-
significantly different from that on previous data from patients with types2,9,10,49, we next examined whether Prov-GigaPath can accurately
colorectal cancer (Extended Data Fig. 5). Finally, we examined the predict cancer subtypes from images. We evaluated our method on
prediction of overall tumour mutation burden (TMB), a predictive subtyping for nine major cancer types in Prov-Path (Fig. 3). Prov-
biomarker in solid tumours that is particularly relevant for immuno- GigaPath outperformed all competing approaches on all nine can-
therapy. Prov-GigaPath achieved the best performance with an average cer types and achieved significant improvements compared with
AUROC of 0.708, with significant improvement over the second-best the second-best method on six cancer types, indicating that our tile
method (Fig. 2e,j). encoder and slide encoder work synergistically to extract meaningful
We observed that GigaPath pretrained on Prov-Path achieves a sub- features for differentiating minute pathological patterns. A key differ-
stantial improvement against the same model architecture pretrained ence between HIPT and Prov-GigaPath is the aggregation layer over
on TCGA data when tested on LUAD-specific five-gene mutation in image tile embeddings. The substantial improvement of Prov-GigaPath
TCGA, indicating the high quality of Prov-Path (Extended Data Fig. 6). over HIPT demonstrates the promise in using LongNet for efficient
We further found that GigaPath outperformed HIPT when both are and effective aggregation of the ultra-large collection of image tiles
trained on Prov-Path, indicating that the effectiveness of GigaPath in a whole slide.
framework (Extended Data Figs. 7 and 8). To further assess the pretrain- Finally, we conducted ablation studies to systematically assess how
ing strategy of our method, we observed that pretraining using DINOv2 each component of Prov-GigaPath contributes to its performance

Nature | Vol 630 | 6 June 2024 | 183


Article
Prov-GigaPath HIPT CtransPath REMEDIS
a b c d e P = 0.097
P < 0.001 P = 0.002 P < 0.001 P = 0.002

0.7 0.7 0.7 0.7 0.7


AUROC

0.6 0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4 0.4


Pan 18-biomarker LUAD 5-gene Pan 5-gene LUAD 5-gene (TCGA) Pan TMB
f g h i j P = 0.005
P < 0.001 P = 0.002 P < 0.001 P < 0.001
0.4 0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3 0.3


AUPRC

0.2 0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1 0.1

0 0 0 0 0
Pan 18-biomarker LUAD 5-gene Pan 5-gene LUAD 5-gene (TCGA) Pan TMB

k P = 0.002 P = 0.216 P = 0.188 P = 0.010 P < 0.001


0.8 0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6 0.6
AUROC

0.5 0.5 0.5 0.5 0.5


0.4 0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3 0.3
EGFR FAT1 KRAS LRP1B TP53
l 0.80
Prov-GigaPath
HIPT
0.75 CtransPath
REMEDIS
0.70
AUROC

0.65

0.60

0.55
74

AS

53

FR

X3

1A

S1

A
TA

T2
P1
T2
AP

KD

T2

AT
D2

TP

FA

RO
ID
EG

TC
KR

K3
SP

KM
KM

KM
LR

ZF
PR

AR
C

PI
N

Fig. 2 | Gene mutation prediction. a−j, Bar plots comparing the AUROC and on TCGA. a−k, Data are mean ± s.e.m. across n = 10 independent experiments.
AUPRC scores of Prov-GigaPath and competing methods on pan-cancer The listed P value indicates the significance for Prov-GigaPath outperforming
18-biomarker (a,f), LUAD-specific 5-gene mutation prediction (b,g), the best comparison approach, with one-sided Wilcoxon test. l, Comparison
pan-cancer 5-gene mutation prediction (c,h), LUAD-specific 5-gene mutation of AUROC scores for individual biomarkers in pan-cancer 18-biomarker
prediction on TCGA (d,i) and pan-cancer TMB prediction (e,j). k, Bar plot predictions.
showing AUROC for each gene on LUAD-specific five-gene mutation prediction

in cancer subtyping (Supplementary Fig. 5). To examine the impor- tested one alternative by removing LongNet and only aggregating
tance of LongNet pretraining, we replaced the LongNet encoder through the attention-based deep multiple instance learning (ABMIL)
pretrained on Prov-Path with a randomly initialized model. We layer. On average, the ABMIL layer cannot achieve a similar perfor-
observed a substantial performance decrease in average AUROC mance to LongNet for slide encoder (P value < 0.012), confirming
from 0.903 to 0.886 (P value < 2.0 × 10−3), indicating that pretrain- the necessity of modelling long-range dependencies in pathology
ing our LongNet encoder could better capture the slide-level cancer slides.
heterogeneity. We observed that freezing and unfreezing the LongNet
encoder achieved comparable performance on cancer subtyping
tasks. This suggests that our pretraining approach can effectively Slide-level vision–language alignment
learn high-quality representations, reducing the need for additional The promising results of Prov-GigaPath on pathology images further
fine-tuning of LongNet. To verify the superiority of using the LongNet motivated us to explore Prov-GigaPath in multimodal vision–language
encoder to aggregate image patterns across the whole slide, we then processing. Prior work on pathology vision–language modelling tends

184 | Nature | Vol 630 | 6 June 2024


Prov-GigaPath HIPT CtransPath REMEDIS
a P < 0.001 P < 0.001 P = 0.423 b P < 0.001 P = 0.005 P = 0.005
1.0 1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8 0.8


AUROC

BACC
0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

OVT CNS EGC OVT CNS EGC


c P = 0.065 P = 0.539 P < 0.001 d P = 0.080 P = 0.010
1.0 1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8 0.8


AUROC

BACC
0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

NSCLC BRCA RCC NSCLC BRCA RCC


e P = 0.014 P = 0.010 P = 0.005 f P = 0.055 P = 0.080 P = 0.385
1.0 1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8 0.8


AUROC

BACC

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

COADREAD HB DIFG COADREAD HB DIFG

Fig. 3 | Comparison on cancer subtyping. a–f, Bar plots comparing cancer BACC, balanced accuracy. BRCA, breast invasive carcinoma; CNS, central
subtyping performance in terms of AUROC (a,c,e) and balanced accuracy nervous system; COADREAD, colorectal adenocarcinoma; DIFG, diffuse
(b,d,f) on nine cancer types. Data are mean ± s.e.m. across n = 10 independent intrinsic pontine glioma; EGC, early gastric cancer; HB, hepatobiliary; NSCLC,
experiments. The listed P value indicates the significance for Prov-GigaPath non-small cell lung cancer; OVT, ovarian cancer; RCC, renal cell cancer.
outperforming the best comparison approach, with one-sided Wilcoxon test.

to focus on tile-level alignment of pathology images and text, as their tissue in Prov-Path. Prov-GigaPath outperformed PLIP by a considerable
studies were limited by the sources of image–text pairs (textbook margin, which potentially reflects the superiority of real-world clinical
examples7 or Twitter data8). By contrast, we examined slide-level data over Twitter data.
alignment of pathology images and text by leveraging the associ- Next, we examined the possibility of predicting gene mutations using
ated report for each slide (Fig. 4a). Such naturally occurring slide– the vision–language pretrained Prov-GigaPath (Fig. 4d,e and Extended
report pairs can potentially uncover richer slide-level information, Data Fig. 9) in the same zero-shot setting. We adopted the prompts
but the modelling is considerably more challenging as we do not have used for cancer subtyping by replacing the cancer type name with the
fine-grained alignment information between individual image tiles gene name for which we want to predict the binary mutation status.
and text snippets. We used the standard cross-modal contrastive loss Prov-GigaPath substantially outperformed state-of-the-art pathology
in continual pretraining of Prov-GigaPath as the visual encoder and vision–language models by a large margin across all six mutations we
PubMedBERT29, a state-of-the-art biomedical language model, as the have examined (P value < 0.001) (Fig. 4d,e). The improvement of our
textual encoder (Fig. 4b). approach is larger on mutation prediction than on cancer subtyping,
We evaluated the resulting Prov-GigaPath on zero-shot cancer sub- which may be partially attributable to richer mutation information
typing in NSCLC and COADREAD following the same setting used in in pathology reports from real-world data compared with text com-
MI-Zero7, a state-of-the-art pathology vision–language model. In the mentary in Twitter8 and scientific papers50. To our knowledge, this is
zero-shot setting, no training images are provided for any of the tar- the first time zero-shot gene mutation prediction was evaluated on
get cancer subtypes. Slides and the corresponding cancer subtypes pathology vision–language modelling. The promising performance
were collected from Prov-Path. Compared with three state-of-the-art of Prov-GigaPath on this novel task bodes well for potential future
pathology vision–language models, Prov-GigaPath attained the best applications in studying rare cancer types and new mutations.
zero-shot classification results on all three metrics in both cancer types
(Fig. 4c,e, Extended Data Fig. 9 and Supplementary Fig. 6), suggesting
that slide-level alignment enabled by LongNet is indeed advantageous. Discussion
Prov-GigaPath attained larger improvement on NSCLC than COAD- We have introduced Prov-GigaPath, a pathology foundation model for
READ, which can be ascribed to the more prevalent presence of lung a broad range of digital pathology applications. Prov-GigaPath was

Nature | Vol 630 | 6 June 2024 | 185


Article
a b
Raw reports Whole slide images A histopathological image of

cell carcinoma
Case ID: XXX

tumour cells
of the lung
carcinoma

carcinoma
squamous

detected:
Patient name: XXX

Variants
positive
Adeno-

mutant
EGFR-

tissue

KRAS
Colon

FAT1
Lung
DOB: XX/XX
Diagnosis:
Infiltrating ductal
carcinoma, right P1 P2 Pn
breast core biopsies. Prov-GigaPath
PubMedBERT T1 T2 T3 T4 T5 T6

T1 T1P3 T1P3 T1P3 0.56


GPT 3.5

PubMedBERT
0.21

Prov-GigaPath
T2 T1P3 T1P3 T1P3
Diagnosis: 0.16 0.14
Infiltrating ductal T1P3 P
carcinoma, right 0.05
breast core biopsies. Tn 0.03 0.03
T1P3 T1P3 T1P3

Clean reports Contrastive loss Predicted probability

Prov-GigaPath MI-Zero BiomedCLIP PLIP


c P < 0.001 P < 0.001 P < 0.001 P < 0.001 P < 0.001 P < 0.001
0.65 0.6
0.6
0.60 0.5
Precision

0.5
BACC

0.55 0.4

f1
0.4
0.50 0.3
0.3
0.45 0.2
NSCLC COADREAD NSCLC COADREAD NSCLC COADREAD
d P < 0.001 P < 0.001 P < 0.001 P < 0.001 P < 0.001 P < 0.001
0.6 0.6 0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5 0.5 0.5


BACC

BACC

BACC

BACC

BACC

BACC
0.4 0.4 0.4 0.4 0.4 0.4

LRP1B KRAS TP53 SPTA1 FAT1 KMT2D


e NSCLC KMT2D FAT1 LRP1B
0.66 0.62
0.66 0.62
0.62
Prov-GigaPath

0.62 0.58 0.58

0.58 0.58
0.54
0.54
0.54 0.54
0.50
0.50 0.50
0.50
0.46
0.50 0.54 0.58 0.62 0.66 0.46 0.50 0.54 0.58 0.62 0.50 0.54 0.58 0.62
0
4
8
2
6
5
5
5
6
6
0.
0.
0.
0.
0.

MI-Zero MI-Zero MI-Zero


MI-Zero

Fig. 4 | Comparison on image–report alignment. a, Flow chart showing the probability of the input WSI being classified into specific cancer subtypes and
fine-tuning of Prov-GigaPath using pathology reports. Real-world pathology mutations. c, Bar plots comparing zero-shot subtyping performance on NSCLC
reports are processed using GPT-3.5 from OpenAI to remove information and COADREAD in terms of BACC, precision and f 1. d, Bar plots comparing the
irrelevant to cancer diagnosis. We performed the CLIP-based contrastive performance on mutation prediction using the fine-tuned model for six genes.
learning to align Prov-GigaPath and PubMedBERT. b, The fine-tuned c,d, Data are mean ± s.e.m. across n = 50 experiments. The listed P value
Prov-GigaPath can then be used to perform zero-shot cancer subtyping and indicates the significance for Prov-GigaPath outperforming the best
mutation prediction. The input of Prov-GigaPath is a sequence of tiles comparison approach, with one-sided Wilcoxon test. e, Scatter plots
segmented from a WSI, and the inputs of the text encoder PubMedBERT are comparing the performance between Prov-GigaPath and MI-Zero in terms
manually designed prompts representing cancer types and mutations. Based of BACC on zero-shot cancer subtyping. Each dot indicates one trial with a
on the output of Prov-GigaPath and PubMedBERT, we can calculate the particular set of text query formulations.

pretrained on a large real-world dataset Prov-Path derived from Provi- Providence and TCGA datasets, we demonstrated state-of-the-art per-
dence health system with diverse types and qualities. Prov-Path is formance for Prov-GigaPath on a variety of pathomics and cancer sub-
substantially larger than TCGA, comprising 1,384,860,229 image tiles typing tasks, as well as on vision–language processing. Prov-GigaPath
from 171,189 whole pathology slides of around 30,000 patients. We has the potential to assist clinical diagnostics and decision support,
proposed GigaPath for pretraining, which adapted the cutting-edge and GigaPath can potentially be applicable to broader biomedical
LongNet5 as the vision transformer to facilitate ultra-large-context domains for efficient self-supervised learning from high-resolution
modelling of gigapixel WSIs. In comprehensive evaluation on both images.

186 | Nature | Vol 630 | 6 June 2024


We noted substantial variability in the performance of our method 5. Ding, J. et al. Longnet: scaling transformers to 1,000,000,000 tokens. Preprint at https://
doi.org/10.48550/arXiv.2307.02486 (2023).
across different tasks. First, the performance on subtyping is substan- 6. Network, C. G. A. R. et al. Comprehensive molecular profiling of lung adenocarcinoma.
tially better than the performance on mutation prediction. Although Nature 511, 543 (2014).
different tasks are not comparable owing to the number of training 7. Lu, M. Y. et al. Visual language pretrained multiple instance zero-shot transfer for
histopathology images. In Proc. of the IEEE/CVF Conference on Computer Vision and
samples, our observations suggest that image-based mutation pre- Pattern Recognition, 19764–19775 (2023).
diction is more challenging. One particular reason could be that the 8. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J., A visual–language
pathology image information is not enough to predict certain muta- foundation model for pathology image analysis using medical Twitter. Nat. Med. 29,
2307–2316 (2023).
tions. Therefore, we plan to utilize other modalities and features to 9. Ozyoruk, K. B. et al. A deep-learning model for transforming the style of tissue images
enhance the prediction in the future. Nevertheless, our method out- from cryosectioned to formalin-fixed and paraffin-embedded. Nat. Biomed. Eng. 6,
performs existing approaches on mutation prediction tasks, offering 1407–1419 (2022).
10. Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor
an opportunity to improve diagnostics and prognostics. Moreover, we composition and prognosis. Nat. Cancer 1, 800–810 (2020).
found that foundation models, including our method and compet- 11. Tellez, D. et al. Whole-slide mitosis detection in h&e breast histology using phh3 as a
ing approaches, are much more effective than task-specific models reference to train distilled stain-invariant convolutional networks. IEEE Trans. Med. Imag.
37, 2126–2136 (2018).
(for example, SL-ImageNet in Supplementary Fig. 4), necessitating 12. Wulczyn, E. et al. Interpretable survival prediction for colorectal cancer using deep
the self-supervised learning framework in these foundation models. learning. NPJ Digit. Med. 4, 71 (2021).
We currently select a magnification of 20 during preprocessing. 13. Tsai, P.-C. et al. Histopathology images predict multi-omics aberrations and prognoses in
colorectal cancer patients. Nat. Commun. 14, 2102 (2023).
A larger magnification will quadruple the processing time but also
14. Diao, J. A. et al. Human-interpretable image features derived from densely mapped cancer
reveal more details of the image. Therefore, we are interested in explor- pathology slides predict diverse molecular phenotypes. Nat. Commun. 12, 1613 (2021).
ing other magnifications in the future. Scaling laws have been observed 15. Echle, A. et al. Deep learning in cancer pathology: a new generation of clinical biomarkers.
Br. J. Cancer 124, 686–696 (2021).
in large language models when modelling text data. We have observed
16. Van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the
that GigaPath pretrained on the larger Prov-Path data outperforms clinic. Nat. Med. 27, 775–784 (2021).
GigaPath pretrained on the smaller TCGA data (Extended Data Fig. 6). 17. Kohane, I. S., Churchill, S., Tan, A. L. M., Vella, M. & Perry, C. L. The digital–physical divide
for pathology research. Lancet Digit. Health 5, e859–e861 (2023).
Despite having different model architectures, we have also observed
18. Huang, Z. et al. Artificial intelligence reveals features associated with breast cancer
that GigaPath, which has more parameters, outperforms HIPT when neoadjuvant chemotherapy responses from multi-stain histopathologic images. NPJ
both are pretrained on Prov-Path. These two results indicate the effec- Precis. Oncol. 7, 14 (2023).
19. Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep
tiveness of larger pretraining data and larger models, which partly
learning. Cancer Cell 40, 865–878 (2022).
indicate that the model performance may further improve with more 20. Wang, X. et al. Predicting gastric cancer outcome from resected lymph node histopathology
pretraining tokens. We are interested in further validating scaling laws images using deep learning. Nat. Commun. 12, 1637 (2021).
21. Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in
in the context of pathology foundation models by comparing models
histopathology: enhancing cancer research and clinical oncology. Nat. Cancer 3,
at different sizes and pretraining data at different sizes. 1026–1038 (2022).
Although initial results are promising, growth opportunities abound. 22. Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction
First, it would be interesting to study scaling laws51 on the pathology of patient outcome. Nat. Med. 25, 1519–1525 (2019).
23. Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for
foundation models by comparing the performance using different sizes prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer.
of vision transformers. In particular, we found that a smaller version Nat. Cancer 3, 1151–1164 (2022).
of Prov-GigaPath using 23 million parameters also attained superior 24. Oquab, M. et al. DINOv2: Learning robust visual features without supervision. Transact.
Mach. Learn. Res. oquab2024dinov (2023).
performance than existing approaches, demonstrating the application 25. Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision
of two models in real-world clinics: a small model for fast inference transformers. In Proc. of the IEEE/CVF International Conference on Computer Vision,
and fine-tuning, and a large model (Prov-GigaPath) for more accurate 9640–9649 (IEEE, 2021).
26. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning
inference. Second, the pretraining process can be further optimized. In of visual representations. In International Conference on Machine Learning (eds Daumé, H.
slide-level self-supervised learning, we froze the tile-level encoder when & Singh, A.) 1597–1607 (PMLR, 2020).
pretraining the slide-level encoder to reduce memory cost, which may 27. Kenton, J. D. M.-W. C. & Toutanova, L. K. BERT: pre-training of deep bidirectional transformers
for language understanding. In Proc. NAACL-HLT 2019 (eds Burstein, J. et al.) 4171–4186
be suboptimal. We plan to explore end-to-end pretraining with larger (Association for Computational Linguistics, 2019).
graphics processing unit (GPU) clusters, on which we can compute 28. Bao, H., Dong, L., Piao, S. & Wei, F. BEIT: BERT pre-training of image transformers.
image encoding on the fly and fine-tune all the way. Third, we conducted In International Conference on Learning Representations (2021).
29. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language
an initial exploration on vision–language pretraining and demonstrated processing. ACM Trans. Comput. Healthc. 3, 2 (2021).
promising results in zero-shot subtyping and mutation prediction, but 30. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature
this remains far away from the potential to serve as a conversational 618, 616–624 (2023).
31. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines.
assistant for clinicians. In future, we plan to incorporate advanced Nature 619, 357–362 (2023).
multimodal learning frameworks, such as LLaVA-Med52, into our work. 32. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images.
Nature 622, 156–163 (2023).
33. Tu, T. et al. Towards generalist biomedical ai. NEJM AI 1, AIoa2300138 (2024).
34. Daniel, N. et al. Between generating noise and generating images: noise in the correct
Online content frequency improves the quality of synthetic histopathology images for digital pathology.
Any methods, additional references, Nature Portfolio reporting summa- In 45th Annual International Conference of the IEEE Engineering in Medicine & Biology
Society (EMBC) 1–7 (IEEE, 2023).
ries, source data, extended data, supplementary information, acknowl-
35. Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical
edgements, peer review information; details of author contributions self-supervised learning. In Proc. of the IEEE/CVF Conference on Computer Vision and
and competing interests; and statements of data and code availability Pattern Recognition, 16144–16155 (IEEE, 2022).
36. Balkwill, F. R., Capasso, M. & Hagemann, T. The tumor microenvironment at a glance.
are available at https://doi.org/10.1038/s41586-024-07441-w.
J. Cell Sci. 125, 5591–5596 (2012).
37. Javed, S. et al. Cellular community detection for tissue phenotyping in colorectal cancer
1. Campanella, G. et al. Clinical-grade computational pathology using weakly supervised histology images. Med. Image Anal. 63, 101696 (2020).
deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019). 38. Saltz, J. et al. Spatial organization and molecular correlation of tumor-infiltrating
2. Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on lymphocytes using deep learning on pathology images. Cell Rep. 23, 181–193 (2018).
whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021). 39. Shao, Z. et al. Hvtsurv: hierarchical vision transformer for patient-level survival prediction
3. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. from whole slide image. In Proc. AAAI Conference on Artificial Intelligence, vol. 37,
Bioeng. 1, 930–949 (2023). 2209–2217 (2023).
4. Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. 40. Li, B., Li, Y. & Eliceiri, K. W. Dual-stream multiple instance learning network for whole
35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 2127–2136 slide image classification with self-supervised contrastive learning. In Proc. IEEE/CVF
(IMLS, 2018). Conference on Computer Vision and Pattern Recognition, 14318–14328 (2021).

Nature | Vol 630 | 6 June 2024 | 187


Article
41. Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological 51. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/
image classification. Med. Image Anal. 81, 102559 (2022). 10.48550/arXiv.2001.08361 (2020).
42. Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine 52. Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in
learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023). one day. Advances in Neural Information Processing Systems, vol. 36 (eds Oh, A. et al.)
43. Chen, R. J. et al. Towards a general-purpose foundation model for computational 28541–28564 (Curran Associates, 2024).
pathology. Nat. Med. 30, 850–862 (2024).
44. Vaswani, A. et al. Attention is all you need. Advances in neural information processing
systems vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
45. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference published maps and institutional affiliations.
on Computer Vision and Pattern Recognition, 16000–16009 (IEEE, 2022).
46. Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer Open Access This article is licensed under a Creative Commons Attribution
histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018). 4.0 International License, which permits use, sharing, adaptation, distribution
47. Brown, L. C. et al. LRP1B mutations are associated with favorable outcomes to immune and reproduction in any medium or format, as long as you give appropriate
checkpoint inhibitors across multiple cancer types. J. Immunother. Cancer 9, e001792 credit to the original author(s) and the source, provide a link to the Creative Commons licence,
(2021). and indicate if changes were made. The images or other third party material in this article are
48. Morris, L. G. et al. Recurrent somatic mutation of fat1 in multiple human cancers leads to included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
aberrant wnt activation. Nat. Genet. 45, 253–261 (2013). to the material. If material is not included in the article’s Creative Commons licence and your
49. Hong, R., Liu, W., DeLair, D., Razavian, N. & Fenyö, D. Predicting endometrial cancer intended use is not permitted by statutory regulation or exceeds the permitted use, you will
subtypes and molecular features from histopathology images using multi-resolution need to obtain permission directly from the copyright holder. To view a copy of this licence,
deep learning models. Cell Rep. Med. 2, 100400 (2021). visit http://creativecommons.org/licenses/by/4.0/.
50. Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from
fifteen million scientific image-text pairs. Preprint at https://doi.org/10.48550/
arXiv.2303.00915 (2023). © The Author(s) 2024

188 | Nature | Vol 630 | 6 June 2024


Methods from 29,018 TCGA slides. In our experiments, we selected the Resnet
152 × 2 model for evaluation.
Preprocessing WSIs We fine-tuned Prov-GigaPath and other baseline models on diverse
We first established our preprocessing pipeline for the 171,189 downstream tasks. For Prov-GigaPath, we froze the tile encoder and
H&E-stained53 and immunohistochemistry54 pathology slides. The only fine-tuned the LongNet slide-level encoder. For each slide, LongNet
statistics of slides and patients for each organ are shown in Supple- produces a set of contextualized tile embeddings. These are aggre-
mentary Figs. 1 and 2. First, we performed tissue segmentation to filter gated using a shallow ABMIL layer to obtain the slide embeddings,
background regions. Following HIPT, we ran the Otsu55 image thresh- which are then used in additional classifiers for downstream prediction
olding at a downsampled resolution (for example, 1,024 pixels) for its tasks. When applying the HIPT model, we followed the default setting
computational efficiency and effectiveness in differentiating tissues by freezing both the 256 × 256 and 4,096 × 4,096 image encoder and
from the background. Second, we resized the WSIs to a standard reso- tuning the parameters of the additional transformer layer and ABMIL
lution of 0.5 μm per pixel (MPP)—that is, 20× magnification using the layer. Since both CtransPath and REMEDIS are tile-level encoders, we
pyvips library. This step is necessary because some slides have higher directly applied one ABMIL layer to get slide-level embeddings and
resolution depending on the scanner settings. Finally, the images were mainly tuned the ABMIL layer and classifier.
cropped into 256 × 256-pixel tile images. Tiles with an occupancy value
of less than 0.1, determined by the Otsu algorithm, were discarded to Mutation prediction
focus on tissue-covered regions. We performed these operations on From Prov-Path, we constructed 5-gene mutation prediction tasks:
a cluster of up to 200 nodes, where each node was equipped with 32 pan-cancer 18 biomarkers prediction, LUAD 5-gene mutation predic-
CPU cores and 256 GB RAM, completing preprocessing in about 157 tion, pan-cancer 5-gene mutation prediction, LUAD 5-gene mutation
hours. Tasks were parallelized, so that each node processed a set of prediction on TCGA and overall TMB prediction (Supplementary
tiles independently. Finally, we collected 1,384,860,229 tiles in total, Tables 7 and 9). The 18 biomarkers prediction is an 18-class multi-label
with the number of tiles in each WSI shown in Supplementary Fig. 3. classification problem, with each class being either a mutation or
PD-L1. The positive status for each gene indicates that it is mutated or
Details of Prov-GigaPath pretraining that PD-L1 (encoded by CD274) is highly expressed. The 5-gene muta-
Prov-GigaPath tile encoder used the ViT model architecture with stand- tion prediction tasks are 5-class classification problems. The 5-gene
ard DINOv2 settings24. We pretrained the model on 1,384,860,229 seg- mutation prediction tasks including 5 genes (EGFR, FAT1, KRAS, TP53
mented tiles, treating each tile as one data instance. The base learning and LRP1B) are formulated as a multi-label prediction task where the
rate in DINOv2 pretraining was set to 4 × 10−3. We set the batch size model was asked to predict mutation status for all genes. The over-
on each GPU device to 12, with a total effective batch size of 384. all TMB prediction is a 2-class classification (High TMB versus Low
Prov-GigaPath slide encoder used the LongNet model architecture TMB). We formulated this task as an image binary classification task
with standard settings5. For discretizing the tile coordinates, we set where each image is annotated as ‘High TMB’ and ‘Low TMB’ based on
the grid size dgrid to 256 and the number of rows and columns to ngrid the number of somatic mutations of the tumour56. Such evaluations
to 1,000. For the input sequence augmentations, we set the cropping reflect the the capability of the model to extracting diverse molecu-
ratio to 0.875. The moving distances were randomly generated with a lar patterns on the WSIs. For each patient, who typically has multiple
uniform distribution by keeping all tiles within the created grid over- WSIs, we selected the largest WSI. This naturally enabled patient-level
lay. We horizontally flipped the tile coordinates for each slide with a stratification when splitting the datasets into training, validation, and
0.5 probability. To pretrain our Prov-GigaPath slide encoder with the test sets. We fine-tuned Prov-GigaPath model with the base learning
masked autoencoder, we set the learning rate to 5 × 10−4 and the batch rate of 2 × 10−3 and the weight decay of 0.01. Following the default set-
size on each GPU device to 4. We also set the training epochs to 30 with tings in HIPT, we trained the comparison models with a learning rate of
the initial epoch as the warmup phase. The slide encoder pretrain- 2 × 10−4. The training batch size for all approaches was set to 1 with 32
ing utilized 16 nodes with 4 × 80 GB A100 GPUs and was completed in gradient accumulation steps. We trained all approaches for 20 epochs.
approximately 2 days (3,072 A100 GPU hours). The inference duration The performances were evaluated in terms of the AUROC and AUPRC
for a WSI is on average 0.7 s, including 0.4 s on computing tile embed- using the 10-fold cross-validation.
dings and 0.3 s on LongNet inference.
Cancer subtyping
Competing methods and benchmarks We conducted the subtyping evaluations on nine cancer types, includ-
We compared Prov-GigaPath to 4 comparison approaches. HIPT35 was ing NSCLC (LUAD versus LUSC), BRCA (IDC versus ILC), RCC (CCRCC
a released model pretrained on 10,678 gigapixel WSIs from TCGA. It versus PRCC versus CHRCC), COADREAD (COAD versus READ), HB
utilized a hierarchical image pyramid transformer architecture with (CHOL versus HCC), DIFG (GBM versus ODG versus AODG versus
256 × 256 and 4,096 × 4,096 image views. We can also view the HIPT HGGNOS versus AASTR), OVT (CCOV versus EOV versus HGSOC versus
model as a tile encoder with an additional embedding aggregation LGSOC versus MOV versus OCS), CNS (ATM versus MNG) and EGC (ESCA
encoder on the 4,096 × 4,096 view. Since it used the DINO self- versus GEJ versus STAD); details and definitions are provided in Supple-
supervised learning approach to train the 256 × 256 image encoder mentary Tables 8 and 9. We fine-tuned the Prov-GigaPath with the base
and 4,096 × 4,096 image encoder, the tile encoder pretraining of HIPT learning rate of 4 × 10−3, the weight decay of 0.001, and the layer-wise
was the same as Prov-GigaPath. The key difference between HIPT and learning rate decay of 0.9. The training hyperparameters were chosen
Prov-GigaPath was the aggregation mechanism. Prov-GigaPath based on performance on the validation set. All models were fine-tuned
approached aggregation using long-sequence representation learning for 20 epochs and evaluated using the tenfold cross-validation. For
with a slide encoder, whereas HIPT employed a second-stage ViT on the Prov-GigaPath, we additionally added a shortcut to the slide-level
the 4,096 × 4,096 image view. CtransPath41 combined a CNN model encoder to pay more attention to tile-level features.
with a multi-scale SwinTransformer. CtransPath used a semantically
relevant contrastive-learning objective to pretrain the model, which Vision–language alignment
treated each input image and its augmentation views as positive pairs We constructed 17,383 pathology WSI-reports pairs and employed the
and S retrieved semantically relevant images as pseudo-positive pairs. OpenCLIP codebase for vision–language processing. Since real-world
REMEDIS42 used a Resnet as the backbone and pretrained with the pathology reports are noisy and lengthy, we first clean the raw pathol-
SimCLR approach on 50 million pathology images randomly sampled ogy reports by removing information irrelevant to cancer diagnosis,
Article
including hospital location, doctor name, and patient name. Specifi- and labels, is available via the NIH Genomic Data Commons portal at
cally, we first clustered the clinical reports into four clusters using https://portal.gdc.cancer.gov/projects/TCGA-LUAD.
k-means and picked the cluster centres as four representative reports.
We then manually cleaned these four reports and obtained four pairs of
original and cleaned reports. We used these four reports as in-context Code availability
learning examples and asked GPT-3.5 to clean all other reports accord- Prov-GigaPath is a vision transformer model created by tile-level
ing to these four in-context learning examples (Supplementary Fig. 9). pretraining using DINOv2, followed by slide-level pretraining using
The distributions of the overall token length before and after the filter- masked autoencoder and LongNet, on more than 170,000 whole
ing are shown in Supplementary Fig. 10. The text embeddings were slides with more than a billion pathology image tiles. The pathology
calculated using the text-embedding-ada-002 model from OpenAI. slides were stripped of the identification barcodes before pretraining.
Finally, we constructed 17,383 vision–language pairs of WSI and the Prov-GigaPath can be accessed at https://github.com/prov-gigapath/
cleaned reports. We hold out 20% of the patients from CLIP pretrain- prov-gigapath, including the model weights and relevant source code.
ing for zero-shot prediction tasks. We set the learning rate of the CLIP We include detailed methods and implementation steps in the Methods
training to 5 × 10−4 and the batch size to 32. We trained both the visual and Supplementary Information to enable independent replication.
encoder and the text encoder for 10 epochs with the first 100 iterations
as the warmup stage. 53. Fischer, A. H., Jacobson, K. A., Rose, J. & Zeller, R. Hematoxylin and eosin staining of tissue
In zero-shot prediction tasks, we chose the MI-Zero (PubMedBERT)7, and cell sections. Cold Spring Harbor Protoc. 2008, prot4986 (2008).
54. Duraiyan, J., Govindarajan, R., Kaliyappan, K. & Palanisamy, M. Applications of
BiomedCLIP50 and PLIP8 as the comparison models. MI-Zero (PubMed-
immunohistochemistry. J. Pharm. Bioallied Sci. 4, S307 (2012).
BERT) was trained on 33,480 pathology image-caption pairs curated 55. Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
from educational resources and ARCH dataset. It is a multiple instance Cybernet. 9, 62–66 (1979).
56. Jain, M. S. & Massoud, T. F. Predicting tumour mutational burden from histopathological
learning-based zero-shot transferring approach by aggregating mul-
images using multiscale deep learning. Nat. Mach. Intel. 2, 356–362 (2020).
tiple tiles with a top K pooling strategy. BiomedCLIP was trained on 15 57. Usuyama, N. Prov-Path Sample Data 1. Zenodo https://doi.org/10.5281/zenodo.10909616
million biomedical domain-specific image-caption pairs from research (2024).
58. Usuyama, N. Prov-Path Sample Data 2. Zenodo https://doi.org/10.5281/zenodo.10909922
articles. PLIP was a pathology domain-specific vision–language pre-
(2024).
trained model using image–text pairs from Twitter. We evaluated the
comparison approaches and Prov-GigaPath on NSCLC and COADREAD
Acknowledgements The authors thank D. Tan, J. Carlson and the Microsoft Health Futures
subtyping tasks and LRP1B, KRAS, TP53, SPTA1, FAT1 and KMT2D muta- team for support and helpful discussions; T. Darcet and M. Oquab for their insights on DINOv2;
tion status prediction. We followed the settings and prompt templates and M. Tanaka for his insights into optimizing GPU operations on Azure.
in MI-Zero7 and evaluated these approaches with 50 randomly sampled
Author contributions H.X., N.U., C.B., S.W. and H.P. contributed to the conception and design
prompts set. of the work. C.B., H.P., B.P., T.B., J.R., R.W., S.L., N.U., R.R., J.B., S.Z., T.N., C.W., Z.G., J. González,
Y.G. and Y.X. contributed to the data acquisition and curation. H.X., N.U., R.R., W.W. and S.M.
Reporting summary contributed to the technical implementation. M.W., F.W., J.Y., C.L. and J. Gao contributed to
technical discussions. H.X., N.U., C.B., S.W. and H.P. contributed to the evaluation framework
Further information on research design is available in the Nature Port- used in the study. C.B. and B.P. provided clinical inputs to the study. A.R., B.W., C.B. and H.P.
folio Reporting Summary linked to this article. contributed to securing funding. All authors contributed to the drafting and revision of the
manuscript.

Competing interests C.B. is a member of the scientific advisory board and owns stock in
Data availability PrimeVax and BioAI; is on the scientific board of Lunaphore and SironaDx; has a consultant or
The pathology imaging data used for the pretraining were created advisory relationship with Sanofi, Agilent, Roche and Incendia; contributes to institutional
research for Illumina, and is an inventor on US patent applications US20180322632A1 (Image
from oncology pathology slides at Providence. The associated clini- Processing Systems and Methods for Displaying Multiple Images of a Biological Specimen) filed
cal data used for fine-tuning and testing were obtained from the cor- by Ventana Medical Systems, Providence Health and Services Oregon and US20200388033A1
responding medical records. These proprietary data cannot be made (System and Method for Automatic Labeling of Pathology Images) filed by Providence Health
and Services Oregon, Omics Data Automation. The other authors declare no competing
publicly available. Researchers may obtain a de-identified test subset interests.
from Providence Health System by reasonable request and subject
to local and national ethical approvals. To help researchers use our Additional information
Supplementary information The online version contains supplementary material available at
model, we provide a de-identified subset of our data at https://doi. https://doi.org/10.1038/s41586-024-07441-w.
org/10.5281/zenodo.10909616 (ref. 57) and https://doi.org/10.5281/ Correspondence and requests for materials should be addressed to Carlo Bifulco,
zenodo.10909922 (ref. 58) for a few patients. We also collected pub- Sheng Wang or Hoifung Poon.
Peer review information Nature thanks Akshay Chaudhari, Joe Yeong and the other,
licly available TCGA WSIs from the NIH Genomic Data Commons Data anonymous, reviewer(s) for their contribution to the peer review of this work.
Portal. The TCGA-LUAD dataset, comprising whole pathology slides Reprints and permissions information is available at http://www.nature.com/reprints.
Extended Data Fig. 1 | Comparison on Pan-cancer 18-biomarker prediction. Bar plot showing the AUPRC score for each biomarker on the 18-biomarker
prediction by Prov-GigaPath and competing methods.
Article

Extended Data Fig. 2 | Comparison on LUAD 5-gene mutation prediction. across n = 10 independent experiments and the bar centre shows the mean
Bar plots showing AUROC and AUPRC scores for predicting each gene mutation value. The listed p-value indicates the significance level that Prov-GigaPath
on LUAD 5-gene mutation prediction. The error bars show the standard error outperforms the best comparison approach, with one-sided Wilcoxon test.
Extended Data Fig. 3 | Comparison on Pan-cancer 5-gene mutation shows the mean value. The listed p-value indicates the significance level that
prediction. Bar plots showing AUROC and AUPRC scores for predicting each Prov-GigaPath outperforms the best comparison approach, with one-sided
gene mutation on Pan-cancer 5-gene mutation prediction. The error bars show Wilcoxon test.
the standard error across n = 10 independent experiments and the bar centre
Article

Extended Data Fig. 4 | Comparison on LUAD 5-gene mutation prediction in error across n = 10 independent experiments and the bar centre shows the mean
TCGA. Bar plots showing AUPRC scores for predicting each gene mutation on value. The listed p-value indicates the significance level that Prov-GigaPath
LUAD 5-gene mutation prediction in TCGA. The error bars show the standard outperforms the best comparison approach, with one-sided Wilcoxon test.
Extended Data Fig. 5 | Comparison on mutation prediction on new colorectal shows the mean value. The listed p-value indicates the significance level that
patients. Bar plots showing AUROC and AUPRC scores for predicting 5-gene Prov-GigaPath outperforms the best comparison approach, with one-sided
mutation and TMB status on new patients from Providence. The error bars show Wilcoxon test.
the standard error across n = 10 independent experiments and the bar centre
Article

Extended Data Fig. 6 | Comparison between pretraining the same model GigaPath-TCGA is GigaPath trained on TCGA. The error bars show the standard
using Prov-Path and TCGA. a-b, Bar plots showing the AUROC (a) and AURPC error across n = 10 independent experiments and the bar centre shows the mean
(b) on LUAD 5-gene mutation prediction in TCGA using models trained on value. The listed p-value indicates the significance level that Prov-GigaPath
Prov-Path and TCGA. Prov-GigaPath is GigaPath trained on Prov-Path. outperforms GigaPath-TCGA, with one-sided Wilcoxon test.
Extended Data Fig. 7 | Comparison between GigaPath trained using pretrained on Prov-Path. The error bars show the standard error across n = 10
Prov-Path and HIPT trained using Prov-Path on mutation prediction. independent experiments and the bar centre shows the mean value. The listed
a-j: Bar plots showing the AUROC (a-e) and AURPC (f-j) of mutation prediction p-value indicates the significance level that Prov-GigaPath outperforms the
tasks by Prov-GigaPath and HIPT-Prov-Path. HIPT-Prov-Path indicates HIPT HIPT-Prov-Path, with one-sided Wilcoxon test.
Article

Extended Data Fig. 8 | Comparison between GigaPath trained using pretrained on Prov-Path. The error bars show the standard error across n = 10
Prov-Path and HIPT trained using Prov-Path on cancer subtyping. independent experiments and the bar centre shows the mean value. The listed
a-f, Bar plots showing the AUROC (a,c,e) and BACC (b,d,f) of cancer subtyping p-value indicates the significance level that Prov-GigaPath outperforms the
tasks by Prov-GigaPath and HIPT-Prov-Path. HIPT-Prov-Path indicates HIPT HIPT-Prov-Path, with one-sided Wilcoxon test.
Extended Data Fig. 9 | Alignment between pathology reports and images. indicates the significance level that Prov-GigaPath outperforms the best
a-d, Bar plots showing the performance of f 1 (a), Precision (b), AUROC (c) and comparison approach, with one-sided Wilcoxon test. e, Scatter plots comparing
AUPRC (d) using fine-tuned Prov-GigaPath to predict mutations in the Prov-GigaPath and MI-Zero on cancer subtyping prediction and mutation
zero-shot learning setting. The error bars show the standard error across prediction in terms of balanced accuracy (BACC).
n = 50 experiments and the bar centre shows the mean value. The listed p-value
nature portfolio | reporting summary
Corresponding author(s): Hoifung Poon
Last updated by author(s): 2024/04/01

Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection We used the Databricks (runtime version 12.2 LTS) platform to collect the whole slide imaged from Providence. We used Microsoft SQL Azure
(RTM) - 12.0.2000.8 and python==3.10 to collect histopathology findings, cancer staging, genomic mutation profiles, along with the associated
pathology reports. For each whole slide image, we ran Otsu algorithm for tissue segmentation to filter background regions. For the pathology
reports, we used GPT-3.5 provided by Azure OpenAI to extract clinical relevant information.

Data analysis This work uses open source codebase and libraries to analyze the data. We used DINOv2 (https://github.com/facebookresearch/dinov2/tree/
main) to pretrain the ViT tile encoder and OpenCLIP (https://github.com/mlfoundations/open_clip) to train the vision-language alignment
model. For LongNet model, we used the implementation in torchscale==0.1.1. To install torchscale, we used the following public packages,
including torch==2.0.0+cu117, torchvision==0.15.0+cu117, tensorboard==2.15.1, timm==0.9.12, xformers==0.0.18, einops==0.7.0,
fairscale==0.4.13, huggingface-hub==0.19.4. We used scikit-learn==1.3.2, scipy==1.11.4 and numpy==1.24.1 to evaluate the model
performance. We used matplotlib==3.3.0 to visualize the data.
All the codes to reproduce our experiments will be made public upon publication.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
April 2023

1
nature portfolio | reporting summary
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy

The pathology imaging data used for the pretraining were created from oncology pathology slides at Providence. The associated clinical data used for fine-tuning
and testing were obtained from the corresponding medical records. These proprietary data cannot be made publicly available. Researchers may obtain a de-
identified test subset from Providence Health System by reasonable request and subject to local and national ethical approvals. To help researchers use our model,
we provide a de-identified subset of our data at https://doi.org/10.5281/zenodo.10909616 and https://doi.org/10.5281/zenodo.10909922 for a few patients. We
also collected publicly available TCGA whole slide images from NIH Genomic Data Commons Data Portal. The TCGA LUAD dataset, comprising whole pathology slides
and labels, is available via the NIH Genomic Data Commons portal at https://portal.gdc.cancer.gov/projects/TCGA-LUAD.

Research involving human participants, their data, or biological material


Policy information about studies with human participants or human data. See also policy information about sex, gender (identity/presentation),
and sexual orientation and race, ethnicity and racism.
Reporting on sex and gender Use the terms sex (biological attribute) and gender (shaped by social and cultural circumstances) carefully in order to avoid
confusing both terms. Indicate if findings apply to only one sex or gender; describe whether sex and gender were considered in
study design; whether sex and/or gender was determined based on self-reporting or assigned and methods used.
Provide in the source data disaggregated sex and gender data, where this information has been collected, and if consent has
been obtained for sharing of individual-level data; provide overall numbers in this Reporting Summary. Please state if this
information has not been collected.
Report sex- and gender-based analyses where performed, justify reasons for lack of sex- and gender-based analysis.

Reporting on race, ethnicity, or Please specify the socially constructed or socially relevant categorization variable(s) used in your manuscript and explain why
other socially relevant they were used. Please note that such variables should not be used as proxies for other socially constructed/relevant variables
groupings (for example, race or ethnicity should not be used as a proxy for socioeconomic status).
Provide clear definitions of the relevant terms used, how they were provided (by the participants/respondents, the
researchers, or third parties), and the method(s) used to classify people into the different categories (e.g. self-report, census or
administrative data, social media data, etc.)
Please provide details about how you controlled for confounding variables in your analyses.

Population characteristics Describe the covariate-relevant population characteristics of the human research participants (e.g. age, genotypic
information, past and current diagnosis and treatment categories). If you filled out the behavioural & social sciences study
design questions and have nothing to add here, write "See above."

Recruitment Describe how participants were recruited. Outline any potential self-selection bias or other biases that may be present and
how these are likely to impact results.

Ethics oversight Identify the organization(s) that approved the study protocol.

Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size During the pretraining of our model, we used all 1,384,860,229 tiles in 171,189 pathology slides from 30,000 patients collected in the
Providence health network comprising 28 cancer centers. For finetuning, we collected patients with the cancer that we investigate. Each
pathology slide is a slide sample and each patient is a patient sample. Each patient can have several slides. When performing mutation
prediction, we selected the largest slide for each patient to analyze. The sample size was determined by all the samples that were collected by
August, 2023.
April 2023

Data exclusions We identified tiles that don't have a substantial tissue occupancy as the background area and filter them out from pretraining and finetuning.

Replication Across all 26 tasks, we ran 10-fold cross-validation with 10 different seeds to determine whether the improvement of our model is significant

2
Replication compared to baseline approaches.

nature portfolio | reporting summary


Randomization When doing the subtyping tasks and mutation prediction tasks, we randomly split the finetuning dataset into 7:1:2 train/validation/test splits.
The hyperparameter was chosen based on accuracy on the validation set.

Blinding During the test, the researchers were blinded to the group allocation.

Reporting for specific materials, systems and methods


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology and archaeology MRI-based neuroimaging
Animals and other organisms
Clinical data
Dual use research of concern
Plants

Plants
Seed stocks Report on the source of all seed stocks or other plant material used. If applicable, state the seed stock centre and catalogue number. If
plant specimens were collected from the field, describe the collection location, date and sampling procedures.

Novel plant genotypes Describe the methods by which all novel plant genotypes were produced. This includes those generated by transgenic approaches,
gene editing, chemical/radiation-based mutagenesis and hybridization. For transgenic lines, describe the transformation method, the
number of independent lines analyzed and the generation upon which experiments were performed. For gene-edited lines, describe
the editor used, the endogenous sequence targeted for editing, the targeting guide RNA sequence (if applicable) and how the editor
was applied.
Authentication Describe any authentication procedures for each seed stock used or novel genotype generated. Describe any experiments used to
assess the effect of a mutation and, where applicable, how potential secondary effects (e.g. second site T-DNA insertions, mosiacism,
off-target gene editing) were examined.

April 2023

You might also like