1 s2.0 S1525157821002154 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

The Journal of Molecular Diagnostics, Vol. 23, No.

10, October 2021

jmdjournal.org

Identification of Tissue of Origin and Guided


Therapeutic Applications in Cancers of Unknown
Primary Using Deep Learning and RNA Sequencing
(TransCUPtomics)
Julien Vibert,* Gaëlle Pierron,y Camille Benoist,z Nadège Gruel,*x Delphine Guillemot,y Anne Vincent-Salomon,{
Christophe Le Tourneau,k Alain Livartowski,** Odette Mariani,{ Sylvain Baulande,yy François-Clément Bidard,**zz
Olivier Delattre,*y Joshua J. Waterfall,x,xx and Sarah Watson*,**

From the INSERM U830,* Équipe Labellisée Ligue Nationale Contre le Cancer, Diversity and Plasticity of Childhood Tumors Lab, PSL Research University,
the Department of Translational Research,x PSL Research University, the Institut Curie Genomics of Excellence (ICGex) Platform,yy PSL Research University,
and the INSERM U830,xx PSL Research University, Institut Curie Research Center, Paris; the Somatic Genetics Unit,y Department of Genetics, the Clinical
Bioinformatic Unit,z Department of Diagnostic and Theranostic Medecine, the Department of Diagnostic and Theranostic Medecine,{ and the Department of
Medical Oncology,** Institut Curie Hospital, Paris; the Department of Drug Development and Innovation,k INSERM U900, Paris-Saclay University, Institut
Curie Hospital and Research Center, Paris and Saint-Cloud; and the INSERM CIC-BT 1428,zz UVSQ, Paris-Saclay University, Saint-Cloud, France

Accepted for publication


July 14, 2021. Cancers of unknown primary (CUP) are metastatic cancers for which the primary tumor is not found
despite thorough diagnostic investigations. Multiple molecular assays have been proposed to identify
Address correspondence to
Sarah Watson, M.D., Ph.D., the tissue of origin (TOO) and inform clinical care; however, none has been able to combine accuracy,
INSERM U830, Équipe Label- interpretability, and easy access for routine use. We developed a classifier tool based on the training of
lisée Ligue Nationale Contre le a variational autoencoder to predict tissue of origin based on RNA-sequencing data. We used as training
Cancer, Diversity and Plasticity data 20,918 samples corresponding to 94 different categories, including 39 cancer types and 55 normal
of Childhood Tumors Lab, PSL tissues. The TransCUPtomics classifier was applied to a retrospective cohort of 37 CUP patients and 11
Research University, Institut prospective patients. TransCUPtomics exhibited an overall accuracy of 96% on reference data for TOO
Curie Research Center, Paris, prediction. The TOO could be identified in 38 (79%) of 48 CUP patients. Eight of 11 prospective CUP
France, Department of Medical patients (73%) could receive first-line therapy guided by TransCUPtomics prediction, with responses
Oncology, Institut Curie Hospi-
observed in most patients. The variational autoencoder added further utility by enabling prediction
tal, Paris, France, 26 rue d’Ulm
interpretability, and diagnostic predictions could be matched to detection of gene fusions and
75005 Paris, France. E-mail:
sarah.watson@curie.fr. expressed variants. TransCUPtomics confidently predicted TOO for CUP and enabled tailored treatments
leading to significant clinical responses. The interpretability of our approach is a powerful addition to
improve the management of CUP patients. (J Mol Diagn 2021, 23: 1380e1392; https://doi.org/
10.1016/j.jmoldx.2021.07.009)

Cancers of unknown primary (CUP) are heterogeneous represent 1% to 2% of metastatic cancers and remain a
metastatic cancers for which the primary tumor cannot be diagnostic and therapeutic challenge. Usual first-line thera-
identified despite a thorough diagnostic workup. CUP peutic strategy consists of unspecific platinum-based

Supported by ANR-10-EQPX-03, Institut Curie Génomique d’Excel- supported by grants ANR-10-EQPX-03 (Equipex) and ANR-10-INBS-09-
lence (ICGex) (S.B., O.D., and S.W.) and by INCa-DGOS- 4654 (J.J.W.). 08 (France Génomique Consortium) from the Agence Nationale de la
S.W. was supported by a grant from the Institut National de la Santé et de la Recherche (“Investissements d’Avenir” program), by the ITMO-Cancer
Recherche Medicale (INSERM) and the Foundation Bettencourt-Schueller. Aviesan (Plan Cancer III), and by the SiRIC-Curie program (SiRIC Grant
J.V. was supported by a grant from la Ligue Contre le Cancer and by INCa-DGOS- 4654).
Institut Curie. J.J.W. acknowledges support from SIRIC INCa-DGOS- Disclosures: None declared.
INSERM_12554. The ICGex NGS platform of the Institut Curie was

Copyright ª 2021 Association for Molecular Pathology and American Society for Investigative Pathology. Published by Elsevier Inc. All rights reserved.
https://doi.org/10.1016/j.jmoldx.2021.07.009
RNA-seq and AI for CUP

chemotherapy, with the response rate and median overall cohort included all new untreated patients with CUP
survival remaining <30% and 9 to 12 months, respectively.1 referred to Institut Curie from June 2019 to August 2020.
Over the last decades, multiple attempts have been made The response rate to first-line therapy was evaluated ac-
to characterize the genomic and transcriptomic landscapes cording to RECIST 1.1 criteria.
of CUP to identify relevant molecular alterations and gene The study was approved by the institutional review board
expression signatures that could orient toward a specific of Institut Curie. All patients provided written informed
tissue of origin and guide therapeutic strategies. Among consent for molecular analysis.
them, data from microarrays, targeted DNA and RNA
sequencing (RNA-seq), DNA methylation, and whole-
genome sequencing have been used, with varying suc- RNA Preparation and Sequencing
cess.2 However, those techniques have not yet been widely
Total RNA was isolated from fresh-frozen tumor tissue
included in the current diagnostic workup of CUP due to
samples by using TRIzol reagent. Library construction was
difficulty of access, cost, and lack of standardization. Thus,
performed following the TruSeq Stranded mRNA LS pro-
current guidelines for CUP management still rely on
tocol (Illumina, San Diego, CA). Sequencing was performed
extensive clinical and immunohistochemical characteriza-
on Illumina sequencing machines: NextSeq 500 (150 nt
tion for tissue of origin determination,1 despite which most
paired-end) and NovaSeq (100 nt paired-end). Atropos
cases remain unclassified and are treated with unspecific
(v1.1.21) was used to trim adapters from FASTQ files.
systemic drugs.
Quality controls, including RNA concentration, RNA
The utility of gene expression data to investigate tissue of
integrity number, and standard sequencing criteria of raw
origin has already been studied extensively, with whole
read data, were performed before analysis. Reads were
transcriptome approaches (RNA-seq) being more robust
aligned to the human reference genome (hg19) with GEN-
than microarrays to identify tumor characteristics and to
CODE version 19 as the reference gene annotation with the
improve diagnostic accuracy.3,4 However, the amount of
use of STAR, version 2.7.0e, and were quantified with the
data generated by whole transcriptome sequencing renders
GeneCounts algorithm. Raw counts were normalized to
their analysis difficult within standard diagnostic proced-
transcripts per million.
ures. Recently, artificial intelligence and machine learning
approaches have been successfully applied to the analysis of
large, high-dimensional molecular data sets.5e8 Reference Samples Used for Training
We hypothesized that such approaches could be applied
to analyze high-dimensional RNA-seq data and trained to To train the classifier, public RNA-seq data from fresh-
identify tissues of origin on data sets of tumors and frozen samples of all primary tumors in The Cancer
nonmalignant tissues. We present here a classifier based on Genome Atlas (TCGA) (n Z 10,201) were used. To allow
the training of a variational autoencoder (VAE), a neural for contamination of samples by normal cells and prevent
network used in the field of deep learning for dimensionality overfitting in the case of low-tumor content, all samples
reduction,9 to predict tissue of origin based on RNA-seq from juxta-tumor normal tissues in TCGA (n Z 746), the
data. The TransCUPtomics classifier was trained on an Genotype-Tissue Expression (GTEx) project (n Z 9659),
unprecedented reference data set of >20,000 tumors and and Human Protein Atlas (n Z 200) were also included in
normal tissues. We report our experience on 48 patients with the training dataset. Categories with <10 samples available
CUP on whom we applied our classifier and evaluated its were excluded to avoid classes with too few samples for
clinical utility by assessing the efficacy of matched therapy. training. The TCGA-SARC (soft tissue sarcoma) category
was divided into the different subtypes of sarcomas, and
normal tissues from different platforms were fused under the
Materials and Methods
same annotation (some tissues such as liver clustered
Patients and Tumors separately on the Uniform Manifold Approximation and
Projection plot between different platforms, either reflecting
The retrospective cohort included patients aged >18 years true biological variability and/or residual batch effect). Data
treated at Institut Curie over the last decade for a clinico- from small cell lung cancers (n Z 79, formalin-fixed,
pathologic diagnosis of CUP, as assessed by a multidisci- paraffin-embedded; https://www.ncbi.nlm.nih.gov/geo,
plinary tumor board after standard diagnostic workup, and accession number GSE60052)11 and pancreatic neuroendo-
for whom fresh-frozen tissue samples were available. This crine tumors (n Z 33; https://www.ncbi.nlm.nih.gov/geo,
cohort included patients treated at Institut Curie in the accession number GSE118014) were also used.12 Raw
SHIVA 01 clinical trial.10 data FASTQ files were downloaded from the Genomic Data
Diagnostic workup included standard biological and Commons archive (TCGA), Sequence Read Archive
radiological procedures to search for the primary tumor, as (Genotype-Tissue Expression, small cell lung cancers, and
well as appropriate extensive pathologic examination and pancreatic neuroendocrine tumors) and HPA (Human Pro-
immunohistochemistry (IHC) testing. The prospective tein Atlas). In total, the reference data set contains 20,918

The Journal of Molecular Diagnostics - jmdjournal.org 1381


Vibert et al

samples divided into 94 “diagnoses.” Supplemental Table entire reference data set, as well as precision and recall values
S1 lists all diagnoses and sample numbers. for each of the diagnoses in the classifier.
To account for diagnoses that arise from similar tissues
The VAE and Machine Learning Classifier and, as expected, share similar transcriptomic profiles
(as depicted in the UMAP plot), different labels corre-
The VAE encoded high-dimensional transcriptomic profiles sponding to subtypes of the same normal tissue were grouped
into low-dimensional representations in a latent space together for this procedure, namely: 13 subtypes of brain
composed of 100 features with interpretable gene weights tissue, four subtypes of gynecologic tissue (cervix, uterus,
(Supplemental Figure S1A). To design the VAE, we prof- vagina, and fallopian tube), three subtypes each of artery and
ited from the seminal work of Way and Greene13 who esophageal tissues, and two subtypes each of adipose, colon,
designed a VAE model named “Tybalt” to encode RNA-seq skin, and cardiac tissues. Moreover, tumors with similar
samples from TCGA. Because this model was already clinical management and related transcriptomic profiles were
optimized in their work, we elaborated the current model regrouped as well, namely: five subtypes of soft tissue
based on a similar architecture: as input to the VAE, the sarcomas from the TCGA-SARC project, colon and rectum
5000 most variable features were selected after a variance- adenocarcinomas (colorectal adenocarcinoma), stomach, and
stabilizing transformation on log-transformed transcripts esophageal carcinomas (gastroesophageal carcinoma).
per million values (SelectVariable Features in R package
Seurat v3.1.4). The encoder neural network was fully con- Classification of Test Samples
nected and one layer deep, with an encoding intermediate
layer of 100 neurons and a decoder network also fully CUP transcriptomic profiles were encoded in the 100-
connected and one layer deep. The latent space is therefore dimensional latent space with the VAE encoder neural
100-dimensional. Input features were scaled between 0 and network, and the 100-feature vector was input into the RF
1 before training (divided by maximum value of the corre- and KNN classifiers to give a prediction of the most prob-
sponding feature). able diagnosis with corresponding scores (Supplemental
The VAE was implemented and trained with Keras Figure S1B). Each test sample was projected on the orig-
version 2.2.4 (TensorFlow version 1.14.0), optimized with inal reference UMAP for visualization.
Adam, batch-normalized. Activation was relu (rectified Two criteria were used for confidence of the prediction:
linear unit) for the encoding layer and sigmoid for the the same diagnosis is predicted by both classifiers; and at
decoding layer. Learning rate was 0.0005, and the model least one of the diagnoses is predicted with a large score
was trained for 50 epochs with no evidence of overfitting. (>50%). Therefore, confidence of prediction can be defined
The latent space was visualized with the use of Uniform as: i) high, both criteria present; ii) moderate, one of the two
Manifold Approximation and Projection (UMAP) in two criteria present; or iii) low, both criteria absent, and samples
dimensions.14 The VAE was trained with all reference are left “unclassified.”
samples as input, resulting in 100-dimensional encoded Considering that some diagnoses may overlap and orient
representations for each sample. Two different machine toward similar clinical management, criteria number one
learning classifiers were then trained on these 100 features: was considered to be fulfilled when a pair of diagnoses of
one random forest (RF) classifier and another using the same family were predicted. Specifically, the following
k-nearest neighbors (KNN). The RF classifier was trained pairs of diagnoses were fused in our samples: “Kidney renal
by using the RandomForest (v4.6-14) package in R, with clear cell carcinoma” and “Kidney renal papillary cell car-
5000 trees, mtry Z 10. The KNN classifier was trained by cinoma” were grouped into “Kidney carcinoma”; “Liver
using the kknn (v1.3.1) package in R, and 10-nearest hepatocellular carcinoma” and “Cholangiocarcinoma” were
neighbors were weighted with a Gaussian kernel. grouped into “Liver HCC/Cholangiocarcinoma”; “Ovarian
serous cystadenocarcinoma” and “Uterine corpus endome-
Cross-Validation of the Classifier trial carcinoma” were grouped into “Gynecologic carci-
noma”; subtypes of soft tissue sarcoma were grouped into
To measure the performance of the classifiers on the training “Soft tissue sarcoma”; and upper tract gastrointestinal can-
data set, the following threefold cross-validation procedure cers were grouped into “GI cancer.”
was used: the reference data set was randomly divided into
three equally sized parts, and three classifiers were indepen- Exploration of VAE Features for Interpretability
dently trained with two of the three parts, the third being
reserved from the entire training procedure for use as a The VAE encodes all samples inside a 100-dimensional
validation data set. Thus, the cross-validation was entirely latent space with 100 features that can be interpreted. For
“blind” to the training of the classifier, including in the feature each feature, mean values were calculated for samples of
selection step from the VAE. Because each of the reference each diagnosis to infer diagnoses with high values for this
samples was included in one of the three validation data sets, a feature. Conversely, an “average” profile could also be
confusion matrix could be calculated for the classifier on the calculated for each diagnosis by taking the mean value for

1382 jmdjournal.org - The Journal of Molecular Diagnostics


RNA-seq and AI for CUP

A 15
N_SPLE
N_ADP SC N_ART TIB
N_BREAST N_ART CRN
N_LUNG
N_ADP VSC
N_ART AO N_HRT AA
N_WB
10 N_LIVER
T_LAML
N_CLN TRA N_ESO MUS
N_MUS SKE
N_VAG
N_ESO GEJ
N_ESO MUC N_SKIN S N_CERV
N_UTER
N_HN N_SKIN NS
5 N_PROST N_FALLOP N_CLN SIG
T_LUAD N_HRT LV
N_STOM N_MSG N_SI TI
T_DLBC
T_STAD T_CHOL
UMAP_2

N_LN N_THYR N_OVARY N_NERV TIB


T_ESCA T_LIHC T_LMS
N_EBV LYM N_PITUI
0 N_TFIB
N_BLAD
N_PANC
T_PAAD
T_THYM T_KIRC
N_ADRNL N_BRA CERH
T_KIRP
T_ACC N_BRA CER
N_TEST N_CML CL
T_PANET
N_KDN CTX
5 T_PCPG N_BRA NA
T_BRCA T_UVM
T_MESO
T_SKCM T_SCLC
T_READ T_MPNST
T_DDLPS T_UPS T_THCA N_BRA SPI
T_COAD N_BRA
T_SS T_TGCT N_BRA AMY
T_KICH N_BRA
T_UCS N_BRA ACC
10 N_BRA
T_UCEC
N_BRA FCTX

T_OV T_PRAD T_GBM

T_LGG

10 5 0 5 10 15
UMAP_1

B C
Brain Nucleus accumbens (basal ganglia)
Caudate (basal ganglia)
Head and neck squamous cell carcinoma Brain Putamen (basal ganglia)
Bladder
Brain Hypothala

Lung squamous cell carcinoma


Ce vical squamous cell carcinoma and endocervical adenocarcinoma
Brain x (BA24)
B
x (BA9)

Figure 1 Transcriptomic landscape of the reference data set captured by the variational autoencoder. A: In total, the reference data set contains 20,918
samples corresponding to 94 diagnostic categories, including 39 different tumor types and 55 normal tissue types. The variational autoencoder was trained to
encode all reference samples in a 100-dimensional space of latent features and later encoded in two dimensions with Uniform Manifold Approximation and
Projection (UMAP). B: Enlarged view of the transcriptomic profiles of lung, head and neck, and cervical squamous cell carcinoma showing partial overlap.
C: Enlarged view of nonmalignant brain structures showing a transcriptomic continuum between cortex and basal ganglia structures. N corresponds to normal
tissues, T to tumor types. N_ADP-SC, Adipose-Subcutaneous; N_ADP-VSC, Adipose-Visceral (Omentum); N_ADRNL, Adrenal gland; N_ART-AO, Artery-Aorta;
N_ART-CRN, Artery-Coronary; N_ART-TIB, Artery-Tibial; N_BLAD, Bladder; N_BRA-ACC, Brain-Anterior cingulate cortex (BA24); N_BRA-AMY, Brain-Amygdala;
N_BRA-CAU, Brain-Caudate (basal ganglia); N_BRA-CER, Brain-Cerebellum; N_BRA-CERH, Brain-Cerebellar hemisphere; N_BRA-CTX, Brain-Cortex; N_BRA-FCTX,
Brain-Frontal cortex (BA9); N_BRA-HIP, Brain-Hippocampus; N_BRA-HYP, Brain-Hypothalamus; N_BRA-NA, Brain-Nucleus accumbens (basal ganglia); N_BRA-
PUT, Brain-Putamen (basal ganglia); N_BRA-SN, Brain-Substantia nigra; N_BRA-SPI, Brain-Spinal cord (cervical c-1); N_BREAST, Breast; N_CERV, Cervix;
N_CLN-SIG, Colon-Sigmoid; N_CLN-TRA, Colon-Transverse; N_CML-CL, Leukemia cell line (CML); N_EBV-LYM, EBV-transformed lymphocytes; N_ESO-GEJ,
Esophagus-Gastroesophageal junction; N_ESO-MUC, Esophagus-Mucosa; N_ESO-MUS, Esophagus-Muscularis; N_FALLOP, Fallopian tube; N_HN, Head and
neck normal tissue; N_HRT-AA, Heart-Atrial appendage; N_HRT-LV, Heart-Left ventricle; N_KDN-CTX, Kidney-Cortex; N_LIVER, Liver; N_LN, Lymph node;
N_LUNG, Lung; N_MSG, Minor salivary gland; N_MUS-SKE, Muscle-Skeletal; N_NERV-TIB, Nerve-Tibial; N_OVARY, Ovary; N_PANC, Pancreas; N_PITUI, Pituitary;
N_PROST, Prostate; N_SI-TI, Small intestine-Terminal ileum; N_SKIN-NS, Skin-Not sun exposed (Suprapubic); N_SKIN-S, Skin-Sun exposed (Lower leg); N_SPLE,
Spleen; N_STOM, Stomach; N_TEST, Testis; N_TFIB, Transformed fibroblasts; N_THYR, Thyroid; N_UTER, Uterus; N_VAG, Vagina; N_WB, Whole blood; T_ACC,
Adrenocortical carcinoma; T_BLCA, Bladder urothelial carcinoma; T_BRCA, Breast invasive carcinoma; T_CESC, Cervical squamous cell carcinoma and endo-
cervical adenocarcinoma; T_CHOL, Cholangiocarcinoma; T_COAD, Colon adenocarcinoma; T_DDLPS, Dedifferentiated liposarcoma; T_DLBC, Diffuse large B-cell
lymphoma; T_ESCA, Esophageal carcinoma; T_GBM, Glioblastoma multiforme; T_HNSC, Head and neck squamous cell carcinoma; T_KICH, Kidney renal chro-
mophobe cell carcinoma; T_KIRC, Kidney renal clear cell carcinoma; T_KIRP, Kidney renal papillary cell carcinoma; T_LAML, Acute myeloid leukemia; T_LGG,
Brain lower grade glioma; T_LIHC, Liver hepatocellular carcinoma; T_LMS, Leiomyosarcoma; T_LUAD, Lung adenocarcinoma; T_LUSC, Lung squamous cell
carcinoma; T_MESO, Mesothelioma; T_MPNST, Malignant peripheral nerve sheath tumor; T_OV, Ovarian serous cystadenocarcinoma; T_PAAD, Pancreatic
adenocarcinoma; T_PANET, Pancreatic neuroendocrine tumor; T_PCPG, Pheochromocytoma and paraganglioma; T_PRAD, Prostate adenocarcinoma; T_READ,
Rectum adenocarcinoma; T_SCLC, Small cell lung cancer; T_SKCM, Skin cutaneous melanoma; T_SS, Synovial sarcoma; T_STAD, Stomach adenocarcinoma;
T_TGCT, Testicular germ cell tumors; T_THCA, Thyroid carcinoma; T_THYM, Thymoma; T_UCEC, Uterine corpus endometrial carcinoma; T_UCS, Uterine carci-
nosarcoma; T_UPS, Undifferentiated pleomorphic sarcoma; T_UVM, Uveal melanoma.

The Journal of Molecular Diagnostics - jmdjournal.org 1383


Vibert et al

each of the 100 features. Each feature is the result of a detect well-documented fusions. Second, an exploratory
nonlinear combination of the initial transcriptomic features, analysis was conducted with five fusion-detection tools: i)
and the associated weights in the decoder network of the Defuse v0.6.0; ii) StarFusion v2.5.3; iii) Fusion Catcher
VAE provide an idea of the genes most contributing to this v1.00; iv) FusionMap Oshell toolkit v10.0.1.50; and v)
feature. Gene Ontology analysis was performed on the 100 ARRIBA v1.2.0.
highest-weight genes in each feature with the package Interpretation combined the results of the targeted fusion
gprofiler2 v0.7.0 in R. Results of these analyses are given in analysis and those of the exploratory analysis.
Supplemental Table S2.
Variant Calling
Analysis of Expressed Variants and Fusion Detection Read alignment was performed with STAR on hg19, and
read cleaning was done as described by GATK good prac-
RNA-seq was used to detect gene fusions and infer tice recommendations (v3.5). Variant calling was performed
expressed variants as described in the following sections. on a list of 499 genes (Cancer Gene Census, COSMIC
24.05.2016) using haplotype Caller (GATK v.3.5) and
Fusion Gene Detection Mutect2 (GATK v.4). Reads with mapping quality <6 and
sequenced bases quality <20 were not considered for
Two complementary approaches were performed for fusion variant calling. Variants were annotated with ANNOVAR
gene detection. First, a targeted analysis was performed by (v2018Apr16). Modelization of single nucleotide variants
using a curated list of known fusion gene sequences to belonging to a list of 499 genes (Cancer Gene Census,

100
Recall
50
(%) 0
T_LAML
T_ACC
T_BLCA
T_LGG
T_BRCA
T_CESC
T_CHOL
T_COREAD
T_DLBC
T_GESCA
T_GBM
T_HNSC
Predicted cancer type

T_KICH
T_KIRC Proportion of
T_KIRP diagnosed
T_LIHC cancer type
T_LUAD (%)
T_LUSC 100
T_MESO 50
T_OV 0
T_PAAD
T_PANET
T_PCPG
T_PRAD
T_SKCM
T_SCLC
T_SARC
T_TGCT
T_THYM
T_THCA
T_UCS
T_UCEC
T_UVM
L L T T
M CC CA GG CA SC O AD BC CA BM SC ICH IRC IRP IHC AD SC SO OV AD E PG AD CM LC RC C YM CA CS EC VM
0 50 100
LA T_A _BL T_L _BR _CE _CH RE _DL ES _G _HN _K _K _K _L _LU _LU ME T_ _PA PAN PC _PR SK _SC _SA _TG _TH _TH T_U _UC _U Precision
T_ T T T T CO T T_G T T T T T T T T T_ T T_ T_ T T_ T T T T T T T
T_ (%)
Diagnosed cancer type
Figure 2 Performance of the TransCUPtomics classifier for tissue of origin detection. Confusion matrix showing the accuracy of the random forest tool of
classification for tumor type prediction, evaluated according to a threefold cross-validation procedure. The reference data set was randomly divided into three
equally sized parts, and three classifiers were independently trained with two of the three parts, the third part being left aside from the entire training
procedure as a validation data set. Rows correspond to the predicted diagnoses and columns to the true diagnoses. Recall and precision are shown at the top
and right sides of the matrix. T_COREAD, Colorectal adenocarcinoma; T_GESCA, Gastroesophageal carcinoma; T_SARC, Soft tissue sarcoma. The rest are similar
to those given in Figure 1.

1384 jmdjournal.org - The Journal of Molecular Diagnostics


RNA-seq and AI for CUP

Table 1 Characteristics of Patients with CUP and pancreatic (T_PAAD) origins clearly formed separated
Characteristic CUP cohort (N Z 48) groups of tumors. Moreover, different histologic subtypes of
tumors developing from the same primary site such as
Age, median (range), years 57 (30e80)
kidney tumors (T-KIRC and T-KIRP) could be distin-
Female sex, no. (%) 29 (60.4)
Prospective, no. (%) 11 (22.9)
guished, and similar tumors of distinct grades such as gli-
Site of metastases, no. (%) oma (T-LGG) and glioblastoma (T-GBM) showed a
Lymph node 34 (70.8) continuous distribution. On the contrary, squamous cell
Liver 14 (29.1) carcinoma of various tissue origins, including head and neck
Bone 10 (20.8) (T-HNSC), cervix (T-CESC), and lung (T-LUSC), showed
Lung 6 (12.5) partial overlap (Figure 1B), in accordance with previous
Peritoneum 7 (14.6) reports.15 Normal tissues clustered according to their tissue
Brain 3 (6.2) of origin and apart from malignant tumors originating from
Other 15 (31.2) the same organs. Of note, a transcriptomic continuum could
Histology, no. (%) be identified between different organs of the same embry-
Adenocarcinoma 23 (47.9)
onal origin [eg, between vagina (N-VAG), cervix (N-CER),
Squamous cell carcinoma 5 (10.4)
Undifferentiated carcinoma 14 (29.2)
uterus (N-UTER) and fallopian tubes (N-FALLOP)], or
Other 6 (12.5) across different structures of the same organ such as brain
IHC, no. (%) (N-BRA) (Figure 1C).
CK7þ CK20e 32 (66.7) The 100 VAE features were used to train two machine
CK7þ CK20þ 5 (10.5) learning classifiers based on RF and KNN to predict the
CK7e CK20þ 1 (2) most probable diagnoses. Cross-validation showed robust
CK7e CK20- 10 (20.8) overall accuracy, with predictions matching the true diag-
Suspected clinicopathologic nosis in 96.26% of cases with the RF classifier and in
diagnosis, no. (%) 96.03% of cases with KNN (94.99% for RF and 94.53% for
Unknown primary 30 (62.5) KNN when restricting the analysis to tumor samples)
GI cancer 8 (16.6)
(Figure 2 and Supplemental Figure S2A-C).
Breast cancer 4 (8.3)
Lung cancer 3 (6.3)
Each of the 100 VAE features was a weighted combi-
Gynecologic cancer 3 (6.3) nation of genes (Supplemental Table S2), allowing biolog-
ical interpretation of the classification. Gene Ontology
The clinicopathologic diagnosis refers to the suspicion of tissue of origin analysis performed on the high-weight genes of each feature
that could be made based on clinical presentation and pathologic analysis.
CUP, cancers of unknown primary; GI, gastrointestinal; IHC, immuno-
enabled identification of biological processes. This included
histochemical CK7 and CK20 profiles. features associated with neural development (VAE_52,
highly expressed in all brain samples), immune infiltration
COSMIC 24.05.2016) was then validated by using Alamut (VAE_2, 9, 39, 64, highly expressed in all blood samples),
Visual 2.9.0 (Interactive BiosoftWare) and annotated for or keratinization (VAE_10, 21, 23, 65, 78, 85, 95, 98, 99)
pathogenicity in five classes by using Varsome Educational (Supplemental Table S3). VAE_2 was associated with Gene
use v9.3.4 (https://varsome.com) following American Ontology terms related to the adaptive immune system, with
College of Medical Genetics and Genomics its highest-weight genes including numerous immunoglob-
recommendations. ulin and T-cell receptor genes. Notably, diagnoses associ-
ated with VAE_2 were related to T cellehosting tissues
(normal lymph node, diffuse large B-cell lymphoma, and
thymoma) but also included tumor types with frequent T-
Results
cell infiltration and benefit from immunotherapy (kidney
Training, Performance, and Interpretation of the carcinoma and skin melanoma).
Classifier
Classification of CUP
The TransCUPtomics classifier was trained to predict tissue
of origin based on RNA-seq data. In total, the reference data The performance of the TransCUPtomics classifier was
set contained 20,918 samples corresponding to 94 diag- evaluated to predict the tissue of origin in a series of 48
nostic categories, including 39 tumor types and 55 normal patients with CUP, including 37 retrospective cases and 11
tissue types (Supplemental Table S1). prospective patients (Table 1). All CUP diagnoses were
UMAP visualization of the transcriptomic landscape of confirmed by a multidisciplinary tumor board and had gone
the reference data set captured by the VAE showed strong through extensive diagnostic workup as recommended. The
separation of most tumor types according to their clinical median age at diagnosis was 57 years (range, 30 to 80
and pathologic diagnosis (Figure 1A). For example, years), and 60.4% of patients were female. The most
adenocarcinoma from lung (T_LUAD), prostate (T_PRAD), frequent metastatic sites were the lymph nodes (70.8%),

The Journal of Molecular Diagnostics - jmdjournal.org 1385


Vibert et al

15
N_SPLE
N_ADP SC N_ART TIB
N_LUNG
N_BREAST N_ADP VSC N_ART CRN

N_ART AO N_HRT AA
N_WB
10 N_LIVER
T_LAML N_CLN TRA
43
N_ESO MUS
N_MUS SKE
N_ESO GEJN_VAG
N_ESO MUC N_SKIN S N_CERV
N_UTER
N_HN N_SKIN NS N_HRT LV
5 18 N_PROST N_FALLOP N_CLN SIG
T_LUAD 29 N_STOM N_MSG N_SI TI
7 16 T_DLBC N_OVARY N_NERV TIB
15 T_CHOL
N_LN T_STAD
UMAP_2

32 40 45 11 44
35 28 8
38 12
T_ESCA T_LIHC N_THYR T_LMS
0 T_HNSC 13 N_BLAD 48 N_TFIB
N_PITUI
T_LUSC 4 30 T_BLCA 46 T_THYM T_KIRC N_EBV LYM
N_PANC
24 10 3 21 T_PAAD
5 1 N_ADRNL N_BRA CERH
20 6 T_CESC 36
34 T_PANET N_BRA CER
17 19 T_KIRP N_TEST
39 T_ACC
22 N_CML CL
5 T_BRCA T_MESO T_UVM N_KDN CTX T_PCPG
27
T_MPNST 25 14 T_SKCM N_BRA SN N_BRA CAU N_BRA NA
47 41
37 42 T_SCLC N_BRA HYP N_BRA PUT
23 T_READ
T_DDLPS 2 T_THCA N_BRA SPI N_BRA HIP
T_COAD T_TGCT T_KICH
T_SS T_UPS
T_UCS N_BRA CTX N_BRA AMY
31
10 T_UCEC 33 N_BRA FCTX N_BRA ACC
T_OV
9 T_GBM
26 T_PRAD T_LGG

10 5 0 5 10 15
UMAP_1
Figure 3 Detection of tissue of origin in patients with cancers of unknown primary (CUP). RNA sequencing was performed on fresh-frozen biopsy
specimens from the diagnostic workup of each patient with CUP (CUP1 to CUP48). Transcriptomic profiles were encoded in the 100-dimensional latent space of
the variational autoencoder trained on the reference data set, and then plotted on the reference Uniform Manifold Approximation and Projection (UMAP)
representation. Each CUP sample is highlighted by a red dot with its corresponding identity number.

liver (29.1%), and bone (20.8%). The most frequent path- Overall, a predicted diagnosis could be established in 38
ologic subtypes were adenocarcinoma (47.9%) and undif- (79%) of 48 cases, and matched clinical and pathologic
ferentiated carcinoma (29.2%). The extensive IHC profile of presentation (Table 2 and Supplemental Table S6). This
each sample is described in Supplemental Table S4. included 32 (67%) of 48 high-confidence predictions and 6
RNA-seq was performed on fresh-frozen tissue samples (12%) of 48 moderate-confidence predictions. The most
from the diagnostic workup. All samples met standard frequent diagnoses established with high-confidence scores
quality controls (Supplemental Table S5). Transcriptomic were lung adenocarcinoma (N Z 6), bladder urothelial
profiles were encoded in the latent space of the VAE trained carcinoma (N Z 3), and breast invasive carcinoma (N Z 3).
on the reference data set. When plotted on the reference In three cases, a high-confidence prediction was made to-
UMAP representation, every CUP localized within or near a ward a nonmalignant tissue of origin (liver, CUP11, CUP43;
specific diagnosis, with 45 cases fitting into a specific tumor ovary, CUP44) and was in agreement with pathologic re-
group and three cases within a normal tissue cluster view exhibiting tumor cellularity <10% in all three samples.
(Figure 3). The most probable tissue of origin was rigor- Ten (21%) of 48 samples remained unclassified (CUP10,
ously predicted with both RF and KNN algorithms to 12 13, 14, 15, 19, 20, 25, 31, and 45). These samples came
evaluate robustness of classification. For each sample, a from lymph node biopsy samples or lymphadenectomy
highly confident prediction was defined by: a similar diag- specimens in seven of 10 cases. Three of these 10 samples
nosis given by both algorithms; and at least one score of were characterized by an unusually high distance to the
prediction over the 50% threshold. Moderate-confidence nearest neighbor in the 100-dimensional latent space (>99%
diagnoses referred to cases for which only one criterion quantile of all samples in the cohort), suggesting that their
was present. The remaining cases were considered as tumor type of origin may not be represented in the reference
unclassified. data set (Supplemental Table S6).

1386 jmdjournal.org - The Journal of Molecular Diagnostics


RNA-seq and AI for CUP

Table 2 Results of TransCUPtomics Prediction for all Patients with CUP


Clinicopathologic
Patient ID Cohort diagnosis Tissue Predicted diagnosis Confidence
CUP1 Prospective CUP Bone biopsy KyCa High
CUP4 Prospective CUP Retroperitoneal biopsy Head and neck squamous cell carcinoma High
CUP37 Prospective CUP Muscular biopsy Soft tissue sarcoma (UPS/LMS) High
CUP39 Prospective CUP/NET Liver biopsy Pancreatic neuroendocrine tumor High
CUP41 Prospective CUP/BrCa Liver biopsy Breast invasive carcinoma High
CUP42 Prospective CUP Peritoneal biopsy Soft tissue sarcoma (UPS/DDLPS) Moderate
CUP43 Prospective ACUP/Lca/PDAC Liver biopsy Liver High
CUP44 Prospective ACUP/GaCa Ovarian biopsy Ovary High
CUP46 Prospective ACUP/CRC Peritoneal biopsy Colon adenocarcinoma Moderate
CUP47 Prospective ACUP/CRC/GaCa Peritoneal biopsy Colon adenocarcinoma High
CUP48 Prospective ACUP/CRC/GaCa Peritoneal biopsy GI cancer High
CUP2 Retrospective CUP Muscular biopsy Undifferentiated pleomorphic sarcoma Moderate
CUP3 Retrospective ACUP Subcutaneous biopsy Bladder urothelial carcinoma High
CUP5 Retrospective CUP/Lca Lung biopsy Lung squamous cell carcinoma High
CUP6 Retrospective CUP/LCa Inguinal lymph node biopsy Cervical squamous cell carcinoma and High
endocervical adenocarcinoma
CUP7 Retrospective CUP/PDAC Bone biopsy Lung adenocarcinoma High
CUP8 Retrospective ACUP/PDAC Liver biopsy Liver HCC/cholangiocarcinoma High
CUP9 Retrospective CUP/GyCa/BrCa Cervical lymphadenectomy Ovarian serous cystadenocarcinoma High
CUP10 Retrospective ACUP Cervical lymphadenectomy Unclassified Low
CUP11 Retrospective ACUP Liver biopsy Liver High
CUP12 Retrospective CUP Retroperitoneal Unclassified Low
lymphadenectomy
CUP13 Retrospective CUP/GaCa Cavum biopsy Unclassified Low
CUP14 Retrospective CUP Nephrectomy Unclassified Low
CUP15 Retrospective ACUP/PDAC Cervical lymph node biopsy Unclassified Low
CUP16 Retrospective CUP Cervical lymphadenectomy Lung adenocarcinoma High
CUP17 Retrospective CUP Cervical lymphadenectomy Uterine corpus endometrial carcinoma Moderate
CUP18 Retrospective ACUP/Lca Cervical lymphadenectomy Lung adenocarcinoma High
CUP19 Retrospective CUP/NET Axillary lymphadenectomy Unclassified Low
CUP20 Retrospective CUP Subclavicular lymphadenectomy Unclassified Low
CUP21 Retrospective ACUP Cervical lymphadenectomy Bladder urothelial carcinoma High
CUP22 Retrospective ACUP/BrCa Axillary lymphadenectomy Breast invasive carcinoma High
CUP23 Retrospective ACUP Lymph node biopsy Colon adenocarcinoma High
CUP24 Retrospective CUP Axillary lymphadenectomy Lung squamous cell carcinoma High
CUP25 Retrospective CUP/HNCa Cervical lymphadenectomy Unclassified Low
CUP26 Retrospective ACUP/GyCa Inguinal lymphadenectomy GyCa High
CUP27 Retrospective CUP/BrCa Lymph node biopsy Breast invasive carcinoma High
CUP28 Retrospective ACUP Liver biopsy Cholangiocarcinoma High
CUP29 Retrospective CUP Lymph node biopsy Lung adenocarcinoma High
CUP30 Retrospective ACUP Kidney biopsy Bladder urothelial carcinoma High
CUP31 Retrospective CUP/OvCa Retroperitoneal biopsy Unclassified Low
CUP32 Retrospective ACUP Cervical lymphadenectomy Lung adenocarcinoma High
CUP33 Retrospective ACUP Lymph node biopsy GyCa High
CUP34 Retrospective CUP/BrCa Cervical lymphadenectomy Pancreatic neuroendocrine tumor High
CUP35 Retrospective ACUP Cervical lymphadenectomy Lung adenocarcinoma Moderate
CUP36 Retrospective ACUP Lymph node biopsy KyCa High
CUP38 Retrospective CUP Lymph node biopsy Skin cutaneous melanoma Moderate
CUP40 Retrospective ACUP/KyCa Bone biopsy Lung adenocarcinoma High
CUP45 Retrospective ACUP/BrCA Cervical lymphadenectomy Unclassified Low
The most probable tissue of origin was predicted with both random forest and k-nearest neighbors machine learning classifiers to evaluate robustness of
classification. For each test sample, a highly confident prediction was defined by: a similar diagnosis given by both machine learning algorithms and at least
one score of prediction over the 50% threshold. Moderate-confident diagnoses referred to cases for which only one criterion was present. The remaining cases
were considered as unclassified.
ACUP, adenocarcinoma of unknown primary; BrCa, breast carcinoma; CRC, colorectal carcinoma; CUP, cancer of unknown primary; DDLPS, dedifferentiated
liposarcoma; GaCa, gastric carcinoma; GI, gastrointestinal; GyCa, gynecologic carcinoma; HCC, hepatocellular carcinoma; HNCa, head and neck carcinoma;
KyCa, kidney carcinoma; LCa, lung carcinoma; LMS, leiomyosarcoma; NET, neuroendocrine tumor; OvCa, ovarian carcinoma; PDAC, pancreatic ductal adeno-
carcinoma; UPS, undifferentiated pleomorphic sarcoma.

The Journal of Molecular Diagnostics - jmdjournal.org 1387


Vibert et al

Table 3 Therapeutic Applications of TransCUPtomics Prediction


Tumor
response at Potential therapeutic
First-line treatment at first line alternative with VAE
Patient ID Cohort Predicted diagnosis diagnosis (3 months) (retrospective cases)
CUP1 Prospective Kidney carcinoma Clinical trial anti-PD1 þ VEGF CR
inhibitor
CUP4 Prospective Head and neck squamous cell Carboplatin PR
carcinoma
CUP37 Prospective Soft tissue sarcoma Surgery CR
(UPS/LMS)
CUP39 Prospective Pancreatic neuroendocrine Carboplatin-etoposide PR
tumor
CUP41 Prospective Breast invasive carcinoma Paclitaxel-trastuzumab- PR
pertuzumab
CUP42 Prospective Soft tissue sarcoma Palliative care NA
(UPS/DDLPS)
CUP43 Prospective Liver 5FU-folinic acid-oxaliplatin* SD
CUP44 Prospective Ovary 5FU-folinic acid-oxaliplatin* PD
CUP46 Prospective Colon adenocarcinoma 5FU-folinic acid- irinotecan- PR
cetuximab
CUP47 Prospective Colon adenocarcinoma 5FU-folinic acid-oxaliplatin- PR
bevacizumab
CUP48 Prospective GI cancer 5FU-folinic acid-oxaliplatin SD
CUP2 Retrospective Undifferentiated pleomorphic Pembrolizumab PD Adriamycin, ifosfamide,
sarcoma VEGFR inhibitors
CUP3 Retrospective Bladder urothelial carcinoma Carboplatin-paclitaxel PD Immune checkpoint
inhibitors
CUP5 Retrospective Lung squamous cell Vinorelbine PD Immune checkpoint
carcinoma inhibitors
CUP6 Retrospective Cervical squamous cell Cisplatin-vinorelbine PD Immune checkpoint
carcinoma and endocervical inhibitors
adenocarcinoma
CUP7 Retrospective Lung adenocarcinoma 5FU-folinic acid-oxaliplatin- PD Immune checkpoint
irinotecan inhibitors
CUP8 Retrospective Liver HCC/ Gemcitabine-oxaliplatin PD Cisplatin, 5FU, sunitinib,
cholangiocarcinoma clinical trials
CUP9 Retrospective Ovarian serous Carboplatin-paclitaxel CR Bevacizumab, PARP inhibitors
cystadenocarcinoma
CUP10 Retrospective Unclassified Carboplatin-paclitaxel PR 0
CUP11 Retrospective Liver Cisplatin-gemcitabine PD NA
CUP12 Retrospective Unclassified Cisplatin-gemcitabine PD 0
CUP13 Retrospective Unclassified Carboplatin-paclitaxel PR 0
CUP14 Retrospective Unclassified Cisplatin-gemcitabine SD 0
CUP15 Retrospective Unclassified Cisplatin-gemcitabine PR 0
CUP16 Retrospective Lung adenocarcinoma Cisplatin-5FU-epirubicine PD Immune checkpoint
inhibitors
CUP17 Retrospective Uterine corpus endometrial Cisplatin-docetaxel SD 0
carcinoma
CUP18 Retrospective Lung adenocarcinoma Cisplatin-docetaxel PR Immune checkpoint
inhibitors
CUP19 Retrospective Unclassified Cisplatin-etoposide NA 0
CUP20 Retrospective Unclassified Gemcitabine-oxaliplatin CR 0
CUP21 Retrospective Bladder urothelial carcinoma Cisplatin-5FU-epirubicine PR Immune checkpoint
inhibitors
CUP22 Retrospective Breast invasive carcinoma 5FU-epirubicin- NA 0
cyclophosphamide/
docetaxel
(table continues)

1388 jmdjournal.org - The Journal of Molecular Diagnostics


RNA-seq and AI for CUP

Table 3 (continued )
Tumor
response at Potential therapeutic
First-line treatment at first line alternative with VAE
Patient ID Cohort Predicted diagnosis diagnosis (3 months) (retrospective cases)
CUP23 Retrospective Colon adenocarcinoma Carboplatin-paclitaxel PD 5FU, oxaliplatin, irinotecan
CUP24 Retrospective Lung squamous cell Cisplatin-5FU-cetuximab PR Immune checkpoint
carcinoma inhibitors
CUP25 Retrospective Unclassified Carboplatin-paclitaxel PR 0
CUP26 Retrospective Gynecologic carcinoma Carboplatin-paclitaxel CR Bevacizumab, PARP inhibitors
CUP27 Retrospective Breast invasive carcinoma Adriamycin- SD Eribulin, PARP inhibitors,
cyclophosphamide immune checkpoint
inhibitors
CUP28 Retrospective Cholangiocarcinoma Carboplatin-paclitaxel SD Cisplatin, 5FU, sunitinib,
clinical trials
CUP29 Retrospective Lung adenocarcinoma Carboplatin-paclitaxel PR Immune checkpoint
inhibitors
CUP30 Retrospective Bladder urothelial carcinoma Palliative care NA Immune checkpoint
inhibitors
CUP31 Retrospective Unclassified Carboplatin-paclitaxel PD 0
CUP32 Retrospective Lung adenocarcinoma Carboplatin-gemcitabine PD Immune checkpoint
inhibitors
CUP33 Retrospective Gynecologic carcinoma Cisplatin-gemcitabine CR Bevacizumab, PARP inhibitors
CUP34 Retrospective Pancreatic neuroendocrine Adriamycin- PR Etoposide, oxaliplatin,
tumor cyclophosphamide sunitinib
CUP35 Retrospective Lung adenocarcinoma Cisplatin-5FU-etoposide- PD Immune checkpoint
adriamycin inhibitors, RET inhibitors
CUP36 Retrospective Kidney carcinoma Carboplatin-paclitaxel PD Immune checkpoint
inhibitors
CUP38 Retrospective Skin cutaneous melanoma Cisplatin-etoposide PD Immune checkpoint
inhibitors
CUP40 Retrospective Lung adenocarcinoma Carboplatin-paclitaxel PD Immune checkpoint
inhibitors
CUP45 Retrospective Unclassified Epirubicin-cyclophosphamide PD 0
For retrospective cases, the response to first-line therapy guided by clinical and pathologic suspicion is indicated, as well as potential therapeutic alter-
natives that could have been made based on TransCUPtomics predictions. For prospective case, TransCUPtomics-tailored treatments and responses are
indicated.
*Treatment based on clinical and pathologic characteristics due to low tumor cellularity of the tumor sample analyzed by RNA sequencing.
5FU, 5-fluorouracil; CR, complete response; DDLPS, dedifferentiated liposarcoma; LMS, leiomyosarcoma; NA, not applicable; PARP, poly (ADP-ribose) po-
lymerase; PD, progression of disease; PD1, programmed cell death protein 1; PR, partial response; RET, rearranged during transfection; SD, stable disease; UPS,
undifferentiated pleomorphic sarcoma; VAE, variational autoencoder; VEGF, vascular endothelial growth factor; VEGFR, vascular endothelial growth factor
receptor.

Molecular Alterations predicted as normal tissues by TransCUPtomics, in line with


their poor tumor cellularity (Supplemental Table S8).
Gene fusion and variant detection algorithms were applied
to all CUP samples. CUP3 showed an in-frame HELB- Clinical Impact of TransCUPtomics Classification
HMGA2 fusion, whereas no relevant gene fusion was
detected in the other samples. The potential application of TransCUPtomics classification
Variants in genes of interest in oncogenesis were detected for tailored treatment guidance was evaluated next. Among
in 46 of 48 samples (Supplemental Tables S7 and S8). The the 37 retrospective cases of CUP, 29 patients had received
most frequently mutated genes were TP53 (N Z 19 of 48) first-line unspecific platinum-based chemotherapy, seven
and KRAS (N Z 9 of 48), and other actionable alterations had received a treatment oriented toward a putative primary
were mostly detected in genes involved in DNA repair and tumor determined by using clinical and IHC characteristics,
RAS/mitogen-activated protein kinase pathways, as previ- and one had not received any systemic treatment. The
ously described in CUP.16 Of note, oncogenic mutations in overall response rate to first-line therapy was 36%,
KRAS and BRCA1 (CUP43) and TP53 (CUP11) were including 4 of 36 complete responses and 9 of 36 partial
detected with a low depth of coverage in two samples responses. The diagnosis prediction given by the Trans-
CUPtomics algorithm could have given therapeutic

The Journal of Molecular Diagnostics - jmdjournal.org 1389


Vibert et al

alternatives to platinum-based chemotherapy in 24 (64.8%) classifier reached an overall accuracy of >95% for cancer
of 37 patients (Table 3). type prediction. When applied to a series of patients with
Of 11 prospective cases, eight could receive first-line CUP, TransCUPtomics could predict the likely tissue of
systemic treatment according to the TransCUPtomics pre- origin in 79% of the cases, which was in line with clinical
dicted tissue of origin. The remaining three patients and pathologic tumor characteristics.
included one who was not treated due to altered perfor- Over the last decades, multiple techniques have been
mance status and two who were treated according to clinical developed to identify tumor tissue of origin based on mo-
and pathologic characteristics due to low tumor cellularity lecular data. Transcriptomic profiling has been widely
of the sample analyzed by RNA-seq. Among the eight pa- studied, and several commercial systems are available to
tients who could receive TransCUPtomics-tailored first-line predict primary tumor type by using microarrays or RT-
treatment, there were two complete responses and five qPCR for targeted mRNA or miRNA quantification.20e22
partial responses (Table 3). This included a 30-year old man DNA methylation patterns have also been shown to be
(CUP1) with diffuse bone and subdiaphragmatic lymph strongly correlated with tissue of origin and enable the
node metastases, whose bone biopsy specimen revealed successful identification of likely primary tumors in CUP.23
undifferentiated adenocarcinoma with an IHC profile More recently, targeted DNA profiling24 and whole-genome
(CKAE1/AE3þ, CK7e, CK20e, CDX2e, TTF1e, PSAe, sequencing25 have also been used for primary tumor type
CD20e, PAX8þ, CD10þ, Vimentinþ, and PDL1 3þ) identification.
compatible with kidney or biliopancreatic primary. Trans- TransCUPtomics combines whole transcriptomic data
CUPtomics showed a highly confident prediction for kidney analysis and deep learning for primary tumor prediction, and
carcinoma, further supported by the detection of a truncating it has several advantages compared with previous methods
SMARCA4 rearrangement.17 The patient was included in a such as RT-qPCR and microarrays in addition to its high
clinical trial evaluating an antieprogrammed cell death accuracy and proportion of high-confidence predictions for
protein 1 immune checkpoint inhibitor in combination with CUP. First, major efforts have been made over the last
an antiangiogenic tyrosine kinase inhibitor. First evaluation decade to provide collections of RNA-seq data of most
at 3 months showed a complete response, and at the time of tumor types, enabling the establishment of an unprecedented
this report, the patient remains progression-free after 12 reference data set of 39 different cancer types, including rare
months of follow-up. Other TransCUPtomics-tailored ther- diagnoses. Second, the inclusion of normal tissues in our
apeutic strategies notably included frontline surgery for reference data set minimizes the risk of overclassification of
predicted soft tissue sarcoma, oxaliplatin and 5- samples based on expression of nonmalignant cells. Third,
fluorouracilebased chemotherapy for predicted colorectal RNA-seq contains functional information, providing in-
carcinoma, and paclitaxel-trastuzumab-pertuzumab for pre- sights into the biological mechanisms underlying classifi-
dicted HER2-amplified breast carcinoma. cation and allowing identification of genetic and immune
signatures for potential therapeutic applications, as opposed
to DNA sequencing, targeted RNA-seq, and DNA methyl-
Discussion ation. Combined with the VAE, a deep learning technique
with high potential for biomedical big data analysis, it also
CUP consist of a heterogeneous group of metastatic tumors, allows interpretability and biological insights into the tran-
for which the primary tumor cannot be identified despite scriptomic landscape of cancers. Last, RNA-seq enables
extensive radiological and pathologic investigations. This identification of fusion transcripts and expressed variants
raises critical clinical issues, as therapeutic strategies in useful for primary tumor identification and precision
oncology are primarily based on the determination of tissue medicine.
of origin, and treatments tailored to the primary site are We show here that the TransCUPtomics classifier results
more effective than unspecific chemotherapy.18 in a major clinical impact for patients with CUP, with an
Pathologic analyses are the gold standard approaches for estimated 65% (24 of 37) of therapeutic alternatives to
tissue of origin determination, by enabling the detection of platinum-based chemotherapy in our retrospective cohort, a
tissue-specific antigens by IHC. However, IHC faces several significant proportion because >75% of patients with CUP
limits, including unequal access to up-to-date panels, lack of are not offered any second-line systemic therapy.19 This
specificity of markers, and absence of expression of any notably included two cases of soft tissue sarcoma, whose
informative antigen in poorly differentiated tumors. As a diagnosis may sometimes mimic undifferentiated carcinoma
result, no precise hypothesis of putative tissue of origin can due to the expression of cytokeratins by rare tumor cells.
be made after extensive IHC profiling in approximately 75% Moreover, in eight prospective patients who could receive
of CUP.19 TransCUPtomics-tailored first-line chemotherapy, no pa-
In this study, we used the largest collection of primary tient experienced tumor progression, and seven of the eight
cancers and normal tissues assembled thus far to design a patients exhibited tumor response at 3 months.
deep learning algorithm to identify tissue of origin based on Prospective clinical trials investigating the efficacy of
features derived from whole transcriptomic data. Our tissue-specific systemic treatments determined by molecular

1390 jmdjournal.org - The Journal of Molecular Diagnostics


RNA-seq and AI for CUP

profiling have thus far failed to show a survival benefit for interpretation and visualization of its decision, as shown in
patients with CUP,26,27 probably due to the heterogeneity this study for CUP, compared with other tools that primarily
and overall poor prognosis of tumor types enrolled in these make a prediction without interpretation. Moreover, the
trials. However, initial results on our prospective cohort VAE enables potential discovery of previously unknown
suggest that using TransCUPtomics may improve the biology by extracting relevant nonlinear features from high-
prognosis of individual patients, which should be confirmed dimensional biological data. It can also be used as a
in larger prospective studies. Of note, these results are in generative model to create realistic synthetic data for further
agreement with the EPICUP study, showing an improved purposes. Altogether, the VAE is a powerful technique from
overall survival in patients with CUP receiving tumor-type artificial intelligence that exhibits high potential in multiple
specific therapy compared with patients treated with biomedical contexts and is increasingly being used in tasks
empirical approaches.23 as diverse as imaging and pharmacology.28,29
Our classifier faces several limits. Despite the attempt to In summary, we present a powerful and interpretable deep
design a reference data set as exhaustive as possible, many rare learningebased classifier trained on RNA-seq data to
tumor types are missing, which may result in incorrect classi- identify tissue of origin in CUP. We propose to integrate
fication or absence of classification of samples, as seen in 10 RNA-seq and TransCUPtomics into the standard manage-
cases. The observation that some of these samples were char- ment of CUP as a cost-effective aid to pathologists and
acterized by an unusually high distance to the nearest neighbor oncologists, as this widely available and standardized
in the 100-dimensional latent space indeed supports the hy- technique may lead to a meaningful improvement of their
pothesis that their tumor type of origin may not be represented clinical management.
in the reference data set. Also, some CUP may have lost most
of their differentiation characteristics, rendering prediction of
tissue of origin intrinsically impossible. However, Trans- Acknowledgments
CUPtomics algorithms still give a diagnostic orientation useful
for treatment determination in those cases, and genomic fea- We thank the patients and their family members, and cli-
tures can help refine diagnostic hypotheses. nicians involved in their care and Maud Kamal for access to
RNA-seq is becoming cost-effective and increasingly the SHIVA01 CUP samples. We also acknowledge support
used to guide diagnosis and therapeutic choices in patients from Institut Curie for sample collection, banking, and
with cancer. This technique is widely available, standard- processing; the Biological Resource Center and its mem-
ized, and rapid: sequencing results could be delivered in a bers; the Unité de Génétique Somatique and its members;
few days and prediction with the trained algorithm within and the Department of Pathology and its members.
minutes. It is currently routinely used in our national
reference center. Thus, our classifier could be easily applied
to prospective cohorts of patients and enriched with diverse
Supplemental Data
diagnoses. We emphasize that such a tool will not replace
Supplemental material for this article can be found at
clinical and pathologic diagnoses but is designed to be an
http://doi.org/10.1016/j.jmoldx.2021.07.009.
additional element to help in the diagnostic workup and
therapeutic decision-making, albeit availability of frozen
tissue specimens is currently restricted to specialized cancer References
centers. However, because frozen tissue specimens allow
higher quality transcriptomic profiling, and our reference 1. Fizazi K, Greco FA, Pavlidis N, Daugaard G, Oien K,
data, including all tumor samples profiled by TCGA, are Pentheroudakis G; ESMO Guidelines Committee: Cancers of unknown
also from frozen tissue, we expected lower performance for primary site: ESMO Clinical Practice Guidelines for diagnosis, treat-
classification of formalin-fixed, paraffin-embedded samples ment and follow-up. Ann Oncol 2015, 26(Suppl 5):v133ev138
2. Rassy E, Pavlidis N: Progress in refining the clinical management of
due to batch effect and did not evaluate them in our study. cancer of unknown primary in the molecular era. Nat Rev Clin Oncol
Developments are therefore needed for applying Trans- 2020, 17:541e554
CUPtomics to RNA-seq data from formalin-fixed, paraffin- 3. Gröschel S, Bommer M, Hutter B, Budczies J, Bonekamp D,
embedded samples. Heining C, Horak P, Fröhlich M, Uhrig S, Hübschmann D, Geörg C,
Artificial intelligence, including classical machine Richter D, Pfarr N, Pfütze K, Wolf S, Schirmacher P, Jäger D, von
Kalle C, Brors B, Glimm H, Weichert W, Stenzinger A, Fröhling S:
learning and deep learning techniques, is increasingly being Integration of genomics and histology revises diagnosis and enables
used with success for prediction tasks involving high- effective therapy of refractory cancer of unknown primary with PDL1
throughput biomedical data. However, interpretability of amplification. Cold Spring Harb Mol Case Stud 2016, 2:a001180
the vast majority of these approaches is often hampered by 4. Wei IH, Shi Y, Jiang H, Kumar-Sinha C, Chinnaiyan AM: RNA-Seq
the “black-box” nature of these algorithms. The VAE used accurately identifies cancer biomarker signatures to distinguish tissue
of origin. Neoplasia 2014, 16:918e927
in this study is a promising technique to address this 5. Grewal JK, Tessier-Cloutier B, Jones M, Gakkhar S, Ma Y, Moore R,
shortcoming of artificial intelligence, as it not only allows Mungall AJ, Zhao Y, Taylor MD, Gelmon K, Lim H, Renouf D,
state-of-the-art predictive performance but also easier Laskin J, Marra M, Yip S, Jones SJM: Application of a Neural network

The Journal of Molecular Diagnostics - jmdjournal.org 1391


Vibert et al

whole transcriptome-based pan-cancer method for diagnosis of primary 18. Greco FA: Molecular diagnosis of the tissue of origin in cancer of
and metastatic cancers. JAMA Netw Open 2019, 2:e192597 unknown primary site: useful in patient management. Curr Treat Op-
6. Xu-Monette ZY, Zhang H, Zhu F, Tzankov A, Bhagat G, Visco C, tions Oncol 2013, 14:634e642
Dybkaer K, Chiu A, Tam W, Zu Y, Hsi ED, You H, Huh J, 19. Varadhachary GR, Raber MN: Cancer of unknown primary site. N
Ponzoni M, Ferreri AJM, Moller MB, Parsons BM, van Krieken JH, Engl J Med 2014, 371:757e765
Piris MA, Winter JN, Hagemeister FB, Shahbaba B, De Dios I, 20. Ferracin M, Pedriali M, Veronese A, Zagatti B, Gafà R, Magri E,
Zhang H, Li Y, Xu B, Albitar M, Young KH: A refined cell-of-origin Lunardi M, Munerato G, Querzoli G, Maestri I, Ulazzi L, Nenci I,
classifier with targeted NGS and artificial intelligence shows robust Croce CM, Lanza G, Querzoli P, Negrini M: MicroRNA profiling for
predictive value in DLBCL. Blood Adv 2020, 4:3391e3404 the identification of cancers with unknown primary tissue-of-origin. J
7. Menden K, Marouf M, Oller S, Dalmia A, Magruder DS, Kloiber K, Pathol 2011, 225:43e53
Heutink P, Bonn S: Deep learning-based cell composition analysis 21. Bridgewater J, van Laar R, Floore A, Van’T Veer L: Gene expression
from tissue expression profiles. Sci Adv 2020, 6:eaba2619 profiling may improve diagnosis in patients with carcinoma of un-
8. Zhao Y, Pan Z, Namburi S, Pattison A, Posner A, Balachander S, known primary. Br J Cancer 2008, 98:1425e1430
Paisie CA, Reddi HV, Rueter J, Gill AJ, Fox S, Raghav KPS, 22. Dos Santos MT, de Souza BF, Cárcano FM, de Oliveira Vidal R,
Flynn WF, Tothill RW, Li S, Karuturi RKM, George J: CUP-AI-Dx: a Scapulatempo-Neto C, Viana CR, Carvalho AL: An integrated tool for
tool for inferring cancer tissue of origin and molecular subtype using determining the primary origin site of metastatic tumours. J Clin Pathol
RNA gene-expression data and artificial intelligence. EBioMedicine 2018, 71:584e593
2020, 61:103030 23. Moran S, Martínez-Cardús A, Sayols S, Musulén E, Balañá C, Estival-
9. Kingma DP, Welling M: Auto-encoding variational bayes. arXiv 2013: Gonzalez A, Moutinho C, Heyn H, Diaz-Lagares A, de Moura MC,
3126114v10 Stella GM, Comoglio PM, Ruiz-Miró M, Matias-Guiu X, Pazo-Cid R,
10. Le Tourneau C, Delord J-P, Gonçalves A, Gavoille C, Dubot C, Antón A, Lopez-Lopez R, Soler G, Longo F, Guerra I, Fernandez S,
Isambert N, Campone M, Tredan O, Massiani M-A, Mauborgne C, Assenov Y, Plass C, Morales R, Carles J, Bowtell D, Mileshkin L,
Armanet S, Servant N, Bièche I, Bernard V, Gentien D, Jezequel P, Sia D, Tothill R, Tabernero J, Llovet JM, Esteller M: Epigenetic
Attignon V, Boyault S, Vincent-Salomon A, Servois V, Sablin M-P, profiling to classify cancer of unknown primary: a multicentre, retro-
Kamal M, Paoletti X; SHIVA Investigators: Molecularly targeted spective analysis. Lancet Oncol 2016, 17:1386e1395
therapy based on tumour molecular profiling versus conventional 24. Penson A, Camacho N, Zheng Y, Varghese AM, Al-Ahmadie H,
therapy for advanced cancer (SHIVA): a multicentre, open-label, Razavi P, Chandarlapaty S, Vallejo CE, Vakiani E, Gilewski T,
proof-of-concept, randomised, controlled phase 2 trial. Lancet Oncol Rosenberg JE, Shady M, Tsui DWY, Reales DN, Abeshouse A, Syed A,
2015, 16:1324e1334 Zehir A, Schultz N, Ladanyi M, Solit DB, Klimstra DS, Hyman DM,
11. Jiang L, Huang J, Higgs BW, Hu Z, Xiao Z, Yao X, et al: Genomic Taylor BS, Berger MF: Development of genome-derived tumor type
landscape survey identifies SRSF1 as a key oncodriver in small cell prediction to inform clinical cancer care. JAMA Oncol 2020, 6:84e91
lung cancer. PLoS Genet 2016, 12:e1005895 25. Jiao W, Atwal G, Polak P, Karlic R, Cuppen E, PCAWG Tumor
12. Chan CS, Laddha SV, Lewis PW, Koletsky MS, Robzyk K, Da Subtypes and Clinical Translation Working Group, Danyi A, de
Silva E, Torres PJ, Untch BR, Li J, Bose P, Chan TA, Klimstra DS, Ridder J, van Herpen C, Lolkema MP, Steeghs N, Getz G, Morris Q,
Allis CD, Tang LH: ATRX, DAXX or MEN1 mutant pancreatic Stein LD; PCAWG Consortium: A deep learning system accurately
neuroendocrine tumors are a distinct alpha-cell signature subgroup. Nat classifies primary and metastatic cancers using passenger mutation
Commun 2018, 9:4158 patterns. Nat Commun 2020, 11:728
13. Way GP, Greene CS: Extracting a biologically relevant latent space 26. Hayashi H, Kurata T, Takiguchi Y, Arai M, Takeda K, Akiyoshi K,
from cancer transcriptomes with variational autoencoders. Pac Symp Matsumoto K, Onoe T, Mukai H, Matsubara N, Minami H, Toyoda M,
Biocomput 2018, 23:80e91 Onozawa Y, Ono A, Fujita Y, Sakai K, Koh Y, Takeuchi A, Ohashi Y,
14. McInnes L, Healy J, Melville J: UMAP: uniform Manifold Nishio K, Nakagawa K: Randomized phase II trial comparing site-
approximation and projection for dimension reduction. arXiv 2018: specific treatment based on gene expression profiling with carbopla-
180203426 tin and paclitaxel for patients with cancer of unknown primary site. J
15. Campbell JD, Yau C, Bowlby R, Liu Y, Brennan K, Fan H, et al: Clin Oncol 2019, 37:570e579
Genomic, pathway network, and immunologic features distinguishing 27. Hainsworth JD, Rubin MS, Spigel DR, Boccia RV, Raby S, Quinn R,
squamous carcinomas. Cell Rep 2018, 23:194e212.e6 Greco FA: Molecular gene expression profiling to predict the tissue of
16. Ross JS, Wang K, Gay L, Otto GA, White E, Iwanik K, Palmer G, origin and direct site-specific therapy in patients with carcinoma of
Yelensky R, Lipson DM, Chmielecki J, Erlich RL, Rankin AN, unknown primary site: a prospective trial of the Sarah Cannon research
Ali SM, Elvin JA, Morosini D, Miller VA, Stephens PJ: Compre- institute. J Clin Oncol 2013, 31:217e223
hensive genomic profiling of carcinoma of unknown primary site: new 28. Jørgensen PB, Schmidt MN, Winther O: Deep generative models for
routes to targeted therapies. JAMA Oncol 2015, 1:40e49 molecular science. Mol Inform 2018, 37
17. Cancer Genome Atlas Research Network. Comprehensive molecular 29. Kell DB, Samanta S, Swainston N: Deep learning and generative
characterization of clear cell renal cell carcinoma. Nature 2013, 499: methods in cheminformatics and chemical biology: navigating small
43e49 molecule space intelligently. Biochem J 2020, 477:4559e4580

1392 jmdjournal.org - The Journal of Molecular Diagnostics

You might also like