Qaiser Et Al-2018-Histopathology

Histopathology 2018, 72, 227–238. DOI: 10.1111/his.
13333
HER2 challenge contest: a detailed assessment of automated

HER2 scoring algorithms in whole slide images of breast
cancer tissues
Talha Qaiser,1,* Abhik Mukherjee,2,* Chaitanya Reddy PB,3 Sai D Munugoti,3
Vamsi Tallam,3 Tomi Pitk€ aaho,4 Taina Lehtim€
aki,4 Thomas Naughton,4 Matt Berseth,5
Anıbal Pedraza,6 Ramakrishnan Mukundan,7 Matthew Smith,8 Abhir Bhalerao,1
Erik Rodner,9 Marcel Simon,9 Joachim Denzler,9 Chao-Hui Huang,10,11 Gloria Bueno,6
David Snead,12 Ian O Ellis,2 Mohammad Ilyas2,13 & Nasir Rajpoot1,12
1
Department of Computer Science, University of Warwick, Coventry, UK, 2Department of Histopathology, Division of
Cancer and Stem Cells, School of Medicine, University of Nottingham, Nottingham, UK, 3Department of Electronics and
Electrical Engineering, Indian Institute of Technology, Guwahati, India, 4Department of Computer Science, Maynooth
University, Maynooth, Ireland, 5NLP Logix LLC, Jacksonville, FL, USA, 6VISILAB, E.T.S.I.I, University of Castilla-La
Mancha, Ciudad Real, Spain, 7Department of Computer Science and Software Engineering, University of Canterbury,
Canterbury, New Zealand, 8Department of Statistics, University of Warwick, Coventry, UK, 9Computer Vision Group,
Friedrich Schiller University of Jena, Jena, Germany, 10MSD International GmbH, 11Singapore Agency for Science,
Technology and Research, Singapore,Singapore, 12Department of Pathology, University Hospitals Coventry and
Warwickshire, Coventry, UK, and 13Nottingham Molecular Pathology Node, University of Nottingham, Nottingham,
UK
Date of submission 15 May 2017

Accepted for publication 29 July 2017
Published online Article Accepted 3 August 2017
Qaiser T, Mukherjee A, Reddy PB C, Munugoti S D, Tallam V, Pitk€

aaho T, Lehtim€
aki T, Naughton T, Berseth
M, Pedraza A, Mukundan R, Smith M, Bhalerao A, Rodner E, Simon M, Denzler J, Huang C-H, Bueno G, Snead
D, Ellis I O, Ilyas M & Rajpoot N
(2018) Histopathology 72, 227–238. https://doi.org/10.1111/his.13333
HER2 challenge contest: a detailed assessment of automated HER2 scoring algorithms in
whole slide images of breast cancer tissues
Aims: Evaluating expression of the human epidermal aimed at systematically comparing and advancing
growth factor receptor 2 (HER2) by visual examina- the state-of-the-art artificial intelligence (AI)-based
tion of immunohistochemistry (IHC) on invasive automated methods for HER2 scoring.
breast cancer (BCa) is a key part of the diagnostic Methods and results: The contest data set comprised
assessment of BCa due to its recognized importance digitized whole slide images (WSI) of sections from 86
as a predictive and prognostic marker in clinical prac- cases of invasive breast carcinoma stained with both
tice. However, visual scoring of HER2 is subjective, haematoxylin and eosin (H&E) and IHC for HER2.
and consequently prone to interobserver variability. The contesting algorithms predicted scores of the IHC
Given the prognostic and therapeutic implications of slides automatically for an unseen subset of the data
HER2 scoring, a more objective method is required. set and the predicted scores were compared with the
In this paper, we report on a recent automated HER2 ‘ground truth’ (a consensus score from at least two
scoring contest, held in conjunction with the annual experts). We also report on a simple ‘Man versus
PathSoc meeting held in Nottingham in June 2016, Machine’ contest for the scoring of HER2 and show
Address for correspondence: N Rajpoot and T Qaiser, Department of Computer Science, University of Warwick, UK. e-mails: n.m.rajpoot@
warwick.ac.uk; t.qaiser@warwick.ac.uk
*These authors contributed equally to this study.
© 2017 John Wiley & Sons Ltd.

228 T Qaiser et al.
that the automated methods could beat the pathology for scoring of HER2. It also demonstrates the enor-
experts on this contest data set. mous potential of automated algorithms in assisting
Conclusions: This paper presents a benchmark for the pathologist with objective IHC scoring.
comparing the performance of automated algorithms
Keywords: automated HER2 scoring, biomarker quantification, breast cancer, digital pathology, quantitative
immunohistochemistry
Introduction are assessed further by FISH to test for gene amplifi-

cation. Examples of the four different HER2 scores (0
The adoption of image analysis in digital pathology to 3+) are shown in Figure 1. A summary of recom-
has received significant attention recently due to the mended guidelines for HER2 IHC scoring criteria7 is
availability of digital slide scanners and the increas- shown in Table 1.
ing importance of tissue-based biomarkers in strati- Historically, up to 20% of the HER2 IHC results may
fied medicine.1 Advances in software development contain inaccuracies8 due to variations in the technical
and an upwards trend in computational capacity quality and the subjective nature of scoring. Although
have also caused an upsurge of interest in digital adoption of HER2 guidelines and recommendations7
pathology. have served to improve standards in HER2 testing,
Breast cancer (BCa) is the most commonly diag- challenging cases remain, especially with HER2 scores
nosed cancer among women, and the second leading deemed borderline between categories.
cause of death worldwide.2 According to Cancer Automated IHC scoring of HER2 carries promise to
Research UK, the risk for women being diagnosed overcome the existing problems in conventional
with breast cancer is one in eight in the United King- methods. Automated scoring methods are not prone
dom, and approximately 11 600 women died from to subjective bias, and can provide precise quantita-
breast cancer in 2012.3 In routine diagnostic practice tive analysis which can assist the expert pathologist
of BCa, tumour tissue is stained with haematoxylin to reach a reproducible score.
and eosin (H&E) and then examined under the optical The HER2 scoring contest, documented in this
microscope for morphological assessment, including paper, was organized by the University of Warwick,
grade. In addition, tissues are stained by immunohis- the University of Nottingham and the Academic–
tochemistry (IHC) to evaluate biomarker expression Industrial Collaboration for Digital Pathology (AID-
for prognostic and predictive purposes. This conven- PATH) consortium (www.aidpath.eu). It was held in
tional method of diagnosis by visual examination is conjunction with the Pathological Society of Great
considered accurate in most areas, but is known to Britain and Ireland meeting in Nottingham (June
suffer from inter- and intra-observer variability in 2016) to provide a platform for researchers to assess
some areas, such as diagnosis of atypical hyperplasia the performance of computer algorithms for auto-
and reporting of histological grade.4–6 Digital mated HER2 scoring on IHC-stained slides. This paper
pathology offers significant potential for improvement provides an overview of the automated methods for
to overcome the subjectivity and improve HER2 scoring as presented at the contest and a ‘Man
reproducibility. versus Machine’ comparison of the degree of agree-
The human epidermal growth factor receptor 2 ment among histopathologists and the automated
(HER2) gene is amplified in approximately 15–20% of methods for HER2 scoring. This may be considered as
breast cancers.7 Gene amplification can also be identi- an initial step towards the development of a reliable
fied through fluorescence in-situ hybridization (FISH). computer-assisted diagnosis tool for HER2 scoring of
Alternatively, as HER2 amplification results in digitized BCa histology slides.
increased protein expression, IHC may be used. Given
the technical ease of performing IHC it has become
the preferred test, and FISH is usually performed only Materials and methods
when the IHC is equivocal. In practice, an expert
ETHICS
histopathologist will report a score between 0 and 3+
and cases scoring 0 or 1+ are classified as negative, Ethical approval was by Nottingham Research Ethics
while cases with a score of 3+ are classed as positive. Committee 2 (Approval no.: REC 2020313); R&D ref-
Cases with score 2+ are classified as equivocal and erence (N) 03HI01.
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
Automated HER2 scoring challenge contest 2016 229
Figure 1. Left to right: examples of regions of interest (800 lm in height and the same in width) from whole slide images (WSIs) scored 0,
1+ (negative), 2+ (equivocal) and 3+ (positive).
Table 1. Recommended automated human epidermal of WSIs (with a corresponding zoomed-in region of
growth factor receptor 2 (HER2) scoring criteria for interest) from the contest data set.
immunohistochemistry (IHC)-stained breast cancer tissue The ground truth (GT) was taken from the clinical
slides7 reports issued on the cases at a tertiary referral centre
for breast pathology (Nottingham University Hospi-
Staining
Score Cell membrane staining pattern assessment
tals, NHS Trust). At this centre, each case had been
reported or reviewed by at least two specialist consul-
0 No membrane staining or incomplete Negative tant histopathologists as part of their routine practice
membrane staining in <10% of invasive
1+ Negative [preliminary reporting and multidisciplinary team
tumour cells (0+) or faint/barely
perceptible or weak incomplete (MDT) review]. The centre provides regular internal
membrane stainaing in 10% of tumour quality control for HER2 assessment for immunohis-
cells (1+) tochemistry runs and contributes and participates
2+ A weak to moderate complete membrane Borderline
regularly in the UK NEQAS (National External Qual-
staining is observed in >10% of tumour (equivocal) ity Assessment Scheme) for immunocytochemistry
cells or strong complete membrane and in-situ hybridization (ICC and ISH).
staining in ≤10% of tumour cells
3+ A strong (intense and uniform) complete Positive CONTESTANTS

membrane staining is observed in >10%
of invasive tumour cells A total of 105 teams from more than 28 countries
registered to access the training data set before the
end of the registration deadline. By the end of the
submission deadline (off-site contest), a total of 18
IMAGE DATA ACQUISITION AND GROUND TRUTH
submissions from 14 teams were received for evalua-
The histology slides for this contest were scanned on tion. The organizers provided an opportunity to each
a Hamamatsu NanoZoomer C9600, enabling the of the 14 teams for presenting their approach in the
image to be viewed from a 94 to a 940 magnifica- contest workshop and six teams chose to present. For
tion, making the process comparable to a clinician’s the Man versus Machine contest, we received the
standard microscope. Generally, WSIs are gigapixel markings from four pathologists. The contest website
images stored in a multiresolution pyramid structure, was re-opened for new submissions after concluding
where the highest resolution is 940. The contest data the workshop. Further details regarding various
set entailed 172 whole slide images (WSI) extracted stages of the contest are described in Data S1 and
from 86 cases of invasive breast carcinomas and Table S1.
included both the H&E- and HER2-stained slides. The
actual HER2 scoring is normally performed on the EVALUATION
IHC-stained slides, while the H&E slides assist the
expert pathologist to identify the areas of invasive The performance of each submitted algorithm was
tumour and discriminate these from areas of in-situ evaluated based on three criteria: (1) agreement
disease. Figure 2 shows an example of the two types points, (2) weighted confidence and (3) combined

230 T Qaiser et al.
(A) Table 2. (A) Agreement points for predicted calls of

ground truth (GT), (B) bonus point criteria, when percent-
age of cells with complete cell membrane staining (PCMS)
lies in certain range of the GT value of the PCMS
Points for predicted score
(A)
Ground truth Score 0 1+ 2+ 3+
0 15 15 10 0
1+ 15 15 10 0
2+ 2.5 2.5 15 5
3+ 0 0 10 15
Percentage of cells with complete cell

Ground truth score membrane staining (PCMS)
(B)
0 0 0
(B) 1+ 1 (PCMS < 3%) 3 (PCMS 2)
2+ 5 (PCMS 5) 2.5 (PCMS 10)
3+ 5 (PCMS 5) 2.5 (PCMS 10)
for a score of 0 and 1+ no herceptin is offered to the

patient; for a 3+ score, herceptin is offered. For an
IHC 2+ score, a FISH test is performed; if positive
(i.e.) there is evidence of gene amplification and her-
ceptin is offered, while for a negative result it is not
offered. The evaluation considers the impact of erro-
neous classification. For example, a score of 0/1+
being interpreted as 3+ or vice versa is a serious error
while a 2+ scored as 0/1+ denies a few patients of
valid treatment; a score of 3+ for a 2+ case bypasses
the FISH test and may treat a few cases erroneously
(which would have been FISH-negative) with toxic
drugs while a score of actual 3+ downgraded to a 2+
calls for additional expense of FISH testing, but the
end result will probably be the same and hence this
should not be regarded as a serious error. These have
been summarized in Table 2.
Figure 2. An example whole slide image (WSI) along with a For agreement points, a penalty method was
zoomed-in cross-sectional area showing the tumour region (A) employed whereby each erroneous prediction is
haematoxylin and eosin (H&E)-stained slide; (B) immunohistochem-
istry (IHC)-stained slide.
penalized with respect to its deviation from the GT, as
shown in Table 2A. It can be envisaged that the
agreement points may end in a tie, where the accu-
points. Each assessment criterion has a separate lea- mulative points of two or more teams may be the
der-board. same. To resolve the tie, a bonus criterion was
The evaluation criteria were rationalized according devised as shown in Table 2B, where the decision
to the clinical significance and implications of HER2 was made on the percentage of cells with complete
IHC scoring as follows: in everyday clinical practice, cell membrane staining (PCMS) regardless of the
intensity. The bonus points were <3% introduced for MUCS-1 and MUCS-2, whereas according to weighted
scores 2+ and 3+ as they attain more clinical signifi- confidence assessment the top-ranked teams were
cance. For the IHC score 1+, 1 bonus point was VISILAB, FSUJena and MTB NLP. The combined
awarded if there was an accurate prediction of the results rank the top three teams in the following
IHC score and PCMS <3%, while 3 bonus points were order: VISILAB, FSUJena and Huangch. The perfor-
awarded if there was an accurate prediction of the mance of top-ranked teams including bonus points
IHC score and PCMS >3% but the predicted PCMS and the trend for total points (without the bonus
value deviated only 2% from the GT. For the IHC points) can be seen in Figure 3. MUCS-1, MUCS-3,
scores 2+ and 3+, 5 bonus points were awarded if CS_UCCGIP and MTB NLP achieved equal points, but
there was an accurate prediction of the IHC score MUCS-1 secured more bonus points, as their PCMS
and PCMS deviated only 5% from the GT. Similarly, was more accurate compared to remaining counter-
2.5 bonus points were awarded for scores 2+ and 3+ parts. Similarly, Team VISILAB and Rumrocks
if there was an accurate predication of IHC score and resulted in a tie where both teams attained equal
PCMS deviated only 10% from the GT. points, but the VISILAB method was more precise in
The weighted confidence was devised to measure predicting PCMS. Comprehensive tables for all three
the credence of the predicted score by the submitted leaderboards are available for download from the con-
algorithm. The criteria to measure the weighted contest website.
fidence wc were distinct for both truly and wrongly
classified cases. In cases where the predicted HER2
SUMMARY OF PROPOSED AUTOMATED METHODS
score pS matched with the GT with higher confidence
c, the weighted confidence amplified the confidence Most of the automated methods (described in Data S2
value for true prediction, whereas wrong predictions and Figure S1) applied a supervised patch-based clas-
with high confidence were penalized accordingly, as sification approach to solve this problem. The most
given in equation (1). This type of assessment is common pipeline was based on three main compo-
important for the development of an interactive diag- nents: (1) pre-processing including the methods to
nostic module. The confidence value may indicate identify the regions of interest for patch generation,
those cases or regions where further examination by (2) classification based on handcrafted or neural net-
the experts may be required before concluding the work learned features and (3) post-processing tech-
final HER2 score. niques to aggregate the HER2 score at WSI level and
to estimate the PCMS. Deep learning, especially con-
( volutional neural network (CNN)-based approaches,
2cc2
if ps ¼ GT
wc ¼ 2
c2 þ1
ð1Þ dominated as eight of the top 10 methods were based
2 otherwise on CNN. The majority of the CNN architectures
[Team Indus, MUCS-(1–3), MTB NLP, VISILAB, Rum-
The third assessment criterion is a combination of Rocks, FSUJena] were inspired from the state-of-the-
both agreement points and weighted confidence-based art deep neural networks.9,10
evaluations. The combined points were calculated by In pre-processing and patch extraction stage, most
taking the product of two assessment criteria for each of the teams followed the conventional thresholding
case. techniques with a combination of morphological
operators. These techniques are computationally less
expensive and generally work well, as background
Results regions lack any texture contents in contrast with
other tissue components. The MUCS-(1–3), MTB NLP,
CONTEST LEADERBOARDS
VISILAB and FSUJena probe the regions of interest
Comprehensive results comprising all the submissions manually through calibration or customized method-
for automated methods are shown in Table 3. The ologies. These methods aimed to pick the best possible
teams in were ranked with respect to the combined regions for training their algorithm, generally with-
point-based assessment with bonus points. For the out affecting the testing phase. To segment tissue
off-site contest, the total possible points were 420 (28 regions, the RumRocks team implemented a deconvo-
cases with a maximum of 15 points each), whereas lutional neural network (DCNN) and a two-dimen-
for weighted confidence the maximum points were sional CNN for selection of patches based on their
28, 1 for each case. The top three-ranked teams with texture. The Huangch team performed mean filtering
respect to point based assessments were Team Indus, and stain normalization using the control tissue
232 T Qaiser et al.
Table 3. A summary of results of all three assessment criteria for the automated human epidermal growth factor receptor
2 (HER2) scoring contest, ordered by the combined points criterion
Points + Weighted
Team Affiliation Points bonus confidence Combined
VISILAB Universidad de Castilla-La Mancha 382.5 404.5 23.552 348.041
FSUJena Computer Vision Group, University of Jena 370 392 23 345
HUANGCH Bioinformatics Institute, Singapore 377.5 391.5 22.622 335.77
MTB NLP NLP Logix, LLC 390 405.5 22.937 335.737
VISILAB (density) Universidad de Castilla-La Mancha 377.5 391 21.878 322.067
Team Indus Indian Institute of Technology Guwahati 402.5 425 18.451 321.414
UC-CSSE-CGIP group University of Canterbury, New Zealand 390 395 21.07 316.05
MUCS-3 Computer Science, Maynooth University 390 411 20.434 300.813
HERcules University of Oxford 360 380 20.572 295.633
MUCS – 2 Computer Science, Maynooth University 385 413 19.51 290.171
Rumrocks Department of Statistics, University of Warwick 382.5 395 19.649 277.705
TissueGnostics TissueGnostics GmbH, Austria 365 366 17.78 266.41
Team Indus (Stainsep) Indian Institute of Technology Guwahati 332.5 345.5 18.451 250.715
MUCS – 1 Computer Science, Maynooth University 390 416 16.765 248.876
HersRockers Indian Institute of Technology Guwahati 320 330 17.318 223.007
VIP-UGR University of Granada 305 322.5 15.41 211.748
TartanSight Computational Biology, CMU 230 230 15.148 159.745
Cancer_Detector Indian Institute of Technology Kanpur 255 260 12.994 138.962
Leaderboard 1
430
420
410
400
390
380
370
360
350
340
s
s
na
ch
)
3
1
LP
ck
du
ou
S-
S-
S-
Figure 3. Combined results for

ng
Je
N
N
ro
In
gr
C
C
(C
ua
um
TB
U
U
U
m
IP
top-ranked teams with respect

FS
M
M
H
a
R
G
LA
Te
to agreement and bonus

SI
E-
VI
SS
points. The trend shows the

-C
significance of predicting
C
U
correctly the percentage of cell

Points + bonus Points membrane.
intensity values to calibrate the stain colour intensity Huangch derived handcrafted characteristic curves
as a pre-processing step. and employed standard machine learning approaches.
In the second step, most of the teams (specifically Team Indus used a combination of data-driven and
the top 10) employed deep learning approaches, handcrafted features. They incorporated the average
whereas other teams such as CS_UCCGIP and control tissue intensity value along with learned
features maps before passing them to the fully con- to explore the disagreements among conventional
nected layers. Some of the top-ranked teams deployed and automatic methods. This type of analysis can
variants of Alexnet9 and GoogLeNet10 for predicting lead us to a more sophisticated protocol for automatic
the HER2 score. The FSUJena team computed the HER2 scoring and to overcome the inter- and intra-
bilinear features after retrieving activations from con- observer agreements that can be found in normal
volutional layers of the AlexNet. The derived activa- practice.
tions contain the learned feature maps representing a The analysis between the expert’s agreement and
d-dimensional w 9 h spatial grid. This approach the evaluation of the automatic HER2 scoring method
enables them to perform their analysis on top of the was performed with a subset (15 cases) of the off-site
learned features maps from CNN. In combination test data set. For this event, we set up an online web-
with standard approaches for data regularization, page for the pathologists. The webpage enabled the
MTB NLP and RumRocks trained multiple models. experts to load and navigate (including pan and
The final HER2 score and PCMS was estimated by zoom) through the WSI of those cases. Both IHC-
averaging over all the models. Additionally, a wide (HER2) and H&E-stained digital images were made
range of data augmentation and regularization tech- available to mimic the conventional scoring environ-
niques were employed to overcome the overfitting ment. On the contest day at Pathological Society
issues. As in practice, the standard data augmenta- meeting 2016 we requested the expert pathologists to
tion techniques such as affine transformations (e.g. score each case by providing the HER2 score, PCMS
rotation, flip, translation), random cropping, blurring and a confidence value.
and elastic deformations were applied to train the
network. MUCS-2, MTB NLP and RumRocks broadly Man versus machine results comparison
used the data augmentation techniques to assist the Table 4 summarizes the overall evaluation scores
network to generalize well on unseen data. achieved by each participant for this event. Each
In the final stage of pre-processing and predicting table entry gives the cumulative score for all 15
the PCMS, most of the teams employed standard cases, which indicates the overall performance. The
image processing and machine learning approaches agreement-points-based assessment was used to eval-
on top of the results attained from the last step. A uate the performance for this event. In total, we
Random Forest classifier was trained by MTB NLP to received four responses from expert pathologists and,
produce the final class probabilities and to estimate as shown in Table 4 we ranked the top six submis-
the PCMS. FSUJena simply used the mean tumour sions, including the top three automated methods.
cell percentage seen in the training set for a particu- From submitted responses, three participant patholo-
lar class as an estimate. Team Indus used both IHC- gists reported themselves as ‘consultant pathologist’
and H&E-stained slides to estimate the PCMS by using and one as ‘trainee pathologist’, and all three marked
standard image processing approaches such as con- breast pathology as a subspeciality.
tour detection, thresholding and morphological fea- As can be seen in Table 4, one of the automated
tures. All the remaining teams limited their analysis methods slightly outperformed the top-performing
to only IHC-stained images. All the submissions used participant pathologist. These results point to the
high-magnification images (910 or above), except potential significance of automated scoring methods
MUCS and Rumrocks, who used images from low res-
olution for selection of ROIs. Table 4. Summary results for the Man versus Machine
event. The evaluation was carried out according to the
contest criteria as described in the Evaluation section
MAN VERSUS MACHINE EVENT
Organization Rank Team name Score Bonus Score + bonus

One way of evaluating the automated algorithms for 1 Team Indus 220 12.5 232.5
IHC (HER2) scoring is to perform comparative analy-
sis of the assessment of expert pathologists and auto- 2 Expert 2 210 20.5 230.5
mated methods for a handful of cases compared to 3 VISILAB 212.5 15 227.5
the scores for those cases as agreed by at least two
4 MUCS-1 205 20.5 225.5
consultant breast pathologists (GT). On the day of the
contest workshop, we organized an event called Man 5 Expert 1 185 10 195
versus Machine. The main aim of this event was to
6 Expert 3 180 13 193
analyse the performance of automatic methods and
234 T Qaiser et al.
and the recent advancements in digital pathology. It contribute and to evaluate the performance of their
is worth mentioning that automated HER2 scoring computer algorithms for automated IHC scoring of
algorithms submitted in this contest are not ready to HER2 in images from BCa tissue slides. Automated
deploy in their current form, as they will require scoring can overcome significantly the subjectivity
extensive validation on a significantly large-scale data found, due to varying standards adopted by different
set and also a great deal of input from experts to pre- diagnostics laboratories. There is a current wealth of
pare the GT on the larger data set. literature11,12 using individual platforms (both freely
Table 5 shows the pooled data for HER2 scoring and commercially available) for digital analysis of
among the three top-ranked automated methods and HER2 in BCa. This, however, was the first compar-
the scores from three participant pathologists and ison of platforms and algorithms, and provides a pilot
comparison with the GT. Table 5 was determined for independent comparison of computing algorithms
for the 15 cases selected from the off-site contest for HER2 assessment on a benchmark data set. The
data set. On the basis of HER2 scores, a 100% contest highlights the wealth of potential carried by
agreement with the GT was observed for score 3+ artificial intelligence (AI) techniques for the assess-
among the participant pathologists and the auto- ment of IHC slides.
mated methods. For the scores of 1+ and 2+, there The contest ‘training data set’ was selected deliber-
were disparities between the GT and the new scores. ately such that it contained a reasonable number of
In all cases except one, for both man and machine, cases from all HER2 scores, bearing in mind the need
the error resulted from overcalling the score. Thus, for the training algorithms to learn features for each
for score 1+, six of nine (67%) were overcalled as score. For the test data set (both off- and on-site), the
2+ by humans while four of nine (44%) were over- GT was withheld at the time of image evaluation.
called by the machine algorithms. For the score of Results showed that the automated analysis per-
2+, seven of 15 (46%) were overcalled as 3+ by formed comparably to histopathologists. Many of the
humans while machines overcalled one of 15 (6%) algorithms achieved high accuracy – often close to
as 3+ and one of 15 (6%) was undercalled as 1+. the maximum. Our main objective was to analyse the
Clinically, a score of 2+ is critical, as in routine performance of algorithms based on clinical rele-
practice cases of score 2+ are recommended to vance, and hence the three particular evaluation cri-
undergo FISH testing. It is equally important to teria described above were chosen. It may be possible
avoid predicting score 2+ as 1+ or 0 cases, as such that other assessment criteria may influence the
erroneous prediction will deny the further assess- ranking of comparative results.
ment of HER2. As can be seen in Table 5, none of The data from the Man versus Machine comparison
the cases with score 2+ was misclassified by the par- showed that, reassuringly, all participants (whether
ticipant pathologists as either 1+ or 0, whereas for human or computer) identified cases correctly with a
one of the cases an automated method wrongly pre- GT score of 3+. This means that no one in the cate-
dicted a score of 2+ as 1+. gory would have been denied treatment. Similarly,
Most of the incorrect predictions by the participant for the cases with a score of 0 or 1+, although there
pathologists were found to be in cases where there was some overcalling, this never exceeded 2+ and
was considerable heterogeneity. Two such examples thus none would have received treatment without
are shown in Figure 4A–D. In tumour cells of HER2 further testing. The most problematic category was,
score 2+, a pattern of weak to moderate complete not unexpectedly, cases with a score of 2+ in both
membrane staining is observed whereas for score 3+, human and machine evaluations. If overcalled as 3+,
an intense (uniform) complete membrane staining is the FISH negative subset would be overtreated. The
observed. Estimating the complete membrane staining GT information for the FISH results were not released
is a difficult and highly subjective process, especially to the participants, as the contest was aimed only at
for score 2+ and 3+, as it is extremely difficult to comparing interpretation of HER2 IHC results. Hence,
detect subtle differences in the morphological appear- most of the automated algorithms aimed at predicting
ance for those cases. the equivocal cases as 2+. Table 5 incorporates the
FISH results for all the cases that were marked as 2+
in the test data GT (including the Man versus
Machine data set). From Man versus Machine cases
Discussion
(15 in total), a score of 2+ (subsequently FISH nega-
A major aim of organizing this contest was to provide tive) was overcalled by the machine as 3+ in only
a platform for computer scientists and researchers to one instance (VISILAB). In contrast, on three
Table 5. Combined matrix for agreement among the three experts and the top three automated methods based on agree-
ment points against the ground truth (GT) scores for 15 cases in the Man versus Machine event. Borderline case 7 was
deemed negative and cases 16 and19 were deemed positive for treatment decision (based on the human epidermal growth
factor receptor 2:chromosome 17 centromere (HER2:CEP17) amplification ratio for HER2 over-expression: 1.96, 2.1 and
2.07, respectively
Case Ground truth FISH results Expert 1 Expert 2 Expert 3 Team Indus Visilab MUCS-1
1 2+ Negative 3+ 2+ 2+ 2+ 2+ 2+
2 0 – 0 1+ 1+ 1+ 1+ 0
3 3+ – 3+ 3+ 3+ 3+ 3+ 3+
4 0 – 1+ 1+ 1+ 0 1+ 1+
5 1+ – 2+ 1+ 2+ 1+ 2+ 1+
6 3+ – 3+ 3+ 3+ 3+ 3+ 3+
7 2+ Borderline amplified 3+ 3+ 3+ 2+ 2+ 2+
8 2+ Negative 3+ 2+ 3+ 2+ 3+ 2+
9 3+ – 3+ 3+ 3+ 3+ 3+ 3+
10 3+ – 3+ 3+ 3+ 3+ 3+ 3+
11 1+ – 1+ 1+ 2+ 0 1+ 1+
12 2+ Positive 2+ 2+ 3+ 2+ 2+ 2+
13 1+ – 2+ 2+ 2+ 2+ 2+ 1+
14 2+ Negative 2+ 2+ 2+ 2+ 2+ 1+
15 0 – 0 1+ 0 0 1+ 0
16 2+ Borderline amplified – – – 0 1+ 2+
17 2+ Negative – – – 2+ 2+ 2+
18 2+ Positive – – – 2+ 1+ 2+
19 2+ Borderline amplified – – – 2+ 2+ 2+
20 1+ – – – – 1+ 1+ 1+
21 1+ – – – – 1+ 1+ 2+
22 0 – – – – 1+ 0 1+
23 1+ – – – – 0 1+ 1+
24 1+ – – – – 0 1+ 2+
25 3+ – – – – 3+ 3+ 3+
26 0 – – – – 1+ 0 1+
27 0 – – – – 0 0 1+
28 0 – – – – 0 0 0
FISH, Fluorescence in-situ hybridization.
occasions (subsequently FISH-negative) the partici- in three instances the score of 2+ (subsequently
pant pathologists overcalled the score 2+ as 3+. FISH-positive) were predicted erroneously as either
Moreover, for the remaining test data set (13 cases), 1+ and 0 by the automated algorithms. Overall, the
236 T Qaiser et al.
(A)
(B)
GT Expert 1 Expert 2 Expert 3 Team Indus MUCS 1 VISILAB

2 3 2 3 2 3 2
(C)
(D)
Figure 4. Examples showing

immunohistochemistry (IHC)-
stained whole slide images
(WSIs) (A,C) and zoomed-in
cross-sectional area (B,D) with
corresponding human
epidermal growth factor
receptor 2 (HER2) ground
truth (GT) scores marked by
expert pathologists and
GT Expert 1 Expert 2 Expert 3 Indus MUCS 1 VISILAB predictions from the top
2 2 2 3 2 2 2 automated methods.
results indicate that further fine-tuning will be exception of one of the participants (Team Indus),
required for 2+ cases with AI. While it is encouraging most of the algorithms reported in this paper have
that automated HER2 scoring algorithms may have avoided the use of H&E slides, although the use of
sufficient potential as a direct comparison to human H&E slide for the automatic detection of ductal carci-
diagnosis, it is probably worthwhile to reflect that the noma in situ (DCIS) regions cannot be ruled out. In
number of pathologists actually joining the contest addition, the task of predicting the PCMS is extremely
was small (only four) and it would have been better subjective, as the expert has to make an estimation
to compare the pathologist’s assessment of the slides on the basis of the physical appearance of the stained
on a reporting microscope rather than a computer for invasive tumour region. The semi-automated methods
a fairer comparison to real-life practice. could provide a comprehensive quantitative analysis
Conventionally, expert pathologists often switch on the selected region of interest to assist the experts
back and forth between the IHC and H&E slides to in estimating the PCMS and HER2 score, especially in
map the invasive tumour regions for estimating the borderline cases. As HER2 immunoscoring relies not
percentage of complete membrane staining. With the only on intensity but the completeness of membrane
positivity, automated scoring may be helpful as All cases with score 2+ are recommended routinely
demonstrated by Br€ ugmann et al.,13 who proposed for further FISH testing to validate HER2 overexpres-
scoring of HER2 based on an algorithm evaluating sion at the gene level. It would be an added advan-
the cell membrane connectivity. tage if the automated methods could be trained with
This study shows that automated IHC scoring algo- FISH GT to predict the final outcome, and the poten-
rithms can provide a quantitative assessment of mor- tial for automated algorithms in calling the actual
phological features that can assist in objective final HER2 status with reproducible accuracy could
computer-assisted diagnosis and predictive modelling be demonstrated. For this, a larger series with 2+
of the outcome and survival.14 We have demon- cases alone with FISH data would need to be tested.
strated the potential significance of digital imaging Indeed, there have been other promising studies that
and automated tools in histopathology. In the context indicate that automated image analysis for HER2
of breast histopathology, whereby almost all the inva- instead of manual assessment may reduce the need
sive tumour cases are considered for HER2 testing, for supplementary FISH testing by up to 68%.16 In a
an automated or semi-automated scoring method has diagnostic setting, this would reduce costs and turn-
potential for deployment in routine practice. Despite around time significantly. During the last decade, IHC
all these advances, several challenges remain for the staining has become ubiquitous in pathology labora-
AI algorithms to be optimized and become part of tories globally and the role of IHC evaluation in a
routine diagnosis. It is worth noting that serious opti- high-throughput setting becomes key for IHC-based
mization will be needed for automated methods while companion diagnostics. Other possible extensions of
processing a whole-slide image. Some methods digital pathology could be to automate the overex-
required more than 3 h per case which, in the ‘real pression of the programmed death 1 (PD-1) receptor
world’ of diagnostic service delivery, is not feasible. and its ligand (PD-L1) to evaluate anaplastic lym-
Another limitation of this contest was that the image phoma kinase (ALK) protein and proto-oncogene tyr-
data were collected from a single site using a single osine-protein kinase ROS1 in lung cancers.17 The AI-
scanner. A potential extension would be to collect based algorithms would be more effective if IHC stain-
data from multiple pathology laboratories with HER2 ing and scoring methods were treated as a composite
scores marked by different experts and images assay.18,19 The varying staining protocols and scoring
scanned using a variety of different machines. This parameters may restrain the effectiveness of AI-based
would also test the differences inherent in staining automated scoring algorithms, including the HER2
quality that may affect such procedures. Such scoring, but with sufficiently variable data from differ-
enhancements could overcome significantly the over- ent centres AI algorithms could be trained to over-
fitting to one particular data set that may occur in come that problem.
the automated scoring methods. In moving across This contest provides a baseline for computer
systems other laboratories, for example, have science and computational pathology researchers for
acknowledged the challenges in reaching the opti- automated/semi-automated scoring and computer-
mum Aperio algorithm parameters to provide results assisted diagnosis (CAD) tools to assist the patholo-
that were equivalent to those of the ‘automated cellu- gists in daily routine analysis. The contest is now
lar imaging system’ (ACIS) or ‘cell analysis system’ over but the registration and the web-portal will
(CAS 200) quantitation systems,15 which are fully remain open for future participants to make novel
automated environments for detecting cells based on contributions to automated HER2 scoring.
intensity characteristics and handcrafted features
found in IHC-stained images. Therefore, there is a
need to learn throughout comparative systems, for Acknowledgements
which the current study provided a valid starting-
The first author (T.Q.) acknowledges the financial sup-
point. Also, the study highlights the need for dialogue
port provided by the University Hospital Coventry
between histopathologists and informaticians to
Warwickshire (UHCW) and the Department of Com-
understand the correct identification of tissue com-
puter Science at Warwick. The VISILAB team (A.P.
partments relevant for assessment, correct morphol-
and G.B.) and UNOTT (M.I. and A.M.) acknowledge
ogy (normal versus in-situ versus invasive) and
financial support from the European Project AIDPATH
stromal versus tumour stain. Algorithms will also
(no.: 612471); http://aidpath.eu/. The MUCS team
need to be trained to the natural acceptable variation
wishes to acknowledge John McDonald and Ronan
in staining hues and intensities (intra- and interlabo-
Reilly for their valuable contributions to the research,
ratories) to work effectively during routine practice.
238 T Qaiser et al.
and acknowledge financial support from Science Foun- interpretation of tissue-based biomarkers. Anal. Cell. Pathol.
dation Ireland (SFI) under grant no. 13/CDA/2224 2014; 2014; 1–10.
12. Tuominen VJ, Tolonen TT, Isola J. ImmunoMembrane: a pub-
and an Irish Research Council (IRC) Post Graduate licly available web application for digital image analysis of
Scholarship. Co-first author Dr. Mukherjee would also HER2 immunohistochemistry. Histopathology 2012; 60; 758–
like to thank the NIHR and Pathological Society of 767.
Great Britain and Ireland for support. We are also 13. Br€ugmann A, Eld M, Lelkaitis G et al. Digital image analysis of
grateful to Dr. Nicholas Trahearn for his input in deriv- membrane connectivity is a robust measure of HER2
immunostains. Breast Cancer Res. Treat. 2012; 132; 41–49.
ing the weighted confidence evaluation measure. 14. Chen J-M, Qu A-P, Wang L-W et al. New breast cancer
prognostic factors identified by computer-aided image analy-
sis of HE stained histopathology images. Sci. Rep. 2015; 5;
Conflicts of interest 10690.
15. Farris AB, Cohen C, Rogers TE et al. Whole Slide imaging for
None. analytical anatomic pathology and telepathology: practical
applications today, promises, and perils. Arch. Pathol. Lab. Med.
2017; 141; 542–540.
References 16. Holten-Rossing H, Møller Talman M-L, Kristensson M et al.
Optimizing HER2 assessment in breast cancer: application of
1. Hamilton PW, Bankhead P, Wang Y et al. Digital pathology automated image analysis. Breast Cancer Res. Treat. 2015;
and image analysis in tissue biomarker research. Methods 152; 367–375.
2014; 70; 59–73. 17. Shtivelman E, Hensing T, Simon GR et al. Molecular pathways
2. Ma J, Jemal A. Breast cancer statistics. In Ahmed A ed. Breast and therapeutic targets in lung cancer. Oncotarget 2014; 5;
cancer metastasis and drug resistance. New York, NY: Springer 1392.
New York, 2013; 1–18. 18. Taylor CR. Predictive biomarkers and companion diagnostics.
3. Breast Cancer Statistics, Cancer Research UK. Available at: The future of immunohistochemistry –’in situ proteomics’, or
http://www.cancerresearchuk.org/cancer-info/cancerstats/type just a ‘stain’? Appl. Immunohistochem. Mol. Morphol. 2014; 22;
s/breast/ (accessed 12/09/2017). 555–561.
4. Smits AJJ, Kummer JA, de Bruin PC et al. The estimation of 19. Ilie M, Hofman V, Dietel M et al. Assessment of the PD-L1 sta-
tumor cell percentage for molecular testing by pathologists is tus by immunohistochemistry: challenges and perspectives for
not accurate. Mod. Pathol. 2014; 27; 168–174. therapeutic strategies in lung cancer patients. Virchows Arch.
5. Viray H, Li K, Long TA et al. A prospective, multi-institutional Int. J. Pathol. 2016; 468; 511–525.
diagnostic trial to determine pathologist accuracy in estimation
of percentage of malignant cells. Arch. Pathol. Lab. Med. 2013;
137; 1545–1549. Supporting Information
6. Rakha EA, Bennett RL, Coleman D et al. Review of the national
external quality assessment (EQA) scheme for breast pathology Additional Supporting Information may be found in
in the UK. J. Clin. Pathol. 2017; 70; 51–57. the online version of this article:
7. Rakha EA, Pinder SE, Bartlett JMS et al. Updated UK recom-
mendations for HER2 assessment in breast cancer. J. Clin.
Data S1. Contest format.
Pathol. 2015; 68; 93–99. Table S1. The ground truth score for 52 cases from
8. Wolff AC, Hammond MEH, Schwartz JN et al. Reply to Vang Niel- the training dataset with percentage of cells with
sen, et al. and to Raji. J. Clin. Oncol. 2007; 25; 4021–4023. complete membrane staining. The boderline case 63
9. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification was deemed negative and the amplification ratio for
with deep convolutional neural networks. In Pereira F, Burges
CJC, Bottou L et al. eds. Advances in neural information processing
Her2 over-expression was 1.92.
systems 25. Red Hook, NY: Curran Associates Inc, 2012; Data S2. Description of automated methods.
1097–1105. Figure S1. Characteristics curves and the corre-
10. Szegedy C, Liu W, Jia Y et al. Going deeper with convolutions. sponding Her2 score. The x-axis denotes range of the
Proceedings of the IEEE conference on computer vision and saturation value whereas y-axis denotes the calcu-
pattern recognition, Boston, Massachusetts: 2015, 1–9.
11. Gavrielides MA, Conway C, O’Flaherty N et al. Observer perfor-
lated percentage from saturation limits. The predicted
mance in the use of digital and optical microscopy for the Her2 scores are also shown for each curve.

Qaiser Et Al-2018-Histopathology

Uploaded by

Document Informationclick to expand document informationwk

Document Informationclick to expand document information

Copyright:

Available Formats

Qaiser Et Al-2018-Histopathology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Qaiser Et Al-2018-Histopathology

Uploaded by

Copyright:

Available Formats

Histopathology 2018, 72, 227–238. DOI: 10.1111/his.

HER2 challenge contest: a detailed assessment of automated

Date of submission 15 May 2017

Qaiser T, Mukherjee A, Reddy PB C, Munugoti S D, Tallam V, Pitk€

© 2017 John Wiley & Sons Ltd.

Introduction are assessed further by FISH to test for gene amplifi-

3+ A strong (intense and uniform) complete Positive CONTESTANTS

© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.

(A) Table 2. (A) Agreement points for predicted calls of

Points for predicted score

Ground truth Score 0 1+ 2+ 3+

Percentage of cells with complete cell

2+ 5 (PCMS 5) 2.5 (PCMS 10)

3+ 5 (PCMS 5) 2.5 (PCMS 10)

for a score of 0 and 1+ no herceptin is offered to the

VISILAB Universidad de Castilla-La Mancha 382.5 404.5 23.552 348.041

FSUJena Computer Vision Group, University of Jena 370 392 23 345

HUANGCH Bioinformatics Institute, Singapore 377.5 391.5 22.622 335.77

MTB NLP NLP Logix, LLC 390 405.5 22.937 335.737

VISILAB (density) Universidad de Castilla-La Mancha 377.5 391 21.878 322.067

MUCS-3 Computer Science, Maynooth University 390 411 20.434 300.813

HERcules University of Oxford 360 380 20.572 295.633

MUCS – 2 Computer Science, Maynooth University 385 413 19.51 290.171

Rumrocks Department of Statistics, University of Warwick 382.5 395 19.649 277.705

TissueGnostics TissueGnostics GmbH, Austria 365 366 17.78 266.41

MUCS – 1 Computer Science, Maynooth University 390 416 16.765 248.876

HersRockers Indian Institute of Technology Guwahati 320 330 17.318 223.007

VIP-UGR University of Granada 305 322.5 15.41 211.748

TartanSight Computational Biology, CMU 230 230 15.148 159.745

Cancer_Detector Indian Institute of Technology Kanpur 255 260 12.994 138.962

Figure 3. Combined results for

top-ranked teams with respect

to agreement and bonus

points. The trend shows the

correctly the percentage of cell

Organization Rank Team name Score Bonus Score + bonus

FISH, Fluorescence in-situ hybridization.

GT Expert 1 Expert 2 Expert 3 Team Indus MUCS 1 VISILAB

Figure 4. Examples showing

© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.

You might also like