Qaiser Et Al-2018-Histopathology
Qaiser Et Al-2018-Histopathology
Qaiser Et Al-2018-Histopathology
13333
Address for correspondence: N Rajpoot and T Qaiser, Department of Computer Science, University of Warwick, UK. e-mails: n.m.rajpoot@
warwick.ac.uk; t.qaiser@warwick.ac.uk
*These authors contributed equally to this study.
that the automated methods could beat the pathology for scoring of HER2. It also demonstrates the enor-
experts on this contest data set. mous potential of automated algorithms in assisting
Conclusions: This paper presents a benchmark for the pathologist with objective IHC scoring.
comparing the performance of automated algorithms
Keywords: automated HER2 scoring, biomarker quantification, breast cancer, digital pathology, quantitative
immunohistochemistry
Figure 1. Left to right: examples of regions of interest (800 lm in height and the same in width) from whole slide images (WSIs) scored 0,
1+ (negative), 2+ (equivocal) and 3+ (positive).
Table 1. Recommended automated human epidermal of WSIs (with a corresponding zoomed-in region of
growth factor receptor 2 (HER2) scoring criteria for interest) from the contest data set.
immunohistochemistry (IHC)-stained breast cancer tissue The ground truth (GT) was taken from the clinical
slides7 reports issued on the cases at a tertiary referral centre
for breast pathology (Nottingham University Hospi-
Staining
Score Cell membrane staining pattern assessment
tals, NHS Trust). At this centre, each case had been
reported or reviewed by at least two specialist consul-
0 No membrane staining or incomplete Negative tant histopathologists as part of their routine practice
membrane staining in <10% of invasive
1+ Negative [preliminary reporting and multidisciplinary team
tumour cells (0+) or faint/barely
perceptible or weak incomplete (MDT) review]. The centre provides regular internal
membrane stainaing in 10% of tumour quality control for HER2 assessment for immunohis-
cells (1+) tochemistry runs and contributes and participates
2+ A weak to moderate complete membrane Borderline
regularly in the UK NEQAS (National External Qual-
staining is observed in >10% of tumour (equivocal) ity Assessment Scheme) for immunocytochemistry
cells or strong complete membrane and in-situ hybridization (ICC and ISH).
staining in ≤10% of tumour cells
(A)
0 15 15 10 0
1+ 15 15 10 0
2+ 2.5 2.5 15 5
3+ 0 0 10 15
(B)
0 0 0
(B) 1+ 1 (PCMS < 3%) 3 (PCMS 2)
intensity. The bonus points were <3% introduced for MUCS-1 and MUCS-2, whereas according to weighted
scores 2+ and 3+ as they attain more clinical signifi- confidence assessment the top-ranked teams were
cance. For the IHC score 1+, 1 bonus point was VISILAB, FSUJena and MTB NLP. The combined
awarded if there was an accurate prediction of the results rank the top three teams in the following
IHC score and PCMS <3%, while 3 bonus points were order: VISILAB, FSUJena and Huangch. The perfor-
awarded if there was an accurate prediction of the mance of top-ranked teams including bonus points
IHC score and PCMS >3% but the predicted PCMS and the trend for total points (without the bonus
value deviated only 2% from the GT. For the IHC points) can be seen in Figure 3. MUCS-1, MUCS-3,
scores 2+ and 3+, 5 bonus points were awarded if CS_UCCGIP and MTB NLP achieved equal points, but
there was an accurate prediction of the IHC score MUCS-1 secured more bonus points, as their PCMS
and PCMS deviated only 5% from the GT. Similarly, was more accurate compared to remaining counter-
2.5 bonus points were awarded for scores 2+ and 3+ parts. Similarly, Team VISILAB and Rumrocks
if there was an accurate predication of IHC score and resulted in a tie where both teams attained equal
PCMS deviated only 10% from the GT. points, but the VISILAB method was more precise in
The weighted confidence was devised to measure predicting PCMS. Comprehensive tables for all three
the credence of the predicted score by the submitted leaderboards are available for download from the con-
algorithm. The criteria to measure the weighted con- test website.
fidence wc were distinct for both truly and wrongly
classified cases. In cases where the predicted HER2
SUMMARY OF PROPOSED AUTOMATED METHODS
score pS matched with the GT with higher confidence
c, the weighted confidence amplified the confidence Most of the automated methods (described in Data S2
value for true prediction, whereas wrong predictions and Figure S1) applied a supervised patch-based clas-
with high confidence were penalized accordingly, as sification approach to solve this problem. The most
given in equation (1). This type of assessment is common pipeline was based on three main compo-
important for the development of an interactive diag- nents: (1) pre-processing including the methods to
nostic module. The confidence value may indicate identify the regions of interest for patch generation,
those cases or regions where further examination by (2) classification based on handcrafted or neural net-
the experts may be required before concluding the work learned features and (3) post-processing tech-
final HER2 score. niques to aggregate the HER2 score at WSI level and
to estimate the PCMS. Deep learning, especially con-
( volutional neural network (CNN)-based approaches,
2cc2
if ps ¼ GT
wc ¼ 2
c2 þ1
ð1Þ dominated as eight of the top 10 methods were based
2 otherwise on CNN. The majority of the CNN architectures
[Team Indus, MUCS-(1–3), MTB NLP, VISILAB, Rum-
The third assessment criterion is a combination of Rocks, FSUJena] were inspired from the state-of-the-
both agreement points and weighted confidence-based art deep neural networks.9,10
evaluations. The combined points were calculated by In pre-processing and patch extraction stage, most
taking the product of two assessment criteria for each of the teams followed the conventional thresholding
case. techniques with a combination of morphological
operators. These techniques are computationally less
expensive and generally work well, as background
Results regions lack any texture contents in contrast with
other tissue components. The MUCS-(1–3), MTB NLP,
CONTEST LEADERBOARDS
VISILAB and FSUJena probe the regions of interest
Comprehensive results comprising all the submissions manually through calibration or customized method-
for automated methods are shown in Table 3. The ologies. These methods aimed to pick the best possible
teams in were ranked with respect to the combined regions for training their algorithm, generally with-
point-based assessment with bonus points. For the out affecting the testing phase. To segment tissue
off-site contest, the total possible points were 420 (28 regions, the RumRocks team implemented a deconvo-
cases with a maximum of 15 points each), whereas lutional neural network (DCNN) and a two-dimen-
for weighted confidence the maximum points were sional CNN for selection of patches based on their
28, 1 for each case. The top three-ranked teams with texture. The Huangch team performed mean filtering
respect to point based assessments were Team Indus, and stain normalization using the control tissue
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
232 T Qaiser et al.
Table 3. A summary of results of all three assessment criteria for the automated human epidermal growth factor receptor
2 (HER2) scoring contest, ordered by the combined points criterion
Points + Weighted
Team Affiliation Points bonus confidence Combined
Team Indus Indian Institute of Technology Guwahati 402.5 425 18.451 321.414
UC-CSSE-CGIP group University of Canterbury, New Zealand 390 395 21.07 316.05
Team Indus (Stainsep) Indian Institute of Technology Guwahati 332.5 345.5 18.451 250.715
Leaderboard 1
430
420
410
400
390
380
370
360
350
340
s
s
na
ch
)
3
1
LP
ck
du
ou
S-
S-
S-
ro
In
gr
C
C
(C
ua
um
TB
U
U
U
m
IP
H
a
R
G
LA
Te
E-
VI
SS
significance of predicting
C
U
intensity values to calibrate the stain colour intensity Huangch derived handcrafted characteristic curves
as a pre-processing step. and employed standard machine learning approaches.
In the second step, most of the teams (specifically Team Indus used a combination of data-driven and
the top 10) employed deep learning approaches, handcrafted features. They incorporated the average
whereas other teams such as CS_UCCGIP and control tissue intensity value along with learned
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
Automated HER2 scoring challenge contest 2016 233
features maps before passing them to the fully con- to explore the disagreements among conventional
nected layers. Some of the top-ranked teams deployed and automatic methods. This type of analysis can
variants of Alexnet9 and GoogLeNet10 for predicting lead us to a more sophisticated protocol for automatic
the HER2 score. The FSUJena team computed the HER2 scoring and to overcome the inter- and intra-
bilinear features after retrieving activations from con- observer agreements that can be found in normal
volutional layers of the AlexNet. The derived activa- practice.
tions contain the learned feature maps representing a The analysis between the expert’s agreement and
d-dimensional w 9 h spatial grid. This approach the evaluation of the automatic HER2 scoring method
enables them to perform their analysis on top of the was performed with a subset (15 cases) of the off-site
learned features maps from CNN. In combination test data set. For this event, we set up an online web-
with standard approaches for data regularization, page for the pathologists. The webpage enabled the
MTB NLP and RumRocks trained multiple models. experts to load and navigate (including pan and
The final HER2 score and PCMS was estimated by zoom) through the WSI of those cases. Both IHC-
averaging over all the models. Additionally, a wide (HER2) and H&E-stained digital images were made
range of data augmentation and regularization tech- available to mimic the conventional scoring environ-
niques were employed to overcome the overfitting ment. On the contest day at Pathological Society
issues. As in practice, the standard data augmenta- meeting 2016 we requested the expert pathologists to
tion techniques such as affine transformations (e.g. score each case by providing the HER2 score, PCMS
rotation, flip, translation), random cropping, blurring and a confidence value.
and elastic deformations were applied to train the
network. MUCS-2, MTB NLP and RumRocks broadly Man versus machine results comparison
used the data augmentation techniques to assist the Table 4 summarizes the overall evaluation scores
network to generalize well on unseen data. achieved by each participant for this event. Each
In the final stage of pre-processing and predicting table entry gives the cumulative score for all 15
the PCMS, most of the teams employed standard cases, which indicates the overall performance. The
image processing and machine learning approaches agreement-points-based assessment was used to eval-
on top of the results attained from the last step. A uate the performance for this event. In total, we
Random Forest classifier was trained by MTB NLP to received four responses from expert pathologists and,
produce the final class probabilities and to estimate as shown in Table 4 we ranked the top six submis-
the PCMS. FSUJena simply used the mean tumour sions, including the top three automated methods.
cell percentage seen in the training set for a particu- From submitted responses, three participant patholo-
lar class as an estimate. Team Indus used both IHC- gists reported themselves as ‘consultant pathologist’
and H&E-stained slides to estimate the PCMS by using and one as ‘trainee pathologist’, and all three marked
standard image processing approaches such as con- breast pathology as a subspeciality.
tour detection, thresholding and morphological fea- As can be seen in Table 4, one of the automated
tures. All the remaining teams limited their analysis methods slightly outperformed the top-performing
to only IHC-stained images. All the submissions used participant pathologist. These results point to the
high-magnification images (910 or above), except potential significance of automated scoring methods
MUCS and Rumrocks, who used images from low res-
olution for selection of ROIs. Table 4. Summary results for the Man versus Machine
event. The evaluation was carried out according to the
contest criteria as described in the Evaluation section
MAN VERSUS MACHINE EVENT
and the recent advancements in digital pathology. It contribute and to evaluate the performance of their
is worth mentioning that automated HER2 scoring computer algorithms for automated IHC scoring of
algorithms submitted in this contest are not ready to HER2 in images from BCa tissue slides. Automated
deploy in their current form, as they will require scoring can overcome significantly the subjectivity
extensive validation on a significantly large-scale data found, due to varying standards adopted by different
set and also a great deal of input from experts to pre- diagnostics laboratories. There is a current wealth of
pare the GT on the larger data set. literature11,12 using individual platforms (both freely
Table 5 shows the pooled data for HER2 scoring and commercially available) for digital analysis of
among the three top-ranked automated methods and HER2 in BCa. This, however, was the first compar-
the scores from three participant pathologists and ison of platforms and algorithms, and provides a pilot
comparison with the GT. Table 5 was determined for independent comparison of computing algorithms
for the 15 cases selected from the off-site contest for HER2 assessment on a benchmark data set. The
data set. On the basis of HER2 scores, a 100% contest highlights the wealth of potential carried by
agreement with the GT was observed for score 3+ artificial intelligence (AI) techniques for the assess-
among the participant pathologists and the auto- ment of IHC slides.
mated methods. For the scores of 1+ and 2+, there The contest ‘training data set’ was selected deliber-
were disparities between the GT and the new scores. ately such that it contained a reasonable number of
In all cases except one, for both man and machine, cases from all HER2 scores, bearing in mind the need
the error resulted from overcalling the score. Thus, for the training algorithms to learn features for each
for score 1+, six of nine (67%) were overcalled as score. For the test data set (both off- and on-site), the
2+ by humans while four of nine (44%) were over- GT was withheld at the time of image evaluation.
called by the machine algorithms. For the score of Results showed that the automated analysis per-
2+, seven of 15 (46%) were overcalled as 3+ by formed comparably to histopathologists. Many of the
humans while machines overcalled one of 15 (6%) algorithms achieved high accuracy – often close to
as 3+ and one of 15 (6%) was undercalled as 1+. the maximum. Our main objective was to analyse the
Clinically, a score of 2+ is critical, as in routine performance of algorithms based on clinical rele-
practice cases of score 2+ are recommended to vance, and hence the three particular evaluation cri-
undergo FISH testing. It is equally important to teria described above were chosen. It may be possible
avoid predicting score 2+ as 1+ or 0 cases, as such that other assessment criteria may influence the
erroneous prediction will deny the further assess- ranking of comparative results.
ment of HER2. As can be seen in Table 5, none of The data from the Man versus Machine comparison
the cases with score 2+ was misclassified by the par- showed that, reassuringly, all participants (whether
ticipant pathologists as either 1+ or 0, whereas for human or computer) identified cases correctly with a
one of the cases an automated method wrongly pre- GT score of 3+. This means that no one in the cate-
dicted a score of 2+ as 1+. gory would have been denied treatment. Similarly,
Most of the incorrect predictions by the participant for the cases with a score of 0 or 1+, although there
pathologists were found to be in cases where there was some overcalling, this never exceeded 2+ and
was considerable heterogeneity. Two such examples thus none would have received treatment without
are shown in Figure 4A–D. In tumour cells of HER2 further testing. The most problematic category was,
score 2+, a pattern of weak to moderate complete not unexpectedly, cases with a score of 2+ in both
membrane staining is observed whereas for score 3+, human and machine evaluations. If overcalled as 3+,
an intense (uniform) complete membrane staining is the FISH negative subset would be overtreated. The
observed. Estimating the complete membrane staining GT information for the FISH results were not released
is a difficult and highly subjective process, especially to the participants, as the contest was aimed only at
for score 2+ and 3+, as it is extremely difficult to comparing interpretation of HER2 IHC results. Hence,
detect subtle differences in the morphological appear- most of the automated algorithms aimed at predicting
ance for those cases. the equivocal cases as 2+. Table 5 incorporates the
FISH results for all the cases that were marked as 2+
in the test data GT (including the Man versus
Machine data set). From Man versus Machine cases
Discussion
(15 in total), a score of 2+ (subsequently FISH nega-
A major aim of organizing this contest was to provide tive) was overcalled by the machine as 3+ in only
a platform for computer scientists and researchers to one instance (VISILAB). In contrast, on three
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
Automated HER2 scoring challenge contest 2016 235
Table 5. Combined matrix for agreement among the three experts and the top three automated methods based on agree-
ment points against the ground truth (GT) scores for 15 cases in the Man versus Machine event. Borderline case 7 was
deemed negative and cases 16 and19 were deemed positive for treatment decision (based on the human epidermal growth
factor receptor 2:chromosome 17 centromere (HER2:CEP17) amplification ratio for HER2 over-expression: 1.96, 2.1 and
2.07, respectively
Case Ground truth FISH results Expert 1 Expert 2 Expert 3 Team Indus Visilab MUCS-1
1 2+ Negative 3+ 2+ 2+ 2+ 2+ 2+
2 0 – 0 1+ 1+ 1+ 1+ 0
3 3+ – 3+ 3+ 3+ 3+ 3+ 3+
4 0 – 1+ 1+ 1+ 0 1+ 1+
5 1+ – 2+ 1+ 2+ 1+ 2+ 1+
6 3+ – 3+ 3+ 3+ 3+ 3+ 3+
7 2+ Borderline amplified 3+ 3+ 3+ 2+ 2+ 2+
8 2+ Negative 3+ 2+ 3+ 2+ 3+ 2+
9 3+ – 3+ 3+ 3+ 3+ 3+ 3+
10 3+ – 3+ 3+ 3+ 3+ 3+ 3+
11 1+ – 1+ 1+ 2+ 0 1+ 1+
12 2+ Positive 2+ 2+ 3+ 2+ 2+ 2+
13 1+ – 2+ 2+ 2+ 2+ 2+ 1+
14 2+ Negative 2+ 2+ 2+ 2+ 2+ 1+
15 0 – 0 1+ 0 0 1+ 0
16 2+ Borderline amplified – – – 0 1+ 2+
17 2+ Negative – – – 2+ 2+ 2+
18 2+ Positive – – – 2+ 1+ 2+
19 2+ Borderline amplified – – – 2+ 2+ 2+
20 1+ – – – – 1+ 1+ 1+
21 1+ – – – – 1+ 1+ 2+
22 0 – – – – 1+ 0 1+
23 1+ – – – – 0 1+ 1+
24 1+ – – – – 0 1+ 2+
25 3+ – – – – 3+ 3+ 3+
26 0 – – – – 1+ 0 1+
27 0 – – – – 0 0 1+
28 0 – – – – 0 0 0
occasions (subsequently FISH-negative) the partici- in three instances the score of 2+ (subsequently
pant pathologists overcalled the score 2+ as 3+. FISH-positive) were predicted erroneously as either
Moreover, for the remaining test data set (13 cases), 1+ and 0 by the automated algorithms. Overall, the
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
236 T Qaiser et al.
(A)
(B)
(C)
(D)
results indicate that further fine-tuning will be exception of one of the participants (Team Indus),
required for 2+ cases with AI. While it is encouraging most of the algorithms reported in this paper have
that automated HER2 scoring algorithms may have avoided the use of H&E slides, although the use of
sufficient potential as a direct comparison to human H&E slide for the automatic detection of ductal carci-
diagnosis, it is probably worthwhile to reflect that the noma in situ (DCIS) regions cannot be ruled out. In
number of pathologists actually joining the contest addition, the task of predicting the PCMS is extremely
was small (only four) and it would have been better subjective, as the expert has to make an estimation
to compare the pathologist’s assessment of the slides on the basis of the physical appearance of the stained
on a reporting microscope rather than a computer for invasive tumour region. The semi-automated methods
a fairer comparison to real-life practice. could provide a comprehensive quantitative analysis
Conventionally, expert pathologists often switch on the selected region of interest to assist the experts
back and forth between the IHC and H&E slides to in estimating the PCMS and HER2 score, especially in
map the invasive tumour regions for estimating the borderline cases. As HER2 immunoscoring relies not
percentage of complete membrane staining. With the only on intensity but the completeness of membrane
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
Automated HER2 scoring challenge contest 2016 237
positivity, automated scoring may be helpful as All cases with score 2+ are recommended routinely
demonstrated by Br€ ugmann et al.,13 who proposed for further FISH testing to validate HER2 overexpres-
scoring of HER2 based on an algorithm evaluating sion at the gene level. It would be an added advan-
the cell membrane connectivity. tage if the automated methods could be trained with
This study shows that automated IHC scoring algo- FISH GT to predict the final outcome, and the poten-
rithms can provide a quantitative assessment of mor- tial for automated algorithms in calling the actual
phological features that can assist in objective final HER2 status with reproducible accuracy could
computer-assisted diagnosis and predictive modelling be demonstrated. For this, a larger series with 2+
of the outcome and survival.14 We have demon- cases alone with FISH data would need to be tested.
strated the potential significance of digital imaging Indeed, there have been other promising studies that
and automated tools in histopathology. In the context indicate that automated image analysis for HER2
of breast histopathology, whereby almost all the inva- instead of manual assessment may reduce the need
sive tumour cases are considered for HER2 testing, for supplementary FISH testing by up to 68%.16 In a
an automated or semi-automated scoring method has diagnostic setting, this would reduce costs and turn-
potential for deployment in routine practice. Despite around time significantly. During the last decade, IHC
all these advances, several challenges remain for the staining has become ubiquitous in pathology labora-
AI algorithms to be optimized and become part of tories globally and the role of IHC evaluation in a
routine diagnosis. It is worth noting that serious opti- high-throughput setting becomes key for IHC-based
mization will be needed for automated methods while companion diagnostics. Other possible extensions of
processing a whole-slide image. Some methods digital pathology could be to automate the overex-
required more than 3 h per case which, in the ‘real pression of the programmed death 1 (PD-1) receptor
world’ of diagnostic service delivery, is not feasible. and its ligand (PD-L1) to evaluate anaplastic lym-
Another limitation of this contest was that the image phoma kinase (ALK) protein and proto-oncogene tyr-
data were collected from a single site using a single osine-protein kinase ROS1 in lung cancers.17 The AI-
scanner. A potential extension would be to collect based algorithms would be more effective if IHC stain-
data from multiple pathology laboratories with HER2 ing and scoring methods were treated as a composite
scores marked by different experts and images assay.18,19 The varying staining protocols and scoring
scanned using a variety of different machines. This parameters may restrain the effectiveness of AI-based
would also test the differences inherent in staining automated scoring algorithms, including the HER2
quality that may affect such procedures. Such scoring, but with sufficiently variable data from differ-
enhancements could overcome significantly the over- ent centres AI algorithms could be trained to over-
fitting to one particular data set that may occur in come that problem.
the automated scoring methods. In moving across This contest provides a baseline for computer
systems other laboratories, for example, have science and computational pathology researchers for
acknowledged the challenges in reaching the opti- automated/semi-automated scoring and computer-
mum Aperio algorithm parameters to provide results assisted diagnosis (CAD) tools to assist the patholo-
that were equivalent to those of the ‘automated cellu- gists in daily routine analysis. The contest is now
lar imaging system’ (ACIS) or ‘cell analysis system’ over but the registration and the web-portal will
(CAS 200) quantitation systems,15 which are fully remain open for future participants to make novel
automated environments for detecting cells based on contributions to automated HER2 scoring.
intensity characteristics and handcrafted features
found in IHC-stained images. Therefore, there is a
need to learn throughout comparative systems, for Acknowledgements
which the current study provided a valid starting-
The first author (T.Q.) acknowledges the financial sup-
point. Also, the study highlights the need for dialogue
port provided by the University Hospital Coventry
between histopathologists and informaticians to
Warwickshire (UHCW) and the Department of Com-
understand the correct identification of tissue com-
puter Science at Warwick. The VISILAB team (A.P.
partments relevant for assessment, correct morphol-
and G.B.) and UNOTT (M.I. and A.M.) acknowledge
ogy (normal versus in-situ versus invasive) and
financial support from the European Project AIDPATH
stromal versus tumour stain. Algorithms will also
(no.: 612471); http://aidpath.eu/. The MUCS team
need to be trained to the natural acceptable variation
wishes to acknowledge John McDonald and Ronan
in staining hues and intensities (intra- and interlabo-
Reilly for their valuable contributions to the research,
ratories) to work effectively during routine practice.
© 2017 John Wiley & Sons Ltd, Histopathology, 72, 227–238.
238 T Qaiser et al.
and acknowledge financial support from Science Foun- interpretation of tissue-based biomarkers. Anal. Cell. Pathol.
dation Ireland (SFI) under grant no. 13/CDA/2224 2014; 2014; 1–10.
12. Tuominen VJ, Tolonen TT, Isola J. ImmunoMembrane: a pub-
and an Irish Research Council (IRC) Post Graduate licly available web application for digital image analysis of
Scholarship. Co-first author Dr. Mukherjee would also HER2 immunohistochemistry. Histopathology 2012; 60; 758–
like to thank the NIHR and Pathological Society of 767.
Great Britain and Ireland for support. We are also 13. Br€ugmann A, Eld M, Lelkaitis G et al. Digital image analysis of
grateful to Dr. Nicholas Trahearn for his input in deriv- membrane connectivity is a robust measure of HER2
immunostains. Breast Cancer Res. Treat. 2012; 132; 41–49.
ing the weighted confidence evaluation measure. 14. Chen J-M, Qu A-P, Wang L-W et al. New breast cancer
prognostic factors identified by computer-aided image analy-
sis of HE stained histopathology images. Sci. Rep. 2015; 5;
Conflicts of interest 10690.
15. Farris AB, Cohen C, Rogers TE et al. Whole Slide imaging for
None. analytical anatomic pathology and telepathology: practical
applications today, promises, and perils. Arch. Pathol. Lab. Med.
2017; 141; 542–540.
References 16. Holten-Rossing H, Møller Talman M-L, Kristensson M et al.
Optimizing HER2 assessment in breast cancer: application of
1. Hamilton PW, Bankhead P, Wang Y et al. Digital pathology automated image analysis. Breast Cancer Res. Treat. 2015;
and image analysis in tissue biomarker research. Methods 152; 367–375.
2014; 70; 59–73. 17. Shtivelman E, Hensing T, Simon GR et al. Molecular pathways
2. Ma J, Jemal A. Breast cancer statistics. In Ahmed A ed. Breast and therapeutic targets in lung cancer. Oncotarget 2014; 5;
cancer metastasis and drug resistance. New York, NY: Springer 1392.
New York, 2013; 1–18. 18. Taylor CR. Predictive biomarkers and companion diagnostics.
3. Breast Cancer Statistics, Cancer Research UK. Available at: The future of immunohistochemistry –’in situ proteomics’, or
http://www.cancerresearchuk.org/cancer-info/cancerstats/type just a ‘stain’? Appl. Immunohistochem. Mol. Morphol. 2014; 22;
s/breast/ (accessed 12/09/2017). 555–561.
4. Smits AJJ, Kummer JA, de Bruin PC et al. The estimation of 19. Ilie M, Hofman V, Dietel M et al. Assessment of the PD-L1 sta-
tumor cell percentage for molecular testing by pathologists is tus by immunohistochemistry: challenges and perspectives for
not accurate. Mod. Pathol. 2014; 27; 168–174. therapeutic strategies in lung cancer patients. Virchows Arch.
5. Viray H, Li K, Long TA et al. A prospective, multi-institutional Int. J. Pathol. 2016; 468; 511–525.
diagnostic trial to determine pathologist accuracy in estimation
of percentage of malignant cells. Arch. Pathol. Lab. Med. 2013;
137; 1545–1549. Supporting Information
6. Rakha EA, Bennett RL, Coleman D et al. Review of the national
external quality assessment (EQA) scheme for breast pathology Additional Supporting Information may be found in
in the UK. J. Clin. Pathol. 2017; 70; 51–57. the online version of this article:
7. Rakha EA, Pinder SE, Bartlett JMS et al. Updated UK recom-
mendations for HER2 assessment in breast cancer. J. Clin.
Data S1. Contest format.
Pathol. 2015; 68; 93–99. Table S1. The ground truth score for 52 cases from
8. Wolff AC, Hammond MEH, Schwartz JN et al. Reply to Vang Niel- the training dataset with percentage of cells with
sen, et al. and to Raji. J. Clin. Oncol. 2007; 25; 4021–4023. complete membrane staining. The boderline case 63
9. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification was deemed negative and the amplification ratio for
with deep convolutional neural networks. In Pereira F, Burges
CJC, Bottou L et al. eds. Advances in neural information processing
Her2 over-expression was 1.92.
systems 25. Red Hook, NY: Curran Associates Inc, 2012; Data S2. Description of automated methods.
1097–1105. Figure S1. Characteristics curves and the corre-
10. Szegedy C, Liu W, Jia Y et al. Going deeper with convolutions. sponding Her2 score. The x-axis denotes range of the
Proceedings of the IEEE conference on computer vision and saturation value whereas y-axis denotes the calcu-
pattern recognition, Boston, Massachusetts: 2015, 1–9.
11. Gavrielides MA, Conway C, O’Flaherty N et al. Observer perfor-
lated percentage from saturation limits. The predicted
mance in the use of digital and optical microscopy for the Her2 scores are also shown for each curve.