Keywords

1 Introduction

Cell detection and classification are often the first key steps in a wide range of histology image analysis tasks, such as investigating the interplay of the tumor and immune cells [1]. Multiplex Immunohistochemistry (mIHC) is a multi-parametric protocol that allows simultaneous examination of expression of multiple markers in a single section [2, 3]. Combined with robust cell detection and classification techniques, mIHC has the potential to allow detailed investigation of cells spatial interaction and signalling for the study of tumor heterogeneity [2].

The field of digital pathology has recently witnessed a surge of interest in the application of deep learning for cell classification [4], cell detection [5, 6], and cell counting [7,8,9,10]. However, automated cell detection and classification remain challenging due to variation in slide preparation and cell morphological diversity in shape and size. For example, closely located cells with weak boundaries are often difficult to discern [5,6,7,8]. Moreover, often a parameter such as a kernel size needed to be fixed [5], which cannot cater for cells with a range of size and shape. Furthermore, the need to differentiate cells with a subtle difference in marker expression intensity, as exemplified in Fig. 1a, adds another layer of complexity in mIHC image analysis.

In this paper, to address the above stated challenges, we developed a new cell detection method followed by multi-stage CNN to analyse mIHC images of breast cancer. Our work has the following main contributions: (1) We developed Cell Count RegularizeD Convolutional neural Network (ConCORDe-Net) inspired by inception-v3 which incorporates cell counter and designed for cell detection without the need of pre-specifying parameters such as cell size. (2) The parameters of ConCORDe-Net were optimized using an objective function that combines conventional Dice overlap and a new cell count loss function which regularizes the network parameters to detect closely located cells. (3) Our quantitative experiments support that ConCORDe-Net outperformed the state of the art methods at detecting closely located as well as weakly stained cells.

2 Materials

The dataset used in this paper were mIHC whole-tumor slide images from patients with breast cancer, and the images were scanned at 40X resolution. A total of 175 regions/patches were annotated from different parts of 6 whole tumor images by experts. The patches were extracted from different regions of the slides to incorporate the variation in the data. The patches were then randomly split into training (120), validation (28), and testing (27). Inside these patches 20477 cells were annotated and these belonged to five different types of cells as depicted in Table 1. Illustrative example of patches are shown in Fig. 1a. The distribution of the data for each cell is presented in Table 1.

Table 1. Distribution of dataset

3 Methodology

3.1 Dot Annotation to Cell Pseudo-segmentation

The reference ground truth obtained was a dot annotation at the center of a cell instead of cell spatial extent segmentation which is generally tedious task. However, to train the proposed cell detection pipeline, cells mask (G) and the number of cells (\(C_t\)) were needed as a target. \(C_t\) is simply the number of annotated cells in the input patch. Cell pseudo-segmentation was generated from dot annotation using Eq. (1).

$$\begin{aligned} G(i, j) = {\left\{ \begin{array}{ll} 1 &{} \text {if } d < r \\ 0 &{} \text {} otherwise \end{array}\right. } \end{aligned}$$
(1)

where G(ij) is pixel intensity value at (ij) of pseudo-segmentation image (G), \(\textit{d}\) is an Euclidean distance between pixel location (ij) and any of cell dot annotations, and \(\textit{r}\) is threshold distance. \(\textit{r}\) was empirically set to 4 pixels to guarantee pseudo-segmentation of cells do not touch each other.

3.2 Cell Counter

Our proposed cell counter network is shown in Fig. 1b. It is a mapping function, \(f:\mathbb {R}^{nxn} \rightarrow \mathbb {R}^{1}\), where n is the size of the input patch, which is 224 in our case. It consists of feature extraction and regression parts. The feature extraction part is composed of four consecutive convolutional layers of \(3\,\times \,3\) filter size, and “same” padding. The number of neurons in these layers are \(\{16, 32, 64, 128\}\) respectively. Every convolutional layer was followed by max-pooling layer of size (\(2\,\times \,2\)) with stride 2 to reduce the dimensionality of features in the previous layer. The regressor part has a series of two dense layers of \(\{200,\ 1\}\) neurons. The output dense layer has one neuron which computes estimated number of cells in the input tensor or image. The activation of all convolutional and dense layers was set to rectified linear unit (ReLU).

Parameters of all layers were randomly initialized using uniform glorot initialization [11]. Optimization of the parameters was done using Adam [12], learning rate of \(10^{-4}\). Initially, we have experimented with Euclidean loss [10] and exponential loss functions. However, these suffer from loss explosion during the initial epochs and we came up with a new cell count loss (\(C_l\)) function in Eq. (2).

$$\begin{aligned} C_{l} = (1 - \frac{1}{1 + \frac{1}{B}\sum _{j=1}^{B}|C_{pj} - C_{tj}| }) \end{aligned}$$
(2)

where the summation is over B mini-batch images, \(C_{pj}\) and \(C_{tj}\) are predicted and true number of cells in the \(j^{th}\) image, respectively. Figure 2a shows profile of \(C_l\) as a function of cell count difference (\(C_{p} - C_{t}\)) and it is bounded between 0 and 1.

Before integrating the cell counter model to cell detection pipeline, it was trained and evaluated using pseudo-segmentation and number of cells as an input and output, respectively. To increase the amount of data, horizontal and vertical flipping were applied to all input training patches. The pseudo-segmentation is a binary image, however, when it is integrated with the cell detection model, a tensor of floating value will be fed. Thus, morphological and intensity deformation was applied as follows; Morphological erosion using rectangular structuring element of width \(w=2\) was performed to every patch with a probability \(p=0.4\), where p and w were empirically chosen. Then, the images were multiplied by a random matrix of the same size as the image with an empirically chosen probability \(p=0.4\). All elements in the random matrix were in range [0.7, 1] to set pixel values between 0.7 and 1.

Fig. 1.
figure 1

(a) Sample patches representing different types of cells. (b) Schematics of ConCORDe-Net architecture. \(3\,\times \,3\) and \(1\,\times \,1\) indicate filter size of convolutional layers. TC = Transposed Convolution, MP = Max-pooling, C = Concatenate. The network has two outputs, probability map and predicted number of cells (\(C_p\)). The probability map was thresholded using an empirically optimized threshold \(T=0.85\) to convert to binary image. The center of every binary object represents center of a cell. (c) Schematics of inception module.

3.3 Cell Detection

Figure. 1b shows the proposed ConCORDe-Net cell detection convolutional neural network. The input is \(224\,\times \,224 \,\times \,3\) size patch. The network has three parts; encoder, decoder and cell counter. The encoder-decoder section is extended version U-Net [13]. The standard U-Net architecture [13] uses VGG-style in its encoder and decoder section. We have proposed to use inception-v3 module shown in Fig. 1c instead of VGG block. The parallel and varying size filters in inception block enables the network to extract multi-scale features in a given layer. The encoder contains three inception modules and the first two modules were followed by 2D max-pooling layers. The decoder is composed of transposed convolution, concatenation, and inception modules. The \(1\,\times \,1\) filter size convolutional layer at the end of the decoder is used to reduce the dimension of the tensor from \(224\,\times \,224\,\times \,32\) to \(224 \,\times \,224\,\times \,1\). The output of the decoder was taken as cell location prediction map (P) and connected to the pretrained cell counter model (explained in Sect. 3.2), which generates predicted number of cells (\(C_p\)). Activation of all layers was set to ReLU, but sigmoid for the last layer in the decoder section. Therefore, the cell detection architecture has two outputs, cell location prediction map and predicted number of cells.

The parameters of cell counter model were transfer learned from cell pseudo-segmentation as explained in Sect. 3.2. Parameters of the other layers were randomly initialized using uniform glorot initialization [11], and optimized using Adam [12], learning rate = \(10^{-4}\) and an objective function shown in Eq. (3). Cell detection loss (\(D_l\)) in Eq. (3) has two parts. The first part is Dice overlap loss, and the second part is cell count loss.

$$\begin{aligned} D_{l}= (1- 2\frac{\sum _{j=1}^{B}\sum _{i=1}^{N}p_{ij}g_{ij}}{1 + \sum _{j=1}^{B}\sum _{i=1}^{N}p_{ij} + \sum _{j=1}^{B}\sum _{i=1}^{N}g_{ij}}) + K(1 - \frac{1}{1 + \frac{1}{B}\sum _{j=1}^{B}|C_{pj} - C_{tj}| }) \end{aligned}$$
(3)

where summations in the first part is over batch size (B) images, and N pixels of the ground truth image, \(g_i\ \epsilon \ G\) and prediction map, \(p_i \ \epsilon \ P\). The second part is same as Eq. (2), but weighted by empirically optimized constant \(K=0.3\).

Horizontal and vertical flipping was applied to training patches to increase the amount and diversity of our data.

3.4 Cell Classification

In our dataset, there were five types of cells: CD8, GAL8+ pSTAT−, GAL8+ pSTAT+ strong, GAL8+ pSTAT+ moderate, and GAL8+ pSTAT+ weak. GAL8+ pSTAT+ cells were divided based on the expression level of pSTAT into strong, moderate, and weak. However, discriminating among GAL8+ pSTAT+ cells is challenging, even for experts. Inspired by the principle of divide and conquer algorithm, we convert the problem into multi-stage classification. The first classifier (classifier1) differentiates between CD8, Gal8+ pSTAT−, and all GAL8+ pSTAT+ cells. Then, a second classifier (classifier2) was trained to further divide GAL8+ pSTAT+ cells in to GAL8+ pSTAT+ strong, GAL8+ pSTAT+ moderate, and GAL8+ pSTAT+ weak.

Both classifiers were trained using \(28\,\times \,28\,\times \,3\) patches which can cover the whole cell area for the majority of the cells. Similar network architecture was used for both classifiers. The classifier has feature extraction and classification sections. The feature extraction part is a modified version of VGG architecture [14] consisting of four convolutional layers of {\(32,\ 64\ 128\ 128\)} neurons with filters size \(3\,\times \,3\), stride 1 and “same” padding. Each convolutional layers were followed by \(2\,\times \,2\) max-pooling. The classification layer consisted of two dense layers of {\(200,\ 3\)} neurons with dropout layer, \(\text {rate}=0.3\) in between. Softmax activation was applied to the last dense layer and ReLU for the other layers. Categorical cross-entropy objective function was applied. Uniform glorot [12] was applied to initialize parameters of the layers and optimized using Adam [12], learning \(\text {rate}=10^{-4}\). To handle class imbalance, in each mini-batch, an equal number of patches from all cell types were fed to the network and the number of iterations were determined by the number of patches in the most underestimated class. Moreover, runtime augmentation of flipping, and zooming with scale \(s= [0.85\ 1.15]\) was applied with a probability of \(p=0.4\), where s and p were empirically optimized.

4 Results and Discussion

The proposed deep learning based unified cell detection and classification pipeline was evaluated on mIHC whole-tumor slide images. Implementation of the proposed approach was done in Python, and we used Keras API [15] for development of the deep learning pipeline.

To investigate if convolutional neural networks (CNN) can regress the number of cells from an input image, the proposed cell counter model was trained and then, evaluated on a test patches pseudo-segmentation image before integrating to ConCORDe-Net. Pearson correlation \(r=0.999\) was obtained between the true and predicted number of cells. The high correlation supports that the proposed cell counter network can be used as a cell count approximation function.

Quantitatively, we evaluated ConCORDe-Net using standard metrics: precision, recall and F1-score. A detection was considered true positive if it lies with in an Euclidean distance of 8 pixels (2r, where r is in Eq. (1)) to a ground truth annotation.

Moreover, we compared ConCORDe-Net with state of the art methods, MapDe [5] and U-Net [13] as shown in Table 2. The same data augmentation as explained in Sect. 3.3 was applied to all models depicted in the Table. U-Net [13] was trained to regress pseudo-segmentation explained in Sect. 3.1. The output of CNN models in Table 2 is probability map that approximates pseudo-segmentation. The center of cells was regressed as follows from the probability map. Firstly, a global threshold maximizing F1-score was applied for each model to generate binary image. Secondly, hole filling morphological operation was applied to remove holes created after thresholding. Finally, the center of every connected component was computed which corresponds to center of a cell. ConCORDe-Net achieved the highest recall and F1-score compared to state of the art methods, MapDe [5] and U-Net [13]. Moreover, in both ConCORDe-Net and U-Net [13], integrating cell counter CNN has improved cell detection F1-score. For MapDe [5], we used the parameters that were specified in the paper and tuning the dimensions of “mapping filter” might improve the result.

Table 2. Cell detection performance comparison. Model1 is a model after cell counter is removed from ConCORDe-Net. U-Net [13] + Cell Counter is a CNN after integrating cell counter CNN to the original U-Net [13] architecture.

Precision of ConCORDe-Net was lower than the three other methods due to the following reasons: (1) ConCORDe-Net identifies weakly stained cells that were missed by other methods, which could be missed by expert too. (2) Over-detection of large cells when there are more than one intensity peaks within the cell. We believe that these limitations could be improved by training and validating on a large cohort.

Fig. 2.
figure 2

(a) Cell count loss profile. ROC and AUC evaluation of (b) classifier1 (c) classifier2 on test data. Where s = strong, m = moderate, w = weak

Performance of the proposed classifier models was qualitatively evaluated using receiver characteristic curve (ROC), area under the curve (AUC), accuracy, precision, recall, and F1-score on test data shown in Table 1. ROC and AUC of classifier1 are presented in Fig. 2b. AUC value of greater than 0.99 was achieved for all cell types. Overall accuracy computed on the original distribution of data was found around \(98\%\). Moreover, precision, recall and F1-score were all 0.98. Figure 2c shows ROC and AUC of this classifier2. For all cell types, AUC value was higher than 0.97 and overall accuracy of around \(93\%\) was obtained. After cascading the two classifiers, overall accuracy of \(96.5\%\) was achieved.

Fig. 3.
figure 3

Illustrative examples of the proposed unified cell detection and classification on test data, and comparison with state-of-the-art method, MapDe [5] and U-Net [13]. White, red, yellow, cyan and dark green colored points represent CD8, GAL8+ pSTAT-, GAL8+ pSTAT+ strong, GAL8+ pSTAT+ moderate, and GAL8+ pSTAT+ weak cells, respectively. The red circles on the top left input images highlights cells that were missed by MapDe [5] and U-Net [13], but detected using ConCORDe-Net. (Color figure online)

Figure 3 shows a visual output of ConCORDe-Net followed by cell classification and comparison with MapDe [5] and U-Net [13] which uses Dice overlap loss as an objective function. ConCORDe-Net is better in discerning touching cells with weak boundary gradient and weakly stained GAL8+ pSTAT- cells compared to MapDe [5] and U-Net [13]. By regularizing the objective function with cell count, the network was able to learns patterns that can separate closely located cells and identify weakly stained cells.

5 Conclusions

In this paper, we proposed a deep learning based unified cell detection and classification method in mIHC whole-tumor slide images of breast cancer. Cell count regularized CNN was employed for cell detection followed by multi-stage CNN to classify cells. The parameters in the cell detection architecture were learnt using a new objective function which optimizes dice overlap and cell count. F1 score of 0.873 was achieved on test data which outperformed state of the art methods MapDe [5] and U-Net [13]. Our proposed approach is better in detecting closely located and weakly stained cells compared to MapDe [5] and U-Net [13]. Moreover, \(96.5\%\) classification accuracy was achieved. Our experiment shows that incorporating problem specific knowledge such as cell count improves robustness of the cell detection algorithm.