Keywords

1 Introduction

An endocytoscope was recently developed as a new endoscopic imaging modality for minimally-invasive diagnosis. Endocytoscopy enables direct observation of the cells and their nuclei on the colon wall at a 500-time-maximum ultramagnification as shown in Fig. 1. However, great pathological knowledge and clinical experiences are necessary to achieve accurate endocytoscopy. Automated pathological diagnosis is required to prevent overlooking neoplastic lesions to support physicians [12, 18, 19]. This automated pathological diagnosis is achieved by robust two-category image classification. Figure 2 shows typical examples of neoplastic and non-neoplastic endocytoscopic images. The differences between the two categories are observed as differences of textures as shown in Fig. 2.

Fig. 1.
figure 1

(a) Endocytoscopy (CF-H290ECI, Olympus, Tokyo). (b) Conventional endoscope observation of a polyp by an endocytoscope. (c) Ultramagnified view by an endocytoscope. Small blue spots represent cell nuclei. In (b) and (c), a polyp’s surface is stained by methylene blue. (Color figure online)

Fig. 2.
figure 2

Typical examples of endocytoscopic images for neoplastic- and non-neoplastic polyps. (a), (b), and (c) are categorised to neoplastic and non-neoplastic polyps.

Robust two-category classification is a fundamental problem in pattern recognition, since multi-category classification is also based on a two-category classification concept. A robust classification can be achieved by an optimal pipeline of feature extraction, feature selection, and the classification of the selected features. A recent approach in image pattern recognition adopt deep learning architectures [15, 16, 23] as a full pipeline from feature extraction to the classification of extracted features. This deep learning approach can achieve robust multi-category classification with sufficiently large training dataset. However, deep learning fails to find optimal parameters with a small dataset. In medical image classification, collecting a large amount of training data with sufficient patient cases is difficult. Therefore, we have to tackle this problem with medical image classification, especially for a new medical modality, using a handcraft feature and a robust classifier. Previous works [12, 18, 19, 25, 26] adopted texture-based feature extraction and support-vector-machine classification without feature selection.

Principal component analysis (PCA) is a fundamental methodology for analysis and compression in data processing [14]. PCA is also used for feature selection [2, 20], through which finds a small number of principal components to represent the patterns in a category. This feature selection is useful and optimal for the representation of the distribution of a pattern with respect to the mean square root error. However, this feature selection is not optimal for the classification of patterns in different categories. The common or similar principal components among patterns in different categories can lead to incorrect classification. Fukunaga and Koontz proposed feature selection methods using PCA for clustering [7]. Their method removes the features shared by two categories from the features. The validity of their method is experimentally presented with phantom data, where the means of each category patterns are known. Fukui et al. proposed a constraint mutual subspace method [5] with which they tried to remove the common subspace among patterns of different categories. However, this methods was only designed for the mutual subspace method [17].

Fig. 3.
figure 3

Interpretation of relation between category subspace by canonical angles between them: (a) Linear separation by hyper plane for ideal features; (b) Before and after feature selection.

Fig. 4.
figure 4

Grassmann distance between two category subspaces. (a) and (b) show Grassmann distance between category subspaces in D-dimensional space for different feature selections \(\psi \) and \(\tilde{\psi }\). In (a), Grassmann distance for \(m_L\)-dimensional subspace is longer than the one for \(m_S\)-dimensional subspace in (b). An optimal manifold gives largest distance between two-category subspaces for classification.

We propose a new feature-selection method for the linear classification of neoplastic and non-neoplastic lesions on endocytoscopic images. This selection method is achieved by searching for an optimal manifold for classification. Figure 3 summarises the concept of our feature-selection method. Triangles \(\mathcal {L}_1\) and \(\mathcal {L}_2\) represent the linear category subspaces spanned by the features of each category. The ideal discriminative feature gives a hyper-plane between the features of the two categories, as shown in Fig. 3(a). This ideal feature extraction gives robust classification by a linear classifier. The difference between the two category subspaces can be represented by the canonical angle between two subspaces. If a feature contains worthless elements, it gives a common subspace for two categories with small canonical angles. These worthless elements can lead to classification errors. The overlap region of the triangles in Fig. 3(b) shows the common subspace between two categories. For feature selection, we have to find a map \(\varPsi \) that maximises the canonical angles between two-category subspaces on a manifold as shown in Fig. 3(b). By projection with \(\varPsi \), we obtained a discriminative features as shown in Fig. 3(a). To find linear map \(\varPsi \), we used feature normalisation [7] and the Grassmann distance [4, 8, 10, 22].

Feature normalisation clarifies the importance of each eigenvector for category representations in PCA [7]. The Grassmann distance represents the difference of two linear subspaces by canonical angles [1, 4, 8, 10, 22, 24]. We obtained discriminative features by selecting the eigenvectors of the normalised features, which give the maximum Grassmann distance between two category subspaces as shown in Fig. 4. Our proposed method can analyse the extracted features and improve the classification accuracy. Furthermore, we integrated our proposed method to the processing flow of the classification shown in Fig. 5 that output the probabilities of each category for practical applications. In such practical applications as computer-aided diagnosis, accurate probabilities for each category are helpful information for physicians during a diagnosis. We evaluated the proposed method and its classification procedure to the classify neoplastic and non-neoplastic colorectal endocytoscopic images in numerical experiments.

Fig. 5.
figure 5

Processing flow of classification of an endocytoscopic image. We use texture features for feature extraction. Classification is achieved by linear support vector machine with probability estimation. In the classification, we utilise rejection option with estimated probabilities. To selection the discriminative features, we propose a new selection method.

2 Mathematical Preliminaries

2.1 Linear Subspace of a Pattern

We have a set \(\{ \varvec{x}_i \}_i^N\) of N extracted D-dimensional features from observed patterns \( \{ \mathcal {X}_i \}_{i=1}^N\) of a category with the condition \(d \ll N\). These extracted features span \(\mathcal {L} = \mathrm {span}(\varvec{x}_1, \dots , \varvec{x}_N )\) for a category. We call this linear subspace a category subspace, which is approximated by

$$\begin{aligned} \mathcal {L} \approx \mathrm {span}(\varvec{y}_1,\dots , \varvec{y}_m), \end{aligned}$$
(1)

where \(m \le D\). In such standard approaches as subspace methods [11, 17, 20, 28], we obtain bases \(\varvec{y}_i, i=1,2, \dots , m\) from the following eigendecomposition

$$\begin{aligned} \varvec{My} = \eta \varvec{y}, \ \ \varvec{M} = \frac{1}{N} \sum _{i=1}^N \varvec{x}_i\varvec{x}_i^{\top }. \end{aligned}$$
(2)

For this eigendecomposition, an eigenvector \(\varvec{y}_i\) correspond to an eigenvalue \(\eta _i\) with conditions \(\eta _1 \ge \eta _2 \ge \dots \eta _D \) and \(\eta _i \ge 0\). We project an input feature vector \(\varvec{x}\) to a linear subspace \(\mathrm {span}(\varvec{y}_1,\dots , \varvec{y}_m)\) by \(\varvec{Y}^{\top }\varvec{x}\), where \(\varvec{Y} = [\varvec{y}_1, \varvec{y}_2, \dots , \varvec{y}_m] \).

2.2 Grassmannian and Canonical Angle

The Grassmannian manifold (Grassmannian) \(\mathcal {G} (m,D)\) is the set of m-dimensional linear subspaces of \(\mathbb {R}^D\) [8]. An element of \(\mathcal {G} (m,D)\) can be represented by an orthogonal matrix \(\varvec{Y}\) of size D by m, where \(\varvec{Y}\) is comprised of the m basis vectors for a set of patterns in \(\mathbb {R}^D\).

Let \(\varvec{Y}_1\) and \(\varvec{Y}_2\) be the orthogonal matrices of size \(D \times m\). Canonical angles (principal angles) \(0 \le \theta _1 \le \dots \le \theta _m \le \frac{\pi }{2}\) between two subspaces \(\mathcal {L}_1= \mathrm {span} (\varvec{Y}_1)\) and \(\mathcal {L}_2 = \mathrm {span} (\varvec{Y}_2)\) are defined by

$$\begin{aligned} {\begin{matrix} &{}\cos \theta _k = \underset{\varvec{y}_{1,k} \in \mathrm {span}(\varvec{Y}_1)}{\max } \ \underset{\varvec{y}_{2,k} \in \mathrm {span}(\varvec{Y}_2)}{\max } \ \varvec{y}_{1,k}^{\top } \varvec{y}_{2,k}, \ \\ &{} \ \ \ \ \ \ \ \ \ \ \mathrm {s.t.} \ \varvec{y}_{1,k}^{\top }\varvec{y}_{1,i}=0, \ \varvec{y}_{2,k}^{\top }\varvec{y}_{2,i}=0, \ \end{matrix}} \end{aligned}$$
(3)

where \(i=1,2, \dots , k-1\).

For two linear subspaces \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we have projection matrices \(\varvec{P}=\varvec{Y}_1\varvec{Y}_1^{\top }\) and \(\varvec{Q} = \varvec{Y}_2\varvec{Y}_2^{\top }\). We also have a set of canonical angles \(\{ \theta _k \}_{i=1}^m\) between these two linear subspaces with conditions \(\theta _1 \le \theta _2 \le \dots . \theta _K\). We obtain the canonical angles from the solution of the eigendecomposition problem

$$\begin{aligned} \varvec{PQPu} = \lambda \varvec{u} \ \mathrm {or} \ \varvec{QPQu} = \lambda \varvec{u}. \end{aligned}$$
(4)

The solutions of these eigendecomposition problems are coincident [17]. For the practical computation of canonical angles, we have singular value decomposition

$$\begin{aligned} \varvec{Y}_1^{\top }\varvec{Y}_2 = \varvec{U} \varvec{\varSigma } \varvec{V}^{\top }, \end{aligned}$$
(5)

where \(\varvec{\varSigma }= \mathrm {diag (\cos \theta _1, \cos \theta _2, \dots , \cos \theta _m)}\) is the diagonal matrix, and \(\varvec{U}, \varvec{V} \in \mathbb {R}^{m \times m} \) are the orthogonal matrices. Note that we have the following relation,

$$\begin{aligned} \lambda _i = \cos ^2 \theta _i \end{aligned}$$
(6)

for eigenvalues \(\lambda _k, \ i=1, 2, \dots m\) in the eigendecomposition in Eq. (4). Canonical angles are used to define the geodesic distance on a Grassmannian.

2.3 Grassmannian Distances

For the two points represented by \(\varvec{Y}_1\) and \(\varvec{Y}_2\) on a Grassmann manifold, we have the following seven distances on the Grassmannian,

  1. 1.

    \(d_p(\varvec{Y}_1,\varvec{Y}_2)\): projection metric

  2. 2.

    \(d_{\mu }(\varvec{Y}_1,\varvec{Y}_2)\): mean distance

  3. 3.

    \(d_{\mathrm {min}}(\varvec{Y}_1,\varvec{Y}_2)\): minimum canonical angle

  4. 4.

    \(d_{\mathrm {max}}(\varvec{Y}_1,\varvec{Y}_2)\): maximum canonical angle

  5. 5.

    \(d_{BC}(\varvec{Y}_1,\varvec{Y}_2)\): Binet-Caucy metric

  6. 6.

    \(d_g(\varvec{Y}_1,\varvec{Y}_2)\): geodesic distance

  7. 7.

    \(d_{c}(\varvec{Y}_1,\varvec{Y}_2)\): Procrustes (chordal) distance

These seven distances are defined using the canonical angles between two linear subspaces. The projection metric and mean distance are defined by

$$\begin{aligned} d_p(\varvec{Y}_1, \varvec{Y}_2) = \left( \sum _{i=1}^m \sin ^2 \theta _i \right) ^{1/2} = \left( m - \sum _{i=1}^m \lambda _i \right) ^{1/2} \end{aligned}$$
(7)

and

$$\begin{aligned} d_{\mu }(\varvec{Y}_1, \varvec{Y}_2) = \frac{1}{m}\sum _{i=1}^m \sin ^2 \theta _i = 1 - \frac{1}{m}\sum _{i=1}^m \lambda _i. \end{aligned}$$
(8)

Furthermore, the sines of the maximum and minimum canonical angles

$$\begin{aligned} d_{\mathrm {min}}(\varvec{Y}_1, \varvec{Y}_2) = \sin \theta _{\mathrm {1}} = (1 - \lambda _{1})^{1/2}, \end{aligned}$$
(9)
$$\begin{aligned} d_{\mathrm {max}}(\varvec{Y}_1, \varvec{Y}_2) = \sin \theta _{\mathrm {m}} = (1 - \lambda _{m})^{1/2}, \end{aligned}$$
(10)

are also used as distances on a Grassmannian. Moreover, the Binet-Caucy distance is defined by

$$\begin{aligned} d_{BQ}(\varvec{Y}_1, \varvec{Y}_2) = \left( 1 - \varPi _{i=1}^m \cos ^2 \theta _i \right) = \left( 1 - \varPi _{i=1}^m \lambda _i \right) , \end{aligned}$$
(11)

where the product of \(\cos \theta _i\) represents the similarity between two linear subspaces. Using canonical angles, we have geodesic distance

$$\begin{aligned} d_g(\varvec{Y}_1, \varvec{Y}_2) = \left( \sum _{i=1}^m \theta _i^2 \right) ^{1/2} = \left( \sum _{i=1}^m \left( \arccos \lambda _i^{1/2}\right) ^2 \right) ^{1/2} \end{aligned}$$
(12)

on a Grassmann manifold. We have two definitions of the Procrustes (chordal) distance,

$$\begin{aligned} d_{c}(\varvec{Y}_1, \varvec{Y}_2) = \underset{\varvec{R}_1, \varvec{R}_2 \in O(m)}{\mathrm {min}} \Vert \varvec{Y}_1\varvec{R}_1 - \varvec{Y}_2\varvec{R}_2 \Vert _{\mathrm {F}} = 2 \left( \sum _{i=1}^m \sin ^2 (\theta _i/2) \right) ^{1/2}, \end{aligned}$$
(13)

where \(\Vert \cdot \Vert _{\mathrm {F}}\) is the Frobenius norm.

2.4 Normalisation of Features

We project the extracted features onto the discriminative feature space for accurate classification. We extract a set of features \(\{ \varvec{x}_i \in \mathbb {R}^D \}_{i=1}^N\) from a set of images with condition \(D \ll N\). Let \(\varvec{\mu } = \mathrm {E} (\varvec{x}_i)\) be the mean of the features. By setting \( \bar{\varvec{x}}_i = \varvec{x}_i - \varvec{\mu }\), we obtain a set of centred features \(\{ \bar{\varvec{x}}_i \}_{i=1}^N \). We assume each image belongs to either category \(\mathcal {C}_1\) or \(\mathcal {C}_2\). Therefore, set \(\{ \bar{\varvec{x}}_i \}_{i=1}^N \) is divided into two sets \(\{ \varvec{x}_i^{(1)} \}_{i=1}^{N_1}\) and \(\{ \varvec{x}_i^{(2)} \}_{i=1}^{N_2}\), where \(N_1\) and \(N_2\), respectively, represent the number of images in the first and second categories.

We define the autocorrelation matrices in the centred feature space as

$$\begin{aligned} \varvec{A}_1 = \frac{1}{N_1}\varvec{X}_1\varvec{X}_1^{\top }, \ \varvec{A}_2 = \frac{1}{N_2} \varvec{X}_2\varvec{X}_2^{\top }, \end{aligned}$$
(14)

where \(\varvec{X}_1 = [\varvec{x}^{(1)}_1, \varvec{x}^{(1)}_2, \dots , \varvec{x}^{(1)}_{N_1}]\) and \(\varvec{X}_2 = [\varvec{x}^{(2)}_1, \varvec{x}^{(2)}_2, \dots , \varvec{x}^{(2)}_{N_2}]\), for the two categories. We define the covariance matrix of all the features as

$$\begin{aligned} \varvec{C}= P(\mathcal {C}_1) \varvec{A}_1 + P(\mathcal {C}_2)\varvec{A}_2, \end{aligned}$$
(15)

where we set \(P(\mathcal {C})_1= \frac{N_1}{N_1+N_2}\) and \(P(\mathcal {C})_2 = \frac{N_2}{N_1+N_2}\). Fukunaga [7] used the autocorrelation matrix of all the features instead of \(\varvec{C}\) in Eq. (15). Fukunaga [6] used covariance matrices \(\varvec{C}_1\) and \(\varvec{C}_2\) for the two categories instead of \(\varvec{A}_1\) and \(\varvec{A}_2\) in Eq. (15). In this manuscript, we adopt covariance matrix \(\varvec{C}\) for all features to remove the common features of the two categories. Here, autocorrelation matrices \(\varvec{A}_1\) and \(\varvec{A}_2\) include the gaps of the means between all the features and each category. The covariance matrix gives the following eigendecomposition problem \(\varvec{CV} = \varvec{V}\varvec{\varXi }\), where \(\varvec{\varXi }= \mathrm {diag}(\xi _1,\xi _2,\dots ,\xi _D)\) consists of eigenvalues \(\xi _i\) for \(i=1,2, \dots , D\) the condition \(\xi _1\ge \xi _2 \ge \dots \ge \xi _{D}\). The eigendecomposition results derive a whitening matrix \(\varvec{W} = \varvec{\varXi }^{-\frac{1}{2}}\varvec{V}^{\top }\).

Using this whitening matrix and Eq. (15), we obtain the following relation

$$\begin{aligned} \varvec{WCW}^{\top } = \varvec{W}P(\mathcal {C}_1) \varvec{A}_1 \varvec{W}^{\top } + \varvec{W}P(\mathcal {C}_2) \varvec{A}_2\varvec{W}^{\top } = \tilde{\varvec{A}}_1 + \tilde{\varvec{A}}_2 = \varvec{I}, \end{aligned}$$
(16)

where \(\varvec{I}\) is an identity matrix. The solutions of the eigenvalue problems

$$\begin{aligned} \tilde{\varvec{A}}_j \varvec{\phi }_i^{(j)} = \lambda _i ^{(j)}\varvec{\phi }_i^{(j)}, \end{aligned}$$
(17)

where we set \(j=1,2\), give the bases of two category subspaces. From Eqs. (16) and (17), we have

$$\begin{aligned} \tilde{\varvec{A}}_2 \varvec{\phi }_i^{(2)} = (\varvec{I} - \tilde{\varvec{A}}_1) \lambda _i \varvec{\phi }_i^{(2)}. \end{aligned}$$
(18)

This leads \(\tilde{\varvec{A}_1} \varvec{\phi }_i^{(2)} = (1-\lambda _i^{(2)})\varvec{\phi }_i^{(2)} \). These relation give the following relations

$$\begin{aligned} \varvec{\phi }_i^{(2)} = \varvec{\phi }_{D-i+1}^{(1)} \end{aligned}$$
(19)

and

$$\begin{aligned} \lambda _i^{(2)} = 1- \lambda _{D-i+1}^{(1)}. \end{aligned}$$
(20)

Equations (19) and (20) show that both eigenvalue problems give the same set of eigenvectors, and corresponding eigenvalues. Note that the two sets of eigenvalues are reversely ordered. The eigenvalue orders in Eq. (20) satisfy

$$\begin{aligned}&1 \ge \lambda _1^{(1)} \ge \lambda _2^{(1)} \ge \dots \ge \lambda _D^{(1)} \ge 0,&\end{aligned}$$
(21)
$$\begin{aligned}&0 \le 1- \lambda _1^{(1)} \le 1-\lambda _2^{(1)} \le \dots \le 1-\lambda _D^{(1)} \le 1.&\end{aligned}$$
(22)

These relations imply that the eigenvectors corresponding to the large eigenvalues of \(\tilde{\varvec{A}}_1\) contribute to represent the subspace for \(\mathcal {C}_1\), although they only make minor contribution to the representation for \(\mathcal {C}_2\). Therefore, we obtained discriminative features by projecting the features to a linear subspace given by \(\mathrm {span}(\varvec{\phi }_{1}^{(1)},\varvec{\phi }_{1}^{(2)},\varvec{\phi }_{2}^{(1)},\varvec{\phi }_{2}^{(2)}, \dots , \varvec{\phi }_{d}^{(1)},\varvec{\phi }_{d}^{(2)} )\). We discuss how to decide number d in the next section.

2.5 Linear Classification with Rejection Option

We use a linear support vector machine (SVM) as a classifier. SVM classifies an input feature vector \(\varvec{x} \in \mathbb {R}^D\) by \(\mathrm {sing}(f(\varvec{x}))\), where \(f(\cdot )\) is a decision function. The parameters and the hyperparameters of this decision function are optimised by a training procedure with training data. We can estimate the probability of belonging to each category with the optimised decision function. For two categories with label \(\mathcal {L} \in \{ 0, 1\}\), we can approximately estimate the probabilities

$$\begin{aligned} P(\mathcal {L}=1 |\varvec{x}) \approx P(A,B,f(x)) = \frac{1}{1+\mathrm {exp}(Af(x)+B)}, \end{aligned}$$
(23)

where AB are the parameters in Platt’s method [21] for category i. \(P(\mathcal {L}=0 |\varvec{x})\) is obtained by \(1-P(\mathcal {L}=1 |\varvec{x})\). After the training for decision function, we obtain AB by maximum likelihood estimation with the training dataset. In our method, we represent the non-neoplastic and neoplastic categories by 0 and 1. SVM can output the probabilities [21] \(p_\mathrm{{non}}\) and \(p_\mathrm{{nneo}}\) that satisfy \(p_\mathrm{{non}}+p_\mathrm{{neo}}=1\) and \(p_\mathrm{{non}},p_\mathrm{{non}} \in [0,1]\), for non-neoplastic and neoplastic images. We adopt the rejection option to remove low confident classification [2]. The rejection option discards classifications with low probabilities such that \(\mathrm {max}\ (p_{\mathrm {non}}, p_{\mathrm {neo}}) < \kappa \), where criteria \(\kappa \) was decided from preliminary experiments.

3 Feature-Selection Method

We have a set \(\{ \mathcal {X}_i \}_{i=1}^N \) of three-channel images. We extract a feature vector \(\varvec{x}_i \in \mathbb {R}^D\) from each image \(\mathcal {X}_i\). As in the same manner of Sect. 2.4, we divide \(\{ \varvec{x}_i \}_{i=1}^N \) to two sets \(\{ \varvec{x}_i^{(1)} \}_{i=1}^{N_1}\) and \(\{ \varvec{x}_i^{(2)} \}_{i=1}^{N_2}\), where \(N_1\) and \(N_2\) represent the number of images in the first and second categories.

We find a linear map \(\varPsi (\varvec{x})=\varvec{P^{*}W(\varvec{x}-\varvec{\mu })}\) by solving

$$\begin{aligned} \mathrm {arg} \ \underset{\varvec{P}}{\mathrm {min}} (\mathrm {rank} (\varvec{P}\varvec{P}^{\top } )) \quad \mathrm {s.t.} \ \ \ \underset{\varvec{P}}{\mathrm {max}}(d(\varPsi (\varvec{Y}_1), \varPsi (\varvec{Y}_2))) \end{aligned}$$
(24)

where \(d(\cdot ,\cdot )\) represents one of the seven Grassmann distances. In this equation, \(\max (\cdot )\) returns one or more matrices that give the maximum distance, and \(\min (\cdot )\) selects one matrix with the minimum rank from the matrices. The solution is an orthogonal matrix \(\varvec{P}^{*}= [\varvec{\phi }_{1}^{(1)},\varvec{\phi }_{1}^{(2)},\varvec{\phi }_{2}^{(1)},\varvec{\phi }_{2}^{(2)}, \dots , \varvec{\phi }_{m}^{(1)},\varvec{\phi }_{m}^{(2)}]^{\top }\), which is comprised of 2m eigenvectors. Algorithm 1, which summarises a procedure to find the \(\varvec{P}^{*}\) using the training data, also returns the selected features for the training data.

figure a

4 Numerical Experiments

We evaluated our feature-selection method by applying the image classification of neoplastic and non-neoplastic images of a colon polyp. We used images of the magnified surfaces of polyps captured by an endocytoscope (CF-H290ECI, Olympus, Tokyo) with IRB approval. The neoplastic and non-neoplastic labels of the images ware annotated by expert physicians. The number of images of neoplastic and non-neoplastic polyps is summarised in Table 1. We extracted two kinds of features from these images: the Haralick features [9] and the convolutional neural network (CNN) features [15]. In this section, we first compare the classification accuracy before and after the feature selection for the Haralick features and next compare the classification accuracy before and after the feature selection of the combined Haralick and CNN features. We finally analysed the relation between the output probability and the accuracy with rejection option. The relation clarifies the validity of our feature-selection method as a manifold optimisation. In the final analysis, we also show the performance of two-category classification with a rejection option.

In these evaluations, we used a SVM [27] for the classifications and trained it by the training data and the best hyperparameters. We obtained the best hyperparameters of the SVM by five-fold cross validation with the training data. For a practical computation of SVM, we used libsvm [3]. We note that kernel SVM with a radius basis function gives less classification accuracy than the linear SVM for the endocytoscopic images in our preliminary experiments. The classification accuracy of the original Haralick, CNN and their combined features without feature selection is summarised in Table 2.

Table 1. Dataset details: 14,840 training and 4,126 test images.
Table 2. Dimension and classification accuracy for extracted features with SVM.
Table 3. Dimension and classification accuracy of selected features of Haralick feature for each Grassmann distance.
Table 4. Dimension and classification accuracy of selected features of combination of Haralick and CNN features for each Grassmann distance.

4.1 Haralick Feature

The Haralick feature represents the texture information measured by fourteen statistical categories of a gray-level co-occurrence matrix. In this case, we used thirteen categories for eight directions with three scales. To compute the statistics, we used contrast normalisation for each local region for the achievement of robustness against illumination changes. We then extracted 312-dimensional Haralick feature vector for an image, each element of which is normalised to a range of [0, 1]. We applied the proposed feature-selection method to the normalised Haralick features.

Figures 7(a), and (b) and (c) show the eigenvalues of the whitened autocorrelation matrices for two categories, and the Grassmann distances. Table 3 summarises the dimensions and classification accuracy for each feature selection with respect to seven Grassmann distances.

4.2 Combination of Haralick and CNN Features

We combined the Haralick and CNN features into a single feature vector. Figure 6 illustrates the architecture of our CNN. Its settings were decided by preliminary experiments [12], where we compared the several parameter settings with AlexNet [15], VGG Net [23], and their modified versions. Our shallow architecture gave the best classification accuracy among them. Before the extraction of the CNN features, we trained the CNN with training data from scratch. We applied batch normalisation and drop out to convolutional layers and full connected layers for the training. We used the values in full connection layer F7 in Fig. 6 for the feature extraction. Each element of the extracted CNN feature was normalised to a range of [0, 1]. For the CNN implementation, we used the Caffe platform [13] and combined the normalised Haralick and normalised CNN features as 888-dimensional column vectors.

We applied the proposed method to these normalised combined features. Figures 7(d), and (e) and (f) show the eigenvalues of the whitened autocorrelation matrices for two categories, and the Grassmann distances. Table 4 summarises the dimensions of the selected feature and classification accuracy for the selection with respect to seven Grassmann distances.

Fig. 6.
figure 6

Architecture of convolutional neural network for feature extraction.

Fig. 7.
figure 7

Eigenvalues and Grassmann distance in feature selection: upper and lower rows represent results for Haralick features and combined features, respectively. (a) and (d) show eigenvalues of \(\tilde{\varvec{A}}_1\) and \(\tilde{\varvec{A}}_1\) for non-neoplastic and neoplastic images. Horizontal and vertical axes represent indices of eigenvalues and eigenvalues. (b), (c), (e) and (f) summarise Grassmann distance after projection with respect to \(\varvec{P}\) given by a criteria \(\tau \). Horizontal and vertical axes represent \(\tau \) and Grassmann distance.

4.3 Analysis of an Optimal Manifold

We evaluated the classification accuracy with respect to the divided range of the output probabilities. The relation between the classification accuracy and the output probability represents the characteristics of the optimal manifold given by our feature selection. Figure 8(a) illustrates the distributions of the output probabilities for the original Haralick features and the selected features. Table 5 summarises the accuracy for each range of the output probabilities for the original Haralick features and the selected features. We also evaluated the classification accuracy with the rejection option of \(\kappa \in \{0.50, 0.55, \dots , 0.90\}\). The rejection rate is the ratio of the rejected images in the test data. The classification results with the rejection option are summarised in Fig. 8(b). We respectively obtained classification accuracy of 82.1%, 82.9% and 84.7% for the original Haralick feature, the selected features of the Haralick and the combined features with the same rejection rate of about 20%. Note that the percentage of inappropriate images in practical diagnosis is close to 20–30%. In these cases, we set \(\kappa \) to 0.70, 0.60 and 0.72 for them. Figure 9 shows examples of rejected images in the classification.

5 Discussion

The curves shown in Fig. 7(a) imply that a small number of eigenvectors is discriminative for classification, since almost all the eigenvalues are close to 0.5. Figures 7(b) and (c) imply that \(d_{\mu }\) gives the largest distance with the fewest selected eigenvectors. The accuracy is improved after the selection based on \(d_{\mu }\) as shown in Table 3. The dimension of the selected features is 30% of the dimension of an original Haralick feature.

The curves shown in Fig. 7(d) imply that there are no particular discriminative eigenvectors, since they are distributed almost uniformly from zero to one. In Table 4, \(d_g, d_p\), and \(d_c\) give the largest distance with all the eigenvectors. In this case, features are just whitened and used for classification. In both the cases of the Haralick and the combined features, our proposed method found discriminative features and improved the classification accuracy.

Table 5. Classification accuracy with respect to output probability. This table summarises classification accuracy for the original Haralick, the selected Haralick and the Haralick\(+\)CNN features. Classification accuracy is computed for each range of output probability p, where p represents \(\mathrm {max}(p_{\mathrm {non}}, p_{\mathrm {neo}})\). Classification accuracy of selected Haralick is almost coincident to mean of each range of output probability.
Fig. 8.
figure 8

Analysis of optimal manifold. (a) Distributions of output probabilities for original and selected features: Figure illustrates probability distributions of the original Haralick, the selected Haralick and the Haralick\(+\)CNN features. Vertical axis represents percentage of output probabilities for each range. Horizontal axis represent divided rages. In this figure, p is given as \(\mathrm {max}(p_{\mathrm {non}}, p_{\mathrm {neo}})\). (b) Receiver operating characteristic (ORC) curves for classification accuracy and rejection rate: Vertical and horizontal axes represent classification accuracy and rejection rate for original Haralick, selected Haralick and Haralick\(+\)CNN features.

The results summarised in Fig. 8(b) indicate that the rejection option improved the classification accuracy. The rejection option correctly removed the low-confident classification in both the cases of the Haralick and the combined features. Table 5 also supports the validity of the rejection option. We can observe low classification accuracy for low output probabilities. Figure 9 shows that the rejected images are inappropriate due to bad observation conditions: over staining, bad illumination, and a lack of discriminative texture.

The mean distance gives the best feature selection for the Haralick feature. In this case, the distributions of the output probabilities were characteristic. We have low distribution for \(p > 0.9\) and medium distribution for \(0.6 \ge p >0.5\). The highest distribution is given for \(0.8 \ge p > 0.7\). For these distributions, the output probability approximated the classification accuracy well, as shown gray in Table 5. This characteristic of the relation between classification accuracy and output probability suggests the validity of the obtained manifold. In the case of the combined features, we did not observe the same characteristic even though the selected combined features achieved the highest classification accuracy.

Fig. 9.
figure 9

Examples of rejected endocytoscopic images in classification: (a) insufficient texture for classification; (b) bad (too dark) illumination; (c) over staining of a polyp. These images were labelled as inappropriate for practical diagnosis. In practice, medical doctors also recognised them as inappropriate.

6 Conclusions

We proposed a feature-selection method that improves the classification accuracy of two categories by an optimal manifold search for classification. We experimentally evaluated the proposed method by comparing the classification accuracy before and after feature selection for about 19,000 endocytoscopic images. The experimental results showed the validity of our proposed method with the improvement of classification accuracy. Furthermore, we experimentally demonstrated the validity of the obtained optimal manifold by exploring the relation between output probability and classification accuracy. The output probability is helpful information for practical diagnosis. Moreover, we achieved robust classification with our feature-selection method and a linear classifier with a rejection option. The classification accuracy was 84.7% with a rejection rate of 20%, in which the classification accuracy was 7.2% higher than the classification with the original Haralick feature and SVM.