Discriminative Feature Selection by Optimal Manifold Search for Neoplastic Image Recognition

Itoh, Hayato; Mori, Yuichi; Misawa, Masashi; Oda, Masahiro; Kudo, Shin-Ei; Mori, Kensaku

doi:10.1007/978-3-030-11018-5_43

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11132))

Included in the following conference series:

European Conference on Computer Vision

1479 Accesses

Abstract

An endocytoscope provides ultramagnified observation that enables physicians to achieve minimally invasive and real-time diagnosis in colonoscopy. However, great pathological knowledge and clinical experiences are required for this diagnosis. The computer-aided diagnosis (CAD) system is required that decreases the chances of overlooking neoplastic polyps in endocytoscopy. Towards the construction of a CAD system, we have developed texture-feature-based classification between neoplastic and non-neoplastic images of polyps. We propose a feature-selection method that selects discriminative features from texture features for such two-category classification by searching for an optimal manifold. With an optimal manifold, where selected features are distributed, the distance between two linear subspaces is maximised. We experimentally evaluated the proposed method by comparing the classification accuracy before and after the feature selection for texture features and deep-learning features. Furthermore, we clarified the characteristics of an optimal manifold by exploring the relation between the classification accuracy and the output probability of a support vector machine (SVM). The classification with our feature-selection method achieved 84.7% accuracy, which is 7.2% higher than the direct application of Haralick features and SVM.

You have full access to this open access chapter, Download conference paper PDF

Computer-Aided Diagnostic Tool for Classification of Colonic Polyp Assessment

Computational learning of features for automated colonic polyp classification

Article Open access 23 February 2021

Multi-scale characterizations of colon polyps via computed tomographic colonography

Article Open access 27 December 2019

Keywords

1 Introduction

An endocytoscope was recently developed as a new endoscopic imaging modality for minimally-invasive diagnosis. Endocytoscopy enables direct observation of the cells and their nuclei on the colon wall at a 500-time-maximum ultramagnification as shown in Fig. 1. However, great pathological knowledge and clinical experiences are necessary to achieve accurate endocytoscopy. Automated pathological diagnosis is required to prevent overlooking neoplastic lesions to support physicians [12, 18, 19]. This automated pathological diagnosis is achieved by robust two-category image classification. Figure 2 shows typical examples of neoplastic and non-neoplastic endocytoscopic images. The differences between the two categories are observed as differences of textures as shown in Fig. 2.

Robust two-category classification is a fundamental problem in pattern recognition, since multi-category classification is also based on a two-category classification concept. A robust classification can be achieved by an optimal pipeline of feature extraction, feature selection, and the classification of the selected features. A recent approach in image pattern recognition adopt deep learning architectures [15, 16, 23] as a full pipeline from feature extraction to the classification of extracted features. This deep learning approach can achieve robust multi-category classification with sufficiently large training dataset. However, deep learning fails to find optimal parameters with a small dataset. In medical image classification, collecting a large amount of training data with sufficient patient cases is difficult. Therefore, we have to tackle this problem with medical image classification, especially for a new medical modality, using a handcraft feature and a robust classifier. Previous works [12, 18, 19, 25, 26] adopted texture-based feature extraction and support-vector-machine classification without feature selection.

Principal component analysis (PCA) is a fundamental methodology for analysis and compression in data processing [14]. PCA is also used for feature selection [2, 20], through which finds a small number of principal components to represent the patterns in a category. This feature selection is useful and optimal for the representation of the distribution of a pattern with respect to the mean square root error. However, this feature selection is not optimal for the classification of patterns in different categories. The common or similar principal components among patterns in different categories can lead to incorrect classification. Fukunaga and Koontz proposed feature selection methods using PCA for clustering [7]. Their method removes the features shared by two categories from the features. The validity of their method is experimentally presented with phantom data, where the means of each category patterns are known. Fukui et al. proposed a constraint mutual subspace method [5] with which they tried to remove the common subspace among patterns of different categories. However, this methods was only designed for the mutual subspace method [17].

We propose a new feature-selection method for the linear classification of neoplastic and non-neoplastic lesions on endocytoscopic images. This selection method is achieved by searching for an optimal manifold for classification. Figure 3 summarises the concept of our feature-selection method. Triangles $\mathcal {L}_1$ and $\mathcal {L}_2$ represent the linear category subspaces spanned by the features of each category. The ideal discriminative feature gives a hyper-plane between the features of the two categories, as shown in Fig. 3(a). This ideal feature extraction gives robust classification by a linear classifier. The difference between the two category subspaces can be represented by the canonical angle between two subspaces. If a feature contains worthless elements, it gives a common subspace for two categories with small canonical angles. These worthless elements can lead to classification errors. The overlap region of the triangles in Fig. 3(b) shows the common subspace between two categories. For feature selection, we have to find a map $\varPsi $ that maximises the canonical angles between two-category subspaces on a manifold as shown in Fig. 3(b). By projection with $\varPsi $, we obtained a discriminative features as shown in Fig. 3(a). To find linear map $\varPsi $, we used feature normalisation [7] and the Grassmann distance [4, 8, 10, 22].

Feature normalisation clarifies the importance of each eigenvector for category representations in PCA [7]. The Grassmann distance represents the difference of two linear subspaces by canonical angles [1, 4, 8, 10, 22, 24]. We obtained discriminative features by selecting the eigenvectors of the normalised features, which give the maximum Grassmann distance between two category subspaces as shown in Fig. 4. Our proposed method can analyse the extracted features and improve the classification accuracy. Furthermore, we integrated our proposed method to the processing flow of the classification shown in Fig. 5 that output the probabilities of each category for practical applications. In such practical applications as computer-aided diagnosis, accurate probabilities for each category are helpful information for physicians during a diagnosis. We evaluated the proposed method and its classification procedure to the classify neoplastic and non-neoplastic colorectal endocytoscopic images in numerical experiments.

2 Mathematical Preliminaries

2.1 Linear Subspace of a Pattern

We have a set $\{ \varvec{x}_i \}_i^N$ of N extracted D-dimensional features from observed patterns $ \{ \mathcal {X}_i \}_{i=1}^N$ of a category with the condition $d \ll N$. These extracted features span $\mathcal {L} = \mathrm {span}(\varvec{x}_1, \dots , \varvec{x}_N )$ for a category. We call this linear subspace a category subspace, which is approximated by

$$\begin{aligned} \mathcal {L} \approx \mathrm {span}(\varvec{y}_1,\dots , \varvec{y}_m), \end{aligned}$$

(1)

where $m \le D$. In such standard approaches as subspace methods [11, 17, 20, 28], we obtain bases $\varvec{y}_i, i=1,2, \dots , m$ from the following eigendecomposition

$$\begin{aligned} \varvec{My} = \eta \varvec{y}, \ \ \varvec{M} = \frac{1}{N} \sum _{i=1}^N \varvec{x}_i\varvec{x}_i^{\top }. \end{aligned}$$

(2)

For this eigendecomposition, an eigenvector $\varvec{y}_i$ correspond to an eigenvalue $\eta _i$ with conditions $\eta _1 \ge \eta _2 \ge \dots \eta _D $ and $\eta _i \ge 0$. We project an input feature vector $\varvec{x}$ to a linear subspace $\mathrm {span}(\varvec{y}_1,\dots , \varvec{y}_m)$ by $\varvec{Y}^{\top }\varvec{x}$, where $\varvec{Y} = [\varvec{y}_1, \varvec{y}_2, \dots , \varvec{y}_m] $.

2.2 Grassmannian and Canonical Angle

The Grassmannian manifold (Grassmannian) $\mathcal {G} (m,D)$ is the set of m-dimensional linear subspaces of $\mathbb {R}^D$ [8]. An element of $\mathcal {G} (m,D)$ can be represented by an orthogonal matrix $\varvec{Y}$ of size D by m, where $\varvec{Y}$ is comprised of the m basis vectors for a set of patterns in $\mathbb {R}^D$.

Let $\varvec{Y}_1$ and $\varvec{Y}_2$ be the orthogonal matrices of size $D \times m$. Canonical angles (principal angles) $0 \le \theta _1 \le \dots \le \theta _m \le \frac{\pi }{2}$ between two subspaces $\mathcal {L}_1= \mathrm {span} (\varvec{Y}_1)$ and $\mathcal {L}_2 = \mathrm {span} (\varvec{Y}_2)$ are defined by

$$\begin{aligned} {\begin{matrix} &{}\cos \theta _k = \underset{\varvec{y}_{1,k} \in \mathrm {span}(\varvec{Y}_1)}{\max } \ \underset{\varvec{y}_{2,k} \in \mathrm {span}(\varvec{Y}_2)}{\max } \ \varvec{y}_{1,k}^{\top } \varvec{y}_{2,k}, \ \\ &{} \ \ \ \ \ \ \ \ \ \ \mathrm {s.t.} \ \varvec{y}_{1,k}^{\top }\varvec{y}_{1,i}=0, \ \varvec{y}_{2,k}^{\top }\varvec{y}_{2,i}=0, \ \end{matrix}} \end{aligned}$$

(3)

where $i=1,2, \dots , k-1$.

For two linear subspaces $\mathcal {L}_1$ and $\mathcal {L}_2$, we have projection matrices $\varvec{P}=\varvec{Y}_1\varvec{Y}_1^{\top }$ and $\varvec{Q} = \varvec{Y}_2\varvec{Y}_2^{\top }$. We also have a set of canonical angles $\{ \theta _k \}_{i=1}^m$ between these two linear subspaces with conditions $\theta _1 \le \theta _2 \le \dots . \theta _K$. We obtain the canonical angles from the solution of the eigendecomposition problem

$$\begin{aligned} \varvec{PQPu} = \lambda \varvec{u} \ \mathrm {or} \ \varvec{QPQu} = \lambda \varvec{u}. \end{aligned}$$

(4)

The solutions of these eigendecomposition problems are coincident [17]. For the practical computation of canonical angles, we have singular value decomposition

$$\begin{aligned} \varvec{Y}_1^{\top }\varvec{Y}_2 = \varvec{U} \varvec{\varSigma } \varvec{V}^{\top }, \end{aligned}$$

(5)

where $\varvec{\varSigma }= \mathrm {diag (\cos \theta _1, \cos \theta _2, \dots , \cos \theta _m)}$ is the diagonal matrix, and $\varvec{U}, \varvec{V} \in \mathbb {R}^{m \times m} $ are the orthogonal matrices. Note that we have the following relation,

$$\begin{aligned} \lambda _i = \cos ^2 \theta _i \end{aligned}$$

(6)

for eigenvalues $\lambda _k, \ i=1, 2, \dots m$ in the eigendecomposition in Eq. (4). Canonical angles are used to define the geodesic distance on a Grassmannian.

2.3 Grassmannian Distances

For the two points represented by $\varvec{Y}_1$ and $\varvec{Y}_2$ on a Grassmann manifold, we have the following seven distances on the Grassmannian,

1.
$d_p(\varvec{Y}_1,\varvec{Y}_2)$: projection metric
2.
$d_{\mu }(\varvec{Y}_1,\varvec{Y}_2)$: mean distance
3.
$d_{\mathrm {min}}(\varvec{Y}_1,\varvec{Y}_2)$: minimum canonical angle
4.
$d_{\mathrm {max}}(\varvec{Y}_1,\varvec{Y}_2)$: maximum canonical angle
5.
$d_{BC}(\varvec{Y}_1,\varvec{Y}_2)$: Binet-Caucy metric
6.
$d_g(\varvec{Y}_1,\varvec{Y}_2)$: geodesic distance
7.
$d_{c}(\varvec{Y}_1,\varvec{Y}_2)$: Procrustes (chordal) distance

These seven distances are defined using the canonical angles between two linear subspaces. The projection metric and mean distance are defined by

$$\begin{aligned} d_p(\varvec{Y}_1, \varvec{Y}_2) = \left( \sum _{i=1}^m \sin ^2 \theta _i \right) ^{1/2} = \left( m - \sum _{i=1}^m \lambda _i \right) ^{1/2} \end{aligned}$$

(7)

and

$$\begin{aligned} d_{\mu }(\varvec{Y}_1, \varvec{Y}_2) = \frac{1}{m}\sum _{i=1}^m \sin ^2 \theta _i = 1 - \frac{1}{m}\sum _{i=1}^m \lambda _i. \end{aligned}$$

(8)

Furthermore, the sines of the maximum and minimum canonical angles

$$\begin{aligned} d_{\mathrm {min}}(\varvec{Y}_1, \varvec{Y}_2) = \sin \theta _{\mathrm {1}} = (1 - \lambda _{1})^{1/2}, \end{aligned}$$

(9)

$$\begin{aligned} d_{\mathrm {max}}(\varvec{Y}_1, \varvec{Y}_2) = \sin \theta _{\mathrm {m}} = (1 - \lambda _{m})^{1/2}, \end{aligned}$$

(10)

are also used as distances on a Grassmannian. Moreover, the Binet-Caucy distance is defined by

$$\begin{aligned} d_{BQ}(\varvec{Y}_1, \varvec{Y}_2) = \left( 1 - \varPi _{i=1}^m \cos ^2 \theta _i \right) = \left( 1 - \varPi _{i=1}^m \lambda _i \right) , \end{aligned}$$

(11)

where the product of $\cos \theta _i$ represents the similarity between two linear subspaces. Using canonical angles, we have geodesic distance

$$\begin{aligned} d_g(\varvec{Y}_1, \varvec{Y}_2) = \left( \sum _{i=1}^m \theta _i^2 \right) ^{1/2} = \left( \sum _{i=1}^m \left( \arccos \lambda _i^{1/2}\right) ^2 \right) ^{1/2} \end{aligned}$$

(12)

on a Grassmann manifold. We have two definitions of the Procrustes (chordal) distance,

$$\begin{aligned} d_{c}(\varvec{Y}_1, \varvec{Y}_2) = \underset{\varvec{R}_1, \varvec{R}_2 \in O(m)}{\mathrm {min}} \Vert \varvec{Y}_1\varvec{R}_1 - \varvec{Y}_2\varvec{R}_2 \Vert _{\mathrm {F}} = 2 \left( \sum _{i=1}^m \sin ^2 (\theta _i/2) \right) ^{1/2}, \end{aligned}$$

(13)

where $\Vert \cdot \Vert _{\mathrm {F}}$ is the Frobenius norm.

2.4 Normalisation of Features

We project the extracted features onto the discriminative feature space for accurate classification. We extract a set of features $\{ \varvec{x}_i \in \mathbb {R}^D \}_{i=1}^N$ from a set of images with condition $D \ll N$. Let $\varvec{\mu } = \mathrm {E} (\varvec{x}_i)$ be the mean of the features. By setting $ \bar{\varvec{x}}_i = \varvec{x}_i - \varvec{\mu }$, we obtain a set of centred features $\{ \bar{\varvec{x}}_i \}_{i=1}^N $. We assume each image belongs to either category $\mathcal {C}_1$ or $\mathcal {C}_2$. Therefore, set $\{ \bar{\varvec{x}}_i \}_{i=1}^N $ is divided into two sets $\{ \varvec{x}_i^{(1)} \}_{i=1}^{N_1}$ and $\{ \varvec{x}_i^{(2)} \}_{i=1}^{N_2}$, where $N_1$ and $N_2$, respectively, represent the number of images in the first and second categories.

We define the autocorrelation matrices in the centred feature space as

$$\begin{aligned} \varvec{A}_1 = \frac{1}{N_1}\varvec{X}_1\varvec{X}_1^{\top }, \ \varvec{A}_2 = \frac{1}{N_2} \varvec{X}_2\varvec{X}_2^{\top }, \end{aligned}$$

(14)

where $\varvec{X}_1 = [\varvec{x}^{(1)}_1, \varvec{x}^{(1)}_2, \dots , \varvec{x}^{(1)}_{N_1}]$ and $\varvec{X}_2 = [\varvec{x}^{(2)}_1, \varvec{x}^{(2)}_2, \dots , \varvec{x}^{(2)}_{N_2}]$, for the two categories. We define the covariance matrix of all the features as

$$\begin{aligned} \varvec{C}= P(\mathcal {C}_1) \varvec{A}_1 + P(\mathcal {C}_2)\varvec{A}_2, \end{aligned}$$

(15)

where we set $P(\mathcal {C})_1= \frac{N_1}{N_1+N_2}$ and $P(\mathcal {C})_2 = \frac{N_2}{N_1+N_2}$. Fukunaga [7] used the autocorrelation matrix of all the features instead of $\varvec{C}$ in Eq. (15). Fukunaga [6] used covariance matrices $\varvec{C}_1$ and $\varvec{C}_2$ for the two categories instead of $\varvec{A}_1$ and $\varvec{A}_2$ in Eq. (15). In this manuscript, we adopt covariance matrix $\varvec{C}$ for all features to remove the common features of the two categories. Here, autocorrelation matrices $\varvec{A}_1$ and $\varvec{A}_2$ include the gaps of the means between all the features and each category. The covariance matrix gives the following eigendecomposition problem $\varvec{CV} = \varvec{V}\varvec{\varXi }$, where $\varvec{\varXi }= \mathrm {diag}(\xi _1,\xi _2,\dots ,\xi _D)$ consists of eigenvalues $\xi _i$ for $i=1,2, \dots , D$ the condition $\xi _1\ge \xi _2 \ge \dots \ge \xi _{D}$. The eigendecomposition results derive a whitening matrix $\varvec{W} = \varvec{\varXi }^{-\frac{1}{2}}\varvec{V}^{\top }$.

Using this whitening matrix and Eq. (15), we obtain the following relation

$$\begin{aligned} \varvec{WCW}^{\top } = \varvec{W}P(\mathcal {C}_1) \varvec{A}_1 \varvec{W}^{\top } + \varvec{W}P(\mathcal {C}_2) \varvec{A}_2\varvec{W}^{\top } = \tilde{\varvec{A}}_1 + \tilde{\varvec{A}}_2 = \varvec{I}, \end{aligned}$$

(16)

where $\varvec{I}$ is an identity matrix. The solutions of the eigenvalue problems

$$\begin{aligned} \tilde{\varvec{A}}_j \varvec{\phi }_i^{(j)} = \lambda _i ^{(j)}\varvec{\phi }_i^{(j)}, \end{aligned}$$

(17)

where we set $j=1,2$, give the bases of two category subspaces. From Eqs. (16) and (17), we have

$$\begin{aligned} \tilde{\varvec{A}}_2 \varvec{\phi }_i^{(2)} = (\varvec{I} - \tilde{\varvec{A}}_1) \lambda _i \varvec{\phi }_i^{(2)}. \end{aligned}$$

(18)

This leads $\tilde{\varvec{A}_1} \varvec{\phi }_i^{(2)} = (1-\lambda _i^{(2)})\varvec{\phi }_i^{(2)} $. These relation give the following relations

$$\begin{aligned} \varvec{\phi }_i^{(2)} = \varvec{\phi }_{D-i+1}^{(1)} \end{aligned}$$

(19)

and

$$\begin{aligned} \lambda _i^{(2)} = 1- \lambda _{D-i+1}^{(1)}. \end{aligned}$$

(20)

Equations (19) and (20) show that both eigenvalue problems give the same set of eigenvectors, and corresponding eigenvalues. Note that the two sets of eigenvalues are reversely ordered. The eigenvalue orders in Eq. (20) satisfy

$$\begin{aligned}&1 \ge \lambda _1^{(1)} \ge \lambda _2^{(1)} \ge \dots \ge \lambda _D^{(1)} \ge 0,&\end{aligned}$$

(21)

$$\begin{aligned}&0 \le 1- \lambda _1^{(1)} \le 1-\lambda _2^{(1)} \le \dots \le 1-\lambda _D^{(1)} \le 1.&\end{aligned}$$

(22)

These relations imply that the eigenvectors corresponding to the large eigenvalues of $\tilde{\varvec{A}}_1$ contribute to represent the subspace for $\mathcal {C}_1$, although they only make minor contribution to the representation for $\mathcal {C}_2$. Therefore, we obtained discriminative features by projecting the features to a linear subspace given by $\mathrm {span}(\varvec{\phi }_{1}^{(1)},\varvec{\phi }_{1}^{(2)},\varvec{\phi }_{2}^{(1)},\varvec{\phi }_{2}^{(2)}, \dots , \varvec{\phi }_{d}^{(1)},\varvec{\phi }_{d}^{(2)} )$. We discuss how to decide number d in the next section.

2.5 Linear Classification with Rejection Option

We use a linear support vector machine (SVM) as a classifier. SVM classifies an input feature vector $\varvec{x} \in \mathbb {R}^D$ by $\mathrm {sing}(f(\varvec{x}))$, where $f(\cdot )$ is a decision function. The parameters and the hyperparameters of this decision function are optimised by a training procedure with training data. We can estimate the probability of belonging to each category with the optimised decision function. For two categories with label $\mathcal {L} \in \{ 0, 1\}$, we can approximately estimate the probabilities

$$\begin{aligned} P(\mathcal {L}=1 |\varvec{x}) \approx P(A,B,f(x)) = \frac{1}{1+\mathrm {exp}(Af(x)+B)}, \end{aligned}$$

(23)

where A, B are the parameters in Platt’s method [21] for category i. $P(\mathcal {L}=0 |\varvec{x})$ is obtained by $1-P(\mathcal {L}=1 |\varvec{x})$. After the training for decision function, we obtain A, B by maximum likelihood estimation with the training dataset. In our method, we represent the non-neoplastic and neoplastic categories by 0 and 1. SVM can output the probabilities [21] $p_\mathrm{{non}}$ and $p_\mathrm{{nneo}}$ that satisfy $p_\mathrm{{non}}+p_\mathrm{{neo}}=1$ and $p_\mathrm{{non}},p_\mathrm{{non}} \in [0,1]$, for non-neoplastic and neoplastic images. We adopt the rejection option to remove low confident classification [2]. The rejection option discards classifications with low probabilities such that $\mathrm {max}\ (p_{\mathrm {non}}, p_{\mathrm {neo}}) < \kappa $, where criteria $\kappa $ was decided from preliminary experiments.

3 Feature-Selection Method

We have a set $\{ \mathcal {X}_i \}_{i=1}^N $ of three-channel images. We extract a feature vector $\varvec{x}_i \in \mathbb {R}^D$ from each image $\mathcal {X}_i$. As in the same manner of Sect. 2.4, we divide $\{ \varvec{x}_i \}_{i=1}^N $ to two sets $\{ \varvec{x}_i^{(1)} \}_{i=1}^{N_1}$ and $\{ \varvec{x}_i^{(2)} \}_{i=1}^{N_2}$, where $N_1$ and $N_2$ represent the number of images in the first and second categories.

We find a linear map $\varPsi (\varvec{x})=\varvec{P^{*}W(\varvec{x}-\varvec{\mu })}$ by solving

$$\begin{aligned} \mathrm {arg} \ \underset{\varvec{P}}{\mathrm {min}} (\mathrm {rank} (\varvec{P}\varvec{P}^{\top } )) \quad \mathrm {s.t.} \ \ \ \underset{\varvec{P}}{\mathrm {max}}(d(\varPsi (\varvec{Y}_1), \varPsi (\varvec{Y}_2))) \end{aligned}$$

(24)

where $d(\cdot ,\cdot )$ represents one of the seven Grassmann distances. In this equation, $\max (\cdot )$ returns one or more matrices that give the maximum distance, and $\min (\cdot )$ selects one matrix with the minimum rank from the matrices. The solution is an orthogonal matrix $\varvec{P}^{*}= [\varvec{\phi }_{1}^{(1)},\varvec{\phi }_{1}^{(2)},\varvec{\phi }_{2}^{(1)},\varvec{\phi }_{2}^{(2)}, \dots , \varvec{\phi }_{m}^{(1)},\varvec{\phi }_{m}^{(2)}]^{\top }$, which is comprised of 2m eigenvectors. Algorithm 1, which summarises a procedure to find the $\varvec{P}^{*}$ using the training data, also returns the selected features for the training data.

4 Numerical Experiments

We evaluated our feature-selection method by applying the image classification of neoplastic and non-neoplastic images of a colon polyp. We used images of the magnified surfaces of polyps captured by an endocytoscope (CF-H290ECI, Olympus, Tokyo) with IRB approval. The neoplastic and non-neoplastic labels of the images ware annotated by expert physicians. The number of images of neoplastic and non-neoplastic polyps is summarised in Table 1. We extracted two kinds of features from these images: the Haralick features [9] and the convolutional neural network (CNN) features [15]. In this section, we first compare the classification accuracy before and after the feature selection for the Haralick features and next compare the classification accuracy before and after the feature selection of the combined Haralick and CNN features. We finally analysed the relation between the output probability and the accuracy with rejection option. The relation clarifies the validity of our feature-selection method as a manifold optimisation. In the final analysis, we also show the performance of two-category classification with a rejection option.

In these evaluations, we used a SVM [27] for the classifications and trained it by the training data and the best hyperparameters. We obtained the best hyperparameters of the SVM by five-fold cross validation with the training data. For a practical computation of SVM, we used libsvm [3]. We note that kernel SVM with a radius basis function gives less classification accuracy than the linear SVM for the endocytoscopic images in our preliminary experiments. The classification accuracy of the original Haralick, CNN and their combined features without feature selection is summarised in Table 2.

Table 1. Dataset details: 14,840 training and 4,126 test images.

Full size table

Table 2. Dimension and classification accuracy for extracted features with SVM.

Full size table

Table 3. Dimension and classification accuracy of selected features of Haralick feature for each Grassmann distance.

Full size table

Table 4. Dimension and classification accuracy of selected features of combination of Haralick and CNN features for each Grassmann distance.

Full size table

4.1 Haralick Feature

The Haralick feature represents the texture information measured by fourteen statistical categories of a gray-level co-occurrence matrix. In this case, we used thirteen categories for eight directions with three scales. To compute the statistics, we used contrast normalisation for each local region for the achievement of robustness against illumination changes. We then extracted 312-dimensional Haralick feature vector for an image, each element of which is normalised to a range of [0, 1]. We applied the proposed feature-selection method to the normalised Haralick features.

Figures 7(a), and (b) and (c) show the eigenvalues of the whitened autocorrelation matrices for two categories, and the Grassmann distances. Table 3 summarises the dimensions and classification accuracy for each feature selection with respect to seven Grassmann distances.

4.2 Combination of Haralick and CNN Features

We combined the Haralick and CNN features into a single feature vector. Figure 6 illustrates the architecture of our CNN. Its settings were decided by preliminary experiments [12], where we compared the several parameter settings with AlexNet [15], VGG Net [23], and their modified versions. Our shallow architecture gave the best classification accuracy among them. Before the extraction of the CNN features, we trained the CNN with training data from scratch. We applied batch normalisation and drop out to convolutional layers and full connected layers for the training. We used the values in full connection layer F7 in Fig. 6 for the feature extraction. Each element of the extracted CNN feature was normalised to a range of [0, 1]. For the CNN implementation, we used the Caffe platform [13] and combined the normalised Haralick and normalised CNN features as 888-dimensional column vectors.

We applied the proposed method to these normalised combined features. Figures 7(d), and (e) and (f) show the eigenvalues of the whitened autocorrelation matrices for two categories, and the Grassmann distances. Table 4 summarises the dimensions of the selected feature and classification accuracy for the selection with respect to seven Grassmann distances.

4.3 Analysis of an Optimal Manifold

We evaluated the classification accuracy with respect to the divided range of the output probabilities. The relation between the classification accuracy and the output probability represents the characteristics of the optimal manifold given by our feature selection. Figure 8(a) illustrates the distributions of the output probabilities for the original Haralick features and the selected features. Table 5 summarises the accuracy for each range of the output probabilities for the original Haralick features and the selected features. We also evaluated the classification accuracy with the rejection option of $\kappa \in \{0.50, 0.55, \dots , 0.90\}$. The rejection rate is the ratio of the rejected images in the test data. The classification results with the rejection option are summarised in Fig. 8(b). We respectively obtained classification accuracy of 82.1%, 82.9% and 84.7% for the original Haralick feature, the selected features of the Haralick and the combined features with the same rejection rate of about 20%. Note that the percentage of inappropriate images in practical diagnosis is close to 20–30%. In these cases, we set $\kappa $ to 0.70, 0.60 and 0.72 for them. Figure 9 shows examples of rejected images in the classification.

5 Discussion

The curves shown in Fig. 7(a) imply that a small number of eigenvectors is discriminative for classification, since almost all the eigenvalues are close to 0.5. Figures 7(b) and (c) imply that $d_{\mu }$ gives the largest distance with the fewest selected eigenvectors. The accuracy is improved after the selection based on $d_{\mu }$ as shown in Table 3. The dimension of the selected features is 30% of the dimension of an original Haralick feature.

The curves shown in Fig. 7(d) imply that there are no particular discriminative eigenvectors, since they are distributed almost uniformly from zero to one. In Table 4, $d_g, d_p$, and $d_c$ give the largest distance with all the eigenvectors. In this case, features are just whitened and used for classification. In both the cases of the Haralick and the combined features, our proposed method found discriminative features and improved the classification accuracy.

Table 5. Classification accuracy with respect to output probability. This table summarises classification accuracy for the original Haralick, the selected Haralick and the Haralick$+$CNN features. Classification accuracy is computed for each range of output probability p, where p represents $\mathrm {max}(p_{\mathrm {non}}, p_{\mathrm {neo}})$. Classification accuracy of selected Haralick is almost coincident to mean of each range of output probability.

Full size table

The results summarised in Fig. 8(b) indicate that the rejection option improved the classification accuracy. The rejection option correctly removed the low-confident classification in both the cases of the Haralick and the combined features. Table 5 also supports the validity of the rejection option. We can observe low classification accuracy for low output probabilities. Figure 9 shows that the rejected images are inappropriate due to bad observation conditions: over staining, bad illumination, and a lack of discriminative texture.

The mean distance gives the best feature selection for the Haralick feature. In this case, the distributions of the output probabilities were characteristic. We have low distribution for $p > 0.9$ and medium distribution for $0.6 \ge p >0.5$. The highest distribution is given for $0.8 \ge p > 0.7$. For these distributions, the output probability approximated the classification accuracy well, as shown gray in Table 5. This characteristic of the relation between classification accuracy and output probability suggests the validity of the obtained manifold. In the case of the combined features, we did not observe the same characteristic even though the selected combined features achieved the highest classification accuracy.

6 Conclusions

We proposed a feature-selection method that improves the classification accuracy of two categories by an optimal manifold search for classification. We experimentally evaluated the proposed method by comparing the classification accuracy before and after feature selection for about 19,000 endocytoscopic images. The experimental results showed the validity of our proposed method with the improvement of classification accuracy. Furthermore, we experimentally demonstrated the validity of the obtained optimal manifold by exploring the relation between output probability and classification accuracy. The output probability is helpful information for practical diagnosis. Moreover, we achieved robust classification with our feature-selection method and a linear classifier with a rejection option. The classification accuracy was 84.7% with a rejection rate of 20%, in which the classification accuracy was 7.2% higher than the classification with the original Haralick feature and SVM.

References

Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Article Google Scholar
Edelman, A., Arias, T., Smith, S.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Article MathSciNet Google Scholar
Fukui, K., Maki, A.: Difference subspace and its generalization for subspace-based methods. IEEE Trans. Pattern Anal. Mach. Intell. 37(11), 2164–2177 (2015)
Article Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990)
MATH Google Scholar
Fukunaga, K., Koontz, W.L.G.: Application of the Karhunen-Loéve expansion to feature selection and ordering. IEEE Trans. Comput. C–19(4), 311–318 (1970)
Article Google Scholar
Hamm, J., Lee, D.: Grassmann discriminant analysis: a unifying view on subspace-based learning. In: Proceedings of International Conference on Machine Learning, pp. 376–383 (2008)
Google Scholar
Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. IEEE Trans. Syst., Man, Cybern. 3(6), 610–621 (1973). https://doi.org/10.1109/TSMC.1973.4309314
Article Google Scholar
Harandi, M., Sanderson, C., Shen, C., Lovell, B.: Dictionary learning and sparse coding on Grassmann manifolds: an extrinsic solution. In: Proceedings of The IEEE International Conference on Computer Vision, pp. 3120–3127 (2013)
Google Scholar
Iijima, T.: Theory of pattern recognition. In: Electronics and Communications in Japan, pp. 123–134 (1963)
Google Scholar
Itoh, H., Mori, Y., Misawa, M., Oda, M., Kudo, S.E., Mori, K.: Cascade classification of endocytoscopic images of colorectal lesions for automated pathological diagnosis. In: Proceedings of SPIE Medical Imaging (2018, in Press)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Jollife, I.T.: Principal Component Analysis. Springer, New York (2002). https://doi.org/10.1007/b98835
Book Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Proceedings of International Conference on Neural Information Processing Systems, vol. 1, pp. 1097–1105 (2012)
Google Scholar
Lecun, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Maeda, K.: From the subspace methods to the mutual subspace method. In: Cipolla, R., Battiato, S., Farinella, G.M. (eds.) Computer Vision. SCI, vol. 285, pp. 135–156. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-12848-6_5
Chapter Google Scholar
Mori, Y., et al.: Impact of an automated system for endocytoscopic diagnosis of small colorectal lesions: an international web-based study. Endoscopy 48, 1110–1118 (2016)
Article Google Scholar
Mori, Y., et al.: Novel computer-aided diagnostic system for colorectal lesions by using endocytoscopy. Gastrointest. Endosc. 81, 621–629 (2015)
Article Google Scholar
Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press, Boston (1983)
Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 10, 61–74 (1999). MIT Press
Google Scholar
Shigenaka, R., Raytchev, B., Tamaki, T., Kaneda, K.: Face sequence recognition using Grassmann distances and Gassmann kernels. In: Proceedings of International Joint Conference on Neural Networks, pp. 1–7 (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (2015)
Google Scholar
Slama, R., Wannous, H., Daoudi, M., Srivastava, A.: Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognit. 48(2), 556–567 (2015)
Article Google Scholar
Tamaki, T., et al.: Computer-aided colorectal tumor classification in NBI endoscopy using CNN features. In: Proceedings of Korea-Japan Joint Workshop on Frontiers of Computer Vision (2016)
Google Scholar
Tamaki, T., et al.: Computer-aided colorectal tumor classification in NBI endoscopy using local features. Mediacal Image Anal. 17, 78–100 (2013)
Article Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
MATH Google Scholar
Watanabe, S., Pakvasa, N.: Subspace method of pattern recognition. In: Proceedings of of the 1st International Joint Conference of Pattern Recognition (1973)
Google Scholar

Download references

Acknowledgements

Parts of this research were supported by the Research on Development of New Medical Devices from the Japan Agency for Medical Research and Development (No. 18hk0102034h0103), and MEXT KAKENHI (No. 26108006, No. 17H00867).

Author information

Authors and Affiliations

Graduate School of Informatics, Nagoya University, Nagoya, Japan
Hayato Itoh, Masahiro Oda & Kensaku Mori
Digestive Disease Center, Showa University Northern Yokohama Hospital, Yokohama, Japan
Yuichi Mori, Masashi Misawa & Shin-Ei Kudo
Information Technology Center, Nagoya University, Nagoya, Japan
Kensaku Mori
Research Center for Medical Bigdata, National Institute of Informatics, Tokyo, Japan
Kensaku Mori

Authors

Hayato Itoh
View author publications
You can also search for this author in PubMed Google Scholar
Yuichi Mori
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Misawa
View author publications
You can also search for this author in PubMed Google Scholar
Masahiro Oda
View author publications
You can also search for this author in PubMed Google Scholar
Shin-Ei Kudo
View author publications
You can also search for this author in PubMed Google Scholar
Kensaku Mori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hayato Itoh .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Itoh, H., Mori, Y., Misawa, M., Oda, M., Kudo, SE., Mori, K. (2019). Discriminative Feature Selection by Optimal Manifold Search for Neoplastic Image Recognition. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-11018-5_43
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Discriminative Feature Selection by Optimal Manifold Search for Neoplastic Image Recognition

Abstract

Similar content being viewed by others

Computer-Aided Diagnostic Tool for Classification of Colonic Polyp Assessment

Computational learning of features for automated colonic polyp classification

Multi-scale characterizations of colon polyps via computed tomographic colonography

Keywords

1 Introduction