Yi Ren Fung∗ , Ziqiang Guan∗ , Ritesh Kumar , Joie Yeahuay Wu and Madalina Fiterau
College of Information and Computer Sciences
University of Massachusetts Amherst
{yfung, zguan, riteshkumar, yeahuaywu, mfiterau}
arXiv:1906.04231v1 [eess.IV] 10 Jun 2019
Table 1: Performance and methodology of some of the state-of-the-art studies on three-class Alzheimer’s Disease classification. The different
splitting schemes and subsets of the ADNI dataset used in evaluation make it hard to interpret the results meaningfully.
657 scans from 173 patients in the testing set. For consis- the training and test sets were split by visits, in which only the
tency, we used the same number of scans in all subsets for all latest visit of each patient was set aside for testing, accuracy
three splits. Our code repository is publicly available 1 , and it was slightly lower. However, accuracy dropped to around
includes the patients ids and visits that we evaluated. 50% when the training and test set were split by patient.
We note that the other state-of-the-art 2D CNN architec-
3.2 Model Architecture and Training tures we tried (DenseNet, InceptionNet, VGGNet) performed
We compared 2D and 3D CNN architecture performances. similarly to ResNet, and the choice of view in the 2D slice
For our 2D CNN architecture, we used ResNet18 [He et al., (coronal, axial, sagittal) did not lead to significant differences
2016] that was pretrained on ImageNet, allowing the model in testing accuracy as long as the slices were chosen to be
to learn how to better extract low-level features from images. close to the center of the brain. We report results on the 88th
For our 3D CNN architecture, we followed a residual slice of the coronal view for our 2D model. Our 3D model
network-based architectural design as well. We used the “bot- performs slightly better but is still limited in performance,
tleneck” configuration, where the inner convolutional layer of suffering the same problem of over-fitting to information on
each residual block contains half the number of filters, and we individual patients instead of learning what generally differ-
used the “full pre-activation” layout for the residual blocks. entiates a brain in the different stages of Alzheimer’s Disease.
See Figure 2 for more details.
For the 2D networks, we used a learning rate of 0.0001 and Ground Truth
L2 regularization constant of 0.01. For the 3D network, we AD MCI CN
used a learning rate of 0.001 and regularization constant of AD 96 59 17
0.0001. Both networks were trained for 36 epochs, with early Prediction MCI 62 153 90
stopping. CN 17 67 96
3.3 Splitting Methodology and Results Table 3: Confusion matrix of the 3D CNN experiment on the test
Our main goal is to investigate how differently the model per- set, with data split by patient.
forms under the following three scenarios: (1) random train-
ing and testing split across the brain MRIs, (2) training and
3.4 Analysis
testing split by patient ID, and (3) training and testing split
based on visit history across the patients. Analysis of the dataset shows that the frequency of disease
stage transition for patients between any two consecutive vis-
Model Train Acc. Test Acc. its is low in the ADNI dataset. There are only 152 transitions
Split by MRI randomly 99.2 ± 0.7% 83.7 ± 1.1% in the entire dataset, which contains 2,731 scans from 657
2D Split by visit history 99.0 ± 0.5% 81.2 ± 0.5% patients.
Split by patients 98.8 ± 0.6% 51.7 ± 1.2% We believe that due to the relatively few transition points in
Split by MRI randomly 95.8 ± 2.3% 84.4 ± 0.6% the dataset, the models are still able to achieve accuracy in the
3D Split by visit history 95.8 ± 2.2% 82.9 ± 0.3% 80% range for the splitting by visit experiments by repeating
Split by patients 86.3 ± 9.5% 52.4 ± 1.8% the diagnostic label of the previous visits. Out of the 152 total
transitions we found across the whole dataset, only 52 of them
Table 2: Our classification result of CNN models by different happened for patients between the n − 1th visit and nth visit.
train/test splitting scheme, averaged over five runs. This suggests that the model is able to encode the structure
of a patient’s brain from the training set, in turn aiding its
In Table 2, we present the three-way classification result of performance on the testing set.
the models described in Section 3.2. In summary, when the Furthermore, we set up additional experiments where we
training and testing sets were split by MRI scans randomly, trained on visits t1−1 from all patients, but only tested on
the 2D and 3D models attained accuracy close to 84%. When the 52 patients that had a transition from the n − 1th to nth
visit. The classification accuracy on this experiment dropped
1 to around 54%, which suggests that the network was repeat-
cnn-study ing information about the patient’s brain structure instead of
Figure 1: Comparison of the spatially normalized MRI scans of 4 subjects in each of the CN, MCI, and AD categories. To the human eye,
distinguishing the difference across the disease stages is a difficult task.
3x3 Conv Layer Batch Norm 3. Difficulty in distinguishing the visual difference of a
brain in the different Alzheimer’s stages. Human brains
Residual Block RELU
are distinct by nature, and the quality of MRI collections
3x3 Conv Layer 1x1 Conv from different clinical settings add to the noise level of the
Residual Block Batch Norm data. In Figure 1, we plotted out the brains of subjects
3x3 Conv Layer
in CN/MCI/AD, and show that the difference in anatomical
structure from CN to MCI to AD is very subtle.
Residual Block 1x1 Conv 4. No clear baseline. Many studies evaluated the per-
3x3 Conv Layer formance of their models on different subsets of the ADNI
Batch Norm
Residual Block dataset, making fair comparison a tricky task. In addition,
3x3 Conv Layer
studies that use a separate testing set do not report the sub-
1x1 Conv jects or scans that they used in their testing set, further com-
Residual Block plicating the comparison process.
3x3 Conv Layer We hope to keep these challenges in mind when designing
3x3 Conv Layer future experiments and ultimately design models that can re-
liably classify brain MRIs with its true stage in Alzheimer’s
Linear Layer
Disease progression, which is robust to visit number, lack of
patient transitions, and minor fluctuations in scan quality.
Figure 2: The architectures of our 3D CNN model and residual
blocks. With the exception of the first and last convolution layers, 4.2 Insights and Future Work
which have a stride of 1, all other layers have a stride of 2 for down- Many studies in Alzheimer’s disease brain MRI classifica-
sampling. The first convolution layers takes in a 1-channel image tion do not take into account how the data should be properly
and outputs a 32-channel output. split, putting into question the ability of the proposed mod-
els to generalize on unseen data. We fill this gap by provid-
learning to be discriminative among the different stages of ing detailed analysis of model performance across splitting
Alzheimer’s Disease exhibited by a particular MRI scan. schemes. Additionally, to our knowledge, none of the previ-
ous studies use all of the MRIs available in the ADNI dataset
and do not present a clear explanation for this decision. To ad-
4 Discussion dress the issue, we perform our experiments on all available
4.1 Technical Challenges data while also reporting the subjects used in the training and
We summarize the main challenges in working with the test split of all our experiments for reproducibility.
ADNI dataset as follows: In the future, we would like to explore utilizing the covari-
1. Lack of transitions in a patient’s health status be- ate data collected from patients to aid image feature extrac-
tween consecutive visits. There are only 152 transitions to- tion. Most of the studies we have come across do not use any
tal out of the entire dataset of 2,731 images collected from covariate information collected from patients. The covari-
patient visits. It is easy for the model to overfit and memorize ates, such as patient demographics and cognitive test scores,
the state of a patient at each visit instead of generalizing the may be helpful for the classification task since they correlate
key distinctions between the different stages of Alzheimer’s with the disease stage of the patient. A scenario could be a
Disease. multitask learning setup, where the model predicts the Mini-
2. Coarse-grained data labels. The data labels are coarse- Mental State Examination (MMSE) and Alzheimer’s Disease
grained in nature so our classifier may become confused when Assessment Scale (ADAS) cognitive scores in addition to the
trying to learn on cases when a patient’s cognitive state may labels. We think this may be helpful in training the model be-
be borderline, such as being between MCI and AD. The con- cause the cognitive test scores can provide finer-grained sig-
fusion matrix in Table 3 demonstrates this. nal for the model, making the prediction more robust.
