Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For

MULTI-SCALE 3D DEEP CONVOLUTIONAL NEURAL NETWORK FOR
HYPERSPECTRAL IMAGE CLASSIFICATION
Mingyi He, Bo Li, Huahui Chen
Northwestern Polytechnical University, School of Electronics and Information

International Center for Information Acquisition & Processing, Xi’an, Shaanxi, China, 710129
ABSTRACT In [5], an unsupervised CNN was proposed for remote

sensing image processing, in which they used greedy layer-
Research in deep neural network (DNN) and deep learning
wise strategy to train their model. Xing et al.[6] used stacked
has great progress for 1D (speech), 2D (image) and 3D (3D-
autoendcoder to extract features for HSI. Hu et al.[7] pro-
object) recognition/classification problems. As HSI that with
posed a two layer 1D CNN architecture to extract spectral
2D spatial and 1D spectral information is quite different from
features and achieve significant results. Both [6] and [7] only
3D object image, the existing DNN cannot be directly ex-
considered the spectral information. For spatial-spectral fea-
tended to hyperspectral image (HSI) classification. A Multi-
ture extraction, Chen et al.[8] proposed a HSI classification
scale 3D deep convolutional neural network (M3D-DCNN)
model based on deep belief network (DBN), in which the
is proposed for HSI classification, which could jointly learn
input is a flattened neighbor region. Yue et al.[9] developed
both 2D Multi-scale spatial feature and 1D spectral fea-
a 1D CNN framework for HSI classification and they also
ture from HSI data in an end-to-end approach, promising to
utilized some hand-craft features to improve the performance
achieve better results with large-scale dataset. Although with-
further. Liang et al.[10] utilized 2D CNN to extract the
out any hand-craft features or pre/post-processing like PCA,
spatial-spectral features through sparse representation. He et
sparse coding etc, we achieve the state-of-the-art results on
al.[11] proposed a modified deep stacking network (DSN) for
the standard datasets, which shows the technical validity and
HSI classification, in which the coarse spectral features by
advancement of our method.
band selection and course spatial features by PCA were used
Index Terms— Hyperspectral image classification, 3D as inputs to the DSN. Mei et al.[12] used 1D CNN to extract
convolution, multi-scale, end-to-end, deep neural network spectral features and then fused hand-craft spatial features
to improve the final performance. Du et al.[13] proposed
1. INTRODUCTION a 8-layer 3D convolution network named ”C3D” for RGB
video classification problems, however it cannot be directly
Hyperspectral image (HSI) is usually composed of hundreds used for HSI classification as HSI is quite different from the
of spectral bands varying from visible light to shortwave, RGB video in the correlations and resolutions among the 3
which provides rich information for target detection and clas- dimensional data. More recently, Chen et al.[14]proposed a
sification applications. Different from the RGB CCD image deep 3D CNN for HSI classification.
or infrared image, HSI has the spatial and spectral informa- In this paper, we are trying to design a Multi-scale 3D
tion simultaneously in 3 dimensional data cube, resulting in Deep Convolutional Neural Network, M3D-DCNN, to di-
great difficulties in HSI processing. Recent researches show rectly extract both the multi-scale spatial feature and the
that, for the purpose of detection and classification of targets spectral feature for HSI classification. In our study, finally,
from HSI, the contextual information could provide great a 5-layer M3D-DCNN is carried out for HSI classification,
advantages, leading to a growth of interest in the research of with which we achieve promising results on the standard HSI
joint spatial-spectral classification approaches [[1],[2]]. datasets including India Pines, Pavia Univ., and Salinas.
Convolutional neural network (CNN) [3] can hierarchi- Compared with the 1D CNN, 3D convolutional kernel
cally extract more implied and deeper features due to its could slip between the spatial and spectral dimensions jointly
layer-by-layer structure. Naturally, CNN has attracted con- to meet the requirements of multi-scale and multi-resolution
siderable attentions in HSI classification and detection [4]. A requirements. Thus it has the power to extract more compli-
brief overview of CNN based methods for HSI classification cated spatial-spectral information in a nature and elegant way.
is given below. Compared with the C3D [13] for video analysis that cannot be
directly and effectively used for HSI classification, our M3D-
This work is partially supported by Natural Science Foundation of China
(61420106007, 61671387) and NPU Seed Foundation of Innovation and Cre- DCNN contains smaller kernel size, which could reduce over-
ation (Z2016120). fitting due to the HSI dataset is usually small. Compared with
,((( ,&,3

the 3D-DNN [14], M3D-DCNN contains smaller kernel size
and deeper layers but less parameters, reducing over-fitting in
these small HSI datasets, without virtual samples, thus more
closed to the practical. In addition, with the multi-scale struc-
ture, our M3D-DCNN could effectively extract HSI features,
contributing to improving HSI classification accuracy.
Our main contributions in this paper can be summarized
as three aspects: (1). We proposed a 3D deep CNN (3D-
DCNN) approach and explored the power of 3D convolution
for HSI classification and compared it with currently used
CNN based methods for the problem. (2). A multi-scale 3D
convolution block is proposed. With which, we proposed a
multi-scale 3D deep CNN (M3D-DCNN) for HSI classifica-
tion to meet the multi-scale targets in spatial domain. The ex-
perimental results show that M3D-DCNN can synchronously
extract spatial & spectral features in a nature and elegant way.
(3). Without any hand-craft features and pre/post-processing
like PCA, sparse presentation etc, our proposed M3D-DCNN
achieves the state-of-the-art result on the standard datasets.
More importantly, our method is totally an end-to-end ap-
proach, promising to achieve better results with large-scale Fig. 1. Illustration of 1D, 2D, 3D convolution in HSI data. B,
dataset in the future. H, W represent the size of kernel along spectral and spatial
dimension respectively. M is the number of feature maps.
2. MULTI-SCALE 3D DEEP CONVOLUTIONAL
NEURAL NETWORK (z, x, y) are the indexes of feature map, respectively corre-
sponding to the 2 spatial and 1 spectral dimensions. k means
2.1. 1D, 2D, 3D Convolution For HSI Data the kernel parameters. i, j, m are the indexes of input layer,
In hyperspectral image processing field, researchers usually output layer and feature map respectively. M is the number
use 1D CNN to extract spectral features separately in the of feature maps, thus Mi means the number of feature maps
spectral domain. When applied to HSI classification prob- in the ith layer. r is the bias term. Rectified linear unit(ReLU)
lems, it is crucial to capture joint features both in spatial as the activation function is selected in this work, which is
dimensions and spectral dimension. Inspiring from Ji et al.’s
work [15] on human action recognition using 3D convolution f (x) = max (0, x) (4)
to extract the spatial-temporal features, we explore 3D kernel
for jointly mining of the spatial and spectral features from 2.2. Multi-scale 3D Convolution Block
HSI data.
An illustration of 1D, 2D, 3D convolution is presented in Multi-scale information has been proved for classification
Figure 1. The formulations of 1D, 2D, 3D convolution are of related problems [16]. This is partly because multi-scale
given below: structure contains abundant context information. While it is
still not well studied in the HSI classification field. In this
M
i −1 B
i −1 paper, we propose a multi-scale 3D convolution block, which
z b (z+b)
vij = f (rij + kijm v(i−1)m ) (1) could be utilized as a basic structure and to construct more
m=0 b=0 powerful CNN model for HSI detection and classification.
M
i −1 H
i −1 W
i −1
xy hw (x+h)(y+w)
vij =f (rij + kijm v(i−1)m ) (2) 2.3. M3D-DCNN Model
m=0 h=0 w=0
With our multi-scale 3D convolution block, we construct a
M i −1 Bi −1 H i −1 W i −1
multi-scale 3D convolutional neural network model, which is
xyz
(x+h)(y+w)(z+b)
hwb illustrated in Figure 3. It is consisted of 10 convolution layers
vij =f (rij + kijm v(i−1)m )
m=0 b=0 h=0 w=0 and 1 fully connect layer, and the depth of this network is 5.
(3) We utilize the dropout [16] layer to prevent over-fitting. And
where, v means the output variable in the feature map. B, H, the ratio of dropout is 0.6 in our experiments. Considering
W represent the size of kernel along spectral and spatial di- the limitation of the labeled data in HSI field, the depth of
mensions respectively. (b, h, w) are the indexes of kernel and our model is beyond most other CNN models [4,5,6] for HSI

Table 1. Parameters of convolutional layers
kernel kernel kernel size kernel stride
name number H, W, B Δ(H, W, B)
conv1 16 3,3,11 1,1,3
conv2 1 1,1,1
conv2 2 1,1,3
16 1,1,1
conv2 3 1,1,5
Fig. 2. Illustration of multi-scale 3D convolution block. m1 , conv2 4 1,1,11
m2 and m3 denote the kernel sizes in the 2 spatial and 1 spec- conv3 1 1,1,1
tral dimensions, respectively. conv3 2 1,1,3
16 1,1,1
conv3 3 1,1,5
conv3 4 1,1,11
conv4 16 2,2,3 1,1,1
pooling – 2,2,3 2,2,3
N
1
E=− log(pnk ) (5)
N n=1
p is the output of the softmax layer:

exp xi
pi = m (6)
i =1 exp xi

were, N is the number of training samples, k the correspon-

dent label of sample n, m is the number of classes. x is the
input of the softmax layer.
3. EXPERIMENTS AND RESULTS
We conduct our experiments on the widely used datasets of

Indian Pines, Pavia Univ. and Salina Valley. All the programs
are implemented in Caffe [17] which is a widely-used deep
learning framework. We train our model by AdaGrad [18]
Fig. 3. Our proposed M3D-DCNN model. The size of the
algorithm, in which the base learning rate is 0.01. In addition,
input patch is 7 × 7 × Band. The details of other hyper-
we set the batch as 40, weight decay as 0.01 for all the layers.
parameters are presented in Table 1.
3.1. Data Sets
classification. Considering the spatial resolution of the data The Indian Pines dataset gathered by Airborne Visible/Infrared
and the target sizes for each classes to be classified, a relative Imaging Spectrometer (AVIRIS) sensor in North-western In-
small kernel size in the 2 spatial dimensions is suit for our diana consists of 145 × 145 pixels with a ground resolution
experiments. of 17 m and 220 spectral reflectance bands in the wavelength
rang 0.4-2.5 μm. We reduce the number of bands to 200
The detailed hyper-parameter setting of this model is pre-
by removing bands covering the region of water absorption.
sented in Table 1. The hyper-parameters are chosen for val-
It includes 16 classes, and we select 8 classes due to some
idation on the training data. In other words, we used 80%
classes having too few labeled samples.
of the training samples to learn weights and the remaining
The University of Pavia dataset acquired by the Reflective
20% to choose the proper hyper-parameters. We used the
Optics System Imaging Spectrometer (ROSIS) sensor during
same model setting for all the three datasets. In other words,
a flight campaign over Pavia University consists of 610 × 340
we don’t deliberately tune the hyper-parameters to pursue a
pixels with ground resolution of 1.3 m and 103 bands.
higher performance.
The Salinas dataset was collected by the 224-band AVIRIS
We train our network with the multinomial logistic loss: sensor over Salinas Valley, California, comprising 512 × 217

pixels with a ground resolution of 3.7 m. We reduce the num-
ber of bands to 204 by removing bands covering the region of
water absorption.
In all the 3 datasets, we randomly select 200 labeled pixels
per class for training and the rest for testing. The input of our
network is the HSI 3D patch in the size of 7 × 7 × Band,
where Band denotes the total number of spectral bands. The
size of the output is the number of the classes. For paper space (a) Ground truth (b) RBF-SVM (c) Hu’s CNN
limit, only the training number and test number for each lass
of the Indian Pines dataset are presented in Table 2.
Table 2. Number of training and test data used in the Indian

Pines dataset.
Class Name Training # Test #
1 Corn-notill 200 1228 (d) Mei’s CNN (e) M3D-DCNN
2 Corn-mintill 200 630
3 Grass-pasture 200 283 Fig. 4. Classification maps for the Indian Pines dataset. From
4 Hay-windrowed 200 278 left to right: (a) ground truth, (b) RBF-SVM[7], (c) Hu’s
5 Soybean-notill 200 772 CNN[7], (d) Mei’s CNN[12], and (e) our M3D-DCNN
6 Soybean-mintill 200 2255
7 Soybean-clean 200 393
8 Woods 200 1065
Table 3. Comparisons with 3 state-of-the-art methods: RBF-
Total 1600 6904 SVM [7], Hu’s CNN [7], Mei’s CNN [12] and M3D-DCNN
RBF- Hu’s Mei’s M3D-
Due to the limitation of the training samples in HSI field, Dataset
SVM CNN CNN DCNN
we augment the dataset by adding Gaussian noise in the spec-
tral domain. At last, the augmented training data is twice as Indian Pines 87.45% 90.07% 95.70% 97.61%
large as the original one. Pavia Univ 90.59% 92.74% 98.00% 98.49%
Salinas 91.34% 92.52% 94.60% 97.24%
3.2. Result Analysis
Firstly, we compare our M3D-DCNN method with other
state-of-the-art methods like RBF-SVM [7], Hu’s CNN [7], Table 4. Influence of with or without multi-scale
and Mei’s CNN [12]. All the methods are compared under the Dataset 3D-DCNN M3D-DCNN
same experiment settings like the number of training samples Indian Pines 95.45% 97.61%
and patch size etc. The results are listed in Table 3. As we Pavia Univ 98.04% 98.49%
can see, our M3D-DCNN method has better or comparable Salinas 96.72% 97.24%
performance than the other three methods, even the method in
[12] utilized the hand-craft spatial features in their network.
For visual comparison, the experimental results with the
Indian Pines dataset are drawn in Pseudo-color in Figure 4. 4. CONCLUSION
It is obvious that our M3D-DCNN achieves the best perfor-
mance.
The extensive experiments with the three public HSI In this paper, a novel multi-scale 3-dimension deep convo-
datasets have proven the technology validity and advance- lutional neural network (M3D-DCNN) is proposed, which
ment of our method. The results prove that (M)3D-DCNN is could jointly learn both 2D Multi-scale spatial feature and
also an elegant way to jointly extract spatial-spectral features 1D spectral feature from HSI data in an end-to-end approach.
for HSI data. Compared with other state-of-the-art methods, we achieved
In addition, we verify the effectiveness of our multi-scale better or comparable performance in the standard datasets.
design. To this end, we replace the multi-scale block with In future work, we will explore more effective data aug-
normal 3D convolution layer. The correspondent results are mentation methods to overcome data limitation. Furthermore,
presented in Table 4. As we can see, the multi-scale design more powerful network architecture design is also deserved
improves the performance significantly. the attention.

5. REFERENCES [13] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani
Lorenzo, and Paluri Manohar, “Learning spatio-
[1] Xiuping Jia, Bor-Chen Kou, and Melba Crawford, “Fea- temporal features with 3d convolutional networks,”
ture mining for hyperspectral image classification,” Pro- arxiv.org/abs/1412.0767, 2014.
ceedings of IEEE, vol. 101, no. 3, pp. 676–697, 2013.
[14] Yushi Chen, Hanlu Jiang, Chunyang Li, and Xiuping
[2] Mingyi He, Wenjuan Chang, and Shaohui Mei, “Ad- Jia, “Deep feature extraction and classification of hy-
vance in feature mining from hyperspectral remote sens- perspectral images based on convolutional neural net-
ing data,” Spacecraft Recovery & Remote Sensing, vol. works,” IEEE Transactions on Geoscience & Remote
34, no. 1, pp. 1–12, 2013. Sensing, vol. 54, no. 10, pp. 1–20, 2016.
[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, [15] S. Ji, M. Yang, and K. Yu, “3d convolutional neural
“Gradient-based learning applied to document recogni- networks for human action recognition,” IEEE Transac-
tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. tions on Pattern Analysis & Machine Intelligence, vol.
2278–2324, 1998. 35, no. 1, pp. 221–31, 2013.
[4] Suraj Srinivas and R. Venkatesh Babu, “Deep learning [16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
in neural networks: An overview,” Computer Science, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout:
2015. a simple way to prevent neural networks from overfit-
[5] A. Romero, C. Gatta, and G. Camps-Valls, “Unsuper- ting,” Journal of Machine Learning Research, vol. 15,
vised deep feature extraction for remote sensing image no. 1, pp. 1929–1958, 2014.
classification,” IEEE Transactions on Geoscience & Re- [17] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
mote Sensing, vol. 54, no. 3, pp. 1349–1362, 2015. Karayev, and Jonathan Long, “Caffe: Convolutional ar-
[6] Chen Xing, Li Ma, and Xiaoquan Yang, “Stacked de- chitecture for fast feature embedding,” Eprint Arxiv, pp.
noise autoencoder based feature extraction and classifi- 675–678, 2014.
cation for hyperspectral images,” Journal of Sensors,
[18] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive
vol. 2016, pp. 1–10, 2016.
subgradient methods for online learning and stochastic
[7] Wei Hu, Yangyu Huang, Li Wei, Fan Zhang, and optimization,” Journal of Machine Learning Research,
Hengchao Li, “Deep convolutional neural networks for vol. 12, no. 7, pp. 257–269, 2011.
hyperspectral image classification,” Journal of Sensors,
vol. 2015, no. 2, pp. 1–12, 2015.
[8] Yushi Chen, Xing Zhao, and Xiuping Jia, “Spectralc-
spatial classification of hyperspectral data based on deep
belief network,” IEEE Journal of Selected Topics in Ap-
plied Earth Observations & Remote Sensing, vol. 8, no.
6, pp. 1–12, 2015.
[9] Jun Yue, Wenzhi Zhao, Shanjun Mao, and Hui Liu,
“Spectralcspatial classification of hyperspectral images
using deep convolutional neural networks,” Remote
Sensing Letters, vol. 6, no. 6, pp. 468–477, 2015.
[10] Heming Liang and Qi Li, “Hyperspectral imagery clas-
sification using sparse representations of convolutional
neural network features,” Remote Sensing, vol. 8, no. 2,
2016.
[11] Mingyi He and Xiaohui Li, “Deep stacking network
with coarse features for hyperspectral image classifica-
tion,” in WHISPERS’16, Aug 2016.
[12] S. Mei, J. Ji, Q. Bi, J. Hou, and Q. Du, “Integrating spec-
tral and spatial information into deep convolutional neu-
ral network for hyperspectral classification,” in IGARSS,
July 2016, pp. 5067–5070.

Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For

Uploaded by

Copyright:

Available Formats

MULTI-SCALE 3D DEEP CONVOLUTIONAL NEURAL NETWORK FOR

HYPERSPECTRAL IMAGE CLASSIFICATION