Abstract
Convolution neural network (CNN) achieves outstanding results in single-label image classification task. However, due to the complex underlying object layout and insufficient multi-label training images, how to achieve better performance for multi-label images via CNN is still an open problem. In this work, we propose an improved deep CNN model which can extract features of objects at different scales in multi-label images by spatial pyramid pooling as well as feature fusion. In model training, we first transfer the parameters pre-trained on ImageNet to our model, then an Adversarial Network is trained to generate examples with occlusions, which makes our model invariant to occlusions. Experimental results on Pascal VOC 2012 and Corel 5K image datasets demonstrate the superiority of the proposed approach over many approaches. The mAP of our model reaches 84.0% on the VOC 2012 dataset, which significantly outperforms most approaches and closes to HCP, the representative multi-label classification approach.









Similar content being viewed by others
References
Alfassy A, Karlinsky L, Aides A, Shtok J, Harary S, Feris R, Giryes R, Bronstein AM (2019) Laso: Label-set operations networks for multi-label few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6548–6557
Azizpour H, Razavian AS, Sullivan J, Maki A, Carlsson S (2015) Factors of transferability for a generic convnet representation. IEEE Trans Pattern Anal Mach Intell 38(9):1790–1802
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):article 27
Chen Q, Song Z, Hua Y, Huang Z, Yan S (2012) Hierarchical matching with side information for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 3426–3433
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol 1, pp 886–893
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of International Conference on Machine Learning, pp 647–655
Dong J, Xia W, Chen Q, Feng J, Huang Z, Yan S (2013) Subcategory-aware object classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 827–834
Duygulu P, Freitas ND, Barnard K, Forsyth DA (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of European Conference on Computer Vision, pp 97–112
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. arXiv:13124894
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of European Conference on Computer Vision and Pattern Recognition, pp 392–407
Han T, Zhang L, Pirbhulal S, Wu W, de Albuquerque VHC (2019) A novel cluster head selection technique for edge-computing based iomt systems. Comput Netw 158:114–122
Harzallah H, Jurie F, Schmid C (2010) Combining efficient object localization and image classification. In: Proceedings of IEEE International Conference on Computer Vision, pp 237–244
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Hedelin P, Skoglund J (2000) Vector quantization based on gaussian mixture models. IEEE Trans Speech Audio Process 8(4):385–401
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2016) Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 2261–2269
Jarrett K, Kavukcuoglu K, Ranzato M, Lecun Y (2009) What is the best multi-stage architecture for object recognition? In: Proceedings of International Conference on Computer Vision, pp 2146–2153
King RA, Nasrabadi NM (1988) Image coding using vector quantization in the transform domain. IEEE Trans Commun 1(8):957–971
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun Acm 60(6):84–90
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 2169–2178
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1990) Handwritten digit recognition with a backpropogation network. In: Advances in Neural Information Processing System, pp 396–404
LeCun Y, Huang F J, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol 2, pp 97-104
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of International Conference on Machine Learning, pp 609–616
Li Z, Shi Z, Zhao W, Li Z, Tang Z (2013) Learning semantic concepts from image database with hybrid generative/discriminative approach. Eng Appl Artif Intell 26(9):2143–2152
Lin M, Chen Q, Yan S (2013) Network in network. arXiv:13124400
Lowe D G (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Ojala T, inen, Harwood D (1996) A comparative study of texture measures with classification based on feature distributions. Pattern Recogn 29(1):51–59
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1717–1724
Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of European Conference on Computer Vision, pp 143–156
Pirbhulal S, Wu W, Mukhopadhyay SC, Li G (2018) Adaptive energy optimization algorithm for internet of medical things. In: Proceedings of the 12th International Conference on Sensing Technology, pp 269–272
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf : An astounding baseline for recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 806–813
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, Lecun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:13126229
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
Song Z, Chen Q, Huang Z, Hua Y, Yan S (2011) Contextualizing object detection and classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1585–1592
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of 31st AAAI Conference on Artificial Intelligence
Wang X, Shrivastava A, Gupta A (2017) A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 2606–2615
Wei Y, Xia W, Lin M, Huang J, Ni B, Dong J, Zhao Y, Yan S (2016) Hcp: a flexible cnn framework for multi-label image classification. IEEE Trans Pattern Anal Mach Intell 38(9):1901–1907
Wright J, Ma Y, Mairal J, Sapiro G, Huang TS, Yan S (2010) Sparse representation for computer vision and pattern recognition. Proc IEEE 98(6):1031–1044
Yang YY, Lin YA, Chu HM, Lin HT (2018) Deep learning with a rethinking structure for multi-label classification. arXiv:180201697
Zan W, Tsim YC, Yeung WS, Chan KC (2007) Probabilistic latent semantic analyses (plsa) in bibliometric analysis for technology forecasting. J Technol Manag Innov 2(1):11–24
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61966004, 61663004, 61762078, 61866004), the Guangxi Natural Science Foundation (Nos. 2016GXNSFAA380146, 2017GXNSFAA198365, 2018GXNSFDA281009), the Research Fund of Guangxi Key Lab of Multi-source Information Mining and Security (16-A-03-02, MIMS18-08), the Guangxi Special Project of Science and Technology Base and Talents (AD16380008), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, T., Li, Z., Zhang, C. et al. Classify multi-label images via improved CNN model with adversarial network. Multimed Tools Appl 79, 6871–6890 (2020). https://doi.org/10.1007/s11042-019-08568-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08568-z