Journal of Shandong University(Engineering Science) ›› 2020, Vol. 50 ›› Issue (4): 28-34.doi: 10.6040/j.issn.1672-3961.0.2019.454

Previous Articles    

Image caption generation method based on class activation mapping and attention mechanism

LIAO Nanxing1, ZHOU Shibin1*, ZHANG Guopeng1, CHENG Deqiang2   

  1. 1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China;
    2. Sun Yueqi Honors College, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China
  • Published:2020-08-13

Abstract: Class activation mapping-attention mechanism was introduced to soft attention based image caption framework. The class activation mapping mechanism introduced the position information to convolutional features with richer semantic information, where there was a better alignment between convolutional features and description words, so that the generated description could describe the image content more completely. Improved the attention mechanism with double layer of long short-term memory network made the attention mechanism suitable for global and local information for generating words with specific features. The experiments showed that the improved model could generate more accurate description and outperformed the performance of models such as the soft attention mechanism in many evaluation criteria, specially the bleu-4 result on the MSCOCO dataset increased 16.8% compared with the soft attention-based model, which showed class activation mapping-attention could align the word and the convolutional feature, and generate more accurate descriptions with less key information lost.

Key words: image caption, attention mechanism, class activation mapping, convolutional neural network, recurrent neural network

CLC Number: 

  • TP391
[1] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3128-3137.
[2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3156-3164.
[3] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [C] // Proceedings of the International conference on machine learning. Lille, France: JMLR, 2015: 2048-2057.
[4] MAO J, XU W,YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-rnn)[C] // Proceedings of the International Conference on Learning Representations. San Diego, USA: ICLR, 2014: 13-29.
[5] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 2921-2929.
[6] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-cam:explanations from deep networks via gradient-based localization[C] // Proceedings of the IEEE International Conference on Computer Vision. Honolulu, USA: IEEE, 2017: 618-626.
[7] LIN M, CHEN Q, YAN S. Network in network[C] // Proceedings of the International Conference on Learning Representations. Banff, Canada: ICLR, 2013: 284-294.
[8] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[C] // Proceedings of the Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 2204-2212.
[9] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C] // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China: IEEE, 2016: 4945-4949.
[10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383.
[11] YANG Z, ZHANG Y-J, UR REHMAN S, et al. Image captioning with object detection and localization[C] // Proceedings of the International Conference on Image and Graphics. Singapore: Springer, 2017: 109-118.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C] // Proceedings of the International Conferenceon Learning Representations.[S.l.] : ICLR, 2014: 4-11.
[14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
[15] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002: 311-318.
[16] LIN C-Y. Rouge: a package for automatic evaluation of summaries[C] // Proceedings of the Text Summarization Branches out. Barcelona, Spain: ACL, 2004: 74-81.
[17] LAVIE A, AGARWAL A. METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments[C] // Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: ACL, 2007: 228-231.
[18] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: consensus-based imagedescription evaluation[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
[19] DONAHUE J, ANNE HENDRICKS L, GUADARRA-MA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2625-2634.
[20] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal Neural Language Models[C] // Proceedings of the Machine Learning Research. Bejing, China: PMLR, 2014: 595-603.
[1] CAI Guoyong, HE Xinhao, CHU Yangyang. Visual sentiment analysis based on spatial attention mechanism and convolutional neural network [J]. Journal of Shandong University(Engineering Science), 2020, 50(4): 8-13.
[2] Shiqi SONG,Yan PIAO,Zexin JIANG. Vehicle classification and tracking for complex scenes based on improved YOLOv3 [J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 27-33.
[3] Zhifu CHANG,Fengyu ZHOU,Yugang WANG,Dongdong SHEN,Yang ZHAO. A survey of image captioning methods based on deep learning [J]. Journal of Shandong University(Engineering Science), 2019, 49(6): 25-35.
[4] Xiaoxiong HOU,Xinzheng XU,Jiong ZHU,Yanyan GUO. Computer aided diagnosis method for breast cancer based on AlexNet and ensemble classifiers [J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 74-79.
[5] Fang GUO,Lei CHEN,Ziwen YANG. Real-time traffic prediction based on MGU for large-scale IP backbone networks [J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 88-95.
[6] Mengmeng LIANG,Tao ZHOU,Yong XIA,Feifei ZHANG,Jian YANG. Lung tumor images recognition based on PSO-ConvK convolutional neural network [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 77-84.
[7] Pu ZHANG,Chang LIU,Yong WANG. Suggestion sentence classification model based on feature fusion and ensemble learning [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 47-54.
[8] XIE Zhifeng, WU Jiaping, MA Lizhuang. Chinese financial news classification method based on convolutional neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 34-39.
[9] LI Yuxin, PU Yuanyuan, XU Dan, QIAN Wenhua, LIU Hejuan. Image aesthetic quality evaluation based on embedded fine-tune deep CNN [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 60-66.
[10] HE Zhengyi, ZENG Xianhua, GUO Jiang. An ensemble method with convolutional neural network and deep belief network for gait recognition and simulation [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 88-95.
[11] ZHAO Yanxia, WANG Xizhao. Multipurpose zero watermarking algorithm for color image based on SVD and DCNN [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 25-33.
[12] XU Shan-shan, LIU Ying-an*, XU Sheng. Wood defects recognition based on the convolutional neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(2): 23-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!