您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (4): 28-34.doi: 10.6040/j.issn.1672-3961.0.2019.454

• • 上一篇    下一篇

基于类激活映射-注意力机制的图像描述方法

廖南星1,周世斌1*,张国鹏1,程德强2   

  1. 1. 中国矿业大学计算机科学与技术学院, 江苏 徐州 221116;2. 中国矿业大学孙越崎学院, 江苏 徐州 221116
  • 发布日期:2020-08-13
  • 作者简介:廖南星(1995— ),女,贵州铜仁人,硕士研究生,主要研究方向为图像描述生成方法. E-mail:nxliao@cumt.edu.cn. *通信作者简介: 周世斌(1970— ),男,安徽六安人,讲师,博士,主要研究方向为机器学习,计算机视觉. E-mail:zhoushibin@cumt.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61971421)

Image caption generation method based on class activation mapping and attention mechanism

LIAO Nanxing1, ZHOU Shibin1*, ZHANG Guopeng1, CHENG Deqiang2   

  1. 1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China;
    2. Sun Yueqi Honors College, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China
  • Published:2020-08-13

摘要: 基于软注意力机制的图像描述算法,提出类激活映射-注意力机制的图像描述方法。利用类激活映射算法得到卷积特征包含定位以及更丰富的语义信息,使得卷积特征与图像描述具有更好的对应关系,解决卷积特征与图像描述的对齐问题,生成的自然语言描述能够尽可能完整的描述图像内容。选择双层长短时记忆网络改进注意力机制结构,使得新的注意力机制适合当前全局和局部信息的特征表示,能够选取合适的特征表示生成图像描述。试验结果表明,改进模型在诸多评价指标上优于软注意力机制等模型,其中在MSCOCO数据集上Bleu-4的评价指标相较于软注意力模型提高了16.8%。类激活映射机制可以解决图像空间信息与描述语义对齐的问题,使得生成的自然语言减少丢失关键信息,提高图像描述的准确性。

关键词: 图像描述, 注意力机制, 类激活映射, 卷积神经网络, 循环神经网络

Abstract: Class activation mapping-attention mechanism was introduced to soft attention based image caption framework. The class activation mapping mechanism introduced the position information to convolutional features with richer semantic information, where there was a better alignment between convolutional features and description words, so that the generated description could describe the image content more completely. Improved the attention mechanism with double layer of long short-term memory network made the attention mechanism suitable for global and local information for generating words with specific features. The experiments showed that the improved model could generate more accurate description and outperformed the performance of models such as the soft attention mechanism in many evaluation criteria, specially the bleu-4 result on the MSCOCO dataset increased 16.8% compared with the soft attention-based model, which showed class activation mapping-attention could align the word and the convolutional feature, and generate more accurate descriptions with less key information lost.

Key words: image caption, attention mechanism, class activation mapping, convolutional neural network, recurrent neural network

中图分类号: 

  • TP391
[1] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3128-3137.
[2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3156-3164.
[3] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [C] // Proceedings of the International conference on machine learning. Lille, France: JMLR, 2015: 2048-2057.
[4] MAO J, XU W,YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-rnn)[C] // Proceedings of the International Conference on Learning Representations. San Diego, USA: ICLR, 2014: 13-29.
[5] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 2921-2929.
[6] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-cam:explanations from deep networks via gradient-based localization[C] // Proceedings of the IEEE International Conference on Computer Vision. Honolulu, USA: IEEE, 2017: 618-626.
[7] LIN M, CHEN Q, YAN S. Network in network[C] // Proceedings of the International Conference on Learning Representations. Banff, Canada: ICLR, 2013: 284-294.
[8] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[C] // Proceedings of the Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 2204-2212.
[9] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C] // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China: IEEE, 2016: 4945-4949.
[10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383.
[11] YANG Z, ZHANG Y-J, UR REHMAN S, et al. Image captioning with object detection and localization[C] // Proceedings of the International Conference on Image and Graphics. Singapore: Springer, 2017: 109-118.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C] // Proceedings of the International Conferenceon Learning Representations.[S.l.] : ICLR, 2014: 4-11.
[14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
[15] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002: 311-318.
[16] LIN C-Y. Rouge: a package for automatic evaluation of summaries[C] // Proceedings of the Text Summarization Branches out. Barcelona, Spain: ACL, 2004: 74-81.
[17] LAVIE A, AGARWAL A. METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments[C] // Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: ACL, 2007: 228-231.
[18] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: consensus-based imagedescription evaluation[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
[19] DONAHUE J, ANNE HENDRICKS L, GUADARRA-MA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2625-2634.
[20] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal Neural Language Models[C] // Proceedings of the Machine Learning Research. Bejing, China: PMLR, 2014: 595-603.
[1] 王禹鸥,苑迎春,何振学,何晨. 融合多特征和多头自注意力机制的高校学业命名实体识别[J]. 山东大学学报 (工学版), 2025, 55(6): 35-44.
[2] 周群颖,隋家成,张继,王洪元. 基于自监督卷积和无参数注意力机制的工业品表面缺陷检测[J]. 山东大学学报 (工学版), 2025, 55(4): 40-47.
[3] 董明书,陈俐企,马川义,张珠皓,孙仁娟,管延华,庄培芝. 沥青路面内部裂缝雷达图像智能判识算法研究[J]. 山东大学学报 (工学版), 2025, 55(3): 72-79.
[4] 李丰,文益民. 融合多尺度视觉和文本语义特征的图像描述生成算法[J]. 山东大学学报 (工学版), 2025, 55(3): 80-87.
[5] 王禹鸥,苑迎春,何振学,王克俭. 改进RoBERTa、多实例学习和双重注意力机制的关系抽取方法[J]. 山东大学学报 (工学版), 2025, 55(2): 78-87.
[6] 李伟豪,王苹苹,许万博,魏本征. 结构先验引导的多模态腰椎MRI图像分割算法[J]. 山东大学学报 (工学版), 2025, 55(1): 66-76.
[7] 邹正标,刘毅志,廖祝华,赵肄江. 动态交通流量预测的时空注意力图卷积网络[J]. 山东大学学报 (工学版), 2024, 54(5): 50-61.
[8] 方世超,滕旭阳,王子南,陈晗,仇兆炀,毕美华. 基于自适应掩码和生成式修复的图像隐私保护技术[J]. 山东大学学报 (工学版), 2024, 54(5): 111-121.
[9] 马翔悦,徐金东,倪梦莹. 基于多尺度特征模糊卷积神经网络的遥感图像分割[J]. 山东大学学报 (工学版), 2024, 54(3): 44-54.
[10] 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7.
[11] 迟云浩,杨璐,郭杰,郝凡昌,聂秀山. 基于注意力特征融合网络的手指静脉图像质量评价方法[J]. 山东大学学报 (工学版), 2023, 53(6): 56-62.
[12] 王碧瑶,韩毅,崔航滨,刘毅超,任铭然,高维勇,陈姝廷,刘嘉巍,崔洋. 基于图像的道路语义分割检测方法[J]. 山东大学学报 (工学版), 2023, 53(5): 37-47.
[13] 那绪博,张莹,李沐阳,陈元畅,华云鹏. 基于ODCG的网约车需求预测模型[J]. 山东大学学报 (工学版), 2023, 53(5): 48-56.
[14] 范海雯,郝旭东,赵康,邢法财,蒋哲,李常刚. 基于卷积神经网络的含分布式光伏配电网静态等值[J]. 山东大学学报 (工学版), 2023, 53(4): 140-148.
[15] 宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[2] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[3] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[4] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[5] 王波,王宁生 . 机电装配体拆卸序列的自动生成及组合优化[J]. 山东大学学报(工学版), 2006, 36(2): 52 -57 .
[6] 张英,郎咏梅,赵玉晓,张鉴达,乔鹏,李善评 . 由EGSB厌氧颗粒污泥培养好氧颗粒污泥的工艺探讨[J]. 山东大学学报(工学版), 2006, 36(4): 56 -59 .
[7] Yue Khing Toh1 , XIAO Wendong2 , XIE Lihua1 . 基于无线传感器网络的分散目标跟踪:实际测试平台的开发应用(英文)[J]. 山东大学学报(工学版), 2009, 39(1): 50 -56 .
[8] 刘忠国,张晓静,刘伯强,刘常春 . 视觉刺激间隔对大脑诱发电位的影响[J]. 山东大学学报(工学版), 2006, 36(3): 34 -38 .
[9] 孙炜伟,王玉振. 考虑饱和的发电机单机无穷大系统有限增益镇定[J]. 山东大学学报(工学版), 2009, 39(1): 69 -76 .
[10] 孙玉利,李法德,左敦稳,戚美 . 直立分室式流体连续通电加热系统的升温特性[J]. 山东大学学报(工学版), 2006, 36(6): 19 -23 .