基于类激活映射-注意力机制的图像描述方法

doi:10.6040/j.issn.1672-3961.0.2019.454

摘要/Abstract

摘要： 基于软注意力机制的图像描述算法,提出类激活映射-注意力机制的图像描述方法。利用类激活映射算法得到卷积特征包含定位以及更丰富的语义信息,使得卷积特征与图像描述具有更好的对应关系,解决卷积特征与图像描述的对齐问题,生成的自然语言描述能够尽可能完整的描述图像内容。选择双层长短时记忆网络改进注意力机制结构,使得新的注意力机制适合当前全局和局部信息的特征表示,能够选取合适的特征表示生成图像描述。试验结果表明,改进模型在诸多评价指标上优于软注意力机制等模型,其中在MSCOCO数据集上Bleu-4的评价指标相较于软注意力模型提高了16.8%。类激活映射机制可以解决图像空间信息与描述语义对齐的问题,使得生成的自然语言减少丢失关键信息,提高图像描述的准确性。

关键词: 图像描述, 注意力机制, 类激活映射, 卷积神经网络, 循环神经网络

Abstract: Class activation mapping-attention mechanism was introduced to soft attention based image caption framework. The class activation mapping mechanism introduced the position information to convolutional features with richer semantic information, where there was a better alignment between convolutional features and description words, so that the generated description could describe the image content more completely. Improved the attention mechanism with double layer of long short-term memory network made the attention mechanism suitable for global and local information for generating words with specific features. The experiments showed that the improved model could generate more accurate description and outperformed the performance of models such as the soft attention mechanism in many evaluation criteria, specially the bleu-4 result on the MSCOCO dataset increased 16.8% compared with the soft attention-based model, which showed class activation mapping-attention could align the word and the convolutional feature, and generate more accurate descriptions with less key information lost.

Key words: image caption, attention mechanism, class activation mapping, convolutional neural network, recurrent neural network

中图分类号:

TP391

廖南星,周世斌,张国鹏,程德强. 基于类激活映射-注意力机制的图像描述方法[J]. 山东大学学报 (工学版), 2020, 50(4): 28-34.

LIAO Nanxing, ZHOU Shibin, ZHANG Guopeng, CHENG Deqiang. Image caption generation method based on class activation mapping and attention mechanism[J]. Journal of Shandong University(Engineering Science), 2020, 50(4): 28-34.

参考文献

[1] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3128-3137.
[2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3156-3164.
[3] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [C] // Proceedings of the International conference on machine learning. Lille, France: JMLR, 2015: 2048-2057.
[4] MAO J, XU W,YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-rnn)[C] // Proceedings of the International Conference on Learning Representations. San Diego, USA: ICLR, 2014: 13-29.
[5] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 2921-2929.
[6] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-cam:explanations from deep networks via gradient-based localization[C] // Proceedings of the IEEE International Conference on Computer Vision. Honolulu, USA: IEEE, 2017: 618-626.
[7] LIN M, CHEN Q, YAN S. Network in network[C] // Proceedings of the International Conference on Learning Representations. Banff, Canada: ICLR, 2013: 284-294.
[8] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[C] // Proceedings of the Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 2204-2212.
[9] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C] // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China: IEEE, 2016: 4945-4949.
[10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383.
[11] YANG Z, ZHANG Y-J, UR REHMAN S, et al. Image captioning with object detection and localization[C] // Proceedings of the International Conference on Image and Graphics. Singapore: Springer, 2017: 109-118.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C] // Proceedings of the International Conferenceon Learning Representations.[S.l.] : ICLR, 2014: 4-11.
[14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
[15] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002: 311-318.
[16] LIN C-Y. Rouge: a package for automatic evaluation of summaries[C] // Proceedings of the Text Summarization Branches out. Barcelona, Spain: ACL, 2004: 74-81.
[17] LAVIE A, AGARWAL A. METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments[C] // Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: ACL, 2007: 228-231.
[18] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: consensus-based imagedescription evaluation[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
[19] DONAHUE J, ANNE HENDRICKS L, GUADARRA-MA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2625-2634.
[20] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal Neural Language Models[C] // Proceedings of the Machine Learning Research. Bejing, China: PMLR, 2014: 595-603.

相关文章 15

[1]	王禹鸥,苑迎春,何振学,何晨. 融合多特征和多头自注意力机制的高校学业命名实体识别[J]. 山东大学学报 (工学版), 2025, 55(6): 35-44.
[2]	周群颖,隋家成,张继,王洪元. 基于自监督卷积和无参数注意力机制的工业品表面缺陷检测[J]. 山东大学学报 (工学版), 2025, 55(4): 40-47.
[3]	董明书,陈俐企,马川义,张珠皓,孙仁娟,管延华,庄培芝. 沥青路面内部裂缝雷达图像智能判识算法研究[J]. 山东大学学报 (工学版), 2025, 55(3): 72-79.
[4]	李丰,文益民. 融合多尺度视觉和文本语义特征的图像描述生成算法[J]. 山东大学学报 (工学版), 2025, 55(3): 80-87.
[5]	王禹鸥,苑迎春,何振学,王克俭. 改进RoBERTa、多实例学习和双重注意力机制的关系抽取方法[J]. 山东大学学报 (工学版), 2025, 55(2): 78-87.
[6]	李伟豪,王苹苹,许万博,魏本征. 结构先验引导的多模态腰椎MRI图像分割算法[J]. 山东大学学报 (工学版), 2025, 55(1): 66-76.
[7]	邹正标,刘毅志,廖祝华,赵肄江. 动态交通流量预测的时空注意力图卷积网络[J]. 山东大学学报 (工学版), 2024, 54(5): 50-61.
[8]	方世超,滕旭阳,王子南,陈晗,仇兆炀,毕美华. 基于自适应掩码和生成式修复的图像隐私保护技术[J]. 山东大学学报 (工学版), 2024, 54(5): 111-121.
[9]	马翔悦,徐金东,倪梦莹. 基于多尺度特征模糊卷积神经网络的遥感图像分割[J]. 山东大学学报 (工学版), 2024, 54(3): 44-54.
[10]	李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7.
[11]	迟云浩,杨璐,郭杰,郝凡昌,聂秀山. 基于注意力特征融合网络的手指静脉图像质量评价方法[J]. 山东大学学报 (工学版), 2023, 53(6): 56-62.
[12]	王碧瑶,韩毅,崔航滨,刘毅超,任铭然,高维勇,陈姝廷,刘嘉巍,崔洋. 基于图像的道路语义分割检测方法[J]. 山东大学学报 (工学版), 2023, 53(5): 37-47.
[13]	那绪博,张莹,李沐阳,陈元畅,华云鹏. 基于ODCG的网约车需求预测模型[J]. 山东大学学报 (工学版), 2023, 53(5): 48-56.
[14]	范海雯,郝旭东,赵康,邢法财,蒋哲,李常刚. 基于卷积神经网络的含分布式光伏配电网静态等值[J]. 山东大学学报 (工学版), 2023, 53(4): 140-148.
[15]	宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed