山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (4): 28-34.doi: 10.6040/j.issn.1672-3961.0.2019.454
廖南星1,周世斌1*,张国鹏1,程德强2
LIAO Nanxing1, ZHOU Shibin1*, ZHANG Guopeng1, CHENG Deqiang2
摘要: 基于软注意力机制的图像描述算法,提出类激活映射-注意力机制的图像描述方法。利用类激活映射算法得到卷积特征包含定位以及更丰富的语义信息,使得卷积特征与图像描述具有更好的对应关系,解决卷积特征与图像描述的对齐问题,生成的自然语言描述能够尽可能完整的描述图像内容。选择双层长短时记忆网络改进注意力机制结构,使得新的注意力机制适合当前全局和局部信息的特征表示,能够选取合适的特征表示生成图像描述。试验结果表明,改进模型在诸多评价指标上优于软注意力机制等模型,其中在MSCOCO数据集上Bleu-4的评价指标相较于软注意力模型提高了16.8%。类激活映射机制可以解决图像空间信息与描述语义对齐的问题,使得生成的自然语言减少丢失关键信息,提高图像描述的准确性。
中图分类号:
| [1] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3128-3137. [2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Boston, USA: IEEE, 2015: 3156-3164. [3] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [C] // Proceedings of the International conference on machine learning. Lille, France: JMLR, 2015: 2048-2057. [4] MAO J, XU W,YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-rnn)[C] // Proceedings of the International Conference on Learning Representations. San Diego, USA: ICLR, 2014: 13-29. [5] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization[C] // Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 2921-2929. [6] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-cam:explanations from deep networks via gradient-based localization[C] // Proceedings of the IEEE International Conference on Computer Vision. Honolulu, USA: IEEE, 2017: 618-626. [7] LIN M, CHEN Q, YAN S. Network in network[C] // Proceedings of the International Conference on Learning Representations. Banff, Canada: ICLR, 2013: 284-294. [8] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[C] // Proceedings of the Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2014: 2204-2212. [9] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C] // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Shanghai, China: IEEE, 2016: 4945-4949. [10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 375-383. [11] YANG Z, ZHANG Y-J, UR REHMAN S, et al. Image captioning with object detection and localization[C] // Proceedings of the International Conference on Image and Graphics. Singapore: Springer, 2017: 109-118. [12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778. [13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C] // Proceedings of the International Conferenceon Learning Representations.[S.l.] : ICLR, 2014: 4-11. [14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086. [15] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL, 2002: 311-318. [16] LIN C-Y. Rouge: a package for automatic evaluation of summaries[C] // Proceedings of the Text Summarization Branches out. Barcelona, Spain: ACL, 2004: 74-81. [17] LAVIE A, AGARWAL A. METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments[C] // Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: ACL, 2007: 228-231. [18] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: consensus-based imagedescription evaluation[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575. [19] DONAHUE J, ANNE HENDRICKS L, GUADARRA-MA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2625-2634. [20] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal Neural Language Models[C] // Proceedings of the Machine Learning Research. Bejing, China: PMLR, 2014: 595-603. |
| [1] | 王禹鸥,苑迎春,何振学,何晨. 融合多特征和多头自注意力机制的高校学业命名实体识别[J]. 山东大学学报 (工学版), 2025, 55(6): 35-44. |
| [2] | 周群颖,隋家成,张继,王洪元. 基于自监督卷积和无参数注意力机制的工业品表面缺陷检测[J]. 山东大学学报 (工学版), 2025, 55(4): 40-47. |
| [3] | 董明书,陈俐企,马川义,张珠皓,孙仁娟,管延华,庄培芝. 沥青路面内部裂缝雷达图像智能判识算法研究[J]. 山东大学学报 (工学版), 2025, 55(3): 72-79. |
| [4] | 李丰,文益民. 融合多尺度视觉和文本语义特征的图像描述生成算法[J]. 山东大学学报 (工学版), 2025, 55(3): 80-87. |
| [5] | 王禹鸥,苑迎春,何振学,王克俭. 改进RoBERTa、多实例学习和双重注意力机制的关系抽取方法[J]. 山东大学学报 (工学版), 2025, 55(2): 78-87. |
| [6] | 李伟豪,王苹苹,许万博,魏本征. 结构先验引导的多模态腰椎MRI图像分割算法[J]. 山东大学学报 (工学版), 2025, 55(1): 66-76. |
| [7] | 邹正标,刘毅志,廖祝华,赵肄江. 动态交通流量预测的时空注意力图卷积网络[J]. 山东大学学报 (工学版), 2024, 54(5): 50-61. |
| [8] | 方世超,滕旭阳,王子南,陈晗,仇兆炀,毕美华. 基于自适应掩码和生成式修复的图像隐私保护技术[J]. 山东大学学报 (工学版), 2024, 54(5): 111-121. |
| [9] | 马翔悦,徐金东,倪梦莹. 基于多尺度特征模糊卷积神经网络的遥感图像分割[J]. 山东大学学报 (工学版), 2024, 54(3): 44-54. |
| [10] | 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7. |
| [11] | 迟云浩,杨璐,郭杰,郝凡昌,聂秀山. 基于注意力特征融合网络的手指静脉图像质量评价方法[J]. 山东大学学报 (工学版), 2023, 53(6): 56-62. |
| [12] | 王碧瑶,韩毅,崔航滨,刘毅超,任铭然,高维勇,陈姝廷,刘嘉巍,崔洋. 基于图像的道路语义分割检测方法[J]. 山东大学学报 (工学版), 2023, 53(5): 37-47. |
| [13] | 那绪博,张莹,李沐阳,陈元畅,华云鹏. 基于ODCG的网约车需求预测模型[J]. 山东大学学报 (工学版), 2023, 53(5): 48-56. |
| [14] | 范海雯,郝旭东,赵康,邢法财,蒋哲,李常刚. 基于卷积神经网络的含分布式光伏配电网静态等值[J]. 山东大学学报 (工学版), 2023, 53(4): 140-148. |
| [15] | 宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76. |
|