您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (3): 80-87.doi: 10.6040/j.issn.1672-3961.0.2024.018

• 机器学习与数据挖掘 • 上一篇    

融合多尺度视觉和文本语义特征的图像描述生成算法

李丰,文益民*   

  1. 桂林电子科技大学计算机与信息安全学院, 广西 桂林 541004
  • 发布日期:2025-06-05
  • 作者简介:李丰(1997— ),男,广东东莞人,硕士研究生,主要研究方向为图像描述生成. E-mail:21032303066@mails.guet.edu.cn. *通信作者简介:文益民(1969— ),男,湖南桃江人,教授,博士生导师,博士,主要研究方向为机器学习与数据挖掘. E-mail:ymwen@guet.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62366011);广西重点研发计划资助项目(桂科AB21220023);广西图像图形与智能处理重点实验室资助项目(GIIP2306);桂林电子科技大学研究生教育创新计划资助项目(2023YCXB11)

Multi-scale visual and textual semantic feature fusion for image captioning

LI Feng, WEN Yimin*   

  1. LI Feng, WEN Yimin*(School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, Guangxi, China
  • Published:2025-06-05

摘要: 为了解决目标检测器预训练数据集与图像描述生成任务数据集存在类别差异导致物体识别错误,以及不同场景样本规模存在差异导致模型对少见场景中对象间关系理解不足的问题,提出融合多尺度视觉和文本语义特征的图像描述生成算法(multi-scale visual and textual semantic feature fusion for image captioning, MVTFF-IC)。多尺度视觉特征融合(multi-scale visual feature fusion, MVFF)模块通过图注意力网络对全局、网格和区域特征进行建模,以获取更具代表性的视觉表征;深度语义融合模块(deep semantic fusion module, DSFM)通过交叉注意力机制整合包含对象关系的文本语义特征,以生成更准确的描述。在微软常见物体场景(Microsoft common objects in context, MSCOCO)数据集上的试验结果表明,MVTFF-IC基于共识的图像描述评价指标CDIEr达到136.7,优于许多现有的流行算法,能够更准确地捕捉图像中的关键信息,生成高质量的描述。

关键词: 图像描述, 图注意力网络, 视觉特征, 文本特征, 注意力机制

Abstract: To address issues caused by category differences between the pre-training dataset of the object detector and the dataset for the image captioning task, which could lead to object recognition errors, as well as variations in sample sizes across different scenes that could result in the model's insufficient understanding of relationships between objects in rare scenes, the multi-scale visual and textual semantic feature fusion for image captioning(MVTFF-IC)was proposed. The multi-scale visual feature fusion(MVFF)module modeled global, grid, and regional features using a graph attention network to obtain more representative visual representations. The deep semantic fusion module(DSFM)integrated textual semantic features, including object relationships, through a cross-attention mechanism to generate more accurate descriptions. Experimental results on the Microsoft common objects in context(MSCOCO)dataset showed that MVTFF-IC achieved a consensus-based image description evaluation CDIEr of 136.7, outperforming many popular existing algorithms, demonstrating its ability to capture key information more accurately in images and generate high-quality descriptions.

Key words: image captioning, graph attention network, visual feature, textual feature, attention mechanism

中图分类号: 

  • TP391
[1] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C] //Proceedings of the 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE, 2009: 248-255.
[2] JI J, LUO Y, SUN X, et al. Improving image captioning by leveraging intra-and inter-layer global representation in Transformer network[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1655-1663.
[3] JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10267-10276.
[4] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[5] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] //Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
[6] YANG X, TANG K, ZHANG H, et al. Auto-encoding scene graphs for image captioning[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 10685-10694.
[7] HERDADE S, KAPPELER A, BOAKEY K, et al. Image captioning: transforming objects into words[C] //Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2019: 11137-11147.
[8] DONG X, LONG C, XU W, et al. Dual graph convolutional networks with Transformer and curriculum learning for image captioning[C] //Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2615-2624.
[9] KUO C W, KIRA Z. Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning[C] //Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 17969-17979.
[10] BERNARDI R, CAKICI R, ELLIOTT D, et al. Automatic description generation from images: a survey of models, datasets, and evaluation measures[J]. Journal of Artificial Intelligence Research, 2016, 55: 409-442.
[11] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 207-218.
[12] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903.
[13] YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 684-699.
[14] HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Long Beach, USA: IEEE, 2019: 4634-4643.
[15] PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10971-10980.
[16] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory Transformer for image captioning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10578-10587.
[17] LUO Y, JI J, SUN X, et al. Dual-level collaborative Transformer for image captioning[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 2286-2293.
[18] WEI J, LI Z, ZHU J, et al. Enhance understanding and reasoning ability for image captioning[J]. Applied Intelligence, 2023, 53(3): 2706-2722.
[19] KARPATHY A, LI F F. Deep visual-semantic align-ments for generating image descriptions[C] //Proceedings of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3128-3137.
[20] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] //Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2002: 311-318.
[21] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C] //Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, USA: ACL, 2005: 65-72.
[22] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C] //Proceedings of the Workshop on Text Summarization Branches Out. Barcelona, Spain: ACL, 2004: 74-81.
[23] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C] //Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
[24] ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evalu-ation[C] //Proceedings of the European Conference on Computer Vision(ECCV). Amsterdam, Netherlands: Springer, 2016: 382-398.
[25] LI X, YIN X, LI C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C] //Proceedings of the European Conference on Computer Vision(ECCV). Glasgow, UK: Springer, 2020: 121-137.
[26] SCHUSTER S, KRISHNA R, CHANG A, et al. Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C] //Pro-ceedings of the fourth Workshop on Vision and Language. Lisbon, Portugal: ACL, 2015: 70-80.
[1] 王禹鸥,苑迎春,何振学,王克俭. 改进RoBERTa、多实例学习和双重注意力机制的关系抽取方法[J]. 山东大学学报 (工学版), 2025, 55(2): 78-87.
[2] 邹正标,刘毅志,廖祝华,赵肄江. 动态交通流量预测的时空注意力图卷积网络[J]. 山东大学学报 (工学版), 2024, 54(5): 50-61.
[3] 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7.
[4] 郑泾飞,廖永新,王华珍,何霆. 基于提及图和显式路径的文档级关系抽取方法[J]. 山东大学学报 (工学版), 2023, 53(6): 16-25.
[5] 王碧瑶,韩毅,崔航滨,刘毅超,任铭然,高维勇,陈姝廷,刘嘉巍,崔洋. 基于图像的道路语义分割检测方法[J]. 山东大学学报 (工学版), 2023, 53(5): 37-47.
[6] 宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76.
[7] 刘方旭,王建,魏本征. 基于多空间注意力的小儿肺炎辅助诊断算法[J]. 山东大学学报 (工学版), 2023, 53(2): 135-142.
[8] 武新章,梁祥宇,朱虹谕,张冬冬. 基于CEEMDAN-GRA-PCC-ATCN的短期风电功率预测[J]. 山东大学学报 (工学版), 2022, 52(6): 146-156.
[9] 侯月武,刘兆英,张婷,李玉鑑,孙长明. 基于改进的DUNet遥感图像道路提取[J]. 山东大学学报 (工学版), 2022, 52(4): 29-37.
[10] 梁晔,马楠,刘宏哲. 图像依赖的显著图融合方法[J]. 山东大学学报 (工学版), 2021, 51(4): 1-7.
[11] 张俊三,程俏俏,万瑶,朱杰,张世栋. MIRGAN: 一种基于GAN的医学影像报告生成模型[J]. 山东大学学报 (工学版), 2021, 51(2): 9-18.
[12] 张沁洋,李旭,姚春龙,李长吾. 结合句法依存信息的方面级情感分类[J]. 山东大学学报 (工学版), 2021, 51(2): 83-89.
[13] 张月芳,邓红霞,呼春香,钱冠宇,李海芳. 融合残差块注意力机制和生成对抗网络的海马体分割[J]. 山东大学学报 (工学版), 2020, 50(6): 76-81.
[14] 廖南星,周世斌,张国鹏,程德强. 基于类激活映射-注意力机制的图像描述方法[J]. 山东大学学报 (工学版), 2020, 50(4): 28-34.
[15] 蔡国永,贺歆灏,储阳阳. 基于空间注意力和卷积神经网络的视觉情感分析[J]. 山东大学学报 (工学版), 2020, 50(4): 8-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!