Journal of Shandong University(Engineering Science) ›› 2025, Vol. 55 ›› Issue (3): 80-87.doi: 10.6040/j.issn.1672-3961.0.2024.018

• Machine Learning & Data Mining • Previous Articles    

Multi-scale visual and textual semantic feature fusion for image captioning

LI Feng, WEN Yimin*   

  1. LI Feng, WEN Yimin*(School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, Guangxi, China
  • Published:2025-06-05

Abstract: To address issues caused by category differences between the pre-training dataset of the object detector and the dataset for the image captioning task, which could lead to object recognition errors, as well as variations in sample sizes across different scenes that could result in the model's insufficient understanding of relationships between objects in rare scenes, the multi-scale visual and textual semantic feature fusion for image captioning(MVTFF-IC)was proposed. The multi-scale visual feature fusion(MVFF)module modeled global, grid, and regional features using a graph attention network to obtain more representative visual representations. The deep semantic fusion module(DSFM)integrated textual semantic features, including object relationships, through a cross-attention mechanism to generate more accurate descriptions. Experimental results on the Microsoft common objects in context(MSCOCO)dataset showed that MVTFF-IC achieved a consensus-based image description evaluation CDIEr of 136.7, outperforming many popular existing algorithms, demonstrating its ability to capture key information more accurately in images and generate high-quality descriptions.

Key words: image captioning, graph attention network, visual feature, textual feature, attention mechanism

CLC Number: 

  • TP391
[1] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C] //Proceedings of the 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE, 2009: 248-255.
[2] JI J, LUO Y, SUN X, et al. Improving image captioning by leveraging intra-and inter-layer global representation in Transformer network[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1655-1663.
[3] JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10267-10276.
[4] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[5] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C] //Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6077-6086.
[6] YANG X, TANG K, ZHANG H, et al. Auto-encoding scene graphs for image captioning[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 10685-10694.
[7] HERDADE S, KAPPELER A, BOAKEY K, et al. Image captioning: transforming objects into words[C] //Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2019: 11137-11147.
[8] DONG X, LONG C, XU W, et al. Dual graph convolutional networks with Transformer and curriculum learning for image captioning[C] //Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 2615-2624.
[9] KUO C W, KIRA Z. Beyond a pre-trained object detector: cross-modal textual and visual context for image captioning[C] //Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE, 2022: 17969-17979.
[10] BERNARDI R, CAKICI R, ELLIOTT D, et al. Automatic description generation from images: a survey of models, datasets, and evaluation measures[J]. Journal of Artificial Intelligence Research, 2016, 55: 409-442.
[11] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 207-218.
[12] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903.
[13] YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 684-699.
[14] HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Long Beach, USA: IEEE, 2019: 4634-4643.
[15] PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10971-10980.
[16] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory Transformer for image captioning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10578-10587.
[17] LUO Y, JI J, SUN X, et al. Dual-level collaborative Transformer for image captioning[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 2286-2293.
[18] WEI J, LI Z, ZHU J, et al. Enhance understanding and reasoning ability for image captioning[J]. Applied Intelligence, 2023, 53(3): 2706-2722.
[19] KARPATHY A, LI F F. Deep visual-semantic align-ments for generating image descriptions[C] //Proceedings of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3128-3137.
[20] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] //Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2002: 311-318.
[21] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C] //Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, USA: ACL, 2005: 65-72.
[22] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C] //Proceedings of the Workshop on Text Summarization Branches Out. Barcelona, Spain: ACL, 2004: 74-81.
[23] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C] //Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 4566-4575.
[24] ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evalu-ation[C] //Proceedings of the European Conference on Computer Vision(ECCV). Amsterdam, Netherlands: Springer, 2016: 382-398.
[25] LI X, YIN X, LI C, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C] //Proceedings of the European Conference on Computer Vision(ECCV). Glasgow, UK: Springer, 2020: 121-137.
[26] SCHUSTER S, KRISHNA R, CHANG A, et al. Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C] //Pro-ceedings of the fourth Workshop on Vision and Language. Lisbon, Portugal: ACL, 2015: 70-80.
[1] WANG Yuou, YUAN Yingchun, HE Zhenxue, WANG Kejian. A relation extraction method based on improved RoBERTa, multiple-instance learning and dual attention mechanism [J]. Journal of Shandong University(Engineering Science), 2025, 55(2): 78-87.
[2] Jiachun LI,Bowen LI,Jianbo CHANG. An efficient and lightweight RGB frame-level face anti-spoofing model [J]. Journal of Shandong University(Engineering Science), 2023, 53(6): 1-7.
[3] Xinzhang WU,Xiangyu LIANG,Hongyu ZHU,Dongdong ZHANG. Short-term wind power prediction based on CEEMDAN-GRA-PCC-ATCN [J]. Journal of Shandong University(Engineering Science), 2022, 52(6): 146-156.
[4] Ye LIANG,Nan MA,Hongzhe LIU. Image-dependent fusion method for saliency maps [J]. Journal of Shandong University(Engineering Science), 2021, 51(4): 1-7.
[5] Junsan ZHANG,Qiaoqiao CHENG,Yao WAN,Jie ZHU,Shidong ZHANG. MIRGAN: a medical image report generation model based on GAN [J]. Journal of Shandong University(Engineering Science), 2021, 51(2): 9-18.
[6] ZHANG Qinyang, LI Xu, YAO Chunlong, LI Changwu. Aspect-level sentiment classification combined with syntactic dependency information [J]. Journal of Shandong University(Engineering Science), 2021, 51(2): 83-89.
[7] ZHANG Yuefang, DENG Hongxia, HU Chunxiang, QIAN Guanyu, LI Haifang. Hippocampal segmentation combining residual attention mechanism and generative adversarial networks [J]. Journal of Shandong University(Engineering Science), 2020, 50(6): 76-81.
[8] LIAO Nanxing, ZHOU Shibin, ZHANG Guopeng, CHENG Deqiang. Image caption generation method based on class activation mapping and attention mechanism [J]. Journal of Shandong University(Engineering Science), 2020, 50(4): 28-34.
[9] Guoyong CAI,Xinhao HE,Yangyang CHU. Visual sentiment analysis based on spatial attention mechanism and convolutional neural network [J]. Journal of Shandong University(Engineering Science), 2020, 50(4): 8-13.
[10] Zhifu CHANG,Fengyu ZHOU,Yugang WANG,Dongdong SHEN,Yang ZHAO. A survey of image captioning methods based on deep learning [J]. Journal of Shandong University(Engineering Science), 2019, 49(6): 25-35.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!