您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (4): 29-39.doi: 10.6040/j.issn.1672-3961.0.2024.024

• 深度学习与视觉专题 • 上一篇    

基于多粒度对齐网络的图像-文本匹配方法

王旭峰1, 周迪1,张风雷1,宋雪萌2,刘萌1*   

  1. 1.山东建筑大学计算机科学与技术学院, 山东 济南 250101;2.山东大学计算机科学与技术学院, 山东 青岛 266237
  • 发布日期:2025-08-31
  • 作者简介:王旭峰(1997— ),男,江苏徐州人,硕士研究生,主要研究方向为多媒体计算和信息检索. E-mail: xufeng_wang@163.com. *通信作者简介:刘萌(1991— ),女,黑龙江尚志人,教授,硕士生导师,博士,主要研究方向为多媒体计算和信息检索. E-mail: mengliu.sdu@gmail.com
  • 基金资助:
    国家自然科学基金资助项目(62376140,U23A20315,62236003);山东省优秀青年科学基金资助项目(ZR2022YQ59);山东省高等学校青创科技支持计划资助项目(2023KJ128)

Multi-granularity alignment network for image-text matching

WANG Xufeng1, ZHOU Di1, ZHANG Fenglei1, SONG Xuemeng2, LIU Meng1*   

  1. 1. College of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China;
    2. College of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
  • Published:2025-08-31

摘要: 为精准匹配图像与文本数据,提出一种多粒度对齐网络(multi-granularity alignment network, MGAN)。通过对比语言-图像预训练模型和基于Transformer的双向编码器模型,分别提取图像块级、区域级和全局级3个不同粒度的信息,弥补匹配信息单一的缺陷。根据各级信息的特性,采用多级对齐机制。在区域级对齐上,结合多视角总结策略,让MGAN有效应对图像和文本之间的一对多描述问题;在图像块级对齐上,引入跨模态相似性交互建模模块,进一步增强图像与文本之间的细节交互。在Flickr30K和MS-COCO两个公开数据集上的大量试验结果表明,MGAN具有更高的匹配性能,验证了多粒度对齐网络方法的有效性。

关键词: 图像-文本匹配, 跨模态检索, 多粒度, 多视角, 跨模态相似性交互

Abstract: To precisely match image and text data, a multi-granularity alignment network(MGAN)was proposed. By adopting a contrastive language-image pre-training model and a Transformer-based bidirectional encoder model, MGAN extracted information at three different granularities: patch level, regional level, and global level, addressing the shortcomings of single-granularity information matching. A multi-level alignment mechanism was employed based on the characteristics of information at each level. At the regional level, a multi-view summarization module was integrated, allowing MGAN to effectively handle the one-to-many description problems between images and texts. At the patch level, a cross-modal similarity interaction modeling module was introduced to further enhance the detailed interactions between images and texts. Extensive experimental results on the publicly available datasets Flickr30K and MS-COCO demonstrated that MGAN achieved promising performance, confirming the effectiveness of the multi-granularity alignment network approach.

Key words: image-text matching, cross-modal retrieval, multi-granularity, multi-view, cross-modal similarity interaction

中图分类号: 

  • TP391
[1] WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14.
[2] HU Y, ZHAN P, XU Y, et al. Temporal representation learning for time series classification[J]. Neural Computing and Applications, 2021, 33: 3169-3182.
[3] WEI Y, WANG X, NIE L, et al. MMGCN: multi-modal graph convolution network for personalized recommen-dation of micro-video[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1437-1445.
[4] CHEN H, DING G, LIN Z, et al. Cross-modal image-text retrieval with semantic consistency[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1749-1757.
[5] CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross modal image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 12655-12663.
[6] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model[C] //Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, USA: ACM, 2013: 2121-2129.
[7] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural languagemodels[EB/OL].(2014-11-10)[2024-01-31]. https://arxiv.org/abs/1411.2539
[8] LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4107-4116.
[9] WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 5005-5013.
[10] SARAFIANOS N, XU X, KAKADIARIS I A. Adversarial representation learning for text-to-image matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5814-5824.
[11] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
[12] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10921-10930.
[13] HUANG F, ZHANG X, ZHAO Z, et al. Bi-directional spatial-semantic attention networks for image-text matching[J]. IEEE Transactions on Image Processing, 2018, 28(4): 2008-2020.
[14] WANG Z, LIU X, LI H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5764-5773.
[15] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10941-10950.
[16] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C] //Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM, 2017: 154-162.
[17] LI K, ZHANG Y, LI K, et al. Visual semantic reasoning for image-text matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4654-4662.
[18] LI K, ZHANG Y, LI K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 641-656.
[19] WANG Y, YANG H, QIAN X, et al. Position focused attention network for image-textmatching[EB/OL].(2019-07-23)[2024-01-31]. https://arxiv.org/abs/1907.09748
[20] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 201-216.
[21] DENG Y, ZHANG F, CHEN X. Collaborative attention network model for cross-modal retrieval[J]. Computer Science, 2020, 47(4): 54-59.
[22] CHEN T, LUO J. Expressing objects just like words: recurrent visual embedding for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020: 10583-10590.
[23] ZHANG J, HE X, QING L, et al. Cross-modal multi-relationship aware reasoning for image-text matching[J].Multimedia Tools and Applications, 2022, 81: 12005-12027.
[24] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3536-3545.
[25] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C] //Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 1047-1055.
[26] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL].(2019-05-24)[2024-01-31]. https://arxiv.org/abs/1810.04805
[29] SUN C, MYERS A, VONDRICK C, et al. Video-BERT: a joint model for video and language representation learning[C] //Proceedings of the 2019 IEEE /CVF Interna- tional Conference on Computer Vision. Seoul: IEEE, 2019: 7464-7473.
[30] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the International Conference on Machine Learning. Vienna, Austria: ICML, 2021: 8748-8763.
[31] LIU Y, XIONG P, XU L, et al. TS2-Net: token shift and selection Transformer for text-video retrieval[C] //Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer, 2022: 319-335.
[32] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C] //Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.
[34] WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(7): 2866-2879.
[35] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1218-1226.
[36] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C] //Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM, 2021: 2205-2213.
[37] LONG S, HAN S C, WAN X, et al. GraDual: graph-based dual-modal representation for image-text matching[C] //Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2022: 3459-3468.
[38] ZHAO G, ZHANG C, SHANG H, et al. Generative label fused network for image-text matching[J]. Knowledge-Based Systems, 2023, 263: 110280.
[39] PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C] //Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 19275-19284.
[40] DIAO H, ZHANG Y, LIU W, et al. Plug-and-playregulators for image-text matching[J]. IEEE Tran-sactions on Image Processing, 2023, 32: 2322-2334.
[1] 刁振宇,韩小凡,张承宇,聂慧佳,赵秀阳,牛冬梅. 基于实例判别与特征增强的单图三维模型检索[J]. 山东大学学报 (工学版), 2025, 55(2): 71-77.
[2] 谢立,叶军,赖鹏飞,卢岚,周浩岩,李兆彬. 一种改进的悲观多粒度粗糙集粒度约简算法[J]. 山东大学学报 (工学版), 2024, 54(6): 38-48.
[3] 于畅,伍星,邓秋菊. 基于深度学习的多视角螺钉缺失智能检测算法[J]. 山东大学学报 (工学版), 2023, 53(4): 104-112.
[4] 朱昌明,岳闻,王盼红,沈震宇,周日贵. 主动三支聚类下的全局和局部多视角多标签学习算法[J]. 山东大学学报 (工学版), 2021, 51(2): 34-46.
[5] 樊伟. 一种多粒度粗糙区间模糊集方法[J]. 山东大学学报(工学版), 2013, 43(1): 63-68.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!