Journal of Shandong University(Engineering Science) ›› 2025, Vol. 55 ›› Issue (4): 29-39.doi: 10.6040/j.issn.1672-3961.0.2024.024

• Special Issue for Deep Learning with Vision • Previous Articles    

Multi-granularity alignment network for image-text matching

WANG Xufeng1, ZHOU Di1, ZHANG Fenglei1, SONG Xuemeng2, LIU Meng1*   

  1. 1. College of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China;
    2. College of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China
  • Published:2025-08-31

Abstract: To precisely match image and text data, a multi-granularity alignment network(MGAN)was proposed. By adopting a contrastive language-image pre-training model and a Transformer-based bidirectional encoder model, MGAN extracted information at three different granularities: patch level, regional level, and global level, addressing the shortcomings of single-granularity information matching. A multi-level alignment mechanism was employed based on the characteristics of information at each level. At the regional level, a multi-view summarization module was integrated, allowing MGAN to effectively handle the one-to-many description problems between images and texts. At the patch level, a cross-modal similarity interaction modeling module was introduced to further enhance the detailed interactions between images and texts. Extensive experimental results on the publicly available datasets Flickr30K and MS-COCO demonstrated that MGAN achieved promising performance, confirming the effectiveness of the multi-granularity alignment network approach.

Key words: image-text matching, cross-modal retrieval, multi-granularity, multi-view, cross-modal similarity interaction

CLC Number: 

  • TP391
[1] WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14.
[2] HU Y, ZHAN P, XU Y, et al. Temporal representation learning for time series classification[J]. Neural Computing and Applications, 2021, 33: 3169-3182.
[3] WEI Y, WANG X, NIE L, et al. MMGCN: multi-modal graph convolution network for personalized recommen-dation of micro-video[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1437-1445.
[4] CHEN H, DING G, LIN Z, et al. Cross-modal image-text retrieval with semantic consistency[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1749-1757.
[5] CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross modal image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 12655-12663.
[6] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model[C] //Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, USA: ACM, 2013: 2121-2129.
[7] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural languagemodels[EB/OL].(2014-11-10)[2024-01-31]. https://arxiv.org/abs/1411.2539
[8] LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4107-4116.
[9] WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 5005-5013.
[10] SARAFIANOS N, XU X, KAKADIARIS I A. Adversarial representation learning for text-to-image matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5814-5824.
[11] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
[12] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10921-10930.
[13] HUANG F, ZHANG X, ZHAO Z, et al. Bi-directional spatial-semantic attention networks for image-text matching[J]. IEEE Transactions on Image Processing, 2018, 28(4): 2008-2020.
[14] WANG Z, LIU X, LI H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5764-5773.
[15] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10941-10950.
[16] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C] //Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM, 2017: 154-162.
[17] LI K, ZHANG Y, LI K, et al. Visual semantic reasoning for image-text matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4654-4662.
[18] LI K, ZHANG Y, LI K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 641-656.
[19] WANG Y, YANG H, QIAN X, et al. Position focused attention network for image-textmatching[EB/OL].(2019-07-23)[2024-01-31]. https://arxiv.org/abs/1907.09748
[20] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 201-216.
[21] DENG Y, ZHANG F, CHEN X. Collaborative attention network model for cross-modal retrieval[J]. Computer Science, 2020, 47(4): 54-59.
[22] CHEN T, LUO J. Expressing objects just like words: recurrent visual embedding for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020: 10583-10590.
[23] ZHANG J, HE X, QING L, et al. Cross-modal multi-relationship aware reasoning for image-text matching[J].Multimedia Tools and Applications, 2022, 81: 12005-12027.
[24] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3536-3545.
[25] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C] //Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 1047-1055.
[26] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL].(2019-05-24)[2024-01-31]. https://arxiv.org/abs/1810.04805
[29] SUN C, MYERS A, VONDRICK C, et al. Video-BERT: a joint model for video and language representation learning[C] //Proceedings of the 2019 IEEE /CVF Interna- tional Conference on Computer Vision. Seoul: IEEE, 2019: 7464-7473.
[30] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the International Conference on Machine Learning. Vienna, Austria: ICML, 2021: 8748-8763.
[31] LIU Y, XIONG P, XU L, et al. TS2-Net: token shift and selection Transformer for text-video retrieval[C] //Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer, 2022: 319-335.
[32] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C] //Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.
[34] WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(7): 2866-2879.
[35] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1218-1226.
[36] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C] //Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM, 2021: 2205-2213.
[37] LONG S, HAN S C, WAN X, et al. GraDual: graph-based dual-modal representation for image-text matching[C] //Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2022: 3459-3468.
[38] ZHAO G, ZHANG C, SHANG H, et al. Generative label fused network for image-text matching[J]. Knowledge-Based Systems, 2023, 263: 110280.
[39] PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C] //Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 19275-19284.
[40] DIAO H, ZHANG Y, LIU W, et al. Plug-and-playregulators for image-text matching[J]. IEEE Tran-sactions on Image Processing, 2023, 32: 2322-2334.
[1] ZHU Changming, YUE Wen, WANG Panhong, SHEN Zhenyu, ZHOU Rigui. Global and local multi-view multi-label learning with active three-way clustering [J]. Journal of Shandong University(Engineering Science), 2021, 51(2): 34-46.
[2] ZHANG Peirui, YANG Yan, XING Huanlai, YU Xiuying. Incremental multi-view clustering algorithm based on kernel K-means [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 48-53.
[3] GUO Chao, YANG Yan, JIANG Yongquan, SONG Yi. Condition recognition of high-speed train based on multi-view classification ensemble [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(1): 7-14.
[4] GAO Shuang1,2, ZHANG Hua-xiang1,2*, FANG Xiao-nan1,2. Independent component analysis and co-training based Web spam detection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(2): 29-34.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!