山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (4): 29-39.doi: 10.6040/j.issn.1672-3961.0.2024.024
• 深度学习与视觉专题 • 上一篇
王旭峰1, 周迪1,张风雷1,宋雪萌2,刘萌1*
WANG Xufeng1, ZHOU Di1, ZHANG Fenglei1, SONG Xuemeng2, LIU Meng1*
摘要: 为精准匹配图像与文本数据,提出一种多粒度对齐网络(multi-granularity alignment network, MGAN)。通过对比语言-图像预训练模型和基于Transformer的双向编码器模型,分别提取图像块级、区域级和全局级3个不同粒度的信息,弥补匹配信息单一的缺陷。根据各级信息的特性,采用多级对齐机制。在区域级对齐上,结合多视角总结策略,让MGAN有效应对图像和文本之间的一对多描述问题;在图像块级对齐上,引入跨模态相似性交互建模模块,进一步增强图像与文本之间的细节交互。在Flickr30K和MS-COCO两个公开数据集上的大量试验结果表明,MGAN具有更高的匹配性能,验证了多粒度对齐网络方法的有效性。
中图分类号:
| [1] WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. [2] HU Y, ZHAN P, XU Y, et al. Temporal representation learning for time series classification[J]. Neural Computing and Applications, 2021, 33: 3169-3182. [3] WEI Y, WANG X, NIE L, et al. MMGCN: multi-modal graph convolution network for personalized recommen-dation of micro-video[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1437-1445. [4] CHEN H, DING G, LIN Z, et al. Cross-modal image-text retrieval with semantic consistency[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1749-1757. [5] CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross modal image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 12655-12663. [6] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model[C] //Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, USA: ACM, 2013: 2121-2129. [7] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural languagemodels[EB/OL].(2014-11-10)[2024-01-31]. https://arxiv.org/abs/1411.2539 [8] LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4107-4116. [9] WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 5005-5013. [10] SARAFIANOS N, XU X, KAKADIARIS I A. Adversarial representation learning for text-to-image matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5814-5824. [11] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676. [12] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10921-10930. [13] HUANG F, ZHANG X, ZHAO Z, et al. Bi-directional spatial-semantic attention networks for image-text matching[J]. IEEE Transactions on Image Processing, 2018, 28(4): 2008-2020. [14] WANG Z, LIU X, LI H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5764-5773. [15] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10941-10950. [16] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C] //Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM, 2017: 154-162. [17] LI K, ZHANG Y, LI K, et al. Visual semantic reasoning for image-text matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4654-4662. [18] LI K, ZHANG Y, LI K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 641-656. [19] WANG Y, YANG H, QIAN X, et al. Position focused attention network for image-textmatching[EB/OL].(2019-07-23)[2024-01-31]. https://arxiv.org/abs/1907.09748 [20] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 201-216. [21] DENG Y, ZHANG F, CHEN X. Collaborative attention network model for cross-modal retrieval[J]. Computer Science, 2020, 47(4): 54-59. [22] CHEN T, LUO J. Expressing objects just like words: recurrent visual embedding for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020: 10583-10590. [23] ZHANG J, HE X, QING L, et al. Cross-modal multi-relationship aware reasoning for image-text matching[J].Multimedia Tools and Applications, 2022, 81: 12005-12027. [24] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3536-3545. [25] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C] //Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 1047-1055. [26] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425. [27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010. [28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL].(2019-05-24)[2024-01-31]. https://arxiv.org/abs/1810.04805 [29] SUN C, MYERS A, VONDRICK C, et al. Video-BERT: a joint model for video and language representation learning[C] //Proceedings of the 2019 IEEE /CVF Interna- tional Conference on Computer Vision. Seoul: IEEE, 2019: 7464-7473. [30] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the International Conference on Machine Learning. Vienna, Austria: ICML, 2021: 8748-8763. [31] LIU Y, XIONG P, XU L, et al. TS2-Net: token shift and selection Transformer for text-video retrieval[C] //Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer, 2022: 319-335. [32] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. [33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C] //Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755. [34] WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(7): 2866-2879. [35] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1218-1226. [36] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C] //Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM, 2021: 2205-2213. [37] LONG S, HAN S C, WAN X, et al. GraDual: graph-based dual-modal representation for image-text matching[C] //Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2022: 3459-3468. [38] ZHAO G, ZHANG C, SHANG H, et al. Generative label fused network for image-text matching[J]. Knowledge-Based Systems, 2023, 263: 110280. [39] PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C] //Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 19275-19284. [40] DIAO H, ZHANG Y, LIU W, et al. Plug-and-playregulators for image-text matching[J]. IEEE Tran-sactions on Image Processing, 2023, 32: 2322-2334. |
| [1] | 刁振宇,韩小凡,张承宇,聂慧佳,赵秀阳,牛冬梅. 基于实例判别与特征增强的单图三维模型检索[J]. 山东大学学报 (工学版), 2025, 55(2): 71-77. |
| [2] | 谢立,叶军,赖鹏飞,卢岚,周浩岩,李兆彬. 一种改进的悲观多粒度粗糙集粒度约简算法[J]. 山东大学学报 (工学版), 2024, 54(6): 38-48. |
| [3] | 于畅,伍星,邓秋菊. 基于深度学习的多视角螺钉缺失智能检测算法[J]. 山东大学学报 (工学版), 2023, 53(4): 104-112. |
| [4] | 朱昌明,岳闻,王盼红,沈震宇,周日贵. 主动三支聚类下的全局和局部多视角多标签学习算法[J]. 山东大学学报 (工学版), 2021, 51(2): 34-46. |
| [5] | 樊伟. 一种多粒度粗糙区间模糊集方法[J]. 山东大学学报(工学版), 2013, 43(1): 63-68. |
|
||