Journal of Shandong University(Engineering Science) ›› 2025, Vol. 55 ›› Issue (4): 29-39.doi: 10.6040/j.issn.1672-3961.0.2024.024
• Special Issue for Deep Learning with Vision • Previous Articles
WANG Xufeng1, ZHOU Di1, ZHANG Fenglei1, SONG Xuemeng2, LIU Meng1*
CLC Number:
| [1] WEI Y, WANG X, GUAN W, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. [2] HU Y, ZHAN P, XU Y, et al. Temporal representation learning for time series classification[J]. Neural Computing and Applications, 2021, 33: 3169-3182. [3] WEI Y, WANG X, NIE L, et al. MMGCN: multi-modal graph convolution network for personalized recommen-dation of micro-video[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1437-1445. [4] CHEN H, DING G, LIN Z, et al. Cross-modal image-text retrieval with semantic consistency[C] //Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM, 2019: 1749-1757. [5] CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross modal image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 12655-12663. [6] FROME A, CORRADO G S, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model[C] //Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, USA: ACM, 2013: 2121-2129. [7] KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural languagemodels[EB/OL].(2014-11-10)[2024-01-31]. https://arxiv.org/abs/1411.2539 [8] LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4107-4116. [9] WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 5005-5013. [10] SARAFIANOS N, XU X, KAKADIARIS I A. Adversarial representation learning for text-to-image matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5814-5824. [11] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676. [12] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10921-10930. [13] HUANG F, ZHANG X, ZHAO Z, et al. Bi-directional spatial-semantic attention networks for image-text matching[J]. IEEE Transactions on Image Processing, 2018, 28(4): 2008-2020. [14] WANG Z, LIU X, LI H, et al. CAMP: cross-modal adaptive message passing for text-image retrieval[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 5764-5773. [15] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 10941-10950. [16] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C] //Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM, 2017: 154-162. [17] LI K, ZHANG Y, LI K, et al. Visual semantic reasoning for image-text matching[C] //Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4654-4662. [18] LI K, ZHANG Y, LI K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 641-656. [19] WANG Y, YANG H, QIAN X, et al. Position focused attention network for image-textmatching[EB/OL].(2019-07-23)[2024-01-31]. https://arxiv.org/abs/1907.09748 [20] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C] //Proceedings of the European Conference on Computer Vision(ECCV). Munich, Germany: Springer, 2018: 201-216. [21] DENG Y, ZHANG F, CHEN X. Collaborative attention network model for cross-modal retrieval[J]. Computer Science, 2020, 47(4): 54-59. [22] CHEN T, LUO J. Expressing objects just like words: recurrent visual embedding for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA: AAAI, 2020: 10583-10590. [23] ZHANG J, HE X, QING L, et al. Cross-modal multi-relationship aware reasoning for image-text matching[J].Multimedia Tools and Applications, 2022, 81: 12005-12027. [24] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2020: 3536-3545. [25] QU L, LIU M, CAO D, et al. Context-aware multi-view summarization network for image-text matching[C] //Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020: 1047-1055. [26] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425. [27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010. [28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL].(2019-05-24)[2024-01-31]. https://arxiv.org/abs/1810.04805 [29] SUN C, MYERS A, VONDRICK C, et al. Video-BERT: a joint model for video and language representation learning[C] //Proceedings of the 2019 IEEE /CVF Interna- tional Conference on Computer Vision. Seoul: IEEE, 2019: 7464-7473. [30] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //Proceedings of the International Conference on Machine Learning. Vienna, Austria: ICML, 2021: 8748-8763. [31] LIU Y, XIONG P, XU L, et al. TS2-Net: token shift and selection Transformer for text-video retrieval[C] //Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer, 2022: 319-335. [32] YOUNG P, LAI A, HODOSH M, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. [33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C] //Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755. [34] WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(7): 2866-2879. [35] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C] //Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.] : AAAI, 2021: 1218-1226. [36] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C] //Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM, 2021: 2205-2213. [37] LONG S, HAN S C, WAN X, et al. GraDual: graph-based dual-modal representation for image-text matching[C] //Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2022: 3459-3468. [38] ZHAO G, ZHANG C, SHANG H, et al. Generative label fused network for image-text matching[J]. Knowledge-Based Systems, 2023, 263: 110280. [39] PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C] //Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 19275-19284. [40] DIAO H, ZHANG Y, LIU W, et al. Plug-and-playregulators for image-text matching[J]. IEEE Tran-sactions on Image Processing, 2023, 32: 2322-2334. |
| [1] | ZHU Changming, YUE Wen, WANG Panhong, SHEN Zhenyu, ZHOU Rigui. Global and local multi-view multi-label learning with active three-way clustering [J]. Journal of Shandong University(Engineering Science), 2021, 51(2): 34-46. |
| [2] | ZHANG Peirui, YANG Yan, XING Huanlai, YU Xiuying. Incremental multi-view clustering algorithm based on kernel K-means [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 48-53. |
| [3] | GUO Chao, YANG Yan, JIANG Yongquan, SONG Yi. Condition recognition of high-speed train based on multi-view classification ensemble [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(1): 7-14. |
| [4] | GAO Shuang1,2, ZHANG Hua-xiang1,2*, FANG Xiao-nan1,2. Independent component analysis and co-training based Web spam detection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(2): 29-34. |
|
||