您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (1): 58-65.doi: 10.6040/j.issn.1672-3961.0.2023.172

• 机器学习与数据挖掘 • 上一篇    下一篇

基于跨模态注意力哈希学习的视频片段定位方法

谭智方1,董飞2,卢鹏宇1,潘嘉男1,聂秀山1*,尹义龙3   

  1. 1.山东建筑大学计算机科学与技术学院, 山东 济南 250101;2.山东师范大学新闻与传媒学院, 山东 济南 250014;3.山东大学软件学院, 山东 济南 250100
  • 发布日期:2025-02-20
  • 作者简介:谭智方(1997— ),男,山东潍坊人,硕士研究生,主要研究方向为计算机视觉中的图像处理. E-mail:826133130@qq.com. *通信作者简介:聂秀山(1981— ),男,江苏徐州人,教授,博士生导师,博士,主要研究方向为计算机视觉. E-mail:niexiushan@163.com
  • 基金资助:
    国家自然科学基金资助项目(62176141,62102235);山东省泰山学者资助项目(tsqn202103088);山东省自然科学基金资助项目(ZR2020QF029)

Video moment location method based on cross-modal attention hashing

TAN Zhifang1, DONG Fei2, LU Pengyu1, PAN Jianan1, NIE Xiushan1*, YIN Yilong3   

  1. 1. College of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China;
    2. School of Journalism and Communication, Shandong Normal University, Jinan 250014, Shandong, China;
    3. College of Software, Shandong University, Jinan 250100, Shandong, China
  • Published:2025-02-20

摘要: 为提升视频片段定位的精度与检索效率,提出基于跨模态注意力哈希学习的视频片段定位方法。将查询语句和原始视频特征通过哈希学习模型转化成简洁的二值哈希码;使用软注意力模块对查询语句中的关键单词进行加权,将视频哈希码和查询语句哈希码输入一个增强的跨模态注意力模型中,挖掘视觉和语言之间的语义关系;设计一个得分预测和位置预测网络,对查询时刻的起始时间戳进行定位。在2个公开数据集上对所提方法进行试验验证,结果表明所提方法对检索效率提升约7倍。

关键词: 视觉理解, 视频片段定位, 多模态检索, 哈希学习, 跨模态

中图分类号: 

  • TP37
[1] HENDRICK L A, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5804-5813.
[2] GAO J, SUN C, YANG Z, et al. TALL: temporal activity localization via language query[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5277-5285.
[3] CHEN J, CHEN X, MA L, et al. Temporally grounding natural sentence in video[C] //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2018: 162-171.
[4] XU H, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9062-9069.
[5] ZHANG D, DAI X, WANG X, et al. MAN:moment alignment network for natural language moment retrieval via iterative graph adjustment[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 1247-1257.
[6] WANG W, HUANG Y, WANG L. Language-driven temporal activity localization:a semantic matching reinforcement learning model[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 334-343.
[7] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL:extractive clip localization using natural language descriptions[EB/OL].(2019-04-04)[2023-11-12]. https://arxiv.org/pdf/1904.02755.
[8] SHOU Z, CHAN J, ZAREIAN A, et al. CDC:convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 1417-1426.
[9] ZENG R, GAN C, CHEN P, et al. Breaking winner-takes-all:iterative-winners-out networks for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2019, 28(12): 5797-5808.
[10] SHOU Z, WANG D, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 1049-1058.
[11] XU H, DAS A, SAENKO K. R-C3D:region convolutional 3D network for temporal activity detection[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5794-5803.
[12] LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C] //Proceedings of the 25th ACM Inter-national Conference on Multimedia. New York, USA: ACM, 2017: 988-996.
[13] WANG J, CHENG Y, FERIS R S. Walk and learn:facial attribute representation learning from egocentric video and contextual data[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 2295-2304.
[14] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C] //Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM, 2018: 19-27.
[15] YU Y, KIM J, KIM G. A joint sequence fusion model for video question answering and retrieval[C] //Proceedings of the European Conference on Computer Vision(ECCV). Piscataway, USA: IEEE, 2018: 471-487.
[16] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2020: 10635-10644.
[17] GIONIS A, INDYK P, MOTWANI R. Similarity search in high dimensions viahashing[C] //Proceedings of the International Conference on Very Large Data Bases. New York, USA: Springer, 1999: 518-529.
[18] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitivehashing scheme based on p-stable distributions[C] //Proceedings of the Twentieth Annual Symposium on Computational Geometry. New York, USA: ACM, 2004: 253-262.
[19] ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C] //Proceedings of the 2006 47th Annual IEEE Symposium on Foundations of Computer Science(FOCS'06). Berkeley, USA: IEEE, 2006: 459-468.
[20] KULIS B, GRAUMAN K. Kernelized locality-sensitive hashing for scalable image search[C] //Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 2130-2137.
[21] LUO W, LIU W, GAO S. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[22] LIU L, SHAO L. Sequential compact code learning for unsupervised image hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(12): 2526-2536.
[23] ZHU L, SHEN J, XIE L, et al. Unsupervised visual hashing with semantic assistant for content-based image retrieval[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 29(2): 472-486.
[24] ZHU L, HUANG Z, LI Z, et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5264-5276.
[25] LI G, SHEN C, VAN DEN HENGEL A. Supervised hashing using graph cuts and boosted decision trees[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(11): 2317-2331.
[26] WANG Q, ZHANG Z, SI L.Ranking preserving hashing for fast similarity search[C] //Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI, 2015: 3911-3917.
[27] SHEN F, SHEN C, LIU W, et al. Supervised discrete hashing[C] //Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 37-45.
[28] LIU X, NIE X, ZENG W, et al. Fast discrete cross-modal hashing with regressing from semantic labels[C] //Proceedings of the 26th ACM International Conference on Multimedia. New York, USA: ACM, 2018: 1662-1669.
[29] GUI J, LI P. R2SDH:robust rotated supervised discrete hashing[C] //Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM, 2018: 1485-1493.
[30] WEISS Y, TORRALBA A, FERGUS R. Spectral hashing[C] //Proceedings of the 21st International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2008: 1753-1760.
[31] LIU Q, LIU G, LI L, et al. Reversed spectral hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(6): 2441-2449.
[32] HU Z, PAN G, WANG Y, et al. Sparse principal component analysis via rotation and truncation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(4): 875-890.
[33] GONG Y, LAZEBNIK S, GORDO A, et al. Iterative quantization:a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(12): 2916-2929.
[34] GUI J, LIU T, SUN Z, et al. Fast supervised discrete hashing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 490-496.
[35] GUI J, LIU T, SUN Z, et al. Supervised discrete hashing with relaxation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 29(3): 608-617.
[36] PENNINGTON J, SOCHER R, MANNING C D. GloVe:global vectors for word representation[C] //Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg, USA: ACL, 2014: 1532-1543.
[37] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 6299-6308.
[38] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2023-11-12]. https://arxiv.org/pdf/1412.3555.
[39] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 706-715.
[40] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C] //Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV). Santiago, Chile: IEEE, 2015: 4489-4497.
[41] TAN Z, DONG F, LIU X, et al. VMLH: efficient video moment location via hashing[J]. Electronics, 2023, 12(2): 420.
[42] GE R, GAO J, CHEN K, NEVATIA R. MAC: mining activity concepts for language-based temporal localization[C] //Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2019: 245-253.
[43] YUAN Y, MEI T, ZHU W. To find where you talk: temporal sentence localization in video with attention based location regression[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9159-9166.
[44] HU Y, LIU M, SU X, et al. Video moment localization via deep cross-modal hashing[J]. IEEE Transactions on Image Processing, 2021, 30: 4667-4677.
[1] 王旭峰, 周迪,张风雷,宋雪萌,刘萌. 基于多粒度对齐网络的图像-文本匹配方法[J]. 山东大学学报 (工学版), 2025, 55(4): 29-39.
[2] 刁振宇,韩小凡,张承宇,聂慧佳,赵秀阳,牛冬梅. 基于实例判别与特征增强的单图三维模型检索[J]. 山东大学学报 (工学版), 2025, 55(2): 71-77.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王素玉,艾兴,赵军,李作丽,刘增文 . 高速立铣3Cr2Mo模具钢切削力建模及预测[J]. 山东大学学报(工学版), 2006, 36(1): 1 -5 .
[2] 张永花,王安玲,刘福平 . 低频非均匀电磁波在导电界面的反射相角[J]. 山东大学学报(工学版), 2006, 36(2): 22 -25 .
[3] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[4] 孔祥臻,刘延俊,王勇,赵秀华 . 气动比例阀的死区补偿与仿真[J]. 山东大学学报(工学版), 2006, 36(1): 99 -102 .
[5] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[6] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[7] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[8] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 .
[9] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[10] 浦剑1 ,张军平1 ,黄华2 . 超分辨率算法研究综述[J]. 山东大学学报(工学版), 2009, 39(1): 27 -32 .