您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (1): 58-65.doi: 10.6040/j.issn.1672-3961.0.2023.172

• 机器学习与数据挖掘 • 上一篇    

基于跨模态注意力哈希学习的视频片段定位方法

谭智方1,董飞2,卢鹏宇1,潘嘉男1,聂秀山1*,尹义龙3   

  1. 1.山东建筑大学计算机科学与技术学院, 山东 济南 250101;2.山东师范大学新闻与传媒学院, 山东 济南 250014;3.山东大学软件学院, 山东 济南 250100
  • 发布日期:2025-02-20
  • 作者简介:谭智方(1997— ),男,山东潍坊人,硕士研究生,主要研究方向为计算机视觉中的图像处理. E-mail:826133130@qq.com. *通信作者简介:聂秀山(1981— ),男,江苏徐州人,教授,博士生导师,博士,主要研究方向为计算机视觉. E-mail:niexiushan@163.com
  • 基金资助:
    国家自然科学基金资助项目(62176141,62102235);山东省泰山学者资助项目(tsqn202103088);山东省自然科学基金资助项目(ZR2020QF029)

Video moment location method based on cross-modal attention hashing

TAN Zhifang1, DONG Fei2, LU Pengyu1, PAN Jianan1, NIE Xiushan1*, YIN Yilong3   

  1. 1. College of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China;
    2. School of Journalism and Communication, Shandong Normal University, Jinan 250014, Shandong, China;
    3. College of Software, Shandong University, Jinan 250100, Shandong, China
  • Published:2025-02-20

摘要: 为提升视频片段定位的精度与检索效率,提出基于跨模态注意力哈希学习的视频片段定位方法。将查询语句和原始视频特征通过哈希学习模型转化成简洁的二值哈希码;使用软注意力模块对查询语句中的关键单词进行加权,将视频哈希码和查询语句哈希码输入一个增强的跨模态注意力模型中,挖掘视觉和语言之间的语义关系;设计一个得分预测和位置预测网络,对查询时刻的起始时间戳进行定位。在2个公开数据集上对所提方法进行试验验证,结果表明所提方法对检索效率提升约7倍。

关键词: 视觉理解, 视频片段定位, 多模态检索, 哈希学习, 跨模态

中图分类号: 

  • TP37
[1] HENDRICK L A, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5804-5813.
[2] GAO J, SUN C, YANG Z, et al. TALL: temporal activity localization via language query[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5277-5285.
[3] CHEN J, CHEN X, MA L, et al. Temporally grounding natural sentence in video[C] //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2018: 162-171.
[4] XU H, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9062-9069.
[5] ZHANG D, DAI X, WANG X, et al. MAN:moment alignment network for natural language moment retrieval via iterative graph adjustment[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 1247-1257.
[6] WANG W, HUANG Y, WANG L. Language-driven temporal activity localization:a semantic matching reinforcement learning model[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 334-343.
[7] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL:extractive clip localization using natural language descriptions[EB/OL].(2019-04-04)[2023-11-12]. https://arxiv.org/pdf/1904.02755.
[8] SHOU Z, CHAN J, ZAREIAN A, et al. CDC:convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 1417-1426.
[9] ZENG R, GAN C, CHEN P, et al. Breaking winner-takes-all:iterative-winners-out networks for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2019, 28(12): 5797-5808.
[10] SHOU Z, WANG D, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 1049-1058.
[11] XU H, DAS A, SAENKO K. R-C3D:region convolutional 3D network for temporal activity detection[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5794-5803.
[12] LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C] //Proceedings of the 25th ACM Inter-national Conference on Multimedia. New York, USA: ACM, 2017: 988-996.
[13] WANG J, CHENG Y, FERIS R S. Walk and learn:facial attribute representation learning from egocentric video and contextual data[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 2295-2304.
[14] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C] //Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM, 2018: 19-27.
[15] YU Y, KIM J, KIM G. A joint sequence fusion model for video question answering and retrieval[C] //Proceedings of the European Conference on Computer Vision(ECCV). Piscataway, USA: IEEE, 2018: 471-487.
[16] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2020: 10635-10644.
[17] GIONIS A, INDYK P, MOTWANI R. Similarity search in high dimensions viahashing[C] //Proceedings of the International Conference on Very Large Data Bases. New York, USA: Springer, 1999: 518-529.
[18] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitivehashing scheme based on p-stable distributions[C] //Proceedings of the Twentieth Annual Symposium on Computational Geometry. New York, USA: ACM, 2004: 253-262.
[19] ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C] //Proceedings of the 2006 47th Annual IEEE Symposium on Foundations of Computer Science(FOCS'06). Berkeley, USA: IEEE, 2006: 459-468.
[20] KULIS B, GRAUMAN K. Kernelized locality-sensitive hashing for scalable image search[C] //Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 2130-2137.
[21] LUO W, LIU W, GAO S. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[22] LIU L, SHAO L. Sequential compact code learning for unsupervised image hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(12): 2526-2536.
[23] ZHU L, SHEN J, XIE L, et al. Unsupervised visual hashing with semantic assistant for content-based image retrieval[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 29(2): 472-486.
[24] ZHU L, HUANG Z, LI Z, et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5264-5276.
[25] LI G, SHEN C, VAN DEN HENGEL A. Supervised hashing using graph cuts and boosted decision trees[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(11): 2317-2331.
[26] WANG Q, ZHANG Z, SI L.Ranking preserving hashing for fast similarity search[C] //Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI, 2015: 3911-3917.
[27] SHEN F, SHEN C, LIU W, et al. Supervised discrete hashing[C] //Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 37-45.
[28] LIU X, NIE X, ZENG W, et al. Fast discrete cross-modal hashing with regressing from semantic labels[C] //Proceedings of the 26th ACM International Conference on Multimedia. New York, USA: ACM, 2018: 1662-1669.
[29] GUI J, LI P. R2SDH:robust rotated supervised discrete hashing[C] //Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM, 2018: 1485-1493.
[30] WEISS Y, TORRALBA A, FERGUS R. Spectral hashing[C] //Proceedings of the 21st International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2008: 1753-1760.
[31] LIU Q, LIU G, LI L, et al. Reversed spectral hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(6): 2441-2449.
[32] HU Z, PAN G, WANG Y, et al. Sparse principal component analysis via rotation and truncation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(4): 875-890.
[33] GONG Y, LAZEBNIK S, GORDO A, et al. Iterative quantization:a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(12): 2916-2929.
[34] GUI J, LIU T, SUN Z, et al. Fast supervised discrete hashing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 490-496.
[35] GUI J, LIU T, SUN Z, et al. Supervised discrete hashing with relaxation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 29(3): 608-617.
[36] PENNINGTON J, SOCHER R, MANNING C D. GloVe:global vectors for word representation[C] //Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg, USA: ACL, 2014: 1532-1543.
[37] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 6299-6308.
[38] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2023-11-12]. https://arxiv.org/pdf/1412.3555.
[39] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 706-715.
[40] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C] //Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV). Santiago, Chile: IEEE, 2015: 4489-4497.
[41] TAN Z, DONG F, LIU X, et al. VMLH: efficient video moment location via hashing[J]. Electronics, 2023, 12(2): 420.
[42] GE R, GAO J, CHEN K, NEVATIA R. MAC: mining activity concepts for language-based temporal localization[C] //Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2019: 245-253.
[43] YUAN Y, MEI T, ZHU W. To find where you talk: temporal sentence localization in video with attention based location regression[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9159-9166.
[44] HU Y, LIU M, SU X, et al. Video moment localization via deep cross-modal hashing[J]. IEEE Transactions on Image Processing, 2021, 30: 4667-4677.
[1] 李秋玲,邵宝民,赵磊,王振,姜雪. 基于ViBe算法运动特征的关键帧提取算法[J]. 山东大学学报 (工学版), 2020, 50(1): 8-13.
[2] 何奕江,杜军平,寇菲菲,梁美玉,王巍,罗盎. 基于深度卷积神经网络的图像自编码算法[J]. 山东大学学报 (工学版), 2019, 49(2): 61-66.
[3] 张杨,陈飞,徐海平. 基于图像块先验的低秩近似和维纳滤波的去噪算法[J]. 山东大学学报(工学版), 2017, 47(3): 16-20.
Viewed
Full text
48
HTML PDF
Just accepted Online first Issue Just accepted Online first Issue
0 0 0 0 0 48

  From local
  Times 48
  Rate 100%

Abstract
62
Just accepted Online first Issue
0 0 62
  From Others local
  Times 61 1
  Rate 98% 2%

Cited

Web of Science  Crossref   ScienceDirect  Search for Citations in Google Scholar >>
 
This page requires you have already subscribed to WoS.
  Shared   
  Discussed   
No Suggested Reading articles found!