基于跨模态注意力哈希学习的视频片段定位方法

摘要/Abstract

参考文献

多维度评价

doi:10.6040/j.issn.1672-3961.0.2023.172

摘要： 为提升视频片段定位的精度与检索效率,提出基于跨模态注意力哈希学习的视频片段定位方法。将查询语句和原始视频特征通过哈希学习模型转化成简洁的二值哈希码;使用软注意力模块对查询语句中的关键单词进行加权,将视频哈希码和查询语句哈希码输入一个增强的跨模态注意力模型中,挖掘视觉和语言之间的语义关系;设计一个得分预测和位置预测网络,对查询时刻的起始时间戳进行定位。在2个公开数据集上对所提方法进行试验验证,结果表明所提方法对检索效率提升约7倍。

关键词: 视觉理解, 视频片段定位, 多模态检索, 哈希学习, 跨模态

中图分类号:

TP37

谭智方,董飞,卢鹏宇,潘嘉男,聂秀山,尹义龙. 基于跨模态注意力哈希学习的视频片段定位方法[J]. 山东大学学报 (工学版), 2025, 55(1): 58-65.

TAN Zhifang, DONG Fei, LU Pengyu, PAN Jianan, NIE Xiushan, YIN Yilong. Video moment location method based on cross-modal attention hashing[J]. Journal of Shandong University(Engineering Science), 2025, 55(1): 58-65.

[1] HENDRICK L A, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5804-5813.
[2] GAO J, SUN C, YANG Z, et al. TALL: temporal activity localization via language query[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5277-5285.
[3] CHEN J, CHEN X, MA L, et al. Temporally grounding natural sentence in video[C] //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2018: 162-171.
[4] XU H, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9062-9069.
[5] ZHANG D, DAI X, WANG X, et al. MAN:moment alignment network for natural language moment retrieval via iterative graph adjustment[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 1247-1257.
[6] WANG W, HUANG Y, WANG L. Language-driven temporal activity localization:a semantic matching reinforcement learning model[C] //Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Long Beach, USA: IEEE, 2019: 334-343.
[7] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL:extractive clip localization using natural language descriptions[EB/OL].(2019-04-04)[2023-11-12]. https://arxiv.org/pdf/1904.02755.
[8] SHOU Z, CHAN J, ZAREIAN A, et al. CDC:convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 1417-1426.
[9] ZENG R, GAN C, CHEN P, et al. Breaking winner-takes-all:iterative-winners-out networks for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2019, 28(12): 5797-5808.
[10] SHOU Z, WANG D, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 1049-1058.
[11] XU H, DAS A, SAENKO K. R-C3D:region convolutional 3D network for temporal activity detection[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 5794-5803.
[12] LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C] //Proceedings of the 25th ACM Inter-national Conference on Multimedia. New York, USA: ACM, 2017: 988-996.
[13] WANG J, CHENG Y, FERIS R S. Walk and learn:facial attribute representation learning from egocentric video and contextual data[C] //Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Las Vegas, USA: IEEE, 2016: 2295-2304.
[14] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C] //Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. New York, USA: ACM, 2018: 19-27.
[15] YU Y, KIM J, KIM G. A joint sequence fusion model for video question answering and retrieval[C] //Proceedings of the European Conference on Computer Vision(ECCV). Piscataway, USA: IEEE, 2018: 471-487.
[16] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C] //Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2020: 10635-10644.
[17] GIONIS A, INDYK P, MOTWANI R. Similarity search in high dimensions viahashing[C] //Proceedings of the International Conference on Very Large Data Bases. New York, USA: Springer, 1999: 518-529.
[18] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitivehashing scheme based on p-stable distributions[C] //Proceedings of the Twentieth Annual Symposium on Computational Geometry. New York, USA: ACM, 2004: 253-262.
[19] ANDONI A, INDYK P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C] //Proceedings of the 2006 47th Annual IEEE Symposium on Foundations of Computer Science(FOCS'06). Berkeley, USA: IEEE, 2006: 459-468.
[20] KULIS B, GRAUMAN K. Kernelized locality-sensitive hashing for scalable image search[C] //Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 2130-2137.
[21] LUO W, LIU W, GAO S. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[22] LIU L, SHAO L. Sequential compact code learning for unsupervised image hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(12): 2526-2536.
[23] ZHU L, SHEN J, XIE L, et al. Unsupervised visual hashing with semantic assistant for content-based image retrieval[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 29(2): 472-486.
[24] ZHU L, HUANG Z, LI Z, et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11): 5264-5276.
[25] LI G, SHEN C, VAN DEN HENGEL A. Supervised hashing using graph cuts and boosted decision trees[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(11): 2317-2331.
[26] WANG Q, ZHANG Z, SI L.Ranking preserving hashing for fast similarity search[C] //Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI, 2015: 3911-3917.
[27] SHEN F, SHEN C, LIU W, et al. Supervised discrete hashing[C] //Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 37-45.
[28] LIU X, NIE X, ZENG W, et al. Fast discrete cross-modal hashing with regressing from semantic labels[C] //Proceedings of the 26th ACM International Conference on Multimedia. New York, USA: ACM, 2018: 1662-1669.
[29] GUI J, LI P. R2SDH:robust rotated supervised discrete hashing[C] //Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM, 2018: 1485-1493.
[30] WEISS Y, TORRALBA A, FERGUS R. Spectral hashing[C] //Proceedings of the 21st International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2008: 1753-1760.
[31] LIU Q, LIU G, LI L, et al. Reversed spectral hashing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(6): 2441-2449.
[32] HU Z, PAN G, WANG Y, et al. Sparse principal component analysis via rotation and truncation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 27(4): 875-890.
[33] GONG Y, LAZEBNIK S, GORDO A, et al. Iterative quantization:a procrustean approach to learning binary codes for large-scale image retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(12): 2916-2929.
[34] GUI J, LIU T, SUN Z, et al. Fast supervised discrete hashing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 490-496.
[35] GUI J, LIU T, SUN Z, et al. Supervised discrete hashing with relaxation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2016, 29(3): 608-617.
[36] PENNINGTON J, SOCHER R, MANNING C D. GloVe:global vectors for word representation[C] //Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Stroudsburg, USA: ACL, 2014: 1532-1543.
[37] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 6299-6308.
[38] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].(2014-12-11)[2023-11-12]. https://arxiv.org/pdf/1412.3555.
[39] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C] //Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 706-715.
[40] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C] //Proceedings of the 2015 IEEE International Conference on Computer Vision(ICCV). Santiago, Chile: IEEE, 2015: 4489-4497.
[41] TAN Z, DONG F, LIU X, et al. VMLH: efficient video moment location via hashing[J]. Electronics, 2023, 12(2): 420.
[42] GE R, GAO J, CHEN K, NEVATIA R. MAC: mining activity concepts for language-based temporal localization[C] //Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2019: 245-253.
[43] YUAN Y, MEI T, ZHU W. To find where you talk: temporal sentence localization in video with attention based location regression[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, USA: AAAI, 2019: 9159-9166.
[44] HU Y, LIU M, SU X, et al. Video moment localization via deep cross-modal hashing[J]. IEEE Transactions on Image Processing, 2021, 30: 4667-4677.

Just accepted

Online first

Just accepted

Online first

Viewed

Full text

	From	local

	Times	52
	Rate	100%

Abstract

Just accepted	Online first	Issue

0	0	68

From	Others	local

Times	67	1
Rate	99%	1%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed