基于视频描述增强和双流特征融合的视频异常检测方法

doi:10.6040/j.issn.1672-3961.0.2025.031

摘要/Abstract

摘要： 针对现有异常检测方法在语义上下文利用和时空特征建模方面的不足,提出一种基于视频描述增强和双流特征融合的视频异常检测方法。自动化提取视频描述,利用对比语言-图像预训练(constrastive language-image pre-training, CLIP)模型进行编码,作为视频上下文语义特征辅助视频异常检测;引入一种时空自适应嵌入模块,分别捕捉视频中细微的时序变化和复杂的空间结构,并进行有效的时空融合;利用精心设计的跨模态对齐模块将上下文语义特征与时空视觉特征进行深度融合,更准确地捕捉异常事件的时空-语义联合特征。试验结果显示,该方法在ShanghaiTech和CUHK Avenue数据集上的检测指标曲线下面积A_UC分别达到97.54%和90.54%,证明该方法在公开视频异常检测数据集上表现优异,具有强大的鲁棒性,为视频异常检测提供一种有效的解决方案。

关键词: 视频异常检测, 视频描述, 时空自适应嵌入, 时序Transformer, 空间Transformer

Abstract: To address the limitations in semantic context utilization and spatio-temporal feature modeling in existing anomaly detection methods, a video anomaly detection method based on video caption augmentation and dual-stream feature fusion was proposed. Video captions were automatically extracted and encoded using the contrastive language-image pre-training(CLIP)model to serve as auxiliary semantic context information for anomaly detection. A spatio-temporal adaptive embedding module was introduced to capture subtle temporal variations and complex spatial structures within videos, enabling effective spatio-temporal feature fusion. A cross-modal alignment module was further designed to deeply integrate contextual semantic features with spatio-temporal visual features, allowing more accurate capture of joint spatio-temporal-semantic representations of anomalous events. Experimental results showed that the method achieved area under the curve A_UC scores of 97.54% on the ShanghaiTech dataset and 90.54% on the CUHK Avenue dataset. The results confirmed the performance and robustness of the method across multiple public video anomaly detection datasets, providing an effective solution for this critical task.

Key words: video anomaly detection, video caption, spatio-temporal adaptive embedding, temporal Transformer, spatial Transformer

中图分类号:

TP391

郑晓,陈鹤,周东傲,宫永顺. 基于视频描述增强和双流特征融合的视频异常检测方法[J]. 山东大学学报 (工学版), 2025, 55(5): 110-119.

ZHENG Xiao, CHEN He, ZHOU Dongao, GONG Yongshun. Video anomaly detection method based on video caption augmentation and dual-stream feature fusion[J]. Journal of Shandong University(Engineering Science), 2025, 55(5): 110-119.

参考文献

[1] DENG H Q, ZHANG Z X, ZOU S H, et al. Bi-directional frame interpolation for unsupervised video anomaly detection[C] //2023 IEEE/CVF Winter Confe-rence on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2023: 2633-2642.
[2] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[3] 吕浩, 易鹏飞, 刘瑞, 等. 用于视频异常检测的时序多尺度自编码器[J]. 图学学报, 2022, 43(2): 223-229. LYU Hao, YI Pengfei, LIU Rui, et al. Sequential multi-scale autoencoder for video anomaly detection[J]. Journal of Graphics, 2022, 43(2): 223-229.
[4] DI MAURO M, GALATRO G, FORTINO G, et al. Supervised feature selection techniques in network intrusion detection: a critical review[J]. Engineering Applications of Artificial Intelligence, 2021, 101: 104216.
[5] WAN B Y, FANG Y M, XIA X, et al. Weakly supervised video anomaly detection via center-guided discriminative learning[C] //2020 IEEE International Conference on Multimedia and Expo(ICME). London, UK: IEEE, 2020: 1-6.
[6] SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6479-6488.
[7] XU D, WU P, YUAN L. Video anomalous behaviour detection based on compressed-inflated attention module[J]. Journal of Computer Science and Electrical Engineering, 2024, 6(2): 2663-1946.
[8] ULLAH W, HUSSAIN T, ULLAH F U M, et al. TransCNN: hybrid CNN and Transformer mechanism for surveillance anomaly detection[J]. Engineering Applica-tions of Artificial Intelligence, 2023, 123: 106173.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[10] HUSSAIN A, ULLAH W, KHAN N, et al. TDS-Net: Transformer enhanced dual-stream network for video anomaly detection[J]. Expert Systems with Applica-tions, 2024, 256: 124846.
[11] 黄少年, 文沛然, 全琪, 等. 基于多支路聚合的帧预测轻量化视频异常检测[J]. 图学学报, 2023, 44(6): 1173-1182. HUANG Shaonian, WEN Peiran, QUAN Qi, et al. Future frame prediction based on multi-branch aggregation for lightweight video anomaly detection[J]. Journal of Graphics, 2023, 44(6): 1173-1182.
[12] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //International Conference on Machine Learning. [S.l.] : PMLR, 2021: 8748-8763.
[13] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C] //International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 12888-12900.
[14] PU Y J, WU X Y, YANG L L, et al. Learning prompt-enhanced context features for weakly-supervised video anomaly detection[J]. IEEE Transactions on Image Processing, 2024, 33: 4923-4936..
[15] WU P, ZHOU X, PANG G, et al. VadCLIP: adapting vision-language models for weakly supervised video anomaly detection[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2024: 6074-6082.
[16] CHEN W L, MA K T, YEW Z J, et al. TEVAD: improved video anomaly detection with captions[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Vancouver, Canada: IEEE, 2023: 5549-5559.
[17] SHI Y Z, YAMASHITA T, HIRAKAWA T, et al. Caption-guided interpretable video anomaly detection based on memory similarity[J]. IEEE Access, 2024, 12: 63995-64005.
[18] HONG W Y, WANG W H, DING M, et al. CogVLM2: visual language models for image and video understanding[EB/OL].(2024-08-29)[2025-03-01]. https://arxiv.org/abs/2408.16500v1
[19] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 4724-4773.
[20] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C] //2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 1-9.
[21] CHEN T S, SIAROHIN A, MENAPACE W, et al. Panda-70M: captioning 70M videos with multiple cross-modality teachers[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2024: 13320-13331.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL].(2021-06-03)[2025-03-01]. https://arxiv.org/abs/2010.11929v2
[23] WAN B Y, JIANG W H, FANG Y M, et al. Anomaly detection in video sequences: a benchmark and computational model[J]. IET Image Processing, 2021, 15(14): 3454-3465.
[24] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL].(2017-01-30)[2025-03-01]. https://arxiv.org/abs/1412.6980v9
[25] LU C W, SHI J P, JIA J Y. Abnormal event detection at 150 FPS in MATLAB[C] //2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 2720-2727.
[26] LUO W X, LIU W, GAO S H. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[27] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[28] QIU S M, YE J F, ZHAO J C, et al. Video anomaly detection guided by clustering learning[J]. Pattern Recognition, 2024, 153: 110550.
[29] YANG Z Y, RADKE R J. Context-aware video anomaly detection in long-term datasets[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Seattle, USA: IEEE, 2024: 4002-4011.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed