Journal of Shandong University(Engineering Science) ›› 2025, Vol. 55 ›› Issue (5): 110-119.doi: 10.6040/j.issn.1672-3961.0.2025.031

• Machine Learning & Data Mining • Previous Articles    

Video anomaly detection method based on video caption augmentation and dual-stream feature fusion

ZHENG Xiao1, CHEN He2, ZHOU Dongao3*, GONG Yongshun1   

  1. ZHENG Xiao1, CHEN He2, ZHOU Dongao3*, GONG Yongshun1(1. School of Software, Shandong University, Jinan 250101, Shandong, China;
    2. Shandong Jiaokong Technology Co., Ltd., Jinan 250022, Shandong, China;
    3. PLA Academy of Military Science, Beijing 100091, China
  • Published:2025-10-17

Abstract: To address the limitations in semantic context utilization and spatio-temporal feature modeling in existing anomaly detection methods, a video anomaly detection method based on video caption augmentation and dual-stream feature fusion was proposed. Video captions were automatically extracted and encoded using the contrastive language-image pre-training(CLIP)model to serve as auxiliary semantic context information for anomaly detection. A spatio-temporal adaptive embedding module was introduced to capture subtle temporal variations and complex spatial structures within videos, enabling effective spatio-temporal feature fusion. A cross-modal alignment module was further designed to deeply integrate contextual semantic features with spatio-temporal visual features, allowing more accurate capture of joint spatio-temporal-semantic representations of anomalous events. Experimental results showed that the method achieved area under the curve AUC scores of 97.54% on the ShanghaiTech dataset and 90.54% on the CUHK Avenue dataset. The results confirmed the performance and robustness of the method across multiple public video anomaly detection datasets, providing an effective solution for this critical task.

Key words: video anomaly detection, video caption, spatio-temporal adaptive embedding, temporal Transformer, spatial Transformer

CLC Number: 

  • TP391
[1] DENG H Q, ZHANG Z X, ZOU S H, et al. Bi-directional frame interpolation for unsupervised video anomaly detection[C] //2023 IEEE/CVF Winter Confe-rence on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2023: 2633-2642.
[2] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[3] 吕浩, 易鹏飞, 刘瑞, 等. 用于视频异常检测的时序多尺度自编码器[J]. 图学学报, 2022, 43(2): 223-229. LYU Hao, YI Pengfei, LIU Rui, et al. Sequential multi-scale autoencoder for video anomaly detection[J]. Journal of Graphics, 2022, 43(2): 223-229.
[4] DI MAURO M, GALATRO G, FORTINO G, et al. Supervised feature selection techniques in network intrusion detection: a critical review[J]. Engineering Applications of Artificial Intelligence, 2021, 101: 104216.
[5] WAN B Y, FANG Y M, XIA X, et al. Weakly supervised video anomaly detection via center-guided discriminative learning[C] //2020 IEEE International Conference on Multimedia and Expo(ICME). London, UK: IEEE, 2020: 1-6.
[6] SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6479-6488.
[7] XU D, WU P, YUAN L. Video anomalous behaviour detection based on compressed-inflated attention module[J]. Journal of Computer Science and Electrical Engineering, 2024, 6(2): 2663-1946.
[8] ULLAH W, HUSSAIN T, ULLAH F U M, et al. TransCNN: hybrid CNN and Transformer mechanism for surveillance anomaly detection[J]. Engineering Applica-tions of Artificial Intelligence, 2023, 123: 106173.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[10] HUSSAIN A, ULLAH W, KHAN N, et al. TDS-Net: Transformer enhanced dual-stream network for video anomaly detection[J]. Expert Systems with Applica-tions, 2024, 256: 124846.
[11] 黄少年, 文沛然, 全琪, 等. 基于多支路聚合的帧预测轻量化视频异常检测[J]. 图学学报, 2023, 44(6): 1173-1182. HUANG Shaonian, WEN Peiran, QUAN Qi, et al. Future frame prediction based on multi-branch aggregation for lightweight video anomaly detection[J]. Journal of Graphics, 2023, 44(6): 1173-1182.
[12] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //International Conference on Machine Learning. [S.l.] : PMLR, 2021: 8748-8763.
[13] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C] //International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 12888-12900.
[14] PU Y J, WU X Y, YANG L L, et al. Learning prompt-enhanced context features for weakly-supervised video anomaly detection[J]. IEEE Transactions on Image Processing, 2024, 33: 4923-4936..
[15] WU P, ZHOU X, PANG G, et al. VadCLIP: adapting vision-language models for weakly supervised video anomaly detection[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2024: 6074-6082.
[16] CHEN W L, MA K T, YEW Z J, et al. TEVAD: improved video anomaly detection with captions[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Vancouver, Canada: IEEE, 2023: 5549-5559.
[17] SHI Y Z, YAMASHITA T, HIRAKAWA T, et al. Caption-guided interpretable video anomaly detection based on memory similarity[J]. IEEE Access, 2024, 12: 63995-64005.
[18] HONG W Y, WANG W H, DING M, et al. CogVLM2: visual language models for image and video understanding[EB/OL].(2024-08-29)[2025-03-01]. https://arxiv.org/abs/2408.16500v1
[19] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 4724-4773.
[20] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C] //2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 1-9.
[21] CHEN T S, SIAROHIN A, MENAPACE W, et al. Panda-70M: captioning 70M videos with multiple cross-modality teachers[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2024: 13320-13331.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL].(2021-06-03)[2025-03-01]. https://arxiv.org/abs/2010.11929v2
[23] WAN B Y, JIANG W H, FANG Y M, et al. Anomaly detection in video sequences: a benchmark and computational model[J]. IET Image Processing, 2021, 15(14): 3454-3465.
[24] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL].(2017-01-30)[2025-03-01]. https://arxiv.org/abs/1412.6980v9
[25] LU C W, SHI J P, JIA J Y. Abnormal event detection at 150 FPS in MATLAB[C] //2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 2720-2727.
[26] LUO W X, LIU W, GAO S H. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[27] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[28] QIU S M, YE J F, ZHAO J C, et al. Video anomaly detection guided by clustering learning[J]. Pattern Recognition, 2024, 153: 110550.
[29] YANG Z Y, RADKE R J. Context-aware video anomaly detection in long-term datasets[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Seattle, USA: IEEE, 2024: 4002-4011.
[1] LI Erchao, ZHANG Zhizhao. Online dynamic demand vehicle routing planning [J]. Journal of Shandong University(Engineering Science), 2024, 54(5): 62-73.
[2] YANG Jucheng, WEI Feng, LIN Liang, JIA Qingxiang, LIU Jianzheng. A research survey of driver drowsiness driving detection [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 1-12.
[3] XIAO Wei, ZHENG Gengsheng, CHEN Yujia. Named entity recognition method combined with self-training model [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 96-102.
[4] Gang HU, Lemeng WANG, Zhiyu LU, Qin WANG, Xiang XU. Importance identification method based on multi-order neighborhood hierarchical association contribution of nodes [J]. Journal of Shandong University(Engineering Science), 2024, 54(1): 1-10.
[5] Jiachun LI,Bowen LI,Jianbo CHANG. An efficient and lightweight RGB frame-level face anti-spoofing model [J]. Journal of Shandong University(Engineering Science), 2023, 53(6): 1-7.
[6] Yujiang FAN,Huanhuan HUANG,Jiaxiong DING,Kai LIAO,Binshan YU. Resilience evaluation system of the old community based on cloud model [J]. Journal of Shandong University(Engineering Science), 2023, 53(5): 1-9, 19.
[7] Ying LI,Jiankun WANG. The classification of mild cognitive impairment based on supervised graph regularization and information fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(4): 65-73.
[8] YU Yixuan, YANG Geng, GENG Hua. Multimodal hierarchical keyframe extraction method for continuous combined motion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 42-50.
[9] ZHANG Hao, LI Ziling, LIU Tong, ZHANG Dawei, TAO Jianhua. A technology prediction model based on fuzzy Bayesian networks with sociological factors [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 23-33.
[10] WU Yanli, LIU Shuwei, HE Dongxiao, WANG Xiaobao, JIN Di. Poisson-gamma topic model of describing multiple underlying relationships [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 51-60.
[11] YU Mingjun, DIAO Hongjun, LING Xinghong. Online multi-object tracking method based on trajectory mask [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 61-69.
[12] HUANG Huajuan, CHENG Qian, WEI Xiuxi, YU Chuchu. Adaptive crow search algorithm with Jaya algorithm and Gaussian mutation [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 11-22.
[13] LIU Fangxu, WANG Jian, WEI Benzheng. Auxiliary diagnosis algorithm for pediatric pneumonia based on multi-spatial attention [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 135-142.
[14] LIU Xing, YANG Lu, HAO Fanchang. Finger vein image retrieval based on multi-feature fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 118-126.
[15] Yue YUAN,Yanli WANG,Kan LIU. Named entity recognition model based on dilated convolutional block architecture [J]. Journal of Shandong University(Engineering Science), 2022, 52(6): 105-114.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!