您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2025, Vol. 55 ›› Issue (5): 110-119.doi: 10.6040/j.issn.1672-3961.0.2025.031

• 机器学习与数据挖掘 • 上一篇    

基于视频描述增强和双流特征融合的视频异常检测方法

郑晓1,陈鹤2,周东傲3*,宫永顺1   

  1. 1.山东大学软件学院, 山东 济南 250101;2.山东交控科技有限公司, 山东 济南 250022;3.中国人民解放军军事科学院, 北京 100091
  • 发布日期:2025-10-17
  • 作者简介:郑晓(2000— ),男,山东德州人,硕士研究生,主要研究方向为计算机视觉、异常检测. E-mail:xzheng@mail.sdu.edu.cn. *通信作者简介:周东傲(1990— ),男,湖南新邵人,助理研究员,博士,主要研究方向为人工智能与信号检测. E-mail:zhoudongao08@nudt.edu.cn
  • 基金资助:
    山东省优秀青年基金(海外)资助项目(2022HWYQ-044);山东交控科技有限公司科技资助项目(1480024005)

Video anomaly detection method based on video caption augmentation and dual-stream feature fusion

ZHENG Xiao1, CHEN He2, ZHOU Dongao3*, GONG Yongshun1   

  1. ZHENG Xiao1, CHEN He2, ZHOU Dongao3*, GONG Yongshun1(1. School of Software, Shandong University, Jinan 250101, Shandong, China;
    2. Shandong Jiaokong Technology Co., Ltd., Jinan 250022, Shandong, China;
    3. PLA Academy of Military Science, Beijing 100091, China
  • Published:2025-10-17

摘要: 针对现有异常检测方法在语义上下文利用和时空特征建模方面的不足,提出一种基于视频描述增强和双流特征融合的视频异常检测方法。自动化提取视频描述,利用对比语言-图像预训练(constrastive language-image pre-training, CLIP)模型进行编码,作为视频上下文语义特征辅助视频异常检测;引入一种时空自适应嵌入模块,分别捕捉视频中细微的时序变化和复杂的空间结构,并进行有效的时空融合;利用精心设计的跨模态对齐模块将上下文语义特征与时空视觉特征进行深度融合,更准确地捕捉异常事件的时空-语义联合特征。试验结果显示,该方法在ShanghaiTech和CUHK Avenue数据集上的检测指标曲线下面积AUC分别达到97.54%和90.54%,证明该方法在公开视频异常检测数据集上表现优异,具有强大的鲁棒性,为视频异常检测提供一种有效的解决方案。

关键词: 视频异常检测, 视频描述, 时空自适应嵌入, 时序Transformer, 空间Transformer

Abstract: To address the limitations in semantic context utilization and spatio-temporal feature modeling in existing anomaly detection methods, a video anomaly detection method based on video caption augmentation and dual-stream feature fusion was proposed. Video captions were automatically extracted and encoded using the contrastive language-image pre-training(CLIP)model to serve as auxiliary semantic context information for anomaly detection. A spatio-temporal adaptive embedding module was introduced to capture subtle temporal variations and complex spatial structures within videos, enabling effective spatio-temporal feature fusion. A cross-modal alignment module was further designed to deeply integrate contextual semantic features with spatio-temporal visual features, allowing more accurate capture of joint spatio-temporal-semantic representations of anomalous events. Experimental results showed that the method achieved area under the curve AUC scores of 97.54% on the ShanghaiTech dataset and 90.54% on the CUHK Avenue dataset. The results confirmed the performance and robustness of the method across multiple public video anomaly detection datasets, providing an effective solution for this critical task.

Key words: video anomaly detection, video caption, spatio-temporal adaptive embedding, temporal Transformer, spatial Transformer

中图分类号: 

  • TP391
[1] DENG H Q, ZHANG Z X, ZOU S H, et al. Bi-directional frame interpolation for unsupervised video anomaly detection[C] //2023 IEEE/CVF Winter Confe-rence on Applications of Computer Vision(WACV). Waikoloa, USA: IEEE, 2023: 2633-2642.
[2] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[3] 吕浩, 易鹏飞, 刘瑞, 等. 用于视频异常检测的时序多尺度自编码器[J]. 图学学报, 2022, 43(2): 223-229. LYU Hao, YI Pengfei, LIU Rui, et al. Sequential multi-scale autoencoder for video anomaly detection[J]. Journal of Graphics, 2022, 43(2): 223-229.
[4] DI MAURO M, GALATRO G, FORTINO G, et al. Supervised feature selection techniques in network intrusion detection: a critical review[J]. Engineering Applications of Artificial Intelligence, 2021, 101: 104216.
[5] WAN B Y, FANG Y M, XIA X, et al. Weakly supervised video anomaly detection via center-guided discriminative learning[C] //2020 IEEE International Conference on Multimedia and Expo(ICME). London, UK: IEEE, 2020: 1-6.
[6] SULTANI W, CHEN C, SHAH M. Real-world anomaly detection in surveillance videos[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 6479-6488.
[7] XU D, WU P, YUAN L. Video anomalous behaviour detection based on compressed-inflated attention module[J]. Journal of Computer Science and Electrical Engineering, 2024, 6(2): 2663-1946.
[8] ULLAH W, HUSSAIN T, ULLAH F U M, et al. TransCNN: hybrid CNN and Transformer mechanism for surveillance anomaly detection[J]. Engineering Applica-tions of Artificial Intelligence, 2023, 123: 106173.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[10] HUSSAIN A, ULLAH W, KHAN N, et al. TDS-Net: Transformer enhanced dual-stream network for video anomaly detection[J]. Expert Systems with Applica-tions, 2024, 256: 124846.
[11] 黄少年, 文沛然, 全琪, 等. 基于多支路聚合的帧预测轻量化视频异常检测[J]. 图学学报, 2023, 44(6): 1173-1182. HUANG Shaonian, WEN Peiran, QUAN Qi, et al. Future frame prediction based on multi-branch aggregation for lightweight video anomaly detection[J]. Journal of Graphics, 2023, 44(6): 1173-1182.
[12] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] //International Conference on Machine Learning. [S.l.] : PMLR, 2021: 8748-8763.
[13] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C] //International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 12888-12900.
[14] PU Y J, WU X Y, YANG L L, et al. Learning prompt-enhanced context features for weakly-supervised video anomaly detection[J]. IEEE Transactions on Image Processing, 2024, 33: 4923-4936..
[15] WU P, ZHOU X, PANG G, et al. VadCLIP: adapting vision-language models for weakly supervised video anomaly detection[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2024: 6074-6082.
[16] CHEN W L, MA K T, YEW Z J, et al. TEVAD: improved video anomaly detection with captions[C] //2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Vancouver, Canada: IEEE, 2023: 5549-5559.
[17] SHI Y Z, YAMASHITA T, HIRAKAWA T, et al. Caption-guided interpretable video anomaly detection based on memory similarity[J]. IEEE Access, 2024, 12: 63995-64005.
[18] HONG W Y, WANG W H, DING M, et al. CogVLM2: visual language models for image and video understanding[EB/OL].(2024-08-29)[2025-03-01]. https://arxiv.org/abs/2408.16500v1
[19] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C] //2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, USA: IEEE, 2017: 4724-4773.
[20] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C] //2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Boston, USA: IEEE, 2015: 1-9.
[21] CHEN T S, SIAROHIN A, MENAPACE W, et al. Panda-70M: captioning 70M videos with multiple cross-modality teachers[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Seattle, USA: IEEE, 2024: 13320-13331.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL].(2021-06-03)[2025-03-01]. https://arxiv.org/abs/2010.11929v2
[23] WAN B Y, JIANG W H, FANG Y M, et al. Anomaly detection in video sequences: a benchmark and computational model[J]. IET Image Processing, 2021, 15(14): 3454-3465.
[24] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL].(2017-01-30)[2025-03-01]. https://arxiv.org/abs/1412.6980v9
[25] LU C W, SHI J P, JIA J Y. Abnormal event detection at 150 FPS in MATLAB[C] //2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 2720-2727.
[26] LUO W X, LIU W, GAO S H. A revisit of sparse coding based anomaly detection in stacked RNN framework[C] //2017 IEEE International Conference on Computer Vision(ICCV). Venice, Italy: IEEE, 2017: 341-349.
[27] CHANG Y P, TU Z G, XIE W, et al. Video anomaly detection with spatio-temporal dissociation[J]. Pattern Recognition, 2022, 122: 108213.
[28] QIU S M, YE J F, ZHAO J C, et al. Video anomaly detection guided by clustering learning[J]. Pattern Recognition, 2024, 153: 110550.
[29] YANG Z Y, RADKE R J. Context-aware video anomaly detection in long-term datasets[C] //2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Seattle, USA: IEEE, 2024: 4002-4011.
[1] 李二超, 张智钊. 在线动态订单需求车辆路径规划[J]. 山东大学学报 (工学版), 2024, 54(5): 62-73.
[2] 杨巨成, 魏峰, 林亮, 贾庆祥, 刘建征. 驾驶员疲劳驾驶检测研究综述[J]. 山东大学学报 (工学版), 2024, 54(2): 1-12.
[3] 肖伟, 郑更生, 陈钰佳. 结合自训练模型的命名实体识别方法[J]. 山东大学学报 (工学版), 2024, 54(2): 96-102.
[4] 胡钢, 王乐萌, 卢志宇, 王琴, 徐翔. 基于节点多阶邻居递阶关联贡献度的重要性辨识[J]. 山东大学学报 (工学版), 2024, 54(1): 1-10.
[5] 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7.
[6] 樊禹江,黄欢欢,丁佳雄,廖凯,余滨杉. 基于云模型的老旧小区韧性评价体系[J]. 山东大学学报 (工学版), 2023, 53(5): 1-9, 19.
[7] 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73.
[8] 于艺旋,杨耕,耿华. 连续复合运动的多模态层次化关键帧提取方法[J]. 山东大学学报 (工学版), 2023, 53(2): 42-50.
[9] 张豪,李子凌,刘通,张大伟,陶建华. 融合社会学因素的模糊贝叶斯网技术预测模型[J]. 山东大学学报 (工学版), 2023, 53(2): 23-33.
[10] 吴艳丽,刘淑薇,何东晓,王晓宝,金弟. 刻画多种潜在关系的泊松-伽马主题模型[J]. 山东大学学报 (工学版), 2023, 53(2): 51-60.
[11] 余明骏,刁红军,凌兴宏. 基于轨迹掩膜的在线多目标跟踪方法[J]. 山东大学学报 (工学版), 2023, 53(2): 61-69.
[12] 黄华娟,程前,韦修喜,于楚楚. 融合Jaya高斯变异的自适应乌鸦搜索算法[J]. 山东大学学报 (工学版), 2023, 53(2): 11-22.
[13] 刘方旭,王建,魏本征. 基于多空间注意力的小儿肺炎辅助诊断算法[J]. 山东大学学报 (工学版), 2023, 53(2): 135-142.
[14] 刘行,杨璐,郝凡昌. 基于多特征融合的手指静脉图像检索方法[J]. 山东大学学报 (工学版), 2023, 53(2): 118-126.
[15] 袁钺,王艳丽,刘勘. 基于空洞卷积块架构的命名实体识别模型[J]. 山东大学学报 (工学版), 2022, 52(6): 105-114.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!