短视频场景分类方法综述

doi:10.6040/j.issn.1672-3961.0.2023.109

摘要/Abstract

摘要： 传统的视频场景分类方法习惯于从视觉模态中提取表现图像场景的特征,结合支持向量机等有监督学习方法,实现对某些类别的场景分类。随着各种短视频在各大平台迅速涌现,基于短视频特性的场景特征表示越来越受到研究者们的关注。由于短视频数据具有噪声、数据缺失、各模态语义强度不一致等问题,导致传统的视频场景表征方法无法学习具有丰富语义的短视频场景表征。近年来,部分短视频场景分类的研究考虑上述挑战,并提出相应的方法。本研究综述短视频场景分类的研究现状,介绍短视频场景特征表示和分类方法,对不同数据集上的场景分类方法进行分析。针对现有方法存在的问题,分析未来短视频场景分类中需要解决的挑战性问题。

关键词: 视频场景, 特征表示, 短视频场景分类, 多模态融合, 深度学习

中图分类号:

TP391

聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11.

NIE Xiushan, GONG Rui, DONG Fei, GUO Jie, MA Yuling. A survey of micro-video scene classification[J]. Journal of Shandong University(Engineering Science), 2024, 54(3): 1-11.

参考文献

[1] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3): 145-175.
[2] SUDDERTH E B, TORRALBA A, FREEMAN W T, et al. Learning hierarchical models of scenes, objects, and parts[C] //Tenth IEEE International Conference on Computer Vision(ICCV'05): Volume 1. Piscataway, USA: IEEE, 2005: 1331-1338.
[3] ZUO Zhen, WANG Gang, SHUAI Bing, et al. Exemplar based deep discriminative and shareable feature learning for scene image classification[J]. Pattern Recognition, 2015, 48(10): 3004-3015.
[4] SINGH V, GIRISH D, RALESCU A L. Image understanding-a brief review of scene classification and recognition[J]. MAICS, 2017: 85-91.
[5] XIAO J, HAYS J, EHINGER K A, et al. SUN database: large-scale scene recognition from abbey to zoo[C] // Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010.
[6] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3):145-175.
[7] OLIVA A, TORRALBA A. Building the gist of a scene: the role of global image features in recognition[J]. Progress in Brain Research, 2006, 155: 23-36.
[8] BROWN M, SÜSSTRUNK S. Multi-spectral SIFT for scene category recognition[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2011: 177-184.
[9] BAY H, ESS A, TUYTELAARS T, et al. Speeded-up robust features(SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346-359.
[10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C] //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR'05). Piscataway, USA:IEEE, 2005, 1: 886-893.
[11] WU J, REHG J M. Centrist: a visual descriptor for scene categorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(8): 1489-1501.
[12] ZABIH R, WOODFILL J. Non-parametric local transforms for computing visual correspondence[C] //Computer Vision: ECCV'94: third European Conference on Computer Vision Stockholm: Volume II 3. Berlin, German: Springer, 1994: 151-158.
[13] FEICHTENHOFER C, PINZ A, WILDES R P. Space-time forests with complementary features for dynamic scene recognition[C] //British Machine Vision Conference. Berlin, German: Springer, 2013: 6.
[14] GANGOPADHYAY A, TRIPATHI S M, JINDAL I, et al. Dynamic scene classification using convolutional neural networks[C] //2016 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Piscataway, USA: IEEE, 2016: 1255-1259.
[15] DORETTO G, CHIUSO A, YING N W, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51: 91-109.
[16] SHROFF N, TURAGA P, CHELLAPPA R. Moving vistas: exploiting motion for describing scenes[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010: 1911-1918.
[17] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2009: 2929-2936.
[18] VASUDEVAN A B, MURALIDHARAN S, CHINTAPALLI S P, et al. Dynamic scene classification using spatial and temporal cues[C] //Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway, USA: IEEE, 2013: 803-810.
[19] FEICHTENHOFER C, PINZ A, WILDES R P. Dy-namic scene recognition with complementary spatiotemporal features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(12):2389-2401.
[20] FEICHTENHOFER C, PINZ A, WILDES R P. Bags of spacetime energies for dynamic scene recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2014: 2681-2688.
[21] DERPANIS K G, LECCE M, DANIILIDIS K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2012: 1306-1313.
[22] DU Liang, LING Haibin. Dynamic scene classification using redundant spatial scenelets[J]. IEEE Transactions on Cybernetics, 2015, 46(9): 2156-2165.
[23] THERIAULT C, THOME N, CORD M. Dynamic scene classification: learning motion descriptors with slow features analysis[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2013: 2603-2610.
[24] WISKOTT L, SEJNOWSKI T J. Slow feature analysis: unsupervised learning of invariances[J]. Neural Computation, 2002, 14(4): 715-770.
[25] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C] // Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497.
[26] HUANG Yuanjun, CAO Xianbin, WANG Qi, et al. Long-short-term features for dynamic scene classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 29(4): 1038-1047.
[27] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2): 1097-1105.
[28] ZHANG Jianglong, NIE Liqiang, WANG Xiang, et al. Shorter-is-better: venue category estimation from micro-video[C] //Proceedings of the 24th ACM International Conference on Multimedia. New York, USA: ACM, 2016: 1415-1424.
[29] NIE Liqiang, WANG Xiang, ZHANG Jianglong, et al. Enhancing micro-video understanding by harnessing external sounds[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 1192-1200.
[30] GRAVES A. Long short-term memory[J]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012, 385: 37-45.
[31] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[EB/OL].(2015-10-17)[2023-05-18]. https://arxiv.org/abs/1506.00019.
[32] ZHOU B, LAPEDRIZA A, KHOSLA A, et al. Places: a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 40(6): 1452-1464.
[33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-05-10)[2023-05-18]. https://arxiv.org/abs/1409.1556.
[34] GUO Jie, NIE Xiushan, CUI Chaoran, et al. Getting more from one attractive scene: venue retrieval in micro-videos[C] //Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 721-733.
[35] GUO Jie, NIE Xiushan, JIAN Muwei, et al. Binary feature representation learning for scene retrieval in micro-video[J]. Multimedia Tools and Applications, 2019, 78: 24539-24552.
[36] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778.
[37] WEI Yinwei, WANG Xiang, GUAN Weili, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14.
[38] WANG Bing, HUANG Xianglin, CAO Gang, et al. Hybrid-attention and frame difference enhanced network for micro-video venue recognition[J]. Journal of Intelligent & Fuzzy Systems, 2022, 43(3): 3337-3353.
[39] WANG Bing, HUANG Xianglin, CAO Gang, et al. Attention-enhanced and trusted multimodal learning for micro-video venue recognition[J]. Computers and Electrical Engineering, 2022, 102: 108127.
[40] EL-NOUBY A, IZACARD G, TOUVRON H, et al. Are large-scale datasets necessary for self-supervised pre-training?[EB/OL].(2021-12-20)[2023-05-18]. https://arxiv.org/abs/2112.10740.
[41] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C] //Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 1096-1103.
[42] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors[J]. Advances in Neural Information Processing Systems, 2015, 28: 1-9.
[43] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings[C] //International Conference on Learning Representations. New York, USA: ICML, 2017: 1-16.
[44] RONG X. Word2vec parameter learning explained[EB/OL].(2016-06-05)[2023-05-18]. https://arxiv.org/abs/1411.2738.
[45] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C] //International Conference on Machine Learning. New York, USA: ACM, 2014: 1188-1196.
[46] GUO Jie, NIE Xiushan, MA Yuling, et al. Attention based consistent semantic learning for micro-video scene recognition[J]. Information Sciences, 2021, 543: 504-516.
[47] FAN Weiquan, HE Zhiwei, XING Xiaofen, et al. Multi-modality depression detection via multi-scale temporal dilated cnns[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. New York, USA: ACM, 2019: 73-80.
[48] YIN Shi, LIANG Cong, DING Heyan, et al. A multi-modal hierarchical recurrent neural network for depression detection[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. [S.l.] : ACM, 2019: 65-71.
[49] RAY A, KUMAR S, REDDY R, et al. Multi-level attention network using text, audio and video for de-pression prediction[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Work-shop[S.l.] : ACM, 2019: 81-88.
[50] MENG Hongying, HUANG Di, WANG Heng, et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression[C] //Proceedings of the 3rd ACM International Workshop on Audio/visual Emotion Challenge. New York, USA: ACM, 2013: 21-30.
[51] SAMAREH A, JIN Y, WANG Z, et al. Detect depression from communication: how computer vision, signal processing, and sentiment analysis join forces[J]. IISE Transactions on Healthcare Systems Engineering, 2018, 8(3): 196-208.
[52] NIE Weizhi, YAN Yan, SONG Dan, et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition[J]. Multimedia Tools and Applications, 2021, 80: 16205-16214.
[53] VERMA S, WANG J, GE Z, et al. Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis[C] //2020 IEEE International Conference on Data Mining(ICDM). Piscataway, USA: IEEE, 2020: 561-570.
[54] LIU Meng, NIE Liqiang, WANG Meng, et al. Towards micro-video understanding by joint sequential-sparse modeling[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 970-978.
[55] LIU Meng, NIE Liqiang, WANG Xiang, et al. Online data organizer: micro-video categorization by structure-guided multimodal dictionary learning[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1235-1247.
[56] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of LSTMs-CNN and prototype for micro-video venue classification[C] //Advances in Multimedia Information Processing: PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 705-715.
[57] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of nnextvlad, cnn and context gating for micro-video venue classification[J]. IEEE Access, 2019, 7:77091-77099.
[58] LIU Wei, HUANG Xianglin, CAO Gang, et al. Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification[J]. Multimedia Tools and Applications, 2020, 79(9/10): 6709-6726.
[59] LI Xin, GUO Yuhong. Multi-level adaptive active learning for scene classification[C] // European Conference on Computer Vision. Berlin, German: Springer, 2014: 234-249.
[60] GUO Jie, NIE Xiushan, YIN Yilong. Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition[J]. IEEE Access, 2020, 8: 29518-29524.
[61] LU Wei, LI Desheng, NIE Liqiang, et al. Learning dual low-rank representation for multi-label micro-video classification[J]. IEEE Transactions on Multimedia, 2023, 25: 77-89.
[62] LU Wei, LIN Jiaxin, JING Peiguang, et al. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification[J]. IEEE Signal Processing Letters, 2023, 30: 60-64.
[63] ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL].(2016-09-27)[2023-05-18]. https://arxiv.org/abs/1609.08675.

多维度评价

Viewed

Full text

146

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	146

	From	local

	Times	146
	Rate	100%

Abstract

435

Just accepted	Online first	Issue

0	0	435

From	Others	local

Times	432	3
Rate	99%	1%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed