山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (3): 1-11.doi: 10.6040/j.issn.1672-3961.0.2023.109
• 机器学习与数据挖掘 •
聂秀山1,巩蕊1,董飞2,郭杰1*,马玉玲1
NIE Xiushan1, GONG Rui1, DONG Fei2, GUO Jie1*, MA Yuling1
摘要: 传统的视频场景分类方法习惯于从视觉模态中提取表现图像场景的特征,结合支持向量机等有监督学习方法,实现对某些类别的场景分类。随着各种短视频在各大平台迅速涌现,基于短视频特性的场景特征表示越来越受到研究者们的关注。由于短视频数据具有噪声、数据缺失、各模态语义强度不一致等问题,导致传统的视频场景表征方法无法学习具有丰富语义的短视频场景表征。近年来,部分短视频场景分类的研究考虑上述挑战,并提出相应的方法。本研究综述短视频场景分类的研究现状,介绍短视频场景特征表示和分类方法,对不同数据集上的场景分类方法进行分析。针对现有方法存在的问题,分析未来短视频场景分类中需要解决的挑战性问题。
中图分类号:
[1] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3): 145-175. [2] SUDDERTH E B, TORRALBA A, FREEMAN W T, et al. Learning hierarchical models of scenes, objects, and parts[C] //Tenth IEEE International Conference on Computer Vision(ICCV'05): Volume 1. Piscataway, USA: IEEE, 2005: 1331-1338. [3] ZUO Zhen, WANG Gang, SHUAI Bing, et al. Exemplar based deep discriminative and shareable feature learning for scene image classification[J]. Pattern Recognition, 2015, 48(10): 3004-3015. [4] SINGH V, GIRISH D, RALESCU A L. Image understanding-a brief review of scene classification and recognition[J]. MAICS, 2017: 85-91. [5] XIAO J, HAYS J, EHINGER K A, et al. SUN database: large-scale scene recognition from abbey to zoo[C] // Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010. [6] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3):145-175. [7] OLIVA A, TORRALBA A. Building the gist of a scene: the role of global image features in recognition[J]. Progress in Brain Research, 2006, 155: 23-36. [8] BROWN M, SÜSSTRUNK S. Multi-spectral SIFT for scene category recognition[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2011: 177-184. [9] BAY H, ESS A, TUYTELAARS T, et al. Speeded-up robust features(SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346-359. [10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C] //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR'05). Piscataway, USA:IEEE, 2005, 1: 886-893. [11] WU J, REHG J M. Centrist: a visual descriptor for scene categorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(8): 1489-1501. [12] ZABIH R, WOODFILL J. Non-parametric local transforms for computing visual correspondence[C] //Computer Vision: ECCV'94: third European Conference on Computer Vision Stockholm: Volume II 3. Berlin, German: Springer, 1994: 151-158. [13] FEICHTENHOFER C, PINZ A, WILDES R P. Space-time forests with complementary features for dynamic scene recognition[C] //British Machine Vision Conference. Berlin, German: Springer, 2013: 6. [14] GANGOPADHYAY A, TRIPATHI S M, JINDAL I, et al. Dynamic scene classification using convolutional neural networks[C] //2016 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Piscataway, USA: IEEE, 2016: 1255-1259. [15] DORETTO G, CHIUSO A, YING N W, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51: 91-109. [16] SHROFF N, TURAGA P, CHELLAPPA R. Moving vistas: exploiting motion for describing scenes[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010: 1911-1918. [17] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2009: 2929-2936. [18] VASUDEVAN A B, MURALIDHARAN S, CHINTAPALLI S P, et al. Dynamic scene classification using spatial and temporal cues[C] //Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway, USA: IEEE, 2013: 803-810. [19] FEICHTENHOFER C, PINZ A, WILDES R P. Dy-namic scene recognition with complementary spatiotemporal features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(12):2389-2401. [20] FEICHTENHOFER C, PINZ A, WILDES R P. Bags of spacetime energies for dynamic scene recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2014: 2681-2688. [21] DERPANIS K G, LECCE M, DANIILIDIS K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2012: 1306-1313. [22] DU Liang, LING Haibin. Dynamic scene classification using redundant spatial scenelets[J]. IEEE Transactions on Cybernetics, 2015, 46(9): 2156-2165. [23] THERIAULT C, THOME N, CORD M. Dynamic scene classification: learning motion descriptors with slow features analysis[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2013: 2603-2610. [24] WISKOTT L, SEJNOWSKI T J. Slow feature analysis: unsupervised learning of invariances[J]. Neural Computation, 2002, 14(4): 715-770. [25] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C] // Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497. [26] HUANG Yuanjun, CAO Xianbin, WANG Qi, et al. Long-short-term features for dynamic scene classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 29(4): 1038-1047. [27] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2): 1097-1105. [28] ZHANG Jianglong, NIE Liqiang, WANG Xiang, et al. Shorter-is-better: venue category estimation from micro-video[C] //Proceedings of the 24th ACM International Conference on Multimedia. New York, USA: ACM, 2016: 1415-1424. [29] NIE Liqiang, WANG Xiang, ZHANG Jianglong, et al. Enhancing micro-video understanding by harnessing external sounds[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 1192-1200. [30] GRAVES A. Long short-term memory[J]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012, 385: 37-45. [31] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[EB/OL].(2015-10-17)[2023-05-18]. https://arxiv.org/abs/1506.00019. [32] ZHOU B, LAPEDRIZA A, KHOSLA A, et al. Places: a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 40(6): 1452-1464. [33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-05-10)[2023-05-18]. https://arxiv.org/abs/1409.1556. [34] GUO Jie, NIE Xiushan, CUI Chaoran, et al. Getting more from one attractive scene: venue retrieval in micro-videos[C] //Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 721-733. [35] GUO Jie, NIE Xiushan, JIAN Muwei, et al. Binary feature representation learning for scene retrieval in micro-video[J]. Multimedia Tools and Applications, 2019, 78: 24539-24552. [36] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778. [37] WEI Yinwei, WANG Xiang, GUAN Weili, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14. [38] WANG Bing, HUANG Xianglin, CAO Gang, et al. Hybrid-attention and frame difference enhanced network for micro-video venue recognition[J]. Journal of Intelligent & Fuzzy Systems, 2022, 43(3): 3337-3353. [39] WANG Bing, HUANG Xianglin, CAO Gang, et al. Attention-enhanced and trusted multimodal learning for micro-video venue recognition[J]. Computers and Electrical Engineering, 2022, 102: 108127. [40] EL-NOUBY A, IZACARD G, TOUVRON H, et al. Are large-scale datasets necessary for self-supervised pre-training?[EB/OL].(2021-12-20)[2023-05-18]. https://arxiv.org/abs/2112.10740. [41] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C] //Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 1096-1103. [42] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors[J]. Advances in Neural Information Processing Systems, 2015, 28: 1-9. [43] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings[C] //International Conference on Learning Representations. New York, USA: ICML, 2017: 1-16. [44] RONG X. Word2vec parameter learning explained[EB/OL].(2016-06-05)[2023-05-18]. https://arxiv.org/abs/1411.2738. [45] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C] //International Conference on Machine Learning. New York, USA: ACM, 2014: 1188-1196. [46] GUO Jie, NIE Xiushan, MA Yuling, et al. Attention based consistent semantic learning for micro-video scene recognition[J]. Information Sciences, 2021, 543: 504-516. [47] FAN Weiquan, HE Zhiwei, XING Xiaofen, et al. Multi-modality depression detection via multi-scale temporal dilated cnns[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. New York, USA: ACM, 2019: 73-80. [48] YIN Shi, LIANG Cong, DING Heyan, et al. A multi-modal hierarchical recurrent neural network for depression detection[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. [S.l.] : ACM, 2019: 65-71. [49] RAY A, KUMAR S, REDDY R, et al. Multi-level attention network using text, audio and video for de-pression prediction[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Work-shop[S.l.] : ACM, 2019: 81-88. [50] MENG Hongying, HUANG Di, WANG Heng, et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression[C] //Proceedings of the 3rd ACM International Workshop on Audio/visual Emotion Challenge. New York, USA: ACM, 2013: 21-30. [51] SAMAREH A, JIN Y, WANG Z, et al. Detect depression from communication: how computer vision, signal processing, and sentiment analysis join forces[J]. IISE Transactions on Healthcare Systems Engineering, 2018, 8(3): 196-208. [52] NIE Weizhi, YAN Yan, SONG Dan, et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition[J]. Multimedia Tools and Applications, 2021, 80: 16205-16214. [53] VERMA S, WANG J, GE Z, et al. Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis[C] //2020 IEEE International Conference on Data Mining(ICDM). Piscataway, USA: IEEE, 2020: 561-570. [54] LIU Meng, NIE Liqiang, WANG Meng, et al. Towards micro-video understanding by joint sequential-sparse modeling[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 970-978. [55] LIU Meng, NIE Liqiang, WANG Xiang, et al. Online data organizer: micro-video categorization by structure-guided multimodal dictionary learning[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1235-1247. [56] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of LSTMs-CNN and prototype for micro-video venue classification[C] //Advances in Multimedia Information Processing: PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 705-715. [57] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of nnextvlad, cnn and context gating for micro-video venue classification[J]. IEEE Access, 2019, 7:77091-77099. [58] LIU Wei, HUANG Xianglin, CAO Gang, et al. Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification[J]. Multimedia Tools and Applications, 2020, 79(9/10): 6709-6726. [59] LI Xin, GUO Yuhong. Multi-level adaptive active learning for scene classification[C] // European Conference on Computer Vision. Berlin, German: Springer, 2014: 234-249. [60] GUO Jie, NIE Xiushan, YIN Yilong. Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition[J]. IEEE Access, 2020, 8: 29518-29524. [61] LU Wei, LI Desheng, NIE Liqiang, et al. Learning dual low-rank representation for multi-label micro-video classification[J]. IEEE Transactions on Multimedia, 2023, 25: 77-89. [62] LU Wei, LIN Jiaxin, JING Peiguang, et al. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification[J]. IEEE Signal Processing Letters, 2023, 30: 60-64. [63] ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL].(2016-09-27)[2023-05-18]. https://arxiv.org/abs/1609.08675. |
[1] | 高泽文,王建,魏本征. 基于混合偏移轴向自注意力机制的脑胶质瘤分割算法[J]. 山东大学学报 (工学版), 2024, 54(2): 80-89. |
[2] | 李璐,张志军,范钰敏,王星,袁卫华. 面向冷启动用户的元学习与图转移学习序列推荐[J]. 山东大学学报 (工学版), 2024, 54(2): 69-79. |
[3] | 陈成,董永权,贾瑞,刘源. 基于交互序列特征相关性的可解释知识追踪[J]. 山东大学学报 (工学版), 2024, 54(1): 100-108. |
[4] | 李家春,李博文,常建波. 一种高效且轻量的RGB单帧人脸反欺诈模型[J]. 山东大学学报 (工学版), 2023, 53(6): 1-7. |
[5] | 王旭晴,魏伟波,杨光宇,宋金涛,吕婷,潘振宽. 基于算法展开的图像盲去模糊深度学习网络[J]. 山东大学学报 (工学版), 2023, 53(6): 35-46. |
[6] | 王碧瑶,韩毅,崔航滨,刘毅超,任铭然,高维勇,陈姝廷,刘嘉巍,崔洋. 基于图像的道路语义分割检测方法[J]. 山东大学学报 (工学版), 2023, 53(5): 37-47. |
[7] | 周晓昕,廖祝华,刘毅志,赵肄江,方艺洁. 融合历史与当前交通流量的信号控制方法[J]. 山东大学学报 (工学版), 2023, 53(4): 48-55. |
[8] | 于畅,伍星,邓秋菊. 基于深度学习的多视角螺钉缺失智能检测算法[J]. 山东大学学报 (工学版), 2023, 53(4): 104-112. |
[9] | 宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76. |
[10] | 李旭涛,杨寒玉,卢业飞,张玮. 基于深度学习的遥感图像道路分割[J]. 山东大学学报 (工学版), 2022, 52(6): 139-145. |
[11] | 袁钺,王艳丽,刘勘. 基于空洞卷积块架构的命名实体识别模型[J]. 山东大学学报 (工学版), 2022, 52(6): 105-114. |
[12] | 孟令灿,聂秀山,张雪. 基于遮挡目标去除的公交车拥挤度分类算法[J]. 山东大学学报 (工学版), 2022, 52(4): 83-88. |
[13] | 杨霄,袭肖明,李维翠,杨璐. 基于层次化双重注意力网络的乳腺多模态图像分类[J]. 山东大学学报 (工学版), 2022, 52(3): 34-41. |
[14] | 王心哲,邓棋文,王际潮,范剑超. 深度语义分割MRF模型的海洋筏式养殖信息提取[J]. 山东大学学报 (工学版), 2022, 52(2): 89-98. |
[15] | 蒋桐雨, 陈帆, 和红杰. 基于非对称U型金字塔重建的轻量级人脸超分辨率网络[J]. 山东大学学报 (工学版), 2022, 52(1): 1-8. |
|