Journal of Shandong University(Engineering Science) ›› 2024, Vol. 54 ›› Issue (3): 1-11.doi: 10.6040/j.issn.1672-3961.0.2023.109

• Machine Learning & Data Mining •    

A survey of micro-video scene classification

NIE Xiushan1, GONG Rui1, DONG Fei2, GUO Jie1*, MA Yuling1   

  1. 1. School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, Shandong, China;
    2. School of Journalism and Communication, Shandong Normal University, Jinan 250358, Shandong, China
  • Published:2024-06-28

CLC Number: 

  • TP391
[1] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3): 145-175.
[2] SUDDERTH E B, TORRALBA A, FREEMAN W T, et al. Learning hierarchical models of scenes, objects, and parts[C] //Tenth IEEE International Conference on Computer Vision(ICCV'05): Volume 1. Piscataway, USA: IEEE, 2005: 1331-1338.
[3] ZUO Zhen, WANG Gang, SHUAI Bing, et al. Exemplar based deep discriminative and shareable feature learning for scene image classification[J]. Pattern Recognition, 2015, 48(10): 3004-3015.
[4] SINGH V, GIRISH D, RALESCU A L. Image understanding-a brief review of scene classification and recognition[J]. MAICS, 2017: 85-91.
[5] XIAO J, HAYS J, EHINGER K A, et al. SUN database: large-scale scene recognition from abbey to zoo[C] // Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010.
[6] OLIVA A, TORRALBA A. Modeling the shape of the scene: a holistic representation of the spatial envelope[J]. International Journal of Computer Vision, 2001, 42(3):145-175.
[7] OLIVA A, TORRALBA A. Building the gist of a scene: the role of global image features in recognition[J]. Progress in Brain Research, 2006, 155: 23-36.
[8] BROWN M, SÜSSTRUNK S. Multi-spectral SIFT for scene category recognition[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2011: 177-184.
[9] BAY H, ESS A, TUYTELAARS T, et al. Speeded-up robust features(SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346-359.
[10] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C] //2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR'05). Piscataway, USA:IEEE, 2005, 1: 886-893.
[11] WU J, REHG J M. Centrist: a visual descriptor for scene categorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(8): 1489-1501.
[12] ZABIH R, WOODFILL J. Non-parametric local transforms for computing visual correspondence[C] //Computer Vision: ECCV'94: third European Conference on Computer Vision Stockholm: Volume II 3. Berlin, German: Springer, 1994: 151-158.
[13] FEICHTENHOFER C, PINZ A, WILDES R P. Space-time forests with complementary features for dynamic scene recognition[C] //British Machine Vision Conference. Berlin, German: Springer, 2013: 6.
[14] GANGOPADHYAY A, TRIPATHI S M, JINDAL I, et al. Dynamic scene classification using convolutional neural networks[C] //2016 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Piscataway, USA: IEEE, 2016: 1255-1259.
[15] DORETTO G, CHIUSO A, YING N W, et al. Dynamic textures[J]. International Journal of Computer Vision, 2003, 51: 91-109.
[16] SHROFF N, TURAGA P, CHELLAPPA R. Moving vistas: exploiting motion for describing scenes[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2010: 1911-1918.
[17] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2009: 2929-2936.
[18] VASUDEVAN A B, MURALIDHARAN S, CHINTAPALLI S P, et al. Dynamic scene classification using spatial and temporal cues[C] //Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway, USA: IEEE, 2013: 803-810.
[19] FEICHTENHOFER C, PINZ A, WILDES R P. Dy-namic scene recognition with complementary spatiotemporal features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(12):2389-2401.
[20] FEICHTENHOFER C, PINZ A, WILDES R P. Bags of spacetime energies for dynamic scene recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2014: 2681-2688.
[21] DERPANIS K G, LECCE M, DANIILIDIS K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2012: 1306-1313.
[22] DU Liang, LING Haibin. Dynamic scene classification using redundant spatial scenelets[J]. IEEE Transactions on Cybernetics, 2015, 46(9): 2156-2165.
[23] THERIAULT C, THOME N, CORD M. Dynamic scene classification: learning motion descriptors with slow features analysis[C] //IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2013: 2603-2610.
[24] WISKOTT L, SEJNOWSKI T J. Slow feature analysis: unsupervised learning of invariances[J]. Neural Computation, 2002, 14(4): 715-770.
[25] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C] // Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497.
[26] HUANG Yuanjun, CAO Xianbin, WANG Qi, et al. Long-short-term features for dynamic scene classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 29(4): 1038-1047.
[27] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2): 1097-1105.
[28] ZHANG Jianglong, NIE Liqiang, WANG Xiang, et al. Shorter-is-better: venue category estimation from micro-video[C] //Proceedings of the 24th ACM International Conference on Multimedia. New York, USA: ACM, 2016: 1415-1424.
[29] NIE Liqiang, WANG Xiang, ZHANG Jianglong, et al. Enhancing micro-video understanding by harnessing external sounds[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 1192-1200.
[30] GRAVES A. Long short-term memory[J]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012, 385: 37-45.
[31] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[EB/OL].(2015-10-17)[2023-05-18]. https://arxiv.org/abs/1506.00019.
[32] ZHOU B, LAPEDRIZA A, KHOSLA A, et al. Places: a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 40(6): 1452-1464.
[33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2015-05-10)[2023-05-18]. https://arxiv.org/abs/1409.1556.
[34] GUO Jie, NIE Xiushan, CUI Chaoran, et al. Getting more from one attractive scene: venue retrieval in micro-videos[C] //Advances in Multimedia Information Processing-PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 721-733.
[35] GUO Jie, NIE Xiushan, JIAN Muwei, et al. Binary feature representation learning for scene retrieval in micro-video[J]. Multimedia Tools and Applications, 2019, 78: 24539-24552.
[36] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C] // IEEE Conference on Computer Vision & Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778.
[37] WEI Yinwei, WANG Xiang, GUAN Weili, et al. Neural multimodal cooperative learning toward micro-video understanding[J]. IEEE Transactions on Image Processing, 2019, 29: 1-14.
[38] WANG Bing, HUANG Xianglin, CAO Gang, et al. Hybrid-attention and frame difference enhanced network for micro-video venue recognition[J]. Journal of Intelligent & Fuzzy Systems, 2022, 43(3): 3337-3353.
[39] WANG Bing, HUANG Xianglin, CAO Gang, et al. Attention-enhanced and trusted multimodal learning for micro-video venue recognition[J]. Computers and Electrical Engineering, 2022, 102: 108127.
[40] EL-NOUBY A, IZACARD G, TOUVRON H, et al. Are large-scale datasets necessary for self-supervised pre-training?[EB/OL].(2021-12-20)[2023-05-18]. https://arxiv.org/abs/2112.10740.
[41] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C] //Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 1096-1103.
[42] KIROS R, ZHU Y, SALAKHUTDINOV R R, et al. Skip-thought vectors[J]. Advances in Neural Information Processing Systems, 2015, 28: 1-9.
[43] ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings[C] //International Conference on Learning Representations. New York, USA: ICML, 2017: 1-16.
[44] RONG X. Word2vec parameter learning explained[EB/OL].(2016-06-05)[2023-05-18]. https://arxiv.org/abs/1411.2738.
[45] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C] //International Conference on Machine Learning. New York, USA: ACM, 2014: 1188-1196.
[46] GUO Jie, NIE Xiushan, MA Yuling, et al. Attention based consistent semantic learning for micro-video scene recognition[J]. Information Sciences, 2021, 543: 504-516.
[47] FAN Weiquan, HE Zhiwei, XING Xiaofen, et al. Multi-modality depression detection via multi-scale temporal dilated cnns[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. New York, USA: ACM, 2019: 73-80.
[48] YIN Shi, LIANG Cong, DING Heyan, et al. A multi-modal hierarchical recurrent neural network for depression detection[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. [S.l.] : ACM, 2019: 65-71.
[49] RAY A, KUMAR S, REDDY R, et al. Multi-level attention network using text, audio and video for de-pression prediction[C] //Proceedings of the 9th International on Audio/Visual Emotion Challenge and Work-shop[S.l.] : ACM, 2019: 81-88.
[50] MENG Hongying, HUANG Di, WANG Heng, et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression[C] //Proceedings of the 3rd ACM International Workshop on Audio/visual Emotion Challenge. New York, USA: ACM, 2013: 21-30.
[51] SAMAREH A, JIN Y, WANG Z, et al. Detect depression from communication: how computer vision, signal processing, and sentiment analysis join forces[J]. IISE Transactions on Healthcare Systems Engineering, 2018, 8(3): 196-208.
[52] NIE Weizhi, YAN Yan, SONG Dan, et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition[J]. Multimedia Tools and Applications, 2021, 80: 16205-16214.
[53] VERMA S, WANG J, GE Z, et al. Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis[C] //2020 IEEE International Conference on Data Mining(ICDM). Piscataway, USA: IEEE, 2020: 561-570.
[54] LIU Meng, NIE Liqiang, WANG Meng, et al. Towards micro-video understanding by joint sequential-sparse modeling[C] //Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 970-978.
[55] LIU Meng, NIE Liqiang, WANG Xiang, et al. Online data organizer: micro-video categorization by structure-guided multimodal dictionary learning[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1235-1247.
[56] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of LSTMs-CNN and prototype for micro-video venue classification[C] //Advances in Multimedia Information Processing: PCM 2018: 19th Pacific-Rim Conference on Multimedia. Berlin, German: Springer, 2018: 705-715.
[57] LIU Wei, HUANG Xianglin, CAO Gang, et al. Joint learning of nnextvlad, cnn and context gating for micro-video venue classification[J]. IEEE Access, 2019, 7:77091-77099.
[58] LIU Wei, HUANG Xianglin, CAO Gang, et al. Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification[J]. Multimedia Tools and Applications, 2020, 79(9/10): 6709-6726.
[59] LI Xin, GUO Yuhong. Multi-level adaptive active learning for scene classification[C] // European Conference on Computer Vision. Berlin, German: Springer, 2014: 234-249.
[60] GUO Jie, NIE Xiushan, YIN Yilong. Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition[J]. IEEE Access, 2020, 8: 29518-29524.
[61] LU Wei, LI Desheng, NIE Liqiang, et al. Learning dual low-rank representation for multi-label micro-video classification[J]. IEEE Transactions on Multimedia, 2023, 25: 77-89.
[62] LU Wei, LIN Jiaxin, JING Peiguang, et al. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification[J]. IEEE Signal Processing Letters, 2023, 30: 60-64.
[63] ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL].(2016-09-27)[2023-05-18]. https://arxiv.org/abs/1609.08675.
[1] YANG Jucheng, WEI Feng, LIN Liang, JIA Qingxiang, LIU Jianzheng. A research survey of driver drowsiness driving detection [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 1-12.
[2] XIAO Wei, ZHENG Gengsheng, CHEN Yujia. Named entity recognition method combined with self-training model [J]. Journal of Shandong University(Engineering Science), 2024, 54(2): 96-102.
[3] Gang HU, Lemeng WANG, Zhiyu LU, Qin WANG, Xiang XU. Importance identification method based on multi-order neighborhood hierarchical association contribution of nodes [J]. Journal of Shandong University(Engineering Science), 2024, 54(1): 1-10.
[4] Jiachun LI,Bowen LI,Jianbo CHANG. An efficient and lightweight RGB frame-level face anti-spoofing model [J]. Journal of Shandong University(Engineering Science), 2023, 53(6): 1-7.
[5] Yujiang FAN,Huanhuan HUANG,Jiaxiong DING,Kai LIAO,Binshan YU. Resilience evaluation system of the old community based on cloud model [J]. Journal of Shandong University(Engineering Science), 2023, 53(5): 1-9, 19.
[6] Ying LI,Jiankun WANG. The classification of mild cognitive impairment based on supervised graph regularization and information fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(4): 65-73.
[7] YU Yixuan, YANG Geng, GENG Hua. Multimodal hierarchical keyframe extraction method for continuous combined motion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 42-50.
[8] ZHANG Hao, LI Ziling, LIU Tong, ZHANG Dawei, TAO Jianhua. A technology prediction model based on fuzzy Bayesian networks with sociological factors [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 23-33.
[9] WU Yanli, LIU Shuwei, HE Dongxiao, WANG Xiaobao, JIN Di. Poisson-gamma topic model of describing multiple underlying relationships [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 51-60.
[10] YU Mingjun, DIAO Hongjun, LING Xinghong. Online multi-object tracking method based on trajectory mask [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 61-69.
[11] HUANG Huajuan, CHENG Qian, WEI Xiuxi, YU Chuchu. Adaptive crow search algorithm with Jaya algorithm and Gaussian mutation [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 11-22.
[12] LIU Fangxu, WANG Jian, WEI Benzheng. Auxiliary diagnosis algorithm for pediatric pneumonia based on multi-spatial attention [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 135-142.
[13] LIU Xing, YANG Lu, HAO Fanchang. Finger vein image retrieval based on multi-feature fusion [J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 118-126.
[14] Yue YUAN,Yanli WANG,Kan LIU. Named entity recognition model based on dilated convolutional block architecture [J]. Journal of Shandong University(Engineering Science), 2022, 52(6): 105-114.
[15] Xiaobin XU,Qi WANG,Bin GAO,Zhiyu SUN,Zhongjun LIANG,Shangguang WANG. Pre-allocation of resources based on trajectory prediction in heterogeneous networks [J]. Journal of Shandong University(Engineering Science), 2022, 52(4): 12-19.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!