Journal of Shandong University(Engineering Science) ›› 2020, Vol. 50 ›› Issue (2): 60-65.doi: 10.6040/j.issn.1672-3961.0.2019.760

• Machine Learning & Data Mining • Previous Articles     Next Articles

LDA-based topic feature representation method for symbolic sequences

Chao FENG1,2(),Kunpeng XU1,2,Lifei CHEN1,2,*()   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, Fujian, China
    2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fuzhou 350117, Fujian, China
  • Received:2019-12-18 Online:2020-04-20 Published:2020-04-16
  • Contact: Lifei CHEN E-mail:fc_fight2017@163.com;clfei@fjnu.edu.cn
  • Supported by:
    国家自然科学基金资助项目(61672157);国家自然科学基金资助项目(U1805263);福建师范大学创新团队资助项目(IRTL1704)

Abstract:

To address the problems of high feature dimensionality and high algorithm time complexity in the existing methods, a topic feature representation method was proposed to transform the symbolic sequences into a set of topic probability vectors, based on the topic model latent Dirichlet allocation (LDA) commonly used in text mining. In the new method, each short sequence gram was considered as the shallow feature (word) of the sequence, and the topics with their probability distributions were extracted as the deep features of the sequences using the LDA model learning algorithm.Experiments were carried out on six real-world sequence sets, and compared with the existing grams-based and Markov model-based methods. The results showed that the new method improved the learning efficiency of the representation model while reducing the feature dimensionality, and achieved better accuracy in the application of symbolic sequence classification.

Key words: feature representation, symbolic sequences, latent Dirichlet allocation, topics, classification

CLC Number: 

  • TP311

Fig.1

Schematic of the SLDA model"

Table 1

Summarized parameters of the experimental datasets"

数据集 序列数M 类别数目C 平均序列长度 平均符号数目
GS1 771 8 1 594 4
GS2 281 6 1 318 4
GS3 310 6 1 536 4
SS1 50 5 1 899 15
SS2 50 5 1 498 16
SS3 50 5 925 15

Fig.2

Change of F1 with various numbers of SLDA topics"

Table 2

Comparison of the number of extracted features and F1 of the classification results yielded by the different representation methods on the symbolic sequence sets"

数据集 SLDA 特征数目 MM-FR 特征数目 G-FR 特征数目
GS1 0.946 8 35 0.911 8 16 0.980 5 81
GS2 1.000 0 15 0.996 4 16 0.996 4 85
GS3 0.967 7 35 0.922 6 16 0.996 8 72
SS1 1.000 0 15 1.000 0 324 0.980 0 227 7
SS2 1.000 0 5 1.000 0 400 1.000 0 252 9
SS3 1.000 0 5 1.000 0 400 1.000 0 212 1
1 DONG G , PEI J . Sequence data mining[M]. Berlin: Springer, 2007: Ⅶ- 69.
2 BENGIO Y , COURVILLE A , VINCENT P . Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intellig-ence, 2013, 35 (8): 1798- 1828.
doi: 10.1109/TPAMI.2013.50
3 COVER T M , HART P E . Nearest neighborpatternclassification[J]. IEEE Transactions on Information Theory, 1967, 13 (1): 21- 27.
4 XING Z , PEI J , KEOGH E J . A brief survey on sequence classification[J]. ACM SIGKDD Explorations Newsletter, 2010, 12 (1): 40- 48.
doi: 10.1145/1882471.1882478
5 BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI, 2011: 1192-1197.
6 郭彦明.基于隐马尔可夫模型的DNA序列分类研究[D].福州:福建师范大学, 2015.
GUO Yanming. A study of DNA sequence classification based on hidden Markov model[D]. Fuzhou: Fujian Normal University, 2015.
7 GUO G , CHEN L , YE Y , et al. Cluster validation method for determining the number of clusters in categorical sequences[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (12): 2936- 2948.
doi: 10.1109/TNNLS.2016.2608354
8 YUAN L , WANG W , CHEN L . Two-stage pruning method for gram-based categorical sequence clustering[J]. International Journal of Machine Learning and Cybernetics, 2019, 10 (4): 631- 640.
doi: 10.1007/s13042-017-0744-y
9 GERS F A , SCHMIDHUBER J , Cummins F . Learning toforget: continual prediction with LSTM[J]. Neural Computation, 2000, 12 (10): 2451- 2471.
doi: 10.1162/089976600300015015
10 GREFF K , SRIVASTAVA R K , KOUTNIK J , et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (10): 2222- 2232.
doi: 10.1109/TNNLS.2016.2582924
11 GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada: IEEE, 2013: 6645-6649.
12 TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: ACL, 2015: 1422-1432.
13 JEBARA T , KONDOR R I , HOWARD A . Probability product kernels[J]. Journal of Machine Learning Research, 2004, 5 (5): 819- 844.
14 SALTON G . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620.
doi: 10.1145/361219.361220
15 XIONG T, WANG S, JIANG Q. A new Markov Model for clustering categorical sequences[C]//Proceedings of the 11th IEEE International Conference on Data Mining. Vancouver, Canada: IEEE Computer Society, 2011: 854-863.
16 程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37 (7): 1977- 1982.
CHENG Lingfang , GUO Gongde , CHEN Lifei . Classification of symbolic sequences with multi-order Markov Model[J]. Journal of Computer Applications, 2017, 37 (7): 1977- 1982.
17 郭彦明, 陈黎飞, 郭躬德. 基于隐马尔科夫模型的DNA序列分类方法[J]. 计算机系统应用, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005
GUO Yanming , CHEN Lifei , GUO Gongde . DNA sequence classification method based on Hidden Markov Model[J]. Computer Systems & Applications, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005
18 周玉元, 周铁军. DNA序列分类的Fisher判别法[J]. 湖南农业大学学报(自然科学版), 2003, 29 (5): 437- 440.
ZHOU Yuyuan , ZHOU Tiejun . The Fisher criterion on classification of the DNA sequence[J]. Journal of Hunan Agricultural University (Natural Sciences), 2003, 29 (5): 437- 440.
19 DAI A M , LE Q V . Semi-supervised sequence learning[J]. Advances in Neural Information Processing Systems, 2015, 3079- 3087.
20 GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep Bidirectional LSTM[C]//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 273-278.
21 DEERWESTER S , DUMAIS S T , LANDAUER T K , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41 (6): 391- 407.
doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
22 THOMAS H. Probabilisticlatent semantic analysis[C]//Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999: 289-296.
23 BLEI D M , NG A Y , JORDAN M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3, 993- 1022.
24 GRIFFITHS T L , STEYVERS M . Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101 (1): 5228- 5235.
25 KELIL A, WANG S. SCS: a new similarity measure for categorical sequences[C]// Proceedings of the 8th IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008: 343-352.
26 WEI D , JIANG Q , WEI Y , et al. A novel hierarchical clustering algorithm for gene sequences[J]. BMC Bioinformatics, 2012, 13 (1): 174- 186.
[1] Shiqi SONG,Yan PIAO,Zexin JIANG. Vehicle classification and tracking for complex scenes based on improved YOLOv3 [J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 27-33.
[2] LI Chunyang, LI Nan, FENG Tao, WANG Zhuhe, MA Jingkai. Abnormal sound detection of washing machines based on deep learning [J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 108-117.
[3] Mingxia GAO,Jingwei LI. Chinese short text classification method based on word2vec embedding [J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 34-41.
[4] Qingtao QU,Qicheng LIU,Chunxiao MU. A parallel adaptive news topic tracking algorithm based on N-Gram language model [J]. Journal of Shandong University(Engineering Science), 2018, 48(6): 37-43.
[5] Yao LI,Zhihai WANG,Yan′ge SUN,Wei ZHANG. An adaptive ensemble classification method based on deep attribute weighting for data stream [J]. Journal of Shandong University(Engineering Science), 2018, 48(6): 44-55, 66.
[6] Pu ZHANG,Chang LIU,Yong WANG. Suggestion sentence classification model based on feature fusion and ensemble learning [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 47-54.
[7] WANG Huan, ZHOU Zhongmei. An over sampling algorithm based on clustering [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 134-139.
[8] YE Mingquan, GAO Lingyun, WAN Chunyuan. Gene expression data classification based on artificial bee colony and SVM [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 10-16.
[9] CAO Ya, DENG Zhaohong, WANG Shitong. An radial basis function neural network model based on monotonic constraints [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 127-133.
[10] XIE Zhifeng, WU Jiaping, MA Lizhuang. Chinese financial news classification method based on convolutional neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 34-39.
[11] WANG Tingting, ZHAI Junhai, ZHANG Mingyang, HAO Pu. K-NN algorithm for big data based on HBase and SimHash [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 54-59.
[12] CHEN Jiajie, WANG Jinfeng. Method for solving Choquet integral model based on ant colony algorithm [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 81-87.
[13] SHEN Ji, MA Zhiqiang, LI Tuya, ZHANG Li. A word extend LDA model for short text sentiment [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 120-126.
[14] LI Wei, WANG Zhechao, LI Shucai, DING Wantao, WANG Qi, ZONG Zhi, LIU Keqi. The mechanical properties of the silty clay and the advanced support method in Harbin Metro [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 61-71.
[15] WANG Lei, DENG Xiaogang, CAO Yuping, TIAN Xuemin. Multiblock local Fisher discriminant analysis for chemical process fault classification [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 179-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] CHEN Rui, LI Hongwei, TIAN Jing. The relationship between the number of magnetic poles and the bearing capacity of radial magnetic bearing[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 81 -85 .
[2] ZHANG Ying,LANG Yongmei,ZHAO Yuxiao,ZHANG Jianda,QIAO Peng,LI Shanping . Research on technique of aerobic granular sludge cultivationby seeding EGSB anaerobic granular sludge[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(4): 56 -59 .
[3] Yue Khing Toh1, XIAO Wendong2, XIE Lihua1. Wireless sensor network for distributed target tracking: practices via real test bed development[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 50 -56 .
[4] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 131 -136 .
[5] LIANG Jing-yun,WANG Ming-gang,CHAI Jia-qian,LIU yong-qing . Synthesis and in vitro antibacterial activity of 1,6-Di-(N5-phenyl-N1-diguanido) hexane dihydrochloride[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 104 -107 .
[6] MENG Jian, LI Yibin, LI Bin. Bound gait controlling method of quadruped robot[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2015, 45(3): 28 -34 .
[7] HE Dongzhi, ZHANG Jifeng, ZHAO Pengfei. Parallel implementing probabilistic spreading algorithm using MapReduce programming mode[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 0, (): 22 -28 .
[8] HUANG Le-jian,WANG Jian-ming . Dynamic analysis of the stabilized smoothing nodal integration meshfree method[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(5): 68 -72 .
[9] WU Hao,TIAN Guo-hui,HUANG Bin .

Research on the collaboration strategy of multi-robot for exploring unknown environment

[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(4): 27 -31 .
[10] HOU Yan, YANG Meng. Highly efficient algorithm for tracking explicit surface to process complex topological events[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(4): 15 -20 .