符号序列的LDA主题特征表示方法

doi:10.6040/j.issn.1672-3961.0.2019.760

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 60-65.doi: 10.6040/j.issn.1672-3961.0.2019.760

符号序列的LDA主题特征表示方法

冯超^1,²(),徐鲲鹏^1,²,陈黎飞^1,^2,*()

1. 福建师范大学数学与信息学院，福建福州 350117
2. 数字福建环境监测物联网实验室，福建福州 350117

收稿日期:2019-12-18 出版日期:2020-04-20 发布日期:2020-04-16
通讯作者: 陈黎飞 E-mail:fc_fight2017@163.com;clfei@fjnu.edu.cn
作者简介:冯超(1994—)，男，安徽六安人，硕士研究生，主要研究方向为数据挖掘. E-mail：fc_fight2017@163.com
基金资助:
国家自然科学基金资助项目(61672157);国家自然科学基金资助项目(U1805263);福建师范大学创新团队资助项目(IRTL1704)

LDA-based topic feature representation method for symbolic sequences

Chao FENG^1,²(),Kunpeng XU^1,²,Lifei CHEN^1,^2,*()

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, Fujian, China
2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fuzhou 350117, Fujian, China

Received:2019-12-18 Online:2020-04-20 Published:2020-04-16
Contact: Lifei CHEN E-mail:fc_fight2017@163.com;clfei@fjnu.edu.cn
Supported by:
国家自然科学基金资助项目(61672157);国家自然科学基金资助项目(U1805263);福建师范大学创新团队资助项目(IRTL1704)

摘要/Abstract

摘要：

针对现有序列挖掘算法特征维度高、学习算法时间复杂度高等方面的不足，提出一种主题特征表示法，将符号序列转换为一组表示多个主题呈现度的概率向量。基于文本挖掘中常用的隐含狄利克雷分配(latent Dirichlet allocation, LDA)主题模型，视短序列元组为序列的浅层特征(词)，利用LDA模型学习算法提取主题及其概率分布，作为序列的深层特征。在6个实际序列数据集上进行试验，并与基于元组、Markov模型的现有方法作对比，结果表明，新方法在降低特征维度的同时提高了表示模型的学习效率，在符号序列分类应用中可以取得较理想的分类精度。

关键词: 特征表示, 符号序列, 隐含狄利克雷分配, 主题, 分类

Abstract:

To address the problems of high feature dimensionality and high algorithm time complexity in the existing methods, a topic feature representation method was proposed to transform the symbolic sequences into a set of topic probability vectors, based on the topic model latent Dirichlet allocation (LDA) commonly used in text mining. In the new method, each short sequence gram was considered as the shallow feature (word) of the sequence, and the topics with their probability distributions were extracted as the deep features of the sequences using the LDA model learning algorithm.Experiments were carried out on six real-world sequence sets, and compared with the existing grams-based and Markov model-based methods. The results showed that the new method improved the learning efficiency of the representation model while reducing the feature dimensionality, and achieved better accuracy in the application of symbolic sequence classification.

Key words: feature representation, symbolic sequences, latent Dirichlet allocation, topics, classification

中图分类号:

TP311

冯超,徐鲲鹏,陈黎飞. 符号序列的LDA主题特征表示方法[J]. 山东大学学报 (工学版), 2020, 50(2): 60-65.

Chao FENG,Kunpeng XU,Lifei CHEN. LDA-based topic feature representation method for symbolic sequences[J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 60-65.

图/表 4

图1

表1

图2

表2

参考文献 26

1	DONG G , PEI J . Sequence data mining[M]. Berlin: Springer, 2007: Ⅶ- 69.
2	BENGIO Y , COURVILLE A , VINCENT P . Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intellig-ence, 2013, 35 (8): 1798- 1828. doi: 10.1109/TPAMI.2013.50
3	COVER T M , HART P E . Nearest neighborpatternclassification[J]. IEEE Transactions on Information Theory, 1967, 13 (1): 21- 27.
4	XING Z , PEI J , KEOGH E J . A brief survey on sequence classification[J]. ACM SIGKDD Explorations Newsletter, 2010, 12 (1): 40- 48. doi: 10.1145/1882471.1882478
5	BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI, 2011: 1192-1197.
6	郭彦明.基于隐马尔可夫模型的DNA序列分类研究[D].福州:福建师范大学, 2015.
	GUO Yanming. A study of DNA sequence classification based on hidden Markov model[D]. Fuzhou: Fujian Normal University, 2015.
7	GUO G , CHEN L , YE Y , et al. Cluster validation method for determining the number of clusters in categorical sequences[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (12): 2936- 2948. doi: 10.1109/TNNLS.2016.2608354
8	YUAN L , WANG W , CHEN L . Two-stage pruning method for gram-based categorical sequence clustering[J]. International Journal of Machine Learning and Cybernetics, 2019, 10 (4): 631- 640. doi: 10.1007/s13042-017-0744-y
9	GERS F A , SCHMIDHUBER J , Cummins F . Learning toforget: continual prediction with LSTM[J]. Neural Computation, 2000, 12 (10): 2451- 2471. doi: 10.1162/089976600300015015
10	GREFF K , SRIVASTAVA R K , KOUTNIK J , et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (10): 2222- 2232. doi: 10.1109/TNNLS.2016.2582924
11	GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada: IEEE, 2013: 6645-6649.
12	TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: ACL, 2015: 1422-1432.
13	JEBARA T , KONDOR R I , HOWARD A . Probability product kernels[J]. Journal of Machine Learning Research, 2004, 5 (5): 819- 844.
14	SALTON G . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620. doi: 10.1145/361219.361220
15	XIONG T, WANG S, JIANG Q. A new Markov Model for clustering categorical sequences[C]//Proceedings of the 11^th IEEE International Conference on Data Mining. Vancouver, Canada: IEEE Computer Society, 2011: 854-863.
16	程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37 (7): 1977- 1982.
	CHENG Lingfang , GUO Gongde , CHEN Lifei . Classification of symbolic sequences with multi-order Markov Model[J]. Journal of Computer Applications, 2017, 37 (7): 1977- 1982.
17	郭彦明, 陈黎飞, 郭躬德. 基于隐马尔科夫模型的DNA序列分类方法[J]. 计算机系统应用, 2014, 23 (7): 24- 30. doi: 10.3969/j.issn.1003-3254.2014.07.005
	GUO Yanming , CHEN Lifei , GUO Gongde . DNA sequence classification method based on Hidden Markov Model[J]. Computer Systems & Applications, 2014, 23 (7): 24- 30. doi: 10.3969/j.issn.1003-3254.2014.07.005
18	周玉元, 周铁军. DNA序列分类的Fisher判别法[J]. 湖南农业大学学报(自然科学版), 2003, 29 (5): 437- 440.
	ZHOU Yuyuan , ZHOU Tiejun . The Fisher criterion on classification of the DNA sequence[J]. Journal of Hunan Agricultural University (Natural Sciences), 2003, 29 (5): 437- 440.
19	DAI A M , LE Q V . Semi-supervised sequence learning[J]. Advances in Neural Information Processing Systems, 2015, 3079- 3087.
20	GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep Bidirectional LSTM[C]//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 273-278.
21	DEERWESTER S , DUMAIS S T , LANDAUER T K , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41 (6): 391- 407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
22	THOMAS H. Probabilisticlatent semantic analysis[C]//Proceedings of the 15^th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999: 289-296.
23	BLEI D M , NG A Y , JORDAN M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3, 993- 1022.
24	GRIFFITHS T L , STEYVERS M . Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101 (1): 5228- 5235.
25	KELIL A, WANG S. SCS: a new similarity measure for categorical sequences[C]// Proceedings of the 8^th IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008: 343-352.
26	WEI D , JIANG Q , WEI Y , et al. A novel hierarchical clustering algorithm for gene sequences[J]. BMC Bioinformatics, 2012, 13 (1): 174- 186.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	序列数M	类别数目C	平均序列长度	平均符号数目
GS₁	771	8	1 594	4
GS₂	281	6	1 318	4
GS₃	310	6	1 536	4
SS₁	50	5	1 899	15
SS₂	50	5	1 498	16
SS₃	50	5	925	15

数据集	SLDA	特征数目	MM-FR	特征数目	G-FR	特征数目
GS₁	0.946 8	35	0.911 8	16	0.980 5	81
GS₂	1.000 0	15	0.996 4	16	0.996 4	85
GS₃	0.967 7	35	0.922 6	16	0.996 8	72
SS₁	1.000 0	15	1.000 0	324	0.980 0	227 7
SS₂	1.000 0	5	1.000 0	400	1.000 0	252 9
SS₃	1.000 0	5	1.000 0	400	1.000 0	212 1

[1]	宋士奇,朴燕,蒋泽新. 基于改进YOLOv3的复杂场景车辆分类与跟踪[J]. 山东大学学报 (工学版), 2020, 50(2): 27-33.
[2]	李春阳,李楠,冯涛,王朱贺,马靖凯. 基于深度学习的洗衣机异常音检测[J]. 山东大学学报 (工学版), 2020, 50(2): 108-117.
[3]	陈馨菂,李天瑞,杨欢欢. 基于时间序列数据的交互式主题河流可视化[J]. 山东大学学报 (工学版), 2019, 49(4): 29-35, 43.
[4]	张红斌,邱蝶蝶,邬任重,朱涛,滑瑾,姬东鸿. 基于极端梯度提升树算法的图像属性标注[J]. 山东大学学报 (工学版), 2019, 49(2): 8-16.
[5]	高明霞,李经纬. 基于word2vec词模型的中文短文本分类方法[J]. 山东大学学报 (工学版), 2019, 49(2): 34-41.
[6]	侯霄雄,许新征,朱炯,郭燕燕. 基于AlexNet和集成分类器的乳腺癌计算机辅助诊断方法[J]. 山东大学学报 (工学版), 2019, 49(2): 74-79.
[7]	朱映雪,黄瑞章,马灿. 一种具有新主题偏向性的短文本动态聚类方法[J]. 山东大学学报 (工学版), 2018, 48(6): 8-18.
[8]	屈庆涛,刘其成,牟春晓. 基于N-Gram语言模型的并行自适应新闻话题追踪算法[J]. 山东大学学报 (工学版), 2018, 48(6): 37-43.
[9]	李尧,王志海,孙艳歌,张伟. 一种基于深度属性加权的数据流自适应集成分类算法[J]. 山东大学学报 (工学版), 2018, 48(6): 44-55, 66.
[10]	张璞,刘畅,王永. 基于特征融合和集成学习的建议语句分类模型[J]. 山东大学学报 (工学版), 2018, 48(5): 47-54.
[11]	王换,周忠眉. 一种基于聚类的过抽样算法[J]. 山东大学学报(工学版), 2018, 48(3): 134-139.
[12]	叶明全,高凌云,万春圆. 基于人工蜂群和SVM的基因表达数据分类[J]. 山东大学学报(工学版), 2018, 48(3): 10-16.
[13]	曹雅,邓赵红,王士同. 基于单调约束的径向基函数神经网络模型[J]. 山东大学学报(工学版), 2018, 48(3): 127-133.
[14]	龙柏,曾宪宇,李徵,刘淇. 电商商品嵌入表示分类方法[J]. 山东大学学报(工学版), 2018, 48(3): 17-24.
[15]	谢志峰,吴佳萍,马利庄. 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报(工学版), 2018, 48(3): 34-39.

符号序列的LDA主题特征表示方法

LDA-based topic feature representation method for symbolic sequences

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 26

相关文章 15

多维度评价

本文评价

推荐阅读 10