您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 60-65.doi: 10.6040/j.issn.1672-3961.0.2019.760

• 机器学习与数据挖掘 • 上一篇    下一篇

符号序列的LDA主题特征表示方法

冯超1,2(),徐鲲鹏1,2,陈黎飞1,2,*()   

  1. 1. 福建师范大学数学与信息学院,福建 福州 350117
    2. 数字福建环境监测物联网实验室,福建 福州 350117
  • 收稿日期:2019-12-18 出版日期:2020-04-20 发布日期:2020-04-16
  • 通讯作者: 陈黎飞 E-mail:fc_fight2017@163.com;clfei@fjnu.edu.cn
  • 作者简介:冯超(1994—),男,安徽六安人,硕士研究生,主要研究方向为数据挖掘. E-mail:fc_fight2017@163.com
  • 基金资助:
    国家自然科学基金资助项目(61672157);国家自然科学基金资助项目(U1805263);福建师范大学创新团队资助项目(IRTL1704)

LDA-based topic feature representation method for symbolic sequences

Chao FENG1,2(),Kunpeng XU1,2,Lifei CHEN1,2,*()   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, Fujian, China
    2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fuzhou 350117, Fujian, China
  • Received:2019-12-18 Online:2020-04-20 Published:2020-04-16
  • Contact: Lifei CHEN E-mail:fc_fight2017@163.com;clfei@fjnu.edu.cn
  • Supported by:
    国家自然科学基金资助项目(61672157);国家自然科学基金资助项目(U1805263);福建师范大学创新团队资助项目(IRTL1704)

摘要:

针对现有序列挖掘算法特征维度高、学习算法时间复杂度高等方面的不足,提出一种主题特征表示法,将符号序列转换为一组表示多个主题呈现度的概率向量。基于文本挖掘中常用的隐含狄利克雷分配(latent Dirichlet allocation, LDA)主题模型,视短序列元组为序列的浅层特征(词),利用LDA模型学习算法提取主题及其概率分布,作为序列的深层特征。在6个实际序列数据集上进行试验,并与基于元组、Markov模型的现有方法作对比,结果表明,新方法在降低特征维度的同时提高了表示模型的学习效率,在符号序列分类应用中可以取得较理想的分类精度。

关键词: 特征表示, 符号序列, 隐含狄利克雷分配, 主题, 分类

Abstract:

To address the problems of high feature dimensionality and high algorithm time complexity in the existing methods, a topic feature representation method was proposed to transform the symbolic sequences into a set of topic probability vectors, based on the topic model latent Dirichlet allocation (LDA) commonly used in text mining. In the new method, each short sequence gram was considered as the shallow feature (word) of the sequence, and the topics with their probability distributions were extracted as the deep features of the sequences using the LDA model learning algorithm.Experiments were carried out on six real-world sequence sets, and compared with the existing grams-based and Markov model-based methods. The results showed that the new method improved the learning efficiency of the representation model while reducing the feature dimensionality, and achieved better accuracy in the application of symbolic sequence classification.

Key words: feature representation, symbolic sequences, latent Dirichlet allocation, topics, classification

中图分类号: 

  • TP311

图1

SLDA模型示意图"

表1

试验数据集的参数汇总表"

数据集 序列数M 类别数目C 平均序列长度 平均符号数目
GS1 771 8 1 594 4
GS2 281 6 1 318 4
GS3 310 6 1 536 4
SS1 50 5 1 899 15
SS2 50 5 1 498 16
SS3 50 5 925 15

图2

F1随SLDA主题数的变化图"

表2

符号序列集上不同方法提取的特征数目及分类结果F1指标对比"

数据集 SLDA 特征数目 MM-FR 特征数目 G-FR 特征数目
GS1 0.946 8 35 0.911 8 16 0.980 5 81
GS2 1.000 0 15 0.996 4 16 0.996 4 85
GS3 0.967 7 35 0.922 6 16 0.996 8 72
SS1 1.000 0 15 1.000 0 324 0.980 0 227 7
SS2 1.000 0 5 1.000 0 400 1.000 0 252 9
SS3 1.000 0 5 1.000 0 400 1.000 0 212 1
1 DONG G , PEI J . Sequence data mining[M]. Berlin: Springer, 2007: Ⅶ- 69.
2 BENGIO Y , COURVILLE A , VINCENT P . Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intellig-ence, 2013, 35 (8): 1798- 1828.
doi: 10.1109/TPAMI.2013.50
3 COVER T M , HART P E . Nearest neighborpatternclassification[J]. IEEE Transactions on Information Theory, 1967, 13 (1): 21- 27.
4 XING Z , PEI J , KEOGH E J . A brief survey on sequence classification[J]. ACM SIGKDD Explorations Newsletter, 2010, 12 (1): 40- 48.
doi: 10.1145/1882471.1882478
5 BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI, 2011: 1192-1197.
6 郭彦明.基于隐马尔可夫模型的DNA序列分类研究[D].福州:福建师范大学, 2015.
GUO Yanming. A study of DNA sequence classification based on hidden Markov model[D]. Fuzhou: Fujian Normal University, 2015.
7 GUO G , CHEN L , YE Y , et al. Cluster validation method for determining the number of clusters in categorical sequences[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (12): 2936- 2948.
doi: 10.1109/TNNLS.2016.2608354
8 YUAN L , WANG W , CHEN L . Two-stage pruning method for gram-based categorical sequence clustering[J]. International Journal of Machine Learning and Cybernetics, 2019, 10 (4): 631- 640.
doi: 10.1007/s13042-017-0744-y
9 GERS F A , SCHMIDHUBER J , Cummins F . Learning toforget: continual prediction with LSTM[J]. Neural Computation, 2000, 12 (10): 2451- 2471.
doi: 10.1162/089976600300015015
10 GREFF K , SRIVASTAVA R K , KOUTNIK J , et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (10): 2222- 2232.
doi: 10.1109/TNNLS.2016.2582924
11 GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada: IEEE, 2013: 6645-6649.
12 TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: ACL, 2015: 1422-1432.
13 JEBARA T , KONDOR R I , HOWARD A . Probability product kernels[J]. Journal of Machine Learning Research, 2004, 5 (5): 819- 844.
14 SALTON G . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620.
doi: 10.1145/361219.361220
15 XIONG T, WANG S, JIANG Q. A new Markov Model for clustering categorical sequences[C]//Proceedings of the 11th IEEE International Conference on Data Mining. Vancouver, Canada: IEEE Computer Society, 2011: 854-863.
16 程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37 (7): 1977- 1982.
CHENG Lingfang , GUO Gongde , CHEN Lifei . Classification of symbolic sequences with multi-order Markov Model[J]. Journal of Computer Applications, 2017, 37 (7): 1977- 1982.
17 郭彦明, 陈黎飞, 郭躬德. 基于隐马尔科夫模型的DNA序列分类方法[J]. 计算机系统应用, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005
GUO Yanming , CHEN Lifei , GUO Gongde . DNA sequence classification method based on Hidden Markov Model[J]. Computer Systems & Applications, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005
18 周玉元, 周铁军. DNA序列分类的Fisher判别法[J]. 湖南农业大学学报(自然科学版), 2003, 29 (5): 437- 440.
ZHOU Yuyuan , ZHOU Tiejun . The Fisher criterion on classification of the DNA sequence[J]. Journal of Hunan Agricultural University (Natural Sciences), 2003, 29 (5): 437- 440.
19 DAI A M , LE Q V . Semi-supervised sequence learning[J]. Advances in Neural Information Processing Systems, 2015, 3079- 3087.
20 GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep Bidirectional LSTM[C]//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 273-278.
21 DEERWESTER S , DUMAIS S T , LANDAUER T K , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41 (6): 391- 407.
doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
22 THOMAS H. Probabilisticlatent semantic analysis[C]//Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999: 289-296.
23 BLEI D M , NG A Y , JORDAN M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3, 993- 1022.
24 GRIFFITHS T L , STEYVERS M . Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101 (1): 5228- 5235.
25 KELIL A, WANG S. SCS: a new similarity measure for categorical sequences[C]// Proceedings of the 8th IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008: 343-352.
26 WEI D , JIANG Q , WEI Y , et al. A novel hierarchical clustering algorithm for gene sequences[J]. BMC Bioinformatics, 2012, 13 (1): 174- 186.
[1] 白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66.
[2] 陈晓江,杨晓奇,陈广豪,刘伍颖. 混合BERT和宽度学习的低时间复杂度短文本分类[J]. 山东大学学报 (工学版), 2024, 54(4): 51-58.
[3] 宋辉,张轶哲,张功萱,孟元. 基于类权重和最小化预测熵的测试时集成方法[J]. 山东大学学报 (工学版), 2024, 54(3): 36-43.
[4] 聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11.
[5] 徐金华,罗义凯,李昱燃,李岩. 基于时频分解与深度学习的轨道客流预测[J]. 山东大学学报 (工学版), 2024, 54(2): 60-68.
[6] 马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51.
[7] 于泓,杜娟,魏琳,张利. 计及行为特征的市场化用户电量数据拟合方法[J]. 山东大学学报 (工学版), 2023, 53(4): 113-119.
[8] 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73.
[9] 张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92.
[10] 刘财辉,周琪,叶晓文. 一种基于改进ReliefF算法的入侵检测模型[J]. 山东大学学报 (工学版), 2023, 53(2): 1-10.
[11] 吴艳丽,刘淑薇,何东晓,王晓宝,金弟. 刻画多种潜在关系的泊松-伽马主题模型[J]. 山东大学学报 (工学版), 2023, 53(2): 51-60.
[12] 孟令灿,聂秀山,张雪. 基于遮挡目标去除的公交车拥挤度分类算法[J]. 山东大学学报 (工学版), 2022, 52(4): 83-88.
[13] 孙志巍,宋明阳,潘泽华,景丽萍. 上下文感知的判别式主题模型[J]. 山东大学学报 (工学版), 2022, 52(4): 131-138.
[14] 王丽,于明仟,刘文鹏,周瑜,郑蕊蕊,贺建军. 面向类不平衡数据的K近邻偏标记学习算法[J]. 山东大学学报 (工学版), 2022, 52(3): 18-24.
[15] 龚楷伦,翟婷婷,唐鸿成. 一种面向多标签分类的在线主动学习算法[J]. 山东大学学报 (工学版), 2022, 52(2): 80-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 施来顺,万忠义 . 新型甜菜碱型沥青乳化剂的合成与性能测试[J]. 山东大学学报(工学版), 2008, 38(4): 112 -115 .
[2] 刘文亮,朱维红,陈涤,张泓泉. 基于雷达图像的运动目标形态检测及跟踪技术[J]. 山东大学学报(工学版), 2010, 40(3): 31 -36 .
[3] 赵然杭,陈守煜 . 水资源数量与质量联合评价理论模型研究[J]. 山东大学学报(工学版), 2006, 36(3): 46 -50 .
[4] 孙从征,管从胜,秦敬玉,程川 . 铝合金化学镀镍磷合金结构和性能[J]. 山东大学学报(工学版), 2007, 37(5): 108 -112 .
[5] 胡天亮,李鹏,张承瑞,左毅 . 基于VHDL的正交编码脉冲电路解码计数器设计[J]. 山东大学学报(工学版), 2008, 38(3): 10 -13 .
[6] 田芳1,张颖欣2,张礼3,侯秀萍3,裘南畹3. 新型金属氧化物薄膜气敏元件基材料的开发[J]. 山东大学学报(工学版), 2009, 39(2): 104 -107 .
[7] 张恭孝,杨荣华 . 水杨醛缩甲基氨基硫脲Schiff碱配合物的合成与表征[J]. 山东大学学报(工学版), 2008, 38(3): 108 -111 .
[8] 陈朋 胡文容 裴海燕. 一株反硝化细菌LZ-14的筛选及其脱氮特性[J]. 山东大学学报(工学版), 2009, 39(5): 133 -138 .
[9] 许延生,刘兴芳 . 模糊聚类迭代模型在水资源承载能力评价中的应用[J]. 山东大学学报(工学版), 2007, 37(3): 100 -104 .
[10] 高阳 张庆松 原小帅 许振浩 刘斌. 地质雷达在岩溶隧道超前预报中的应用[J]. 山东大学学报(工学版), 2009, 39(4): 82 -86 .