山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 60-65.doi: 10.6040/j.issn.1672-3961.0.2019.760
Chao FENG1,2(
),Kunpeng XU1,2,Lifei CHEN1,2,*(
)
摘要:
针对现有序列挖掘算法特征维度高、学习算法时间复杂度高等方面的不足,提出一种主题特征表示法,将符号序列转换为一组表示多个主题呈现度的概率向量。基于文本挖掘中常用的隐含狄利克雷分配(latent Dirichlet allocation, LDA)主题模型,视短序列元组为序列的浅层特征(词),利用LDA模型学习算法提取主题及其概率分布,作为序列的深层特征。在6个实际序列数据集上进行试验,并与基于元组、Markov模型的现有方法作对比,结果表明,新方法在降低特征维度的同时提高了表示模型的学习效率,在符号序列分类应用中可以取得较理想的分类精度。
中图分类号:
| 1 | DONG G , PEI J . Sequence data mining[M]. Berlin: Springer, 2007: Ⅶ- 69. |
| 2 |
BENGIO Y , COURVILLE A , VINCENT P . Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intellig-ence, 2013, 35 (8): 1798- 1828.
doi: 10.1109/TPAMI.2013.50 |
| 3 | COVER T M , HART P E . Nearest neighborpatternclassification[J]. IEEE Transactions on Information Theory, 1967, 13 (1): 21- 27. |
| 4 |
XING Z , PEI J , KEOGH E J . A brief survey on sequence classification[J]. ACM SIGKDD Explorations Newsletter, 2010, 12 (1): 40- 48.
doi: 10.1145/1882471.1882478 |
| 5 | BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: IJCAI, 2011: 1192-1197. |
| 6 | 郭彦明.基于隐马尔可夫模型的DNA序列分类研究[D].福州:福建师范大学, 2015. |
| GUO Yanming. A study of DNA sequence classification based on hidden Markov model[D]. Fuzhou: Fujian Normal University, 2015. | |
| 7 |
GUO G , CHEN L , YE Y , et al. Cluster validation method for determining the number of clusters in categorical sequences[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (12): 2936- 2948.
doi: 10.1109/TNNLS.2016.2608354 |
| 8 |
YUAN L , WANG W , CHEN L . Two-stage pruning method for gram-based categorical sequence clustering[J]. International Journal of Machine Learning and Cybernetics, 2019, 10 (4): 631- 640.
doi: 10.1007/s13042-017-0744-y |
| 9 |
GERS F A , SCHMIDHUBER J , Cummins F . Learning toforget: continual prediction with LSTM[J]. Neural Computation, 2000, 12 (10): 2451- 2471.
doi: 10.1162/089976600300015015 |
| 10 |
GREFF K , SRIVASTAVA R K , KOUTNIK J , et al. LSTM: a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28 (10): 2222- 2232.
doi: 10.1109/TNNLS.2016.2582924 |
| 11 | GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada: IEEE, 2013: 6645-6649. |
| 12 | TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: ACL, 2015: 1422-1432. |
| 13 | JEBARA T , KONDOR R I , HOWARD A . Probability product kernels[J]. Journal of Machine Learning Research, 2004, 5 (5): 819- 844. |
| 14 |
SALTON G . A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18 (11): 613- 620.
doi: 10.1145/361219.361220 |
| 15 | XIONG T, WANG S, JIANG Q. A new Markov Model for clustering categorical sequences[C]//Proceedings of the 11th IEEE International Conference on Data Mining. Vancouver, Canada: IEEE Computer Society, 2011: 854-863. |
| 16 | 程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37 (7): 1977- 1982. |
| CHENG Lingfang , GUO Gongde , CHEN Lifei . Classification of symbolic sequences with multi-order Markov Model[J]. Journal of Computer Applications, 2017, 37 (7): 1977- 1982. | |
| 17 |
郭彦明, 陈黎飞, 郭躬德. 基于隐马尔科夫模型的DNA序列分类方法[J]. 计算机系统应用, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005 |
|
GUO Yanming , CHEN Lifei , GUO Gongde . DNA sequence classification method based on Hidden Markov Model[J]. Computer Systems & Applications, 2014, 23 (7): 24- 30.
doi: 10.3969/j.issn.1003-3254.2014.07.005 |
|
| 18 | 周玉元, 周铁军. DNA序列分类的Fisher判别法[J]. 湖南农业大学学报(自然科学版), 2003, 29 (5): 437- 440. |
| ZHOU Yuyuan , ZHOU Tiejun . The Fisher criterion on classification of the DNA sequence[J]. Journal of Hunan Agricultural University (Natural Sciences), 2003, 29 (5): 437- 440. | |
| 19 | DAI A M , LE Q V . Semi-supervised sequence learning[J]. Advances in Neural Information Processing Systems, 2015, 3079- 3087. |
| 20 | GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep Bidirectional LSTM[C]//Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech Republic: IEEE, 2013: 273-278. |
| 21 |
DEERWESTER S , DUMAIS S T , LANDAUER T K , et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41 (6): 391- 407.
doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 |
| 22 | THOMAS H. Probabilisticlatent semantic analysis[C]//Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden: Morgan Kaufmann, 1999: 289-296. |
| 23 | BLEI D M , NG A Y , JORDAN M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3, 993- 1022. |
| 24 | GRIFFITHS T L , STEYVERS M . Finding scientific topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101 (1): 5228- 5235. |
| 25 | KELIL A, WANG S. SCS: a new similarity measure for categorical sequences[C]// Proceedings of the 8th IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008: 343-352. |
| 26 | WEI D , JIANG Q , WEI Y , et al. A novel hierarchical clustering algorithm for gene sequences[J]. BMC Bioinformatics, 2012, 13 (1): 174- 186. |
| [1] | 白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66. |
| [2] | 陈晓江,杨晓奇,陈广豪,刘伍颖. 混合BERT和宽度学习的低时间复杂度短文本分类[J]. 山东大学学报 (工学版), 2024, 54(4): 51-58. |
| [3] | 宋辉,张轶哲,张功萱,孟元. 基于类权重和最小化预测熵的测试时集成方法[J]. 山东大学学报 (工学版), 2024, 54(3): 36-43. |
| [4] | 聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11. |
| [5] | 徐金华,罗义凯,李昱燃,李岩. 基于时频分解与深度学习的轨道客流预测[J]. 山东大学学报 (工学版), 2024, 54(2): 60-68. |
| [6] | 马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51. |
| [7] | 于泓,杜娟,魏琳,张利. 计及行为特征的市场化用户电量数据拟合方法[J]. 山东大学学报 (工学版), 2023, 53(4): 113-119. |
| [8] | 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73. |
| [9] | 张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92. |
| [10] | 刘财辉,周琪,叶晓文. 一种基于改进ReliefF算法的入侵检测模型[J]. 山东大学学报 (工学版), 2023, 53(2): 1-10. |
| [11] | 吴艳丽,刘淑薇,何东晓,王晓宝,金弟. 刻画多种潜在关系的泊松-伽马主题模型[J]. 山东大学学报 (工学版), 2023, 53(2): 51-60. |
| [12] | 孟令灿,聂秀山,张雪. 基于遮挡目标去除的公交车拥挤度分类算法[J]. 山东大学学报 (工学版), 2022, 52(4): 83-88. |
| [13] | 孙志巍,宋明阳,潘泽华,景丽萍. 上下文感知的判别式主题模型[J]. 山东大学学报 (工学版), 2022, 52(4): 131-138. |
| [14] | 王丽,于明仟,刘文鹏,周瑜,郑蕊蕊,贺建军. 面向类不平衡数据的K近邻偏标记学习算法[J]. 山东大学学报 (工学版), 2022, 52(3): 18-24. |
| [15] | 龚楷伦,翟婷婷,唐鸿成. 一种面向多标签分类的在线主动学习算法[J]. 山东大学学报 (工学版), 2022, 52(2): 80-88. |
|