您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2018, Vol. 48 ›› Issue (6): 8-18.doi: 10.6040/j.issn.1672-3961.0.2018.193

• 机器学习与数据挖掘 • 上一篇    下一篇

一种具有新主题偏向性的短文本动态聚类方法

朱映雪1,2(),黄瑞章1,2,*(),马灿1,2   

  1. 1. 贵州大学大学计算机科学与技术学院,贵州 贵阳 550025
    2. 贵州省公共大数据重点实验室,贵州 贵阳 550025
  • 收稿日期:2018-05-31 出版日期:2018-12-20 发布日期:2018-12-26
  • 通讯作者: 黄瑞章 E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • 作者简介:朱映雪(1993—),女,贵州毕节人,硕士研究生,主要研究方向为数据挖掘与机器学习.E-mail:zhuyingxue1993@gmail.com
  • 基金资助:
    国家自然科学基金项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省自然科学基金(黔科合基础[2018]1035)

A short text dynamic clustering approach bias on new topic

Yingxue ZHU1,2(),Ruizhang HUANG1,2,*(),Can MA1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2018-05-31 Online:2018-12-20 Published:2018-12-26
  • Contact: Ruizhang HUANG E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • Supported by:
    国家自然科学基金项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省自然科学基金(黔科合基础[2018]1035)

摘要:

为了解决短文本数据流的动态聚类问题,提出动态的狄利克雷多项混合(dynamic Dirichlet multinomial mixture,DDMM)模型。模型能够很好地捕获短文本数据流中主题随时间变化而变化的动态过程,同时考虑到已有历史主题和新主题之间的关系,能够对主题继承性的强弱进行调整,从而增大新主题产生的可能。在Gibbs采样过程中,能够自动估算出聚类个数。模拟数据和真实数据上的试验表明,DDMM模型是有效的。同时将提出的方法和传统动态聚类方法进行对比,结果表明DDMM模型能够进行有效的文本动态聚类,并且聚类效果表现良好。

关键词: 动态聚类, 新主题偏向, Gibbs采样, 主题模型, 文本挖掘

Abstract:

The dynamic Dirichlet multinomial mixture (DDMM) model for short textual data stream dynamic clustering problem was proposed.The model could capture the change of topics in the short textual data stream over time, and take the relationship between existing historical topics and new topics into consideration, which could adjust the strength of the lineage of topics, and increase the likelihood of new topic emergence.In addition, the proposed approach could infer the number of clusters automatically in the process of Gibbs sampling.Experiments indicated that the DDMM model performed well on the synthetic data set as well as real data sets.And the comparison between the proposed approach and state-of-the-art dynamic clustering approaches showed that the DDMM model was effective for document dynamic clustering, and performed well on short text dynamic clustering.

Key words: dynamic clustering, new topic bias, Gibbs sampling, topic model, text mining

中图分类号: 

  • TP391.1

表1

主要使用的符号"

符号 释义
d, z, w 文档,主题,词
t 时间
K 初始聚类个数
K* 实际估算出的聚类个数
V 词典大小
dt 时间片t内的文档集
Nd 文档d的词数
Nd, w 文档d中词w出现的次数
Θt, Φt 时间片t内的主题分布,时间片t内的词分布
γ, αt, βt 模型的先验参数

图1

DDMM的图模型表示"

表2

模拟数据集"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

图2

DCT模型估计出的模拟数据集的聚类标签"

图3

DDMM模型估计出的模拟数据集的聚类标签"

图4

DDMM模型在每次迭代中估算出的聚类个数"

表3

20个新闻数据集"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

表4

twitter数据集上的动态聚类模型的NMI值"

时间片 DTM DMM DCT DDMM
1 0.284 0.311 0.448 0.506
2 0.327 0.394 0.422 0.527
3 0.319 0.373 0.415 0.483

表5

twitter数据集上的动态聚类模型的Purity值"

时间片 DTM DMM DCT DDMM
1 0.412 0.451 0.543 0.597
2 0.479 0.492 0.538 0.589
3 0.463 0.478 0.514 0.573

表6

20个新闻数据集上的动态聚类模型的NMI值"

时间片 DTM DMM DCT DDMM
1 0.359 0.368 0.412 0.482
2 0.382 0.355 0.398 0.436
3 0.305 0.304 0.334 0.406

表7

20个新闻数据集上的动态聚类模型的Purity值"

时间片 DTM DMM DCT DDMM
1 0.423 0.427 0.504 0.549
2 0.434 0.435 0.468 0.546
3 0.418 0.426 0.457 0.535

图5

DCT模型估计出的20个新闻数据集的聚类标签"

图6

DDMM模型估计出的20个新闻数据集的聚类标签"

图7

γ对DDMM模型聚类效果的影响"

1 YIN J, WANG J.A Dirichlet multinomial mixture model-based approach for short text clustering[C]//Proc of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: KDD'14.New York, USA: ACM, 2014: 233-242.
2 AHMED A, XING E P.Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream[C]//Proc of the 26th Conference on Uncertainty in Artificial Intelligence.New York, USA: AUAI Press, 2010: 20-29.
3 PITMAN J , YOR M . The two-parameter poisson dirichlet distribution derived from a stable subordinator[J]. Annals of Probability, 1995, 25 (2): 885- 900.
4 TEH Y W.A hierarchical bayesian language model based on pitman-yor processes[C]//Proc of the 21st Iternational Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics.Sydney, Australia: ACM, 2006: 985-992.
5 PORTEOUS I, NEWMAN D, IHLER A, et al.Fast collapsed gibbs sampling for latent dirichlet allocation[C]//Proc of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Las Vegas, USA: ACM, 2008: 569-577.
6 BRUCE C , DONALD M , TREVOR S . Search engines: information retrieval in practice[M]. Boston, USA: Addison-Wesley, 2010: 22- 23.
7 HOFMANN T.Probabilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York, USA: ACM, 1999: 50-57.
8 BLEI D M , NG A Y , JORDAN M I . Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3, 993- 1022.
9 BLEID M, LAFFERTY J D.Dynamic topic models[C]//Proc of the 23rd International Conference on Machine Learning (ICML′06).New York, USA: ACM, 2006: 113-120.
10 WEI X, SUN J, WANG X.Dynamic mixture models for multiple time-Series[C]// Proc of the 20th International Joint Conference on Artificial Intelligence.Hyderabad, India: ACM, 2007: 2909-2914.
11 IWATA T, WATANABE S, YAMADA T, et al.Topic tracking model for analyzing consumer purchase behavior[C]//Proc of 21st International Joint Conference on Artificial Intelligence.San Francisco, USA: ACM, 2009: 1427-1432.
12 WANG X, MCCALLUM A.Topics over time: a non-markov continuous-time model of topical trends[C]//Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA: ACM, 2006: 424-433.
13 李雷, 朱玉婷, 施化吉, 等. 社会网络中基于U_BTM模型的主题挖掘[J]. 计算机应用研究, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
LI Lei , ZHU Yuting , SHI Huaji , et al. Topic mining based on U_BTM model in social networks[J]. Application Research of Computers, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
14 谢珺, 郝洁, 苏婧琼, 等. 一种针对短文本的主题情感混合模型[J]. 中文信息学报, 2017, 31 (1): 162- 168.
XIE Jun , HAO Jie , SU Jingqiong , et al. A joint topic and sentiment model for short texts[J]. Journal of Chinese Information Processing, 2017, 31 (1): 162- 168.
15 刘泽锦, 王洁. 同主题词短文本分类算法中BTM的应用与改进[J]. 计算机系统应用, 2017, 26 (11): 213- 219.
LIU Zejin , WANG Jie . Application and improvement of BTM in short text classification algorithm of the same topic[J]. Computer Systems & Applications, 2017, 26 (11): 213- 219.
16 YAN X, GUO J, LAN Y, et al.A biterm topic model for short texts[C]//Proc of the 22nd International Conference on World Wide Web.New York, USA: ACM, 2013: 1445-1456.
17 ZHOU X , OUYANG J , LI X . Two time-efficient gibbs sampling inference algorithms for biterm topic model[J]. Applied Intelligence, 2018, 48 (3): 730- 754.
doi: 10.1007/s10489-017-1004-2
18 CHENG X , YAN X , LAN Y , et al. BTM:topic modeling over short texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26 (12): 2928- 2941.
doi: 10.1109/TKDE.2014.2313872
19 WANG Y, AGICHTEIN E, BENZI M.TM-LDA: efficient online modeling of latent topic transitions in social media[C]//Proc of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′12).New York, USA: ACM, 2012: 123-131.
20 ZHAO W X , JIANG J , WENG J , et al. Comparing twitter and traditional media using topic models[J]. Berlin, Germany: Springer, 2011, 338- 349.
21 SASAKI K, YOSHIKAWA T, FURUHASHI T.Online topic model for twitter considering dynamics of user interests and topic trends[C] //Proc of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.Doha, Qatar: ACL, 2014: 1977-1985.
22 LIANG S, YILMAZ E, KANOULAS E.Dynamic clustering of streaming short documents[C]//Proc of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD′16).New York, USA: ACM, 2016: 995-1004.
23 LIANG S, REN Z, YILMAZ E, et al.Collaborative user clustering for short text streams[C]//Proc of the 31st AAAI Conference on Artificial Intelligence.San Francisco, USA: ACM, 2017: 3504-3510.
24 刘冰玉, 王翠荣, 王聪, 等. 基于动态主题模型融合多维数据的微博社区发现算法[J]. 软件学报, 2017, 28 (2): 246- 261.
LIU Bingyu , WANG Cuirong , WANG Cong , et al. Microblog community discovery algorithm based on dynamic topic model with multidimensional data fusion[J]. Journal of Software, 2017, 28 (2): 246- 261.
25 PHADIA E G . Prior processes and their applications[M]. New York, USA: Springer, 2016: 77- 79.
26 ZHONG S . Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65 (3): 3- 29.
27 TEH Y W , JORDAN M I , BEAL M J , et al. Hierarchical dirichlet process[J]. Journal of American Statistical Association, 2006, 101 (476): 1566- 1581.
doi: 10.1198/016214506000000302
[1] 马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51.
[2] 吴艳丽,刘淑薇,何东晓,王晓宝,金弟. 刻画多种潜在关系的泊松-伽马主题模型[J]. 山东大学学报 (工学版), 2023, 53(2): 51-60.
[3] 孙志巍,宋明阳,潘泽华,景丽萍. 上下文感知的判别式主题模型[J]. 山东大学学报 (工学版), 2022, 52(4): 131-138.
[4] 闫盈盈,黄瑞章,王瑞,马灿,刘博伟,黄庭. 一种长文本辅助短文本的文本理解方法[J]. 山东大学学报(工学版), 2018, 48(3): 67-74.
[5] 卢文羊, 徐佳一, 杨育彬. 基于LDA主题模型的社会网络链接预测[J]. 山东大学学报(工学版), 2014, 44(6): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王素玉,艾兴,赵军,李作丽,刘增文 . 高速立铣3Cr2Mo模具钢切削力建模及预测[J]. 山东大学学报(工学版), 2006, 36(1): 1 -5 .
[2] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[3] 施来顺,万忠义 . 新型甜菜碱型沥青乳化剂的合成与性能测试[J]. 山东大学学报(工学版), 2008, 38(4): 112 -115 .
[4] 孔祥臻,刘延俊,王勇,赵秀华 . 气动比例阀的死区补偿与仿真[J]. 山东大学学报(工学版), 2006, 36(1): 99 -102 .
[5] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[6] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 .
[7] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[8] 浦剑1 ,张军平1 ,黄华2 . 超分辨率算法研究综述[J]. 山东大学学报(工学版), 2009, 39(1): 27 -32 .
[9] 王丽君,黄奇成,王兆旭 . 敏感性问题中的均方误差与模型比较[J]. 山东大学学报(工学版), 2006, 36(6): 51 -56 .
[10] 孙殿柱,朱昌志,李延瑞 . 散乱点云边界特征快速提取算法[J]. 山东大学学报(工学版), 2009, 39(1): 84 -86 .