您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2018, Vol. 48 ›› Issue (6): 8-18.doi: 10.6040/j.issn.1672-3961.0.2018.193

• 机器学习与数据挖掘 • 上一篇    下一篇

一种具有新主题偏向性的短文本动态聚类方法

朱映雪1,2(),黄瑞章1,2,*(),马灿1,2   

  1. 1. 贵州大学大学计算机科学与技术学院,贵州 贵阳 550025
    2. 贵州省公共大数据重点实验室,贵州 贵阳 550025
  • 收稿日期:2018-05-31 出版日期:2018-12-20 发布日期:2018-12-26
  • 通讯作者: 黄瑞章 E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • 作者简介:朱映雪(1993—),女,贵州毕节人,硕士研究生,主要研究方向为数据挖掘与机器学习.E-mail:zhuyingxue1993@gmail.com
  • 基金资助:
    国家自然科学基金项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省自然科学基金(黔科合基础[2018]1035)

A short text dynamic clustering approach bias on new topic

Yingxue ZHU1,2(),Ruizhang HUANG1,2,*(),Can MA1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2018-05-31 Online:2018-12-20 Published:2018-12-26
  • Contact: Ruizhang HUANG E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • Supported by:
    国家自然科学基金项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省自然科学基金(黔科合基础[2018]1035)

摘要:

为了解决短文本数据流的动态聚类问题,提出动态的狄利克雷多项混合(dynamic Dirichlet multinomial mixture,DDMM)模型。模型能够很好地捕获短文本数据流中主题随时间变化而变化的动态过程,同时考虑到已有历史主题和新主题之间的关系,能够对主题继承性的强弱进行调整,从而增大新主题产生的可能。在Gibbs采样过程中,能够自动估算出聚类个数。模拟数据和真实数据上的试验表明,DDMM模型是有效的。同时将提出的方法和传统动态聚类方法进行对比,结果表明DDMM模型能够进行有效的文本动态聚类,并且聚类效果表现良好。

关键词: 动态聚类, 新主题偏向, Gibbs采样, 主题模型, 文本挖掘

Abstract:

The dynamic Dirichlet multinomial mixture (DDMM) model for short textual data stream dynamic clustering problem was proposed.The model could capture the change of topics in the short textual data stream over time, and take the relationship between existing historical topics and new topics into consideration, which could adjust the strength of the lineage of topics, and increase the likelihood of new topic emergence.In addition, the proposed approach could infer the number of clusters automatically in the process of Gibbs sampling.Experiments indicated that the DDMM model performed well on the synthetic data set as well as real data sets.And the comparison between the proposed approach and state-of-the-art dynamic clustering approaches showed that the DDMM model was effective for document dynamic clustering, and performed well on short text dynamic clustering.

Key words: dynamic clustering, new topic bias, Gibbs sampling, topic model, text mining

中图分类号: 

  • TP391.1

表1

主要使用的符号"

符号 释义
d, z, w 文档,主题,词
t 时间
K 初始聚类个数
K* 实际估算出的聚类个数
V 词典大小
dt 时间片t内的文档集
Nd 文档d的词数
Nd, w 文档d中词w出现的次数
Θt, Φt 时间片t内的主题分布,时间片t内的词分布
γ, αt, βt 模型的先验参数

图1

DDMM的图模型表示"

表2

模拟数据集"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

图2

DCT模型估计出的模拟数据集的聚类标签"

图3

DDMM模型估计出的模拟数据集的聚类标签"

图4

DDMM模型在每次迭代中估算出的聚类个数"

表3

20个新闻数据集"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

表4

twitter数据集上的动态聚类模型的NMI值"

时间片 DTM DMM DCT DDMM
1 0.284 0.311 0.448 0.506
2 0.327 0.394 0.422 0.527
3 0.319 0.373 0.415 0.483

表5

twitter数据集上的动态聚类模型的Purity值"

时间片 DTM DMM DCT DDMM
1 0.412 0.451 0.543 0.597
2 0.479 0.492 0.538 0.589
3 0.463 0.478 0.514 0.573

表6

20个新闻数据集上的动态聚类模型的NMI值"

时间片 DTM DMM DCT DDMM
1 0.359 0.368 0.412 0.482
2 0.382 0.355 0.398 0.436
3 0.305 0.304 0.334 0.406

表7

20个新闻数据集上的动态聚类模型的Purity值"

时间片 DTM DMM DCT DDMM
1 0.423 0.427 0.504 0.549
2 0.434 0.435 0.468 0.546
3 0.418 0.426 0.457 0.535

图5

DCT模型估计出的20个新闻数据集的聚类标签"

图6

DDMM模型估计出的20个新闻数据集的聚类标签"

图7

γ对DDMM模型聚类效果的影响"

1 YIN J, WANG J.A Dirichlet multinomial mixture model-based approach for short text clustering[C]//Proc of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: KDD'14.New York, USA: ACM, 2014: 233-242.
2 AHMED A, XING E P.Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream[C]//Proc of the 26th Conference on Uncertainty in Artificial Intelligence.New York, USA: AUAI Press, 2010: 20-29.
3 PITMAN J , YOR M . The two-parameter poisson dirichlet distribution derived from a stable subordinator[J]. Annals of Probability, 1995, 25 (2): 885- 900.
4 TEH Y W.A hierarchical bayesian language model based on pitman-yor processes[C]//Proc of the 21st Iternational Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics.Sydney, Australia: ACM, 2006: 985-992.
5 PORTEOUS I, NEWMAN D, IHLER A, et al.Fast collapsed gibbs sampling for latent dirichlet allocation[C]//Proc of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Las Vegas, USA: ACM, 2008: 569-577.
6 BRUCE C , DONALD M , TREVOR S . Search engines: information retrieval in practice[M]. Boston, USA: Addison-Wesley, 2010: 22- 23.
7 HOFMANN T.Probabilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York, USA: ACM, 1999: 50-57.
8 BLEI D M , NG A Y , JORDAN M I . Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3, 993- 1022.
9 BLEID M, LAFFERTY J D.Dynamic topic models[C]//Proc of the 23rd International Conference on Machine Learning (ICML′06).New York, USA: ACM, 2006: 113-120.
10 WEI X, SUN J, WANG X.Dynamic mixture models for multiple time-Series[C]// Proc of the 20th International Joint Conference on Artificial Intelligence.Hyderabad, India: ACM, 2007: 2909-2914.
11 IWATA T, WATANABE S, YAMADA T, et al.Topic tracking model for analyzing consumer purchase behavior[C]//Proc of 21st International Joint Conference on Artificial Intelligence.San Francisco, USA: ACM, 2009: 1427-1432.
12 WANG X, MCCALLUM A.Topics over time: a non-markov continuous-time model of topical trends[C]//Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA: ACM, 2006: 424-433.
13 李雷, 朱玉婷, 施化吉, 等. 社会网络中基于U_BTM模型的主题挖掘[J]. 计算机应用研究, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
LI Lei , ZHU Yuting , SHI Huaji , et al. Topic mining based on U_BTM model in social networks[J]. Application Research of Computers, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
14 谢珺, 郝洁, 苏婧琼, 等. 一种针对短文本的主题情感混合模型[J]. 中文信息学报, 2017, 31 (1): 162- 168.
XIE Jun , HAO Jie , SU Jingqiong , et al. A joint topic and sentiment model for short texts[J]. Journal of Chinese Information Processing, 2017, 31 (1): 162- 168.
15 刘泽锦, 王洁. 同主题词短文本分类算法中BTM的应用与改进[J]. 计算机系统应用, 2017, 26 (11): 213- 219.
LIU Zejin , WANG Jie . Application and improvement of BTM in short text classification algorithm of the same topic[J]. Computer Systems & Applications, 2017, 26 (11): 213- 219.
16 YAN X, GUO J, LAN Y, et al.A biterm topic model for short texts[C]//Proc of the 22nd International Conference on World Wide Web.New York, USA: ACM, 2013: 1445-1456.
17 ZHOU X , OUYANG J , LI X . Two time-efficient gibbs sampling inference algorithms for biterm topic model[J]. Applied Intelligence, 2018, 48 (3): 730- 754.
doi: 10.1007/s10489-017-1004-2
18 CHENG X , YAN X , LAN Y , et al. BTM:topic modeling over short texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26 (12): 2928- 2941.
doi: 10.1109/TKDE.2014.2313872
19 WANG Y, AGICHTEIN E, BENZI M.TM-LDA: efficient online modeling of latent topic transitions in social media[C]//Proc of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′12).New York, USA: ACM, 2012: 123-131.
20 ZHAO W X , JIANG J , WENG J , et al. Comparing twitter and traditional media using topic models[J]. Berlin, Germany: Springer, 2011, 338- 349.
21 SASAKI K, YOSHIKAWA T, FURUHASHI T.Online topic model for twitter considering dynamics of user interests and topic trends[C] //Proc of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.Doha, Qatar: ACL, 2014: 1977-1985.
22 LIANG S, YILMAZ E, KANOULAS E.Dynamic clustering of streaming short documents[C]//Proc of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD′16).New York, USA: ACM, 2016: 995-1004.
23 LIANG S, REN Z, YILMAZ E, et al.Collaborative user clustering for short text streams[C]//Proc of the 31st AAAI Conference on Artificial Intelligence.San Francisco, USA: ACM, 2017: 3504-3510.
24 刘冰玉, 王翠荣, 王聪, 等. 基于动态主题模型融合多维数据的微博社区发现算法[J]. 软件学报, 2017, 28 (2): 246- 261.
LIU Bingyu , WANG Cuirong , WANG Cong , et al. Microblog community discovery algorithm based on dynamic topic model with multidimensional data fusion[J]. Journal of Software, 2017, 28 (2): 246- 261.
25 PHADIA E G . Prior processes and their applications[M]. New York, USA: Springer, 2016: 77- 79.
26 ZHONG S . Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65 (3): 3- 29.
27 TEH Y W , JORDAN M I , BEAL M J , et al. Hierarchical dirichlet process[J]. Journal of American Statistical Association, 2006, 101 (476): 1566- 1581.
doi: 10.1198/016214506000000302
[1] 闫盈盈,黄瑞章,王瑞,马灿,刘博伟,黄庭. 一种长文本辅助短文本的文本理解方法[J]. 山东大学学报(工学版), 2018, 48(3): 67-74.
[2] 卢文羊, 徐佳一, 杨育彬. 基于LDA主题模型的社会网络链接预测[J]. 山东大学学报(工学版), 2014, 44(6): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王素玉,艾兴,赵军,李作丽,刘增文 . 高速立铣3Cr2Mo模具钢切削力建模及预测[J]. 山东大学学报(工学版), 2006, 36(1): 1 -5 .
[2] 张永花,王安玲,刘福平 . 低频非均匀电磁波在导电界面的反射相角[J]. 山东大学学报(工学版), 2006, 36(2): 22 -25 .
[3] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 .
[4] 孙炜伟,王玉振. 考虑饱和的发电机单机无穷大系统有限增益镇定[J]. 山东大学学报(工学版), 2009, 39(1): 69 -76 .
[5] 程代展,李志强. 非线性系统线性化综述(英文)[J]. 山东大学学报(工学版), 2009, 39(2): 26 -36 .
[6] 曲延鹏,陈颂英,李春峰,王小鹏,滕书格 . 低压大流量自激脉冲清洗喷嘴内部气液两相流数值模拟[J]. 山东大学学报(工学版), 2006, 36(4): 16 -20 .
[7] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .
[8] 刘新1 ,宋思利1 ,王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 .
[9] 胡天亮,李鹏,张承瑞,左毅 . 基于VHDL的正交编码脉冲电路解码计数器设计[J]. 山东大学学报(工学版), 2008, 38(3): 10 -13 .
[10] 孔维涛,张庆范,张承慧 . 基于DSP的空间矢量脉宽调制(SVPWM)的实现[J]. 山东大学学报(工学版), 2008, 38(3): 81 -84 .