Journal of Shandong University(Engineering Science) ›› 2018, Vol. 48 ›› Issue (6): 8-18.doi: 10.6040/j.issn.1672-3961.0.2018.193

• Machine Learning & Data Mining • Previous Articles     Next Articles

A short text dynamic clustering approach bias on new topic

Yingxue ZHU1,2(),Ruizhang HUANG1,2,*(),Can MA1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2018-05-31 Online:2018-12-20 Published:2018-12-26
  • Contact: Ruizhang HUANG E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • Supported by:
    国家自然科学基金项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省自然科学基金(黔科合基础[2018]1035)

Abstract:

The dynamic Dirichlet multinomial mixture (DDMM) model for short textual data stream dynamic clustering problem was proposed.The model could capture the change of topics in the short textual data stream over time, and take the relationship between existing historical topics and new topics into consideration, which could adjust the strength of the lineage of topics, and increase the likelihood of new topic emergence.In addition, the proposed approach could infer the number of clusters automatically in the process of Gibbs sampling.Experiments indicated that the DDMM model performed well on the synthetic data set as well as real data sets.And the comparison between the proposed approach and state-of-the-art dynamic clustering approaches showed that the DDMM model was effective for document dynamic clustering, and performed well on short text dynamic clustering.

Key words: dynamic clustering, new topic bias, Gibbs sampling, topic model, text mining

CLC Number: 

  • TP391.1

Table 1

Main notation Used"

符号 释义
d, z, w 文档,主题,词
t 时间
K 初始聚类个数
K* 实际估算出的聚类个数
V 词典大小
dt 时间片t内的文档集
Nd 文档d的词数
Nd, w 文档d中词w出现的次数
Θt, Φt 时间片t内的主题分布,时间片t内的词分布
γ, αt, βt 模型的先验参数

Fig.1

Graphical representation of DDMM"

Table 2

Synthetic dataset"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

Fig.2

Estimated cluster labels of the synthetic data acquired by DCT model"

Fig.3

Estimated cluster labels of the synthetic data set acquired by DDMM model"

Fig.4

Estimation of the number of clusters by DDMM model with each iteration"

Table 3

20 news datasets"

时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)

Table 4

NMI results of the dynamic clustering models on twitter data set"

时间片 DTM DMM DCT DDMM
1 0.284 0.311 0.448 0.506
2 0.327 0.394 0.422 0.527
3 0.319 0.373 0.415 0.483

Table 5

Purity results of the dynamic clustering models on twitter data set"

时间片 DTM DMM DCT DDMM
1 0.412 0.451 0.543 0.597
2 0.479 0.492 0.538 0.589
3 0.463 0.478 0.514 0.573

Table 6

NMI results of the dynamic clustering models on 20 news datasets"

时间片 DTM DMM DCT DDMM
1 0.359 0.368 0.412 0.482
2 0.382 0.355 0.398 0.436
3 0.305 0.304 0.334 0.406

Table 7

Purity results of the dynamic clustering models on 20 news datasets"

时间片 DTM DMM DCT DDMM
1 0.423 0.427 0.504 0.549
2 0.434 0.435 0.468 0.546
3 0.418 0.426 0.457 0.535

Fig.5

Estimated cluster labels of the 20 news datasets acquired by DCT model"

Fig.6

Estimated cluster labels of the 20 news datasets acquired by DDMM model"

Fig.7

The influence of γ on clustering effect of DDMM model"

1 YIN J, WANG J.A Dirichlet multinomial mixture model-based approach for short text clustering[C]//Proc of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: KDD'14.New York, USA: ACM, 2014: 233-242.
2 AHMED A, XING E P.Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream[C]//Proc of the 26th Conference on Uncertainty in Artificial Intelligence.New York, USA: AUAI Press, 2010: 20-29.
3 PITMAN J , YOR M . The two-parameter poisson dirichlet distribution derived from a stable subordinator[J]. Annals of Probability, 1995, 25 (2): 885- 900.
4 TEH Y W.A hierarchical bayesian language model based on pitman-yor processes[C]//Proc of the 21st Iternational Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics.Sydney, Australia: ACM, 2006: 985-992.
5 PORTEOUS I, NEWMAN D, IHLER A, et al.Fast collapsed gibbs sampling for latent dirichlet allocation[C]//Proc of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Las Vegas, USA: ACM, 2008: 569-577.
6 BRUCE C , DONALD M , TREVOR S . Search engines: information retrieval in practice[M]. Boston, USA: Addison-Wesley, 2010: 22- 23.
7 HOFMANN T.Probabilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York, USA: ACM, 1999: 50-57.
8 BLEI D M , NG A Y , JORDAN M I . Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3, 993- 1022.
9 BLEID M, LAFFERTY J D.Dynamic topic models[C]//Proc of the 23rd International Conference on Machine Learning (ICML′06).New York, USA: ACM, 2006: 113-120.
10 WEI X, SUN J, WANG X.Dynamic mixture models for multiple time-Series[C]// Proc of the 20th International Joint Conference on Artificial Intelligence.Hyderabad, India: ACM, 2007: 2909-2914.
11 IWATA T, WATANABE S, YAMADA T, et al.Topic tracking model for analyzing consumer purchase behavior[C]//Proc of 21st International Joint Conference on Artificial Intelligence.San Francisco, USA: ACM, 2009: 1427-1432.
12 WANG X, MCCALLUM A.Topics over time: a non-markov continuous-time model of topical trends[C]//Proc of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA: ACM, 2006: 424-433.
13 李雷, 朱玉婷, 施化吉, 等. 社会网络中基于U_BTM模型的主题挖掘[J]. 计算机应用研究, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
LI Lei , ZHU Yuting , SHI Huaji , et al. Topic mining based on U_BTM model in social networks[J]. Application Research of Computers, 2017, 34 (1): 132- 135, 146.
doi: 10.3969/j.issn.1001-3695.2017.01.028
14 谢珺, 郝洁, 苏婧琼, 等. 一种针对短文本的主题情感混合模型[J]. 中文信息学报, 2017, 31 (1): 162- 168.
XIE Jun , HAO Jie , SU Jingqiong , et al. A joint topic and sentiment model for short texts[J]. Journal of Chinese Information Processing, 2017, 31 (1): 162- 168.
15 刘泽锦, 王洁. 同主题词短文本分类算法中BTM的应用与改进[J]. 计算机系统应用, 2017, 26 (11): 213- 219.
LIU Zejin , WANG Jie . Application and improvement of BTM in short text classification algorithm of the same topic[J]. Computer Systems & Applications, 2017, 26 (11): 213- 219.
16 YAN X, GUO J, LAN Y, et al.A biterm topic model for short texts[C]//Proc of the 22nd International Conference on World Wide Web.New York, USA: ACM, 2013: 1445-1456.
17 ZHOU X , OUYANG J , LI X . Two time-efficient gibbs sampling inference algorithms for biterm topic model[J]. Applied Intelligence, 2018, 48 (3): 730- 754.
doi: 10.1007/s10489-017-1004-2
18 CHENG X , YAN X , LAN Y , et al. BTM:topic modeling over short texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26 (12): 2928- 2941.
doi: 10.1109/TKDE.2014.2313872
19 WANG Y, AGICHTEIN E, BENZI M.TM-LDA: efficient online modeling of latent topic transitions in social media[C]//Proc of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′12).New York, USA: ACM, 2012: 123-131.
20 ZHAO W X , JIANG J , WENG J , et al. Comparing twitter and traditional media using topic models[J]. Berlin, Germany: Springer, 2011, 338- 349.
21 SASAKI K, YOSHIKAWA T, FURUHASHI T.Online topic model for twitter considering dynamics of user interests and topic trends[C] //Proc of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.Doha, Qatar: ACL, 2014: 1977-1985.
22 LIANG S, YILMAZ E, KANOULAS E.Dynamic clustering of streaming short documents[C]//Proc of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD′16).New York, USA: ACM, 2016: 995-1004.
23 LIANG S, REN Z, YILMAZ E, et al.Collaborative user clustering for short text streams[C]//Proc of the 31st AAAI Conference on Artificial Intelligence.San Francisco, USA: ACM, 2017: 3504-3510.
24 刘冰玉, 王翠荣, 王聪, 等. 基于动态主题模型融合多维数据的微博社区发现算法[J]. 软件学报, 2017, 28 (2): 246- 261.
LIU Bingyu , WANG Cuirong , WANG Cong , et al. Microblog community discovery algorithm based on dynamic topic model with multidimensional data fusion[J]. Journal of Software, 2017, 28 (2): 246- 261.
25 PHADIA E G . Prior processes and their applications[M]. New York, USA: Springer, 2016: 77- 79.
26 ZHONG S . Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65 (3): 3- 29.
27 TEH Y W , JORDAN M I , BEAL M J , et al. Hierarchical dirichlet process[J]. Journal of American Statistical Association, 2006, 101 (476): 1566- 1581.
doi: 10.1198/016214506000000302
[1] YAN Yingying, HUANG Ruizhang, WANG Rui, MA Can, LIU Bowei, HUANG Ting. A document understanding method for short texts by auxiliary long documents [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 67-74.
[2] LU Wenyang, XU Jiayi, YANG Yubin. LDA-based link prediction in social network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(6): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] WANG Su-yu,<\sup>,AI Xing<\sup>,ZHAO Jun<\sup>,LI Zuo-li<\sup>,LIU Zeng-wen<\sup> . Milling force prediction model for highspeed end milling 3Cr2Mo steel[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 1 -5 .
[2] ZHANG Yong-hua,WANG An-ling,LIU Fu-ping . The reflected phase angle of low frequent inhomogeneous[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 22 -25 .
[3] LI Ke,LIU Chang-chun,LI Tong-lei . Medical registration approach using improved maximization of mutual information[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 107 -110 .
[4] SUN Weiwei, WANG Yuzhen. Finite gain stabilization of singlemachine infinite bus system subject to saturation[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 69 -76 .
[5] CHENG Daizhan, LI Zhiqiang. A survey on linearization of nonlinear systems[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 26 -36 .
[6] QU Yan-peng,CHEN Song-ying,LI Chun-feng,WANG Xiao-peng,TENG Shu-ge . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(4): 16 -20 .
[7] WANG Yong, XIE Yudong. Gas control technology of largeflow pipe[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 70 -74 .
[8] LIU Xin 1, SONG Sili 1, WANG Xinhong 2. [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 98 -100 .
[9] HU Tian-liang,LI Peng,ZHANG Cheng-rui,ZUO Yi . Design of a QEP decode counter based on VHDL[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 10 -13 .
[10] KONG Wei-tao,ZHANG Qing-fan,ZHANG Cheng-hui . DSP based implementation of the space vector pulse width modulation[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 81 -84 .