山东大学学报 (工学版) ›› 2018, Vol. 48 ›› Issue (6): 8-18.doi: 10.6040/j.issn.1672-3961.0.2018.193

  1. 1. 贵州大学大学计算机科学与技术学院,贵州 贵阳 550025
    2. 贵州省公共大数据重点实验室,贵州 贵阳 550025
  • 收稿日期:2018-05-31 出版日期:2018-12-20 发布日期:2018-12-26
  • 通讯作者: 黄瑞章 E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • 作者简介:朱映雪(1993—),女,贵州毕节人,硕士研究生,主要研究方向为数据挖掘与机器学习.E-mail:zhuyingxue1993@gmail.com
  • 基金资助:

A short text dynamic clustering approach bias on new topic

Yingxue ZHU1,2(),Ruizhang HUANG1,2,*(),Can MA1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2018-05-31 Online:2018-12-20 Published:2018-12-26
  • Contact: Ruizhang HUANG E-mail:zhuyingxue1993@gmail.com;rzhuang@gzu.edu.cn
  • Supported by:


为了解决短文本数据流的动态聚类问题,提出动态的狄利克雷多项混合(dynamic Dirichlet multinomial mixture,DDMM)模型。模型能够很好地捕获短文本数据流中主题随时间变化而变化的动态过程,同时考虑到已有历史主题和新主题之间的关系,能够对主题继承性的强弱进行调整,从而增大新主题产生的可能。在Gibbs采样过程中,能够自动估算出聚类个数。模拟数据和真实数据上的试验表明,DDMM模型是有效的。同时将提出的方法和传统动态聚类方法进行对比,结果表明DDMM模型能够进行有效的文本动态聚类,并且聚类效果表现良好。

关键词: 动态聚类, 新主题偏向, Gibbs采样, 主题模型, 文本挖掘


The dynamic Dirichlet multinomial mixture (DDMM) model for short textual data stream dynamic clustering problem was proposed.The model could capture the change of topics in the short textual data stream over time, and take the relationship between existing historical topics and new topics into consideration, which could adjust the strength of the lineage of topics, and increase the likelihood of new topic emergence.In addition, the proposed approach could infer the number of clusters automatically in the process of Gibbs sampling.Experiments indicated that the DDMM model performed well on the synthetic data set as well as real data sets.And the comparison between the proposed approach and state-of-the-art dynamic clustering approaches showed that the DDMM model was effective for document dynamic clustering, and performed well on short text dynamic clustering.

Key words: dynamic clustering, new topic bias, Gibbs sampling, topic model, text mining


  • TP391.1



符号 释义
d, z, w 文档,主题,词
t 时间
K 初始聚类个数
K* 实际估算出的聚类个数
V 词典大小
dt 时间片t内的文档集
Nd 文档d的词数
Nd, w 文档d中词w出现的次数
Θt, Φt 时间片t内的主题分布,时间片t内的词分布
γ, αt, βt 模型的先验参数





时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)









时间片 类标签(文本数)
1 0(50), 1(50), 2(50)
2 0(50), 1(50), 2(50), 3(50)
3 0(50), 1(50), 2(50), 3(50), 4(50)



1 0.284 0.311 0.448 0.506
2 0.327 0.394 0.422 0.527
3 0.319 0.373 0.415 0.483



1 0.412 0.451 0.543 0.597
2 0.479 0.492 0.538 0.589
3 0.463 0.478 0.514 0.573



1 0.359 0.368 0.412 0.482
2 0.382 0.355 0.398 0.436
3 0.305 0.304 0.334 0.406



1 0.423 0.427 0.504 0.549
2 0.434 0.435 0.468 0.546
3 0.418 0.426 0.457 0.535







