山东大学学报 (工学版) ›› 2018, Vol. 48 ›› Issue (6): 37-43.doi: 10.6040/j.issn.1672-3961.0.2018.204
Qingtao QU(),Qicheng LIU*(),Chunxiao MU
摘要:
针对传统的向量空间模型及一元语法模型表示话题的文本特征时忽略词语之间语序关系的问题,提出一种基于N-Gram语言模型的并行自适应新闻话题追踪算法。使用N-Gram语言模型,利用新闻报道中词语间的语序关系进行文本表示,根据贝叶斯分类算法进行话题追踪,利用最小特征平均可信度阈值更新策略,采用测试新闻报道更新训练集,完善话题模型,并在MapReduce分布式计算模型上予以实现。试验表明,该算法不仅有效地提高了话题追踪效果,而且具有良好的并行加速比和可扩展性。
中图分类号:
1 | 中国互联网信息中心.中国互联网络发展状况统计报告[R/OL]. [2018-3-5]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/t20180305_70249.htm. |
2 |
游丹丹, 陈福集. 我国网络舆情热点话题发现研究综述[J]. 现代情报, 2017, 37 (3): 165- 171.
doi: 10.3969/j.issn.1008-0821.2017.03.029 |
YOU Dandan , CHEN Fuji . The literature review about the hotspot topic detection of network public opinion in China[J]. Journal of Modern Information, 2017, 37 (3): 165- 171.
doi: 10.3969/j.issn.1008-0821.2017.03.029 |
|
3 | CARBONELL J. CMU Report on TDT-2: segmentation, detection and tracking[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. San Francisco, USA: Morgan Kaufmann, 1999: 117-120. |
4 | SCHULTZ J M. Topic detection and tracking using idf-weighted cosine coefficient[C]//Proceedings of the DARPA Broadcast News Workshop. San Francisco, USA: Morgan Kaufmann, 1999: 189-192. |
5 | 武军娜.自适应话题跟踪技术研究[D].保定:华北电力大学, 2013. |
WU Junna. Research on technologies of adaptive topic tracking[D]. Baoding: North China Electric Power University, 2013. | |
6 | RAGHAVAN V V , WONG S K M . A critical analysis of vector space model for information retrieval[J]. Journal of the Association for Information Science & Technology, 1986, 37 (5): 279- 287. |
7 | 王会珍,朱靖波,陈文亮,等.基于一元语法模型的中文话题追踪[C]//第二届全国学生计算语言学研讨会[出版地不详]: [出版者不详], 2004: 422-427. |
WANG Huizhen, ZHU Jingbo, CHEN Wenliang, et al. Chinese topic tracking based on unigram model [C]//2nd Student Workshop on Computational Linguistics.[S.1]: [s.n], 2004: 422-427. | |
8 |
张辉, 周敬民, 王亮, 等. 基于三维文档向量的自适应话题追踪器模型[J]. 中文信息学报, 2010, 24 (5): 70- 76.
doi: 10.3969/j.issn.1003-0077.2010.05.012 |
ZHANG Hui , ZHOU Jingmin , WANG Liang , et al. An adaptive topic tracking model based on 3-dimension document vector[J]. Journal of Chinese Information Processing, 2010, 24 (5): 70- 76.
doi: 10.3969/j.issn.1003-0077.2010.05.012 |
|
9 |
王会珍, 朱靖波, 季铎, 等. 基于反馈学习自适应的中文话题追踪[J]. 中文信息学报, 2006, 20 (3): 92- 98.
doi: 10.3969/j.issn.1003-0077.2006.03.014 |
WANG Huizhen , ZHU Jingbo , JI Duo , et al. Adaptive Chinese topic tracking based on feedback learning[J]. Journal of Chinese Information Processing, 2006, 20 (3): 92- 98.
doi: 10.3969/j.issn.1003-0077.2006.03.014 |
|
10 | 毛伟, 徐蔚然, 郭军. 基于N-Gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统[J]. 中文信息学报, 2006, 20 (3): 31- 37. |
MAO Wei , XU Weiran , GUO Jun . A Chinese text classifier based on N-Gram language model and chain augmented naïve bayesian classifier[J]. Journal of Chinese Information Processing, 2006, 20 (3): 31- 37. | |
11 | 胡睿.基于贝叶斯分类的中文垃圾邮件过滤方法研究和改进[D].北京:清华大学, 2006. |
HU Rui. Research and improvement of Chinese spam emails filtering method based on bayesian classification[D]. Beijing: Tsinghua University, 2006. | |
12 | 李超, 刘辉. 一种基于关联分析与N-Gram的错误参数检测方法[J]. 软件学报, 2018, 29 (8): 1- 15. |
LI Chao , LIU Hui . Association analysis and N-Gram based detection of incorrect arguments[J]. Journal of Software, 2018, 29 (8): 1- 15. | |
13 |
柏文言, 张闯, 徐克付, 等. 一种融合用户关系的自适应微博话题跟踪方法[J]. 电子学报, 2017, 45 (6): 1375- 1381.
doi: 10.3969/j.issn.0372-2112.2017.06.014 |
BAI Wenyan , ZHANG Chuang , XU Kefu , et al. A self-adaptive microblog topic tracking method by user relationship[J]. Chinese Journal of Electronics, 2017, 45 (6): 1375- 1381.
doi: 10.3969/j.issn.0372-2112.2017.06.014 |
|
14 |
魏景璇, 鲁燃, 张艳辉. 基于动态阈值和命名实体的双重过滤话题追踪[J]. 计算机应用研究, 2015, 32 (4): 982- 985.
doi: 10.3969/j.issn.1001-3695.2015.04.005 |
WEI Jingxuan , LU Ran , ZHANG Yanhui . Double filtering based on dynamic threshold and named entity of topic tracking[J]. Application Research of Computers, 2015, 32 (4): 982- 985.
doi: 10.3969/j.issn.1001-3695.2015.04.005 |
|
15 | 彭敏, 官宸宇, 朱佳晖, 等. 面向社交媒体文本的话题检测与追踪技术研究综述[J]. 武汉大学学报(理学版), 2016, 62 (3): 197- 217. |
PENG Min , GUAN Chenyu , ZHU Jiahui , et al. A survey on topic detection and tracking in social media text[J]. Journal of Wuhan University(Natural Science Edition), 2016, 62 (3): 197- 217. |
[1] | 于江德1,赵红丹1,郑勃举1,余正涛2. 基于中文人名用字特征的性别判定方法[J]. 山东大学学报(工学版), 2014, 44(1): 13-18. |
|