Journal of Shandong University(Engineering Science) ›› 2018, Vol. 48 ›› Issue (6): 37-43.doi: 10.6040/j.issn.1672-3961.0.2018.204

• Machine Learning & Data Mining • Previous Articles     Next Articles

A parallel adaptive news topic tracking algorithm based on N-Gram language model

Qingtao QU(),Qicheng LIU*(),Chunxiao MU   

  1. School of Computer and Control Engineering, Yantai University, Yantai 264005, Shandong, China
  • Received:2018-05-25 Online:2018-12-20 Published:2018-12-26
  • Contact: Qicheng LIU E-mail:992883600@qq.com;ytliuqc@163.com
  • Supported by:
    山东省自然科学基金(ZR2016FM42);山东省重点研发计划(2016GGX109004);国家海洋局“十三五”海洋经济创新发展示范重点项目(YHC-ZB-P201701);国家自然科学基金(61702439)

Abstract:

When the traditional vector space model and unigram model expressed the text features of the topic, the word order relations between the words was ignored. In terms of this issue, a parallel adaptive news topic tracking algorithm based on N-Gram language model was proposed. N-Gram language mode was used to express the text features, which made use of word order relations in news reports. The Bayes classification algorithm was applied to conduct topic tracking, with the minimum feature average confidence threshold update strategy, the training set was updated to improve the topic model by using the test news reports. The parallel adaptive news topic tracking algorithm based on N-Gram language model (PATT-Gram) was implemented on the mapreduce distributed computing model. Experiments showed that the algorithm effectively improved the topic tracking effect and had good parallel speedup and scalability.

Key words: topic tracking, N-Gram language model, naive Bayes classification, MapReduce computational model

CLC Number: 

  • TP391

Fig.1

Computational model of MapReduce"

Fig.2

Algorithm of PATT-Gram"

Table 1

The experimental results of adaptivity"

%
数据量/GB 算法 Pmicro Rmicro F1micro
0.97 PTT-Gram 83 81 81
PATT-Gram 90 75 82
2.10 PTT-Gram 85 85 85
PATT-Gram 90 81 85
3.80 PTT-Gram 83 83 83
PATT-Gram 88 78 83

Table 2

The experimental results of topic tracking"

%
数据量/GB 算法 Pmicro Rmicro F1micro
0.97 PATT-NB 90 44 59
PATT-Gram 90 75 82
2.10 PATT-NB 91 56 69
PATT-Gram 90 81 85
3.80 PATT-NB 88 56 68
PATT-Gram 88 78 83

Table 3

The parallel processing time of the proposed algorithm"

s
数据量/GB 1台 2台 3台 4台 5台
0.97 3 086 1 841 1 349 1 077 859
2.10 7 224 4 045 2 822 2 135 1 762
3.80 13 848 7 611 5 076 3 950 3 236

Table 4

The speed ratio of the proposed algorithm"

数据量/GB 2台 3台 4台 5台
0.97 1.67 2.28 2.86 3.59
2.10 1.78 2.59 3.38 4.17
3.80 1.82 2.73 3.51 4.27
1 中国互联网信息中心.中国互联网络发展状况统计报告[R/OL]. [2018-3-5]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201803/t20180305_70249.htm.
2 游丹丹, 陈福集. 我国网络舆情热点话题发现研究综述[J]. 现代情报, 2017, 37 (3): 165- 171.
doi: 10.3969/j.issn.1008-0821.2017.03.029
YOU Dandan , CHEN Fuji . The literature review about the hotspot topic detection of network public opinion in China[J]. Journal of Modern Information, 2017, 37 (3): 165- 171.
doi: 10.3969/j.issn.1008-0821.2017.03.029
3 CARBONELL J. CMU Report on TDT-2: segmentation, detection and tracking[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. San Francisco, USA: Morgan Kaufmann, 1999: 117-120.
4 SCHULTZ J M. Topic detection and tracking using idf-weighted cosine coefficient[C]//Proceedings of the DARPA Broadcast News Workshop. San Francisco, USA: Morgan Kaufmann, 1999: 189-192.
5 武军娜.自适应话题跟踪技术研究[D].保定:华北电力大学, 2013.
WU Junna. Research on technologies of adaptive topic tracking[D]. Baoding: North China Electric Power University, 2013.
6 RAGHAVAN V V , WONG S K M . A critical analysis of vector space model for information retrieval[J]. Journal of the Association for Information Science & Technology, 1986, 37 (5): 279- 287.
7 王会珍,朱靖波,陈文亮,等.基于一元语法模型的中文话题追踪[C]//第二届全国学生计算语言学研讨会[出版地不详]: [出版者不详], 2004: 422-427.
WANG Huizhen, ZHU Jingbo, CHEN Wenliang, et al. Chinese topic tracking based on unigram model [C]//2nd Student Workshop on Computational Linguistics.[S.1]: [s.n], 2004: 422-427.
8 张辉, 周敬民, 王亮, 等. 基于三维文档向量的自适应话题追踪器模型[J]. 中文信息学报, 2010, 24 (5): 70- 76.
doi: 10.3969/j.issn.1003-0077.2010.05.012
ZHANG Hui , ZHOU Jingmin , WANG Liang , et al. An adaptive topic tracking model based on 3-dimension document vector[J]. Journal of Chinese Information Processing, 2010, 24 (5): 70- 76.
doi: 10.3969/j.issn.1003-0077.2010.05.012
9 王会珍, 朱靖波, 季铎, 等. 基于反馈学习自适应的中文话题追踪[J]. 中文信息学报, 2006, 20 (3): 92- 98.
doi: 10.3969/j.issn.1003-0077.2006.03.014
WANG Huizhen , ZHU Jingbo , JI Duo , et al. Adaptive Chinese topic tracking based on feedback learning[J]. Journal of Chinese Information Processing, 2006, 20 (3): 92- 98.
doi: 10.3969/j.issn.1003-0077.2006.03.014
10 毛伟, 徐蔚然, 郭军. 基于N-Gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统[J]. 中文信息学报, 2006, 20 (3): 31- 37.
MAO Wei , XU Weiran , GUO Jun . A Chinese text classifier based on N-Gram language model and chain augmented naïve bayesian classifier[J]. Journal of Chinese Information Processing, 2006, 20 (3): 31- 37.
11 胡睿.基于贝叶斯分类的中文垃圾邮件过滤方法研究和改进[D].北京:清华大学, 2006.
HU Rui. Research and improvement of Chinese spam emails filtering method based on bayesian classification[D]. Beijing: Tsinghua University, 2006.
12 李超, 刘辉. 一种基于关联分析与N-Gram的错误参数检测方法[J]. 软件学报, 2018, 29 (8): 1- 15.
LI Chao , LIU Hui . Association analysis and N-Gram based detection of incorrect arguments[J]. Journal of Software, 2018, 29 (8): 1- 15.
13 柏文言, 张闯, 徐克付, 等. 一种融合用户关系的自适应微博话题跟踪方法[J]. 电子学报, 2017, 45 (6): 1375- 1381.
doi: 10.3969/j.issn.0372-2112.2017.06.014
BAI Wenyan , ZHANG Chuang , XU Kefu , et al. A self-adaptive microblog topic tracking method by user relationship[J]. Chinese Journal of Electronics, 2017, 45 (6): 1375- 1381.
doi: 10.3969/j.issn.0372-2112.2017.06.014
14 魏景璇, 鲁燃, 张艳辉. 基于动态阈值和命名实体的双重过滤话题追踪[J]. 计算机应用研究, 2015, 32 (4): 982- 985.
doi: 10.3969/j.issn.1001-3695.2015.04.005
WEI Jingxuan , LU Ran , ZHANG Yanhui . Double filtering based on dynamic threshold and named entity of topic tracking[J]. Application Research of Computers, 2015, 32 (4): 982- 985.
doi: 10.3969/j.issn.1001-3695.2015.04.005
15 彭敏, 官宸宇, 朱佳晖, 等. 面向社交媒体文本的话题检测与追踪技术研究综述[J]. 武汉大学学报(理学版), 2016, 62 (3): 197- 217.
PENG Min , GUAN Chenyu , ZHU Jiahui , et al. A survey on topic detection and tracking in social media text[J]. Journal of Wuhan University(Natural Science Edition), 2016, 62 (3): 197- 217.
[1] ZHAI Dong-hai1,2, YU Jiang1, NIE Hong-yu1, CUI Jing-jing1, DU Jia1. Adaptive hot topic tracking model based on relevance feedback [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(1): 7-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LI Ke,LIU Chang-chun,LI Tong-lei . Medical registration approach using improved maximization of mutual information[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 107 -110 .
[2] CHENG Daizhan, LI Zhiqiang. A survey on linearization of nonlinear systems[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 26 -36 .
[3] SUN Yu-li,LI De-fa,ZUO Dun-wen,QI mei . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(6): 19 -23 .
[4] WANG Yong, XIE Yudong. Gas control technology of largeflow pipe[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 70 -74 .
[5] LIU Xin 1, SONG Sili 1, WANG Xinhong 2. [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 98 -100 .
[6] HU Tian-liang,LI Peng,ZHANG Cheng-rui,ZUO Yi . Design of a QEP decode counter based on VHDL[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 10 -13 .
[7] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 104 -107 .
[8] CHEN Huaxin, CHEN Shuanfa, WANG Binggang. The aging behavior and mechanism of base asphalts[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 125 -130 .
[9] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(2): 131 -136 .
[10] LI Shijin, WANG Shengte, HUANG Leping. Change detection with remote sensing images based on forward-backward heterogenicity[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 1 -9 .