JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE) ›› 2018, Vol. 48 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1672-3961.0.2017.402

Previous Articles     Next Articles

A document understanding method for short texts by auxiliary long documents

YAN Yingying1,2, HUANG Ruizhang1,2*, WANG Rui1,2, MA Can1,2, LIU Bowei1,2, HUANG Ting1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2017-08-23 Online:2018-06-20 Published:2017-08-23

Abstract: Based on the dirichlet-multinomial regression(DMR)model, a dual dirichlet-multinomial regression(DDMR)model that short texts were understood by auxiliary long documents was proposed. A topic set was shared by long documents and short texts which came from different data sources, and two dirichlet priors were used to generate the topic allocation of long documents and short texts, which enabled the topic knowledge of long documents to be transferred to short texts and improved understanding of the short text. The experiments showed that the DDMR model had a great effect on the topical discovery of short texts.

Key words: short text understanding, dual dirichlet-multinomial regression model, topic model

CLC Number: 

  • TP391.1
[1] WENG J, LIM E P, JIANG J, et al. Twitter Rank: finding topic-sensitive influential twitterers[C] //Proceedings of the third ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2010:261-270.
[2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[3] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C] //Proceedings of the first Workshop on Social Media Analytics. New York, USA: ACM, 2010:80-88.
[4] GABRILOVICH E. Feature generation for textual information retrieval using world knowledge[J]. ACM, 2007, 41(2):123-123.
[5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C] //Proceedings of the third IEEE International Conference on Data Mining. Washington, D C, USA: IEEE Computer Society, 2003:541-544.
[6] PHAN X H, NGUYEN C T, LE DT, et al. A hidden topic-based framework toward building applications with short web documents[J]. IEEE Transactions on Knowledge & Data Engineering, 2011, 23(7):961-976.
[7] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C] //Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008:91-100.
[8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2009:919-928.
[9] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C] //Proceedings of the 15th International Conference on World Wide Web. New York, USA: ACM, 2006:377-386.
[10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM, 2013:1445-1456.
[11] YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering [C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014:233-242.
[12] SEIFZADEH S, FARAHAT A K, KAMEL M S, et al. Short-textclustering using statistical semantics[C] //Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM, 2015:805-810.
[13] WANG Z, MI H, ITTYCHERIAH A. Semi-supervised clustering for short text via deep representation learning[C] //Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: ACL, 2016:31-39.
[14] RAINA R, NG A Y, KOLLER D. Constructing informative priors using transfer learning[C] //Proceedings of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006:713-720.
[15] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C] //Proceedings of the 20th ACM International Conference on Information and Knowledge management. New York, USA: ACM, 2011:775-784.
[16] MIMNO D, MCCALLUM A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression[C] //Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Corvallis, Oregon, USA: AUAI Press, 2012:411-418.
[17] CELEUX G, DIEBOLT J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem[J]. Computationalstatistics Quarter, 1985(2):73-82.
[18] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, 1989, 45(3):503-528.
[19] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2008:990-998.
[20] ZHONG S. Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65(1):3-29.
[1] Yingxue ZHU,Ruizhang HUANG,Can MA. A short text dynamic clustering approach bias on new topic [J]. Journal of Shandong University(Engineering Science), 2018, 48(6): 8-18.
[2] LU Wenyang, XU Jiayi, YANG Yubin. LDA-based link prediction in social network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(6): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LI Ke,LIU Chang-chun,LI Tong-lei . Medical registration approach using improved maximization of mutual information[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 107 -110 .
[2] SUN Guohua, WU Yaohua, LI Wei. The effect of excise tax control strategy on the supply chain system performance[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 63 -68 .
[3] LIANG Jing-yun,WANG Ming-gang,CHAI Jia-qian,LIU yong-qing . Synthesis and in vitro antibacterial activity of 1,6-Di-(N5-phenyl-N1-diguanido) hexane dihydrochloride[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 104 -107 .
[4] ZHANG Gong-xiao,YANG Rong-hua . Synthesis and characterization of salicylaldehyde methylthiosemicarbazone Schiff base complexes[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 108 -111 .
[5] XU Yan-sheng,LIU Xing-fang . Application of the fuzzy clustering iterative model to the evalution of water resource carrying capacity[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(3): 100 -104 .
[6] HAO Ming-hui,WANG Xi-ping,WANG Min,ZHOU Shen-jie .

The solution of a oneedge crack of a finite plate with the influence of  couple stress in a uniform tension field

[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(2): 92 -95 .
[7] CHEN Sheng-li,WU Hui-qiu,LUO Yun-feng . Optimal design of online multiunit dynamic auctions[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(2): 120 -126 .
[8] GAO Ming, SHI Yue-Thao, WANG Ni-Ni, SUN Feng-Zhong, PING Ya-Ming. Circumferential inflow air distributing rules in a natural draft  wet-cooling tower under crosswind conditions[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(3): 154 -158 .
[9] NIU Xiu-ming,FU Chun-hua . The effect of carbon on organic wastewater degradation in the process of pulse discharge[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(1): 121 -126 .
[10] ZHANG Dun,HOU Jing-ming,LIU Han-sheng,XU Gen-hai . Experimental and numerical modeling for the location of abucket aerator on an arch structure[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(2): 101 -105 .