JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE) ›› 2018, Vol. 48 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1672-3961.0.2017.402

Previous Articles     Next Articles

A document understanding method for short texts by auxiliary long documents

YAN Yingying1,2, HUANG Ruizhang1,2*, WANG Rui1,2, MA Can1,2, LIU Bowei1,2, HUANG Ting1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2017-08-23 Online:2018-06-20 Published:2017-08-23

Abstract: Based on the dirichlet-multinomial regression(DMR)model, a dual dirichlet-multinomial regression(DDMR)model that short texts were understood by auxiliary long documents was proposed. A topic set was shared by long documents and short texts which came from different data sources, and two dirichlet priors were used to generate the topic allocation of long documents and short texts, which enabled the topic knowledge of long documents to be transferred to short texts and improved understanding of the short text. The experiments showed that the DDMR model had a great effect on the topical discovery of short texts.

Key words: short text understanding, dual dirichlet-multinomial regression model, topic model

CLC Number: 

  • TP391.1
[1] WENG J, LIM E P, JIANG J, et al. Twitter Rank: finding topic-sensitive influential twitterers[C] //Proceedings of the third ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2010:261-270.
[2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[3] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C] //Proceedings of the first Workshop on Social Media Analytics. New York, USA: ACM, 2010:80-88.
[4] GABRILOVICH E. Feature generation for textual information retrieval using world knowledge[J]. ACM, 2007, 41(2):123-123.
[5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C] //Proceedings of the third IEEE International Conference on Data Mining. Washington, D C, USA: IEEE Computer Society, 2003:541-544.
[6] PHAN X H, NGUYEN C T, LE DT, et al. A hidden topic-based framework toward building applications with short web documents[J]. IEEE Transactions on Knowledge & Data Engineering, 2011, 23(7):961-976.
[7] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C] //Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008:91-100.
[8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2009:919-928.
[9] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C] //Proceedings of the 15th International Conference on World Wide Web. New York, USA: ACM, 2006:377-386.
[10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM, 2013:1445-1456.
[11] YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering [C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014:233-242.
[12] SEIFZADEH S, FARAHAT A K, KAMEL M S, et al. Short-textclustering using statistical semantics[C] //Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM, 2015:805-810.
[13] WANG Z, MI H, ITTYCHERIAH A. Semi-supervised clustering for short text via deep representation learning[C] //Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: ACL, 2016:31-39.
[14] RAINA R, NG A Y, KOLLER D. Constructing informative priors using transfer learning[C] //Proceedings of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006:713-720.
[15] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C] //Proceedings of the 20th ACM International Conference on Information and Knowledge management. New York, USA: ACM, 2011:775-784.
[16] MIMNO D, MCCALLUM A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression[C] //Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Corvallis, Oregon, USA: AUAI Press, 2012:411-418.
[17] CELEUX G, DIEBOLT J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem[J]. Computationalstatistics Quarter, 1985(2):73-82.
[18] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, 1989, 45(3):503-528.
[19] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2008:990-998.
[20] ZHONG S. Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65(1):3-29.
[1] LU Wenyang, XU Jiayi, YANG Yubin. LDA-based link prediction in social network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(6): 26-31.
Full text



No Suggested Reading articles found!