山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1672-3961.0.2017.402
闫盈盈1,2,黄瑞章1,2*,王瑞1,2,马灿1,2,刘博伟1,2,黄庭1,2
YAN Yingying1,2, HUANG Ruizhang1,2*, WANG Rui1,2, MA Can1,2, LIU Bowei1,2, HUANG Ting1,2
摘要: 在狄利克雷多项回归(dirichlet-multinomial regression, DMR)模型的基础上,提出一个长文本辅助短文本理解的二元狄利克雷多项回归(dual dirichlet-multinomial regression, DDMR)模型。来自不同数据源的长短文本共享一个主题集合,并采用不同的狄利克雷先验产生长短文本的主题分配,使得长文本的主题知识能够迁移到短文本中,改善短文本的理解。试验表明,DDMR模型在短文本的主题发现效果上具有较大的提升作用。
中图分类号:
[1] WENG J, LIM E P, JIANG J, et al. Twitter Rank: finding topic-sensitive influential twitterers[C] //Proceedings of the third ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2010:261-270. [2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022. [3] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C] //Proceedings of the first Workshop on Social Media Analytics. New York, USA: ACM, 2010:80-88. [4] GABRILOVICH E. Feature generation for textual information retrieval using world knowledge[J]. ACM, 2007, 41(2):123-123. [5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C] //Proceedings of the third IEEE International Conference on Data Mining. Washington, D C, USA: IEEE Computer Society, 2003:541-544. [6] PHAN X H, NGUYEN C T, LE DT, et al. A hidden topic-based framework toward building applications with short web documents[J]. IEEE Transactions on Knowledge & Data Engineering, 2011, 23(7):961-976. [7] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C] //Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008:91-100. [8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2009:919-928. [9] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C] //Proceedings of the 15th International Conference on World Wide Web. New York, USA: ACM, 2006:377-386. [10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM, 2013:1445-1456. [11] YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering [C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014:233-242. [12] SEIFZADEH S, FARAHAT A K, KAMEL M S, et al. Short-textclustering using statistical semantics[C] //Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM, 2015:805-810. [13] WANG Z, MI H, ITTYCHERIAH A. Semi-supervised clustering for short text via deep representation learning[C] //Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: ACL, 2016:31-39. [14] RAINA R, NG A Y, KOLLER D. Constructing informative priors using transfer learning[C] //Proceedings of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006:713-720. [15] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C] //Proceedings of the 20th ACM International Conference on Information and Knowledge management. New York, USA: ACM, 2011:775-784. [16] MIMNO D, MCCALLUM A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression[C] //Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Corvallis, Oregon, USA: AUAI Press, 2012:411-418. [17] CELEUX G, DIEBOLT J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem[J]. Computationalstatistics Quarter, 1985(2):73-82. [18] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, 1989, 45(3):503-528. [19] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2008:990-998. [20] ZHONG S. Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65(1):3-29. |
[1] | 卢文羊, 徐佳一, 杨育彬. 基于LDA主题模型的社会网络链接预测[J]. 山东大学学报(工学版), 2014, 44(6): 26-31. |
|