您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1672-3961.0.2017.402

• • 上一篇    下一篇

一种长文本辅助短文本的文本理解方法

闫盈盈1,2,黄瑞章1,2*,王瑞1,2,马灿1,2,刘博伟1,2,黄庭1,2   

  1. 1. 贵州大学计算机科学与技术学院, 贵州 贵阳 550025;2. 贵州省公共大数据重点实验室, 贵州 贵阳 550025
  • 收稿日期:2017-08-23 出版日期:2018-06-20 发布日期:2017-08-23
  • 通讯作者: 黄瑞章(1979— ),女,天津人,副教授,博士,主要研究方向为数据挖掘与机器学习. E-mail:rzhuang@gzu.edu.cn E-mail:yyingy0921@163.com
  • 作者简介:闫盈盈(1991— ),女,山西吕梁人,硕士研究生,主要研究方向为数据挖掘与机器学习. E-mail:yyingy0921@163.com
  • 基金资助:
    国家自然科学基金资助项目(61462011,61540050);贵州大学引进人才科研资助项目(2011015);贵州省重大应用基础研究资助项目(JZ20142001)

A document understanding method for short texts by auxiliary long documents

YAN Yingying1,2, HUANG Ruizhang1,2*, WANG Rui1,2, MA Can1,2, LIU Bowei1,2, HUANG Ting1,2   

  1. 1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China
  • Received:2017-08-23 Online:2018-06-20 Published:2017-08-23

摘要: 在狄利克雷多项回归(dirichlet-multinomial regression, DMR)模型的基础上,提出一个长文本辅助短文本理解的二元狄利克雷多项回归(dual dirichlet-multinomial regression, DDMR)模型。来自不同数据源的长短文本共享一个主题集合,并采用不同的狄利克雷先验产生长短文本的主题分配,使得长文本的主题知识能够迁移到短文本中,改善短文本的理解。试验表明,DDMR模型在短文本的主题发现效果上具有较大的提升作用。

关键词: 短文本理解, 主题模型, 二元狄利克雷多项回归模型

Abstract: Based on the dirichlet-multinomial regression(DMR)model, a dual dirichlet-multinomial regression(DDMR)model that short texts were understood by auxiliary long documents was proposed. A topic set was shared by long documents and short texts which came from different data sources, and two dirichlet priors were used to generate the topic allocation of long documents and short texts, which enabled the topic knowledge of long documents to be transferred to short texts and improved understanding of the short text. The experiments showed that the DDMR model had a great effect on the topical discovery of short texts.

Key words: short text understanding, dual dirichlet-multinomial regression model, topic model

中图分类号: 

  • TP391.1
[1] WENG J, LIM E P, JIANG J, et al. Twitter Rank: finding topic-sensitive influential twitterers[C] //Proceedings of the third ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2010:261-270.
[2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[3] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C] //Proceedings of the first Workshop on Social Media Analytics. New York, USA: ACM, 2010:80-88.
[4] GABRILOVICH E. Feature generation for textual information retrieval using world knowledge[J]. ACM, 2007, 41(2):123-123.
[5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C] //Proceedings of the third IEEE International Conference on Data Mining. Washington, D C, USA: IEEE Computer Society, 2003:541-544.
[6] PHAN X H, NGUYEN C T, LE DT, et al. A hidden topic-based framework toward building applications with short web documents[J]. IEEE Transactions on Knowledge & Data Engineering, 2011, 23(7):961-976.
[7] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C] //Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008:91-100.
[8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2009:919-928.
[9] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C] //Proceedings of the 15th International Conference on World Wide Web. New York, USA: ACM, 2006:377-386.
[10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM, 2013:1445-1456.
[11] YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering [C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014:233-242.
[12] SEIFZADEH S, FARAHAT A K, KAMEL M S, et al. Short-textclustering using statistical semantics[C] //Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM, 2015:805-810.
[13] WANG Z, MI H, ITTYCHERIAH A. Semi-supervised clustering for short text via deep representation learning[C] //Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: ACL, 2016:31-39.
[14] RAINA R, NG A Y, KOLLER D. Constructing informative priors using transfer learning[C] //Proceedings of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006:713-720.
[15] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C] //Proceedings of the 20th ACM International Conference on Information and Knowledge management. New York, USA: ACM, 2011:775-784.
[16] MIMNO D, MCCALLUM A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression[C] //Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Corvallis, Oregon, USA: AUAI Press, 2012:411-418.
[17] CELEUX G, DIEBOLT J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem[J]. Computationalstatistics Quarter, 1985(2):73-82.
[18] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, 1989, 45(3):503-528.
[19] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2008:990-998.
[20] ZHONG S. Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65(1):3-29.
[1] 卢文羊, 徐佳一, 杨育彬. 基于LDA主题模型的社会网络链接预测[J]. 山东大学学报(工学版), 2014, 44(6): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!