一种长文本辅助短文本的文本理解方法

doi:10.6040/j.issn.1672-3961.0.2017.402

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1672-3961.0.2017.402

一种长文本辅助短文本的文本理解方法

闫盈盈^1,2,黄瑞章^1,2*,王瑞^1,2,马灿^1,2,刘博伟^1,2,黄庭^1,2

1. 贵州大学计算机科学与技术学院, 贵州贵阳 550025;2. 贵州省公共大数据重点实验室, 贵州贵阳 550025

收稿日期:2017-08-23 出版日期:2018-06-20 发布日期:2017-08-23
通讯作者: 黄瑞章(1979— ),女,天津人,副教授,博士,主要研究方向为数据挖掘与机器学习. E-mail:rzhuang@gzu.edu.cn E-mail:yyingy0921@163.com
作者简介:闫盈盈(1991— ),女,山西吕梁人,硕士研究生,主要研究方向为数据挖掘与机器学习. E-mail:yyingy0921@163.com
基金资助:
国家自然科学基金资助项目(61462011,61540050);贵州大学引进人才科研资助项目(2011015);贵州省重大应用基础研究资助项目(JZ20142001)

A document understanding method for short texts by auxiliary long documents

YAN Yingying^1,2, HUANG Ruizhang^1,2*, WANG Rui^1,2, MA Can^1,2, LIU Bowei^1,2, HUANG Ting^1,2

1. School of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China;
2. Guizhou Provincial Key Laboratory of Public Big Data, Guiyang 550025, Guizhou, China

Received:2017-08-23 Online:2018-06-20 Published:2017-08-23

摘要/Abstract

摘要： 在狄利克雷多项回归(dirichlet-multinomial regression, DMR)模型的基础上,提出一个长文本辅助短文本理解的二元狄利克雷多项回归(dual dirichlet-multinomial regression, DDMR)模型。来自不同数据源的长短文本共享一个主题集合,并采用不同的狄利克雷先验产生长短文本的主题分配,使得长文本的主题知识能够迁移到短文本中,改善短文本的理解。试验表明,DDMR模型在短文本的主题发现效果上具有较大的提升作用。

关键词: 短文本理解, 主题模型, 二元狄利克雷多项回归模型

Abstract: Based on the dirichlet-multinomial regression(DMR)model, a dual dirichlet-multinomial regression(DDMR)model that short texts were understood by auxiliary long documents was proposed. A topic set was shared by long documents and short texts which came from different data sources, and two dirichlet priors were used to generate the topic allocation of long documents and short texts, which enabled the topic knowledge of long documents to be transferred to short texts and improved understanding of the short text. The experiments showed that the DDMR model had a great effect on the topical discovery of short texts.

Key words: short text understanding, dual dirichlet-multinomial regression model, topic model

中图分类号:

TP391.1

闫盈盈,黄瑞章,王瑞,马灿,刘博伟,黄庭. 一种长文本辅助短文本的文本理解方法[J]. 山东大学学报(工学版), 2018, 48(3): 67-74.

YAN Yingying, HUANG Ruizhang, WANG Rui, MA Can, LIU Bowei, HUANG Ting. A document understanding method for short texts by auxiliary long documents[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 67-74.

参考文献

[1] WENG J, LIM E P, JIANG J, et al. Twitter Rank: finding topic-sensitive influential twitterers[C] //Proceedings of the third ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2010:261-270.
[2] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[3] HONG L, DAVISON B D. Empirical study of topic modeling in Twitter[C] //Proceedings of the first Workshop on Social Media Analytics. New York, USA: ACM, 2010:80-88.
[4] GABRILOVICH E. Feature generation for textual information retrieval using world knowledge[J]. ACM, 2007, 41(2):123-123.
[5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C] //Proceedings of the third IEEE International Conference on Data Mining. Washington, D C, USA: IEEE Computer Society, 2003:541-544.
[6] PHAN X H, NGUYEN C T, LE DT, et al. A hidden topic-based framework toward building applications with short web documents[J]. IEEE Transactions on Knowledge & Data Engineering, 2011, 23(7):961-976.
[7] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C] //Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008:91-100.
[8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C] //Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2009:919-928.
[9] SAHAMI M, HEILMAN T D. A web-based kernel function for measuring the similarity of short text snippets[C] //Proceedings of the 15th International Conference on World Wide Web. New York, USA: ACM, 2006:377-386.
[10] YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts[C] //Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM, 2013:1445-1456.
[11] YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering [C] //Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014:233-242.
[12] SEIFZADEH S, FARAHAT A K, KAMEL M S, et al. Short-textclustering using statistical semantics[C] //Proceedings of the 24th International Conference on World Wide Web. New York, USA: ACM, 2015:805-810.
[13] WANG Z, MI H, ITTYCHERIAH A. Semi-supervised clustering for short text via deep representation learning[C] //Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg, USA: ACL, 2016:31-39.
[14] RAINA R, NG A Y, KOLLER D. Constructing informative priors using transfer learning[C] //Proceedings of the 23rd International Conference on Machine learning. New York, USA: ACM, 2006:713-720.
[15] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C] //Proceedings of the 20th ACM International Conference on Information and Knowledge management. New York, USA: ACM, 2011:775-784.
[16] MIMNO D, MCCALLUM A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression[C] //Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Corvallis, Oregon, USA: AUAI Press, 2012:411-418.
[17] CELEUX G, DIEBOLT J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem[J]. Computationalstatistics Quarter, 1985(2):73-82.
[18] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, 1989, 45(3):503-528.
[19] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C] //Proceedings of the 14th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2008:990-998.
[20] ZHONG S. Semi-supervised model-based document clustering: a comparative study[J]. Machine Learning, 2006, 65(1):3-29.

多维度评价

Viewed

Full text

355

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	355

From	Others	local

Times	18	337
Rate	5%	95%

Abstract

1273

Just accepted	Online first	Issue

0	0	1273

From	Others	local

Times	1271	2
Rate	100%	0%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed

一种长文本辅助短文本的文本理解方法

A document understanding method for short texts by auxiliary long documents

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

多维度评价

本文评价

推荐阅读 0