基于word2vec词模型的中文短文本分类方法

doi:10.6040/j.issn.1672-3961.0.2018.197

摘要/Abstract

摘要：

针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。

关键词: 短文本, 中文文本分类, 维基百科, word2vec, 词模型

Abstract:

In the short text classification process, the weak feature expression of the limitation of the number of words restricted the classification effect. To solve this problem, a Chinese short text classification method based on embedding trained by word2vec from Wikipedia (CSTC-EWW) was proposed, and a series of experiments for short texts with 4 topics from the iask.com website were finished. This method firstly trained the embedding by word2vec from Wikipedia corpus. the feature of short text based on the embedding was established. Naive Bayes and SVM was used to classify short text. The experimental results showed the following conclusions: CSTC-EWW could effectively classify short texts and the best F-value could reach 81.8%; Comparing the text feature expression of BOW model weighted by TF-IDF and the method of extending feature from Wikipedia, the classification results of CSTC-EWW were significantly better and F-measure of CSTC-EWW on car could be increased by 45.2%.

Key words: short texts, Chinese text classification, Wikipedia, word2vec, embedding

中图分类号:

TP391

高明霞,李经纬. 基于word2vec词模型的中文短文本分类方法[J]. 山东大学学报 (工学版), 2019, 49(2): 34-41.

Mingxia GAO,Jingwei LI. Chinese short text classification method based on word2vec embedding[J]. Journal of Shandong University(Engineering Science), 2019, 49(2): 34-41.

图/表 8

图1

图2

图3

表1

表2

表3

图4

图5

参考文献 20

1	刘英涛.短文本分类研究[D].重庆:重庆理工大学, 2016.
	LIU Yingtao. Research on short text classification[D]. Chongqing: Chongqing University of Technology, 2016.
2	METZLER D, DUMAIS S, MEEK C. Similarity measures for short segments of text[C]//Processdings of AAAI Conference on Artificial Intelligence. Heidelberg Berlin Germany: Springer-Verlag, 2007: 16-27.
3	ZELIKOWITZ S , TRANSDUCTIVE M F . Learning for short-text classification problem using latent semantic indexing international[J]. Journal of Pattern Recognition and Artificial Intelligence, 2005, 19 (2): 143- 163.
4	杨超群.基于自身特征的短文本分类研究[D].合肥:合肥工业大学, 2016.
	YANG Chaoqun. Research on short text classification based on its own features[D]. Hefei: Hefei University of Technology, 2016.
5	范云杰,刘怀亮.基于维基百科的中文短文本分类研究[D].西安:西安电子科技大学, 2013.
	FAN Yunjie, LIU Huailiang. Research on Chinese short text classification based on wikipedia[D]. Xi'an: Xidian University, 2013.
6	刘婧姣,张素智.基于语义的短文本分类算法研究[D].郑州:郑州轻工业学院, 2013
	LIU Jingjiao, ZHANG Suzhi. The study of short text classification algorithm based on semantic[D]. Zhengzhou: Zhengzhou University of Light Industry, 2013.
7	蔡志威,闵华清.基于概念的短文本分类[D].广州:华南理工大学, 2016.
	CAI Zhiwei, MIN Huaqing. Concept-based short text classification[D]. Guangzhou: South China University of Technology, 2016.
8	李锐, 张谦, 刘嘉勇. 基于加权word2vec的微博情感分析[J]. 通信技术, 2017, 50 (3): 502- 506. doi: 10.3969/j.issn.1002-0802.2017.03.021
	LI Rui , ZHANG Qian , LIU Jiayong . Microblog sentiment analysis based on weighted word2vec[J]. Communications Technology, 2017, 50 (3): 502- 506. doi: 10.3969/j.issn.1002-0802.2017.03.021
9	董文.基于LDA和word2vec的推荐算法研究[D].北京:北京邮电大学, 2015.
	DONG Wen. Research of recommendation algorithm based on LDA and word2vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.
10	闭炳华. 基于word2vec的数字图书馆本体构建技术研究[J]. 现代电子技术, 2016, 39 (15): 90- 94.
	BI Binghua . Research on digital library ontology construction technology based on word2vec[J]. Modern Electronics Technique, 2016, 39 (15): 90- 94.
11	赵飞, 周涛, 张良, 等. 维基百科研究综述[J]. 电子科技大学学报, 2010, 39 (3): 321- 334. doi: 10.3969/j.issn.1001-0548.2010.03.001
	ZHAO Fei , ZHOU Tao , ZHANG Liang , et al. Research progress on Wikipedia[J]. Journal of University of Electronic Science and Technology of China, 2010, 39 (3): 321- 334. doi: 10.3969/j.issn.1001-0548.2010.03.001
12	HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, USA: Erlbaum, 1986: 1-12.
13	BENGIO Y , DUCHARME R , VINCENT P , et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3, 1137- 1155.
14	MIKOLOV T , SUTSKEVER I , CHEN K , et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26, 3111- 3119.
15	熊富林, 邓怡豪, 唐晓晟. word2vec的核心架构及其应用[J]. 南京师范大学学报(工程技术版), 2015, (1): 43- 48. doi: 10.3969/j.issn.1672-1292.2015.01.008
	XIONG Fulin , DENG Yihao , TANG Xiaosheng . The Architecture of word2vec and its applications[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2015, (1): 43- 48. doi: 10.3969/j.issn.1672-1292.2015.01.008
16	唐明, 朱磊, 邹显春. 基于word2Vec的一种文档向量表示[J]. 计算机科学, 2016, 43 (6): 214- 217.
	TANG Ming , ZHU Lei , ZOU Xianchun . Document vector representation based on word2vec[J]. Computer Science, 2016, 43 (6): 214- 217.
17	陆远蓉. 使用数据挖掘工具Weka[J]. 电脑知识与技术, 2008, 1 (6): 14- 16, 19.
	LU Yuanrong . Using weka as data mining tool[J]. Computer Knowledge and Technology, 2008, 1 (6): 14- 16, 19.
18	汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31 (5): 1281- 1286. doi: 10.3969/j.issn.1001-3695.2014.05.001
	WANG Haiyan , LI Jianhui , YANG Fenglei . Overview of support vector machine analysis and algorithm[J]. Application Research of Computers, 2014, 31 (5): 1281- 1286. doi: 10.3969/j.issn.1001-3695.2014.05.001
19	YANG Y. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, USA: ACM, 1999: 42-49.
20	奉国和, 郑伟. 国内中文自动分词技术研究综述[J]. 图书情报工作, 2011, 55 (2): 41- 45.
	FENG Guohe , ZHENG Wei . Review of Chinese automatic word segmentation[J]. Library and Information Service, 2011, 55 (2): 41- 45.

多维度评价

Viewed

Full text

870

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	66	0	0	804

From	Others	local

Times	100	770
Rate	11%	89%

Abstract

3470

Just accepted	Online first	Issue

0	0	3470

	From	Others

	Times	3470
	Rate	100%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed

覆盖率/%	家庭与生活	健康与医学	汽车	商业经济	合计
50	10 000	10 000	10 000	10 000	40 000
75	9 717	9 585	9 579	9 710	38 591
100	8 272	7 742	8 064	8 668	32 746

word2vec参数			特征表示方法	F-度量值/%
词向量维数	窗口大小	采样阈值	特征表示方法	F-度量值/%
200	3	10^-3	embedding-ave	79.5
200	3	10^-3	embedding-tfidf	81.3
200	3	10^-5	embedding-ave	77.9
200	3	10^-5	embedding-tfidf	81.1
200	5	10^-3	embedding-ave	81.1
200	5	10^-3	embedding-tfidf	80.4
200	5	10^-5	embedding-ave	80.4
200	5	10^-5	embedding-tfidf	81
400	3	10^-3	embedding-ave	79.2
400	3	10^-3	embedding-tfidf	81.7
400	3	10^-5	embedding-ave	78.3
400	3	10^-5	embedding-tfidf	81.4
400	5	10^-3	embedding-ave	80.2
400	5	10^-3	embedding-tfidf	80.4
400	5	10^-5	embedding-ave	79.4
400	5	10^-5	embedding-tfidf	81.2

%
模型	方法	分类器	覆盖率/%
模型	方法	分类器	50	75	100
模型a	embedding-ave	Naive Bayes	66.7	66.9	64.4
	embedding-ave	LibSVM	79.2	79.5	77.5
	embedding-tfidf	Naive Bayes	62.7	63.7	61.2
	embedding-tfidf	LibSVM	81.7	81.8	80.6
模型b	embedding-ave	Naive Bayes	68.9	69.1	66.5
	embedding-ave	LibSVM	78.3	78.3	76.0
	embedding-tfidf	Naive Bayes	65.1	65.9	63.2
	embedding-tfidf	LibSVM	81.4	81.5	79.6

[1]	沈冀,马志强,李图雅,张力. 面向短文本情感分析的词扩充LDA模型[J]. 山东大学学报(工学版), 2018, 48(3): 120-126.
[2]	闫盈盈,黄瑞章,王瑞,马灿,刘博伟,黄庭. 一种长文本辅助短文本的文本理解方法[J]. 山东大学学报(工学版), 2018, 48(3): 67-74.
[3]	林江豪,周咏梅,阳爱民,陈锦. 基于词向量的领域情感词典构建[J]. 山东大学学报(工学版), 2018, 48(3): 40-47.
[4]	邵发, 黄银阁, 周兰江, 郭剑毅, 余正涛, 张金鹏. 基于实体消歧的中文实体关系抽取[J]. 山东大学学报(工学版), 2014, 44(6): 32-37.
[5]	王洪元,封磊,冯燕,程起才. 流形学习算法在中文文本分类中的应用[J]. 山东大学学报(工学版), 2012, 42(4): 8-12.