您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2019, Vol. 49 ›› Issue (2): 34-41.doi: 10.6040/j.issn.1672-3961.0.2018.197

• 机器学习与数据挖掘 • 上一篇    下一篇

基于word2vec词模型的中文短文本分类方法

高明霞(),李经纬   

  1. 北京工业大学信息学部, 北京 100124
  • 收稿日期:2018-05-31 出版日期:2019-04-20 发布日期:2019-04-19
  • 作者简介:高明霞(1973—),女,河北张家口人,工程师,博士,主要研究方向为数据挖掘与知识工程.E-mail: gaomx@bjut.edu.cn
  • 基金资助:
    北京市MRI和脑信息重点试验室基金(20160201);数字出版国家重点试验室基金(Q5007013201501);计算机学院院级科研项目(2018JSJKY008)

Chinese short text classification method based on word2vec embedding

Mingxia GAO(),Jingwei LI   

  1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
  • Received:2018-05-31 Online:2019-04-20 Published:2019-04-19
  • Supported by:
    北京市MRI和脑信息重点试验室基金(20160201);数字出版国家重点试验室基金(Q5007013201501);计算机学院院级科研项目(2018JSJKY008)

摘要:

针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。

关键词: 短文本, 中文文本分类, 维基百科, word2vec, 词模型

Abstract:

In the short text classification process, the weak feature expression of the limitation of the number of words restricted the classification effect. To solve this problem, a Chinese short text classification method based on embedding trained by word2vec from Wikipedia (CSTC-EWW) was proposed, and a series of experiments for short texts with 4 topics from the iask.com website were finished. This method firstly trained the embedding by word2vec from Wikipedia corpus. the feature of short text based on the embedding was established. Naive Bayes and SVM was used to classify short text. The experimental results showed the following conclusions: CSTC-EWW could effectively classify short texts and the best F-value could reach 81.8%; Comparing the text feature expression of BOW model weighted by TF-IDF and the method of extending feature from Wikipedia, the classification results of CSTC-EWW were significantly better and F-measure of CSTC-EWW on car could be increased by 45.2%.

Key words: short texts, Chinese text classification, Wikipedia, word2vec, embedding

中图分类号: 

  • TP391

图1

中文短文本分类流程"

图2

借助平均值表达特征"

图3

结合TF-IDF加权表达特征"

表1

不同覆盖率下的短文本数目"

覆盖率/% 家庭与生活 健康与医学 汽车 商业经济 合计
50 10 000 10 000 10 000 10 000 40 000
75 9 717 9 585 9 579 9 710 38 591
100 8 272 7 742 8 064 8 668 32 746

表2

word2vec参数变化对分类效果的影响"

word2vec参数特征表示方法 F-度量
值/%
词向量维数 窗口大小 采样阈值
200 3 10-3 embedding-ave 79.5
200 3 10-3 embedding-tfidf 81.3
200 3 10-5 embedding-ave 77.9
200 3 10-5 embedding-tfidf 81.1
200 5 10-3 embedding-ave 81.1
200 5 10-3 embedding-tfidf 80.4
200 5 10-5 embedding-ave 80.4
200 5 10-5 embedding-tfidf 81
400 3 10-3 embedding-ave 79.2
400 3 10-3 embedding-tfidf 81.7
400 3 10-5 embedding-ave 78.3
400 3 10-5 embedding-tfidf 81.4
400 5 10-3 embedding-ave 80.2
400 5 10-3 embedding-tfidf 80.4
400 5 10-5 embedding-ave 79.4
400 5 10-5 embedding-tfidf 81.2

表3

不同覆盖率下的F-度量值"

%
模型 方法 分类器覆盖率/%
50 75 100
模型a embedding-ave Naive Bayes 66.7 66.9 64.4
embedding-ave LibSVM 79.2 79.5 77.5
embedding-tfidf Naive Bayes 62.7 63.7 61.2
embedding-tfidf LibSVM 81.7 81.8 80.6
模型b embedding-ave Naive Bayes 68.9 69.1 66.5
embedding-ave LibSVM 78.3 78.3 76.0
embedding-tfidf Naive Bayes 65.1 65.9 63.2
embedding-tfidf LibSVM 81.4 81.5 79.6

图4

传统特征表示方法与本研究中2种特征表达方式的比较"

图5

CSTC-EWW方法与范云杰等人模型的比较"

1 刘英涛.短文本分类研究[D].重庆:重庆理工大学, 2016.
LIU Yingtao. Research on short text classification[D]. Chongqing: Chongqing University of Technology, 2016.
2 METZLER D, DUMAIS S, MEEK C. Similarity measures for short segments of text[C]//Processdings of AAAI Conference on Artificial Intelligence. Heidelberg Berlin Germany: Springer-Verlag, 2007: 16-27.
3 ZELIKOWITZ S , TRANSDUCTIVE M F . Learning for short-text classification problem using latent semantic indexing international[J]. Journal of Pattern Recognition and Artificial Intelligence, 2005, 19 (2): 143- 163.
4 杨超群.基于自身特征的短文本分类研究[D].合肥:合肥工业大学, 2016.
YANG Chaoqun. Research on short text classification based on its own features[D]. Hefei: Hefei University of Technology, 2016.
5 范云杰,刘怀亮.基于维基百科的中文短文本分类研究[D].西安:西安电子科技大学, 2013.
FAN Yunjie, LIU Huailiang. Research on Chinese short text classification based on wikipedia[D]. Xi'an: Xidian University, 2013.
6 刘婧姣,张素智.基于语义的短文本分类算法研究[D].郑州:郑州轻工业学院, 2013
LIU Jingjiao, ZHANG Suzhi. The study of short text classification algorithm based on semantic[D]. Zhengzhou: Zhengzhou University of Light Industry, 2013.
7 蔡志威,闵华清.基于概念的短文本分类[D].广州:华南理工大学, 2016.
CAI Zhiwei, MIN Huaqing. Concept-based short text classification[D]. Guangzhou: South China University of Technology, 2016.
8 李锐, 张谦, 刘嘉勇. 基于加权word2vec的微博情感分析[J]. 通信技术, 2017, 50 (3): 502- 506.
doi: 10.3969/j.issn.1002-0802.2017.03.021
LI Rui , ZHANG Qian , LIU Jiayong . Microblog sentiment analysis based on weighted word2vec[J]. Communications Technology, 2017, 50 (3): 502- 506.
doi: 10.3969/j.issn.1002-0802.2017.03.021
9 董文.基于LDA和word2vec的推荐算法研究[D].北京:北京邮电大学, 2015.
DONG Wen. Research of recommendation algorithm based on LDA and word2vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.
10 闭炳华. 基于word2vec的数字图书馆本体构建技术研究[J]. 现代电子技术, 2016, 39 (15): 90- 94.
BI Binghua . Research on digital library ontology construction technology based on word2vec[J]. Modern Electronics Technique, 2016, 39 (15): 90- 94.
11 赵飞, 周涛, 张良, 等. 维基百科研究综述[J]. 电子科技大学学报, 2010, 39 (3): 321- 334.
doi: 10.3969/j.issn.1001-0548.2010.03.001
ZHAO Fei , ZHOU Tao , ZHANG Liang , et al. Research progress on Wikipedia[J]. Journal of University of Electronic Science and Technology of China, 2010, 39 (3): 321- 334.
doi: 10.3969/j.issn.1001-0548.2010.03.001
12 HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, USA: Erlbaum, 1986: 1-12.
13 BENGIO Y , DUCHARME R , VINCENT P , et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3, 1137- 1155.
14 MIKOLOV T , SUTSKEVER I , CHEN K , et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26, 3111- 3119.
15 熊富林, 邓怡豪, 唐晓晟. word2vec的核心架构及其应用[J]. 南京师范大学学报(工程技术版), 2015, (1): 43- 48.
doi: 10.3969/j.issn.1672-1292.2015.01.008
XIONG Fulin , DENG Yihao , TANG Xiaosheng . The Architecture of word2vec and its applications[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2015, (1): 43- 48.
doi: 10.3969/j.issn.1672-1292.2015.01.008
16 唐明, 朱磊, 邹显春. 基于word2Vec的一种文档向量表示[J]. 计算机科学, 2016, 43 (6): 214- 217.
TANG Ming , ZHU Lei , ZOU Xianchun . Document vector representation based on word2vec[J]. Computer Science, 2016, 43 (6): 214- 217.
17 陆远蓉. 使用数据挖掘工具Weka[J]. 电脑知识与技术, 2008, 1 (6): 14- 16, 19.
LU Yuanrong . Using weka as data mining tool[J]. Computer Knowledge and Technology, 2008, 1 (6): 14- 16, 19.
18 汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31 (5): 1281- 1286.
doi: 10.3969/j.issn.1001-3695.2014.05.001
WANG Haiyan , LI Jianhui , YANG Fenglei . Overview of support vector machine analysis and algorithm[J]. Application Research of Computers, 2014, 31 (5): 1281- 1286.
doi: 10.3969/j.issn.1001-3695.2014.05.001
19 YANG Y. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, USA: ACM, 1999: 42-49.
20 奉国和, 郑伟. 国内中文自动分词技术研究综述[J]. 图书情报工作, 2011, 55 (2): 41- 45.
FENG Guohe , ZHENG Wei . Review of Chinese automatic word segmentation[J]. Library and Information Service, 2011, 55 (2): 41- 45.
[1] 沈冀,马志强,李图雅,张力. 面向短文本情感分析的词扩充LDA模型[J]. 山东大学学报(工学版), 2018, 48(3): 120-126.
[2] 闫盈盈,黄瑞章,王瑞,马灿,刘博伟,黄庭. 一种长文本辅助短文本的文本理解方法[J]. 山东大学学报(工学版), 2018, 48(3): 67-74.
[3] 林江豪,周咏梅,阳爱民,陈锦. 基于词向量的领域情感词典构建[J]. 山东大学学报(工学版), 2018, 48(3): 40-47.
[4] 邵发, 黄银阁, 周兰江, 郭剑毅, 余正涛, 张金鹏. 基于实体消歧的中文实体关系抽取[J]. 山东大学学报(工学版), 2014, 44(6): 32-37.
[5] 王洪元,封磊,冯燕,程起才. 流形学习算法在中文文本分类中的应用[J]. 山东大学学报(工学版), 2012, 42(4): 8-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王素玉,艾兴,赵军,李作丽,刘增文 . 高速立铣3Cr2Mo模具钢切削力建模及预测[J]. 山东大学学报(工学版), 2006, 36(1): 1 -5 .
[2] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[3] 李梁,罗奇鸣,陈恩红. 对象级搜索中基于图的对象排序模型(英文)[J]. 山东大学学报(工学版), 2009, 39(1): 15 -21 .
[4] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[5] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 .
[6] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[7] 浦剑1 ,张军平1 ,黄华2 . 超分辨率算法研究综述[J]. 山东大学学报(工学版), 2009, 39(1): 27 -32 .
[8] 秦通,孙丰荣*,王丽梅,王庆浩,李新彩. 基于极大圆盘引导的形状插值实现三维表面重建[J]. 山东大学学报(工学版), 2010, 40(3): 1 -5 .
[9] 孙国华,吴耀华,黎伟. 消费税控制策略对供应链系统绩效的影响[J]. 山东大学学报(工学版), 2009, 39(1): 63 -68 .
[10] 孙炜伟,王玉振. 考虑饱和的发电机单机无穷大系统有限增益镇定[J]. 山东大学学报(工学版), 2009, 39(1): 69 -76 .