Journal of Shandong University(Engineering Science) ›› 2019, Vol. 49 ›› Issue (2): 34-41.doi: 10.6040/j.issn.1672-3961.0.2018.197

• Machine Learning & Data Mining • Previous Articles     Next Articles

Chinese short text classification method based on word2vec embedding

Mingxia GAO(),Jingwei LI   

  1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
  • Received:2018-05-31 Online:2019-04-20 Published:2019-04-19
  • Supported by:
    北京市MRI和脑信息重点试验室基金(20160201);数字出版国家重点试验室基金(Q5007013201501);计算机学院院级科研项目(2018JSJKY008)

Abstract:

In the short text classification process, the weak feature expression of the limitation of the number of words restricted the classification effect. To solve this problem, a Chinese short text classification method based on embedding trained by word2vec from Wikipedia (CSTC-EWW) was proposed, and a series of experiments for short texts with 4 topics from the iask.com website were finished. This method firstly trained the embedding by word2vec from Wikipedia corpus. the feature of short text based on the embedding was established. Naive Bayes and SVM was used to classify short text. The experimental results showed the following conclusions: CSTC-EWW could effectively classify short texts and the best F-value could reach 81.8%; Comparing the text feature expression of BOW model weighted by TF-IDF and the method of extending feature from Wikipedia, the classification results of CSTC-EWW were significantly better and F-measure of CSTC-EWW on car could be increased by 45.2%.

Key words: short texts, Chinese text classification, Wikipedia, word2vec, embedding

CLC Number: 

  • TP391

Fig.1

The process of Chinese short text classification"

Fig.2

Feature based on average value"

Fig.3

Feature based on weight of TF-IDF"

Table 1

Short text number under different coverage rates"

覆盖率/% 家庭与生活 健康与医学 汽车 商业经济 合计
50 10 000 10 000 10 000 10 000 40 000
75 9 717 9 585 9 579 9 710 38 591
100 8 272 7 742 8 064 8 668 32 746

Table 2

Influence of word2vec Parameters on the Classification Effect"

word2vec参数特征表示方法 F-度量
值/%
词向量维数 窗口大小 采样阈值
200 3 10-3 embedding-ave 79.5
200 3 10-3 embedding-tfidf 81.3
200 3 10-5 embedding-ave 77.9
200 3 10-5 embedding-tfidf 81.1
200 5 10-3 embedding-ave 81.1
200 5 10-3 embedding-tfidf 80.4
200 5 10-5 embedding-ave 80.4
200 5 10-5 embedding-tfidf 81
400 3 10-3 embedding-ave 79.2
400 3 10-3 embedding-tfidf 81.7
400 3 10-5 embedding-ave 78.3
400 3 10-5 embedding-tfidf 81.4
400 5 10-3 embedding-ave 80.2
400 5 10-3 embedding-tfidf 80.4
400 5 10-5 embedding-ave 79.4
400 5 10-5 embedding-tfidf 81.2

Table 3

Influence of Coverage Rate on the Classification Effect"

%
模型 方法 分类器覆盖率/%
50 75 100
模型a embedding-ave Naive Bayes 66.7 66.9 64.4
embedding-ave LibSVM 79.2 79.5 77.5
embedding-tfidf Naive Bayes 62.7 63.7 61.2
embedding-tfidf LibSVM 81.7 81.8 80.6
模型b embedding-ave Naive Bayes 68.9 69.1 66.5
embedding-ave LibSVM 78.3 78.3 76.0
embedding-tfidf Naive Bayes 65.1 65.9 63.2
embedding-tfidf LibSVM 81.4 81.5 79.6

Fig.4

Comparison between the Traditional Representation Method and two proposed methods of Feature Expression"

Fig.5

Comparison between the Fan Yunjie et al′s Model and CSTC-EWW"

1 刘英涛.短文本分类研究[D].重庆:重庆理工大学, 2016.
LIU Yingtao. Research on short text classification[D]. Chongqing: Chongqing University of Technology, 2016.
2 METZLER D, DUMAIS S, MEEK C. Similarity measures for short segments of text[C]//Processdings of AAAI Conference on Artificial Intelligence. Heidelberg Berlin Germany: Springer-Verlag, 2007: 16-27.
3 ZELIKOWITZ S , TRANSDUCTIVE M F . Learning for short-text classification problem using latent semantic indexing international[J]. Journal of Pattern Recognition and Artificial Intelligence, 2005, 19 (2): 143- 163.
4 杨超群.基于自身特征的短文本分类研究[D].合肥:合肥工业大学, 2016.
YANG Chaoqun. Research on short text classification based on its own features[D]. Hefei: Hefei University of Technology, 2016.
5 范云杰,刘怀亮.基于维基百科的中文短文本分类研究[D].西安:西安电子科技大学, 2013.
FAN Yunjie, LIU Huailiang. Research on Chinese short text classification based on wikipedia[D]. Xi'an: Xidian University, 2013.
6 刘婧姣,张素智.基于语义的短文本分类算法研究[D].郑州:郑州轻工业学院, 2013
LIU Jingjiao, ZHANG Suzhi. The study of short text classification algorithm based on semantic[D]. Zhengzhou: Zhengzhou University of Light Industry, 2013.
7 蔡志威,闵华清.基于概念的短文本分类[D].广州:华南理工大学, 2016.
CAI Zhiwei, MIN Huaqing. Concept-based short text classification[D]. Guangzhou: South China University of Technology, 2016.
8 李锐, 张谦, 刘嘉勇. 基于加权word2vec的微博情感分析[J]. 通信技术, 2017, 50 (3): 502- 506.
doi: 10.3969/j.issn.1002-0802.2017.03.021
LI Rui , ZHANG Qian , LIU Jiayong . Microblog sentiment analysis based on weighted word2vec[J]. Communications Technology, 2017, 50 (3): 502- 506.
doi: 10.3969/j.issn.1002-0802.2017.03.021
9 董文.基于LDA和word2vec的推荐算法研究[D].北京:北京邮电大学, 2015.
DONG Wen. Research of recommendation algorithm based on LDA and word2vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.
10 闭炳华. 基于word2vec的数字图书馆本体构建技术研究[J]. 现代电子技术, 2016, 39 (15): 90- 94.
BI Binghua . Research on digital library ontology construction technology based on word2vec[J]. Modern Electronics Technique, 2016, 39 (15): 90- 94.
11 赵飞, 周涛, 张良, 等. 维基百科研究综述[J]. 电子科技大学学报, 2010, 39 (3): 321- 334.
doi: 10.3969/j.issn.1001-0548.2010.03.001
ZHAO Fei , ZHOU Tao , ZHANG Liang , et al. Research progress on Wikipedia[J]. Journal of University of Electronic Science and Technology of China, 2010, 39 (3): 321- 334.
doi: 10.3969/j.issn.1001-0548.2010.03.001
12 HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, USA: Erlbaum, 1986: 1-12.
13 BENGIO Y , DUCHARME R , VINCENT P , et al. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3, 1137- 1155.
14 MIKOLOV T , SUTSKEVER I , CHEN K , et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26, 3111- 3119.
15 熊富林, 邓怡豪, 唐晓晟. word2vec的核心架构及其应用[J]. 南京师范大学学报(工程技术版), 2015, (1): 43- 48.
doi: 10.3969/j.issn.1672-1292.2015.01.008
XIONG Fulin , DENG Yihao , TANG Xiaosheng . The Architecture of word2vec and its applications[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2015, (1): 43- 48.
doi: 10.3969/j.issn.1672-1292.2015.01.008
16 唐明, 朱磊, 邹显春. 基于word2Vec的一种文档向量表示[J]. 计算机科学, 2016, 43 (6): 214- 217.
TANG Ming , ZHU Lei , ZOU Xianchun . Document vector representation based on word2vec[J]. Computer Science, 2016, 43 (6): 214- 217.
17 陆远蓉. 使用数据挖掘工具Weka[J]. 电脑知识与技术, 2008, 1 (6): 14- 16, 19.
LU Yuanrong . Using weka as data mining tool[J]. Computer Knowledge and Technology, 2008, 1 (6): 14- 16, 19.
18 汪海燕, 黎建辉, 杨风雷. 支持向量机理论及算法研究综述[J]. 计算机应用研究, 2014, 31 (5): 1281- 1286.
doi: 10.3969/j.issn.1001-3695.2014.05.001
WANG Haiyan , LI Jianhui , YANG Fenglei . Overview of support vector machine analysis and algorithm[J]. Application Research of Computers, 2014, 31 (5): 1281- 1286.
doi: 10.3969/j.issn.1001-3695.2014.05.001
19 YANG Y. A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York, USA: ACM, 1999: 42-49.
20 奉国和, 郑伟. 国内中文自动分词技术研究综述[J]. 图书情报工作, 2011, 55 (2): 41- 45.
FENG Guohe , ZHENG Wei . Review of Chinese automatic word segmentation[J]. Library and Information Service, 2011, 55 (2): 41- 45.
[1] LIN Jianghao, ZHOU Yongmei, YANG Aimin, CHEN Jin. Building of domain sentiment lexicon based on word2vec [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 40-47.
[2] LONG Bai, ZENG Xianyu, LI Zhi, LIU Qi. Item embedding classification method for E-commerce [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 17-24.
[3] MEI Qinglin, ZHANG Huaxiang. A neighborhood preserving embedding algorithm based on global distance and label information [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(1): 10-14.
[4] SHAO Fa, HUANG Yinge, ZHOU Lanjiang, GUO Jianyi, YU Zhengtao, ZHANG Jinpeng. Chinese entity relation extraction based on entity disambiguation [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(6): 32-37.
[5] WANG Xi-zhao,BAI Li-jie*,HUA Qiang, LIU Yu-chao. Locally linear discriminant embedding with nonparametric method [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2011, 41(4): 1-6.
[6] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 27-32.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] WANG Su-yu,<\sup>,AI Xing<\sup>,ZHAO Jun<\sup>,LI Zuo-li<\sup>,LIU Zeng-wen<\sup> . Milling force prediction model for highspeed end milling 3Cr2Mo steel[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 1 -5 .
[2] LI Kan . Empolder and implement of the embedded weld control system[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(4): 37 -41 .
[3] LI Liang, LUO Qiming, CHEN Enhong. Graph-based ranking model for object-level search
[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 15 -21 .
[4] CHEN Rui, LI Hongwei, TIAN Jing. The relationship between the number of magnetic poles and the bearing capacity of radial magnetic bearing[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 81 -85 .
[5] LI Ke,LIU Chang-chun,LI Tong-lei . Medical registration approach using improved maximization of mutual information[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 107 -110 .
[6] JI Tao,GAO Xu/sup>,SUN Tong-jing,XUE Yong-duan/sup>,XU Bing-yin/sup> . Characteristic analysis of fault generated traveling waves in 10 Kv automatic blocking and continuous power transmission lines[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 111 -116 .
[7] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 27 -32 .
[8] QIN Tong, SUN Fengrong*, WANG Limei, WANG Qinghao, LI Xincai. 3D surface reconstruction using the shape based interpolation guided by maximal discs[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2010, 40(3): 1 -5 .
[9] SUN Guohua, WU Yaohua, LI Wei. The effect of excise tax control strategy on the supply chain system performance[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 63 -68 .
[10] SUN Weiwei, WANG Yuzhen. Finite gain stabilization of singlemachine infinite bus system subject to saturation[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 69 -76 .