您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 40-47.doi: 10.6040/j.issn.1672-3961.0.2017.403

• • 上一篇    下一篇

基于词向量的领域情感词典构建

林江豪1,2,周咏梅1,2*,阳爱民1,2,陈锦1,3   

  1. 1. 广东外语外贸大学语言工程与计算实验室, 广东 广州 510006;2. 广东外语外贸大学信息科学与技术学院, 广东 广州 510006;3. 广东外语外贸大学国际学院, 广东 广州 510420
  • 收稿日期:2017-08-23 出版日期:2018-06-20 发布日期:2017-08-23
  • 通讯作者: 周咏梅(1971— ),女,湖南永州人,硕士,教授,主要研究领域为文本情感分析,机器学习等. E-mail:yongmeizhou@163.com E-mail:lin_hao@foxmail.com
  • 作者简介:林江豪(1985— ),男,广东揭阳人,硕士,助理研究员,主要研究领域为自然语言处理,文本情感分析. E-mail:lin_hao@foxmail.com
  • 基金资助:
    教育部人文社会科学资助项目(14YJA740011);广东省教育厅科技创新资助项目(2013KJCX0067);广东省哲学社会科学“十二五”规划资助项目(GD15YTS01);广东省科技计划资助项目(2017A040406025);广东外语外贸大学教改资助项目(GWJY2017046)

Building of domain sentiment lexicon based on word2vec

LIN Jianghao1,2, ZHOU Yongmei1,2*, YANG Aimin1,2, CHEN Jin1,3   

  1. 1. Laboratory for Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510006, Guangdong, China;
    2. School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, Guangdong, China;
    3. International College, Guangdong University of Foreign Studies, Guangzhou 510420, Guangdong, China
  • Received:2017-08-23 Online:2018-06-20 Published:2017-08-23

摘要: 针对现有领域情感词典在情感和语义表达等方面的不足,提出一种基于词向量的领域情感词典构建方法。利用25万篇新闻语料和10万余条酒店评论数据,训练得到word2vec模型;选择80个情感明显、内容丰富、词性多样化的情感词作为种子词集;利用TF-IDF值在词汇重要程度的度量作用,在酒店评论中获得9 860个领域候选情感词汇;通过计算候选情感词与种子词的词向量之间的语义相似度,将情感词映射到高维向量空间,实现了情感词的特征向量表示(Senti2vec)。将Senti2vec应用于情感词极性分类和文本情感分析任务中,试验结果表明,Senti2vec能实现情感词的语义表示和情感表示;基于特定领域语料的语义相似计算,使得提取的情感特征更具有领域特性,同时不受候选情感词集范围的约束。

关键词: word2vec, 情感词, 语义相似度, 情感特征向量, 领域情感词典

Abstract: In order to fill the gap of sentimental and semantic representation in domain sentiment lexicon, a construction method of domain sentiment lexicon via word vectors was proposed. The word2vec model was trained based on 250 thousand news texts and 100 thousand hotel review texts. Eighty sentimental words, which possed obvious sentiment, rich content and diverse POS, were chosen as a set of seed words. Meanwhile, 9 860 candidate sentimental words among the hotel review texts were acquired via the measuring value of TR-IDF. The semantic similarity between the candidate sentimental words and the seed words was calculated based on their word vectors, and the sentimental words were mapped to the high dimensional vector space and the feature vector representation(Senti2vec)was extracted. Senti2vec was applied into the polarity classification of sentimental words and sentimental text analysis. The experimental results showed that Senti2vec could represent the meaning and sentiment of sentimental words. Senti2vec was based on semantic similarity calculation from data of specific domain, which enabled this method more adaptable into different domains.

Key words: word2vec, sentiment word, sentimental feature vector, semantic similarity, domain sentiment lexicon

中图分类号: 

  • TP391.1
[1] XU G, MENG X F, WANG H F. Build Chinese emotion lexicons using a graph-based algorithm and multiple resources[C] //Proceedings of the 23rd international conference on computational linguistics. Beijing, China:ACM, 2010:1209-1217.
[2] BACCIANELLA S, ESULI A, SEBASTIANI F. SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining[C] //Proceedings of International Conference on Language Resources and Evaluation, LREC 2010. Malta:LREC, 2010:83-90.
[3] DAI L L, XIA Y N, LIU B, et al. Measuring semantic similarity between words using HowNet[C] //Proceedings of the 2008 International Conference on Computer Science and Information Technology. Singapore:IEEE, 2008:601-605.
[4] TABOADA M, BROOKE J, TOFILOSKI M, et al. Lexicon-based methods for sentiment analysis[J]. Computational linguistics, 2011, 37(2): 267-307.
[5] DRAGUT E C, WANG H, SISTLA P, et al. Polarity consistency checking for domain independent sentiment dictionaries[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(3): 838-851.
[6] VO D T, ZHANG Y. Dont count, predict! an automatic approach to learning sentiment lexicons for short text[C] //Proceedings of the 54th annual meeting of the association for computational linguistics. Berlin, Germany:ACL, 2016: 219.
[7] 朱嫣岚,闵锦,周雅倩,等.基于hownet的词汇语义倾向计算[J].中文信息学报, 2006, 20(1):14-20. ZHU Yanlan, MIN Jin, ZHOU Yaqian, et al. Semantic orientation computing based on HowNet[J]. Journal of Chinese Information Processing, 2006, 20(1): 14-20.
[8] 柳位平,朱艳辉,栗春亮,等.中文基础情感词词典构建方法研究[J].计算机应用,2009,29(11):2882-2884. LIU Weiping, ZHU Yanhui, LI Chunliang, et al. Research on building Chinese basic semantic lexicon[J]. Journal of Computer Applications, 2009, 29(11): 2882-2884.
[9] 周咏梅,阳爱民,杨佳能. 一种新闻评论情感词典的构建方法[J]. 计算机科学,2014,41(08):67-69. ZHOU Yongmei, YANG Aimin, YANG Jianeng. Construction method of sentiment lexicon for news reviews[J]. Computer Science, 2014, 41(08):67-69.
[10] YANG Aimin, LIN Jianghao, ZHOU Yongmei, et al. Research on building a Chinese sentiment lexicon based on SO-PMI[J]. Applied Mechanics and Materials, 2013, 263-266(1):1688-1693.
[11] 周咏梅,阳爱民,林江豪. 中文微博情感词典构建方法[J]. 山东大学学报(工学版),2014,44(3):36-40. ZHOU Yongmei, YANG Aimin, LIN Jianghao. A method of building Chinese microblog sentiment lexicon[J]. Journal of Shandong University(Engineering Science), 2014, 44(3):36-40.
[12] WANG G, ARAKI K. Modifying SO-PMI for Japanese weblog opinion mining by using a balancing factor and detecting neutral expressions[C] //Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. New York, America:ACL, 2007:189-192.
[13] 彭丽针,吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016,43(4):45-49. PENG Lizhen, WU Yangyang. Semantic similarity computing based on community mining of wikipedia[J]. Computer Science, 2016, 43(4):45-49.
[14] 陶富民,高军,王腾蛟,等. 面向话题的新闻评论的情感特征选取[J]. 中文信息学报,2010, 24(3):37-43. TAO Fumin, GAO Jun, WANG Tenjiao, et al. Topic oriented sentimental feature selection method for news comments[J].Journal of Chinese Information Processing, 2010, 24(3):37-43.
[15] 李素科,蒋严冰. 基于情感特征聚类的半监督情感分类[J]. 计算机研究与发展,2013, 50(12):2570-2577. LI Suke, JIANG Yanbing. Semi-supervised sentiment classification based on sentiment feature clustering[J]. Journal of Computer Research and Development, 2013, 50(12):2570-2577.
[16] 贺飞艳,何炎祥,刘楠,等. 面向微博短文本的细粒度情感特征抽取方法[J]. 北京大学学报(自然科学版),2014,42(1):48-54. HE Feiyan, HE Yanxiang, LIU Nan, et al. A microblog short text oriented multi-class feature extraction method of fine-grained sentiment analysis[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2014, 42(1):48-54.
[17] 吴金源,冀俊忠,赵学武,等. 基于特征选择技术的情感词权重计算[J]. 北京工业大学学报,2016, 42(1):142-151. WU Jinyuan, JI Junzhong, ZHAO Xuewu, et al. Weight calculation of emotional word based on feature selection technique[J].Journal of Beijing University of Technology, 2016, 42(1):142-151.
[18] HAMOUDA A, MAREI M, ROHAIM M.Building machine learning based senti-word lexicon for sentiment analysis[J]. Journal of Advances in Information Technology, 2011, 2(4):199-203.
[19] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C] //Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar:ACL,2014:1532-1543.
[20] TSVETKOV Y, FARUQUI M, DYER C. Correlation-based intrinsic evaluation of word vector representations[C] //Proceedings of the Workshop on Evaluating Vector-Space Representations for Nlp. 2016. Berlin,Germany:ACL, 2016:111-115.
[21] CAMACHO-COLLADOS J, NAVIGLI R. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations[C] //Proceedings of the Workshop on Evaluating Vector-Space Representations for Nlp. 2016. Berlin, Germany: ACL, 2016:43-50.
[22] LAURENS V D M. Accelerating t-SNE using tree-based algorithms[J]. Journal of Machine Learning Research, 2014, 15(1):3221-3245.
[23] 周咏梅,杨佳能,阳爱民. 面向文本情感分析的中文情感词典构建方法[J]. 山东大学学报(工学版),2013,43(6):27-33. ZHOU Yongmei, YANG Jianeng, YANG Aimin. A method on building Chinese sentiment lexicon for text sentiment analysis[J]. Journal of Shandong University(Engineering Science), 2013, 43(6):27-33.
[24] 杨鼎,阳爱民. 一种基于情感词典和朴素贝叶斯的中文文本情感分类方法[J]. 计算机应用研究,2010,27(10):3737-3739,3743. YANG Ding, YANG Aimin. Classification approach of Chinese texts sentiment based on semantic lexicon and naïve Bayesian[J].Application Research of Computers, 2010, 27(10):3737-3739,3743.
[1] 徐庆, 段利国, 李爱萍, 阴桂梅. 基于实体词语义相似度的中文实体关系抽取[J]. 山东大学学报(工学版), 2015, 45(6): 7-15.
[2] 钱肃驰, 彭甫镕, 陆建峰. 基于语义相似度的标签优化[J]. 山东大学学报(工学版), 2015, 45(2): 37-42.
[3] 徐晓丹, 段正杰, 陈中育. 基于扩展情感词典及特征加权的情感挖掘方法[J]. 山东大学学报(工学版), 2014, 44(6): 15-18.
[4] 周咏梅1,阳爱民1,林江豪2. 中文微博情感词典构建方法[J]. 山东大学学报(工学版), 2014, 44(3): 36-40.
[5] 尹坤,尹红风*,杨燕,贾真. 基于SimRank的百度百科词条语义相似度计算[J]. 山东大学学报(工学版), 2014, 44(3): 29-35.
[6] 卢玲1,王越2,杨武1. 一种基于朴素贝叶斯的中文评论情感分类方法研究[J]. 山东大学学报(工学版), 2013, 43(6): 7-11.
[7] 周咏梅1,杨佳能2,阳爱民2. 面向文本情感分析的中文情感词典构建方法[J]. 山东大学学报(工学版), 2013, 43(6): 27-33.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!