您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (6): 7-15.doi: 10.6040/j.issn.1672-3961.2.2015.085

• 机器学习与数据挖掘 • 上一篇    下一篇

基于实体词语义相似度的中文实体关系抽取

徐庆1, 段利国1, 李爱萍1,2, 阴桂梅3   

  1. 1. 太原理工大学计算机科学与技术学院, 山西太原 030024;
    2. 武汉大学软件工程国家重点实验室, 湖北武汉 430072;
    3. 太原师范学院计算机科学与技术系, 山西太原 030600
  • 收稿日期:2015-05-18 修回日期:2015-10-26 出版日期:2015-12-20 发布日期:2015-05-18
  • 通讯作者: 段利国(1970-),男,山西繁峙人,副教授,博士,主要研究方向为中文信息处理.E-mail:tyutdlg@163.com E-mail:tyutdlg@163.com
  • 作者简介:徐庆(1990-),男,山西晋城人,硕士研究生,主要研究方向为中文信息处理.E-mail:627573745@qq.com
  • 基金资助:
    武汉大学软件工程国家重点实验室开放课题资助项目(SKLSE2012-09-30);山西省自然科学基金资助项目(2013011015-2);山西省基础条件平台资助项目(2014091004-0104)

Chinese entity relation extraction based on entity semantic similarity

XU Qing1, DUAN Liguo1, LI Aiping1,2, YIN Guimei3   

  1. 1. College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, Shanxi, China;
    2. State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, Hubei, China;
    3. Department of Computer Science and Technology, Taiyuan Normal University, Taiyuan 030600, Shanxi, China
  • Received:2015-05-18 Revised:2015-10-26 Online:2015-12-20 Published:2015-05-18

摘要: 为了探索语义相似度在中文实体关系抽取上的作用,提出由实体词在《同义词词林》中的5层编码构建成的《同义词词林》编码树和由关系实例中的实体词,各个类别中所有实体词计算相似度后求得的平均值构建成的实体词语义相似度树2种新特征,并连同已有的《同义词词林》编码、实体类型信息共4种特征探究其对抽取性能的影响。单一特征的试验中,实体类型特征效果最好,F值达到了小类84.9、大类83.2;组合特征的试验中,实体类型和《同义词词林》编码树的组合特征效果最好,大类小类的F值都比实体类型特征提高了2.5,3种组合特征性能不升反降。试验结果表明《同义词词林》编码树是对实体类型的有效补充,但过多的特征会造成信息冗余,使抽取性能下降。

关键词: 语法树, 语义相似度, 树核函数, 《同义词词林》, 中文实体关系抽取

Abstract: In order to explore the impact of the semantic similarity on the Chinese entity relation extraction, two new features were proposed, which were the "TongYiCi Cilin" code tree constructed with the entities'5 layer code in "TongYiCi Cilin" and the entity semantic similarity tree constructed with the average of the semantic similarity between the entity word in relation instance and all entity words in each category of relation. The impact on the relation extraction performance of these two new features together with the existing "TongYiCi Cilin" code feature and the entity type information feature was explored. In the cases with single features, the entity type feature got the best performance, and the F values of subtype and type were 84.9 and 83.2; In the cases with combination features, the combination of the entity type feature and the "TongYiCi Cilin" code tree feature got the best performance, the F values of both subtype and type were 2.5 higher than the entity type feature. But the performance of three combinations features became poorer instead of better. The results showed that the "TongYiCi Cilin" code tree was an effective supplement of the entity type information, but excessive features may result in information redundancy and poor performance.

Key words: syntax tree, semantic similarity, tree kernel, TongYiCi CiLin, Chinese entity relation extraction

中图分类号: 

  • TP391.1
[1] 秦兵,刘安安,刘挺. 无指导的中文开放式实体关系抽取[J]. 计算机研究与发展,2015,52(5):1029-1035. QIN Bin, LIU Anan, LIU Ting. Unsupervised Chinese open entity relation extraction[J]. Journal of Computer Research and Development, 2015, 52(5):1029-1035.
[2] 贾真,何大可,尹红风,等. 基于无监督学习的部分-整体关系获取[J]. 西南交通大学学报,2014, 49(4):590-596. JIA Zhen, HE Dake, YIN Fongfeng, et al. Acquisition of part-whole relations based on unsupervised learning[J]. Journal of Southwest Jiaotong University, 2014, 49(4):590-596.
[3] 杨博,蔡东风,杨华. 开放式信息抽取研究进展[J]. 中文信息学报,2014,28(4):1-11. YANG Fu, CAI Dongfeng, YANG Hua. Progress in open information extraction[J]. Journal of Chinese Information Processing, 2014, 28(4):1-11.
[4] 李付民,杨静,贺樑. 基于中文句法结构的关系挖掘[J]. 计算机工程,2014,40(7):143-147. LI Fumin, YANG Jing, HE Liang. Relation extraction based on Chinese syntactic structure[J]. Computer Engineering, 2014, 40(7):143-147.
[5] 刘琦,肖仰华,汪卫. 一种面向海量中文文本的典型类属关系识别方法[J]. 计算机工程,2015,41(2):26-30. LIU Qi, XIAO Yanghua, WANG Wei. A Recognition approach of typical generic relationship for massive Chinese text[J]. Computer Engineering, 2015, 41(2):26-30.
[6] 张苇如,孙乐,韩先培. 基于维基百科和模式聚类的实体关系抽取方法[J]. 中文信息学报,2012,26(2):75-81. ZHANG Weiru, SUN Le, HAN Xianpei. A entity relation extraction method based on Wikepadia and pattern clustering[J]. Journal of Chinese Information Processing, 2012, 26(2):75-81.
[7] 车万翔,刘挺,李生. 实体关系自动抽取[J]. 中文信息学报,2005,19(2):1-6. CHE Wanxiang, LIU Ting, LI Sheng. Automatic entity relation extraction[J]. Journal of Chinese Information Processing, 2005, 19(2):1-6.
[8] 徐健,张智雄,吴振新. 实体关系抽取的技术方法综述[J]. 现代图书情报技术,2008(8):18-23. XU Jian, ZHANG Zhixiong, WU Zhenxin. Review on techniques of entity relation extraction[J]. Xiandai Tushu Qingbao Jishu, 2008(8):18-23.
[9] 欧阳丹彤,瞿剑峰,叶育鑫. 关系抽取中基于本体的远监督样本扩充[J]. 软件学报,2014,25(9):2088-2101. OUYANG Dantong, ZHAI Jianfeng, YE Yuxin. Extending training set in distant supervision by ontology for relation extraction[J]. Journal of Software, 2014, 25(9):2088-2101.
[10] 贾真,何大可,杨燕, 等. 基于弱监督学习的中文网络百科关系抽取[J]. 智能系统学报,2015,10(1):113-119. JIA Zhen, HE Dake, YANG Yang, et al. Relation extraction from Chinese online encyclopedia based on weakly supervised learning[J]. CAAL Transactions on Intelligent Systems, 2015, 10(1):113-119.
[11] 朱苏阳,惠浩添,钱龙华,等. 基于自监督学习的维基百科家庭关系抽取[J]. 计算机应用,2015,35(4):1013-1016. ZHU Suyang, HUI Haotian, QIAN Longhua, et al. Family relation extraction from Wikipedia by self-supervised learning[J]. Journal of Computer Applications, 2015, 35(4):1013-1016.
[12] 董静,孙乐,冯元勇,等. 中文实体关系抽取中的特征选择研究[J]. 中文信息学报,2007,21(4):80-85. DONG Jing, SUN Le, FENG Yuanyong, et al. Chinese automatic entity relation extraction[J]. Journal of Chinese Information Processing, 2007, 21(4):80-85.
[13] 刘路,李弼程,张先飞. 基于正反例训练的SVM命名实体关系抽取[J]. 计算机应用,2008,28(6):1444-1446. LIU Lu, LI Bicheng, ZHANG Xianfei. Named entity relation extraction based on SVM training by positive and negative cases[J]. Computer Applications, 2008, 28(6):1444-1446.
[14] 刘克彬,李芳,刘磊,等. 基于核函数中文关系自动抽取系统的实现[J]. 计算机研究与发展,2007,44(8):1406-1411. LIU Kebin, LI Fang, LIU Lei, et al. Implementation of a kernel-based Chinese relation extraction system[J]. Journal of Computer Research and Development, 2007, 44(8):1406-1411.
[15] 郭喜跃,何婷婷,胡小华,等. 基于句法语义特征的中文实体关系抽取[J]. 中文信息学报,2014,28(6):183-189. GUO Xiyue, HE Tingting, HU Xiaohua, et al. Chinese named entity relation extraction based on syntactic and semantic features[J]. Journal of Chinese Information Processing, 2014, 28(6):183-189.
[16] QIAN Longhua, ZHOU Guodong, ZHU Qiaoming. Employing constituent dependency information for tree kernel-based semantic relation extraction between named entities[J]. ACM Transactions on Asian Language Information Processing(TALIP), 2011, 10(3):15:1-15:24.
[17] QIAN Longhua, ZHOU Guodong, KONG Fang. Exploiting constituent dependencies for tree kernel-based semantic relation extraction[J]. ACM Transaction on Asian Language Information Processing, 2011, 10(3):697-704.
[18] ZHANG M, ZHANG J, SU J, et al. D. A composite kernel to extract relations between entities with both flat and structured features[C]//Proceedings of COLING-ACL. Sydney, Australia, Association for Computational Linguistics Stroudsburg, 2006:825-832.
[19] 刘丹丹,彭成,钱龙华, 等. 词汇语义信息对中文实体关系抽取影响的比较[J]. 计算机应用,2012,32(8):2238-2244. LIU Dandan, PENG Cheng, QIAN Longhua, et al. Comparative analysis of impact of lexical semantic information on Chinese entity relation extraction[J]. Journal of Computer Applications, 2012, 32(8):2238-2244.
[20] 梅家驹,竺一鸣,高蕴琦, 等. 编纂汉语类义词典的尝试-《同义词词林》简介[J]. 辞书研究,1983,01:133-138. MEI Jiaju, ZHU Yiming, GAO Yunqi, et al. The introduction of TongYiCi CiLin[J]. Lexicographical Studies, 1983, 01:133-138.
[21] 刘丹丹,彭成,钱龙华, 等. 《同义词词林》在中文实体关系抽取中的作用[J]. 中文信息学报,2014,28(2):91-99. LIU Dandan, PENG Cheng, QIAN Longhua, et al. The effect of TongYiCi CiLin in Chinese entity relation extraction[J]. Journal of Chinese Information Processing, 2014, 28(2):91-99.
[22] 田久乐,赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报:信息科学版,2010,26(6):602-608. TIAN Jiule, ZHAO Wei. Word similarity algorithm based on Tongyici Cilin in semantic web adaptive learning system[J]. Journal of Jilin University:Information Science Editon, 2010, 26(6):602-608.
[23] 陈鹏,郭剑毅,余正涛, 等. 融合领域知识短语树核函数的中文领域实体关系抽取[J]. 南京大学学报:自然科学版,2015,51(1):181-186. CHEN Peng, GUO Jianyi, YU Zhengtao, et al. Chinese domain entity relation extraction based on domain knowledge phrasal tree[J]. Journal of Nanjing University:Natural Sciences, 2015, 51(1):181-186.
[24] 刘志刚,李德仁,秦前清, 等. 支持向量机在多类分类问题中的推广[J]. 计算机工程与应用,2004,40(7):10-13. LIU Zhigang, LI Deren, QIN Qianqing, et al. An analytical overview of methods for multi-category support vertor machines[J]. Computer Engineering and Applications, 2004, 40(7):10-13.
[25] 虞欢欢,钱龙华,周国栋,等. 基于合一句法和实体语义树的中文语义关系抽取[J]. 中文信息学报,2010,24(5):17-23. YU Huanhuan, QIAN Longhua, ZHOU Guodong, et al. Chinese semantic relation extraction based on unified syntactic and entity semantic tree[J]. Journal of Chinese Information Processing, 2010, 24(5):17-23.
[1] 林江豪,周咏梅,阳爱民,陈锦. 基于词向量的领域情感词典构建[J]. 山东大学学报(工学版), 2018, 48(3): 40-47.
[2] 钱肃驰, 彭甫镕, 陆建峰. 基于语义相似度的标签优化[J]. 山东大学学报(工学版), 2015, 45(2): 37-42.
[3] 刘晓勇. 一种基于树核函数的半监督关系抽取方法研究[J]. 山东大学学报(工学版), 2015, 45(2): 22-26.
[4] 尹坤,尹红风*,杨燕,贾真. 基于SimRank的百度百科词条语义相似度计算[J]. 山东大学学报(工学版), 2014, 44(3): 29-35.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!