您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2011, Vol. 41 ›› Issue (6): 18-23.

• 机器学习与数据挖掘 • 上一篇    下一篇

英汉平行语料中双语兼类词消歧研究

冯敏萱1,曲维光2,3*   

  1. 1.南京师范大学文学院, 江苏 南京 210046; 2.南京师范大学计算机科学与技术学院, 江苏 南京 210046;
    3.江苏省信息安全保密技术研究中心, 江苏 南京 210097
  • 收稿日期:2011-04-15 出版日期:2011-12-16 发布日期:2011-04-15
  • 通讯作者: 曲维光(1964- ),男,山东烟台人,教授,博士,主要研究方向为自然语言处理和人工智能. E-mail: wgqu-nj@163.com
  • 作者简介:冯敏萱(1978- ),女,江苏南京人,讲师,博士,主要研究方向为自然语言处理和语料库语言学.E-mail: fennel-2006@163.com
  • 基金资助:

    国家自然科学基金资助项目(60773173, 61073119); 江苏省自然科学基金资助项目(BK2010547);江苏省社会科学基金资助项目(10YYB007)

Study of bilingual words of part-of-speech(POS) disambiguation in the English-Chinese parallel corpus

FENG Min-xuan1, QU Wei-guang2,3*   

  1. 1. School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210046, China;
    2. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210046, China;
    3. The Research Center of Information Security and Confidentiality Technology of Jiangsu Province, Nanjing 210097, China
  • Received:2011-04-15 Online:2011-12-16 Published:2011-04-15

摘要:

对于一部分目前统计处理消歧效果较差、但出现频率又很高的兼类词,手工编写针对性极强的消歧规则。在未经词汇对齐的平行语料中,实现了基于个性规则的词性消歧方法。本研究为5个典型兼类词(过去、计划、与、back、so)设计的平行消歧算法,在大规模平行语料中得到了验证,平均F值达到了98.45%。研究结果表明该规则具有不受上下文长度和模板数量限制、特别适合于双语平行处理、消歧效果好等优点。

关键词: 平行语料, 词性消歧, 兼类词, 自动识别, 中文信息处理

Abstract:

 A part-of-speech disambiguation approach was given based on idiosyncratic rules in a parallel corpus unaligned at the lexical level. This approach focused on those words that occurred in the corpus at  very high frequency, while the part-of-speeches were difficult to determine. A number of idiosyncratic disambiguation rules were  constructed and an algorithm built on these rules was  applied on five typical words, among which were three Chinese words, “guoqu”, “jihua” and  “yu” and two English words, “back” and “so”. Experiments on a large scale parallel corpus obtained an F-score of 98.45% for the disambiguation of these words, and the results showed that the constructed rules would not be constrained by the length of context and the number of templates.

Key words: parallel corpus, part of speech disambiguation, words of POS ambiguity, automatic recognition, Chinese information processing

中图分类号: 

  • TP391.1
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!