您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (4): 67-75.doi: 10.6040/j.issn.1672-3961.0.2023.112

• 机器学习与数据挖掘 • 上一篇    下一篇

融合数据增强和知识迁移的汉维跨语言命名实体识别

葛一飞1,艾孜尔古丽1,2*,陈德刚1   

  1. 1.新疆师范大学计算机科学技术学院, 新疆 乌鲁木齐 830054;2.国家语言资源监测与研究少数民族语言中心, 北京 100081
  • 发布日期:2024-08-20
  • 作者简介:葛一飞(1998— ),男,江苏宿迁人,硕士研究生,主要研究方向为自然语言处理. E-mail:1453259830@qq.com. *通信作者简介:艾孜尔古丽(1987— ),女,新疆乌鲁木齐人,副教授,硕士生导师,博士,主要研究方向为自然语言处理. E-mail:Azragul2010@126.com
  • 基金资助:
    新疆维吾尔自治区创新环境(人才、基地)建设专项-自然科学计划(少数民族科技人才特殊培养)资助项目(2022D03001);国家自然科学基金资助项目(61662081);国家社会科学基金资助项目(14AZD11);新疆师范大学青年拔尖人才资助项目(XJNUQB2022-22)

Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration

GE Yifei1, Azragul1,2*, CHEN Degang1   

  1. 1. College of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, Xinjiang, China;
    2. National Language Resource Monitoring &
    Research Center of Minority Languages, Beijing 100081, China
  • Published:2024-08-20

摘要: 针对维吾尔语命名实体识别任务数据匮乏的问题,提出汉维跨语言命名实体识别零样本迁移方法。采用一种简单有效的序列标记翻译方式,将源语言训练数据翻译为目标语言数据,避免词序变化和实体跨度不确定等问题,结合源语言数据和翻译后得到的数据,引入一种基于相似度计算的实体增强方法,可以有效提高文本生成质量,进一步增加样本的多样性。通过一系列广泛的试验,这些增强数据使少数民族预训练语言模型(Chinese minority pre-trained language model, CINO)能够更好地实现知识迁移目标语言的特定语言特征和多语言的语言独立特征,在多语言数据增强跨语言知识迁移模型上F1值达到86.50%,相比于基线模型提升7.42%,证明融合数据增强和知识迁移的汉维跨语言命名实体识别的可行性。

关键词: 汉维跨语言, 命名实体识别, 数据增强, 知识迁移, CINO

中图分类号: 

  • TP391
[1] DONG C H, ZHANG J J, ZONG C Q, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C] //Natural Language Understanding and Intelligent Applications. Kunming, China: Springer, 2016: 239-250.
[2] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The Bulletin of Mathematical Biophysics, 1943, 5(4): 115-133.
[3] PIRES T, SCHLINGER E, GARRETTE D. How multilingual is multilingual BERT?[C] //Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL, 2020: 4996-5001.
[4] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL, 2019: 4171-4186.
[5] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C] //Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA: ACL, 2020: 8440-8451.
[6] CONNEAU A, LAMPLE G. Cross-lingual language model pretraining[C] //Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook, USA: NIPS, 2019: 634.
[7] DAI X, ADEL H. An analysis of simple data augmentation for named entity recognition[C] //Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain: COLING, 2020: 3861-3867.
[8] DING B, LIU L, BING L, et al. DAGA:data augmentation with a generation approach for low-resource tagging tasks[C] //Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP). Seattle, USA: ACL, 2020: 6045-6057.
[9] 李丽双, 郭元凯. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别[J]. 中文信息学报, 2018, 32(1): 116-122. LI Lishuang, GUO Yuankai. Biomedical named entity recognition with CNN-BLSTM-CRF[J]. Journal of Chinese Information Processing, 2018, 32(1): 116-122.
[10] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C] //Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, USA: ACM, 2001: 282-289.
[11] 张海楠, 伍大勇, 刘悦, 等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4): 28-35. ZHANG Hainan, WU Dayong, LIU Yue, et al. Chinese named entity recognition based on deep neural network[J]. Journal of Chinese Information Processing, 2017, 31(4): 28-35.
[12] 杨飘, 董文永. 基于BERT嵌入的中文命名实体识别方法[J]. 计算机工程, 2020, 46(4): 40-45. YANG Piao, DONG Wenyong. Chinese named entity recognition method based on BERT embedding[J]. Computer Engineering, 2020, 46(4): 40-45.
[13] 宋佳芮, 陈艳平, 王凯, 等. 基于Affix-Attention的命名实体识别语义补充方法[J].山东大学学报(工学版), 2023, 53(2): 70-76. SONG Jiarui, CHEN Yanping, WANG Kai, et al. Semantic supplement method for named entity recognition based on Affix-Attention[J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 70-76.
[14] GE Y, CHEN D, LI K, et al. Uyghurlanguage recognition method based on BIGRU-IDCNN-ATT-CRF[C] //Proceedings of the 2021 7th International Symposium on System and Software Reliability(ISSSR). Chongqing, China: IEEE, 2021: 146-151.
[15] GE Y, YUSUP A, CHEN D, et al. UGDA: data augmentation methods for Uyghur language named entity recognition[C] //Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications(DSA). Urumqi, China: IEEE, 2022: 926-932.
[16] ANWAR A, LI X, YANG Y, et al. Constructing Uyghurnamed entity recognition system using neural machine translation tag projection[C] //China National Conference on Chinese Computational Linguistics. Hainan, China: CCL, 2020: 247-260.
[17] 梁世宁. 零样本跨语言序列标注关键技术研究[D]. 长春: 吉林大学, 2022. LIANG Shining. Research on key techniques in zero-shot cross-lingual sequence labeling[D]. Changchun: Jilin University, 2022.
[18] 佘琪星. 面向低资源的跨语言命名实体识别方法[D]. 哈尔滨: 哈尔滨工业大学, 2021. SHE Qixing. Cross-lingual named entity recognition in a low resource setting[D]. Harbin: Harbin Institute of Technology, 2021.
[19] LIANG S, SHOU L, PEI J, et al. CalibreNet:calibration networks for multilingual sequence labeling[C] //Proceedings of the 14th ACM International Conference on Web Search and Data Mining. Jerusalem, Israel: WSDM, 2021: 842-850.
[20] YAN H, QIAN T, XIE L, et al. Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations[J]. Plos One, 2021, 16(9): e0257230.
[21] LIU L, DING B, BING L, et al. MulDA:a multilingual data augmentation framework for low-resource cross-lingual NER[C] //Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand: ACL-IJCNLP, 2021: 5834-5846.
[22] LIANG S, GONG M, PEI J, et al. Reinforcediterative knowledge distillation for cross-lingual named entity recognition[C] //Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. Singapore: KDD, 2021: 3231-3239.
[23] ZHOU R, LI X, HE R, et al. MELM:data augmentation with masked entity language modeling for low-resource NER[C] //Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: ACL, 2022: 2251-2262.
[24] JAIN A, PARANJAPE B, LIPTON Z C. Entity projection via machine translation for cross-lingual NER[C] //Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Hong Kong, China: ACL, 2019:1083-1092.
[25] ZHU S, CAO R, YU K. Dual learning for semi-supervised natural language understanding[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1936-1947.
[26] SHOU L, BO S, CHENG F, et al. Mining implicit relevance feedback from user behavior for web question answering[C] //Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. San Diego, USA: KDD, 2020: 2931-2941.
[27] HOU Y, CHEN S, CHE W, et al. C2C-GenDA: cluster-to-cluster generation for data augmentation of slot filling[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2021: 13027-13035.
[28] REIMERS N, GUREVYCH I. Sentence-BERT:sentence embeddings using siamese BERT-networks[C] //Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Hong Kong, China: ACL, 2019: 3982-3992.
[29] YANG Z, XU Z, CUI Y, et al. CINO:a Chinese minority pre-trained language model[C] //Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Korea: COLING, 2022: 3937-3949.
[30] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] // Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS, 2017: 6000-6010.
[31] LI Y, JIANG J, JIA Y J, et al. TIP-LAS: an open-source toolkit for Tibetan word segmentation and POS tagging[J]. Journal of Chinese Information Processing, 2015, 29: 203-207.
[1] 肖伟, 郑更生, 陈钰佳. 结合自训练模型的命名实体识别方法[J]. 山东大学学报 (工学版), 2024, 54(2): 96-102.
[2] 宋佳芮,陈艳平,王凯,黄瑞章,秦永彬. 基于Affix-Attention的命名实体识别语义补充方法[J]. 山东大学学报 (工学版), 2023, 53(2): 70-76.
[3] 袁钺,王艳丽,刘勘. 基于空洞卷积块架构的命名实体识别模型[J]. 山东大学学报 (工学版), 2022, 52(6): 105-114.
[4] 金翠,王洪元,陈首兵. 基于随机擦除行人对齐网络的行人重识别方法[J]. 山东大学学报 (工学版), 2018, 48(6): 67-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张波,李术才,杨学英,王锡平,张敦福 . 含两个圆形孔洞岩盐路基稳定性的数值分析[J]. 山东大学学报(工学版), 2008, 38(1): 66 -69 .
[2] 胡天亮1 ,张承瑞1 ,杨林1 ,武洪恩2 . 一种高速可配置实时总线的开发及其应用[J]. 山东大学学报(工学版), 2009, 39(1): 96 -101 .
[3] 刘国彩,刘玉常,鞠培军. 变时滞神经网络的时滞相关全局渐近稳定新判据[J]. 山东大学学报(工学版), 2010, 40(4): 53 -56 .
[4] 雷小锋1,庄伟1,程宇1,丁世飞1,谢昆青2. OPHCLUS:基于序关系保持的层次聚类算法[J]. 山东大学学报(工学版), 2010, 40(5): 48 -55 .
[5] 徐书根,王威强,李梦丽,宋明大 . 尿素合成塔爆炸载荷类型与破坏形态的关系分析[J]. 山东大学学报(工学版), 2008, 38(3): 51 -57 .
[6] 杨朋朋,王葵,李磊,赵兰明. 机组组合问题的两层优化研究[J]. 山东大学学报(工学版), 2011, 41(3): 167 -172 .
[7] 张建明, 刘泉声, 唐志成, 占婷, 蒋亚龙. 考虑剪切变形历史影响的节理峰值剪切强度准则[J]. 山东大学学报(工学版), 0, (): 77 -81 .
[8] 冯震恒,张忠诚. 乙酸乙酯生成过程的间歇反应精馏的模拟和优化[J]. 山东大学学报(工学版), 2010, 40(3): 154 -158 .
[9] 王俊光 梁冰. 孔隙压力作用下泥岩三轴蠕变实验研究[J]. 山东大学学报(工学版), 2009, 39(3): 135 -138 .
[10] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .