融合数据增强和知识迁移的汉维跨语言命名实体识别

doi:10.6040/j.issn.1672-3961.0.2023.112

摘要/Abstract

摘要： 针对维吾尔语命名实体识别任务数据匮乏的问题,提出汉维跨语言命名实体识别零样本迁移方法。采用一种简单有效的序列标记翻译方式,将源语言训练数据翻译为目标语言数据,避免词序变化和实体跨度不确定等问题,结合源语言数据和翻译后得到的数据,引入一种基于相似度计算的实体增强方法,可以有效提高文本生成质量,进一步增加样本的多样性。通过一系列广泛的试验,这些增强数据使少数民族预训练语言模型(Chinese minority pre-trained language model, CINO)能够更好地实现知识迁移目标语言的特定语言特征和多语言的语言独立特征,在多语言数据增强跨语言知识迁移模型上F₁值达到86.50%,相比于基线模型提升7.42%,证明融合数据增强和知识迁移的汉维跨语言命名实体识别的可行性。

关键词: 汉维跨语言, 命名实体识别, 数据增强, 知识迁移, CINO

中图分类号:

TP391

葛一飞,艾孜尔古丽,陈德刚. 融合数据增强和知识迁移的汉维跨语言命名实体识别[J]. 山东大学学报 (工学版), 2024, 54(4): 67-75.

GE Yifei, Azragul, CHEN Degang. Chinese-Uyghur cross-lingual named entity recognition by fusing data augmentation and knowledge migration[J]. Journal of Shandong University(Engineering Science), 2024, 54(4): 67-75.

参考文献

[1] DONG C H, ZHANG J J, ZONG C Q, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C] //Natural Language Understanding and Intelligent Applications. Kunming, China: Springer, 2016: 239-250.
[2] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The Bulletin of Mathematical Biophysics, 1943, 5(4): 115-133.
[3] PIRES T, SCHLINGER E, GARRETTE D. How multilingual is multilingual BERT?[C] //Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL, 2020: 4996-5001.
[4] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL, 2019: 4171-4186.
[5] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[C] //Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA: ACL, 2020: 8440-8451.
[6] CONNEAU A, LAMPLE G. Cross-lingual language model pretraining[C] //Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook, USA: NIPS, 2019: 634.
[7] DAI X, ADEL H. An analysis of simple data augmentation for named entity recognition[C] //Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain: COLING, 2020: 3861-3867.
[8] DING B, LIU L, BING L, et al. DAGA:data augmentation with a generation approach for low-resource tagging tasks[C] //Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP). Seattle, USA: ACL, 2020: 6045-6057.
[9] 李丽双, 郭元凯. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别[J]. 中文信息学报, 2018, 32(1): 116-122. LI Lishuang, GUO Yuankai. Biomedical named entity recognition with CNN-BLSTM-CRF[J]. Journal of Chinese Information Processing, 2018, 32(1): 116-122.
[10] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C] //Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, USA: ACM, 2001: 282-289.
[11] 张海楠, 伍大勇, 刘悦, 等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4): 28-35. ZHANG Hainan, WU Dayong, LIU Yue, et al. Chinese named entity recognition based on deep neural network[J]. Journal of Chinese Information Processing, 2017, 31(4): 28-35.
[12] 杨飘, 董文永. 基于BERT嵌入的中文命名实体识别方法[J]. 计算机工程, 2020, 46(4): 40-45. YANG Piao, DONG Wenyong. Chinese named entity recognition method based on BERT embedding[J]. Computer Engineering, 2020, 46(4): 40-45.
[13] 宋佳芮, 陈艳平, 王凯, 等. 基于Affix-Attention的命名实体识别语义补充方法[J].山东大学学报(工学版), 2023, 53(2): 70-76. SONG Jiarui, CHEN Yanping, WANG Kai, et al. Semantic supplement method for named entity recognition based on Affix-Attention[J]. Journal of Shandong University(Engineering Science), 2023, 53(2): 70-76.
[14] GE Y, CHEN D, LI K, et al. Uyghurlanguage recognition method based on BIGRU-IDCNN-ATT-CRF[C] //Proceedings of the 2021 7th International Symposium on System and Software Reliability(ISSSR). Chongqing, China: IEEE, 2021: 146-151.
[15] GE Y, YUSUP A, CHEN D, et al. UGDA: data augmentation methods for Uyghur language named entity recognition[C] //Proceedings of the 2022 9th International Conference on Dependable Systems and Their Applications(DSA). Urumqi, China: IEEE, 2022: 926-932.
[16] ANWAR A, LI X, YANG Y, et al. Constructing Uyghurnamed entity recognition system using neural machine translation tag projection[C] //China National Conference on Chinese Computational Linguistics. Hainan, China: CCL, 2020: 247-260.
[17] 梁世宁. 零样本跨语言序列标注关键技术研究[D]. 长春: 吉林大学, 2022. LIANG Shining. Research on key techniques in zero-shot cross-lingual sequence labeling[D]. Changchun: Jilin University, 2022.
[18] 佘琪星. 面向低资源的跨语言命名实体识别方法[D]. 哈尔滨: 哈尔滨工业大学, 2021. SHE Qixing. Cross-lingual named entity recognition in a low resource setting[D]. Harbin: Harbin Institute of Technology, 2021.
[19] LIANG S, SHOU L, PEI J, et al. CalibreNet:calibration networks for multilingual sequence labeling[C] //Proceedings of the 14th ACM International Conference on Web Search and Data Mining. Jerusalem, Israel: WSDM, 2021: 842-850.
[20] YAN H, QIAN T, XIE L, et al. Unsupervised cross-lingual model transfer for named entity recognition with contextualized word representations[J]. Plos One, 2021, 16(9): e0257230.
[21] LIU L, DING B, BING L, et al. MulDA:a multilingual data augmentation framework for low-resource cross-lingual NER[C] //Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand: ACL-IJCNLP, 2021: 5834-5846.
[22] LIANG S, GONG M, PEI J, et al. Reinforcediterative knowledge distillation for cross-lingual named entity recognition[C] //Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. Singapore: KDD, 2021: 3231-3239.
[23] ZHOU R, LI X, HE R, et al. MELM:data augmentation with masked entity language modeling for low-resource NER[C] //Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: ACL, 2022: 2251-2262.
[24] JAIN A, PARANJAPE B, LIPTON Z C. Entity projection via machine translation for cross-lingual NER[C] //Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Hong Kong, China: ACL, 2019:1083-1092.
[25] ZHU S, CAO R, YU K. Dual learning for semi-supervised natural language understanding[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1936-1947.
[26] SHOU L, BO S, CHENG F, et al. Mining implicit relevance feedback from user behavior for web question answering[C] //Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. San Diego, USA: KDD, 2020: 2931-2941.
[27] HOU Y, CHEN S, CHE W, et al. C2C-GenDA: cluster-to-cluster generation for data augmentation of slot filling[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2021: 13027-13035.
[28] REIMERS N, GUREVYCH I. Sentence-BERT:sentence embeddings using siamese BERT-networks[C] //Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Hong Kong, China: ACL, 2019: 3982-3992.
[29] YANG Z, XU Z, CUI Y, et al. CINO:a Chinese minority pre-trained language model[C] //Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Korea: COLING, 2022: 3937-3949.
[30] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] // Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: NIPS, 2017: 6000-6010.
[31] LI Y, JIANG J, JIA Y J, et al. TIP-LAS: an open-source toolkit for Tibetan word segmentation and POS tagging[J]. Journal of Chinese Information Processing, 2015, 29: 203-207.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed