面向不平衡数据的提升均衡集成学习算法

doi:10.6040/j.issn.1672-3961.0.2023.116

摘要/Abstract

摘要： 为有效解决欠采样技术在处理不平衡数据时的伪平衡问题,提出并设计一种基于欠采样的提升均衡集成学习算法。采用新的均衡采样机制,通过分箱操作协调数据的预测概率,生成高质量的训练子集,以此迭代训练分类器。基于基分类器在原始数据上的假阳性率和假阴性率,在迭代过程中自适应为其分配权重,避免性能较差的分类器影响整体决策,提高集成模型的泛化能力。新的算法能够在消除伪平衡的同时增加多数类样本的识别度,从而降低边界模糊对分类模型的影响。通过18组小型数据集和2组大型数据集的对比试验表明,该算法具有处理不平衡数据分类问题的优势。

关键词: 欠采样, 类不平衡, 不平衡学习, 集成学习, 不平衡数据分类

中图分类号:

TP391

白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66.

BAI Lin, JU Tong, WAND Hao, LEI Mingzhu, PAN Xiaoying. Boosted equalization ensemble learning algorithm for imbalanced data[J]. Journal of Shandong University(Engineering Science), 2024, 54(4): 59-66.

参考文献

[1] LIU N, LI X, QI E, et al. A novel ensemble learning paradigm for medical diagnosis with imbalanced data[J]. IEEE Access, 2020, 8: 171263-171280.
[2] LI Z, HUANG M, LIU G, et al. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Systems with Applications, 2021, 175: 114750.
[3] DING H, CHEN L, DONG L, et al. Imbalanced data classification: a KNN and generative adversarial networks-based hybrid approach for intrusion detection[J]. Future Generation Computer Systems, 2022, 131: 240-254.
[4] LIU S, WANG Y, ZHANG J, et al. Addressing the class imbalance problem in twitter spam detection using ensemble learning[J]. Computers & Security, 2017, 69: 35-49.
[5] PASSOS L A, JODAS D S, RIBEIRO L C, et al. Handling imbalanced datasets through optimum-path forest[J]. Knowledge-Based Systems, 2022, 242: 108445.
[6] TAO X, LI Q, GUO W, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Information Sciences, 2019, 487: 31-56.
[7] KRAWCZYK B, WOZNIAK M, SCHAEFER G. Cost sensitive decision tree ensembles for effective imbalanced classification[J]. Applied Soft Computing, 2014, 14: 554-562.
[8] KHAN S H, HAYAT M, BENNAMOUN M, et al. Cost-sensitive learning of deep feature representations from imbalanced data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3573-3587.
[9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[10] SOLTANZADEH P, HASHEMZAEH M. RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Information Sciences, 2021, 542(4): 92-111.
[11] DOUZAS G, BACAO F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135.
[12] VUTTIPITTAYAMONGKOL P, ELYAN E. Improved overlap-basedundersampling for imbalanced dataset classification with application to epilepsy and Parkinson's disease[J]. International Journal of Neural Systems, 2020, 30(9): 2050043.
[13] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbour-hood-based undersampling approach for handling imbalanced and overlapped data[J]. Information Sciences, 2020, 509: 47-70.
[14] ELHASSAN T, ALJURF M. Classification of imbalance data using Tomek link(T-link)combined with random under-sampling(RUS)as a data reduction method[J]. Global Journal of Technology & Optimization, 2016, 1: 2-11.
[15] LIU Z, CAO W, GAO Z, et al. Self-paced ensemble for highly imbalanced massive data classification[C] //2020 IEEE 36th International Conference on Data Engineering(ICDE). Dallas, USA: IEEE, 2020: 841-852.
[16] REN J, WANG Y, MAO M, et al. Equalization ensemble for large scale highly imbalanced data classification[J]. Knowledge-Based Systems, 2022, 242: 108295.
[17] NG W W Y, XU S, ZHANG J, et al. Hashing-based undersampling ensemble for imbalanced pattern classification problems[J]. IEEE Transactions on Cybernetics, 2020, 52(2): 1269-1279.
[18] WANG S, XIN Y. Diversity analysis on imbalanced data sets by using ensemble models[C] //2009 IEEE Symposium on Computational Intelligence and Data Mining. Nashville, USA: IEEE, 2009: 324-331.
[19] DERRAC J, GARCIA S, SANCHEZ L, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2015, 17(2/3): 255-287.
[20] KOZIARSKI M. Radial-basedunder sampling for imbalanced data classification[J]. Pattern Recognition, 2020, 102: 107262.
[21] DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. The Journal of Machine Learning Research, 2006, 7: 1-30.
[22] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2021.

相关文章 14

[1]	常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12.
[2]	闵海根,雷小平,李杰,童星,吴霞,方煜坤. 基于双层混合集成的自动驾驶汽车故障检测[J]. 山东大学学报 (工学版), 2022, 52(6): 30-40.
[3]	王丽,于明仟,刘文鹏,周瑜,郑蕊蕊,贺建军. 面向类不平衡数据的K近邻偏标记学习算法[J]. 山东大学学报 (工学版), 2022, 52(3): 18-24.
[4]	张大鹏,刘雅军,张伟,沈芬,杨建盛. 基于异质集成学习的虚假评论检测[J]. 山东大学学报 (工学版), 2020, 50(2): 1-9.
[5]	张宗堂,王森,孙世林. 一种针对不平衡数据分类的集成学习算法[J]. 山东大学学报 (工学版), 2019, 49(4): 8-13.
[6]	张璞,刘畅,王永. 基于特征融合和集成学习的建议语句分类模型[J]. 山东大学学报 (工学版), 2018, 48(5): 47-54.
[7]	沈冬冬,周风余,栗梦媛,王淑倩,郭仁和. 基于集成深度神经网络的室内无线定位[J]. 山东大学学报 (工学版), 2018, 48(5): 95-102.
[8]	王立宏,李强. 旅行商问题的一种选择性集成求解方法[J]. 山东大学学报(工学版), 2016, 46(1): 42-48.
[9]	陈大伟,闫昭*,刘昊岩. SVD系列算法在评分预测中的过拟合现象[J]. 山东大学学报(工学版), 2014, 44(3): 15-21.
[10]	房晓南1,2,张化祥1,2*,高爽1,2. 基于SMOTE和随机森林的Web spam检测[J]. 山东大学学报(工学版), 2013, 43(1): 22-27.
[11]	张伶卫,万文强. 基于云计算平台的代价敏感集成学习算法研究[J]. 山东大学学报(工学版), 2012, 42(4): 19-23.
[12]	谢伙生,刘敏. 一种基于主动学习的集成协同训练算法[J]. 山东大学学报(工学版), 2012, 42(3): 1-5.
[13]	李小斌1, 李世银2. 时间序列早期分类的多分类器集成方法[J]. 山东大学学报(工学版), 2011, 41(4): 73-78.
[14]	李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed