山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (4): 59-66.doi: 10.6040/j.issn.1672-3961.0.2023.116
白琳1,2,俱通1,王浩1,雷明珠1,潘晓英1,2
BAI Lin1,2, JU Tong1, WAND Hao1, LEI Mingzhu1, PAN Xiaoying1,2
摘要: 为有效解决欠采样技术在处理不平衡数据时的伪平衡问题,提出并设计一种基于欠采样的提升均衡集成学习算法。采用新的均衡采样机制,通过分箱操作协调数据的预测概率,生成高质量的训练子集,以此迭代训练分类器。基于基分类器在原始数据上的假阳性率和假阴性率,在迭代过程中自适应为其分配权重,避免性能较差的分类器影响整体决策,提高集成模型的泛化能力。新的算法能够在消除伪平衡的同时增加多数类样本的识别度,从而降低边界模糊对分类模型的影响。通过18组小型数据集和2组大型数据集的对比试验表明,该算法具有处理不平衡数据分类问题的优势。
中图分类号:
[1] LIU N, LI X, QI E, et al. A novel ensemble learning paradigm for medical diagnosis with imbalanced data[J]. IEEE Access, 2020, 8: 171263-171280. [2] LI Z, HUANG M, LIU G, et al. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Systems with Applications, 2021, 175: 114750. [3] DING H, CHEN L, DONG L, et al. Imbalanced data classification: a KNN and generative adversarial networks-based hybrid approach for intrusion detection[J]. Future Generation Computer Systems, 2022, 131: 240-254. [4] LIU S, WANG Y, ZHANG J, et al. Addressing the class imbalance problem in twitter spam detection using ensemble learning[J]. Computers & Security, 2017, 69: 35-49. [5] PASSOS L A, JODAS D S, RIBEIRO L C, et al. Handling imbalanced datasets through optimum-path forest[J]. Knowledge-Based Systems, 2022, 242: 108445. [6] TAO X, LI Q, GUO W, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Information Sciences, 2019, 487: 31-56. [7] KRAWCZYK B, WOZNIAK M, SCHAEFER G. Cost sensitive decision tree ensembles for effective imbalanced classification[J]. Applied Soft Computing, 2014, 14: 554-562. [8] KHAN S H, HAYAT M, BENNAMOUN M, et al. Cost-sensitive learning of deep feature representations from imbalanced data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3573-3587. [9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357. [10] SOLTANZADEH P, HASHEMZAEH M. RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Information Sciences, 2021, 542(4): 92-111. [11] DOUZAS G, BACAO F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135. [12] VUTTIPITTAYAMONGKOL P, ELYAN E. Improved overlap-basedundersampling for imbalanced dataset classification with application to epilepsy and Parkinson's disease[J]. International Journal of Neural Systems, 2020, 30(9): 2050043. [13] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbour-hood-based undersampling approach for handling imbalanced and overlapped data[J]. Information Sciences, 2020, 509: 47-70. [14] ELHASSAN T, ALJURF M. Classification of imbalance data using Tomek link(T-link)combined with random under-sampling(RUS)as a data reduction method[J]. Global Journal of Technology & Optimization, 2016, 1: 2-11. [15] LIU Z, CAO W, GAO Z, et al. Self-paced ensemble for highly imbalanced massive data classification[C] //2020 IEEE 36th International Conference on Data Engineering(ICDE). Dallas, USA: IEEE, 2020: 841-852. [16] REN J, WANG Y, MAO M, et al. Equalization ensemble for large scale highly imbalanced data classification[J]. Knowledge-Based Systems, 2022, 242: 108295. [17] NG W W Y, XU S, ZHANG J, et al. Hashing-based undersampling ensemble for imbalanced pattern classification problems[J]. IEEE Transactions on Cybernetics, 2020, 52(2): 1269-1279. [18] WANG S, XIN Y. Diversity analysis on imbalanced data sets by using ensemble models[C] //2009 IEEE Symposium on Computational Intelligence and Data Mining. Nashville, USA: IEEE, 2009: 324-331. [19] DERRAC J, GARCIA S, SANCHEZ L, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2015, 17(2/3): 255-287. [20] KOZIARSKI M. Radial-basedunder sampling for imbalanced data classification[J]. Pattern Recognition, 2020, 102: 107262. [21] DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. The Journal of Machine Learning Research, 2006, 7: 1-30. [22] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2021. |
[1] | 常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12. |
[2] | 闵海根,雷小平,李杰,童星,吴霞,方煜坤. 基于双层混合集成的自动驾驶汽车故障检测[J]. 山东大学学报 (工学版), 2022, 52(6): 30-40. |
[3] | 王丽,于明仟,刘文鹏,周瑜,郑蕊蕊,贺建军. 面向类不平衡数据的K近邻偏标记学习算法[J]. 山东大学学报 (工学版), 2022, 52(3): 18-24. |
[4] | 张大鹏,刘雅军,张伟,沈芬,杨建盛. 基于异质集成学习的虚假评论检测[J]. 山东大学学报 (工学版), 2020, 50(2): 1-9. |
[5] | 张宗堂,王森,孙世林. 一种针对不平衡数据分类的集成学习算法[J]. 山东大学学报 (工学版), 2019, 49(4): 8-13. |
[6] | 张璞,刘畅,王永. 基于特征融合和集成学习的建议语句分类模型[J]. 山东大学学报 (工学版), 2018, 48(5): 47-54. |
[7] | 沈冬冬,周风余,栗梦媛,王淑倩,郭仁和. 基于集成深度神经网络的室内无线定位[J]. 山东大学学报 (工学版), 2018, 48(5): 95-102. |
[8] | 王立宏,李强. 旅行商问题的一种选择性集成求解方法[J]. 山东大学学报(工学版), 2016, 46(1): 42-48. |
[9] | 陈大伟,闫昭*,刘昊岩. SVD系列算法在评分预测中的过拟合现象[J]. 山东大学学报(工学版), 2014, 44(3): 15-21. |
[10] | 房晓南1,2,张化祥1,2*,高爽1,2. 基于SMOTE和随机森林的Web spam检测[J]. 山东大学学报(工学版), 2013, 43(1): 22-27. |
[11] | 张伶卫,万文强. 基于云计算平台的代价敏感集成学习算法研究[J]. 山东大学学报(工学版), 2012, 42(4): 19-23. |
[12] | 谢伙生,刘敏. 一种基于主动学习的集成协同训练算法[J]. 山东大学学报(工学版), 2012, 42(3): 1-5. |
[13] | 李小斌1, 李世银2. 时间序列早期分类的多分类器集成方法[J]. 山东大学学报(工学版), 2011, 41(4): 73-78. |
[14] | 李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11. |
|