您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (4): 59-66.doi: 10.6040/j.issn.1672-3961.0.2023.116

• 机器学习与数据挖掘 • 上一篇    

面向不平衡数据的提升均衡集成学习算法

白琳1,2,俱通1,王浩1,雷明珠1,潘晓英1,2   

  1. 1.西安邮电大学计算机学院, 陕西 西安 710121;2.陕西省网络数据分析与智能处理重点实验室, 陕西 西安 710121
  • 发布日期:2024-08-20
  • 作者简介:白琳(1980— ),女,陕西西安人,副教授,硕士生导师,硕士,主要研究方向为智能信息处理. E-mail:bailin@xupt.edu
  • 基金资助:
    陕西省重点研发计划资助项目(2023-YBSF-476);西安邮电大学创新基金资助项目(CXJJYL2022043)

Boosted equalization ensemble learning algorithm for imbalanced data

BAI Lin1,2, JU Tong1, WAND Hao1, LEI Mingzhu1, PAN Xiaoying1,2   

  1. 1.School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, Shaanxi, China;
    2. Shaanxi Province Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an 710121, Shaanxi, China
  • Published:2024-08-20

摘要: 为有效解决欠采样技术在处理不平衡数据时的伪平衡问题,提出并设计一种基于欠采样的提升均衡集成学习算法。采用新的均衡采样机制,通过分箱操作协调数据的预测概率,生成高质量的训练子集,以此迭代训练分类器。基于基分类器在原始数据上的假阳性率和假阴性率,在迭代过程中自适应为其分配权重,避免性能较差的分类器影响整体决策,提高集成模型的泛化能力。新的算法能够在消除伪平衡的同时增加多数类样本的识别度,从而降低边界模糊对分类模型的影响。通过18组小型数据集和2组大型数据集的对比试验表明,该算法具有处理不平衡数据分类问题的优势。

关键词: 欠采样, 类不平衡, 不平衡学习, 集成学习, 不平衡数据分类

中图分类号: 

  • TP391
[1] LIU N, LI X, QI E, et al. A novel ensemble learning paradigm for medical diagnosis with imbalanced data[J]. IEEE Access, 2020, 8: 171263-171280.
[2] LI Z, HUANG M, LIU G, et al. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Systems with Applications, 2021, 175: 114750.
[3] DING H, CHEN L, DONG L, et al. Imbalanced data classification: a KNN and generative adversarial networks-based hybrid approach for intrusion detection[J]. Future Generation Computer Systems, 2022, 131: 240-254.
[4] LIU S, WANG Y, ZHANG J, et al. Addressing the class imbalance problem in twitter spam detection using ensemble learning[J]. Computers & Security, 2017, 69: 35-49.
[5] PASSOS L A, JODAS D S, RIBEIRO L C, et al. Handling imbalanced datasets through optimum-path forest[J]. Knowledge-Based Systems, 2022, 242: 108445.
[6] TAO X, LI Q, GUO W, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Information Sciences, 2019, 487: 31-56.
[7] KRAWCZYK B, WOZNIAK M, SCHAEFER G. Cost sensitive decision tree ensembles for effective imbalanced classification[J]. Applied Soft Computing, 2014, 14: 554-562.
[8] KHAN S H, HAYAT M, BENNAMOUN M, et al. Cost-sensitive learning of deep feature representations from imbalanced data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3573-3587.
[9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[10] SOLTANZADEH P, HASHEMZAEH M. RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem[J]. Information Sciences, 2021, 542(4): 92-111.
[11] DOUZAS G, BACAO F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135.
[12] VUTTIPITTAYAMONGKOL P, ELYAN E. Improved overlap-basedundersampling for imbalanced dataset classification with application to epilepsy and Parkinson's disease[J]. International Journal of Neural Systems, 2020, 30(9): 2050043.
[13] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbour-hood-based undersampling approach for handling imbalanced and overlapped data[J]. Information Sciences, 2020, 509: 47-70.
[14] ELHASSAN T, ALJURF M. Classification of imbalance data using Tomek link(T-link)combined with random under-sampling(RUS)as a data reduction method[J]. Global Journal of Technology & Optimization, 2016, 1: 2-11.
[15] LIU Z, CAO W, GAO Z, et al. Self-paced ensemble for highly imbalanced massive data classification[C] //2020 IEEE 36th International Conference on Data Engineering(ICDE). Dallas, USA: IEEE, 2020: 841-852.
[16] REN J, WANG Y, MAO M, et al. Equalization ensemble for large scale highly imbalanced data classification[J]. Knowledge-Based Systems, 2022, 242: 108295.
[17] NG W W Y, XU S, ZHANG J, et al. Hashing-based undersampling ensemble for imbalanced pattern classification problems[J]. IEEE Transactions on Cybernetics, 2020, 52(2): 1269-1279.
[18] WANG S, XIN Y. Diversity analysis on imbalanced data sets by using ensemble models[C] //2009 IEEE Symposium on Computational Intelligence and Data Mining. Nashville, USA: IEEE, 2009: 324-331.
[19] DERRAC J, GARCIA S, SANCHEZ L, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2015, 17(2/3): 255-287.
[20] KOZIARSKI M. Radial-basedunder sampling for imbalanced data classification[J]. Pattern Recognition, 2020, 102: 107262.
[21] DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. The Journal of Machine Learning Research, 2006, 7: 1-30.
[22] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2021.
[1] 常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12.
[2] 闵海根,雷小平,李杰,童星,吴霞,方煜坤. 基于双层混合集成的自动驾驶汽车故障检测[J]. 山东大学学报 (工学版), 2022, 52(6): 30-40.
[3] 王丽,于明仟,刘文鹏,周瑜,郑蕊蕊,贺建军. 面向类不平衡数据的K近邻偏标记学习算法[J]. 山东大学学报 (工学版), 2022, 52(3): 18-24.
[4] 张大鹏,刘雅军,张伟,沈芬,杨建盛. 基于异质集成学习的虚假评论检测[J]. 山东大学学报 (工学版), 2020, 50(2): 1-9.
[5] 张宗堂,王森,孙世林. 一种针对不平衡数据分类的集成学习算法[J]. 山东大学学报 (工学版), 2019, 49(4): 8-13.
[6] 张璞,刘畅,王永. 基于特征融合和集成学习的建议语句分类模型[J]. 山东大学学报 (工学版), 2018, 48(5): 47-54.
[7] 沈冬冬,周风余,栗梦媛,王淑倩,郭仁和. 基于集成深度神经网络的室内无线定位[J]. 山东大学学报 (工学版), 2018, 48(5): 95-102.
[8] 王立宏,李强. 旅行商问题的一种选择性集成求解方法[J]. 山东大学学报(工学版), 2016, 46(1): 42-48.
[9] 陈大伟,闫昭*,刘昊岩. SVD系列算法在评分预测中的过拟合现象[J]. 山东大学学报(工学版), 2014, 44(3): 15-21.
[10] 房晓南1,2,张化祥1,2*,高爽1,2. 基于SMOTE和随机森林的Web spam检测[J]. 山东大学学报(工学版), 2013, 43(1): 22-27.
[11] 张伶卫,万文强. 基于云计算平台的代价敏感集成学习算法研究[J]. 山东大学学报(工学版), 2012, 42(4): 19-23.
[12] 谢伙生,刘敏. 一种基于主动学习的集成协同训练算法[J]. 山东大学学报(工学版), 2012, 42(3): 1-5.
[13] 李小斌1, 李世银2. 时间序列早期分类的多分类器集成方法[J]. 山东大学学报(工学版), 2011, 41(4): 73-78.
[14] 李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!