山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 134-139.doi: 10.6040/j.issn.1672-3961.0.2017.416
王换,周忠眉
WANG Huan, ZHOU Zhongmei
摘要: 在过抽样技术研究中,为了合成较有意义的新样本,提出一种基于聚类的过抽样算法ClusteredSMOTE-Boost。过滤小类的噪声样本,将剩余的每个小类样本作为目标样本参与合成新样本。对整个训练集聚类,根据聚类后目标样本所在簇的特点确定其权重及合成个数。将所有目标样本聚类,在目标样本所在的簇内选取K个近邻,并从中任选一个与目标样本合成新样本,使新样本与目标样本簇内的样本尽量相似,并减少由于添加样本而造成的边界复杂度。试验结果表明,ClusteredSMOTE-Boost算法在各个度量上均明显优于SMOTE-Boost、ADASYN-Boost和BorderlineSMOTE-Boost三种经典算法。
中图分类号:
| [1] WANG S, YAO X. Multi-class imbalance problems: analysis and potential solutions[J]. IEEE Transactions on Systems Man & Cybernetics: Part B, 2012, 42(4):1119-1130. [2] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284. [3] KUMAR M, BHUTANI K, AGGARWAL S. Hybrid model for medical diagnosis using neutrosophic cognitive maps with genetic algorithms[C] //IEEE International Conference on Fuzzy Systems. Istanbul, Turkey: IEEE, 2015:1-7. [4] SRIVASTAVA A, KUNDU A, SURAL S, et al. Credit card fraud detection using hidden Markov model[J]. IEEE Transactions on Dependable & Secure Computing, 2008, 5(1):37-48. [5] LI J, FONG S, MOHAMMED S, et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms[J]. Journal of Supercomputing, 2016, 72(10): 3708-3728. [6] 杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4): 7-12. YANG Ming, YIN Junmei, JI Genlin. Classification methods on imbalanced data: a survey[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2008, 8(4): 7-12. [7] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637. [8] SAEZ J A, LUENGO J, STEFANWSKI J, et al. SMOTE—IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information Sciences, 2015, 291(5):184-203. [9] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge & Information Systems, 2012, 33(2):245-265. [10] BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority over sampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2):405-425. [11] BORAL A, CYGAN M, KOCIUMAKA T, et al. A fast branching algorithm for cluster vertex deletion[J]. Theory of Computing Systems, 2016, 58(2):357-376. [12] FOMIN S, GRIGORIEV D, KOSHEVOY G. Subtraction-free complexity, cluster transformations, and spanning trees[J]. Foundations of Computational Mathematics, 2016, 16(1):1-31. [13] DAVIES D L, BOULDIN D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1979, 1(2):224. [14] ZENG H J, HE Q C, CHEN Z, et al. Learning to cluster web search results[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK: ACM, 2004:210-217. [15] 胡小生, 张润晶, 钟勇. 一种基于聚类提升的不平衡数据分类算法[J]. 集成技术, 2014(2):35-41. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A clustering-based enhanced classification algorithm for imbalanced data[J]. Journal of Integration Technology, 2014(2):35-41. [16] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2011, 16(1):321-357. [17] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Hoboken, USA: IEEE, 2008:1322-1328. [18] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science, 2005, 3644(5):878-887. [19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[J]. Lecture Notes in Computer Science, 2003, 2838:107-119. [20] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C] //Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009:475-482. |
| [1] | 李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109. |
| [2] | 陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70. |
| [3] | 王梅,宋凯文,刘勇,王志宝,万达. DMKK-means——一种深度多核K-means聚类算法[J]. 山东大学学报 (工学版), 2024, 54(6): 1-7. |
| [4] | 白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66. |
| [5] | 陈晓江,杨晓奇,陈广豪,刘伍颖. 混合BERT和宽度学习的低时间复杂度短文本分类[J]. 山东大学学报 (工学版), 2024, 54(4): 51-58. |
| [6] | 宋辉,张轶哲,张功萱,孟元. 基于类权重和最小化预测熵的测试时集成方法[J]. 山东大学学报 (工学版), 2024, 54(3): 36-43. |
| [7] | 聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11. |
| [8] | 王丽娟,徐晓,丁世飞. 面向密度峰值聚类的高效相似度度量[J]. 山东大学学报 (工学版), 2024, 54(3): 12-21. |
| [9] | 徐金华,罗义凯,李昱燃,李岩. 基于时频分解与深度学习的轨道客流预测[J]. 山东大学学报 (工学版), 2024, 54(2): 60-68. |
| [10] | 马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51. |
| [11] | 张鑫,费可可. 基于log鲁棒核岭回归的子空间聚类算法[J]. 山东大学学报 (工学版), 2023, 53(6): 26-34. |
| [12] | 于泓,杜娟,魏琳,张利. 计及行为特征的市场化用户电量数据拟合方法[J]. 山东大学学报 (工学版), 2023, 53(4): 113-119. |
| [13] | 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73. |
| [14] | 李兆彬,叶军,周浩岩,卢岚,谢立. 变异萤火虫优化的粗糙K-均值聚类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 74-82. |
| [15] | 张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92. |
|