一种基于聚类的过抽样算法

doi:10.6040/j.issn.1672-3961.0.2017.416

摘要/Abstract

摘要： 在过抽样技术研究中,为了合成较有意义的新样本,提出一种基于聚类的过抽样算法ClusteredSMOTE_-Boost。过滤小类的噪声样本,将剩余的每个小类样本作为目标样本参与合成新样本。对整个训练集聚类,根据聚类后目标样本所在簇的特点确定其权重及合成个数。将所有目标样本聚类,在目标样本所在的簇内选取K个近邻,并从中任选一个与目标样本合成新样本,使新样本与目标样本簇内的样本尽量相似,并减少由于添加样本而造成的边界复杂度。试验结果表明,ClusteredSMOTE_-Boost算法在各个度量上均明显优于SMOTE_-Boost、ADASYN_-Boost和BorderlineSMOTE_-Boost三种经典算法。

关键词: 过抽样, 样本权重, 聚类, 分类, 不平衡数据

Abstract: In the research of over sampling, in order to generate meaningful new samples, the ClusteredSMOTE_-Boost was proposed, which was based on the clustering technique. The algorithm filtered the noisy of minority class samples and took the remaining minority class samples as target samples to synthesize new samples. According to characteristics of the cluster of target samples after clustering determined the weight and the number of the target samples for the whole training set. All target samples were clustered and K-nearest neighbors in the cluster of the target sample were selected, and then a sample from K-nearest neighbors was randomly chosen to synthesize new sample with target sample. Thus, new samples were similar with samples in the target cluster. This method reduced the complexity of the boundary caused by the additional new samples. The experimental results showed that the ClusteredSMOTE_-Boost algorithm was superior to the three classical algorithms SMOTE_-Boost, ADASYN_-Boost, BorderlineSMOTE_-Boost on the variety of measures.

Key words: over sampling, instance weights, classification, cluster, imbalanced data

中图分类号:

TP311

王换,周忠眉. 一种基于聚类的过抽样算法[J]. 山东大学学报(工学版), 2018, 48(3): 134-139.

WANG Huan, ZHOU Zhongmei. An over sampling algorithm based on clustering[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 134-139.

参考文献

[1] WANG S, YAO X. Multi-class imbalance problems: analysis and potential solutions[J]. IEEE Transactions on Systems Man & Cybernetics: Part B, 2012, 42(4):1119-1130.
[2] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284.
[3] KUMAR M, BHUTANI K, AGGARWAL S. Hybrid model for medical diagnosis using neutrosophic cognitive maps with genetic algorithms[C] //IEEE International Conference on Fuzzy Systems. Istanbul, Turkey: IEEE, 2015:1-7.
[4] SRIVASTAVA A, KUNDU A, SURAL S, et al. Credit card fraud detection using hidden Markov model[J]. IEEE Transactions on Dependable & Secure Computing, 2008, 5(1):37-48.
[5] LI J, FONG S, MOHAMMED S, et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms[J]. Journal of Supercomputing, 2016, 72(10): 3708-3728.
[6] 杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4): 7-12. YANG Ming, YIN Junmei, JI Genlin. Classification methods on imbalanced data: a survey[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2008, 8(4): 7-12.
[7] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.
[8] SAEZ J A, LUENGO J, STEFANWSKI J, et al. SMOTE—IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information Sciences, 2015, 291(5):184-203.
[9] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge & Information Systems, 2012, 33(2):245-265.
[10] BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority over sampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2):405-425.
[11] BORAL A, CYGAN M, KOCIUMAKA T, et al. A fast branching algorithm for cluster vertex deletion[J]. Theory of Computing Systems, 2016, 58(2):357-376.
[12] FOMIN S, GRIGORIEV D, KOSHEVOY G. Subtraction-free complexity, cluster transformations, and spanning trees[J]. Foundations of Computational Mathematics, 2016, 16(1):1-31.
[13] DAVIES D L, BOULDIN D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1979, 1(2):224.
[14] ZENG H J, HE Q C, CHEN Z, et al. Learning to cluster web search results[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK: ACM, 2004:210-217.
[15] 胡小生, 张润晶, 钟勇. 一种基于聚类提升的不平衡数据分类算法[J]. 集成技术, 2014(2):35-41. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A clustering-based enhanced classification algorithm for imbalanced data[J]. Journal of Integration Technology, 2014(2):35-41.
[16] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2011, 16(1):321-357.
[17] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Hoboken, USA: IEEE, 2008:1322-1328.
[18] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science, 2005, 3644(5):878-887.
[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[J]. Lecture Notes in Computer Science, 2003, 2838:107-119.
[20] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C] //Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009:475-482.

相关文章 15

[1]	李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109.
[2]	陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70.
[3]	王梅,宋凯文,刘勇,王志宝,万达. DMKK-means——一种深度多核K-means聚类算法[J]. 山东大学学报 (工学版), 2024, 54(6): 1-7.
[4]	白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66.
[5]	陈晓江,杨晓奇,陈广豪,刘伍颖. 混合BERT和宽度学习的低时间复杂度短文本分类[J]. 山东大学学报 (工学版), 2024, 54(4): 51-58.
[6]	宋辉,张轶哲,张功萱,孟元. 基于类权重和最小化预测熵的测试时集成方法[J]. 山东大学学报 (工学版), 2024, 54(3): 36-43.
[7]	聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11.
[8]	王丽娟,徐晓,丁世飞. 面向密度峰值聚类的高效相似度度量[J]. 山东大学学报 (工学版), 2024, 54(3): 12-21.
[9]	徐金华,罗义凯,李昱燃,李岩. 基于时频分解与深度学习的轨道客流预测[J]. 山东大学学报 (工学版), 2024, 54(2): 60-68.
[10]	马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51.
[11]	张鑫,费可可. 基于log鲁棒核岭回归的子空间聚类算法[J]. 山东大学学报 (工学版), 2023, 53(6): 26-34.
[12]	于泓,杜娟,魏琳,张利. 计及行为特征的市场化用户电量数据拟合方法[J]. 山东大学学报 (工学版), 2023, 53(4): 113-119.
[13]	李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73.
[14]	李兆彬,叶军,周浩岩,卢岚,谢立. 变异萤火虫优化的粗糙K-均值聚类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 74-82.
[15]	张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed