山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 134-139.doi: 10.6040/j.issn.1672-3961.0.2017.416
王换,周忠眉
WANG Huan, ZHOU Zhongmei
摘要: 在过抽样技术研究中,为了合成较有意义的新样本,提出一种基于聚类的过抽样算法ClusteredSMOTE-Boost。过滤小类的噪声样本,将剩余的每个小类样本作为目标样本参与合成新样本。对整个训练集聚类,根据聚类后目标样本所在簇的特点确定其权重及合成个数。将所有目标样本聚类,在目标样本所在的簇内选取K个近邻,并从中任选一个与目标样本合成新样本,使新样本与目标样本簇内的样本尽量相似,并减少由于添加样本而造成的边界复杂度。试验结果表明,ClusteredSMOTE-Boost算法在各个度量上均明显优于SMOTE-Boost、ADASYN-Boost和BorderlineSMOTE-Boost三种经典算法。
中图分类号:
[1] WANG S, YAO X. Multi-class imbalance problems: analysis and potential solutions[J]. IEEE Transactions on Systems Man & Cybernetics: Part B, 2012, 42(4):1119-1130. [2] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284. [3] KUMAR M, BHUTANI K, AGGARWAL S. Hybrid model for medical diagnosis using neutrosophic cognitive maps with genetic algorithms[C] //IEEE International Conference on Fuzzy Systems. Istanbul, Turkey: IEEE, 2015:1-7. [4] SRIVASTAVA A, KUNDU A, SURAL S, et al. Credit card fraud detection using hidden Markov model[J]. IEEE Transactions on Dependable & Secure Computing, 2008, 5(1):37-48. [5] LI J, FONG S, MOHAMMED S, et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms[J]. Journal of Supercomputing, 2016, 72(10): 3708-3728. [6] 杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4): 7-12. YANG Ming, YIN Junmei, JI Genlin. Classification methods on imbalanced data: a survey[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2008, 8(4): 7-12. [7] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637. [8] SAEZ J A, LUENGO J, STEFANWSKI J, et al. SMOTE—IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information Sciences, 2015, 291(5):184-203. [9] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge & Information Systems, 2012, 33(2):245-265. [10] BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority over sampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2):405-425. [11] BORAL A, CYGAN M, KOCIUMAKA T, et al. A fast branching algorithm for cluster vertex deletion[J]. Theory of Computing Systems, 2016, 58(2):357-376. [12] FOMIN S, GRIGORIEV D, KOSHEVOY G. Subtraction-free complexity, cluster transformations, and spanning trees[J]. Foundations of Computational Mathematics, 2016, 16(1):1-31. [13] DAVIES D L, BOULDIN D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1979, 1(2):224. [14] ZENG H J, HE Q C, CHEN Z, et al. Learning to cluster web search results[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK: ACM, 2004:210-217. [15] 胡小生, 张润晶, 钟勇. 一种基于聚类提升的不平衡数据分类算法[J]. 集成技术, 2014(2):35-41. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A clustering-based enhanced classification algorithm for imbalanced data[J]. Journal of Integration Technology, 2014(2):35-41. [16] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2011, 16(1):321-357. [17] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Hoboken, USA: IEEE, 2008:1322-1328. [18] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science, 2005, 3644(5):878-887. [19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[J]. Lecture Notes in Computer Science, 2003, 2838:107-119. [20] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C] //Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009:475-482. |
[1] | 张璞,刘畅,王永. 基于特征融合和集成学习的建议语句分类模型[J]. 山东大学学报(工学版), 2018, 48(5): 47-54. |
[2] | 曹雅,邓赵红,王士同. 基于单调约束的径向基函数神经网络模型[J]. 山东大学学报(工学版), 2018, 48(3): 127-133. |
[3] | 龙柏,曾宪宇,李徵,刘淇. 电商商品嵌入表示分类方法[J]. 山东大学学报(工学版), 2018, 48(3): 17-24. |
[4] | 谢志峰,吴佳萍,马利庄. 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报(工学版), 2018, 48(3): 34-39. |
[5] | 张佩瑞,杨燕,邢焕来,喻琇瑛. 基于核K-means的增量多视图聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 48-53. |
[6] | 王婷婷,翟俊海,张明阳,郝璞. 基于HBase和SimHash的大数据K-近邻算法[J]. 山东大学学报(工学版), 2018, 48(3): 54-59. |
[7] | 陈嘉杰,王金凤. 基于蚁群算法求解Choquet模糊积分模型[J]. 山东大学学报(工学版), 2018, 48(3): 81-87. |
[8] | 读习习,刘华锋,景丽萍. 一种融合社交网络的叠加联合聚类推荐模型[J]. 山东大学学报(工学版), 2018, 48(3): 96-102. |
[9] | 杨天鹏,徐鲲鹏,陈黎飞. 非均匀数据的变异系数聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 140-145. |
[10] | 叶明全,高凌云,万春圆. 基于人工蜂群和SVM的基因表达数据分类[J]. 山东大学学报(工学版), 2018, 48(3): 10-16. |
[11] | 庞人铭,王波,叶昊,张海峰,李明亮. 基于PCA相似度和谱聚类相结合的高炉历史数据聚类[J]. 山东大学学报(工学版), 2017, 47(5): 143-149. |
[12] | 王磊,邓晓刚,曹玉苹,田学民. 基于MLFDA的化工过程故障模式分类方法[J]. 山东大学学报(工学版), 2017, 47(5): 179-186. |
[13] | 李素姝,王士同,李滔. 基于LS-SVM与模糊补准则的特征选择方法[J]. 山东大学学报(工学版), 2017, 47(3): 34-42. |
[14] | 何其佳,刘振丙,徐涛,蒋淑洁. 基于LBP和极限学习机的脑部MR图像分类[J]. 山东大学学报(工学版), 2017, 47(2): 86-93. |
[15] | 郭超,杨燕,江永全,宋祎. 基于多视图分类集成的高铁工况识别[J]. 山东大学学报(工学版), 2017, 47(1): 7-14. |
|