您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 134-139.doi: 10.6040/j.issn.1672-3961.0.2017.416

• • 上一篇    下一篇

一种基于聚类的过抽样算法

王换,周忠眉   

  1. 闽南师范大学计算机学院, 福建 漳州 363000
  • 收稿日期:2017-08-24 出版日期:2018-06-20 发布日期:2017-08-24
  • 作者简介:王换(1990— ),女,河南安阳人,硕士研究生,主要研究方向为数据挖掘. E-mail:704807435@qq.com
  • 基金资助:
    国家自然科学基金资助项目(61170129)

An over sampling algorithm based on clustering

WANG Huan, ZHOU Zhongmei   

  1. School of Computer, Minnan Normal University, Zhangzhou 363000, Fujian, China
  • Received:2017-08-24 Online:2018-06-20 Published:2017-08-24

摘要: 在过抽样技术研究中,为了合成较有意义的新样本,提出一种基于聚类的过抽样算法ClusteredSMOTE-Boost。过滤小类的噪声样本,将剩余的每个小类样本作为目标样本参与合成新样本。对整个训练集聚类,根据聚类后目标样本所在簇的特点确定其权重及合成个数。将所有目标样本聚类,在目标样本所在的簇内选取K个近邻,并从中任选一个与目标样本合成新样本,使新样本与目标样本簇内的样本尽量相似,并减少由于添加样本而造成的边界复杂度。试验结果表明,ClusteredSMOTE-Boost算法在各个度量上均明显优于SMOTE-Boost、ADASYN-Boost和BorderlineSMOTE-Boost三种经典算法。

关键词: 过抽样, 样本权重, 聚类, 分类, 不平衡数据

Abstract: In the research of over sampling, in order to generate meaningful new samples, the ClusteredSMOTE-Boost was proposed, which was based on the clustering technique. The algorithm filtered the noisy of minority class samples and took the remaining minority class samples as target samples to synthesize new samples. According to characteristics of the cluster of target samples after clustering determined the weight and the number of the target samples for the whole training set. All target samples were clustered and K-nearest neighbors in the cluster of the target sample were selected, and then a sample from K-nearest neighbors was randomly chosen to synthesize new sample with target sample. Thus, new samples were similar with samples in the target cluster. This method reduced the complexity of the boundary caused by the additional new samples. The experimental results showed that the ClusteredSMOTE-Boost algorithm was superior to the three classical algorithms SMOTE-Boost, ADASYN-Boost, BorderlineSMOTE-Boost on the variety of measures.

Key words: over sampling, instance weights, classification, cluster, imbalanced data

中图分类号: 

  • TP311
[1] WANG S, YAO X. Multi-class imbalance problems: analysis and potential solutions[J]. IEEE Transactions on Systems Man & Cybernetics: Part B, 2012, 42(4):1119-1130.
[2] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284.
[3] KUMAR M, BHUTANI K, AGGARWAL S. Hybrid model for medical diagnosis using neutrosophic cognitive maps with genetic algorithms[C] //IEEE International Conference on Fuzzy Systems. Istanbul, Turkey: IEEE, 2015:1-7.
[4] SRIVASTAVA A, KUNDU A, SURAL S, et al. Credit card fraud detection using hidden Markov model[J]. IEEE Transactions on Dependable & Secure Computing, 2008, 5(1):37-48.
[5] LI J, FONG S, MOHAMMED S, et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms[J]. Journal of Supercomputing, 2016, 72(10): 3708-3728.
[6] 杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4): 7-12. YANG Ming, YIN Junmei, JI Genlin. Classification methods on imbalanced data: a survey[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2008, 8(4): 7-12.
[7] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.
[8] SAEZ J A, LUENGO J, STEFANWSKI J, et al. SMOTE—IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information Sciences, 2015, 291(5):184-203.
[9] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge & Information Systems, 2012, 33(2):245-265.
[10] BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority over sampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2):405-425.
[11] BORAL A, CYGAN M, KOCIUMAKA T, et al. A fast branching algorithm for cluster vertex deletion[J]. Theory of Computing Systems, 2016, 58(2):357-376.
[12] FOMIN S, GRIGORIEV D, KOSHEVOY G. Subtraction-free complexity, cluster transformations, and spanning trees[J]. Foundations of Computational Mathematics, 2016, 16(1):1-31.
[13] DAVIES D L, BOULDIN D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1979, 1(2):224.
[14] ZENG H J, HE Q C, CHEN Z, et al. Learning to cluster web search results[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK: ACM, 2004:210-217.
[15] 胡小生, 张润晶, 钟勇. 一种基于聚类提升的不平衡数据分类算法[J]. 集成技术, 2014(2):35-41. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A clustering-based enhanced classification algorithm for imbalanced data[J]. Journal of Integration Technology, 2014(2):35-41.
[16] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2011, 16(1):321-357.
[17] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Hoboken, USA: IEEE, 2008:1322-1328.
[18] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science, 2005, 3644(5):878-887.
[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[J]. Lecture Notes in Computer Science, 2003, 2838:107-119.
[20] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C] //Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009:475-482.
[1] 李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109.
[2] 陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70.
[3] 王梅,宋凯文,刘勇,王志宝,万达. DMKK-means——一种深度多核K-means聚类算法[J]. 山东大学学报 (工学版), 2024, 54(6): 1-7.
[4] 白琳,俱通,王浩,雷明珠,潘晓英. 面向不平衡数据的提升均衡集成学习算法[J]. 山东大学学报 (工学版), 2024, 54(4): 59-66.
[5] 陈晓江,杨晓奇,陈广豪,刘伍颖. 混合BERT和宽度学习的低时间复杂度短文本分类[J]. 山东大学学报 (工学版), 2024, 54(4): 51-58.
[6] 宋辉,张轶哲,张功萱,孟元. 基于类权重和最小化预测熵的测试时集成方法[J]. 山东大学学报 (工学版), 2024, 54(3): 36-43.
[7] 聂秀山,巩蕊,董飞,郭杰,马玉玲. 短视频场景分类方法综述[J]. 山东大学学报 (工学版), 2024, 54(3): 1-11.
[8] 王丽娟,徐晓,丁世飞. 面向密度峰值聚类的高效相似度度量[J]. 山东大学学报 (工学版), 2024, 54(3): 12-21.
[9] 徐金华,罗义凯,李昱燃,李岩. 基于时频分解与深度学习的轨道客流预测[J]. 山东大学学报 (工学版), 2024, 54(2): 60-68.
[10] 马坤,刘筱云,李乐平,纪科,陈贞翔,杨波. 用于意图识别的自适应多标签信息学习模型[J]. 山东大学学报 (工学版), 2024, 54(1): 45-51.
[11] 张鑫,费可可. 基于log鲁棒核岭回归的子空间聚类算法[J]. 山东大学学报 (工学版), 2023, 53(6): 26-34.
[12] 于泓,杜娟,魏琳,张利. 计及行为特征的市场化用户电量数据拟合方法[J]. 山东大学学报 (工学版), 2023, 53(4): 113-119.
[13] 李颖,王建坤. 基于监督图正则化和信息融合的轻度认知障碍分类方法[J]. 山东大学学报 (工学版), 2023, 53(4): 65-73.
[14] 李兆彬,叶军,周浩岩,卢岚,谢立. 变异萤火虫优化的粗糙K-均值聚类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 74-82.
[15] 张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[2] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[3] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[4] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[5] 王波,王宁生 . 机电装配体拆卸序列的自动生成及组合优化[J]. 山东大学学报(工学版), 2006, 36(2): 52 -57 .
[6] 张英,郎咏梅,赵玉晓,张鉴达,乔鹏,李善评 . 由EGSB厌氧颗粒污泥培养好氧颗粒污泥的工艺探讨[J]. 山东大学学报(工学版), 2006, 36(4): 56 -59 .
[7] Yue Khing Toh1 , XIAO Wendong2 , XIE Lihua1 . 基于无线传感器网络的分散目标跟踪:实际测试平台的开发应用(英文)[J]. 山东大学学报(工学版), 2009, 39(1): 50 -56 .
[8] 孙炜伟,王玉振. 考虑饱和的发电机单机无穷大系统有限增益镇定[J]. 山东大学学报(工学版), 2009, 39(1): 69 -76 .
[9] 孙玉利,李法德,左敦稳,戚美 . 直立分室式流体连续通电加热系统的升温特性[J]. 山东大学学报(工学版), 2006, 36(6): 19 -23 .
[10] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .