JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE) ›› 2018, Vol. 48 ›› Issue (3): 134-139.doi: 10.6040/j.issn.1672-3961.0.2017.416

Previous Articles     Next Articles

An over sampling algorithm based on clustering

WANG Huan, ZHOU Zhongmei   

  1. School of Computer, Minnan Normal University, Zhangzhou 363000, Fujian, China
  • Received:2017-08-24 Online:2018-06-20 Published:2017-08-24

Abstract: In the research of over sampling, in order to generate meaningful new samples, the ClusteredSMOTE-Boost was proposed, which was based on the clustering technique. The algorithm filtered the noisy of minority class samples and took the remaining minority class samples as target samples to synthesize new samples. According to characteristics of the cluster of target samples after clustering determined the weight and the number of the target samples for the whole training set. All target samples were clustered and K-nearest neighbors in the cluster of the target sample were selected, and then a sample from K-nearest neighbors was randomly chosen to synthesize new sample with target sample. Thus, new samples were similar with samples in the target cluster. This method reduced the complexity of the boundary caused by the additional new samples. The experimental results showed that the ClusteredSMOTE-Boost algorithm was superior to the three classical algorithms SMOTE-Boost, ADASYN-Boost, BorderlineSMOTE-Boost on the variety of measures.

Key words: over sampling, instance weights, classification, cluster, imbalanced data

CLC Number: 

  • TP311
[1] WANG S, YAO X. Multi-class imbalance problems: analysis and potential solutions[J]. IEEE Transactions on Systems Man & Cybernetics: Part B, 2012, 42(4):1119-1130.
[2] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284.
[3] KUMAR M, BHUTANI K, AGGARWAL S. Hybrid model for medical diagnosis using neutrosophic cognitive maps with genetic algorithms[C] //IEEE International Conference on Fuzzy Systems. Istanbul, Turkey: IEEE, 2015:1-7.
[4] SRIVASTAVA A, KUNDU A, SURAL S, et al. Credit card fraud detection using hidden Markov model[J]. IEEE Transactions on Dependable & Secure Computing, 2008, 5(1):37-48.
[5] LI J, FONG S, MOHAMMED S, et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms[J]. Journal of Supercomputing, 2016, 72(10): 3708-3728.
[6] 杨明, 尹军梅, 吉根林. 不平衡数据分类方法综述[J]. 南京师范大学学报(工程技术版), 2008, 8(4): 7-12. YANG Ming, YIN Junmei, JI Genlin. Classification methods on imbalanced data: a survey[J]. Journal of Nanjing Normal University(Engineering and Technology Edition), 2008, 8(4): 7-12.
[7] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.
[8] SAEZ J A, LUENGO J, STEFANWSKI J, et al. SMOTE—IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information Sciences, 2015, 291(5):184-203.
[9] RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory[J]. Knowledge & Information Systems, 2012, 33(2):245-265.
[10] BARUA S, ISLAM M M, YAO X, et al. MWMOTE: majority weighted minority over sampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(2):405-425.
[11] BORAL A, CYGAN M, KOCIUMAKA T, et al. A fast branching algorithm for cluster vertex deletion[J]. Theory of Computing Systems, 2016, 58(2):357-376.
[12] FOMIN S, GRIGORIEV D, KOSHEVOY G. Subtraction-free complexity, cluster transformations, and spanning trees[J]. Foundations of Computational Mathematics, 2016, 16(1):1-31.
[13] DAVIES D L, BOULDIN D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1979, 1(2):224.
[14] ZENG H J, HE Q C, CHEN Z, et al. Learning to cluster web search results[C] //International ACM SIGIR Conference on Research and Development in Information Retrieval. Sheffield, UK: ACM, 2004:210-217.
[15] 胡小生, 张润晶, 钟勇. 一种基于聚类提升的不平衡数据分类算法[J]. 集成技术, 2014(2):35-41. HU Xiaosheng, ZHANG Runjing, ZHONG Yong. A clustering-based enhanced classification algorithm for imbalanced data[J]. Journal of Integration Technology, 2014(2):35-41.
[16] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2011, 16(1):321-357.
[17] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C] //IEEE International Joint Conference on Neural Networks. Hoboken, USA: IEEE, 2008:1322-1328.
[18] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Lecture Notes in Computer Science, 2005, 3644(5):878-887.
[19] CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting[J]. Lecture Notes in Computer Science, 2003, 2838:107-119.
[20] BUNKHUMPORNPAT C, SINAPIROMSARAN K, LURSINSAP C. Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C] //Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin, Germany: Springer, 2009:475-482.
[1] ZHANG Pu, LIU Chang, WANG Yong. Suggestion sentence classification model based on feature fusion and ensemble learning [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(5): 47-54.
[2] CAO Ya, DENG Zhaohong, WANG Shitong. An radial basis function neural network model based on monotonic constraints [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 127-133.
[3] XIE Zhifeng, WU Jiaping, MA Lizhuang. Chinese financial news classification method based on convolutional neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 34-39.
[4] ZHANG Peirui, YANG Yan, XING Huanlai, YU Xiuying. Incremental multi-view clustering algorithm based on kernel K-means [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 48-53.
[5] WANG Tingting, ZHAI Junhai, ZHANG Mingyang, HAO Pu. K-NN algorithm for big data based on HBase and SimHash [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 54-59.
[6] CHEN Jiajie, WANG Jinfeng. Method for solving Choquet integral model based on ant colony algorithm [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 81-87.
[7] DU Xixi, LIU Huafeng, JING Liping. An additive co-clustering for recommendation of integrating social network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 96-102.
[8] YANG Tianpeng, XU Kunpeng, CHEN Lifei. Coefficient of variation clustering algorithm for non-uniform data [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 140-145.
[9] LI Shijin, WANG Shengte, HUANG Leping. Change detection with remote sensing images based on forward-backward heterogenicity [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 1-9.
[10] YE Mingquan, GAO Lingyun, WAN Chunyuan. Gene expression data classification based on artificial bee colony and SVM [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 10-16.
[11] LI Wei, WANG Zhechao, LI Shucai, DING Wantao, WANG Qi, ZONG Zhi, LIU Keqi. The mechanical properties of the silty clay and the advanced support method in Harbin Metro [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 61-71.
[12] PANG Renming, WANG Bo, YE Hao, ZHANG Haifeng, LI Mingliang. Clustering of blast furnace historical data based on PCA similarity factor and spectral clustering [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 143-149.
[13] WANG Lei, DENG Xiaogang, CAO Yuping, TIAN Xuemin. Multiblock local Fisher discriminant analysis for chemical process fault classification [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(5): 179-186.
[14] LI Sushu, WANG Shitong, LI Tao. A feature selection method based on LS-SVM and fuzzy supplementary criterion [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(3): 34-42.
[15] HE Qijia, LIU Zhenbing, XU Tao, JIANG Shujie. MR image classification based on LBP and extreme learning machine [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(2): 86-93.
Full text



No Suggested Reading articles found!