非均匀数据的变异系数聚类算法

doi:10.6040/j.issn.1672-3961.0.2017.410

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 140-145.doi: 10.6040/j.issn.1672-3961.0.2017.410

• • 上一篇

非均匀数据的变异系数聚类算法

杨天鹏¹,徐鲲鹏¹,陈黎飞^1,2*

1. 福建师范大学数学与信息学院, 福建福州 350117;2. 数字福建环境监测物联网实验室, 福建福州 350117

收稿日期:2017-08-24 出版日期:2018-06-20 发布日期:2017-08-24
通讯作者: 陈黎飞(1972— ),男,福建长乐人,教授,博导,博士,主要研究方向为统计机器学习,数据挖掘,模式识别.E-mail: clfei@fjnu.edu.cn E-mail:yangplace@163.com
作者简介:杨天鹏(1991— ),男,湖北十堰人,硕士研究生,主要研究方向为数据挖掘.E-mail:yangplace@163.com
基金资助:
国家自然科学基金资助项目(61175123);福建省自然科学基金资助项目(2015J01238);福建师范大学创新团队资助项目(IRTL1704)

Coefficient of variation clustering algorithm for non-uniform data

YANG Tianpeng¹, XU Kunpeng¹, CHEN Lifei^1,2*

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, Fujian, China;
2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou 350117, Fujian, China

Received:2017-08-24 Online:2018-06-20 Published:2017-08-24

摘要/Abstract

摘要： 针对现有基于划分的聚类算法无法有效聚类簇大小和簇密度有较大差异的非均匀数据的问题,提出一种基于变异系数聚类算法。从聚类优化目标的角度出发,分析了以K-means为代表的划分聚类算法引发“均匀效应”的成因;提出以变异系数度量非均匀数据的分布散度,并基于变异系数定义一种非均匀数据的相异度公式;基于相异度公式定义了聚类目标优化函数,并根据局部优化方法给出聚类算法过程。在合成和真实数据集上的试验结果表明,与K-means、Verify2、ESSC聚类算法相比,本研究提出的非均匀数据的变异系数聚类算法(coefficient of variation clustering for non-uniform data, CVCN)聚类精度提升5%~40%。

关键词: 基于划分聚类, 非均匀数据, 均匀效应, 聚类, K-means, 变异系数

Abstract: Affected by the “uniform effect”, a problem existed in the partition-based algorithms remained on open and challenging taskdue to handling. To solve this problem, a clustering algorithm based on coefficient of variation was proposed. The “uniform effect” caused by K-means-type partitioning clustering algorithm from the view of clustering optimization was analyzed. Instead of the squared error, a new measure of dispersion for non-uniform data was proposed relied on the coefficient of variation. The clustering objective optimization function was defined using a new non-uniform data dissimilarity formula, which was proposed based on the coefficient of variation. According to the local optimization method, the clustering algorithm process was given. The experimental results on real and synthetic non-uniform datasets showed that the clustering accuracy of CVCN was better than K-means, Verify2, ESSC.

Key words: clustering, partition-based clustering, coefficient of variation, K-means, uniform effect, non-uniform data

中图分类号:

TP311

杨天鹏,徐鲲鹏,陈黎飞. 非均匀数据的变异系数聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 140-145.

YANG Tianpeng, XU Kunpeng, CHEN Lifei. Coefficient of variation clustering algorithm for non-uniform data[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 140-145.

参考文献

[1] 韩家炜,坎伯,裴健.数据挖掘:概念与技术[M]. 3版. 范明,孟小峰,译.北京: 机械工业出版社, 2012.
[2] BERKHIN P. A survey of clustering data mining techniques[J]. Grouping Multidimensional Data, 2002, 43(1): 25-71.
[3] 孙吉贵.刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1): 48-61. SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1): 48-61.
[4] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. Acm Computing Surveys, 1999, 31(3): 264-323.
[5] AGGARWAL C C, REDDY C K. Data clustering: algorithms and applications[M]. Boca Raton: CRC press, 2013.
[6] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9): 1263-1284.
[7] KRAWCZYK B. Learning from imbalanced data: open challenges and future directions[J]. Progress in Artificial Intelligence, 2016, 5(4): 1-12.
[8] HARTIGAN J A, WONG M A. Algorithm as 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society Series C:Applied Statistics, 1979, 28(1): 100-108.
[9] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE Transactions on Systems, Man, and Cybernetics: Part B: Cybernetics, 2009, 39(2): 318-331.
[10] WU J, XIONG H, CHEN J. Adapting the right measures for K-means clustering[C] //Proceedings of the the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM,2009: 877-886.
[11] KUMAR C N S, RAO K N, GOVARDHAN A. An empirical comparative study of novel clustering algorithms for class imbalance learning[C] //Proceedings of the Second International Conference on Computer and Communication Technologies(IC3T). Hyderabad, India: Springer India, 2016:181-191.
[12] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1): 29-38.
[13] LIANG J, BAI L, DANG C, et al. The K-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
[14] MAHAJAN M, NIMBHORKAR P, VARADARAJAN K. The planar K-means problem is NP-hard[J]. Theoretical Computer Science, 2009, 442(8): 274-285.
[15] XU L, JORDAN M I. On convergence properties of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 1996, 8(1): 129-151.
[16] MCLACHLAN G J, KRISHNAN T. The EM Algorithm and Extensions, Second Edition[M]. New York:[s.n.] , 2007.
[17] JAIN A K. Data clustering: 50 years beyond K-means[J]. Pattern Recognition Letters, 2010, 31(8): 651-666.
[18] BROWN C E. Applied multivariate statistics in geohydrology and related sciences[M]. Berlin: Springer, 1998.
[19] EVERITT B. Cambridge dictionary of statistics[M]. Cambridge:Cambridge University Press, 2002.
[20] 齐敏. 模式识别导论[M]. 北京:清华大学出版社, 2009.
[21] ALOISE D, DESHPANDE A, HANSEN P, et al. NP-hardness of Euclidean sum-of-squares clustering[J]. Machine Learning, 2009, 75(2): 245-248.
[22] DENG Z H, CHOI K S, CHUNG F L, et al. Enhanced soft subspace clustering integrating within-cluster and between-cluster information[J]. Pattern Recognition, 2010, 43(3): 767-781.
[23] LI X, CHEN Z,YANG F. Exploring of clustering algorithm on class-imbalanced data[C] //Proceedings of the 8th International Conference on Computer Science & Education(ICCSE). Columbo, Sri Lanka: IEEE, 2013:89-93.
[24] CHEN L, JIANG Q, WANG S. A probability model for projective clustering on high dimensional data[C] //Eighth IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008:755-760.
[25] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2002, 3(3): 583-617.
[26] 陈黎飞, 吴涛. 数据挖掘中的特征约简[M]. 北京: 科学出版社, 2016.

相关文章 15

[1]	王换,周忠眉. 一种基于聚类的过抽样算法[J]. 山东大学学报(工学版), 2018, 48(3): 134-139.
[2]	张佩瑞,杨燕,邢焕来,喻琇瑛. 基于核K-means的增量多视图聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 48-53.
[3]	读习习,刘华锋,景丽萍. 一种融合社交网络的叠加联合聚类推荐模型[J]. 山东大学学报(工学版), 2018, 48(3): 96-102.
[4]	肖苗苗,魏本征,尹义龙. 基于BFOA和K-means的复合入侵检测算法[J]. 山东大学学报(工学版), 2018, 48(3): 115-119.
[5]	庞人铭,王波,叶昊,张海峰,李明亮. 基于PCA相似度和谱聚类相结合的高炉历史数据聚类[J]. 山东大学学报(工学版), 2017, 47(5): 143-149.
[6]	周旺,张晨麟,吴建鑫. 一种基于Hartigan-Wong和Lloyd的定性平衡聚类算法[J]. 山东大学学报(工学版), 2016, 46(5): 37-44.
[7]	吉兴全,韩国正,李可军,傅荣荣,朱仰贺. 基于密度的改进K均值聚类算法在配网区块划分中的应用[J]. 山东大学学报(工学版), 2016, 46(4): 41-46.
[8]	李朔,石宇良. 基于位置社交网络中地点聚类推荐方法[J]. 山东大学学报(工学版), 2016, 46(3): 44-50.
[9]	江峰,杜军威,刘国柱,眭跃飞. 基于加权的K-modes聚类初始中心选择算法[J]. 山东大学学报(工学版), 2016, 46(2): 29-34.
[10]	樊淑炎, 丁世飞. 基于多尺度的改进Graph cut算法[J]. 山东大学学报(工学版), 2016, 46(1): 28-33.
[11]	徐平安,唐雁,石教开,张辉荣. 基于薛定谔方程的K-Means聚类算法[J]. 山东大学学报(工学版), 2016, 46(1): 34-41.
[12]	马相明, 孙霞, 张强. 轮式装载机典型作业工况构建与分析[J]. 山东大学学报(工学版), 2015, 45(5): 82-87.
[13]	朱红, 丁世飞. 变粒度二次聚类方法[J]. 山东大学学报(工学版), 2015, 45(3): 1-6.
[14]	董红斌, 张广江, 逄锦伟, 韩启龙. 一种基于协同进化方法的聚类集成算法[J]. 山东大学学报(工学版), 2015, 45(2): 1-9.
[15]	浩庆波, 牟少敏, 尹传环, 昌腾腾, 崔文斌. 一种基于聚类的快速局部支持向量机算法[J]. 山东大学学报(工学版), 2015, 45(1): 13-18.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

非均匀数据的变异系数聚类算法

Coefficient of variation clustering algorithm for non-uniform data

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

多维度评价

本文评价

推荐阅读 0