非均匀数据的变异系数聚类算法

doi:10.6040/j.issn.1672-3961.0.2017.410

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 140-145.doi: 10.6040/j.issn.1672-3961.0.2017.410

• • 上一篇

非均匀数据的变异系数聚类算法

杨天鹏¹,徐鲲鹏¹,陈黎飞^1,2*

1. 福建师范大学数学与信息学院, 福建福州 350117;2. 数字福建环境监测物联网实验室, 福建福州 350117

收稿日期:2017-08-24 出版日期:2018-06-20 发布日期:2017-08-24
通讯作者: 陈黎飞(1972— ),男,福建长乐人,教授,博导,博士,主要研究方向为统计机器学习,数据挖掘,模式识别.E-mail: clfei@fjnu.edu.cn E-mail:yangplace@163.com
作者简介:杨天鹏(1991— ),男,湖北十堰人,硕士研究生,主要研究方向为数据挖掘.E-mail:yangplace@163.com
基金资助:
国家自然科学基金资助项目(61175123);福建省自然科学基金资助项目(2015J01238);福建师范大学创新团队资助项目(IRTL1704)

Coefficient of variation clustering algorithm for non-uniform data

YANG Tianpeng¹, XU Kunpeng¹, CHEN Lifei^1,2*

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, Fujian, China;
2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou 350117, Fujian, China

Received:2017-08-24 Online:2018-06-20 Published:2017-08-24

摘要/Abstract

摘要： 针对现有基于划分的聚类算法无法有效聚类簇大小和簇密度有较大差异的非均匀数据的问题,提出一种基于变异系数聚类算法。从聚类优化目标的角度出发,分析了以K-means为代表的划分聚类算法引发“均匀效应”的成因;提出以变异系数度量非均匀数据的分布散度,并基于变异系数定义一种非均匀数据的相异度公式;基于相异度公式定义了聚类目标优化函数,并根据局部优化方法给出聚类算法过程。在合成和真实数据集上的试验结果表明,与K-means、Verify2、ESSC聚类算法相比,本研究提出的非均匀数据的变异系数聚类算法(coefficient of variation clustering for non-uniform data, CVCN)聚类精度提升5%~40%。

关键词: 基于划分聚类, 非均匀数据, 均匀效应, 聚类, K-means, 变异系数

Abstract: Affected by the “uniform effect”, a problem existed in the partition-based algorithms remained on open and challenging taskdue to handling. To solve this problem, a clustering algorithm based on coefficient of variation was proposed. The “uniform effect” caused by K-means-type partitioning clustering algorithm from the view of clustering optimization was analyzed. Instead of the squared error, a new measure of dispersion for non-uniform data was proposed relied on the coefficient of variation. The clustering objective optimization function was defined using a new non-uniform data dissimilarity formula, which was proposed based on the coefficient of variation. According to the local optimization method, the clustering algorithm process was given. The experimental results on real and synthetic non-uniform datasets showed that the clustering accuracy of CVCN was better than K-means, Verify2, ESSC.

Key words: clustering, partition-based clustering, coefficient of variation, K-means, uniform effect, non-uniform data

中图分类号:

TP311

杨天鹏,徐鲲鹏,陈黎飞. 非均匀数据的变异系数聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 140-145.

YANG Tianpeng, XU Kunpeng, CHEN Lifei. Coefficient of variation clustering algorithm for non-uniform data[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 140-145.

参考文献

[1] 韩家炜,坎伯,裴健.数据挖掘:概念与技术[M]. 3版. 范明,孟小峰,译.北京: 机械工业出版社, 2012.
[2] BERKHIN P. A survey of clustering data mining techniques[J]. Grouping Multidimensional Data, 2002, 43(1): 25-71.
[3] 孙吉贵.刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1): 48-61. SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1): 48-61.
[4] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. Acm Computing Surveys, 1999, 31(3): 264-323.
[5] AGGARWAL C C, REDDY C K. Data clustering: algorithms and applications[M]. Boca Raton: CRC press, 2013.
[6] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9): 1263-1284.
[7] KRAWCZYK B. Learning from imbalanced data: open challenges and future directions[J]. Progress in Artificial Intelligence, 2016, 5(4): 1-12.
[8] HARTIGAN J A, WONG M A. Algorithm as 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society Series C:Applied Statistics, 1979, 28(1): 100-108.
[9] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE Transactions on Systems, Man, and Cybernetics: Part B: Cybernetics, 2009, 39(2): 318-331.
[10] WU J, XIONG H, CHEN J. Adapting the right measures for K-means clustering[C] //Proceedings of the the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM,2009: 877-886.
[11] KUMAR C N S, RAO K N, GOVARDHAN A. An empirical comparative study of novel clustering algorithms for class imbalance learning[C] //Proceedings of the Second International Conference on Computer and Communication Technologies(IC3T). Hyderabad, India: Springer India, 2016:181-191.
[12] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1): 29-38.
[13] LIANG J, BAI L, DANG C, et al. The K-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
[14] MAHAJAN M, NIMBHORKAR P, VARADARAJAN K. The planar K-means problem is NP-hard[J]. Theoretical Computer Science, 2009, 442(8): 274-285.
[15] XU L, JORDAN M I. On convergence properties of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 1996, 8(1): 129-151.
[16] MCLACHLAN G J, KRISHNAN T. The EM Algorithm and Extensions, Second Edition[M]. New York:[s.n.] , 2007.
[17] JAIN A K. Data clustering: 50 years beyond K-means[J]. Pattern Recognition Letters, 2010, 31(8): 651-666.
[18] BROWN C E. Applied multivariate statistics in geohydrology and related sciences[M]. Berlin: Springer, 1998.
[19] EVERITT B. Cambridge dictionary of statistics[M]. Cambridge:Cambridge University Press, 2002.
[20] 齐敏. 模式识别导论[M]. 北京:清华大学出版社, 2009.
[21] ALOISE D, DESHPANDE A, HANSEN P, et al. NP-hardness of Euclidean sum-of-squares clustering[J]. Machine Learning, 2009, 75(2): 245-248.
[22] DENG Z H, CHOI K S, CHUNG F L, et al. Enhanced soft subspace clustering integrating within-cluster and between-cluster information[J]. Pattern Recognition, 2010, 43(3): 767-781.
[23] LI X, CHEN Z,YANG F. Exploring of clustering algorithm on class-imbalanced data[C] //Proceedings of the 8th International Conference on Computer Science & Education(ICCSE). Columbo, Sri Lanka: IEEE, 2013:89-93.
[24] CHEN L, JIANG Q, WANG S. A probability model for projective clustering on high dimensional data[C] //Eighth IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008:755-760.
[25] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2002, 3(3): 583-617.
[26] 陈黎飞, 吴涛. 数据挖掘中的特征约简[M]. 北京: 科学出版社, 2016.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

非均匀数据的变异系数聚类算法

Coefficient of variation clustering algorithm for non-uniform data

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

多维度评价

本文评价

推荐阅读 10

[1]	李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109.
[2]	陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70.
[3]	王梅,宋凯文,刘勇,王志宝,万达. DMKK-means——一种深度多核K-means聚类算法[J]. 山东大学学报 (工学版), 2024, 54(6): 1-7.
[4]	王丽娟,徐晓,丁世飞. 面向密度峰值聚类的高效相似度度量[J]. 山东大学学报 (工学版), 2024, 54(3): 12-21.
[5]	张鑫,费可可. 基于log鲁棒核岭回归的子空间聚类算法[J]. 山东大学学报 (工学版), 2023, 53(6): 26-34.
[6]	李兆彬,叶军,周浩岩,卢岚,谢立. 变异萤火虫优化的粗糙K-均值聚类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 74-82.
[7]	侯延琛,赵金东. 任意形状聚类的SPK-means算法[J]. 山东大学学报 (工学版), 2023, 53(2): 87-92.
[8]	程业超,刘惊雷. 自适应图正则的单步子空间聚类[J]. 山东大学学报 (工学版), 2022, 52(2): 57-66.
[9]	卢建云,张蔚,李林. 一种基于动态局部密度和聚类结构的聚类算法[J]. 山东大学学报 (工学版), 2022, 52(2): 118-127.
[10]	孟银凤,杨佳宇,曹付元. 函数型数据的分裂转移式层次聚类算法[J]. 山东大学学报 (工学版), 2022, 52(1): 19-27.
[11]	朱恒东, 马盈仓, 代雪珍. 自适应半监督邻域聚类算法[J]. 山东大学学报 (工学版), 2021, 51(4): 24-34.
[12]	朱昌明,岳闻,王盼红,沈震宇,周日贵. 主动三支聚类下的全局和局部多视角多标签学习算法[J]. 山东大学学报 (工学版), 2021, 51(2): 34-46.
[13]	解子奇,王立宏,李嫚. 块对角子空间聚类中成对约束的主动式学习[J]. 山东大学学报 (工学版), 2021, 51(2): 65-73.
[14]	李蓓,赵松,谢志佳,牛萌. 电动汽车虚拟储能可用容量建模[J]. 山东大学学报 (工学版), 2020, 50(6): 101-111.
[15]	董新宇,陈瀚阅,李家国,孟庆岩,邢世和,张黎明. 基于多方法融合的非监督彩色图像分割[J]. 山东大学学报 (工学版), 2019, 49(2): 96-101.