山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 140-145.doi: 10.6040/j.issn.1672-3961.0.2017.410
• • 上一篇
杨天鹏1,徐鲲鹏1,陈黎飞1,2*
YANG Tianpeng1, XU Kunpeng1, CHEN Lifei1,2*
摘要: 针对现有基于划分的聚类算法无法有效聚类簇大小和簇密度有较大差异的非均匀数据的问题,提出一种基于变异系数聚类算法。从聚类优化目标的角度出发,分析了以K-means为代表的划分聚类算法引发“均匀效应”的成因;提出以变异系数度量非均匀数据的分布散度,并基于变异系数定义一种非均匀数据的相异度公式;基于相异度公式定义了聚类目标优化函数,并根据局部优化方法给出聚类算法过程。在合成和真实数据集上的试验结果表明,与K-means、Verify2、ESSC聚类算法相比,本研究提出的非均匀数据的变异系数聚类算法(coefficient of variation clustering for non-uniform data, CVCN)聚类精度提升5%~40%。
中图分类号:
| [1] 韩家炜,坎伯,裴健.数据挖掘:概念与技术[M]. 3版. 范明,孟小峰,译.北京: 机械工业出版社, 2012. [2] BERKHIN P. A survey of clustering data mining techniques[J]. Grouping Multidimensional Data, 2002, 43(1): 25-71. [3] 孙吉贵.刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1): 48-61. SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1): 48-61. [4] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. Acm Computing Surveys, 1999, 31(3): 264-323. [5] AGGARWAL C C, REDDY C K. Data clustering: algorithms and applications[M]. Boca Raton: CRC press, 2013. [6] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9): 1263-1284. [7] KRAWCZYK B. Learning from imbalanced data: open challenges and future directions[J]. Progress in Artificial Intelligence, 2016, 5(4): 1-12. [8] HARTIGAN J A, WONG M A. Algorithm as 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society Series C:Applied Statistics, 1979, 28(1): 100-108. [9] XIONG H, WU J, CHEN J. K-means clustering versus validation measures: a data-distribution perspective[J]. IEEE Transactions on Systems, Man, and Cybernetics: Part B: Cybernetics, 2009, 39(2): 318-331. [10] WU J, XIONG H, CHEN J. Adapting the right measures for K-means clustering[C] //Proceedings of the the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM,2009: 877-886. [11] KUMAR C N S, RAO K N, GOVARDHAN A. An empirical comparative study of novel clustering algorithms for class imbalance learning[C] //Proceedings of the Second International Conference on Computer and Communication Technologies(IC3T). Hyderabad, India: Springer India, 2016:181-191. [12] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1): 29-38. [13] LIANG J, BAI L, DANG C, et al. The K-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745. [14] MAHAJAN M, NIMBHORKAR P, VARADARAJAN K. The planar K-means problem is NP-hard[J]. Theoretical Computer Science, 2009, 442(8): 274-285. [15] XU L, JORDAN M I. On convergence properties of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 1996, 8(1): 129-151. [16] MCLACHLAN G J, KRISHNAN T. The EM Algorithm and Extensions, Second Edition[M]. New York:[s.n.] , 2007. [17] JAIN A K. Data clustering: 50 years beyond K-means[J]. Pattern Recognition Letters, 2010, 31(8): 651-666. [18] BROWN C E. Applied multivariate statistics in geohydrology and related sciences[M]. Berlin: Springer, 1998. [19] EVERITT B. Cambridge dictionary of statistics[M]. Cambridge:Cambridge University Press, 2002. [20] 齐敏. 模式识别导论[M]. 北京:清华大学出版社, 2009. [21] ALOISE D, DESHPANDE A, HANSEN P, et al. NP-hardness of Euclidean sum-of-squares clustering[J]. Machine Learning, 2009, 75(2): 245-248. [22] DENG Z H, CHOI K S, CHUNG F L, et al. Enhanced soft subspace clustering integrating within-cluster and between-cluster information[J]. Pattern Recognition, 2010, 43(3): 767-781. [23] LI X, CHEN Z,YANG F. Exploring of clustering algorithm on class-imbalanced data[C] //Proceedings of the 8th International Conference on Computer Science & Education(ICCSE). Columbo, Sri Lanka: IEEE, 2013:89-93. [24] CHEN L, JIANG Q, WANG S. A probability model for projective clustering on high dimensional data[C] //Eighth IEEE International Conference on Data Mining. Pisa, Italy: IEEE Computer Society, 2008:755-760. [25] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2002, 3(3): 583-617. [26] 陈黎飞, 吴涛. 数据挖掘中的特征约简[M]. 北京: 科学出版社, 2016. |
| [1] | 李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109. |
| [2] | 陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70. |
| [3] | 王梅,宋凯文,刘勇,王志宝,万达. DMKK-means——一种深度多核K-means聚类算法[J]. 山东大学学报 (工学版), 2024, 54(6): 1-7. |
| [4] | 王丽娟,徐晓,丁世飞. 面向密度峰值聚类的高效相似度度量[J]. 山东大学学报 (工学版), 2024, 54(3): 12-21. |
| [5] | 张鑫,费可可. 基于log鲁棒核岭回归的子空间聚类算法[J]. 山东大学学报 (工学版), 2023, 53(6): 26-34. |
| [6] | 李兆彬,叶军,周浩岩,卢岚,谢立. 变异萤火虫优化的粗糙K-均值聚类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 74-82. |
| [7] | 侯延琛,赵金东. 任意形状聚类的SPK-means算法[J]. 山东大学学报 (工学版), 2023, 53(2): 87-92. |
| [8] | 程业超,刘惊雷. 自适应图正则的单步子空间聚类[J]. 山东大学学报 (工学版), 2022, 52(2): 57-66. |
| [9] | 卢建云,张蔚,李林. 一种基于动态局部密度和聚类结构的聚类算法[J]. 山东大学学报 (工学版), 2022, 52(2): 118-127. |
| [10] | 孟银凤,杨佳宇,曹付元. 函数型数据的分裂转移式层次聚类算法[J]. 山东大学学报 (工学版), 2022, 52(1): 19-27. |
| [11] | 朱恒东, 马盈仓, 代雪珍. 自适应半监督邻域聚类算法[J]. 山东大学学报 (工学版), 2021, 51(4): 24-34. |
| [12] | 朱昌明,岳闻,王盼红,沈震宇,周日贵. 主动三支聚类下的全局和局部多视角多标签学习算法[J]. 山东大学学报 (工学版), 2021, 51(2): 34-46. |
| [13] | 解子奇,王立宏,李嫚. 块对角子空间聚类中成对约束的主动式学习[J]. 山东大学学报 (工学版), 2021, 51(2): 65-73. |
| [14] | 李蓓,赵松,谢志佳,牛萌. 电动汽车虚拟储能可用容量建模[J]. 山东大学学报 (工学版), 2020, 50(6): 101-111. |
| [15] | 董新宇,陈瀚阅,李家国,孟庆岩,邢世和,张黎明. 基于多方法融合的非监督彩色图像分割[J]. 山东大学学报 (工学版), 2019, 49(2): 96-101. |
|