您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (3): 12-21.doi: 10.6040/j.issn.1672-3961.0.2023.107

• 机器学习与数据挖掘 • 上一篇    下一篇

面向密度峰值聚类的高效相似度度量

王丽娟1,2,徐晓1*,丁世飞1   

  1. 1.中国矿业大学计算机科学技术学院, 江苏 徐州 221116;2.徐州工业职业技术学院信息工程学院, 江苏 徐州 221114
  • 发布日期:2024-06-28
  • 作者简介:王丽娟(1981— )女,江苏赣榆人,博士研究生,主要研究方向为机器学习和聚类分析. E-mail:327732566@qq.com. *通信作者简介:徐晓(1992— )女,江苏海门人,讲师,硕士生导师,博士,主要研究方向为数据挖掘和机器学习. E-mail:xu_xiao@cumt.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62206296);中央高校基本科研业务费专项资金资助项目(2022QN1095);江苏省高等职业院校专业带头人高端研修资助项目(2022GRFX063)

Efficient similarity measure for density peaks clustering

WANG Lijuan1,2, XU Xiao1*, DING Shifei1   

  1. 1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China;
    2. School of Information Engineering, Xuzhou College of Industrial Technology, Xuzhou 221114, Jiangsu, China
  • Published:2024-06-28

摘要: 针对密度峰值聚类(density peaks clustering, DPC)计算复杂度高的问题,提出一种面向密度峰值聚类的高效相似度度量(efficient similarity measure, ESM)法,通过仅度量最近邻之间的相似度构建不完全相似度矩阵。最近邻的选择基于一个随机第三方数据对象,无需另外引入参数。基于ESM法构建相似度矩阵,提出一种改进的高效密度峰值聚类(efficient density peaks clustering, EDPC)算法,在保持准确率的同时提高DPC识别聚类中心的效率。理论分析和试验结果表明,ESM法通过减少一定不相似的相似度,可以有效提高DPC及其改进算法基于K最近邻的密度峰值聚类(density peaks clustering based on K-nearest neighbors, DPC-KNN)和模糊加权K最近邻密度峰值聚类(fuzzy weighted K-nearest neighbors density peaks clustering, FKNN-DPC)的计算效率,具有较强的可扩展性。

关键词: 密度峰值聚类, 聚类中心, 相似度矩阵, 计算复杂度, 大规模数据集

中图分类号: 

  • TP391
[1] CHEN J G, PHILIP S Y. A domain adaptive density clustering algorithm for data with varying density distribution[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(6):2310-2321.
[2] 贾洪杰, 丁世飞, 史忠植. 求解大规模谱聚类的近似加权核k-means算法[J]. 软件学报, 2015, 26(11):2836-2846. JIA Hongjie, DING Shifei, SHI Zhongzhi. Approximate weighted kernel k-means for large-scale spectral clustering[J]. Journal of Software, 2015, 26(11):2836-2846.
[3] YAN X Q, YE Y D, QIU X Y, et al. Synergetic information bottleneck for joint multi-view and ensemble clustering[J]. Information Fusion, 2020, 56:15-27.
[4] XIE D Y, GAO Q X, WANG Q Q, et al. Adaptive latent similarity learning for multi-view clustering[J]. Neural Networks, 2020, 121:409-418.
[5] QU H, MA T, TONG X Y, et al. Clustering by centroid drift and boundary shrinkage[J]. Pattern Recognition, 2022, 129:108745.
[6] BARANWAL M, SALAPAKA S. Clustering and supervisory voltage control in power systems[J]. International Journal of Electrical Power & Energy Systems, 2019, 109:641-651.
[7] POTHULA K, SMYRNOVA D, SCHRÖDER G. Clustering cryo-EM images of helical protein polymers for helical reconstructions[J]. Ultramicroscopy, 2019, 203:132-138.
[8] SHI Y C, OTTO C, JAIN A. Face clustering: representation and pairwise constraints[J]. IEEE Transactions on Information Forensics and Security, 2018, 13(7):1626-1640.
[9] LLOBELL F, VIGNEAU E, QANNARI E. Clustering datasets by means of CLUSTATIS with identification of atypical datasets. Application to sensometrics[J]. Food Quality and Preference, 2019, 75:97-104.
[10] 史倩玉, 梁吉业, 赵兴旺. 一种不完备混合数据集成聚类算法[J]. 计算机研究与发展, 2016, 53(9):1979-1989. SHI Qianyu, LIANG Jiye, ZHAO Xingwang. A clustering ensemble algorithm for incomplete mixed data[J]. Journal of Computer Research and Development, 2016, 53(9):1979-1989.
[11] CHEN Y W, TANG S Y, BOUGUILA N, et al. A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data[J]. Pattern Recognition, 2018, 83:375-387.
[12] BRYANT A, CIOS K. RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6):1109-1121.
[13] RODRÍGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[14] DU M J, DING S F, JIA H J. Study on density peaks clustering based on K-nearest neighbors and principal component analysis[J]. Knowledge-Based Systems, 2016, 99:135-145.
[15] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354:19-40.
[16] LIU R, WANG H, YU X M. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Information Sciences, 2018, 450:200-226.
[17] DING S F, DU W, XU X, et al. An improved density peaks clustering algorithm based on natural neighbor with a merging strategy[J]. Information Sciences, 2023, 624:252-276.
[18] WANG M, MIN F, ZHANG Z H, et al. Active learning through density clustering[J]. Expert Systems with Applications, 2017, 85:305-317.
[19] XU J, WANG G Y, LI T R, et al. Fat node leading tree for data stream clustering with density peaks[J]. Knowledge-Based Systems, 2017, 120:99-117.
[20] XU J, WANG G Y, DENG W H. DenPEHC:density peak based efficient hierarchical clustering[J]. Infor-mation Sciences, 2016, 373:200-218.
[21] WU B, WILAMOWSKI B. A fast density and grid based clustering method for data with arbitrary shapes and noise[J]. IEEE Transactions on Industrial Informatics, 2017, 13(4):1620-1628.
[22] 巩树凤, 张岩峰. EDDPC:一种高效的分布式密度中心聚类算法[J]. 计算机研究与发展, 2016, 53(6):1400-1409. GONG Shufeng, ZHANG Yanfeng. EDDPC: an efficient distributed density peaks clustering algorithm[J]. Journal of Computer Research and Development, 2016, 53(6):1400-1409.
[23] XU X, DING S F, DU M J, et al. DPCG: an efficient density peaks clustering algorithm based on grid[J]. International Journal of Machine Learning and Cybernetics, 2018, 9(5):743-754.
[24] 徐晓, 丁世飞, 孙统风, 等. 基于网格筛选的大规模密度峰值聚类算法[J]. 计算机研究与发展, 2018, 55(11):2419-2429. XU Xiao, DING Shifei, SUN Tongfeng, et al. Large-scale density peaks clustering algorithm based on sparse screening[J]. Journal of Computer Research and Development, 2018, 55(11):2419-2429.
[25] 谢娟英, 高红超, 谢维信. K近邻优化的密度峰值快速搜索聚类算法[J]. 中国科学:信息科学, 2016, 46(2):258-280. XIE Juanying, GAO Hongchao, XIE Weixin. K-nearest neighbors optimized clustering algorithm by fast search and finding the density peaks of a dataset[J]. Scientia Sinica Informationis, 2016, 46(2):258-280.
[26] XU T F, JIANG J H. A graph adaptive density peaks clustering algorithm for automatic centroid selection and effective aggregation[J]. Expert Systems with Applications, 2022, 195:116539.
[27] ZHANG Y F, CHEN S M, YU G. Efficient distributed density peaks for clustering large data sets in MapReduce[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(12):3218-3230.
[28] MEHMOOD R, ZHANG G Z, BIE R F, et al. Clustering by fast search and find of density peaks via heat diffusion[J]. Neurocomputing, 2016, 208:210-217.
[29] DING S F, LI C, XU X, et al. A sampling-based density peaks clustering algorithm for large-scale data[J]. Pattern Recognition, 2023, 136:109238.
[30] RASOOL Z, ZHOU R, CHEN L, et al. Index-based solutions for efficient density peak clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 34(5):2212-2226.
[31] JIANG J, YAN X, YU Z T, et al. A Chinese expert disambiguation method based on semi-supervised graph clustering[J]. International Journal of Machine Learning & Cybernetics, 2015, 6(2):197-204.
[1] 陈素根,赵志忠. 融合局部截断距离及小簇合并的密度峰值聚类[J]. 山东大学学报 (工学版), 2025, 55(2): 58-70.
[2] 张佩瑞,杨燕,邢焕来,喻琇瑛. 基于核K-means的增量多视图聚类算法[J]. 山东大学学报(工学版), 2018, 48(3): 48-53.
[3] 田枫,刘卓炫,尚福华,沈旭昆,王梅,王浩畅. 基于语境相关图传播的图像标注改善方法[J]. 山东大学学报(工学版), 2016, 46(5): 1-6.
[4] 徐平安,唐雁,石教开,张辉荣. 基于薛定谔方程的K-Means聚类算法[J]. 山东大学学报(工学版), 2016, 46(1): 34-41.
[5] 翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张永花,王安玲,刘福平 . 低频非均匀电磁波在导电界面的反射相角[J]. 山东大学学报(工学版), 2006, 36(2): 22 -25 .
[2] 孔祥臻,刘延俊,王勇,赵秀华 . 气动比例阀的死区补偿与仿真[J]. 山东大学学报(工学版), 2006, 36(1): 99 -102 .
[3] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[4] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[5] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[6] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[7] 秦通,孙丰荣*,王丽梅,王庆浩,李新彩. 基于极大圆盘引导的形状插值实现三维表面重建[J]. 山东大学学报(工学版), 2010, 40(3): 1 -5 .
[8] 孙殿柱,朱昌志,李延瑞 . 散乱点云边界特征快速提取算法[J]. 山东大学学报(工学版), 2009, 39(1): 84 -86 .
[9] 胡天亮,李鹏,张承瑞,左毅 . 基于VHDL的正交编码脉冲电路解码计数器设计[J]. 山东大学学报(工学版), 2008, 38(3): 10 -13 .
[10] 卜德云 张道强. 自适应谱聚类算法研究[J]. 山东大学学报(工学版), 2009, 39(5): 22 -26 .