一种基于Hartigan-Wong和Lloyd的定性平衡聚类算法

doi:10.6040/j.issn.1672-3961.1.2016.031

摘要/Abstract

摘要： 基于传统的Hartigan-Wong聚类算法会产生不平衡聚类结果的缺点,提出一种新的聚类算法Charl,这种算法会改进聚类结果的平衡性但不要求绝对平衡。结合Lloyd算法和Hartigan-Wong算法的思想,Charl算法采用一种自适应性的动态调整策略来调整平衡程度。跟Lloyd算法一样,Charl算法以批处理的方式更新中心,所以具有计算高效的性质。在13个数据集上进行的试验表明,Charl方法不仅产生了平衡的聚类结果,并且同时得到了比Lloyd算法更低的代价函数值和更好的聚类性能(聚类准确率、归一化互信息、聚类时间等)。这种定性平衡聚类算法也明显优于严格平衡的聚类算法。

关键词: 定性平衡, Lloyd, 平衡聚类, Hartigan-Wong, 机器学习

Abstract: The traditional Hartigan-Wong clustering algorithm could cause the unbalanced clustering problem. To solve this problem, Charl which is a novel qualitative balanced clustering method was proposed to improve the balance level while the absolute balance was not required. Charl combined ideas from both the Lloyds method and the Hartigan-Wong method, Charl proposed an adaptive tuning strategy to tune the balance level. This algorithm was a batch processing method, which shared the efficiency benefits of the Lloyds method. Experiments on 13 benchmark datasets showed that Charl not only produced more balanced output groups, but also achieved lower cost values and higher clustering performances(in terms of accuracy, normal mutual information and time cost)than the Lloyds method. This qualitative balancing method also outperformed the quantitative balanced clustering method by a large margin.

Key words: qualitative balancing, Hartigan-Wong, balanced clustering, Lloyd, machine learning

中图分类号:

TP391

周旺,张晨麟,吴建鑫. 一种基于Hartigan-Wong和Lloyd的定性平衡聚类算法[J]. 山东大学学报(工学版), 2016, 46(5): 37-44.

ZHOU Wang, ZHANG Chenlin, WU Jianxin. Qualitative balanced clustering algorithm based on Hartigan-Wong and Lloyd[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(5): 37-44.

参考文献

[1] LLOYD S. Least squares quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2):129-137.
[2] BISHOP C M. Pattern recognition and machine learning[M]. Singapore: Springer, 2006.
[3] 贺玲, 吴玲达, 蔡益朝. 数据挖掘中的聚类算法综述[J]. 计算机应用研究, 2007, 24(1):10-13. HE Ling, WU Lingda, CAI Yichao. Survey of clustering algorithms in data mining[J]. Application Research of Computers, 2007, 24(1):10-13.
[4] 孙吉贵,刘杰,赵连宇.聚类算法研究[J]. 软件学报, 2008,19(1): 48-61. SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1):48-61.
[5] CHEN X, DENG C. Large scale spectral clustering with landmark-based representation[C] //Proceedings of the 25th AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI, 2011:313-318.
[6] HUANG J, NIE F, HUANG H. Spectral rotation versus k-means in spectral clustering [C] //Proceedings of the 27th AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI, 2013: 431-437.
[7] 周林, 平西建, 徐森, 等. 基于谱聚类的聚类集成算法[J]. 自动化学报, 2012, 38(8):1335-1342. ZHOU Lin, PING Xijian, XU Sen, et al. Cluster ensemble based on spectral clustering[J]. Acta Automatica Sinica, 2012, 38(8):1335-1342.
[8] WU J, LIU H, XIONG H, et al. A theoretic framework of k-means-based consensus clustering[C] //Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013:1799-1805.
[9] XU R, WUNSCH D. Survey of clustering algorithms[J]. IEEE Transactions on Neural Networks, 2005, 16(3):645-678.
[10] 王守强,朱大铭,史士英. K-means聚类问题的改进近似算法[J].山东大学学报(工学版), 2011, 41(4):125-132. WANG Shouqiang, ZHU Daming, SHI Shiying. Improved approximation algorithm for the K-means clustering problem[J]. Journal of Shandong University(Engineering Science), 2011, 41(4):125-132.
[11] 贾洪杰,丁世飞,史忠植.求解大规模谱聚类的近似加权K-means算法[J].软件学报, 2015, 26(11):2836-2846. JIA Hongjie, DING Shifei, SHI Zhongzhi. Approximate weighted kernel K-means for large-scale spectral clustering[J]. Journal of Software, 2015, 26(11):2836- 2846.
[12] 雷小锋, 谢昆青, 林帆,等. 一种基于 K-Means 局部最优性的高效聚类算法[J]. 软件学报, 2008, 19(7): 1683-1692. LEI Xiaofeng, XIE Kunqing, LIN Fan, et al. An efficient clustering algorithm based on local optimality of K-means[J]. Journal of Software, 2008, 19(7):1683-1692.
[13] LI F, PERONA P. A bayesian hierarchical model for learning natural scene categories[C] //Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005:524-531.
[14] JURIE F, TRIGGS B. Creating efficient codebooks for visual recognition[C] //Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005:604-610.
[15] WU J, REHG J M. Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel[C] //Proceedings of the 12th Inter- national Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 630-637.
[16] WU J, TAN W C, REHG J M. Efficient and effective visual codebook generation using additive kernels[J]. Journal of Machine Learning Research, 2011,12(9):3097-3118.
[17] BANERJEE A, GHOSH J. Scalable clustering algorithms with balancing constraints[J]. Data Mining and Knowledge Discovery, 2006, 13(3):365-395.
[18] ZHU S, WANG D, LI T. Data clustering with size constraints[J]. Knowledge-Based Systems, 2010, 23(8):883-889.
[19] HARTIGAN J A, WONG M A. Algorithm AS 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society(Series C, Applied Statistics), 1979, 28(1):100-108.
[20] TELGARSKY M, VATTANI A. Hartigans Method: k-means clustering without Voronoi[C] //Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: JMLR, 2010:820-827.
[21] SLONIM N, AHARONI E, CRAMMER K. Hartigans k-means versus Lloyds k-means: is it time for a change?[C] //Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013:1677-1684.
[22] PAPADIMITRIOU C H, STEIGLITZ K. Combinatorial optimization: algorithms and complexity[M]. New York, USA: Dover Publications, 1998.
[23] CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1624-1637.
[24] RAND W M. Objective criteria for the evaluation of clustering methods[J]. Journal of the American Statistical Association, 1971, 66(336):846-850.
[25] JEGOU H, DOUZE M, SCHMID C. Product quantization for nearest neighbor search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(1):117-128.

相关文章 15

[1]	祝明,石承龙,吕潘,刘现荣,孙驰,陈建城,范宏运. 基于优化长短时记忆网络的深基坑变形预测方法及其工程应用[J]. 山东大学学报 (工学版), 2025, 55(3): 141-148.
[2]	常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12.
[3]	乔慧妍,段学龙,解驰皓,赵冬慧,马玉玲. 基于异常点检测的心理健康辅助诊断方法[J]. 山东大学学报 (工学版), 2024, 54(4): 76-85.
[4]	刘新,刘冬兰,付婷,王勇,常英贤,姚洪磊,罗昕,王睿,张昊. 基于联邦学习的时间序列预测算法[J]. 山东大学学报 (工学版), 2024, 54(3): 55-63.
[5]	岳仁峰,张嘉琦,刘勇,范学忠,李琮琮,孔令鑫. 基于颜色和纹理特征的立体车库锈蚀检测技术[J]. 山东大学学报 (工学版), 2024, 54(3): 64-69.
[6]	陈成,董永权,贾瑞,刘源. 基于交互序列特征相关性的可解释知识追踪[J]. 山东大学学报 (工学版), 2024, 54(1): 100-108.
[7]	卞小曼,王小琴,蓝如师,刘振丙,罗笑南. 基于相似性保持和判别性分析的快速视频哈希算法[J]. 山东大学学报 (工学版), 2023, 53(6): 63-69.
[8]	李鸿钊,张庆松,刘人太,陈新,辛勤,石乐乐. 浅埋地铁车站施工期地表变形风险预警[J]. 山东大学学报 (工学版), 2023, 53(6): 82-91.
[9]	袁高腾,周晓峰,郭宏乐. 基于特征选择算法的ECG信号分类[J]. 山东大学学报 (工学版), 2022, 52(4): 38-44.
[10]	聂秀山,马玉玲,乔慧妍,郭杰,崔超然,于志云,刘兴波,尹义龙. 任务粒度视角下的学生成绩预测研究综述[J]. 山东大学学报 (工学版), 2022, 52(2): 1-14.
[11]	孙鸿昌,周风余,单明珠,翟文文,牛兰强. 基于模式划分的空调能耗混合填补方法[J]. 山东大学学报 (工学版), 2022, 52(1): 9-18.
[12]	袁高腾,刘毅慧,黄伟,胡兵. 基于Gabor特征的乳腺肿瘤MR图像分类识别模型[J]. 山东大学学报 (工学版), 2020, 50(3): 15-23.
[13]	高铭壑,张莹,张蓉蓉,黄子豪,黄琳焱,李繁菀,张昕,王彦浩. 基于预测数据特征的空气质量预测方法[J]. 山东大学学报 (工学版), 2020, 50(2): 91-99.
[14]	张大鹏,刘雅军,张伟,沈芬,杨建盛. 基于异质集成学习的虚假评论检测[J]. 山东大学学报 (工学版), 2020, 50(2): 1-9.
[15]	刘玉田, 孙润稼, 王洪涛, 顾雪平. 人工智能在电力系统恢复中的应用综述[J]. 山东大学学报 (工学版), 2019, 49(5): 1-8.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed