您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2016, Vol. 46 ›› Issue (5): 37-44.doi: 10.6040/j.issn.1672-3961.1.2016.031

• • 上一篇    下一篇

一种基于Hartigan-Wong和Lloyd的定性平衡聚类算法

周旺,张晨麟,吴建鑫   

  1. 南京大学计算机软件新技术国家重点试验室, 江苏 南京 210046
  • 收稿日期:2016-03-01 出版日期:2016-10-20 发布日期:2016-03-01
  • 作者简介:周旺(1992— ),男,福建福州人,硕士研究生,主要研究方向为计算机视觉. E-mail: zhouw@lamda.nju.edu.cn
  • 基金资助:
    国家自然科学基金优秀青年科学基金资助项目(61422203)、中央高校基本科研业务费专项资金资助项目(20620140498)

Qualitative balanced clustering algorithm based on Hartigan-Wong and Lloyd

ZHOU Wang, ZHANG Chenlin, WU Jianxin   

  1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, Jiangsu, China
  • Received:2016-03-01 Online:2016-10-20 Published:2016-03-01

摘要: 基于传统的Hartigan-Wong聚类算法会产生不平衡聚类结果的缺点,提出一种新的聚类算法Charl,这种算法会改进聚类结果的平衡性但不要求绝对平衡。 结合Lloyd算法和Hartigan-Wong算法的思想,Charl算法采用一种自适应性的动态调整策略来调整平衡程度。跟Lloyd算法一样,Charl算法以批处理的方式更新中心,所以具有计算高效的性质。在13个数据集上进行的试验表明,Charl方法不仅产生了平衡的聚类结果,并且同时得到了比Lloyd算法更低的代价函数值和更好的聚类性能(聚类准确率、归一化互信息、聚类时间等)。这种定性平衡聚类算法也明显优于严格平衡的聚类算法。

关键词: 定性平衡, Lloyd, 平衡聚类, Hartigan-Wong, 机器学习

Abstract: The traditional Hartigan-Wong clustering algorithm could cause the unbalanced clustering problem. To solve this problem, Charl which is a novel qualitative balanced clustering method was proposed to improve the balance level while the absolute balance was not required. Charl combined ideas from both the Lloyds method and the Hartigan-Wong method, Charl proposed an adaptive tuning strategy to tune the balance level. This algorithm was a batch processing method, which shared the efficiency benefits of the Lloyds method. Experiments on 13 benchmark datasets showed that Charl not only produced more balanced output groups, but also achieved lower cost values and higher clustering performances(in terms of accuracy, normal mutual information and time cost)than the Lloyds method. This qualitative balancing method also outperformed the quantitative balanced clustering method by a large margin.

Key words: qualitative balancing, Hartigan-Wong, balanced clustering, Lloyd, machine learning

中图分类号: 

  • TP391
[1] LLOYD S. Least squares quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2):129-137.
[2] BISHOP C M. Pattern recognition and machine learning[M]. Singapore: Springer, 2006.
[3] 贺玲, 吴玲达, 蔡益朝. 数据挖掘中的聚类算法综述[J]. 计算机应用研究, 2007, 24(1):10-13. HE Ling, WU Lingda, CAI Yichao. Survey of clustering algorithms in data mining[J]. Application Research of Computers, 2007, 24(1):10-13.
[4] 孙吉贵,刘杰,赵连宇.聚类算法研究[J]. 软件学报, 2008,19(1): 48-61. SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software, 2008, 19(1):48-61.
[5] CHEN X, DENG C. Large scale spectral clustering with landmark-based representation[C] //Proceedings of the 25th AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI, 2011:313-318.
[6] HUANG J, NIE F, HUANG H. Spectral rotation versus k-means in spectral clustering [C] //Proceedings of the 27th AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI, 2013: 431-437.
[7] 周林, 平西建, 徐森, 等. 基于谱聚类的聚类集成算法[J]. 自动化学报, 2012, 38(8):1335-1342. ZHOU Lin, PING Xijian, XU Sen, et al. Cluster ensemble based on spectral clustering[J]. Acta Automatica Sinica, 2012, 38(8):1335-1342.
[8] WU J, LIU H, XIONG H, et al. A theoretic framework of k-means-based consensus clustering[C] //Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013:1799-1805.
[9] XU R, WUNSCH D. Survey of clustering algorithms[J]. IEEE Transactions on Neural Networks, 2005, 16(3):645-678.
[10] 王守强,朱大铭,史士英. K-means聚类问题的改进近似算法[J].山东大学学报(工学版), 2011, 41(4):125-132. WANG Shouqiang, ZHU Daming, SHI Shiying. Improved approximation algorithm for the K-means clustering problem[J]. Journal of Shandong University(Engineering Science), 2011, 41(4):125-132.
[11] 贾洪杰,丁世飞,史忠植.求解大规模谱聚类的近似加权K-means算法[J].软件学报, 2015, 26(11):2836-2846. JIA Hongjie, DING Shifei, SHI Zhongzhi. Approximate weighted kernel K-means for large-scale spectral clustering[J]. Journal of Software, 2015, 26(11):2836- 2846.
[12] 雷小锋, 谢昆青, 林帆,等. 一种基于 K-Means 局部最优性的高效聚类算法[J]. 软件学报, 2008, 19(7): 1683-1692. LEI Xiaofeng, XIE Kunqing, LIN Fan, et al. An efficient clustering algorithm based on local optimality of K-means[J]. Journal of Software, 2008, 19(7):1683-1692.
[13] LI F, PERONA P. A bayesian hierarchical model for learning natural scene categories[C] //Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005:524-531.
[14] JURIE F, TRIGGS B. Creating efficient codebooks for visual recognition[C] //Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005:604-610.
[15] WU J, REHG J M. Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel[C] //Proceedings of the 12th Inter- national Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 630-637.
[16] WU J, TAN W C, REHG J M. Efficient and effective visual codebook generation using additive kernels[J]. Journal of Machine Learning Research, 2011,12(9):3097-3118.
[17] BANERJEE A, GHOSH J. Scalable clustering algorithms with balancing constraints[J]. Data Mining and Knowledge Discovery, 2006, 13(3):365-395.
[18] ZHU S, WANG D, LI T. Data clustering with size constraints[J]. Knowledge-Based Systems, 2010, 23(8):883-889.
[19] HARTIGAN J A, WONG M A. Algorithm AS 136: a K-means clustering algorithm[J]. Journal of the Royal Statistical Society(Series C, Applied Statistics), 1979, 28(1):100-108.
[20] TELGARSKY M, VATTANI A. Hartigans Method: k-means clustering without Voronoi[C] //Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: JMLR, 2010:820-827.
[21] SLONIM N, AHARONI E, CRAMMER K. Hartigans k-means versus Lloyds k-means: is it time for a change?[C] //Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013:1677-1684.
[22] PAPADIMITRIOU C H, STEIGLITZ K. Combinatorial optimization: algorithms and complexity[M]. New York, USA: Dover Publications, 1998.
[23] CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1624-1637.
[24] RAND W M. Objective criteria for the evaluation of clustering methods[J]. Journal of the American Statistical Association, 1971, 66(336):846-850.
[25] JEGOU H, DOUZE M, SCHMID C. Product quantization for nearest neighbor search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(1):117-128.
[1] 张冕,黄颖,梅海艺,郭毓. 基于Kinect的配电作业机器人智能人机交互方法[J]. 山东大学学报 (工学版), 2018, 48(5): 103-108.
[2] 刘洋,刘博,王峰. 基于Parameter Server框架的大数据挖掘优化算法[J]. 山东大学学报(工学版), 2017, 47(4): 1-6.
[3] 魏波,张文生,李元香,夏学文,吕敬钦. 一种选择特征的稀疏在线学习算法[J]. 山东大学学报(工学版), 2017, 47(1): 22-27.
[4] 孟令恒,丁世飞. 基于单静态图像的深度感知模型[J]. 山东大学学报(工学版), 2016, 46(3): 37-43.
[5] 刘杰, 杨鹏, 吕文生, 刘阿古达木, 刘俊秀. 基于气象因素的PM2.5质量浓度预测模型[J]. 山东大学学报(工学版), 2015, 45(6): 76-83.
[6] 郑毅, 朱成璋. 基于深度信念网络的PM2.5预测[J]. 山东大学学报(工学版), 2014, 44(6): 19-25.
[7] 谢琳1,殷熙尧2,李凡长3,吴佳3. 一种逆归结学习表示[J]. 山东大学学报(工学版), 2013, 43(4): 46-50.
[8] 何雪英1,2, 秦伟1, 尹义龙1*, 赵联征1,乔昊3. 基于机器学习的视频指纹识别[J]. 山东大学学报(工学版), 2011, 41(4): 29-33.
[9] 梁春林1,彭凌西2*. 基于免疫网络的无监督式分类算法[J]. 山东大学学报(工学版), 2010, 40(5): 82-86.
[10] 郭茂祖 邹权 李文滨 韩英鹏. 生物信息学中的学习问题[J]. 山东大学学报(工学版), 2009, 39(3): 1-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 程代展,李志强. 非线性系统线性化综述(英文)[J]. 山东大学学报(工学版), 2009, 39(2): 26 -36 .
[2] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .
[3] 刘新1 ,宋思利1 ,王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 .
[4] 田芳1,张颖欣2,张礼3,侯秀萍3,裘南畹3. 新型金属氧化物薄膜气敏元件基材料的开发[J]. 山东大学学报(工学版), 2009, 39(2): 104 -107 .
[5] 陈华鑫, 陈拴发, 王秉纲. 基质沥青老化行为与老化机理[J]. 山东大学学报(工学版), 2009, 39(2): 125 -130 .
[6] 赵延风1,2, 王正中1,2 ,芦琴1,祝晗英3 . 梯形明渠水跃共轭水深的直接计算方法[J]. 山东大学学报(工学版), 2009, 39(2): 131 -136 .
[7] 李士进,王声特,黄乐平. 基于正反向异质性的遥感图像变化检测[J]. 山东大学学报(工学版), 2018, 48(3): 1 -9 .
[8] 赵科军 王新军 刘洋 仇一泓. 基于结构化覆盖网的连续 top-k 联接查询算法[J]. 山东大学学报(工学版), 2009, 39(5): 32 -37 .
[9] 赵治广,王登杰,田云飞 . 基于灰色理论的路基沉降研究[J]. 山东大学学报(工学版), 2007, 37(3): 86 -88 .
[10] 姚占勇,商庆森,赵之仲,贾朝霞 . 界面条件对半刚性沥青路面结构应力分布的影响[J]. 山东大学学报(工学版), 2007, 37(3): 93 -99 .