﻿ 一种基于Hartigan-Wong和Lloyd的定性平衡聚类算法
 文章快速检索 高级检索
 山东大学学报(工学版)  2016, Vol. 46 Issue (5): 37-44  DOI: 10.6040/j.issn.1672-3961.1.2016.031 0

### 引用本文

ZHOU Wang, ZHANG Chenlin, WU Jianxin. Qualitative balanced clustering algorithm based on Hartigan-Wong and Lloyd[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(5): 37-44. DOI: 10.6040/j.issn.1672-3961.1.2016.031.

### 文章历史

Qualitative balanced clustering algorithm based on Hartigan-Wong and Lloyd
ZHOU Wang, ZHANG Chenlin, WU Jianxin
State Key Laboratory for Novel Software Technology, Nanjing University， Nanjing 210046， Jiangsu， China
Abstract: The traditional Hartigan-Wong clustering algorithm could cause the unbalanced clustering problem. To solve this problem, Charl which is a novel qualitative balanced clustering method was proposed to improve the balance level while the absolute balance was not required. Charl combined ideas from both the Lloyds method and the Hartigan-Wong method, Charl proposed an adaptive tuning strategy to tune the balance level. This algorithm was a batch processing method, which shared the efficiency benefits of the Lloyds method. Experiments on 13 benchmark datasets showed that Charl not only produced more balanced output groups, but also achieved lower cost values and higher clustering performances (in terms of accuracy, normal mutual information and time cost) than the Lloyds method. This qualitative balancing method also outperformed the quantitative balanced clustering method by a large margin.
Key words: balanced clustering    qualitative balancing    Hartigan-Wong    Lloyd    machine learning
0 引言

1 Hartigan-Wong算法与Lloyd算法

1.1 Lloyd算法

K-means算法当中，需要最小化代价函数

 $J=\sum\limits_{i=1}^{n}{\sum\limits_{k=1}^{K}{{{r}_{ik}}\left\| {{x}_{i}}-{{\mu }_{k}} \right\|}}{{}^{2}}.$ (1)

 ${{r}_{ij}}=\left\{ \begin{array}{*{35}{l}} 1, & \text{if}j=\underset{1\le k\le K}{\mathop{\arg \text{min}}}\,{{\left\| {{x}_{i}}-{{\mu }_{k}} \right\|}^{2}}; \\ 0, & \text{otherwise}. \\ \end{array} \right.$ (2)
 ${{\mu }_{k}}\frac{\sum\limits_{i=1}^{n}{{{r}_{ik}}{{x}_{i}}}}{\sum\limits_{i=1}^{n}{{{r}_{ik}}}}=\frac{\sum\limits_{i=1}^{n}{{{r}_{ik}}{{x}_{i}}}}{{{n}_{k}}}.$ (3)

Lloyd算法按照批处理的方式，通过两个阶段来找出rikμk的值。

Lloyd算法简单有效，但该算法的输出经常是不平衡的。对于不同的k(1≤kK)，nk可能会在很大的范围内波动[14-16]

1.2 Hartigan-Wong算法

 ${{r}_{ij}}=\left\{ \begin{array}{*{35}{l}} 1,\text{if}j=\underset{1\le k\le K}{\mathop{\arg \min }}\,\frac{{{n}_{k}}}{{{n}_{k}}+1}{{\left\| {{x}_{i}}-{{\mu }_{k}} \right\|}^{2}}; \\ 0,\text{otherwise}. \\ \end{array} \right.$ (4)

 ${{\left\| {{x}_{i}}-{{\mu }_{1}} \right\|}^{2}}=10,{{n}_{1}}=5.$ (5)
 ${{\left\| {{x}_{i}}-{{\mu }_{2}} \right\|}^{2}}=9,\text{ }{{n}_{2}}=17.$ (6)

2 2 Charl算法

2.1 新的调整机制

 ${{r}_{ij}}=\left\{ \begin{array}{*{35}{l}} 1,\text{if}j=\underset{1\le k\le K}{\mathop{\arg \min }}\,{{\left( \frac{{{n}_{k}}}{{{n}_{k}}+1} \right)}^{p}}\left. {{\left\| {{x}_{i}}-{{\mu }_{k}} \right\|}^{2}} \right|; \\ 0,\text{otherwise}\text{.} \\ \end{array} \right.$ (7)

2.2 不当样本点的产生

 $J=\underset{1\le k\le K}{\mathop{\arg \min }}\,{{\left( \frac{{{n}_{k}}}{{{n}_{k}}+1} \right)}^{p}}{{\left\| {{x}_{i}}-{{\mu }_{k}} \right\|}^{2}},$ (8)
 ${{\left\| {{x}_{i}}-{{u}_{j}} \right\|}^{2}}>{{\left\| {{x}_{i}}-{{u}_{s}} \right\|}^{2}}.$ (9)

 ${{\left( \frac{{{n}_{j}}}{{{n}_{j}}+1} \right)}^{p}}{{\left\| {{x}_{i}}-{{\mu }_{j}} \right\|}^{2}}\le {{\left( \frac{{{n}_{j}}}{{{n}_{j}}+1} \right)}^{p}}{{\left\| {{x}_{i}}-{{u}_{s}} \right\|}^{2}}.$ (10)

 $\frac{{{n}_{s}}\left( {{n}_{j}}+1 \right)}{\left( {{n}_{s}}+1 \right){{n}_{j}}}\ge {{\left\| \frac{{{x}_{i}}-{{\mu }_{j}}}{{{x}_{i}}-{{\mu }_{s}}} \right\|}^{\frac{2}{p}}}>1.$ (11)

 $\rho =\left( \left\| \frac{{{x}_{i}}-{{u}_{j}}}{{{x}_{i}}-{{u}_{s}}} \right\| \right){{}^{\frac{2}{p}}},$ (12)

 ${{n}_{s}}\left( {{n}_{j}}-\rho {{n}_{j}}+1 \right)\ge \rho {{n}_{j}}>0.$ (13)

 $1＜\rho ＜1+\frac{1}{{{n}_{j}}},$ (14)
 ${{n}_{j}}＜\rho {{n}_{j}}＜{{n}_{j}}+1,$ (15)
 $\frac{{{n}_{t}}}{{{n}_{j}}}\ge \frac{\rho }{{{n}_{j}}+1-\rho {{n}_{j}}}>1.$ (16)

2.3 不当样本点分配的减少

(1) 先根据式(12)计算ρ，此过程与t无关。

(2) 如果存在s(1≤sK)，同时满足式(14)和(16)，那么这次对xi的重分配就会导致代价函数值J的增加。

(1) 根据式(8)和(9)，当p增加时，Charl算法更倾向于将样本点xi分配到样本个数较少的类，但代价函数增大的风险也随之增大。当需要聚类结果更平衡时，可以令p为较大的值，反之，可令p为较小的值。

(2) 根据式(13)，当p增加时，ρ会减少，式(14)更容易被满足。因此较大的p更容易导致不当的样本点分配。

(3) 当分类结果接近均匀时，对于所有的s(1≤sK)，$\frac{{{n}_{s}}}{{{n}_{j}}}\approx 1$。此时并不需要聚类结果变得更加平衡，可将p的值设置为一个较小值。

(4) 如果式(14)成立，那么式(16)中的$\frac{\rho }{{{n}_{j}}+1-\rho {{n}_{j}}}$取决于ρnjnj+1的接近程度。当ρnj接近nj+1时，产生不当的样本点分配的概率也随之增大。

(5) 最后需要强调的是，尽管在某样本点的分配上可能会使J增大，但在一次迭代过程中，所有样本点分配完成后，整体的代价函数值J仍然可能是下降的。

2.4 平衡程度评价指标和自适应调整

 $\sigma =\sqrt{\frac{1}{K}{{\sum\limits_{k=1}^{K}{\left( {{n}_{k}}-\frac{n}{K} \right)}}^{2}}}.$ (17)

1: 输入：x1,x2,…,xn,K,Tmax。

2: 随机选择K个点作为初始中心μk(1≤kK)。

3: t→0，使用式(1)计算Jt

4: p→1。

5: 若迭代次数没有达到Tmax，进行以下迭代：

6: 使用式(7)更新rik(1≤in,1≤kK)。

7: 使用式(3)更新聚类中心 μk

8: 重新计算类的大小nk，计算标准差

 ${{\sigma }_{t}}=\sqrt{\frac{1}{K}\sum\limits_{k=1}^{K}{{{\left( {{n}_{k}}-\frac{n}{K} \right)}^{2}}}}.$

10: 如果σt>$\frac{n}{5K}$，则p→1.25p

11: 否则p→max(p/1.25,1)。

12: tt+1,使用式(1)计算Jt

13: 如果JtJt-1，终止迭代。

Charl算法采用批处理的更新方式，所以具有跟Lloyd一样的高效计算的优点。在每次迭代过程中，根据聚类结果是否均衡，自适应地调整p。通过这一策略，Charl算法倾向于得到更平衡的聚类结果。在满足以下两个条件之一时，终止Charl算法的迭代过程：

(1) Charl算法迭代数达到最大迭代次数。

(2) 某次迭代后，代价函数值J上升。

2.5 Charl算法与定量平衡聚类算法对比

Quant算法是一个典型的定量平衡聚类算法[10],现在将Charl算法进行对比。Quant算法可以指定最后聚类结果必须是完全平衡的，即nk≥$\left\lceil \frac{n}{K} \right\rceil ,1\le k\le K$。

3 试验

3.1 数据集、评估参数和设定

3.2 Charl算法聚类结果

 $\frac{{{b}_{\text{Lloyd}}}-{{b}_{\text{Charl}}}}{{{b}_{\text{Lloyd}}}}.$

3.3 Charl算法代价函数值

3.4 Charl算法的聚类性能

Charl算法不仅能获得更平衡的聚类结果，还可以得到较好的聚类性能，例如ACC、NMI和ARI。表 3统计了运行100次Charl、Lloyd和Quant算法得到的平均ACC、NMI和ARI值。

3.5 Charl算法的实际应用

Charl算法作为一种平衡聚类算法，可以应用到许多场景中。最近邻问题是计算机视觉中一个基础且重要的问题。在数据维数很大的时候，由于得到精确的最近邻点的代价太大，通常使用近似算法得到近似最近邻点。而Product Quantization算法[25]是解决近似最近邻的一种常用算法。

4 结论与展望

 [1] LLOYD S. Least squares quantization in PCM[J]. IEEE Transactions on Information Theory , 1982, 28 (2) : 129-137 DOI:10.1109/TIT.1982.1056489 [2] BISHOP C M. Pattern recognition and machine learning[M]. Singapore: Springer, 2006 . [3] 贺玲, 吴玲达, 蔡益朝. 数据挖掘中的聚类算法综述[J]. 计算机应用研究 , 2007, 24 (1) : 10-13 HE Ling, WU Lingda, CAI Yichao. Survey of clustering algorithms in data mining[J]. Application Research of Computers , 2007, 24 (1) : 10-13 [4] 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J]. 软件学报 , 2008, 19 (1) : 48-61 SUN Jigui, LIU Jie, ZHAO Lianyu. Clustering algorithms research[J]. Journal of Software , 2008, 19 (1) : 48-61 DOI:10.3724/SP.J.1001.2008.00048 [5] CHEN X, DENG C. Large scale spectral clustering with landmark-based representation[C] //Proceedings of the 25th AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI, 2011:313-318. [6] HUANG J, NIE F, HUANG H. Spectral rotation versus k-means in spectral clustering [C]//Proceedings of the 27th AAAI Conference on Artificial Intelligence. Washington, DC, USA: AAAI, 2013: 431-437. [7] 周林, 平西建, 徐森, 等. 基于谱聚类的聚类集成算法[J]. 自动化学报 , 2012, 38 (8) : 1335-1342 ZHOU Lin, PING Xijian, XU Sen, et al. Cluster ensemble based on spectral clustering[J]. Acta Automatica Sinica , 2012, 38 (8) : 1335-1342 DOI:10.3724/SP.J.1004.2012.01335 [8] WU J, LIU H, XIONG H, et al. A theoretic framework of k-means-based consensus clustering[C] //Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013:1799-1805. [9] XU R, WUNSCH D. Survey of clustering algorithms[J]. IEEE Transactions on Neural Networks , 2005, 16 (3) : 645-678 DOI:10.1109/TNN.2005.845141 [10] 王守强, 朱大铭, 史士英. K-means聚类问题的改进近似算法[J]. 山东大学学报(工学版) , 2011, 41 (4) : 125-132 WANG Shouqiang, ZHU Daming, SHI Shiying. Improved approximation algorithm for theK-means clustering problem[J]. Journal of Shandong University (Engineering Science) , 2011, 41 (4) : 125-132 [11] 贾洪杰, 丁世飞, 史忠植. 求解大规模谱聚类的近似加权K-means算法[J]. 软件学报 , 2015, 26 (11) : 2836-2846 JIA Hongjie, DING Shifei, SHI Zhongzhi. Approximate weighted kernelK-means for large-scale spectral clustering[J]. Journal of Software , 2015, 26 (11) : 2836-2846 [12] 雷小锋, 谢昆青, 林帆, 等. 一种基于 K-Means 局部最优性的高效聚类算法[J]. 软件学报 , 2008, 19 (7) : 1683-1692 LEI Xiaofeng, XIE Kunqing, LIN Fan, et al. An efficient clustering algorithm based on local optimality ofK-means[J]. Journal of Software , 2008, 19 (7) : 1683-1692 DOI:10.3724/SP.J.1001.2008.01683 [13] LI F, PERONA P. A bayesian hierarchical model for learning natural scene categories[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005:524-531. [14] JURIE F, TRIGGS B. Creating efficient codebooks for visual recognition[C]//Proceedings of the 10th IEEE International Conference on Computer Vision. Beijing, China: IEEE, 2005:604-610. [15] WU J, REHG J M. Beyond the Euclidean distance: creating effective visual codebooks using the histogram intersection kernel[C]//Proceedings of the 12th Inter- national Conference on Computer Vision. Kyoto, Japan: IEEE, 2009: 630-637. [16] WU J, TAN W C, REHG J M. Efficient and effective visual codebook generation using additive kernels[J]. Journal of Machine Learning Research , 2011, 12 (9) : 3097-3118 [17] BANERJEE A, GHOSH J. Scalable clustering algorithms with balancing constraints[J]. Data Mining and Knowledge Discovery , 2006, 13 (3) : 365-395 DOI:10.1007/s10618-006-0040-z [18] ZHU S, WANG D, LI T. Data clustering with size constraints[J]. Knowledge-Based Systems , 2010, 23 (8) : 883-889 DOI:10.1016/j.knosys.2010.06.003 [19] HARTIGAN J A, WONG M A. Algorithm AS 136: aK-means clustering algorithm[J]. Journal of the Royal Statistical Society (Series C, Applied Statistics) , 1979, 28 (1) : 100-108 [20] TELGARSKY M, VATTANI A. Hartigans Method:K-means clustering without Voronoi[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: JMLR, 2010:820-827. [21] LONIM N, AHARONI E, CRAMMER K. Hartigan's K-means versus Lloyd's k-means: is it time for a change?[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China: AAAI, 2013：1677-1684. [22] PAPADIMITRIOU C H, STEIGLITZ K. Combinatorial optimization: algorithms and complexity[M]. New York, USA: Dover Publications, 1998 . [23] CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering , 2005, 17 (12) : 1624-1637 DOI:10.1109/TKDE.2005.198 [24] RAND W M. Objective criteria for the evaluation of clustering methods[J]. Journal of the American Statistical Association , 1971, 66 (336) : 846-850 DOI:10.1080/01621459.1971.10482356 [25] JEGOU H, DOUZE M, SCHMID C. Product quantization for nearest neighbor search[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2011, 33 (1) : 117-128 DOI:10.1109/TPAMI.2010.57