• 机器学习与数据挖掘 •

### 基于加权的K-modes聚类初始中心选择算法

1. 1. 青岛科技大学信息科学与技术学院, 山东 青岛 266061;2. 中国科学院计算技术研究所, 北京 100190
• 收稿日期:2015-04-15 出版日期:2016-04-20 发布日期:2015-04-15
• 作者简介:江峰(1978— ),男,江西彭泽人,副教授, 博士,主要研究方向为数据挖掘,粗糙集.E-mail: jiangkong@163.net
• 基金资助:
国家自然科学基金资助项目(60802042,61273180);山东省自然科学基金资助项目(ZR2011FQ005);山东省高等学校科技计划资助项目(J11LG05)

### A weight-based initial centers selection algorithm for K-modes clustering

JIANG Feng1, DU Junwei1, LIU Guozhu1, SUI Yuefei2

1. 1. College of Information Science &
Technology, Qingdao University of Science and Technology, Qingdao 266061, Shandong, China;
2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
• Received:2015-04-15 Online:2016-04-20 Published:2015-04-15

Abstract: The current initialization methods for K-modes clustering do not consider the case in which various attributes have different significances. To solve this problem, a weighted density and weighted overlap distance-based initial center selection algorithm(called Ini-Weight)was proposed. In algorithm Ini-Weight, initial centers were selected by calculating the density of each object and the distance between any two objects. In Ini-Weight, when calculating the density of each object and the distance between any two objects, different weights were assigned to different attributes according to the significance of each attribute. Finally, Ini-Weight was compared with the current methods on UCI data sets. The results showed that Ini-Weight algorithm could effectively distinguish different attributes and improve the accuracy for selecting initial centers.

• TP181
 [1] HAN J W, KAMBER M. Data mining: concepts and techniques[M]. 2nd edition. Burlington, USA: Morgan Kaufmann Publishers, 2006.[2] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. ACM Computing Surveys, 1999, 31(3):264-323.[3] 陈文强, 林琛, 陈珂,等. 基于GraphLab的分布式近邻传播聚类算法[J]. 山东大学学报(工学版), 2013, 43(5):13-18. CHEN Wenqiang, LIN Chen, CHEN Ke, et al. Distributed affinity propagation clustering algorithm based on GraphLab[J]. Journal of Shandong University(Engineering Science), 2013, 43(5):13-18.[4] ANDERBERG M R. Cluster analysis for applications[M]. New York: Academic Press, 1973.[5] BAI L, LIANG J Y, SUI C, et al. Fast global K-means clustering based on local geometrical information[J]. Information Sciences, 2013, 245:168-180.[6] HUANG Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.[7] WU S, JIANG Q S, HUANG Z X. A new initialization method for clustering categorical data[C] //Proceedings of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Nanjing, China: Springer, 2007:972-980.[8] CAO F Y, LIANG J Y, BAI L. A new initialization method for categorical data clustering[J]. Expert Systems with Application, 2009, 36(7):10223-10228.[9] BAI L, LIANG J Y, DANG C Y. An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data[J]. Knowledge-Based Systems, 2011, 24(6):785-795.[10] KHAN S S, AHMAD A. Cluster center initialization algorithm for K-modes clustering[J]. Expert Systems with Applications, 2013, 40(18):7444-7456.[11] BRADLEY P S, FAYYAD U M. Refining initial points for K-means clustering[C] //Proceedings of the 15th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1998:91-99.[12] CAO F Y, LIANG J Y, JIANG G. An initialization method for the K-means algorithm using neighborhood model[J]. Computers and Mathematics with Applications, 2009, 58(3):474-483.[13] 王守强, 朱大铭, 史士英. K-means聚类问题的改进近似算法[J]. 山东大学学报(工学版), 2011, 41(4):125-132. WANG Shouqiang, ZHU D M, SHI Shiying. Improved approximation algorithm for the K-means clustering problem[J]. Journal of Shandong University(Engineering Science), 2011, 41(4):125-132.[14] 夏战国, 万玲, 蔡世玉,等. 一种面向入侵检测的半监督聚类算法[J]. 山东大学学报(工学版), 2012, 42(6):1-7. XIA Zhanguo, WAN Ling, CAI Shiyu, et al. A semi-supervised clustering algorithm oriented to intrusion detection[J]. Journal of Shandong University(Engineering Science), 2012, 42(6):1-7.[15] PAWLAK Z. Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht, Netherlands: Kluwer Academic Publishers, 1991.[16] SHANNON C E. The mathematical theory of communication[J]. Bell System Technical Journal, 1948, 27(3-4):373-423.[17] D(¨overU)NTSCH I, GEDIGA G. Uncertainty measures of rough set prediction[J]. Artificial Intelligence, 1998, 106(1):109-137.[18] 苗夺谦, 胡桂荣. 知识约简的一种启发式算法[J]. 计算机研究与发展, 1999, 36(6):681-684. MIAO Duoqian, HU Guirong. A heuristic algorithm for reduction of knowledge[J]. Computer Research and Development, 1999, 36(6):681-684.[19] LIANG J Y, SHI Z Z. The information entropy, rough entropy and knowledge granulation in rough set theory[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2004, 12(1):37-46.[20] JIANG F, SUI Y F, ZHOU L. A relative decision entropy-based feature selection approach[J]. Pattern Recognition, 2015, 48(7):2151-2163.[21] 江峰, 王莎莎, 杜军威,等. 基于近似决策熵的属性约简[J]. 控制与决策, 2015, 30(1):65-70. JIANG Feng, WANG Shasha, DU Junwei, et al. Attribute reduction based on approximation decision entropy[J]. Control and Decision, 2015, 30(1):65-70.[22] 梁吉业, 白亮, 曹付元. 基于新的距离度量的K-modes聚类算法[J]. 计算机研究与发展, 2010, 47(10):1749-1755. LIANG Jiye, BAI Liang, CAO Fuyuan. K-modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10):1749-1755.[23] 徐章艳, 刘作鹏, 杨炳儒,等. 一个复杂度为max(O(|C||U|), O(|C|2|U/C|))的快速属性约简算法[J]. 计算机学报, 2006, 29(3):391-399. XU Zhangyan, LIU Zuopeng, YANG Bingru, et al. A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|))[J]. Chinese Journal of Computers, 2006, 29(3):391-399.[24] HUANG Z X, NG M K. A fuzzy K-modes algorithm for clustering categorical data[J]. IEEE Transactions on Fuzzy Systems, 1999, 7(4):446-452.
 [1] 景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9. [2] 高峰1,迟春梅2. 决策表中属性的重排[J]. 山东大学学报(工学版), 2013, 43(5): 6-12. [3] 樊伟. 一种多粒度粗糙区间模糊集方法[J]. 山东大学学报(工学版), 2013, 43(1): 63-68. [4] 陈玉明,吴克寿,谢荣生. 基于相对知识粒度的决策表约简[J]. 山东大学学报(工学版), 2012, 42(6): 8-12. [5] 李慧1,2,胡云1,3,李存华1. 基于粗糙集理论的瓦斯灾害信息特征提取技术[J]. 山东大学学报(工学版), 2012, 42(5): 91-95. [6] 杨习贝1,2,黄佳玲1,周君仪3,杨静宇2. 不完备系统中基于特征相容块的粗糙集[J]. 山东大学学报(工学版), 2012, 42(5): 1-6. [7] 吴克寿,陈玉明,曾志强. 基于邻域关系的决策表约简[J]. 山东大学学报(工学版), 2012, 42(2): 7-10. [8] 翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28. [9] 焦吉成,高学东,王元璞,赵传领 . 关系积理论及属性约简算法[J]. 山东大学学报(工学版), 2008, 38(2): 112-116 . [10] 李成栋,雷红,史开泉 . 一种基于粗集的模糊系统设计方法[J]. 山东大学学报(工学版), 2006, 36(4): 73-80 . [11] 孟健,崔明辉,史开泉 . 变异粗集在集对分析中的应用[J]. 山东大学学报(工学版), 2006, 36(3): 65-68 . [12] 李秀红,张东升 . (α,β)-粗糙集模型中阈值的确定与解释[J]. 山东大学学报(工学版), 2006, 36(2): 81-85 . [13] 管延勇,胡海清,王洪凯 . α-粗糙集模型中的不可分辨关系[J]. 山东大学学报(工学版), 2006, 36(1): 75-80 .
Viewed
Full text

Abstract

Cited

Shared
Discussed
 [1] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 . [2] 岳远征. 远离平衡态玻璃的弛豫[J]. 山东大学学报(工学版), 2009, 39(5): 1 -20 . [3] 王勇, 谢玉东. 大流量管道煤气的控制技术研究[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 . [4] 李辉平, 赵国群, 张雷, 贺连芳. 超高强度钢板热冲压及模内淬火工艺的发展现状[J]. 山东大学学报(工学版), 2010, 40(3): 69 -74 . [5] 刘新1 ，宋思利1 ，王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 . [6] 李士进,王声特,黄乐平. 基于正反向异质性的遥感图像变化检测[J]. 山东大学学报(工学版), 2018, 48(3): 1 -9 . [7] 赵治广,王登杰,田云飞 . 基于灰色理论的路基沉降研究[J]. 山东大学学报(工学版), 2007, 37(3): 86 -88 . [8] 王学平,王登杰,孙英明,董磊 . 免棱镜全站仪在桥梁检测中的应用[J]. 山东大学学报(工学版), 2007, 37(3): 105 -108 . [9] 孟健, 李贻斌, 李彬. 四足机器人跳跃步态控制方法[J]. 山东大学学报(工学版), 2015, 45(3): 28 -34 . [10] 张光庆,孔凡玉,李大兴, . Koblitz曲线上抵抗简单功耗分析的有效算法[J]. 山东大学学报(工学版), 2007, 37(3): 78 -80 .