您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2016, Vol. 46 ›› Issue (2): 29-34.doi: 10.6040/j.issn.1672-3961.0.2015.101

• 机器学习与数据挖掘 • 上一篇    下一篇

基于加权的K-modes聚类初始中心选择算法

江峰1,杜军威1,刘国柱1,眭跃飞2   

  1. 1. 青岛科技大学信息科学与技术学院, 山东 青岛 266061;2. 中国科学院计算技术研究所, 北京 100190
  • 收稿日期:2015-04-15 出版日期:2016-04-20 发布日期:2015-04-15
  • 作者简介:江峰(1978— ),男,江西彭泽人,副教授, 博士,主要研究方向为数据挖掘,粗糙集.E-mail: jiangkong@163.net
  • 基金资助:
    国家自然科学基金资助项目(60802042,61273180);山东省自然科学基金资助项目(ZR2011FQ005);山东省高等学校科技计划资助项目(J11LG05)

A weight-based initial centers selection algorithm for K-modes clustering

JIANG Feng1, DU Junwei1, LIU Guozhu1, SUI Yuefei2   

  1. 1. College of Information Science &
    Technology, Qingdao University of Science and Technology, Qingdao 266061, Shandong, China;
    2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2015-04-15 Online:2016-04-20 Published:2015-04-15

摘要: 针对现有的K-modes聚类初始化方法没有考虑不同的属性具有不同的重要性这一问题,提出一种基于加权密度与加权重叠距离的初始中心选择算法Ini-Weight。Ini-Weight算法通过计算每个对象的密度以及对象之间的距离来选择初始中心。在计算对象的密度以及对象的距离时,Ini-Weight算法根据每个属性的重要性为不同的属性赋予不同的权值。最后,在UCI数据集上将Ini-Weight与现有的方法进行了比较,结果表明,Ini-Weight算法可以有效地区分不同的属性,而且提高了初始中心选择的准确性。

关键词: 初始中心, 加权重叠距离, 加权平均密度, 粗糙集, K-modes聚类

Abstract: The current initialization methods for K-modes clustering do not consider the case in which various attributes have different significances. To solve this problem, a weighted density and weighted overlap distance-based initial center selection algorithm(called Ini-Weight)was proposed. In algorithm Ini-Weight, initial centers were selected by calculating the density of each object and the distance between any two objects. In Ini-Weight, when calculating the density of each object and the distance between any two objects, different weights were assigned to different attributes according to the significance of each attribute. Finally, Ini-Weight was compared with the current methods on UCI data sets. The results showed that Ini-Weight algorithm could effectively distinguish different attributes and improve the accuracy for selecting initial centers.

Key words: initial centers, weighted overlap distance, rough sets, K-modes clustering, weighted average density

中图分类号: 

  • TP181
[1] HAN J W, KAMBER M. Data mining: concepts and techniques[M]. 2nd edition. Burlington, USA: Morgan Kaufmann Publishers, 2006.
[2] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. ACM Computing Surveys, 1999, 31(3):264-323.
[3] 陈文强, 林琛, 陈珂,等. 基于GraphLab的分布式近邻传播聚类算法[J]. 山东大学学报(工学版), 2013, 43(5):13-18. CHEN Wenqiang, LIN Chen, CHEN Ke, et al. Distributed affinity propagation clustering algorithm based on GraphLab[J]. Journal of Shandong University(Engineering Science), 2013, 43(5):13-18.
[4] ANDERBERG M R. Cluster analysis for applications[M]. New York: Academic Press, 1973.
[5] BAI L, LIANG J Y, SUI C, et al. Fast global K-means clustering based on local geometrical information[J]. Information Sciences, 2013, 245:168-180.
[6] HUANG Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.
[7] WU S, JIANG Q S, HUANG Z X. A new initialization method for clustering categorical data[C] //Proceedings of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Nanjing, China: Springer, 2007:972-980.
[8] CAO F Y, LIANG J Y, BAI L. A new initialization method for categorical data clustering[J]. Expert Systems with Application, 2009, 36(7):10223-10228.
[9] BAI L, LIANG J Y, DANG C Y. An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data[J]. Knowledge-Based Systems, 2011, 24(6):785-795.
[10] KHAN S S, AHMAD A. Cluster center initialization algorithm for K-modes clustering[J]. Expert Systems with Applications, 2013, 40(18):7444-7456.
[11] BRADLEY P S, FAYYAD U M. Refining initial points for K-means clustering[C] //Proceedings of the 15th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1998:91-99.
[12] CAO F Y, LIANG J Y, JIANG G. An initialization method for the K-means algorithm using neighborhood model[J]. Computers and Mathematics with Applications, 2009, 58(3):474-483.
[13] 王守强, 朱大铭, 史士英. K-means聚类问题的改进近似算法[J]. 山东大学学报(工学版), 2011, 41(4):125-132. WANG Shouqiang, ZHU D M, SHI Shiying. Improved approximation algorithm for the K-means clustering problem[J]. Journal of Shandong University(Engineering Science), 2011, 41(4):125-132.
[14] 夏战国, 万玲, 蔡世玉,等. 一种面向入侵检测的半监督聚类算法[J]. 山东大学学报(工学版), 2012, 42(6):1-7. XIA Zhanguo, WAN Ling, CAI Shiyu, et al. A semi-supervised clustering algorithm oriented to intrusion detection[J]. Journal of Shandong University(Engineering Science), 2012, 42(6):1-7.
[15] PAWLAK Z. Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht, Netherlands: Kluwer Academic Publishers, 1991.
[16] SHANNON C E. The mathematical theory of communication[J]. Bell System Technical Journal, 1948, 27(3-4):373-423.
[17] D(¨overU)NTSCH I, GEDIGA G. Uncertainty measures of rough set prediction[J]. Artificial Intelligence, 1998, 106(1):109-137.
[18] 苗夺谦, 胡桂荣. 知识约简的一种启发式算法[J]. 计算机研究与发展, 1999, 36(6):681-684. MIAO Duoqian, HU Guirong. A heuristic algorithm for reduction of knowledge[J]. Computer Research and Development, 1999, 36(6):681-684.
[19] LIANG J Y, SHI Z Z. The information entropy, rough entropy and knowledge granulation in rough set theory[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2004, 12(1):37-46.
[20] JIANG F, SUI Y F, ZHOU L. A relative decision entropy-based feature selection approach[J]. Pattern Recognition, 2015, 48(7):2151-2163.
[21] 江峰, 王莎莎, 杜军威,等. 基于近似决策熵的属性约简[J]. 控制与决策, 2015, 30(1):65-70. JIANG Feng, WANG Shasha, DU Junwei, et al. Attribute reduction based on approximation decision entropy[J]. Control and Decision, 2015, 30(1):65-70.
[22] 梁吉业, 白亮, 曹付元. 基于新的距离度量的K-modes聚类算法[J]. 计算机研究与发展, 2010, 47(10):1749-1755. LIANG Jiye, BAI Liang, CAO Fuyuan. K-modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10):1749-1755.
[23] 徐章艳, 刘作鹏, 杨炳儒,等. 一个复杂度为max(O(|C||U|), O(|C|2|U/C|))的快速属性约简算法[J]. 计算机学报, 2006, 29(3):391-399. XU Zhangyan, LIU Zuopeng, YANG Bingru, et al. A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|))[J]. Chinese Journal of Computers, 2006, 29(3):391-399.
[24] HUANG Z X, NG M K. A fuzzy K-modes algorithm for clustering categorical data[J]. IEEE Transactions on Fuzzy Systems, 1999, 7(4):446-452.
[1] 景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9.
[2] 高峰1,迟春梅2. 决策表中属性的重排[J]. 山东大学学报(工学版), 2013, 43(5): 6-12.
[3] 樊伟. 一种多粒度粗糙区间模糊集方法[J]. 山东大学学报(工学版), 2013, 43(1): 63-68.
[4] 陈玉明,吴克寿,谢荣生. 基于相对知识粒度的决策表约简[J]. 山东大学学报(工学版), 2012, 42(6): 8-12.
[5] 李慧1,2,胡云1,3,李存华1. 基于粗糙集理论的瓦斯灾害信息特征提取技术[J]. 山东大学学报(工学版), 2012, 42(5): 91-95.
[6] 杨习贝1,2,黄佳玲1,周君仪3,杨静宇2. 不完备系统中基于特征相容块的粗糙集[J]. 山东大学学报(工学版), 2012, 42(5): 1-6.
[7] 吴克寿,陈玉明,曾志强. 基于邻域关系的决策表约简[J]. 山东大学学报(工学版), 2012, 42(2): 7-10.
[8] 翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28.
[9] 焦吉成,高学东,王元璞,赵传领 . 关系积理论及属性约简算法[J]. 山东大学学报(工学版), 2008, 38(2): 112-116 .
[10] 李成栋,雷红,史开泉 . 一种基于粗集的模糊系统设计方法[J]. 山东大学学报(工学版), 2006, 36(4): 73-80 .
[11] 孟健,崔明辉,史开泉 . 变异粗集在集对分析中的应用[J]. 山东大学学报(工学版), 2006, 36(3): 65-68 .
[12] 李秀红,张东升 . (α,β)-粗糙集模型中阈值的确定与解释[J]. 山东大学学报(工学版), 2006, 36(2): 81-85 .
[13] 管延勇,胡海清,王洪凯 . α-粗糙集模型中的不可分辨关系[J]. 山东大学学报(工学版), 2006, 36(1): 75-80 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李可,刘常春,李同磊 . 一种改进的最大互信息医学图像配准算法[J]. 山东大学学报(工学版), 2006, 36(2): 107 -110 .
[2] 岳远征. 远离平衡态玻璃的弛豫[J]. 山东大学学报(工学版), 2009, 39(5): 1 -20 .
[3] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .
[4] 李辉平, 赵国群, 张雷, 贺连芳. 超高强度钢板热冲压及模内淬火工艺的发展现状[J]. 山东大学学报(工学版), 2010, 40(3): 69 -74 .
[5] 刘新1 ,宋思利1 ,王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 .
[6] 李士进,王声特,黄乐平. 基于正反向异质性的遥感图像变化检测[J]. 山东大学学报(工学版), 2018, 48(3): 1 -9 .
[7] 赵治广,王登杰,田云飞 . 基于灰色理论的路基沉降研究[J]. 山东大学学报(工学版), 2007, 37(3): 86 -88 .
[8] 王学平,王登杰,孙英明,董磊 . 免棱镜全站仪在桥梁检测中的应用[J]. 山东大学学报(工学版), 2007, 37(3): 105 -108 .
[9] 孟健, 李贻斌, 李彬. 四足机器人跳跃步态控制方法[J]. 山东大学学报(工学版), 2015, 45(3): 28 -34 .
[10] 张光庆,孔凡玉,李大兴, . Koblitz曲线上抵抗简单功耗分析的有效算法[J]. 山东大学学报(工学版), 2007, 37(3): 78 -80 .