基于加权的K-modes聚类初始中心选择算法

doi:10.6040/j.issn.1672-3961.0.2015.101

山东大学学报(工学版) ›› 2016, Vol. 46 ›› Issue (2): 29-34.doi: 10.6040/j.issn.1672-3961.0.2015.101

基于加权的K-modes聚类初始中心选择算法

江峰¹,杜军威¹,刘国柱¹,眭跃飞²

1. 青岛科技大学信息科学与技术学院, 山东青岛 266061;2. 中国科学院计算技术研究所, 北京 100190

收稿日期:2015-04-15 出版日期:2016-04-20 发布日期:2015-04-15
作者简介:江峰(1978— ),男,江西彭泽人,副教授, 博士,主要研究方向为数据挖掘,粗糙集.E-mail: jiangkong@163.net
基金资助:
国家自然科学基金资助项目(60802042,61273180);山东省自然科学基金资助项目(ZR2011FQ005);山东省高等学校科技计划资助项目(J11LG05)

A weight-based initial centers selection algorithm for K-modes clustering

JIANG Feng¹, DU Junwei¹, LIU Guozhu¹, SUI Yuefei²

1. College of Information Science &
Technology, Qingdao University of Science and Technology, Qingdao 266061, Shandong, China;
2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

Received:2015-04-15 Online:2016-04-20 Published:2015-04-15

摘要/Abstract

摘要： 针对现有的K-modes聚类初始化方法没有考虑不同的属性具有不同的重要性这一问题,提出一种基于加权密度与加权重叠距离的初始中心选择算法Ini_-Weight。Ini_-Weight算法通过计算每个对象的密度以及对象之间的距离来选择初始中心。在计算对象的密度以及对象的距离时,Ini_-Weight算法根据每个属性的重要性为不同的属性赋予不同的权值。最后,在UCI数据集上将Ini_-Weight与现有的方法进行了比较,结果表明,Ini_-Weight算法可以有效地区分不同的属性,而且提高了初始中心选择的准确性。

关键词: 初始中心, 加权重叠距离, 加权平均密度, 粗糙集, K-modes聚类

Abstract: The current initialization methods for K-modes clustering do not consider the case in which various attributes have different significances. To solve this problem, a weighted density and weighted overlap distance-based initial center selection algorithm(called Ini_-Weight)was proposed. In algorithm Ini_-Weight, initial centers were selected by calculating the density of each object and the distance between any two objects. In Ini_-Weight, when calculating the density of each object and the distance between any two objects, different weights were assigned to different attributes according to the significance of each attribute. Finally, Ini_-Weight was compared with the current methods on UCI data sets. The results showed that Ini_-Weight algorithm could effectively distinguish different attributes and improve the accuracy for selecting initial centers.

Key words: initial centers, weighted overlap distance, rough sets, K-modes clustering, weighted average density

中图分类号:

TP181

江峰,杜军威,刘国柱,眭跃飞. 基于加权的K-modes聚类初始中心选择算法[J]. 山东大学学报(工学版), 2016, 46(2): 29-34.

JIANG Feng, DU Junwei, LIU Guozhu, SUI Yuefei. A weight-based initial centers selection algorithm for K-modes clustering[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(2): 29-34.

参考文献

[1] HAN J W, KAMBER M. Data mining: concepts and techniques[M]. 2nd edition. Burlington, USA: Morgan Kaufmann Publishers, 2006.
[2] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. ACM Computing Surveys, 1999, 31(3):264-323.
[3] 陈文强, 林琛, 陈珂,等. 基于GraphLab的分布式近邻传播聚类算法[J]. 山东大学学报(工学版), 2013, 43(5):13-18. CHEN Wenqiang, LIN Chen, CHEN Ke, et al. Distributed affinity propagation clustering algorithm based on GraphLab[J]. Journal of Shandong University(Engineering Science), 2013, 43(5):13-18.
[4] ANDERBERG M R. Cluster analysis for applications[M]. New York: Academic Press, 1973.
[5] BAI L, LIANG J Y, SUI C, et al. Fast global K-means clustering based on local geometrical information[J]. Information Sciences, 2013, 245:168-180.
[6] HUANG Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.
[7] WU S, JIANG Q S, HUANG Z X. A new initialization method for clustering categorical data[C] //Proceedings of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Nanjing, China: Springer, 2007:972-980.
[8] CAO F Y, LIANG J Y, BAI L. A new initialization method for categorical data clustering[J]. Expert Systems with Application, 2009, 36(7):10223-10228.
[9] BAI L, LIANG J Y, DANG C Y. An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data[J]. Knowledge-Based Systems, 2011, 24(6):785-795.
[10] KHAN S S, AHMAD A. Cluster center initialization algorithm for K-modes clustering[J]. Expert Systems with Applications, 2013, 40(18):7444-7456.
[11] BRADLEY P S, FAYYAD U M. Refining initial points for K-means clustering[C] //Proceedings of the 15th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1998:91-99.
[12] CAO F Y, LIANG J Y, JIANG G. An initialization method for the K-means algorithm using neighborhood model[J]. Computers and Mathematics with Applications, 2009, 58(3):474-483.
[13] 王守强, 朱大铭, 史士英. K-means聚类问题的改进近似算法[J]. 山东大学学报(工学版), 2011, 41(4):125-132. WANG Shouqiang, ZHU D M, SHI Shiying. Improved approximation algorithm for the K-means clustering problem[J]. Journal of Shandong University(Engineering Science), 2011, 41(4):125-132.
[14] 夏战国, 万玲, 蔡世玉,等. 一种面向入侵检测的半监督聚类算法[J]. 山东大学学报(工学版), 2012, 42(6):1-7. XIA Zhanguo, WAN Ling, CAI Shiyu, et al. A semi-supervised clustering algorithm oriented to intrusion detection[J]. Journal of Shandong University(Engineering Science), 2012, 42(6):1-7.
[15] PAWLAK Z. Rough sets: theoretical aspects of reasoning about data[M]. Dordrecht, Netherlands: Kluwer Academic Publishers, 1991.
[16] SHANNON C E. The mathematical theory of communication[J]. Bell System Technical Journal, 1948, 27(3-4):373-423.
[17] D(¨overU)NTSCH I, GEDIGA G. Uncertainty measures of rough set prediction[J]. Artificial Intelligence, 1998, 106(1):109-137.
[18] 苗夺谦, 胡桂荣. 知识约简的一种启发式算法[J]. 计算机研究与发展, 1999, 36(6):681-684. MIAO Duoqian, HU Guirong. A heuristic algorithm for reduction of knowledge[J]. Computer Research and Development, 1999, 36(6):681-684.
[19] LIANG J Y, SHI Z Z. The information entropy, rough entropy and knowledge granulation in rough set theory[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2004, 12(1):37-46.
[20] JIANG F, SUI Y F, ZHOU L. A relative decision entropy-based feature selection approach[J]. Pattern Recognition, 2015, 48(7):2151-2163.
[21] 江峰, 王莎莎, 杜军威,等. 基于近似决策熵的属性约简[J]. 控制与决策, 2015, 30(1):65-70. JIANG Feng, WANG Shasha, DU Junwei, et al. Attribute reduction based on approximation decision entropy[J]. Control and Decision, 2015, 30(1):65-70.
[22] 梁吉业, 白亮, 曹付元. 基于新的距离度量的K-modes聚类算法[J]. 计算机研究与发展, 2010, 47(10):1749-1755. LIANG Jiye, BAI Liang, CAO Fuyuan. K-modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10):1749-1755.
[23] 徐章艳, 刘作鹏, 杨炳儒,等. 一个复杂度为max(O(|C||U|), O(|C|2|U/C|))的快速属性约简算法[J]. 计算机学报, 2006, 29(3):391-399. XU Zhangyan, LIU Zuopeng, YANG Bingru, et al. A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|))[J]. Chinese Journal of Computers, 2006, 29(3):391-399.
[24] HUANG Z X, NG M K. A fuzzy K-modes algorithm for clustering categorical data[J]. IEEE Transactions on Fuzzy Systems, 1999, 7(4):446-452.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于加权的K-modes聚类初始中心选择算法

A weight-based initial centers selection algorithm for K-modes clustering

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 13

多维度评价

本文评价

推荐阅读 10

[1]	景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9.
[2]	高峰1,迟春梅2. 决策表中属性的重排[J]. 山东大学学报(工学版), 2013, 43(5): 6-12.
[3]	樊伟. 一种多粒度粗糙区间模糊集方法[J]. 山东大学学报(工学版), 2013, 43(1): 63-68.
[4]	陈玉明,吴克寿,谢荣生. 基于相对知识粒度的决策表约简[J]. 山东大学学报(工学版), 2012, 42(6): 8-12.
[5]	李慧1,2,胡云1,3,李存华1. 基于粗糙集理论的瓦斯灾害信息特征提取技术[J]. 山东大学学报(工学版), 2012, 42(5): 91-95.
[6]	杨习贝1,2,黄佳玲1,周君仪3,杨静宇2. 不完备系统中基于特征相容块的粗糙集[J]. 山东大学学报(工学版), 2012, 42(5): 1-6.
[7]	吴克寿,陈玉明,曾志强. 基于邻域关系的决策表约简[J]. 山东大学学报(工学版), 2012, 42(2): 7-10.
[8]	翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28.
[9]	焦吉成,高学东,王元璞,赵传领 . 关系积理论及属性约简算法[J]. 山东大学学报(工学版), 2008, 38(2): 112-116 .
[10]	李成栋,雷红,史开泉 . 一种基于粗集的模糊系统设计方法[J]. 山东大学学报(工学版), 2006, 36(4): 73-80 .
[11]	孟健,崔明辉,史开泉 . 变异粗集在集对分析中的应用[J]. 山东大学学报(工学版), 2006, 36(3): 65-68 .
[12]	李秀红,张东升 . (α,β)-粗糙集模型中阈值的确定与解释[J]. 山东大学学报(工学版), 2006, 36(2): 81-85 .
[13]	管延勇,胡海清,王洪凯 . α-粗糙集模型中的不可分辨关系[J]. 山东大学学报(工学版), 2006, 36(1): 75-80 .