您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (3): 7-14.doi: 10.6040/j.issn.1672-3961.1.2014.182

• 机器学习与数据挖掘 • 上一篇    下一篇

一种基于密度差异的离群点检测算法

辛丽玲, 何威, 于剑, 贾彩燕   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2014-03-26 修回日期:2015-03-17 出版日期:2015-06-20 发布日期:2014-03-26
  • 作者简介:辛丽玲(1988- ),女,山西大同人,硕士研究生,主要研究方向为数据挖掘与离群点检测. E-mail:12120466@bjtu.edu.cn
  • 基金资助:
    北京市科委资助项目(Z131110002813118);河北省教育厅青年基金资助项目(2011143);河北省自然科学基金资助项目(F2013205192)

An outlier detection algorithm based on density difference

XIN Liling, HE Wei, YU Jian, JIA Caiyan   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2014-03-26 Revised:2015-03-17 Online:2015-06-20 Published:2014-03-26

摘要: 在MMOD算法的基础上提出一种改进算法IMMOD,该算法考虑各属性的差异对离群点检测的影响,通过引入信息熵来确定属性的重要程度以量化权重向量,进而采用加权距离计算各数据点相异性。此外,在处理高维数据时,确定次要属性后采用属性约简方法,在保证时间效率的同时提高检测精度。理论分析和试验结果表明IMMOD算法参数少、检测准确性高,能很好地适用于高维数据,整体性能优于同类算法。

关键词: 属性约简, IMMOD(improved MMOD), 加权距离, 信息熵, 离群点检测, MMOD(mountain method based outlier detection)

Abstract: An improved algorithm IMMOD was proposed based on the algorithm MMOD, which considered the difference among different attributes and improved the accuracy of detection. The algorithm introduced entropy to confirm the significance of attribute. The weight of attribute determined by the entropy was used to calculate the weighted distance between objects. In addition, determining and reducing the secondary attributes could guarantee the computational complexity and improve the precision on the high dimensional datasets. The theoretical analysis and the empirical study both showed that the IMMOD could be applied on high dimensional datasets well with a few of parameters and high accuracy, which was better than other algorithms.

Key words: information entropy, outlier detection, attribute reduction, weighted distance, MMOD, IMMOD

中图分类号: 

  • TP391
[1] HAWKINS D M.Identification of outliers[M].London:Chapman and Hall, 1980:1-1.
[2] KLEINBAUM D, KUPPER L, MULLER K, et al. Applied regression analysis and other multivariable methods[M]. 3 ed. California:Duxbury Press, 1998:228-237.
[3] CHEN Yixin, DANG Xin.Outlier detection with the kernelized spatial depth function[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2):288-305.
[4] KNORR E M, NG R T.Algorithms for mining distance-based outliers in large datasets[C]//Proceedings of the 24rd International Conference on Very Large Data Bases.San Francisco, USA:Morgan Kaufmann, 1998:392-403.
[5] ANGIULLI F, BASTA S, PIZZUTI C. Distance-based detection and prediction of outliers[J].IEEE Transactions on Knowledge and Data Engineering, 2006, 18(2):145-160.
[6] BREITENBACH M,GRUDIC G Z.Clustering through ranking on manifolds[C]//In Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany:ACM, 2005:73-80.
[7] HE Zengyou, XU Xiaofei, DENG Shengchun.Discovering cluster based local outliers[J]. Pattern Recognition Letters, 2003, 24:1641-1650.
[8] GUPTA K K, NATH B, KOTAGIRI R. Layered approach using conditional random fields for intrusion detection[J].IEEE Transactions on Dependable and Secure Computing, 2010, 7(1):35-49.
[9] BREUNIG M M, KRIEGEL H P, NG R T, et al.Identifying density-based local outliers[C]// In Proceedings of the 2000 ACM SIGMOD international conference on Management of data.New York, USA:ACM, 2000:93-104.
[10] LATECKI L, LAZAREVIC A, POKRAJAC D. Outlier detection with kernel density functions[C]//Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition.Leipzig, Germany:Springer, 2007:61-75.
[11] 何威.基于数据密度估计的聚类与离群点检测研究[D].北京:北京交通大学,2011. HE Wei. Data density based clustering and outlier detection[D].Beijing:Beijing Jiaotong University, 2011.
[12] MARKUS Goldstein. An expectation maximization based local outlier detection algorithm[C]//Proceedings of the 21st International Conference on Pattern Recognition. Tsukuba, Japan:IEEE, 2012:2282-2285.
[13] RALF O.A fuzzy vector valued KNN-algorithm for automatic outlier detection[J].Applied Soft Computing, 2009, 9:1263-1272.
[14] PARZEN E.On the estimate of a probability density function and mode[J].Annals Mathematical Statistics, 1962, 33(3):1065-1076.
[15] TERRELL G R, SCOTT D W.Variable kernel density estimation[J].The Annals of Statistics, 1992, 20(3):1236-1265.
[16] KRIEGEL H P, KRGER P, SCHUBERT E, et al.Local outlier probabilities[C]//Proceeding of the 18th ACM Conference on Information and Knowledge Management.Hong Kong, China:ACM, 2009:1649-1652.
[17] YAGER R R,FILEV D P.Approximate clustering via the mountain method[J].IEEE Transactions on Systems,Man and Cybernetics, 1994, 24(8):1279-1284.
[18] CHIU S L.Fuzzy model identification based on cluster estimation[J].Journal of Intelligent Fuzzy Systems, 1994, 2:267-278.
[19] YANG M S, WU K L.A similarity-based robust clustering method[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(4):434-448.
[20] MANOR L Z, PERONA P. Self-tuning spectral clustering[C]// Proceedings of the 18th Annual Conference on Neural Information Processing Systems. Cambridge, UK:MIT Press, 2004:1601-1608.
[21] SHANNON C E.The mathematical theory of communication[J].Bell System Technical Journal, 1948, 27(3-4):373-423.
[22] MLLER E, SCHIFFER M, SEIDL T.Statistical selection of relevant subspace projections for outlier ranking[C]// Proceedings of the 27th International Conference on Data Engineering. Hannover, Germany:IEEE, 2011:434-445.
[23] 胡彩平,秦小麟.一种基于密度的局部离群点检测算法DLOF[J].计算机研究与发展,2010,47(12):2110-2116. HU Caiping, QIN Xiaolin. A density-based local outlier detecting algorithm[J].Journal of Computer Research and Development, 2010, 47(12):2110-2116.
[24] LAZAREVIC A,KUMAR V.Feature bagging for outlier detection[C]//Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, USA:ACM, 2005:157-166.
[25] VAIDYA P M.An optimal algorithm for the all-nearest-neighbors problem[C]//27th Annual Symposium Foundations of Computer Science. Toronto,Canada:IEEE,1986:117-122.
[1] 吴建萍,姜斌,刘剑慰. 基于小波包信息熵和小波神经网络的异步电机故障诊断[J]. 山东大学学报(工学版), 2017, 47(5): 223-228.
[2] 林耀进,张佳,林梦雷,王娟. 一种基于模糊信息熵的协同过滤推荐方法[J]. 山东大学学报(工学版), 2016, 46(5): 13-20.
[3] 张佳,林耀进,林梦雷,刘景华,李慧宗. 基于信息熵的协同过滤算法[J]. 山东大学学报(工学版), 2016, 46(2): 43-50.
[4] 景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9.
[5] 潘盼1,王熙照2,翟俊海2. 基于有序决策树的改进归纳算法[J]. 山东大学学报(工学版), 2014, 44(1): 41-44.
[6] 施珺,朱敏. 一种基于灰色系统和支持向量机的预测优化模型[J]. 山东大学学报(工学版), 2012, 42(5): 7-11.
[7] 李慧1,2,胡云1,3,李存华1. 基于粗糙集理论的瓦斯灾害信息特征提取技术[J]. 山东大学学报(工学版), 2012, 42(5): 91-95.
[8] 胡云1,2,李慧1,施珺1,蔡虹1. 基于属性约简和相对熵的离群点检测算法[J]. 山东大学学报(工学版), 2011, 41(6): 31-36.
[9] 郭剑毅1,2,雷春雅1,余正涛1,2,苏磊1,2,赵君1,田维1. 基于信息熵的半监督领域实体关系抽取研究[J]. 山东大学学报(工学版), 2011, 41(4): 7-12.
[10] 黄添强1,2,陈智文1. 基于双向运动矢量的数字视频篡改鉴定[J]. 山东大学学报(工学版), 2011, 41(4): 13-19.
[11] 翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28.
[12] 雷小锋1,庄伟1,程宇1,丁世飞1,谢昆青2. OPHCLUS:基于序关系保持的层次聚类算法[J]. 山东大学学报(工学版), 2010, 40(5): 48-55.
[13] 罗玉盘 商琳. 基于多粒度周期模式的时序离群点检测算法[J]. 山东大学学报(工学版), 2009, 39(3): 11-15.
[14] 管延勇,胡海清,王洪凯 . α-粗糙集模型中的不可分辨关系[J]. 山东大学学报(工学版), 2006, 36(1): 75-80 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!