您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (3): 7-14.doi: 10.6040/j.issn.1672-3961.1.2014.182

• 机器学习与数据挖掘 • 上一篇    下一篇

一种基于密度差异的离群点检测算法

辛丽玲, 何威, 于剑, 贾彩燕   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2014-03-26 修回日期:2015-03-17 出版日期:2015-06-20 发布日期:2014-03-26
  • 作者简介:辛丽玲(1988- ),女,山西大同人,硕士研究生,主要研究方向为数据挖掘与离群点检测. E-mail:12120466@bjtu.edu.cn
  • 基金资助:
    北京市科委资助项目(Z131110002813118);河北省教育厅青年基金资助项目(2011143);河北省自然科学基金资助项目(F2013205192)

An outlier detection algorithm based on density difference

XIN Liling, HE Wei, YU Jian, JIA Caiyan   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2014-03-26 Revised:2015-03-17 Online:2015-06-20 Published:2014-03-26

摘要: 在MMOD算法的基础上提出一种改进算法IMMOD,该算法考虑各属性的差异对离群点检测的影响,通过引入信息熵来确定属性的重要程度以量化权重向量,进而采用加权距离计算各数据点相异性。此外,在处理高维数据时,确定次要属性后采用属性约简方法,在保证时间效率的同时提高检测精度。理论分析和试验结果表明IMMOD算法参数少、检测准确性高,能很好地适用于高维数据,整体性能优于同类算法。

关键词: 属性约简, IMMOD(improved MMOD), 加权距离, 信息熵, 离群点检测, MMOD(mountain method based outlier detection)

Abstract: An improved algorithm IMMOD was proposed based on the algorithm MMOD, which considered the difference among different attributes and improved the accuracy of detection. The algorithm introduced entropy to confirm the significance of attribute. The weight of attribute determined by the entropy was used to calculate the weighted distance between objects. In addition, determining and reducing the secondary attributes could guarantee the computational complexity and improve the precision on the high dimensional datasets. The theoretical analysis and the empirical study both showed that the IMMOD could be applied on high dimensional datasets well with a few of parameters and high accuracy, which was better than other algorithms.

Key words: information entropy, outlier detection, attribute reduction, weighted distance, MMOD, IMMOD

中图分类号: 

  • TP391
[1] HAWKINS D M.Identification of outliers[M].London:Chapman and Hall, 1980:1-1.
[2] KLEINBAUM D, KUPPER L, MULLER K, et al. Applied regression analysis and other multivariable methods[M]. 3 ed. California:Duxbury Press, 1998:228-237.
[3] CHEN Yixin, DANG Xin.Outlier detection with the kernelized spatial depth function[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2):288-305.
[4] KNORR E M, NG R T.Algorithms for mining distance-based outliers in large datasets[C]//Proceedings of the 24rd International Conference on Very Large Data Bases.San Francisco, USA:Morgan Kaufmann, 1998:392-403.
[5] ANGIULLI F, BASTA S, PIZZUTI C. Distance-based detection and prediction of outliers[J].IEEE Transactions on Knowledge and Data Engineering, 2006, 18(2):145-160.
[6] BREITENBACH M,GRUDIC G Z.Clustering through ranking on manifolds[C]//In Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany:ACM, 2005:73-80.
[7] HE Zengyou, XU Xiaofei, DENG Shengchun.Discovering cluster based local outliers[J]. Pattern Recognition Letters, 2003, 24:1641-1650.
[8] GUPTA K K, NATH B, KOTAGIRI R. Layered approach using conditional random fields for intrusion detection[J].IEEE Transactions on Dependable and Secure Computing, 2010, 7(1):35-49.
[9] BREUNIG M M, KRIEGEL H P, NG R T, et al.Identifying density-based local outliers[C]// In Proceedings of the 2000 ACM SIGMOD international conference on Management of data.New York, USA:ACM, 2000:93-104.
[10] LATECKI L, LAZAREVIC A, POKRAJAC D. Outlier detection with kernel density functions[C]//Proceedings of the 5th International Conference on Machine Learning and Data Mining in Pattern Recognition.Leipzig, Germany:Springer, 2007:61-75.
[11] 何威.基于数据密度估计的聚类与离群点检测研究[D].北京:北京交通大学,2011. HE Wei. Data density based clustering and outlier detection[D].Beijing:Beijing Jiaotong University, 2011.
[12] MARKUS Goldstein. An expectation maximization based local outlier detection algorithm[C]//Proceedings of the 21st International Conference on Pattern Recognition. Tsukuba, Japan:IEEE, 2012:2282-2285.
[13] RALF O.A fuzzy vector valued KNN-algorithm for automatic outlier detection[J].Applied Soft Computing, 2009, 9:1263-1272.
[14] PARZEN E.On the estimate of a probability density function and mode[J].Annals Mathematical Statistics, 1962, 33(3):1065-1076.
[15] TERRELL G R, SCOTT D W.Variable kernel density estimation[J].The Annals of Statistics, 1992, 20(3):1236-1265.
[16] KRIEGEL H P, KRGER P, SCHUBERT E, et al.Local outlier probabilities[C]//Proceeding of the 18th ACM Conference on Information and Knowledge Management.Hong Kong, China:ACM, 2009:1649-1652.
[17] YAGER R R,FILEV D P.Approximate clustering via the mountain method[J].IEEE Transactions on Systems,Man and Cybernetics, 1994, 24(8):1279-1284.
[18] CHIU S L.Fuzzy model identification based on cluster estimation[J].Journal of Intelligent Fuzzy Systems, 1994, 2:267-278.
[19] YANG M S, WU K L.A similarity-based robust clustering method[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(4):434-448.
[20] MANOR L Z, PERONA P. Self-tuning spectral clustering[C]// Proceedings of the 18th Annual Conference on Neural Information Processing Systems. Cambridge, UK:MIT Press, 2004:1601-1608.
[21] SHANNON C E.The mathematical theory of communication[J].Bell System Technical Journal, 1948, 27(3-4):373-423.
[22] MLLER E, SCHIFFER M, SEIDL T.Statistical selection of relevant subspace projections for outlier ranking[C]// Proceedings of the 27th International Conference on Data Engineering. Hannover, Germany:IEEE, 2011:434-445.
[23] 胡彩平,秦小麟.一种基于密度的局部离群点检测算法DLOF[J].计算机研究与发展,2010,47(12):2110-2116. HU Caiping, QIN Xiaolin. A density-based local outlier detecting algorithm[J].Journal of Computer Research and Development, 2010, 47(12):2110-2116.
[24] LAZAREVIC A,KUMAR V.Feature bagging for outlier detection[C]//Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York, USA:ACM, 2005:157-166.
[25] VAIDYA P M.An optimal algorithm for the all-nearest-neighbors problem[C]//27th Annual Symposium Foundations of Computer Science. Toronto,Canada:IEEE,1986:117-122.
[1] 邱利芹,王磊,于越,孙雅慧. 知识粒度视角下区间值决策信息系统的增量式属性约简[J]. 山东大学学报 (工学版), 2025, 55(6): 45-57.
[2] 陈宝国,邓明,陈金林. 基于权重邻域熵的数值型信息系统属性约简算法[J]. 山东大学学报 (工学版), 2024, 54(1): 33-44.
[3] 季雨瑄,叶军,杨震宇,敖家欣,王磊. 结合分辨矩阵改进的邻域粗糙集属性约简算法[J]. 山东大学学报 (工学版), 2022, 52(4): 99-109.
[4] 郭茂林,包崇明,周丽华,丁涛,孔兵. 基于TOPSIS的异质网络影响力最大化[J]. 山东大学学报 (工学版), 2022, 52(2): 31-40.
[5] 吴建萍,姜斌,刘剑慰. 基于小波包信息熵和小波神经网络的异步电机故障诊断[J]. 山东大学学报(工学版), 2017, 47(5): 223-228.
[6] 林耀进,张佳,林梦雷,王娟. 一种基于模糊信息熵的协同过滤推荐方法[J]. 山东大学学报(工学版), 2016, 46(5): 13-20.
[7] 张佳,林耀进,林梦雷,刘景华,李慧宗. 基于信息熵的协同过滤算法[J]. 山东大学学报(工学版), 2016, 46(2): 43-50.
[8] 景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9.
[9] 潘盼1,王熙照2,翟俊海2. 基于有序决策树的改进归纳算法[J]. 山东大学学报(工学版), 2014, 44(1): 41-44.
[10] 李慧1,2,胡云1,3,李存华1. 基于粗糙集理论的瓦斯灾害信息特征提取技术[J]. 山东大学学报(工学版), 2012, 42(5): 91-95.
[11] 施珺,朱敏. 一种基于灰色系统和支持向量机的预测优化模型[J]. 山东大学学报(工学版), 2012, 42(5): 7-11.
[12] 胡云1,2,李慧1,施珺1,蔡虹1. 基于属性约简和相对熵的离群点检测算法[J]. 山东大学学报(工学版), 2011, 41(6): 31-36.
[13] 翟俊海,高原原,王熙照,陈俊芬. 基于划分子集的属性约简算法[J]. 山东大学学报(工学版), 2011, 41(4): 24-28.
[14] 黄添强1,2,陈智文1. 基于双向运动矢量的数字视频篡改鉴定[J]. 山东大学学报(工学版), 2011, 41(4): 13-19.
[15] 郭剑毅1,2,雷春雅1,余正涛1,2,苏磊1,2,赵君1,田维1. 基于信息熵的半监督领域实体关系抽取研究[J]. 山东大学学报(工学版), 2011, 41(4): 7-12.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[2] 王静,李玉江,张晓瑾, 毕研俊,陈位锁 . 粉煤灰去除水中活性紫KN-B[J]. 山东大学学报(工学版), 2006, 36(6): 100 -103 .
[3] 关小军,韩振强,申孝民,麻晓飞,刘运腾 . 09CuPTiRE钢动态再结晶的热模拟实验与有限元模拟[J]. 山东大学学报(工学版), 2006, 36(5): 17 -20 .
[4] 孔维涛,张庆范,张承慧 . 基于DSP的空间矢量脉宽调制(SVPWM)的实现[J]. 山东大学学报(工学版), 2008, 38(3): 81 -84 .
[5] 田芳1,张颖欣2,张礼3,侯秀萍3,裘南畹3. 新型金属氧化物薄膜气敏元件基材料的开发[J]. 山东大学学报(工学版), 2009, 39(2): 104 -107 .
[6] 郑桂兰,关瑞芳,隋 肃,李建权,李国忠 . 反应型反光型道路标线涂料识别效果研究[J]. 山东大学学报(工学版), 2007, 37(1): 86 -89 .
[7] 张迎春 王佐勋 王桂娟. 基于神经网络控制器的高压电缆测温系统[J]. 山东大学学报(工学版), 2009, 39(5): 62 -67 .
[8] 高厚磊 田佳 杜强 武志刚 刘淑敏. 能源开发新技术——分布式发电[J]. 山东大学学报(工学版), 2009, 39(5): 106 -110 .
[9] 侯燕,杨猛. 高效解决复杂拓扑问题的显式界面追踪算法[J]. 山东大学学报(工学版), 2016, 46(4): 15 -20 .
[10] 高明 史月涛 王妮妮 孙奉仲 平亚明. 侧风环境下自然通风湿式冷却塔周向进风变化规律[J]. 山东大学学报(工学版), 2009, 39(3): 154 -158 .