您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2018, Vol. 48 ›› Issue (6): 44-55.doi: 10.6040/j.issn.1672-3961.0.2018.198

• 机器学习与数据挖掘 • 上一篇    下一篇

一种基于深度属性加权的数据流自适应集成分类算法

李尧(),王志海*(),孙艳歌,张伟   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2018-05-25 出版日期:2018-12-20 发布日期:2018-12-26
  • 通讯作者: 王志海 E-mail:16120396@bjtu.edu.cn;zhhwang@bjtu.edu.cn
  • 作者简介:李尧(1993—),男,安徽黄山人,硕士研究生,主要研究方向为数据挖掘和机器学习.E-mail:16120396@bjtu.edu.cn
  • 基金资助:
    北京市自然科学基金(4182052);国家自然科学基金(61672086);国家自然科学基金(61702030);国家自然科学基金(61771058)

An adaptive ensemble classification method based on deep attribute weighting for data stream

Yao LI(),Zhihai WANG*(),Yan′ge SUN,Wei ZHANG   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2018-05-25 Online:2018-12-20 Published:2018-12-26
  • Contact: Zhihai WANG E-mail:16120396@bjtu.edu.cn;zhhwang@bjtu.edu.cn
  • Supported by:
    北京市自然科学基金(4182052);国家自然科学基金(61672086);国家自然科学基金(61702030);国家自然科学基金(61771058)

摘要:

针对现有的大多数数据流集成分类算法对分类器的评估时未考虑历史数据的重要性,同时忽略对无关属性和噪声属性干扰的处理等问题,提出一种基于深度属性加权的数据流自适应集成分类算法,旨在有效组合多个基于深度属性加权的朴素贝叶斯模型。通过在不同数据块中深入分析不同属性取值对类属性归属的贡献,并将学习到的局部属性权重作用于不同的属性取值,以降低噪声数据干扰。在评价基分类器时,权衡历史数据和当前最新数据的重要性;采用基于测试实例的分类器置信度和分类正确率权重的组合投票策略进行子分类器组合以提高整体分类性能。通过在多个基准数据集上与经典算法对比试验,本研究算法在分类正确率和概念漂移适应性上具有一定优势。

关键词: 数据流, 集成分类, 深度属性加权, 概念漂移, 自适应

Abstract:

Due to most of the existing data stream ensemble classification algorithms without considering the importance of historical data in the evaluation of the base classifier, while ignoring the treatment of interference with irrelevant attributes and noise attributes, an adaptive ensemble classification method based on deep attribute weighting for data stream (EMDAW) was proposed to effectively combine multiple naive Bayesian models based on depth attribute weighting. In different data blocks, the contribution of different attribute values to the attribution of class attributes was deeply analyzed, and the learned local attribute weights to different attribute values were applied to reduce noise data interference. In the evaluation of the base classifier, the importance of the historical data and the current latest data was weighed. The sub-classifier combination was used to improve the overall classification performance by using the combined voting strategy based on the test case classifier confidence and classification correct rate. By comparing experiments with classical algorithms on multiple benchmark datasets, the proposed algorithm had certain advantages in classification correct rate and concept drift adaptability.

Key words: data stream, ensemble classification, deep attribute weighting, concept drift, adaptive

中图分类号: 

  • TP391

图1

几种不同类型的概念漂移"

图2

属性加权朴素贝叶斯结构"

图3

深度属性加权的数据流自适应集成分类算法框架"

表1

不同数据集的特征"

数据集 实例数 属性数目 类标数 噪声比例/% 漂移数 漂移类型
HYP 1 000 000 10 2 5 1 增量式漂移
SEA 1 000 000 3 4 10 9 突变漂移
LEDM 1 000 000 24 10 10 3 混合漂移
LEDND 1 000 000 24 10 20 0
Cover type 581 000 53 7 未知
Electricity 45 000 7 2 未知
Poker 1 000 000 10 10 未知
Spam 9 342 500 2 未知

表2

几种基分类器的分类正确率比较"

%
分类器模型 HYP SEAF LEDm LEDnd Cover type Electricity Poker Spam
DAW 72.34 84.34 67.44 51.57 82.36 79.12 83.44 84.25
NB 77.48 84.86 67.14 51.27 66.04 77.88 59.46 80.25
HOT 75.46 85.78 67.22 51.13 74.93 77.12 83.36 78.09

图4

不同集成分类器数量下不同算法的分类正确率"

图5

不同参数下本研究算法的平均分类正确率"

表3

不同数据块大小情况下各数据集分类正确率"

%
数据集 数据块大小
500 750 1 000 1 250 1 500 1 750 2 000
HYP 84.27 85.39 85.97 86.19 86.29 86.60 86.59
SEAF 82.80 83.56 84.25 84.95 85.19 85.44 85.53
Electricity 77.14 77.27 79.33 78.69 78.12 78.50 78.51
Cover type 83.47 83.93 84.25 81.76 81.58 81.03 79.15

表4

集成策略中不同参数k的各数据集分类正确率"

%
数据集 50 100 150 200
HYP 84.33 85.97 85.50 85.52
SEAF 83.80 84.71 84.83 84.33
Electricity 77.35 79.35 78.53 78.62
Cover type 84.04 84.25 84.02 83.84

表5

不同分类算法数据块平均训练时间"

ms
数据集 AWE AUE2 DDM NB Oza DWM NSE EMDAW
HYP 239.1 156.2 1 333.6 0.2 104.5 4 384.2 331.6 460.4
SEAF 87.0 42.1 32.1 0.1 378.8 1 246.2 262.5 358.1
LEDM 230.1 150.3 101.3 0.2 124.6 108.6 534.6 125.1
LEDND 230.6 150.6 120.2 0.2 132.6 120.5 834.6 142.3
Cover type 296.6 133.2 349.4 0.8 447.3 63.5 640.0 260.9
Electricity 290.1 180.6 85.3 0.4 408.8 40.2 173.3 318.6
Poker 173.5 155.7 42.8 0.2 314.6 49.2 762.8 544.4
Spam 750.6 669.5 191.6 3.6 810.5 24.6 152.2 197.9

表6

不同分类算法平均分类正确率"

%
数据集 AWE AUE2 DDM NB Oza DWM NSE EMDAW
HYP 82.45 83.54 76.54 77.48 83.05 82.93 84.43 85.97
SEAF 84.05 86.77 84.95 84.86 85.04 85.38 83.01 84.71
LEDM 67.08 67.58 66.70 67.14 67.55 67.12 62.86 67.21
LEDND 51.27 51.26 51.18 51.27 51.23 51.26 47.16 50.57
Cover type 81.70 84.05 74.36 66.04 80.52 77.29 79.70 84.25
Electricity 77.67 78.21 76.25 77.88 77.34 76.69 76.70 79.35
Poker 53.87 66.88 62.14 59.46 65.19 60.72 53.73 62.60
Spam 74.86 72.23 77.25 80.25 78.92 80.26 68.78 81.37

图6

不同数据块大小算法平均分类正确率"

图7

在数据集SEA上的分类正确率"

图8

在数据集HYP上的分类正确率"

图9

在数据集Electricity上的分类正确率"

图10

在数据集LEDm上的分类正确率"

1 GAMA J , ŽLIOBAITE I , BIFET A , et al. A survey on concept drift adaptation[J]. ACM Computing Surveys (CSUR), 2014, 46 (4): 44.
2 DIETTERICH T G. Ensemble methods in machine learning[C]//Proceedings of the International Workshop on Multiple Classifier Systems. New York, USA: ACM, 2000: 1-15.
3 TSYMBAL A. The problem of concept drift: definitions and related work[R]. Dublin, Ireland, Trinity College, 2004.
4 WEBB G I , HYDE R , CAO H , et al. Characterizing concept drift[J]. Data Mining and Knowledge Discovery, 2016, 30 (4): 964- 994.
5 亓开元, 赵卓峰, 房俊, 等. 针对高速数据流的大规模数据实时处理方法[J]. 计算机学报, 2012, 35 (3): 477- 490.
QI Kaiyuan , ZHAO Zhuofeng , FANG Jun , et al. Real-time processing for high speed data stream over lame scale data[J]. Chinese Journal of Computers, 2012, 35 (3): 477- 490.
6 GAMA J . Knowledge discovery from data streams[M]. Florida, USA: CRC Press, 2010.
7 WANG Haixun, WEI Fan, YU P S, et al. Mining concept-drifting data streams using ensemble classifiers[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2003: 226-235.
8 HOMAYOUN S , AHMADZADEH M . A review on data stream classification approaches[J]. Journal of Advanced Computer Science & Technology, 2016, 5 (1): 8- 13.
9 STREET W N, KIM Y S. A streaming ensemble algorithm (sea) for large-scale classification[C]//Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 377-382.
10 SUN Yu , TANG Ke , MINKU L L , et al. Online ensemble learning of data streams with gradually evolved classes[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28 (6): 1532- 1545.
doi: 10.1109/TKDE.2016.2526675
11 BRZEZINSKI D , STEFANOWSKJ J . Reacting to different types of concept drift: The accuracy updated ensemble algorithm[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25 (1): 81- 94.
doi: 10.1109/TNNLS.2013.2251352
12 BIFET A, HOLMES G, PFAHRINGER B, et al. New ensemble methods for evolving data streams[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. New York, USA: ACM, 2009: 139-148.
13 FREUND Y , SCHAPIRE R E . A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55 (1): 119- 139.
doi: 10.1006/jcss.1997.1504
14 ELWELL R , POLIKAR R . Incremental learning of concept drift in nonstationary environments[J]. IEEE Transactions on Neural Networks, 2011, 22 (10): 1517- 1531.
doi: 10.1109/TNN.2011.2160459
15 桂林, 张玉红, 胡学钢. 一种基于混合集成方法的数据流概念漂移检测方法[J]. 计算机科学, 2012, 39 (1): 152- 155.
doi: 10.3969/j.issn.1002-137X.2012.01.034
GUI Lin , ZHANG Yuhong , HU Xuegang . Data stream concept drift detection method based on mixture ensemble method[J]. Computer Science, 2012, 39 (1): 152- 155.
doi: 10.3969/j.issn.1002-137X.2012.01.034
16 赵强利, 蒋艳凰, 卢宇彤. 具有回忆和遗忘机制的数据流挖掘模型与算法[J]. 软件学报, 2015, 26 (10): 2567- 2580.
ZHAO Qiangli , JIANG Yanhuang , LU Yutong . Ensemble model and algorithm with recalling and forgetting mechanism for data stream mining[J]. Journal of Software, 2015, 26 (10): 2567- 2580.
17 WANG S K, DAI B R. A g-means update ensemble learning approach for the imbalanced data stream with concept drifts[C]//International Conference on Big Data Analytics and Knowledge Discovery. Berlin, Germany: Springer, 2016: 255-266.
18 SUN YU , TANG KE , ZHU ZEXUAN , et al. Concept drift adaptation by exploiting historical knowledge[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 1- 10.
19 ZHANG H, SHENG Shengli. Learning weighted naive bayes with accurate ranking[C]//Proceedings of the fourth International Conference on Data Mining. New Jersey, USA: IEEE, 2004: 567-570.
20 HALL M . A decision tree-based attribute weighting filter for naive Bayes[J]. Knowledge-Based Systems, 2007, 20 (2): 120- 126.
21 JIANG Liangxiao , LI Chaoqun , WANG Shasha , et al. Deep feature weighting for naive bayes and its application to text classification[J]. Engineering Applications of Artificial Intelligence, 2016, 52, 26- 39.
doi: 10.1016/j.engappai.2016.02.002
22 GROSSMAN D, DOMINGOS P. Learning bayesian network classifiers by maximizing conditional likelihood[C]//Proceedings of the twenty-first International Conference on Machine learning. New York, USA: ACM, 2004.
23 ZHU Ciyou , BYRD R H , LU Peihuang , et al. Algorithm 778: l-bfgs-b: fortran subroutines for large-scale bound-constrained optimization[J]. ACM Transactions on Mathematical Software, 1997, 23 (4): 550- 560.
doi: 10.1145/279232.279236
24 SONG Ge , YE Yunming , ZHANG Haijun , et al. Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift[J]. Information Sciences, 2016, 357, 125- 143.
doi: 10.1016/j.ins.2016.03.043
25 PIETRUCZUK L , RUTKOWSKI L , JAWORSKI M , et al. How to adjust an ensemble size in stream data mining[J]. Information Sciences, 2017, 381, 46- 54.
doi: 10.1016/j.ins.2016.10.028
26 BIFET A , HOLMES G , KIRKBY R , et al. Moa: massive online analysis[J]. Journal of Machine Learning Research, 2010, 11 (50): 1601- 1604.
27 OZA N C , RUSSELL S . Online ensemble learning[M]. Berkeley, USA: University of California, 2001.
28 KOLTER J Z , MALOOF M A . Dynamic weighted majority: an ensemble method for drifting concepts[J]. Journal of Machine Learning Research, 2007, (8): 2755- 2790.
29 HULTEN G, SPENCER L, DOMINGOS P. Mining time-changing data streams[C]//Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2001: 97-106.
[1] 周前,李群,朱丹丹,李仪博. 基于M3C自适应虚拟惯量的海上低频风电系统协调惯量响应控制[J]. 山东大学学报 (工学版), 2025, 55(5): 30-39.
[2] 李晓辉,刘小飞,孙炜桐,赵毅,董媛,靳引利. 基于车辆与无人机协同的巡检任务分配与路径规划算法[J]. 山东大学学报 (工学版), 2025, 55(5): 101-109.
[3] 郑晓,陈鹤,周东傲,宫永顺. 基于视频描述增强和双流特征融合的视频异常检测方法[J]. 山东大学学报 (工学版), 2025, 55(5): 110-119.
[4] 高君健,廖祝华,刘毅志,赵肄江. 基于分层多智能体强化学习的个性化与信号控制联合路径引导方法[J]. 山东大学学报 (工学版), 2025, 55(3): 34-45.
[5] 周彦冰,马士伦,文益民. 基于图结构的概念漂移检测[J]. 山东大学学报 (工学版), 2025, 55(2): 88-96.
[6] 吴正健,吾尔尼沙·买买提,杨耀威,阿力木江·艾沙,库尔班·吾布力. 基于DRCoALTP的印刷体文档图像多文种识别方法[J]. 山东大学学报 (工学版), 2025, 55(1): 51-57.
[7] 张梦雨,何振学,赵晓君,王浩然,肖利民,王翔. 基于AMSChOA的MPRM电路面积优化[J]. 山东大学学报 (工学版), 2024, 54(6): 147-155.
[8] 王辰龑,刘轩,超木日力格. 自适应的并行天牛须优化算法[J]. 山东大学学报 (工学版), 2024, 54(5): 74-80.
[9] 方世超,滕旭阳,王子南,陈晗,仇兆炀,毕美华. 基于自适应掩码和生成式修复的图像隐私保护技术[J]. 山东大学学报 (工学版), 2024, 54(5): 111-121.
[10] 张喜龙,韩萌,陈志强,武红鑫,李慕航. 动态集成选择的不平衡漂移数据流Boosting分类算法[J]. 山东大学学报 (工学版), 2023, 53(4): 83-92.
[11] 刘子一,崔超然,孟凡安,林培光. 基于批归一化统计量的无源多领域自适应方法[J]. 山东大学学报 (工学版), 2023, 53(2): 102-108.
[12] 刘丁菠,刘学艳,于东然,杨博,李伟. 面向小样本目标检测任务的自适应特征重构算法[J]. 山东大学学报 (工学版), 2022, 52(6): 115-122.
[13] 武新章,梁祥宇,朱虹谕,张冬冬. 基于CEEMDAN-GRA-PCC-ATCN的短期风电功率预测[J]. 山东大学学报 (工学版), 2022, 52(6): 146-156.
[14] 许传臻,袭肖明,李维翠,孙仪,杨璐. 基于自适应多分辨率特征学习的CNV分型网络[J]. 山东大学学报 (工学版), 2022, 52(4): 69-75.
[15] 孟祥飞,张强,胡宴才,张燕,杨仁明. 欠驱动船舶自适应神经网络有限时间跟踪控制[J]. 山东大学学报 (工学版), 2022, 52(4): 214-226.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Yue Khing Toh1 , XIAO Wendong2 , XIE Lihua1 . 基于无线传感器网络的分散目标跟踪:实际测试平台的开发应用(英文)[J]. 山东大学学报(工学版), 2009, 39(1): 50 -56 .
[2] 关小军,韩振强,申孝民,麻晓飞,刘运腾 . 09CuPTiRE钢动态再结晶的热模拟实验与有限元模拟[J]. 山东大学学报(工学版), 2006, 36(5): 17 -20 .
[3] 卜德云 张道强. 自适应谱聚类算法研究[J]. 山东大学学报(工学版), 2009, 39(5): 22 -26 .
[4] 于海波,李宇,余恬,雷虹 . W波段折叠波导慢波系统的尺寸对其冷特性的影响[J]. 山东大学学报(工学版), 2008, 38(3): 90 -94 .
[5] 王汝贵,蔡敢为 . 两自由度可控平面连杆机构机电耦合系统的超谐波共振分析[J]. 山东大学学报(工学版), 2008, 38(3): 58 -63 .
[6] 薛成骞,董建文,孟宪锋,常虹,曹宁,陈华英,李木森 . C/C+HA骨植入材料对杂交波尔山羊生理生化机能的影响[J]. 山东大学学报(工学版), 2008, 38(3): 73 -76 .
[7] 孙媛媛 徐衍亮 姚之宁. 旁磁制动单相感应电动机制动力的分析与计算[J]. 山东大学学报(工学版), 2009, 39(5): 120 -123 .
[8] 董成喜,吴德伟,何 晶 . 基于粗糙模糊集理论的卫星导航系统作战效能评估方法[J]. 山东大学学报(工学版), 2008, 38(4): 32 -36 .
[9] 董彤 袁淑娟 葛军饴 洪芳 郁黎明 曹世勋 张金仓. 磁制冷材料Gd5Ge4中的磁玻璃态[J]. 山东大学学报(工学版), 2009, 39(3): 67 -70 .
[10] 闫崇京 廖文和 郭宇 程筱胜. 基于多色图的BOM建模[J]. 山东大学学报(工学版), 2008, 38(6): 70 -75 .