分级式代价敏感决策树及其在手机换机预测中的应用

doi:10.6040/j.issn.1672-3961.2.2015.190

摘要/Abstract

摘要： 在手机用户数据集中,非换机用户和换机用户存在着严重的不平衡,传统的数据挖掘方法在处理不平衡数据时追求整体正确率,导致换机用户的预测精度较低。针对这一问题,提出一种基于分级式代价敏感决策树的换机预测方法。首先利用粗糙集对原始数据集进行属性约简并计算各属性的重要度,然后根据属性重要度对属性分块建立分级结构,最后以基尼系数和误分代价为分裂标准构建代价敏感决策树,作为每一级的基分类器。对某电信运营商客户数据进行3个仿真试验,结果表明:分级式代价敏感决策树在原始的不平衡用户数据集及欠抽样处理后的平衡用户数据集上都有较好的结果。

关键词: 分级结构, 决策树, 代价敏感, 不平衡数据, 换机预测

Abstract: In the data of mobile phone users, imbalance problem existed between the replacement users and non replacement users, however traditional date mining pursued the best overall accuracy which led the prediction accuracy of the replacement users overly low. In order to solve this problem, a method of predicting the users who replace phone was proposed based on hierarchical cost sensitive decision tree. The algorithm realized attributes reduction and calculated the importance of attributes by rough set, then a hierarchical structure was built by parting the attributes; finally a cost sensitive decision tree was regarded as the base classifier for the hierarchical structure, the decision tree was constructed with its splitting criterion which included gini index and misclassification cost. Three experiments were made for the users data which from a telecom operator, the results showed that the hierarchical cost sensitive decision tree achieved a better effect on the imbalance user data and balance user data which obtained by under sampling.

Key words: hierarchical structure, decision tree, cost sensitive, imbalance data, prediction of replacing phone

中图分类号:

TP391

熊冰妍,王国胤,邓维斌. 分级式代价敏感决策树及其在手机换机预测中的应用[J]. 山东大学学报 (工学版), 2015, 45(5): 36-42.

XIONG Bingyan, WANG Guoyin, DENG Weibin. Hierarchical cost sensitive decision tree and its application in the prediction of the mobile phone replacement[J]. Journal of Shandong University(Engineering Science), 2015, 45(5): 36-42.

参考文献

[1] BATISTA G E, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM Sigkdd Explorations Newsletter, 2004, 6(1):20-29.
[2] KOTSIANTIS S B, PINTELAS P E. Mixture of expert agents for handling imbalanced data sets[J]. Annals of Mathematics, Computing & Teleinformatics, 2003, 1(1):46-55.
[3] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[4] HAN H, WANG W Y, MAO B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[J]. Computer Science, 2005, 3644:878-887.
[5] GARCIA S, HERRERA F. Evolutionary under sampling for classification with imbalanced data sets: proposals and taxonomy[J]. Evolutionary Computation, 2009, 17(3):275-306.
[6] YEN S J, LEE Y S. Cluster-based under-sampling approaches for imbalanced data distributions[J]. Expert Systems with Applications, 2009, 36(3):5718-5727.
[7] WU J, XIONG H, WU P, et al. Local decomposition for rare class analysis[J]. Kdd, 2007, 20(2):191-220.
[8] BLASZCZYNSKI J, STEFANOWSKI J. Neighbourhood sampling in bagging for imbalanced data[J]. Neurocomputing, 2015, 150:529-542.
[9] KAI M T. An instance-weighting method to induce cost-sensitive trees[J]. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(3):659-665.
[10] ZHANG S. Decision tree classifiers sensitive to heterogeneous costs[J]. Journal of Systems and Software, 2012, 85(4):771-779.
[11] 郑燕, 王杨, 郝青峰, 等. 用于不平衡数据分类的代价敏感超网络算法[J]. 计算机应用, 2014, 34(5):1336-1340. ZHENG Yan, WANG Yang, HAO Qinfeng, et al. Cost-sensitive hypernetworks for imbalanced data classification[J]. Journal of Computer Applications, 2014, 34(5):1336-1340.
[12] PARK Y, LUO L, PARHI K K, et al. Seizure prediction with spectral power of EEG using cost-sensitive support vector machines[J]. Epilepsia, 2011, 52(10):1761-1770.
[13] BREIMAN L, FRIEDMAN J, STONE C J, et al. Classification and regression trees[M]. Boca Raton: CRC press, 1984.
[14] QUINLAN J R. Simplifying decision trees[J]. International Journal of Man-Machine Studies, 1987, 27(3):221-234.
[15] 王国胤. Rough集理论与知识获取[M]. 西安: 西安交通大学出版社, 2001.
[16] FAN W, STOLFO S J, ZHANG J X, et al. AdaCost: misclassification cost-sensitive boosting[C] // Proceeding of the 6^th internatinal conference on machine learning. sanmateo: morgan kaufm ann publishers, 1999:97-105.
[17] SU C T, CHEN L S, YIH Y. Knowledge acquisition through information granulation for imbalanced data[J]. Expert Systems with Applications, 2006, 31(3):531-541.
[18] 赵凤英, 王崇骏, 陈世福. 用于不均衡数据集的挖掘方法[J]. 计算机科学, 2007, 34(9):139-141. ZHAO Fengying, WANG Chongjun, CHEN Shifu. Data mining on imbalanced data sets[J]. Computer Science, 2007, 34(9):139-141.
[19] 陈思, 郭躬德, 陈黎飞. 基于聚类融合的不平衡数据分类方法[J]. 模式识别与人工智能, 2010,23(6):772-780. CHEN Si, GUO Gongde,CHEN Lifei. Clustering ensembles based classification method for imbalanced data sets[J]. Pattem Recognition and Aitificial Intelligence, 2010, 23(6):772-780.
[20] WITTEN I H, FRANK E. Data Mining: Practical Machine Learning Tools and Techniques[M]. 2nd Edition. Orlando, USA: Morgan Kaufmann, 2005.

相关文章 13

[1]	章博,卢峰,董寒宇,陈清泰,林振智,王洪涛. 基于决策树和数据驱动的零电量用户筛选方法[J]. 山东大学学报 (工学版), 2019, 49(5): 29-36.
[2]	张宗堂,王森,孙世林. 一种针对不平衡数据分类的集成学习算法[J]. 山东大学学报 (工学版), 2019, 49(4): 8-13.
[3]	周荣翔,贾修一. 中文反语识别特征分析[J]. 山东大学学报 (工学版), 2019, 49(1): 41-46.
[4]	王换,周忠眉. 一种基于聚类的过抽样算法[J]. 山东大学学报(工学版), 2018, 48(3): 134-139.
[5]	于青民,李晓磊,翟勇. 基于改进EMD和数据分箱的轴承内圈故障特征提取方法[J]. 山东大学学报(工学版), 2017, 47(3): 89-95.
[6]	鲁淑霞,李黎敏. 加权最大夹角间隔核心集向量机的不平衡数据分类[J]. 山东大学学报(工学版), 2014, 44(3): 1-7.
[7]	潘盼1,王熙照2,翟俊海2. 基于有序决策树的改进归纳算法[J]. 山东大学学报(工学版), 2014, 44(1): 41-44.
[8]	安春霖1,陆慧娟1,2*,郑恩辉3,王明怡1,陆羿4. 嵌入误分类代价和拒识代价的极限学习机基因表达数据分类[J]. 山东大学学报(工学版), 2013, 43(4): 18-25.
[9]	许春耀1,2, 陈明志3*, 余轮1. 适应用户需求变化的前摄推荐模型[J]. 山东大学学报(工学版), 2013, 43(3): 1-6.
[10]	张伶卫,万文强. 基于云计算平台的代价敏感集成学习算法研究[J]. 山东大学学报(工学版), 2012, 42(4): 19-23.
[11]	孙晓燕1,2,张化祥1,2*,计华1,2. 基于AdaBoost的欠抽样集成学习算法[J]. 山东大学学报(工学版), 2011, 41(4): 91-94.
[12]	张小峰,张志旺,逄珊. 基于通信系统的决策树构造算法[J]. 山东大学学报(工学版), 2011, 41(4): 79-84.
[13]	李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11.

多维度评价

Viewed

Full text

376

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	376

From	Others	local

Times	25	351
Rate	7%	93%

Abstract

1031

Just accepted	Online first	Issue

0	0	1031

	From	Others

	Times	1031
	Rate	100%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed