基于多次随机欠采样和POSS方法的软件缺陷检测

doi:10.6040/j.issn.1672-3961.0.2016.304

山东大学学报(工学版) ›› 2017, Vol. 47 ›› Issue (1): 15-21.doi: 10.6040/j.issn.1672-3961.0.2016.304

基于多次随机欠采样和POSS方法的软件缺陷检测

方昊,李云^*

南京邮电大学计算机学院, 江苏南京 210003

收稿日期:2016-07-22 出版日期:2017-02-20 发布日期:2016-07-22
通讯作者: 李云(1974— ),男,安徽望江人,教授,博士,主要研究方向为机器学习与模式识别.E-mail:liyun@njupt.edu.cn E-mail:15150662912@163.com
作者简介:方昊(1989— ),男,江苏宿迁人,硕士研究生,主要研究方向为特征选择.E-mail:15150662912@163.com
基金资助:
江苏省自然科学基金资助项目(BK20131378,BK20140885);广西高校云计算与复杂系统重点实验室资助项目(15206)

Random undersampling and POSS method for software defect prediction

FANG Hao, LI Yun^*

College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, Jiangsu, China

Received:2016-07-22 Online:2017-02-20 Published:2016-07-22

摘要/Abstract

摘要： 为了解决因软件缺陷数据存在数据不平衡问题限制了分类器的性能,将POSS(pareto optimization for subset selection)特征选择算法和随机欠采样技术引入到软件缺陷检测中,并利用支持向量机(support vector machine, SVM)构建预测模型。试验结果表明,通过多次随机欠采样可以有效地解决软件缺陷数据不平衡问题,同时使用POSS方法对目标子集进行双向优化,从而提高分类的准确率,其结果要优于Relief、Fisher、MI(mutual information)特征选择算法。

关键词: 软件缺陷检测, 不平衡性, 数据采样, 特征选择

Abstract: In order to solve the problem of imbalance distribution in software defect prediction, POSS(pareto optimization for subset selection)feature selection and random undersampling was applied in this paper, and SVM was used to build the prediction model. The experimental results showed that the problem could be solved effectively by using multiple random undersampling, and the POSS method was treated subset selection as a bi-objective optimization, which could improve the accuracy of classification, the effectiveness of proposed method was verified by comparing with Relief、Fisher、MI(mutual information).

Key words: class imbalance, data sampling, feature selection, software defect prediction

中图分类号:

TP391

方昊,李云. 基于多次随机欠采样和POSS方法的软件缺陷检测[J]. 山东大学学报(工学版), 2017, 47(1): 15-21.

FANG Hao, LI Yun. Random undersampling and POSS method for software defect prediction[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(1): 15-21.

参考文献

[1] SONG Q, JIA Z, SHEPPERD M, et al. A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering, 2011, 37(3):356-370.
[2] MUNSON J C, KHOSHGOFTAAR T M. Regression modelling of software quality: empirical investigation[J]. Information and Software Technology, 1990, 32(2):106-114.
[3] ZHENG J. Cost-sensitive boosting neural networks for software defect prediction[J]. Expert Systems with Applications, 2010, 37(6):4537-4543.
[4] KHOSHGOFTAAR T M, SELIYA N. Analogy-based practical classification rules for software quality estimation[J].Empirical Software Engineering, 2003, 8(4):325-350.
[5] CHIDAMBER S R, KEMERER C F. A metrics suite for object oriented design[J]. IEEE Transactions on Software Engineering, 1994, 20(6):476-493.
[6] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A. An empirical study of feature ranking techniques for software quality prediction[J].International Journal of Software Engineering and Knowledge Engineering, 2012, 22(2):161-183.
[7] GAO K, KHOSHGOFTAAR T M, WANG H, et al. Choosing software metrics for defect prediction: an investigation on feature selection techniques[J]. Software: Practice and Experience, 2011, 41(5): 579-606.
[8] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A, et al. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction[J]. Information Systems Frontiers, 2014, 16(5): 801-822.
[9] BOEHM B W, PAPCCIO P N. Understanding and controlling software costs[J].IEEE Transactions on Software Engineering, 1998, 14(10):1462-1477.
[10] 姚旭,王晓丹,张玉玺.特征选择综述[J].控制与决策,2012,27(2):161-166. YAO Xu, WANG Xiaodan, ZHANG Yuxi. Survey of feature selection methods[J]. Control and Decision, 2012, 27(2):161-166.
[11] GU Q, LI Z, HAN J. Generalized fisher score for feature selection[C] // Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011.Barcelona, Spain:AUAI Press, 2011:266-273.
[12] ROBNIK-SIKONJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53(1-2):23-69.
[13] GUYON I, WESTON J, BARNHILL S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1-3):389-422.
[14] LIU H, YU L. Toward integrating feature selection algorithms for classification and clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4):491-502.
[15] WOZNICA A, NGUYEN P, KALOUSIS A. Model mining for robust feature selection[C] //Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China:ACM, 2012: 913-921.
[16] JONG K, MARCHIORI E, SEBAG M, et al. Feature selection in proteomic pattern data with support vector machines[C] //Proceedings of the 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology. La Jolla, USA:IEEE, 2004:41-48.
[17] RODRIGUEZ D, RUIZ R, CUADRADO-GALLEGO J, et al. Detecting fault modules applying feature selection to classifiers[C] //Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration.Las Vegas, USA:IEEE, 2007: 667-672.
[18] FORMAN G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305.
[19] QIAN C, YU Y, ZHOU Z H. Subset Selection by Pareto Optimization[C] //Proceedings of the Advances in Neural Information Processing Systems 28(NIPS 2015).Montreal, Canada:NIPS, 2015:1774-1782.
[20] 徐燕, 李锦涛, 王斌,等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1):82-89. XU Yan, LI Jintao, WANG Bin, et al. A high performance feature selection method based on classification[J].Journal of Software, 2008, 19(1):82-89.
[21] 马衍庆. 基于机器学习的网络流量识别方法与实现[D]. 济南:山东大学, 2014. MA Yanqing. Internet traffic classification and identification based on machine learning[D]. Jinan: Shandong University, 2014.
[22] YU Y, YAO X, ZHOU ZH. On the approximation ability of evolutionary optimization with application to minimum set cover[J].Artificial Intelligence, 2012, 180-181(2):20-33.
[23] HUANG Y J, POWERS R, MONTELIONE G T. Protein NMR recall, precision, and F-measure scores(RPF scores): structure quality assessment measures based on information retrieval statistics[J]. Journal of the American Chemical Society, 2005, 127(6): 1665-1674.
[24] ZHAO Z, GUO S, XU Q, et al. G-means: a clustering algorithm for intrusion detection[C] //Proceedings of the Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). [S.l.] :Springer, 2009, 5506:563-570.
[25] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2):434-443.
[26] METZ C E. Basic principles of ROC analysis[J].Seminars in Nuclear Medicine, 1978, 8(4):283-298.
[27] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: one-sided selection[C] //Proceedings of the Fourteenth International Conference on Machine Learning.Stanford, USA:ICML, 2000:179-186.

多维度评价

Viewed

Full text

731

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	731

From	Others	local

Times	38	693
Rate	5%	95%

Abstract

1393

Just accepted	Online first	Issue

0	0	1393

From	Others	local

Times	1391	2
Rate	100%	0%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Discussed

基于多次随机欠采样和POSS方法的软件缺陷检测

Random undersampling and POSS method for software defect prediction

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

多维度评价

本文评价

推荐阅读 0

[1]	牟廉明. 自适应特征选择加权k子凸包分类[J]. 山东大学学报(工学版), 2018, 48(5): 32-37.
[2]	李素姝,王士同,李滔. 基于LS-SVM与模糊补准则的特征选择方法[J]. 山东大学学报(工学版), 2017, 47(3): 34-42.
[3]	莫小勇,潘志松,邱俊洋,余亚军,蒋铭初. 基于在线特征选择的网络流异常检测[J]. 山东大学学报(工学版), 2016, 46(4): 21-27.
[4]	徐晓丹, 段正杰, 陈中育. 基于扩展情感词典及特征加权的情感挖掘方法[J]. 山东大学学报(工学版), 2014, 44(6): 15-18.
[5]	魏小敏,徐彬,关佶红. 基于递归特征消除法的蛋白质能量热点预测[J]. 山东大学学报(工学版), 2014, 44(2): 12-20.
[6]	潘冬寅，朱发，徐昇，业宁*. 结肠癌基因表达谱的特征选取研究[J]. 山东大学学报(工学版), 2012, 42(2): 23-29.
[7]	李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11.
[8]	戴平,李宁*. 一种基于SVM的快速特征选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 60-65.
[9]	谭台哲,梁应毅,刘富春. 一种ReliefF特征估计方法在无监督流形学习中的应用[J]. 山东大学学报(工学版), 2010, 40(5): 66-71.
[10]	尤鸣宇,陈燕,李国正. 不均衡问题中的特征选择新算法:Im-IG[J]. 山东大学学报(工学版), 2010, 40(5): 123-128.
[11]	阳爱民1,周咏梅1,邓河2,周剑峰3. 一种网络流量分类特征的产生及选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 1-7.
[12]	王法波,许信顺. 文本分类中一种新的特征选择方法[J]. 山东大学学报(工学版), 2010, 40(4): 8-11.