您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2017, Vol. 47 ›› Issue (1): 15-21.doi: 10.6040/j.issn.1672-3961.0.2016.304

• • 上一篇    下一篇

基于多次随机欠采样和POSS方法的软件缺陷检测

方昊,李云*   

  1. 南京邮电大学计算机学院, 江苏 南京 210003
  • 收稿日期:2016-07-22 出版日期:2017-02-20 发布日期:2016-07-22
  • 通讯作者: 李云(1974— ),男,安徽望江人,教授,博士,主要研究方向为机器学习与模式识别.E-mail:liyun@njupt.edu.cn E-mail:15150662912@163.com
  • 作者简介:方昊(1989— ),男,江苏宿迁人,硕士研究生,主要研究方向为特征选择.E-mail:15150662912@163.com
  • 基金资助:
    江苏省自然科学基金资助项目(BK20131378,BK20140885);广西高校云计算与复杂系统重点实验室资助项目(15206)

Random undersampling and POSS method for software defect prediction

FANG Hao, LI Yun*   

  1. College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, Jiangsu, China
  • Received:2016-07-22 Online:2017-02-20 Published:2016-07-22

摘要: 为了解决因软件缺陷数据存在数据不平衡问题限制了分类器的性能,将POSS(pareto optimization for subset selection)特征选择算法和随机欠采样技术引入到软件缺陷检测中,并利用支持向量机(support vector machine, SVM)构建预测模型。试验结果表明,通过多次随机欠采样可以有效地解决软件缺陷数据不平衡问题,同时使用POSS方法对目标子集进行双向优化,从而提高分类的准确率,其结果要优于Relief、Fisher、MI(mutual information)特征选择算法。

关键词: 软件缺陷检测, 不平衡性, 数据采样, 特征选择

Abstract: In order to solve the problem of imbalance distribution in software defect prediction, POSS(pareto optimization for subset selection)feature selection and random undersampling was applied in this paper, and SVM was used to build the prediction model. The experimental results showed that the problem could be solved effectively by using multiple random undersampling, and the POSS method was treated subset selection as a bi-objective optimization, which could improve the accuracy of classification, the effectiveness of proposed method was verified by comparing with Relief、Fisher、MI(mutual information).

Key words: class imbalance, data sampling, feature selection, software defect prediction

中图分类号: 

  • TP391
[1] SONG Q, JIA Z, SHEPPERD M, et al. A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering, 2011, 37(3):356-370.
[2] MUNSON J C, KHOSHGOFTAAR T M. Regression modelling of software quality: empirical investigation[J]. Information and Software Technology, 1990, 32(2):106-114.
[3] ZHENG J. Cost-sensitive boosting neural networks for software defect prediction[J]. Expert Systems with Applications, 2010, 37(6):4537-4543.
[4] KHOSHGOFTAAR T M, SELIYA N. Analogy-based practical classification rules for software quality estimation[J].Empirical Software Engineering, 2003, 8(4):325-350.
[5] CHIDAMBER S R, KEMERER C F. A metrics suite for object oriented design[J]. IEEE Transactions on Software Engineering, 1994, 20(6):476-493.
[6] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A. An empirical study of feature ranking techniques for software quality prediction[J].International Journal of Software Engineering and Knowledge Engineering, 2012, 22(2):161-183.
[7] GAO K, KHOSHGOFTAAR T M, WANG H, et al. Choosing software metrics for defect prediction: an investigation on feature selection techniques[J]. Software: Practice and Experience, 2011, 41(5): 579-606.
[8] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A, et al. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction[J]. Information Systems Frontiers, 2014, 16(5): 801-822.
[9] BOEHM B W, PAPCCIO P N. Understanding and controlling software costs[J].IEEE Transactions on Software Engineering, 1998, 14(10):1462-1477.
[10] 姚旭,王晓丹,张玉玺.特征选择综述[J].控制与决策,2012,27(2):161-166. YAO Xu, WANG Xiaodan, ZHANG Yuxi. Survey of feature selection methods[J]. Control and Decision, 2012, 27(2):161-166.
[11] GU Q, LI Z, HAN J. Generalized fisher score for feature selection[C] // Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011.Barcelona, Spain:AUAI Press, 2011:266-273.
[12] ROBNIK-SIKONJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53(1-2):23-69.
[13] GUYON I, WESTON J, BARNHILL S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1-3):389-422.
[14] LIU H, YU L. Toward integrating feature selection algorithms for classification and clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4):491-502.
[15] WOZNICA A, NGUYEN P, KALOUSIS A. Model mining for robust feature selection[C] //Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China:ACM, 2012: 913-921.
[16] JONG K, MARCHIORI E, SEBAG M, et al. Feature selection in proteomic pattern data with support vector machines[C] //Proceedings of the 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology. La Jolla, USA:IEEE, 2004:41-48.
[17] RODRIGUEZ D, RUIZ R, CUADRADO-GALLEGO J, et al. Detecting fault modules applying feature selection to classifiers[C] //Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration.Las Vegas, USA:IEEE, 2007: 667-672.
[18] FORMAN G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305.
[19] QIAN C, YU Y, ZHOU Z H. Subset Selection by Pareto Optimization[C] //Proceedings of the Advances in Neural Information Processing Systems 28(NIPS 2015).Montreal, Canada:NIPS, 2015:1774-1782.
[20] 徐燕, 李锦涛, 王斌,等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1):82-89. XU Yan, LI Jintao, WANG Bin, et al. A high performance feature selection method based on classification[J].Journal of Software, 2008, 19(1):82-89.
[21] 马衍庆. 基于机器学习的网络流量识别方法与实现[D]. 济南:山东大学, 2014. MA Yanqing. Internet traffic classification and identification based on machine learning[D]. Jinan: Shandong University, 2014.
[22] YU Y, YAO X, ZHOU ZH. On the approximation ability of evolutionary optimization with application to minimum set cover[J].Artificial Intelligence, 2012, 180-181(2):20-33.
[23] HUANG Y J, POWERS R, MONTELIONE G T. Protein NMR recall, precision, and F-measure scores(RPF scores): structure quality assessment measures based on information retrieval statistics[J]. Journal of the American Chemical Society, 2005, 127(6): 1665-1674.
[24] ZHAO Z, GUO S, XU Q, et al. G-means: a clustering algorithm for intrusion detection[C] //Proceedings of the Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). [S.l.] :Springer, 2009, 5506:563-570.
[25] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2):434-443.
[26] METZ C E. Basic principles of ROC analysis[J].Seminars in Nuclear Medicine, 1978, 8(4):283-298.
[27] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: one-sided selection[C] //Proceedings of the Fourteenth International Conference on Machine Learning.Stanford, USA:ICML, 2000:179-186.
[1] 牟廉明. 自适应特征选择加权k子凸包分类[J]. 山东大学学报(工学版), 2018, 48(5): 32-37.
[2] 李素姝,王士同,李滔. 基于LS-SVM与模糊补准则的特征选择方法[J]. 山东大学学报(工学版), 2017, 47(3): 34-42.
[3] 莫小勇,潘志松,邱俊洋,余亚军,蒋铭初. 基于在线特征选择的网络流异常检测[J]. 山东大学学报(工学版), 2016, 46(4): 21-27.
[4] 徐晓丹, 段正杰, 陈中育. 基于扩展情感词典及特征加权的情感挖掘方法[J]. 山东大学学报(工学版), 2014, 44(6): 15-18.
[5] 魏小敏,徐彬,关佶红. 基于递归特征消除法的蛋白质能量热点预测[J]. 山东大学学报(工学版), 2014, 44(2): 12-20.
[6] 潘冬寅,朱发,徐昇,业宁*. 结肠癌基因表达谱的特征选取研究[J]. 山东大学学报(工学版), 2012, 42(2): 23-29.
[7] 李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11.
[8] 戴平,李宁*. 一种基于SVM的快速特征选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 60-65.
[9] 谭台哲,梁应毅,刘富春. 一种ReliefF特征估计方法在无监督流形学习中的应用[J]. 山东大学学报(工学版), 2010, 40(5): 66-71.
[10] 尤鸣宇,陈燕,李国正. 不均衡问题中的特征选择新算法:Im-IG[J]. 山东大学学报(工学版), 2010, 40(5): 123-128.
[11] 阳爱民1,周咏梅1,邓河2,周剑峰3. 一种网络流量分类特征的产生及选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 1-7.
[12] 王法波,许信顺. 文本分类中一种新的特征选择方法[J]. 山东大学学报(工学版), 2010, 40(4): 8-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!