山东大学学报(工学版) ›› 2017, Vol. 47 ›› Issue (1): 15-21.doi: 10.6040/j.issn.1672-3961.0.2016.304
方昊,李云*
FANG Hao, LI Yun*
摘要: 为了解决因软件缺陷数据存在数据不平衡问题限制了分类器的性能,将POSS(pareto optimization for subset selection)特征选择算法和随机欠采样技术引入到软件缺陷检测中,并利用支持向量机(support vector machine, SVM)构建预测模型。试验结果表明,通过多次随机欠采样可以有效地解决软件缺陷数据不平衡问题,同时使用POSS方法对目标子集进行双向优化,从而提高分类的准确率,其结果要优于Relief、Fisher、MI(mutual information)特征选择算法。
中图分类号:
[1] SONG Q, JIA Z, SHEPPERD M, et al. A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering, 2011, 37(3):356-370. [2] MUNSON J C, KHOSHGOFTAAR T M. Regression modelling of software quality: empirical investigation[J]. Information and Software Technology, 1990, 32(2):106-114. [3] ZHENG J. Cost-sensitive boosting neural networks for software defect prediction[J]. Expert Systems with Applications, 2010, 37(6):4537-4543. [4] KHOSHGOFTAAR T M, SELIYA N. Analogy-based practical classification rules for software quality estimation[J].Empirical Software Engineering, 2003, 8(4):325-350. [5] CHIDAMBER S R, KEMERER C F. A metrics suite for object oriented design[J]. IEEE Transactions on Software Engineering, 1994, 20(6):476-493. [6] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A. An empirical study of feature ranking techniques for software quality prediction[J].International Journal of Software Engineering and Knowledge Engineering, 2012, 22(2):161-183. [7] GAO K, KHOSHGOFTAAR T M, WANG H, et al. Choosing software metrics for defect prediction: an investigation on feature selection techniques[J]. Software: Practice and Experience, 2011, 41(5): 579-606. [8] KHOSHGOFTAAR T M, GAO K, NAPOLITANO A, et al. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction[J]. Information Systems Frontiers, 2014, 16(5): 801-822. [9] BOEHM B W, PAPCCIO P N. Understanding and controlling software costs[J].IEEE Transactions on Software Engineering, 1998, 14(10):1462-1477. [10] 姚旭,王晓丹,张玉玺.特征选择综述[J].控制与决策,2012,27(2):161-166. YAO Xu, WANG Xiaodan, ZHANG Yuxi. Survey of feature selection methods[J]. Control and Decision, 2012, 27(2):161-166. [11] GU Q, LI Z, HAN J. Generalized fisher score for feature selection[C] // Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011.Barcelona, Spain:AUAI Press, 2011:266-273. [12] ROBNIK-SIKONJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53(1-2):23-69. [13] GUYON I, WESTON J, BARNHILL S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1-3):389-422. [14] LIU H, YU L. Toward integrating feature selection algorithms for classification and clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4):491-502. [15] WOZNICA A, NGUYEN P, KALOUSIS A. Model mining for robust feature selection[C] //Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Beijing, China:ACM, 2012: 913-921. [16] JONG K, MARCHIORI E, SEBAG M, et al. Feature selection in proteomic pattern data with support vector machines[C] //Proceedings of the 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology. La Jolla, USA:IEEE, 2004:41-48. [17] RODRIGUEZ D, RUIZ R, CUADRADO-GALLEGO J, et al. Detecting fault modules applying feature selection to classifiers[C] //Proceedings of the 2007 IEEE International Conference on Information Reuse and Integration.Las Vegas, USA:IEEE, 2007: 667-672. [18] FORMAN G. An extensive empirical study of feature selection metrics for text classification[J]. Journal of Machine Learning Research, 2003, 3(2):1289-1305. [19] QIAN C, YU Y, ZHOU Z H. Subset Selection by Pareto Optimization[C] //Proceedings of the Advances in Neural Information Processing Systems 28(NIPS 2015).Montreal, Canada:NIPS, 2015:1774-1782. [20] 徐燕, 李锦涛, 王斌,等. 基于区分类别能力的高性能特征选择方法[J]. 软件学报, 2008, 19(1):82-89. XU Yan, LI Jintao, WANG Bin, et al. A high performance feature selection method based on classification[J].Journal of Software, 2008, 19(1):82-89. [21] 马衍庆. 基于机器学习的网络流量识别方法与实现[D]. 济南:山东大学, 2014. MA Yanqing. Internet traffic classification and identification based on machine learning[D]. Jinan: Shandong University, 2014. [22] YU Y, YAO X, ZHOU ZH. On the approximation ability of evolutionary optimization with application to minimum set cover[J].Artificial Intelligence, 2012, 180-181(2):20-33. [23] HUANG Y J, POWERS R, MONTELIONE G T. Protein NMR recall, precision, and F-measure scores(RPF scores): structure quality assessment measures based on information retrieval statistics[J]. Journal of the American Chemical Society, 2005, 127(6): 1665-1674. [24] ZHAO Z, GUO S, XU Q, et al. G-means: a clustering algorithm for intrusion detection[C] //Proceedings of the Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). [S.l.] :Springer, 2009, 5506:563-570. [25] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2):434-443. [26] METZ C E. Basic principles of ROC analysis[J].Seminars in Nuclear Medicine, 1978, 8(4):283-298. [27] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: one-sided selection[C] //Proceedings of the Fourteenth International Conference on Machine Learning.Stanford, USA:ICML, 2000:179-186. |
[1] | 牟廉明. 自适应特征选择加权k子凸包分类[J]. 山东大学学报(工学版), 2018, 48(5): 32-37. |
[2] | 李素姝,王士同,李滔. 基于LS-SVM与模糊补准则的特征选择方法[J]. 山东大学学报(工学版), 2017, 47(3): 34-42. |
[3] | 莫小勇,潘志松,邱俊洋,余亚军,蒋铭初. 基于在线特征选择的网络流异常检测[J]. 山东大学学报(工学版), 2016, 46(4): 21-27. |
[4] | 徐晓丹, 段正杰, 陈中育. 基于扩展情感词典及特征加权的情感挖掘方法[J]. 山东大学学报(工学版), 2014, 44(6): 15-18. |
[5] | 魏小敏,徐彬,关佶红. 基于递归特征消除法的蛋白质能量热点预测[J]. 山东大学学报(工学版), 2014, 44(2): 12-20. |
[6] | 潘冬寅,朱发,徐昇,业宁*. 结肠癌基因表达谱的特征选取研究[J]. 山东大学学报(工学版), 2012, 42(2): 23-29. |
[7] | 李霞1,王连喜2,蒋盛益1. 面向不平衡问题的集成特征选择[J]. 山东大学学报(工学版), 2011, 41(3): 7-11. |
[8] | 戴平,李宁*. 一种基于SVM的快速特征选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 60-65. |
[9] | 谭台哲,梁应毅,刘富春. 一种ReliefF特征估计方法在无监督流形学习中的应用[J]. 山东大学学报(工学版), 2010, 40(5): 66-71. |
[10] | 尤鸣宇,陈燕,李国正. 不均衡问题中的特征选择新算法:Im-IG[J]. 山东大学学报(工学版), 2010, 40(5): 123-128. |
[11] | 阳爱民1,周咏梅1,邓河2,周剑峰3. 一种网络流量分类特征的产生及选择方法[J]. 山东大学学报(工学版), 2010, 40(5): 1-7. |
[12] | 王法波,许信顺. 文本分类中一种新的特征选择方法[J]. 山东大学学报(工学版), 2010, 40(4): 8-11. |
|