您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2011, Vol. 41 ›› Issue (6): 1-6.

• 机器学习与数据挖掘 •    下一篇

一种面向连续型属性的特征选取方法

李国和1,2,岳翔1,2,李雪3,吴卫江1,2,李洪奇1   

  1. 1. 中国石油大学(北京)地球物理与信息工程学院, 北京 102249;
    2. 石大兆信数字身份管理与物联网技术研究院, 北京 100029;
    3. 昆士兰大学信息技术与电气工程学院, 澳大利亚 布里斯班 4072
  • 收稿日期:2011-04-15 出版日期:2011-12-16 发布日期:2011-04-15
  • 作者简介:李国和(1965- ),福建平和人,男,教授,博士,博士生导师,主要研究方向为人工智能,知识发现. E-mail:ligh@cup.edu.cn
  • 基金资助:

    国家高新技术研究发展计划(863计划)资助项目(2009AA062802);国家自然科学基金资助项目(60473125);中国石油(CNPC)石油科技中青年创新基金资助项目(05E7013);国家重大专项子课题资助项目(G5800-08-ZS-WX)

A method of feature selection for continuous attributes

LI Guo-he1,2, YUE Xiang1,2, LI Xue3, WU Wei-jiang1,2, LI Hong-qi1   

  1. 1. College of Geophysics and Information Engineering, China University of Petroleum, Beijing 102249, China;
    2.  PanPass Institute of Digital Identification Management and Internet of Things,  Beijing 100029, China;
    3.  School of Information Technology and Electrical Engineering, the University of Queensland, Brisbane 4072, Australia
  • Received:2011-04-15 Online:2011-12-16 Published:2011-04-15

摘要:

特征选取是数据约简方法之一,其对提高机器学习的效率和效果具有重要影响。根据对象在特征空间中的分布,划分连续特征空间为类别单一、边界清晰的多个子空间。依统计学意义,把各个子空间分别投影到所有特征上,获取所有不同类别子空间对当前子空间特征区分能力的评估。通过构造区分能力评估矩阵,实现特征分类能力的排序。引入特征集区分能力信息增益,结合特征分类能力排序,逐一优选特征,最终完成特征子集的求解。采用UCI(University of California Irvine)数据集进行实验,获取特征子集,利用该特征子集,提高了机器学习效率和分类精度,表明了特征选取的可行性。

关键词: 数据约简, 特征选取, 连续型属性, 决策表

Abstract:

Feature selection is one of the methods for reduction of data sets, which improves efficiency and effectivity of machine learning. In terms of the distribution of objects and their classification labels, the continuous feature space was partitioned into a variety of subspaces, each one with a clear edge and unique classification label. After the projection of all the subspaces for  each feature, the quality of each feature was  estimated for a subspace opposite all  the other subspaces with different classification labels by means of statistical significance. Through construction of a matrix by all the estimate qualities of all features of  the subspaces, all  features were ranked from the highest classifying power to the lowest on the matrix for the feature space. After the information gain function was defined by the subset of features, the feature subset was optimally determined on the basis of ranked features by gradually adding features. Experiments on the data sets from UCI(University of California Irvine) repository by the feature selection obtained feature subsets,  by which the performance and classification accuracy of machine learning were improved, illustrating that the feature selection was feasible.

Key words: data reduction, feature selection, continuous attributes, decision table

中图分类号: 

  • TP181
[1] 景运革,李天瑞. 基于知识粒度的增量约简算法[J]. 山东大学学报(工学版), 2016, 46(1): 1-9.
[2] 张国栋1,2,张化祥1,2*. 基于非线性流形学习和k-NN的文本分类算法[J]. 山东大学学报(工学版), 2013, 43(1): 28-33.
[3] 陈玉明,吴克寿,谢荣生. 基于相对知识粒度的决策表约简[J]. 山东大学学报(工学版), 2012, 42(6): 8-12.
[4] 吴克寿,陈玉明,曾志强. 基于邻域关系的决策表约简[J]. 山东大学学报(工学版), 2012, 42(2): 7-10.
[5] 李成栋,雷红,史开泉 . 一种基于粗集的模糊系统设计方法[J]. 山东大学学报(工学版), 2006, 36(4): 73-80 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!