山东大学学报(工学版) ›› 2013, Vol. 43 ›› Issue (1): 22-27.

基于SMOTE和随机森林的Web spam检测


  1. 1.山东师范大学信息科学与工程学院,山东 济南 250014;
    2.山东省分布式计算机软件新技术重点实验室,山东 济南 250014
  • 收稿日期:2012-12-05 出版日期:2013-02-20 发布日期:2012-12-05
  • 通讯作者: 张化祥(1966- ),男,山东济宁人,教授,博士生导师,主要研究方向为机器学习, 模式识别及Web挖掘等.E-mail:huaxzhang@163.com
  • 作者简介:房晓南(1979- ),男,山东德州人,讲师,博士研究生, 主要研究方向为机器学习与Web挖掘等.E-mail:franknan@126.com
Web spam detection based on SMOTE and random forests

FANG Xiao-nan1,2, ZHANG Hua-xiang1,2*, GAO Shuang1,2   

  1. 1. School of Information Science & Engineering, Shandong Normal University, Jinan 250014, China;
    2. Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan 250014, China
  • Received:2012-12-05 Online:2013-02-20 Published:2012-12-05


Web spam是指采用某些技术手段,使得网页在搜索引擎检索结果中的排名高于其应得排名的行为,它严重影响搜索结果的质量。考虑到Web spam数据集的严重不平衡情况,本研究提出先使用SMOTE过抽样方法平衡数据集,再利用随机森林算法训练分类器。通过对常见的单分类器和集成学习分类器的对比实验,发现SMOTE+RF方法表现较为突出,并根据实验结果优化了方法中的重要参数,对使用SMOTE方法后AUC值提高的原因进行了分析。在WEBSPAM UK2007数据集上的实验证明,该方法可以显著提高分类器的分类效果,其AUC值已经超过了Web Spam Challenge 2008上的最好成绩。

关键词: 随机森林, 搜索引擎作弊, SMOTE, 集成学习, 搜索引擎垃圾网页


Web spam refers to the actions intended to mislead search engines into ranking some pages higher than they deserved, which could significantly deteriorate the quality of searching results. Considering the serious imbalance of the Web spam dataset, it was proposed to use oversampling method SMOTE to balance the dataset, then to train the classifiers with random forests algorithm. The results showed that the SMOTE+RF method was more prominent by means of experimental comparison with the conventional single classifiers and the ensemble learning classifiers. The important parameters of this method were optimized based on experimental results, and the reasons for the improvement of the AUC value after using SMOTE were also analyzed. Experimental results on WEBSPAM UK2007 dataset showed that this method could markedly improve the performance of the classifiers, of which the AUC value could exceed the best result of Web Spam Challenge 2008.

Key words: search engine spamming, Web spam, ensemble learning, random forests, SMOTE


