• Articles • Previous Articles     Next Articles

Web spam detection based on SMOTE and random forests

FANG Xiao-nan1,2, ZHANG Hua-xiang1,2*, GAO Shuang1,2   

  1. 1. School of Information Science & Engineering, Shandong Normal University, Jinan 250014, China;
    2. Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan 250014, China
  • Received:2012-12-05 Online:2013-02-20 Published:2012-12-05


Web spam refers to the actions intended to mislead search engines into ranking some pages higher than they deserved, which could significantly deteriorate the quality of searching results. Considering the serious imbalance of the Web spam dataset, it was proposed to use oversampling method SMOTE to balance the dataset, then to train the classifiers with random forests algorithm. The results showed that the SMOTE+RF method was more prominent by means of experimental comparison with the conventional single classifiers and the ensemble learning classifiers. The important parameters of this method were optimized based on experimental results, and the reasons for the improvement of the AUC value after using SMOTE were also analyzed. Experimental results on WEBSPAM UK2007 dataset showed that this method could markedly improve the performance of the classifiers, of which the AUC value could exceed the best result of Web Spam Challenge 2008.

Key words: search engine spamming, Web spam, ensemble learning, random forests, SMOTE

CLC Number: 

  • TP391
[1] SHEN Dongdong, ZHOU Fengyu, LI Mengyuan, WANG Shuqian, GUO Renhe. Indoor wireless positioning based on ensemble deep neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(5): 95-102.
[2] ZHANG Pu, LIU Chang, WANG Yong. Suggestion sentence classification model based on feature fusion and ensemble learning [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(5): 47-54.
[3] WANG Lihong, LI Qiang. A selective ensemble method for traveling salesman problems [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(1): 42-48.
[4] CHEN Dawei, YAN Zhao*, LIU Haoyan. Overfitting phenomenon  of  the series of single value decomposition algorithms in rating prediction [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(3): 15-21.
[5] GAO Shuang1,2, ZHANG Hua-xiang1,2*, FANG Xiao-nan1,2. Independent component analysis and co-training based Web spam detection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(2): 29-34.
[6] ZHANG Ling-wei, WAN Wen-qiang. Study on the cost-sensitive ensemble learning algorithm based on the cloud computing platform [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(4): 19-23.
[7] XIE Huo-sheng, LIU Min. An ensemble co-training algorithm based on active learning [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(3): 1-5.
[8] LI Xiao-bin1, LI Shi-yin2. Ensemble learning of multi-classifier for early classification of time series [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2011, 41(4): 73-78.
[9] LI Xia1, WANG Lian-xi2, JIANG Sheng-yi1. Ensemble learning based feature selection for imbalanced problems [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2011, 41(3): 7-11.
Full text



No Suggested Reading articles found!