JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE) ›› 2013, Vol. 43 ›› Issue (1): 22-27.

• Articles • Previous Articles     Next Articles

Web spam detection based on SMOTE and random forests

FANG Xiao-nan1,2, ZHANG Hua-xiang1,2*, GAO Shuang1,2   

  1. 1. School of Information Science & Engineering, Shandong Normal University, Jinan 250014, China;
    2. Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan 250014, China
  • Received:2012-12-05 Online:2013-02-20 Published:2012-12-05

Abstract:

Web spam refers to the actions intended to mislead search engines into ranking some pages higher than they deserved, which could significantly deteriorate the quality of searching results. Considering the serious imbalance of the Web spam dataset, it was proposed to use oversampling method SMOTE to balance the dataset, then to train the classifiers with random forests algorithm. The results showed that the SMOTE+RF method was more prominent by means of experimental comparison with the conventional single classifiers and the ensemble learning classifiers. The important parameters of this method were optimized based on experimental results, and the reasons for the improvement of the AUC value after using SMOTE were also analyzed. Experimental results on WEBSPAM UK2007 dataset showed that this method could markedly improve the performance of the classifiers, of which the AUC value could exceed the best result of Web Spam Challenge 2008.

Key words: search engine spamming, Web spam, ensemble learning, random forests, SMOTE

CLC Number: 

  • TP391
[1] Dapeng ZHANG,Yajun LIU,Wei ZHANG,Fen SHEN,Jiansheng YANG. Fake comment detection based on heterogeneous ensemble learning [J]. Journal of Shandong University(Engineering Science), 2020, 50(2): 1-9.
[2] Zongtang ZHANG,Sen WANG,Shilin SUN. An ensemble learning algorithm for unbalanced data classification [J]. Journal of Shandong University(Engineering Science), 2019, 49(4): 8-13.
[3] Pu ZHANG,Chang LIU,Yong WANG. Suggestion sentence classification model based on feature fusion and ensemble learning [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 47-54.
[4] Dongdong SHEN,Fengyu ZHOU,Mengyuan LI,Shuqian WANG,Renhe GUO. Indoor wireless positioning based on ensemble deep neural network [J]. Journal of Shandong University(Engineering Science), 2018, 48(5): 95-102.
[5] WANG Lihong, LI Qiang. A selective ensemble method for traveling salesman problems [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2016, 46(1): 42-48.
[6] CHEN Dawei, YAN Zhao*, LIU Haoyan. Overfitting phenomenon  of  the series of single value decomposition algorithms in rating prediction [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2014, 44(3): 15-21.
[7] GAO Shuang1,2, ZHANG Hua-xiang1,2*, FANG Xiao-nan1,2. Independent component analysis and co-training based Web spam detection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2013, 43(2): 29-34.
[8] ZHANG Ling-wei, WAN Wen-qiang. Study on the cost-sensitive ensemble learning algorithm based on the cloud computing platform [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(4): 19-23.
[9] XIE Huo-sheng, LIU Min. An ensemble co-training algorithm based on active learning [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2012, 42(3): 1-5.
[10] LI Xiao-bin1, LI Shi-yin2. Ensemble learning of multi-classifier for early classification of time series [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2011, 41(4): 73-78.
[11] LI Xia1, WANG Lian-xi2, JIANG Sheng-yi1. Ensemble learning based feature selection for imbalanced problems [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2011, 41(3): 7-11.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] WANG Su-yu,<\sup>,AI Xing<\sup>,ZHAO Jun<\sup>,LI Zuo-li<\sup>,LIU Zeng-wen<\sup> . Milling force prediction model for highspeed end milling 3Cr2Mo steel[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 1 -5 .
[2] ZHANG Yong-hua,WANG An-ling,LIU Fu-ping . The reflected phase angle of low frequent inhomogeneous[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 22 -25 .
[3] LI Kan . Empolder and implement of the embedded weld control system[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(4): 37 -41 .
[4] KONG Xiang-zhen,LIU Yan-jun,WANG Yong,ZHAO Xiu-hua . Compensation and simulation for the deadband of the pneumatic proportional valve[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 99 -102 .
[5] LAI Xiang . The global domain of attraction for a kind of MKdV equations[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(1): 87 -92 .
[6] YU Jia yuan1, TIAN Jin ting1, ZHU Qiang zhong2. Computational intelligence and its application in psychology[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 1 -5 .
[7] CHEN Rui, LI Hongwei, TIAN Jing. The relationship between the number of magnetic poles and the bearing capacity of radial magnetic bearing[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(2): 81 -85 .
[8] LI Ke,LIU Chang-chun,LI Tong-lei . Medical registration approach using improved maximization of mutual information[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 107 -110 .
[9] JI Tao,GAO Xu/sup>,SUN Tong-jing,XUE Yong-duan/sup>,XU Bing-yin/sup> . Characteristic analysis of fault generated traveling waves in 10 Kv automatic blocking and continuous power transmission lines[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(2): 111 -116 .
[10] . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(1): 27 -32 .