Journal of Shandong University(Engineering Science) ›› 2020, Vol. 50 ›› Issue (2): 118-128.doi: 10.6040/j.issn.1672-3961.0.2019.043

• Machine Learning & Data Mining • Previous Articles    

Semantic analysis and vectorization for intelligent detection of big data cross-site scripting attacks

Haijun ZHANG1(),Yinghui CHEN2,*()   

  1. 1. School of Computer, Jiaying University, Meizhou 514015, Guangdong, China
    2. School of Mathematics, Jiaying University, Meizhou 514015, Guangdong, China
  • Received:2019-01-29 Online:2020-04-20 Published:2020-04-16
  • Contact: Yinghui CHEN E-mail:nihaoba_456@163.com
  • Supported by:
    国家自然科学基金资助项目(61171141);国家自然科学基金资助项目(61573145);广东省自然科学基金重点资助项目(2014B010104001);广东省自然科学基金重点资助项目(2015A030308018);广东省普通高等学校人文社会科学省市共建重点研究基地资助项目(18KYKT11);广东省嘉应学院自然科学基金重点资助项目(2017KJZ02)

Abstract:

The access traffic corpus big data were processed with word vectorization based on the methods of semantic scenario analysis and vectorization, and the intelligent detection oriented to big data cross-site scripting attack was realized. It used the natural language processing methods for data acquisition, data cleaning, data sampling, feature extraction and other data preprocessing. The algorithm of word vectorization based on neural network was used to realize word vectorization and get big data of word vectorization. Through theoretical analysis and deductions, the intelligent detection algorithms of varieties of long short term memory networks with different layers were realized. With different hyperparameters and repeated tests, lots of results were got, such as the highest recognition rate for 0.999 5, the minimum recognition rate for 0.264 3, average recognition rate for 99.88%, variance for 0, standard deviations for 0.000 4, the curve diagram of recognition rates change, the curve diagram of error of loss change, the curve diagram of cosine proximity change of word vector samples and the curve diagram of mean absolute error change etc. The results of the study showed that the algorithm had the advantages of high recognition rates, strong stability and excellent overall performance, etc.

Key words: web intrusion detection, cross-site scripting, natural language processing, deep long short term memory network, big data

CLC Number: 

  • TP309.2

Fig.1

Schematic diagram of intelligence detection of oriented to big data Cross-Site Scripting attack with semantic scenario analysis and vectorization"

Table 1

Recognition rates for the firstⅠclass based on different μ"

序号 学习率
0.001 0.01 0.1
1 0.988 5 0.982 9 0.278 4
2 0.994 1 0.994 4 0.273 7
3 0.994 3 0.994 8 0.278 7
4 0.994 8 0.995 3 0.264 3
5 0.995 2 0.995 5 0.280 1
6 0.995 5 0.995 7 0.279 7
7 0.995 7 0.995 6 0.278 9
8 0.995 9 0.995 8 0.281 1
9 0.996 0 0.995 8 0.279 6
10 0.996 1 0.996 1 0.278 3
11 0.996 3 0.996 1 0.278 7
12 0.996 4 0.996 0 0.282 5
13 0.996 5 0.995 7 0.279 1
14 0.996 5 0.996 0 0.279 2
15 0.996 7 0.996 1 0.275 1
16 0.996 6 0.996 4 0.279 0
17 0.995 7 0.996 5 0.279 0
18 0.996 3 0.996 5 0.279 6
19 0.996 5 0.996 6 0.270 7
20 0.996 7 0.996 8 0.280 4

Fig.2

Curve diagram of recognition rates for the first Ⅰ class based on different μ"

Table 2

Recognition rates for the second Ⅱ class based on different μ"

序号 学习率
0.001 0.01 0.1
1 0.995 2 0.993 8 0.956 8
2 0.997 4 0.997 5 0.969 4
3 0.997 9 0.997 6 0.927 5
4 0.998 3 0.998 1 0.830 9
5 0.998 6 0.998 4 0.831 0
6 0.998 8 0.998 4 0.831 0
7 0.998 9 0.998 8 0.831 1
8 0.999 1 0.999 0 0.830 9
9 0.999 2 0.999 0 0.831 1
10 0.999 2 0.999 0 0.831 1
11 0.999 3 0.999 1 0.830 8
12 0.999 3 0.999 2 0.831 1
13 0.999 3 0.999 3 0.831 3
14 0.999 3 0.999 3 0.831 1
15 0.999 5 0.999 3 0.831 2
16 0.999 5 0.999 4 0.831 0
17 0.999 4 0.999 5 0.830 9
18 0.999 5 0.999 3 0.831 2
19 0.999 4 0.999 4 0.831 1
20 0.999 5 0.999 5 0.831 0

Table 3

Recognition rates for the firstⅠclass based on different BatchSize"

序号 数据块
50 100 500
1 0.993 2 0.993 5 0.982 9
2 0.993 4 0.9942 0.994 4
3 0.993 9 0.994 6 0.994 8
4 0.994 2 0.994 8 0.995 3
5 0.993 8 0.994 9 0.995 5
6 0.994 5 0.994 6 0.995 6
7 0.994 5 0.994 6 0.995 6
8 0.994 5 0.993 3 0.995 8
9 0.994 5 0.994 2 0.995 8
10 0.994 7 0.994 6 0.996 1
11 0.994 4 0.993 4 0.996 1
12 0.994 6 0.994 9 0.996 0
13 0.994 6 0.994 9 0.995 7
14 0.994 5 0.995 2 0.996 0
15 0.994 7 0.995 3 0.996 1
16 0.994 7 0.995 1 0.996 4
17 0.994 9 0.995 2 0.996 5
18 0.994 9 0.995 3 0.996 5
19 0.994 9 0.994 9 0.996 6
20 0.995 2 0.994 7 0.996 8

Table 4

Recognition rates for the secondⅡclass based on different BatchSize"

序号 数据块
50 100 500
1 0.994 9 0.993 8 0.992 9
2 0.997 5 0.997 5 0.996 9
3 0.997 9 0.997 6 0.997 3
4 0.998 3 0.998 1 0.997 6
5 0.998 7 0.998 4 0.998 0
6 0.998 9 0.998 4 0.998 3
7 0.999 0 0.998 8 0.998 4
8 0.999 1 0.999 0 0.998 5
9 0.999 1 0.999 0 0.998 8
10 0.999 1 0.999 0 0.998 8
11 0.999 2 0.999 1 0.999 0
12 0.999 2 0.999 2 0.999 1
13 0.999 3 0.999 3 0.999 0
14 0.999 4 0.999 3 0.999 2
15 0.999 3 0.999 3 0.999 3
16 0.999 4 0.999 4 0.999 1
17 0.999 4 0.999 5 0.999 3
18 0.999 4 0.999 3 0.999 3
19 0.999 3 0.999 4 0.999 3
20 0.999 4 0.999 5 0.999 3

Table 5

Recognition rates for the firstⅠclass based on different number of neurons"

序号 神经元数
64 128 256
1 0.993 2 0.991 1 0.991 8
2 0.993 4 0.993 6 0.993 5
3 0.993 9 0.993 9 0.993 5
4 0.994 2 0.992 6 0.993 2
5 0.993 8 0.993 9 0.993 4
6 0.994 5 0.994 0 0.992 7
7 0.994 5 0.994 1 0.993 5
8 0.994 5 0.994 5 0.993 8
9 0.993 8 0.994 1 0.993 9
10 0.994 7 0.994 2 0.994 0
11 0.994 4 0.994 3 0.994 1
12 0.994 6 0.994 3 0.994 3
13 0.994 6 0.993 8 0.991 6
14 0.994 5 0.991 4 0.993 4
15 0.994 7 0.994 1 0.993 8
16 0.994 7 0.994 2 0.993 8
17 0.994 9 0.994 4 0.993 5
18 0.994 9 0.994 5 0.992 3
19 0.994 9 0.994 5 0.993 7
20 0.995 2 0.994 7 0.993 3

Table 6

Recognition rates for the secondⅡclass based on different number of neurons"

序号 神经元数
64 128 256
1 0.993 8 0.996 8 0.993 9
2 0.997 5 0.996 6 0.994 0
3 0.997 5 0.996 6 0.994 0
4 0.998 1 0.997 2 0.995 4
5 0.998 4 0.997 1 0.994 0
6 0.998 4 0.996 8 0.994 8
7 0.998 8 0.996 8 0.994 9
8 0.999 0 0.996 8 0.994 9
9 0.999 0 0.995 0 0.995 0
10 0.999 0 0.996 0 0.995 0
11 0.999 1 0.995 0 0.994 9
12 0.999 2 0.995 0 0.994 8
13 0.999 3 0.997 1 0.994 8
14 0.999 3 0.996 7 0.994 8
15 0.999 3 0.994 3 0.994 8
16 0.999 4 0.996 6 0.994 7
17 0.999 5 0.996 6 0.994 8
18 0.999 3 0.996 9 0.994 8
19 0.999 4 0.997 0 0.994 8
20 0.999 5 0.997 0 0.995 0

Table 7

Average recognition rates, variance and standard deviation for the firstⅠclass based on different μ %"

学习率 平均识别率 方差 标准差
0.1 99.551 5 0.000 3 0.182 1
1 99.523 0 0.000 8 0.296 1
10 27.780 5 0.001 6 0.411 6

Table 8

Average recognition rates, variance and standard deviation for the secondⅡclass based on different μ %"

学习率 平均识别率 方差 标准差
0.1 99.883 0 0.000 1 0.102 9
1 99.864 5 0.000 2 0.128 8
10 84.907 5 0.188 8 4.457 9

Table 9

Average recognition rates, variance and standard deviation for the firstⅠclass based on different BatchSize %"

数据块 平均识别率 方差 标准差
5 000 99.443 0 0 0.050 6
10 000 99.461 0 0 0.060 8
50 000 99.522 5 0.000 8 0.296 0

Table 10

Average recognition rates, variance and standard deviation for the secondⅡ class based ondifferent BatchSize %"

数据块 平均识别率 方差 标准差
5 000 99.879 0 0.000 1 0.105 5
10 000 99.864 5 0.000 2 0.128 8
20 000 99.837 0 0.000 2 0.147 3

Table 11

Average recognition rates, variance and standard deviation for the firstⅠclass based on different number of neurons %"

神经元数 平均识别率 方差 标准差
6 400 99.443 0 0 0.050 6
12 800 99.381 0 0.000 1 0.098 2
25 600 99.335 5 0.000 1 0.072 7

Table 12

Average recognition rates, variance and standard deviation for the secondⅡclass based on different number of neurons %"

神经元数 平均识别率 方差 标准差
6 400 99.864 5 0.000 2 0.128 8
12 800 99.640 0 0.000 1 0.085 8
25 600 99.470 5 0.000 0 0.040 2

Fig.3

Bar chart of average recognition rate for the firstⅠand secondⅡ class based on different μ"

Fig.4

Bar chart of standard deviation for the firstⅠand second Ⅱ class based on different μ"

Fig.5

Curve diagram of recognition rate change"

Fig.6

Curve diagram of error of loss change"

Fig.7

Curve diagram of cosine proximity change"

Fig.8

Curve diagram of mean absolute error change"

1 NAIR Akhil . Prevention of cross site scripting (XSS) and securing web application atclient side[J]. International Journal of Emerging Technology and Computer Science, 2018, 3 (2): 83- 86.
2 RODRIGUEZ G E, BENAVIDES D E, TORRES J, et al. Cookie scout: an analytic model for prevention of cross-site scripting (XSS) using a cookie classifier[C]//Proceedings of the International Conference on Information Technology & Systems. Berlin, Germany: Springer Cham Press, 2018: 497-507.
3 XU K, GUO S, CAO N, et al. ECGLens: interactive visual exploration of large scale ECG data for arrhythmia detection[C]//Proceedings of the ACM CHI Conference on Human Factors in Computing Systems. Chicago, USA: ACM Press, 2018: 1-12.
4 KAHNG M , ANDREWS P Y , KALRO A , et al. ActiVis: visual exploration of industry-scale deep neural network models[J]. IEEE Trans. Visualization and Computer Graphics, 2018, 24 (1): 88- 97.
doi: 10.1109/TVCG.2017.2744718
5 LIU M , SHI J , CAO K , et al. Analyzing the training processes of deep generative models[J]. IEEE Trans. Visualization and Computer Graphics, 2018, 24 (1): 77- 87.
doi: 10.1109/TVCG.2017.2744938
6 ZHANG Haijun , XIAO Nanfeng . Parallel implementation of multilayered neural networks based on Map-Reduce on cloud computing clusters[J]. Soft Computing, 2016, 20 (4): 1471- 1483.
doi: 10.1007/s00500-015-1599-3
7 LI Yuanzhi, LIANG Yingyu. Learning overparameterized neural networks via stochastic gradient descent on structured data[EB/OL]. (2018-08-03)[2018-08-20]. https://arxiv.org/abs/1808.01204.
8 ZE Yuan, ZHU Allen, LI Yuanzhi, et al. On the convergence rate of training recurrent neural networks[J/OL]. arXiv: 1810.12065v4(2018-10-29)[2019-05-27]. https://arxiv.org/abs/1810.12065.
9 ZHANG Haijun , ZHANG Nan , XIAO Nanfeng . Fire detection and identification method based on visual attention mechanism[J]. Optik, 2015, 126 (6): 5011- 5018.
10 CHEN Minmin, JEFFREY Pennington, SAMUEL S S. Dynamical isometry and a mean field theory of RNNs: gating enables signal propagation in recurrent neural networks[EB/OL]. (2018-06-14)[2019-02-08]. http://proceedings.mlr.press/v80/chen18i.html.
11 ANDROS Tjandra, SAKRIANI Sakti, SATOSHI Nakamura. Tensor decomposition for compressing recurrent neural network[EB/OL]. (2018-02-28)[2018-05-08]. https://arxiv.org/abs/1802.10410.
12 CHEN Qufei, MARINA Sokolova. Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries[EB/OL]. (2018-05-01)[2018-05-01]. https://arxiv.org/abs/1805.00352.
13 DL4J.Word2Vec, Doc2vec & GloVe: Neural word embeddings for natural language processing[EB/OL]. (2018-03-01)[2018-06-05]. https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec.
14 RINA Panigrahy, SUSHANT Sachdeva, ZHANG Qiuyi. Convergence results for neural networks via electrodynamics[J/OL]. arXiv: 1702.00458v5(2017-02-01)[2018-12-04]. https://arxiv.org/abs/1702.00458.
15 BORDERS Florian, BERTHIER Tess, JORIO L D, et al. Iteratively unveiling new regions of interest in deep learning models[EB/OL]. (2018-04-11)[2018-06-11]. https://openreview.net/forum?id=rJz89iiiM.
16 KINDERMANS P J, KRISTOF T S, MAXIMILIAN Alber, et al. Learning how to explain neural networks: patternnet and pattern attribution[EB/OL]. (2017-05-16)[2017-10-24]. https://arxiv.org/abs/1705.05598.
17 CHOO Jaegul , LIU Shixia . Visual analytics for explainable deep learning[J]. Computer Graphics and Applications IEEE, 2018, 38 (4): 84- 92.
18 SMILKOV Daniel, THORAT Nikhil, KIM Been, et al. Smoothgrad: removing noise by adding noise[J/OL]. arXiv: 1706.03825v1(2017-06-12)[2017-06-12]. https://arxiv.org/abs/1706.03825.
19 CHEN H , CHIANG R H L , STOREY V C . Business intelligence and analytics: From big data to big impact[J]. MIS Quarterly, 2012, 36 (4): 1165- 1188.
doi: 10.2307/41703503
20 KWON O , LEE N , SHIN B . Data quality management, data usage experience and acquisition intention of big data analytics[J]. International Journal of Information Management, 2014, 34 (3): 387- 394.
doi: 10.1016/j.ijinfomgt.2014.02.002
21 TAF F, BIG D C. Demystifying big data: a practical guide to transforming the business of government[EB/OL]. (2012-10-01)[2012-10-05]. http://www.techamerica.org/Docs/fileManager.cfm?f=techamerica-bigdatareport-final.pdf.
22 TRIGUERO Isaac , PERALTA Daniel , BACARDIT Jaume , et al. MRPR: a MapReduce solution for prototype reduction in big data classification[J]. Neurocomputing, 2015, 150 (1): 331- 345.
[1] WANG Tingting, ZHAI Junhai, ZHANG Mingyang, HAO Pu. K-NN algorithm for big data based on HBase and SimHash [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 54-59.
[2] XIE Zhifeng, WU Jiaping, MA Lizhuang. Chinese financial news classification method based on convolutional neural network [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2018, 48(3): 34-39.
[3] LIU Yang, LIU Bo, WANG Feng. Optimization algorithm for big data mining based on parameter server framework [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(4): 1-6.
[4] WEI Bo, ZHANG Wensheng, LI Yuanxiang, XIA Xuewen, LYU Jingqin. A sparse online learning algorithm for feature selection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(1): 22-27.
[5] DONG Ai-Feng, DIAO Ge-Ji, SCHOMMER Christoph. A fingerprint engine for author profiling [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(5): 27-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] WANG Li-ju,HUANG Qi-cheng,WANG Zhao-xu . [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(6): 51 -56 .
[2] ZOU Feifei,GUAN Xiaojun,HAN Zhenqiang,SHEN Xiaomin,MA Xiaofei ,LIU Yunteng . hermal simulating experiment and FEM simulation of dynamic recrystallization of 09CuPTiRE steel[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2006, 36(5): 17 -20 .
[3] HU Tian-liang,LI Peng,ZHANG Cheng-rui,ZUO Yi . Design of a QEP decode counter based on VHDL[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(3): 10 -13 .
[4] WANG Xue-ping,WANG Deng-jie,SUN Ying-ming*,DONG Lei . Application of the nonprism total station in the detection of a highway bridge[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(3): 105 -108 .
[5] SHI Wen-Hua, LIU Wei-Dong, SUN Yong-Fu. Research of 1/3 dam breach simulation and personnel evacuation scenario based on digital elevation model DEM in a quake lake[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(5): 144 -148 .
[6] CHEN Peng, HU Wen-Rong, FEI Hai-Yan. Screening of a denitrifying bacterium strain LZ-14 and its nitrogen removal characteristics[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2009, 39(5): 133 -138 .
[7] HANG Guang-qing,KONG Fan-yu,LI Da-xing, . Efficient algorithm with resistance to simple power analysis on Koblitz curves[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2007, 37(3): 78 -80 .
[8] JIANG Guo-xin . A new design experiment for diffraction theory application [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(1): 105 -108 .
[9] WANG Kai,SUN Feng-zhong,ZHAO Yuan-bin,GAO Ming,GAO Shan . Mathematical model and numerical simulation of the air inlet flowfield of a natural-draft cooling tower[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(1): 13 -17 .
[10] HAO Ming-hui,WANG Xi-ping,WANG Min,ZHOU Shen-jie .

The solution of a oneedge crack of a finite plate with the influence of  couple stress in a uniform tension field

[J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2008, 38(2): 92 -95 .