您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 118-128.doi: 10.6040/j.issn.1672-3961.0.2019.043

• 机器学习与数据挖掘 • 上一篇    

语义分析及向量化大数据跨站脚本攻击智检

张海军1(),陈映辉2,*()   

  1. 1. 嘉应学院计算机学院, 广东 梅州 514015
    2. 嘉应学院数学学院, 广东 梅州 514015
  • 收稿日期:2019-01-29 出版日期:2020-04-20 发布日期:2020-04-16
  • 通讯作者: 陈映辉 E-mail:nihaoba_456@163.com
  • 作者简介:张海军(1978—),男,江西赣州人,讲师,博士,主要研究方向为智能计算,智能控制,模式识别,深度学习等. E-mail:nihaoba_456@163.com
  • 基金资助:
    国家自然科学基金资助项目(61171141);国家自然科学基金资助项目(61573145);广东省自然科学基金重点资助项目(2014B010104001);广东省自然科学基金重点资助项目(2015A030308018);广东省普通高等学校人文社会科学省市共建重点研究基地资助项目(18KYKT11);广东省嘉应学院自然科学基金重点资助项目(2017KJZ02)

Semantic analysis and vectorization for intelligent detection of big data cross-site scripting attacks

Haijun ZHANG1(),Yinghui CHEN2,*()   

  1. 1. School of Computer, Jiaying University, Meizhou 514015, Guangdong, China
    2. School of Mathematics, Jiaying University, Meizhou 514015, Guangdong, China
  • Received:2019-01-29 Online:2020-04-20 Published:2020-04-16
  • Contact: Yinghui CHEN E-mail:nihaoba_456@163.com
  • Supported by:
    国家自然科学基金资助项目(61171141);国家自然科学基金资助项目(61573145);广东省自然科学基金重点资助项目(2014B010104001);广东省自然科学基金重点资助项目(2015A030308018);广东省普通高等学校人文社会科学省市共建重点研究基地资助项目(18KYKT11);广东省嘉应学院自然科学基金重点资助项目(2017KJZ02)

摘要:

基于语义情景分析及向量化对访问流量语料库大数据进行词向量化处理,实现面向大数据跨站脚本攻击智能检测。利用自然语言处理方法进行数据获取、数据清洗、数据抽样、特征提取等数据预处理。设计基于神经网络的词向量化算法,实现词向量化得到词向量大数据;通过理论分析和推导,实现多种不同深度的长短时记忆网络智能检测算法。设计不同的超参数并进行反复试验,分别得到最大识别率为0.999 5、最低识别率为0.264 3、识别率均值为99.88%、方差为0、标准差为0.000 4的识别率变化过程曲线图、损失误差变化过程曲线图、词向量样本余弦距离变化曲线图和平均绝对误差变化过程曲线图等。研究结果表明该算法有高识别率、稳定性强、总体性能优良等优点。

关键词: 网络入侵检测, 跨站脚本攻击, 自然语言处理, 深度长短时记忆网络, 大数据

Abstract:

The access traffic corpus big data were processed with word vectorization based on the methods of semantic scenario analysis and vectorization, and the intelligent detection oriented to big data cross-site scripting attack was realized. It used the natural language processing methods for data acquisition, data cleaning, data sampling, feature extraction and other data preprocessing. The algorithm of word vectorization based on neural network was used to realize word vectorization and get big data of word vectorization. Through theoretical analysis and deductions, the intelligent detection algorithms of varieties of long short term memory networks with different layers were realized. With different hyperparameters and repeated tests, lots of results were got, such as the highest recognition rate for 0.999 5, the minimum recognition rate for 0.264 3, average recognition rate for 99.88%, variance for 0, standard deviations for 0.000 4, the curve diagram of recognition rates change, the curve diagram of error of loss change, the curve diagram of cosine proximity change of word vector samples and the curve diagram of mean absolute error change etc. The results of the study showed that the algorithm had the advantages of high recognition rates, strong stability and excellent overall performance, etc.

Key words: web intrusion detection, cross-site scripting, natural language processing, deep long short term memory network, big data

中图分类号: 

  • TP309.2

图1

基于语义情景分析及向量化面向大数据跨站脚本攻击智能检测原理图"

表1

第Ⅰ类基于不同的μ的识别率"

序号 学习率
0.001 0.01 0.1
1 0.988 5 0.982 9 0.278 4
2 0.994 1 0.994 4 0.273 7
3 0.994 3 0.994 8 0.278 7
4 0.994 8 0.995 3 0.264 3
5 0.995 2 0.995 5 0.280 1
6 0.995 5 0.995 7 0.279 7
7 0.995 7 0.995 6 0.278 9
8 0.995 9 0.995 8 0.281 1
9 0.996 0 0.995 8 0.279 6
10 0.996 1 0.996 1 0.278 3
11 0.996 3 0.996 1 0.278 7
12 0.996 4 0.996 0 0.282 5
13 0.996 5 0.995 7 0.279 1
14 0.996 5 0.996 0 0.279 2
15 0.996 7 0.996 1 0.275 1
16 0.996 6 0.996 4 0.279 0
17 0.995 7 0.996 5 0.279 0
18 0.996 3 0.996 5 0.279 6
19 0.996 5 0.996 6 0.270 7
20 0.996 7 0.996 8 0.280 4

图2

第Ⅰ类基于不同的μ的识别率曲线图"

表2

第Ⅱ类基于不同的μ的识别率"

序号 学习率
0.001 0.01 0.1
1 0.995 2 0.993 8 0.956 8
2 0.997 4 0.997 5 0.969 4
3 0.997 9 0.997 6 0.927 5
4 0.998 3 0.998 1 0.830 9
5 0.998 6 0.998 4 0.831 0
6 0.998 8 0.998 4 0.831 0
7 0.998 9 0.998 8 0.831 1
8 0.999 1 0.999 0 0.830 9
9 0.999 2 0.999 0 0.831 1
10 0.999 2 0.999 0 0.831 1
11 0.999 3 0.999 1 0.830 8
12 0.999 3 0.999 2 0.831 1
13 0.999 3 0.999 3 0.831 3
14 0.999 3 0.999 3 0.831 1
15 0.999 5 0.999 3 0.831 2
16 0.999 5 0.999 4 0.831 0
17 0.999 4 0.999 5 0.830 9
18 0.999 5 0.999 3 0.831 2
19 0.999 4 0.999 4 0.831 1
20 0.999 5 0.999 5 0.831 0

表3

第Ⅰ类基于不同的BatchSize的识别率"

序号 数据块
50 100 500
1 0.993 2 0.993 5 0.982 9
2 0.993 4 0.9942 0.994 4
3 0.993 9 0.994 6 0.994 8
4 0.994 2 0.994 8 0.995 3
5 0.993 8 0.994 9 0.995 5
6 0.994 5 0.994 6 0.995 6
7 0.994 5 0.994 6 0.995 6
8 0.994 5 0.993 3 0.995 8
9 0.994 5 0.994 2 0.995 8
10 0.994 7 0.994 6 0.996 1
11 0.994 4 0.993 4 0.996 1
12 0.994 6 0.994 9 0.996 0
13 0.994 6 0.994 9 0.995 7
14 0.994 5 0.995 2 0.996 0
15 0.994 7 0.995 3 0.996 1
16 0.994 7 0.995 1 0.996 4
17 0.994 9 0.995 2 0.996 5
18 0.994 9 0.995 3 0.996 5
19 0.994 9 0.994 9 0.996 6
20 0.995 2 0.994 7 0.996 8

表4

第Ⅱ类基于不同的BatchSize的识别率"

序号 数据块
50 100 500
1 0.994 9 0.993 8 0.992 9
2 0.997 5 0.997 5 0.996 9
3 0.997 9 0.997 6 0.997 3
4 0.998 3 0.998 1 0.997 6
5 0.998 7 0.998 4 0.998 0
6 0.998 9 0.998 4 0.998 3
7 0.999 0 0.998 8 0.998 4
8 0.999 1 0.999 0 0.998 5
9 0.999 1 0.999 0 0.998 8
10 0.999 1 0.999 0 0.998 8
11 0.999 2 0.999 1 0.999 0
12 0.999 2 0.999 2 0.999 1
13 0.999 3 0.999 3 0.999 0
14 0.999 4 0.999 3 0.999 2
15 0.999 3 0.999 3 0.999 3
16 0.999 4 0.999 4 0.999 1
17 0.999 4 0.999 5 0.999 3
18 0.999 4 0.999 3 0.999 3
19 0.999 3 0.999 4 0.999 3
20 0.999 4 0.999 5 0.999 3

表5

第Ⅰ类基于不同的神经元数的识别率"

序号 神经元数
64 128 256
1 0.993 2 0.991 1 0.991 8
2 0.993 4 0.993 6 0.993 5
3 0.993 9 0.993 9 0.993 5
4 0.994 2 0.992 6 0.993 2
5 0.993 8 0.993 9 0.993 4
6 0.994 5 0.994 0 0.992 7
7 0.994 5 0.994 1 0.993 5
8 0.994 5 0.994 5 0.993 8
9 0.993 8 0.994 1 0.993 9
10 0.994 7 0.994 2 0.994 0
11 0.994 4 0.994 3 0.994 1
12 0.994 6 0.994 3 0.994 3
13 0.994 6 0.993 8 0.991 6
14 0.994 5 0.991 4 0.993 4
15 0.994 7 0.994 1 0.993 8
16 0.994 7 0.994 2 0.993 8
17 0.994 9 0.994 4 0.993 5
18 0.994 9 0.994 5 0.992 3
19 0.994 9 0.994 5 0.993 7
20 0.995 2 0.994 7 0.993 3

表6

第Ⅱ类基于不同的神经元数的识别率"

序号 神经元数
64 128 256
1 0.993 8 0.996 8 0.993 9
2 0.997 5 0.996 6 0.994 0
3 0.997 5 0.996 6 0.994 0
4 0.998 1 0.997 2 0.995 4
5 0.998 4 0.997 1 0.994 0
6 0.998 4 0.996 8 0.994 8
7 0.998 8 0.996 8 0.994 9
8 0.999 0 0.996 8 0.994 9
9 0.999 0 0.995 0 0.995 0
10 0.999 0 0.996 0 0.995 0
11 0.999 1 0.995 0 0.994 9
12 0.999 2 0.995 0 0.994 8
13 0.999 3 0.997 1 0.994 8
14 0.999 3 0.996 7 0.994 8
15 0.999 3 0.994 3 0.994 8
16 0.999 4 0.996 6 0.994 7
17 0.999 5 0.996 6 0.994 8
18 0.999 3 0.996 9 0.994 8
19 0.999 4 0.997 0 0.994 8
20 0.999 5 0.997 0 0.995 0

表7

第Ⅰ类基于不同的μ的平均识别率、方差和标准差"

学习率 平均识别率 方差 标准差
0.1 99.551 5 0.000 3 0.182 1
1 99.523 0 0.000 8 0.296 1
10 27.780 5 0.001 6 0.411 6

表8

第Ⅱ类基于不同的μ的平均识别率、方差和标准差"

学习率 平均识别率 方差 标准差
0.1 99.883 0 0.000 1 0.102 9
1 99.864 5 0.000 2 0.128 8
10 84.907 5 0.188 8 4.457 9

表9

第Ⅰ类基于不同的BatchSize的平均识别率、方差和标准差"

数据块 平均识别率 方差 标准差
5 000 99.443 0 0 0.050 6
10 000 99.461 0 0 0.060 8
50 000 99.522 5 0.000 8 0.296 0

表10

第Ⅱ类基于不同的BatchSize的平均识别率、方差和标准差"

数据块 平均识别率 方差 标准差
5 000 99.879 0 0.000 1 0.105 5
10 000 99.864 5 0.000 2 0.128 8
20 000 99.837 0 0.000 2 0.147 3

表11

第Ⅰ类基于不同的神经元数的平均识别率、方差和标准差"

神经元数 平均识别率 方差 标准差
6 400 99.443 0 0 0.050 6
12 800 99.381 0 0.000 1 0.098 2
25 600 99.335 5 0.000 1 0.072 7

表12

第Ⅱ类基于不同的神经元数的平均识别率、方差和标准差"

神经元数 平均识别率 方差 标准差
6 400 99.864 5 0.000 2 0.128 8
12 800 99.640 0 0.000 1 0.085 8
25 600 99.470 5 0.000 0 0.040 2

图3

第Ⅰ和Ⅱ类基于不同μ的识别率均值条形图"

图4

第Ⅰ和Ⅱ类基于不同μ的标准差条形图"

图5

识别率变化曲线图"

图6

损失误差变化曲线图"

图7

余弦距离变化曲线图"

图8

平均绝对误差变化曲线图"

1 NAIR Akhil . Prevention of cross site scripting (XSS) and securing web application atclient side[J]. International Journal of Emerging Technology and Computer Science, 2018, 3 (2): 83- 86.
2 RODRIGUEZ G E, BENAVIDES D E, TORRES J, et al. Cookie scout: an analytic model for prevention of cross-site scripting (XSS) using a cookie classifier[C]//Proceedings of the International Conference on Information Technology & Systems. Berlin, Germany: Springer Cham Press, 2018: 497-507.
3 XU K, GUO S, CAO N, et al. ECGLens: interactive visual exploration of large scale ECG data for arrhythmia detection[C]//Proceedings of the ACM CHI Conference on Human Factors in Computing Systems. Chicago, USA: ACM Press, 2018: 1-12.
4 KAHNG M , ANDREWS P Y , KALRO A , et al. ActiVis: visual exploration of industry-scale deep neural network models[J]. IEEE Trans. Visualization and Computer Graphics, 2018, 24 (1): 88- 97.
doi: 10.1109/TVCG.2017.2744718
5 LIU M , SHI J , CAO K , et al. Analyzing the training processes of deep generative models[J]. IEEE Trans. Visualization and Computer Graphics, 2018, 24 (1): 77- 87.
doi: 10.1109/TVCG.2017.2744938
6 ZHANG Haijun , XIAO Nanfeng . Parallel implementation of multilayered neural networks based on Map-Reduce on cloud computing clusters[J]. Soft Computing, 2016, 20 (4): 1471- 1483.
doi: 10.1007/s00500-015-1599-3
7 LI Yuanzhi, LIANG Yingyu. Learning overparameterized neural networks via stochastic gradient descent on structured data[EB/OL]. (2018-08-03)[2018-08-20]. https://arxiv.org/abs/1808.01204.
8 ZE Yuan, ZHU Allen, LI Yuanzhi, et al. On the convergence rate of training recurrent neural networks[J/OL]. arXiv: 1810.12065v4(2018-10-29)[2019-05-27]. https://arxiv.org/abs/1810.12065.
9 ZHANG Haijun , ZHANG Nan , XIAO Nanfeng . Fire detection and identification method based on visual attention mechanism[J]. Optik, 2015, 126 (6): 5011- 5018.
10 CHEN Minmin, JEFFREY Pennington, SAMUEL S S. Dynamical isometry and a mean field theory of RNNs: gating enables signal propagation in recurrent neural networks[EB/OL]. (2018-06-14)[2019-02-08]. http://proceedings.mlr.press/v80/chen18i.html.
11 ANDROS Tjandra, SAKRIANI Sakti, SATOSHI Nakamura. Tensor decomposition for compressing recurrent neural network[EB/OL]. (2018-02-28)[2018-05-08]. https://arxiv.org/abs/1802.10410.
12 CHEN Qufei, MARINA Sokolova. Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries[EB/OL]. (2018-05-01)[2018-05-01]. https://arxiv.org/abs/1805.00352.
13 DL4J.Word2Vec, Doc2vec & GloVe: Neural word embeddings for natural language processing[EB/OL]. (2018-03-01)[2018-06-05]. https://deeplearning4j.org/docs/latest/deeplearning4j-nlp-word2vec.
14 RINA Panigrahy, SUSHANT Sachdeva, ZHANG Qiuyi. Convergence results for neural networks via electrodynamics[J/OL]. arXiv: 1702.00458v5(2017-02-01)[2018-12-04]. https://arxiv.org/abs/1702.00458.
15 BORDERS Florian, BERTHIER Tess, JORIO L D, et al. Iteratively unveiling new regions of interest in deep learning models[EB/OL]. (2018-04-11)[2018-06-11]. https://openreview.net/forum?id=rJz89iiiM.
16 KINDERMANS P J, KRISTOF T S, MAXIMILIAN Alber, et al. Learning how to explain neural networks: patternnet and pattern attribution[EB/OL]. (2017-05-16)[2017-10-24]. https://arxiv.org/abs/1705.05598.
17 CHOO Jaegul , LIU Shixia . Visual analytics for explainable deep learning[J]. Computer Graphics and Applications IEEE, 2018, 38 (4): 84- 92.
18 SMILKOV Daniel, THORAT Nikhil, KIM Been, et al. Smoothgrad: removing noise by adding noise[J/OL]. arXiv: 1706.03825v1(2017-06-12)[2017-06-12]. https://arxiv.org/abs/1706.03825.
19 CHEN H , CHIANG R H L , STOREY V C . Business intelligence and analytics: From big data to big impact[J]. MIS Quarterly, 2012, 36 (4): 1165- 1188.
doi: 10.2307/41703503
20 KWON O , LEE N , SHIN B . Data quality management, data usage experience and acquisition intention of big data analytics[J]. International Journal of Information Management, 2014, 34 (3): 387- 394.
doi: 10.1016/j.ijinfomgt.2014.02.002
21 TAF F, BIG D C. Demystifying big data: a practical guide to transforming the business of government[EB/OL]. (2012-10-01)[2012-10-05]. http://www.techamerica.org/Docs/fileManager.cfm?f=techamerica-bigdatareport-final.pdf.
22 TRIGUERO Isaac , PERALTA Daniel , BACARDIT Jaume , et al. MRPR: a MapReduce solution for prototype reduction in big data classification[J]. Neurocomputing, 2015, 150 (1): 331- 345.
[1] 王婷婷,翟俊海,张明阳,郝璞. 基于HBase和SimHash的大数据K-近邻算法[J]. 山东大学学报(工学版), 2018, 48(3): 54-59.
[2] 谢志峰,吴佳萍,马利庄. 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报(工学版), 2018, 48(3): 34-39.
[3] 刘洋,刘博,王峰. 基于Parameter Server框架的大数据挖掘优化算法[J]. 山东大学学报(工学版), 2017, 47(4): 1-6.
[4] 魏波,张文生,李元香,夏学文,吕敬钦. 一种选择特征的稀疏在线学习算法[J]. 山东大学学报(工学版), 2017, 47(1): 22-27.
[5] 董乃鹏 赵合计 SCHOMMER Christoph. 作者写作特征提取引擎[J]. 山东大学学报(工学版), 2009, 39(5): 27-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王丽君,黄奇成,王兆旭 . 敏感性问题中的均方误差与模型比较[J]. 山东大学学报(工学版), 2006, 36(6): 51 -56 .
[2] 关小军,韩振强,申孝民,麻晓飞,刘运腾 . 09CuPTiRE钢动态再结晶的热模拟实验与有限元模拟[J]. 山东大学学报(工学版), 2006, 36(5): 17 -20 .
[3] 胡天亮,李鹏,张承瑞,左毅 . 基于VHDL的正交编码脉冲电路解码计数器设计[J]. 山东大学学报(工学版), 2008, 38(3): 10 -13 .
[4] 王学平,王登杰,孙英明,董磊 . 免棱镜全站仪在桥梁检测中的应用[J]. 山东大学学报(工学版), 2007, 37(3): 105 -108 .
[5] 世文学 刘卫东 孙永福. 基于DEM的堰塞湖1/3溃决模拟及人员撤离方案研究[J]. 山东大学学报(工学版), 2009, 39(5): 144 -148 .
[6] 陈朋 胡文容 裴海燕. 一株反硝化细菌LZ-14的筛选及其脱氮特性[J]. 山东大学学报(工学版), 2009, 39(5): 133 -138 .
[7] 张光庆,孔凡玉,李大兴, . Koblitz曲线上抵抗简单功耗分析的有效算法[J]. 山东大学学报(工学版), 2007, 37(3): 78 -80 .
[8] 姜国新 .

关于衍射原理应用的设计性实验

[J]. 山东大学学报(工学版), 2008, 38(1): 105 -108 .
[9] 王凯,孙奉仲,赵元宾,高明,高山 . 自然通风冷却塔进风口流场模型的建立及计算[J]. 山东大学学报(工学版), 2008, 38(1): 13 -17 .
[10] 郝明辉,王锡平,王敏,周慎杰 .

考虑偶应力影响的有限大板单边裂纹计算

[J]. 山东大学学报(工学版), 2008, 38(2): 92 -95 .