您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 54-59.doi: 10.6040/j.issn.1672-3961.0.2017.414

• • 上一篇    下一篇

基于HBase和SimHash的大数据K-近邻算法

王婷婷a,b,翟俊海a,b,张明阳a,b*,郝璞a,b   

  1. 河北大学 a. 河北省机器学习与计算智能重点实验室;b. 数学与信息科学学院, 河北 保定 071002
  • 收稿日期:2017-08-29 出版日期:2018-06-20 发布日期:2017-08-29
  • 通讯作者: 翟俊海(1964— ),男,河北易县人,博士,教授,主要研究方向为机器学习与数据挖掘. E-mail: mczjh@126.com E-mail:479064019@qq.com
  • 作者简介:王婷婷(1991— ),女,河北廊坊人,硕士研究生,主要研究方向为云计算与大数据处理. E-mail:479064019@qq.com
  • 基金资助:
    河北省自然科学基金资助项目(F2017201026);河北大学自然科学研究计划资助项目(799207217071);河北大学研究生创新资助项目(X2016059)资助

K-NN algorithm for big data based on HBase and SimHash

WANG Tingtinga,b, ZHAI Junhaia,b*, ZHANG Mingyanga,b, HAO Pua,b   

  1. a. Key Lab. of Machine Learning and Computational Intelligence;
    b. College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei, China
  • Received:2017-08-29 Online:2018-06-20 Published:2017-08-29

摘要: 针对大数据K-近邻(K-nearest neighbors, K-NN)计算复杂度高的问题,提出一种基于HBase和SimHash的大数据K-近邻分类算法。利用SimHash算法将大数据集从原空间映射到Hamming空间,得到哈希签名值集合;将样例的行键与值的二元对存储到HBase数据库中,行健(rowkey)为样例的哈希签名值,值(value)为样例的类别;对于测试样例,以其哈希签名值作为健rowkey,从HBase数据库中获取所有样例的value,通过对这些values进行多数投票,即可以得到测试样例的类别。与基于MapReduce的K-NN和基于Spark的K-NN在运行时间和测试精度两方面进行试验比较。试验结果显示,在保持分类能力的前提下,提出的算法的运行时间远远低于其他两种方法。

关键词: 大数据, 分类算法, SimHash, K-近邻, HBase

Abstract: Aiming at solving the problem of high computational complexity of K-NN(K-nearest neighbors)algorithm in big data scenarios, based on HBase and SimHash, a K-NN algorithm for big data classification was proposed. The big data sets were mapped from the original space into the Hamming space, and the sets of hash codes were obtained. The pairs of rowkeys and values were stored in HBase database; the rowkeys were the hash codes of instances; the values were the classes of instances. For testing instances, the values of instances which had same rowkeys were selected from HBase database, and the labels of testing instances were obtained by majority voting with the values. The proposed algorithm was experimentally compared with MapReduce-based K-NN and Spark-based K-NN on the running time and testing accuracy. The experimental results showed that the running time of the proposed algorithm was much lower than the times of the MapReduce-based K-NN and Spark-based K-NN in the case of classification performance preservation.

Key words: big data, K-nearest neighbors, HBase, SimHash, classification algorithms

中图分类号: 

  • TP181
[1] COVER T M, HART P E. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[2] SAVCHENKO A V. Maximum-likelihood approximate nearest neighbor method in real-time image recognition[J]. Pattern Recognition, 2017, 61:459-469.
[3] BHATEJA A K, SHARMA S, CHAUDHURY S, et al. Iris recognition based on sparse representation and k-nearest subspace with genetic algorithm [J]. Pattern Recognition Letters, 2016, 73:13-18.
[4] 朱庆生, 唐汇, 冯骥. 一种基于自然最近邻的离群检测算法[J]. 计算机科学, 2014, 41(3):276-278. ZHU Qingsheng, TANG Hui, FENG Ji. Outlier dectection algorithm based on natural nearest neighbor [J]. Computer Science, 2014, 41(3):276-278.
[5] 肖春宝, 冯大政. 基于K近邻一致性的特征匹配内点选择算法[J]. 计算机科学, 2016, 43(1):290-293. XIAO Chunbao, FENG Dazheng. Inlier selection algorithm for feature matchiing based on K nearest neighbor consistence[J]. Computer Science, 2016, 43(1):290-293.
[6] NIKOLOPOULOS K I, BABAI M Z, BOZOS K. Forecasting supply chain sporadic demand with nearest neighbor approaches[J]. International Journal of Production Economics, 2016, 177:139-148.
[7] CHEN H L, YANG B, WANG G, et al. A novel bankruptcy prediction model based on an adaptive fuzzy K-nearest neighbor method[J]. Knowledge-Based Systems, 2011, 24(8):1348-1359.
[8] SINGH A, DAMIR B, DEEP K, et al. Calibration of nearest neighbors model for avalanche forecasting[J]. Cold Regions Science and Technology, 2015, 109:33-42.
[9] LEE Y H, WEI C P, CHENG T H, et al. Nearest-neighbor-based approach to time-series classification[J]. Decision Support Systems, 2012, 53(1):207-217.
[10] LTIFI H, BENMOHAMED E, KOLSKI C, et al. Enhanced visual data mining process for dynamic decision-making[J]. Knowledge-Based Systems, 2016, 112:166-181.
[11] WU X D, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge & Information Systems, 2007, 14(1):1-37.
[12] ZHAI J H, WANG X Z. PANG X H. Voting-based instance selection from large data sets with mapreduce and random weight networks[J]. Information Sciences, 2016, 367:1066-1077.
[13] ZHAI J H, LI T, WANG X Z. A cross-selection instance algorithm[J]. Journal of Intelligent & Fuzzy Systems, 2016, 30(2):717-728.
[14] INDYK P, MOTWANI R. Approximate nearest neighbors: towards removing the curse of dimensionality[C] //Proceedings of the thirtieth annual ACM symposium on Theory of computing. Dallas, USA: ACM, 1998:604-613.
[15] ÁLVAR A G, DÍEZ-PASTOR J F, RODÍGUEZ J J, et al. Instance selection of linear complexity for big data[J]. Knowledge-Based Systems, 2016, 107:83-95.
[16] TRIGUEROI, PERALTA D, BACARDIT J, et al. MRPR: a MapReduce solution for prototype reduction in big data classification[J]. Neurocomputing, 2015, 150(Part A):331-345.
[17] GU X G, ZHANG Y D, ZHANG L, et al. An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features[J]. Signal Processing, 2013, 93(8):2244-2255.
[18] WANG J, LIU W, KUMAR S, et al. Learning to hash for indexing big data-a survey[J]. Proceedings of the IEEE, 2016, 104(1):34-57.
[19] 闫永刚, 马廷淮, 王建. K-NN分类算法的MapReduce并行化实现[J]. 南京航空航天大学学报, 2013, 45(4):550-555. YAN Yonggang, MA Tinghuai, WANG Jian. Parallel implementing K-NN classification algorithm using mapreduce programming mode[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2013, 45(4):550-555.
[20] MAILLO J, RAMIREZ S, TRIGUERO I, et al. K-NN-IS: an iterative spark-based design of the K-nearest neighbors classifier for big data[J]. Knowledge-Based Systems, 2016, 117:3-15.
[21] 翟俊海, 王婷婷, 张明阳, 等. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6):650-656. ZHAI Junhai, WANG Tingting, ZHANG Mingyang, et al. Experimental comparison of two acceleration approaches for K-nearest neighbor[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):650-656.
[22] 黄宜华, 苗凯翔. 深入理解大数据-大数据处理与编程实践[M]. 北京: 机械工业出版社, 2014.
[23] 李武军, 周志华. 大数据哈希学习:现状与趋势[J]. 科学通报, 2015, 60(5-6):485-490. LI Wujun, ZHOU Zhihua. Learning to hash for big data: Current status and future trends[J]. Chinese Science Bulletin, 2015, 60(5-6):485-490.
[1] 王凤娟,王语睿,卫兰,范存群,徐晓斌. 基于自适应线性模型的环境数据预测算法[J]. 山东大学学报 (工学版), 2024, 54(4): 86-94.
[2] 赵宁宁,唐雪嵩,赵鸣博. 基于卷积神经网络的深度线段分类算法[J]. 山东大学学报 (工学版), 2020, 50(4): 22-27.
[3] 张海军,陈映辉. 语义分析及向量化大数据跨站脚本攻击智检[J]. 山东大学学报 (工学版), 2020, 50(2): 118-128.
[4] 刘洋,刘博,王峰. 基于Parameter Server框架的大数据挖掘优化算法[J]. 山东大学学报(工学版), 2017, 47(4): 1-6.
[5] 魏波,张文生,李元香,夏学文,吕敬钦. 一种选择特征的稀疏在线学习算法[J]. 山东大学学报(工学版), 2017, 47(1): 22-27.
[6] 王梅,曾昭虎,孙莺萁,杨二龙,宋考平. 基于输入K-近邻的正则化路径上SVR贝叶斯组合[J]. 山东大学学报(工学版), 2016, 46(6): 8-14.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[2] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[3] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[4] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[5] 王波,王宁生 . 机电装配体拆卸序列的自动生成及组合优化[J]. 山东大学学报(工学版), 2006, 36(2): 52 -57 .
[6] 张英,郎咏梅,赵玉晓,张鉴达,乔鹏,李善评 . 由EGSB厌氧颗粒污泥培养好氧颗粒污泥的工艺探讨[J]. 山东大学学报(工学版), 2006, 36(4): 56 -59 .
[7] Yue Khing Toh1 , XIAO Wendong2 , XIE Lihua1 . 基于无线传感器网络的分散目标跟踪:实际测试平台的开发应用(英文)[J]. 山东大学学报(工学版), 2009, 39(1): 50 -56 .
[8] 孙炜伟,王玉振. 考虑饱和的发电机单机无穷大系统有限增益镇定[J]. 山东大学学报(工学版), 2009, 39(1): 69 -76 .
[9] 孙玉利,李法德,左敦稳,戚美 . 直立分室式流体连续通电加热系统的升温特性[J]. 山东大学学报(工学版), 2006, 36(6): 19 -23 .
[10] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .