JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE) ›› 2018, Vol. 48 ›› Issue (3): 54-59.doi: 10.6040/j.issn.1672-3961.0.2017.414

Previous Articles     Next Articles

K-NN algorithm for big data based on HBase and SimHash

WANG Tingtinga,b, ZHAI Junhaia,b*, ZHANG Mingyanga,b, HAO Pua,b   

  1. a. Key Lab. of Machine Learning and Computational Intelligence;
    b. College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei, China
  • Received:2017-08-29 Online:2018-06-20 Published:2017-08-29

Abstract: Aiming at solving the problem of high computational complexity of K-NN(K-nearest neighbors)algorithm in big data scenarios, based on HBase and SimHash, a K-NN algorithm for big data classification was proposed. The big data sets were mapped from the original space into the Hamming space, and the sets of hash codes were obtained. The pairs of rowkeys and values were stored in HBase database; the rowkeys were the hash codes of instances; the values were the classes of instances. For testing instances, the values of instances which had same rowkeys were selected from HBase database, and the labels of testing instances were obtained by majority voting with the values. The proposed algorithm was experimentally compared with MapReduce-based K-NN and Spark-based K-NN on the running time and testing accuracy. The experimental results showed that the running time of the proposed algorithm was much lower than the times of the MapReduce-based K-NN and Spark-based K-NN in the case of classification performance preservation.

Key words: big data, K-nearest neighbors, HBase, SimHash, classification algorithms

CLC Number: 

  • TP181
[1] COVER T M, HART P E. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[2] SAVCHENKO A V. Maximum-likelihood approximate nearest neighbor method in real-time image recognition[J]. Pattern Recognition, 2017, 61:459-469.
[3] BHATEJA A K, SHARMA S, CHAUDHURY S, et al. Iris recognition based on sparse representation and k-nearest subspace with genetic algorithm [J]. Pattern Recognition Letters, 2016, 73:13-18.
[4] 朱庆生, 唐汇, 冯骥. 一种基于自然最近邻的离群检测算法[J]. 计算机科学, 2014, 41(3):276-278. ZHU Qingsheng, TANG Hui, FENG Ji. Outlier dectection algorithm based on natural nearest neighbor [J]. Computer Science, 2014, 41(3):276-278.
[5] 肖春宝, 冯大政. 基于K近邻一致性的特征匹配内点选择算法[J]. 计算机科学, 2016, 43(1):290-293. XIAO Chunbao, FENG Dazheng. Inlier selection algorithm for feature matchiing based on K nearest neighbor consistence[J]. Computer Science, 2016, 43(1):290-293.
[6] NIKOLOPOULOS K I, BABAI M Z, BOZOS K. Forecasting supply chain sporadic demand with nearest neighbor approaches[J]. International Journal of Production Economics, 2016, 177:139-148.
[7] CHEN H L, YANG B, WANG G, et al. A novel bankruptcy prediction model based on an adaptive fuzzy K-nearest neighbor method[J]. Knowledge-Based Systems, 2011, 24(8):1348-1359.
[8] SINGH A, DAMIR B, DEEP K, et al. Calibration of nearest neighbors model for avalanche forecasting[J]. Cold Regions Science and Technology, 2015, 109:33-42.
[9] LEE Y H, WEI C P, CHENG T H, et al. Nearest-neighbor-based approach to time-series classification[J]. Decision Support Systems, 2012, 53(1):207-217.
[10] LTIFI H, BENMOHAMED E, KOLSKI C, et al. Enhanced visual data mining process for dynamic decision-making[J]. Knowledge-Based Systems, 2016, 112:166-181.
[11] WU X D, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge & Information Systems, 2007, 14(1):1-37.
[12] ZHAI J H, WANG X Z. PANG X H. Voting-based instance selection from large data sets with mapreduce and random weight networks[J]. Information Sciences, 2016, 367:1066-1077.
[13] ZHAI J H, LI T, WANG X Z. A cross-selection instance algorithm[J]. Journal of Intelligent & Fuzzy Systems, 2016, 30(2):717-728.
[14] INDYK P, MOTWANI R. Approximate nearest neighbors: towards removing the curse of dimensionality[C] //Proceedings of the thirtieth annual ACM symposium on Theory of computing. Dallas, USA: ACM, 1998:604-613.
[15] ÁLVAR A G, DÍEZ-PASTOR J F, RODÍGUEZ J J, et al. Instance selection of linear complexity for big data[J]. Knowledge-Based Systems, 2016, 107:83-95.
[16] TRIGUEROI, PERALTA D, BACARDIT J, et al. MRPR: a MapReduce solution for prototype reduction in big data classification[J]. Neurocomputing, 2015, 150(Part A):331-345.
[17] GU X G, ZHANG Y D, ZHANG L, et al. An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features[J]. Signal Processing, 2013, 93(8):2244-2255.
[18] WANG J, LIU W, KUMAR S, et al. Learning to hash for indexing big data-a survey[J]. Proceedings of the IEEE, 2016, 104(1):34-57.
[19] 闫永刚, 马廷淮, 王建. K-NN分类算法的MapReduce并行化实现[J]. 南京航空航天大学学报, 2013, 45(4):550-555. YAN Yonggang, MA Tinghuai, WANG Jian. Parallel implementing K-NN classification algorithm using mapreduce programming mode[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2013, 45(4):550-555.
[20] MAILLO J, RAMIREZ S, TRIGUERO I, et al. K-NN-IS: an iterative spark-based design of the K-nearest neighbors classifier for big data[J]. Knowledge-Based Systems, 2016, 117:3-15.
[21] 翟俊海, 王婷婷, 张明阳, 等. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6):650-656. ZHAI Junhai, WANG Tingting, ZHANG Mingyang, et al. Experimental comparison of two acceleration approaches for K-nearest neighbor[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):650-656.
[22] 黄宜华, 苗凯翔. 深入理解大数据-大数据处理与编程实践[M]. 北京: 机械工业出版社, 2014.
[23] 李武军, 周志华. 大数据哈希学习:现状与趋势[J]. 科学通报, 2015, 60(5-6):485-490. LI Wujun, ZHOU Zhihua. Learning to hash for big data: Current status and future trends[J]. Chinese Science Bulletin, 2015, 60(5-6):485-490.
[1] LIU Yang, LIU Bo, WANG Feng. Optimization algorithm for big data mining based on parameter server framework [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(4): 1-6.
[2] WEI Bo, ZHANG Wensheng, LI Yuanxiang, XIA Xuewen, LYU Jingqin. A sparse online learning algorithm for feature selection [J]. JOURNAL OF SHANDONG UNIVERSITY (ENGINEERING SCIENCE), 2017, 47(1): 22-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!