山东大学学报(工学版) ›› 2018, Vol. 48 ›› Issue (3): 54-59.doi: 10.6040/j.issn.1672-3961.0.2017.414
王婷婷a,b,翟俊海a,b,张明阳a,b*,郝璞a,b
WANG Tingtinga,b, ZHAI Junhaia,b*, ZHANG Mingyanga,b, HAO Pua,b
摘要: 针对大数据K-近邻(K-nearest neighbors, K-NN)计算复杂度高的问题,提出一种基于HBase和SimHash的大数据K-近邻分类算法。利用SimHash算法将大数据集从原空间映射到Hamming空间,得到哈希签名值集合;将样例的行键与值的二元对存储到HBase数据库中,行健(rowkey)为样例的哈希签名值,值(value)为样例的类别;对于测试样例,以其哈希签名值作为健rowkey,从HBase数据库中获取所有样例的value,通过对这些values进行多数投票,即可以得到测试样例的类别。与基于MapReduce的K-NN和基于Spark的K-NN在运行时间和测试精度两方面进行试验比较。试验结果显示,在保持分类能力的前提下,提出的算法的运行时间远远低于其他两种方法。
中图分类号:
[1] COVER T M, HART P E. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27. [2] SAVCHENKO A V. Maximum-likelihood approximate nearest neighbor method in real-time image recognition[J]. Pattern Recognition, 2017, 61:459-469. [3] BHATEJA A K, SHARMA S, CHAUDHURY S, et al. Iris recognition based on sparse representation and k-nearest subspace with genetic algorithm [J]. Pattern Recognition Letters, 2016, 73:13-18. [4] 朱庆生, 唐汇, 冯骥. 一种基于自然最近邻的离群检测算法[J]. 计算机科学, 2014, 41(3):276-278. ZHU Qingsheng, TANG Hui, FENG Ji. Outlier dectection algorithm based on natural nearest neighbor [J]. Computer Science, 2014, 41(3):276-278. [5] 肖春宝, 冯大政. 基于K近邻一致性的特征匹配内点选择算法[J]. 计算机科学, 2016, 43(1):290-293. XIAO Chunbao, FENG Dazheng. Inlier selection algorithm for feature matchiing based on K nearest neighbor consistence[J]. Computer Science, 2016, 43(1):290-293. [6] NIKOLOPOULOS K I, BABAI M Z, BOZOS K. Forecasting supply chain sporadic demand with nearest neighbor approaches[J]. International Journal of Production Economics, 2016, 177:139-148. [7] CHEN H L, YANG B, WANG G, et al. A novel bankruptcy prediction model based on an adaptive fuzzy K-nearest neighbor method[J]. Knowledge-Based Systems, 2011, 24(8):1348-1359. [8] SINGH A, DAMIR B, DEEP K, et al. Calibration of nearest neighbors model for avalanche forecasting[J]. Cold Regions Science and Technology, 2015, 109:33-42. [9] LEE Y H, WEI C P, CHENG T H, et al. Nearest-neighbor-based approach to time-series classification[J]. Decision Support Systems, 2012, 53(1):207-217. [10] LTIFI H, BENMOHAMED E, KOLSKI C, et al. Enhanced visual data mining process for dynamic decision-making[J]. Knowledge-Based Systems, 2016, 112:166-181. [11] WU X D, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge & Information Systems, 2007, 14(1):1-37. [12] ZHAI J H, WANG X Z. PANG X H. Voting-based instance selection from large data sets with mapreduce and random weight networks[J]. Information Sciences, 2016, 367:1066-1077. [13] ZHAI J H, LI T, WANG X Z. A cross-selection instance algorithm[J]. Journal of Intelligent & Fuzzy Systems, 2016, 30(2):717-728. [14] INDYK P, MOTWANI R. Approximate nearest neighbors: towards removing the curse of dimensionality[C] //Proceedings of the thirtieth annual ACM symposium on Theory of computing. Dallas, USA: ACM, 1998:604-613. [15] ÁLVAR A G, DÍEZ-PASTOR J F, RODÍGUEZ J J, et al. Instance selection of linear complexity for big data[J]. Knowledge-Based Systems, 2016, 107:83-95. [16] TRIGUEROI, PERALTA D, BACARDIT J, et al. MRPR: a MapReduce solution for prototype reduction in big data classification[J]. Neurocomputing, 2015, 150(Part A):331-345. [17] GU X G, ZHANG Y D, ZHANG L, et al. An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features[J]. Signal Processing, 2013, 93(8):2244-2255. [18] WANG J, LIU W, KUMAR S, et al. Learning to hash for indexing big data-a survey[J]. Proceedings of the IEEE, 2016, 104(1):34-57. [19] 闫永刚, 马廷淮, 王建. K-NN分类算法的MapReduce并行化实现[J]. 南京航空航天大学学报, 2013, 45(4):550-555. YAN Yonggang, MA Tinghuai, WANG Jian. Parallel implementing K-NN classification algorithm using mapreduce programming mode[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2013, 45(4):550-555. [20] MAILLO J, RAMIREZ S, TRIGUERO I, et al. K-NN-IS: an iterative spark-based design of the K-nearest neighbors classifier for big data[J]. Knowledge-Based Systems, 2016, 117:3-15. [21] 翟俊海, 王婷婷, 张明阳, 等. 2种加速K-近邻方法的实验比较[J]. 河北大学学报(自然科学版), 2016, 36(6):650-656. ZHAI Junhai, WANG Tingting, ZHANG Mingyang, et al. Experimental comparison of two acceleration approaches for K-nearest neighbor[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(6):650-656. [22] 黄宜华, 苗凯翔. 深入理解大数据-大数据处理与编程实践[M]. 北京: 机械工业出版社, 2014. [23] 李武军, 周志华. 大数据哈希学习:现状与趋势[J]. 科学通报, 2015, 60(5-6):485-490. LI Wujun, ZHOU Zhihua. Learning to hash for big data: Current status and future trends[J]. Chinese Science Bulletin, 2015, 60(5-6):485-490. |
[1] | 刘洋,刘博,王峰. 基于Parameter Server框架的大数据挖掘优化算法[J]. 山东大学学报(工学版), 2017, 47(4): 1-6. |
[2] | 魏波,张文生,李元香,夏学文,吕敬钦. 一种选择特征的稀疏在线学习算法[J]. 山东大学学报(工学版), 2017, 47(1): 22-27. |
[3] | 王梅,曾昭虎,孙莺萁,杨二龙,宋考平. 基于输入K-近邻的正则化路径上SVR贝叶斯组合[J]. 山东大学学报(工学版), 2016, 46(6): 8-14. |
|