您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (5): 22-28.doi: 10.6040/j.issn.1672-3961.0.2014.367

• 机器学习与数据挖掘 • 上一篇    下一篇

不确定性传播算法的MapReduce并行化实现

何东之, 张吉沣, 赵鹏飞   

  1. 北京工业大学软件学院, 北京 100124
  • 收稿日期:2014-12-16 修回日期:2015-06-10 出版日期:2018-10-20 发布日期:2014-12-16
  • 作者简介:何东之(1970-),男,山东临沂人,副教授,博士,主要研究方向为嵌入式系统和新型人机交互.E-mail:victor@bjut.edu.cn
  • 基金资助:
    北京市教委基金资助项目(PXM2011_014204_09_000232)

Parallel implementing probabilistic spreading algorithm using MapReduce programming mode

HE Dongzhi, ZHANG Jifeng, ZHAO Pengfei   

  1. School of Software Engineering, Beijing University of Technology, Beijing 100124, China
  • Received:2014-12-16 Revised:2015-06-10 Online:2018-10-20 Published:2014-12-16

摘要: 为了克服单机串行不确定性传播算法处理大规模数据集的局限,采用MapReduce编程模型对算法进行并行化实现。将单机算法按照算法流程进行拆分,每一步对应一个MapReduce程序。每一步的输入及输出数据都存储在Hadoop分布式文件系统上。用命中率对比并行化的不确定性传播算法与全局排名算法的性能。对比不同数据量、不同节点数时并行化的不确定性传播算法的加速比。试验结果表明,不确定性传播算法MapReduce并行化后部署在Hadoop集群上运行,命中率显著高于全局排名算法,且有着较好的并行性,扩大了单机算法所能处理的数据规模且提高了算法的运算速度。

关键词: 云计算平台, 分布式, 二分网络, 不确定性传播算法, MapReduce

Abstract: In order to overcome the limitations of the serial probabilistic spreading algorithm in dealing with large-scale dataset, a parallelization of the algorithm was put forth by using MapReduce. The complex computing tasks were decomposed into a series of MapReduce job flow for distributed parallel processing on Hadoop. The input and output data of every step were stored in the Hadoop distributed file system. Hit ratio was used to compare the parallelizable probabilistic spreading algorithm versus the global ranking method performance. Speedups of the parallelizable algorithm were compared while the amount of data and the number of nodes was different. Experiment results showed that the probabilistic spreading algorithm based on MapReduce had good parallelism and had higher hit ratio than the global ranking method. Data scale that can be handled by the serial algorithm was expanded, and the operation speed of the algorithm was raised.

Key words: cloud computing paltform, probabilistic spreading algorithm, MapReduce, distributed, bipartite network

中图分类号: 

  • TP311
[1] 王国霞,刘贺平.个性化推荐系统综述[J].计算机工程与应用, 2012, 48(7):66-76. WANG Guoxia, LIU Heping. Survey of personalized recommendation system[J]. Computer Engineering and Applications, 2012, 48(7):66-76.
[2] LÜ Linyuan, MEDO Matúš, CHI Hoyeung, et al. Recommender systems[J]. Physics Reports, 2012, 519(1):1-49.
[3] 朱郁筱, 吕琳媛. 推荐系统评价指标综述[J]. 电子科技大学学报, 2012, 41(2):163-175. ZHU Yuxiao, LÜ Linyuan. Evaluation metrics for recommender systems[J]. Journal of University of Electronic Science and Technology of China, 2012, 41(2):163-175.
[4] ZHOU Tao, REN Jie, MEDO Matúš, et al. Bipartite network projection and personal recommendation[J]. Physical Review E, 2007, 76(4):046115.
[5] LATAPY Matthieu, MAGNIEN Clémence, DELVECCHIO Nathalie. Basic notions for the analysis of large two-mode networks[J]. Social Networks, 2008, 30(1):31-48.
[6] 吴亚晶, 张鹏, 狄增如, 等. 二分网络研究[J]. 复杂系统与复杂性科学, 2010, 7(1):1-12. WU Yajing, ZHANG Peng, DI Zengru, et al. Study on bipartite networks[J]. Complex Systems and Complexity Science, 2010, 7(1):1-12.
[7] 刘亮亮, 曹菡, 韩亚楠. 基于群体动力学的协同过滤算法及应用[J]. 计算机应用研究, 2014, 31(12):3603-3605, 3612. LIU Liangliang, CAO Han, HAN Yanan. Algorithm and application of collaborative filtering based on group dynamics[J]. Application Research of Computers, 2014, 31(12):3603-3605, 3612.
[8] KOREN Yehuda, BELL Robert. Recommender systems handbook[M]. New York: Springer, 2011:145-186.
[9] ADOMAVICIUS G, TUZHILIN A. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions[J]. Knowledge and Data Engineering, 2005, 17(6):734-749.
[10] 陈吉荣, 乐嘉锦. 基于 Hadoop 生态系统的大数据解决方案综述[J]. 计算机工程与科学, 2013, 35(10):25-35. CHEN Jirong, LE Jiajin. Reviewing the big data solution based on hadoop ecosystem[J]. Computer Engineering & Science, 2013, 35(10):25-35.
[11] 冯汉超, 周凯东. 分布式系统下大数据存储结构优化研究[J]. 河北工程大学学报:自然科学版, 2014, 31(4):69-73. FENG Hanchao, ZHOU Kaidong. Research on optimizing big data storage structure in distributed system[J]. Journal of Hebei University of Engineering:Natural Science Edition, 2014, 31(4):69-73.
[12] 刘飞. 基于云计算的分布式存储系统的研究和应用[D]. 西安:西安工业大学, 2012. LIU Fei. Research and application of distributed storage system based on cloud computing[D]. Xi'an:Xi'an Technological University, 2012.
[13] BORTHAKUR D. HDFS architecture guide[J]. Hadoop Apache Project, 2008:53.
[14] 黄文依, 王劲松, 林胜. HDFS可视化操作研究与实现[J]. 天津理工大学学报, 2012, 28(1):31-34. HANG Wenyi, WANG Jinsong, LIN Sheng. Research and implementation of HDFS visualized operation[J]. Journal of Tianjin University of Technology, 2012, 28(1):31-34.
[15] KAVULYA S, TAN J, GANDHI R, et al. An analysis of traces from a production mapreduce cluster[C]//Processdings of Cluster, Cloud and Grid Computing (CCGrid). Melbourne:IEEE, 2010:94-103.
[16] LIN Wenhui, LEI Zhenming, LIU Jun, et al. MapReduce optimization algorithm based on machine learning in heterogeneous cloud environment[J]. The Journal of China Universities of Posts and Telecommunications, 2013, 20(6):77-121.
[17] 应毅,刘亚军. MapReduce并行计算技术发展综述[J]. 计算机系统应用, 2014,23(4):1-6,11. YING Yi, LIU Yajun. Survey of developments of MapReduce parallel computing technology[J]. Computer Systems & Applications, 2014, 23(4):1-6,11.
[18] DITTRICH J, QUIANRUIZ J A. Efficient big data processing in hadoop MapReduce[J]. Proceedings of the VLDB Endowment, 2012, 5(12):2014-2015.
[19] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学, 2011(10):166-168. ZHAO Weizhong, MA Huifang, FU Yanxiang, et al. Research on parallel k-means algorithm design based on hadoop platform[J]. Computer Science, 2011(10):166-168.
[20] HASSANZADEH O, KEMENTSIETSIDIS A, KIMEL-FELD B, et al. Next generation data analytics at IBM research[J]. Proceedings of the VLDB Endowment, 2013, 6(11):1174-1175.
[21] BORTHAKUR D, GRAY J, SARMA J S, et al. Apache hadoop goes realtime at facebook[C]//Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2011:1071-1080.
[22] LEE H, KWON J. Similar user clustering based on movielens data set[J]. Advanced Science and Technology Letters, 2014, 51:32-35.
[1] 秦利国,何潇,周东华. 一种时延多智能体系统的分布式编队[J]. 山东大学学报(工学版), 2017, 47(5): 79-88.
[2] 刘洋,刘博,王峰. 基于Parameter Server框架的大数据挖掘优化算法[J]. 山东大学学报(工学版), 2017, 47(4): 1-6.
[3] 李庆冬,骆伟超,叶瑛歆,张承瑞,胡天亮. 基于本体的机床设备资源共享机制[J]. 山东大学学报(工学版), 2017, 47(3): 130-138.
[4] 范德斌,邓长寿,袁斯昊,谭旭杰,董小刚. 基于MapReduce模型的分布式粒子群算法[J]. 山东大学学报(工学版), 2016, 46(6): 23-30.
[5] 孙一冰,付敏跃,王炳昌,张焕水. 大规模动态系统的分布式状态估计算法[J]. 山东大学学报(工学版), 2016, 46(6): 62-68.
[6] 陈宏兴, 周风余, 田天, 姜志飞, 陈竹敏. 服务机器人云计算平台SOA接口层模型设计[J]. 山东大学学报(工学版), 2015, 45(4): 31-39.
[7] 陈继明,孙名妤,游聚娟,康忠健. 基于子空间细菌群体趋药性算法的含分布式电源的配电网无功优化[J]. 山东大学学报(工学版), 2014, 44(2): 49-54.
[8] 陈文强1,林琛1,2,陈珂3,陈锦秀1,邹权1,2*. 基于GraphLab的分布式近邻传播聚类算法[J]. 山东大学学报(工学版), 2013, 43(5): 13-18.
[9] 张伶卫,万文强. 基于云计算平台的代价敏感集成学习算法研究[J]. 山东大学学报(工学版), 2012, 42(4): 19-23.
[10] 石海东,胡冬梅,王晓东. 分布式天线系统中信道时延扩展的统计分析[J]. 山东大学学报(工学版), 2012, 42(1): 133-142.
[11] 孙甲冰1,2,张承进1*. 有丢包的随机不确定参数系统的最优融合滤波[J]. 山东大学学报(工学版), 2011, 41(6): 59-65.
[12] 高厚磊 田佳 杜强 武志刚 刘淑敏. 能源开发新技术——分布式发电[J]. 山东大学学报(工学版), 2009, 39(5): 106-110.
[13] 李政,王晓东,卜智勇 . 分布式天线系统中选择合并的衰落统计分析[J]. 山东大学学报(工学版), 2007, 37(5): 83-88 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 肖乔, 裴继红, 王荔霞, 龚志成. 基于多通道Gabor滤波模糊融合的遥感图像舰船检测[J]. 山东大学学报(工学版), 2015, 45(5): 29 -35 .
[2] 熊冰妍, 王国胤, 邓维斌. 分级式代价敏感决策树及其在手机换机预测中的应用[J]. 山东大学学报(工学版), 2015, 45(5): 36 -42 .
[3] 李新玉, 徐桂云, 任世锦, 杨茂云. 基于鉴别流形的不相关稀疏投影非负矩阵分解[J]. 山东大学学报(工学版), 2015, 45(5): 1 -12 .
[4] 王晓初, 王士同, 包芳. 基于概率密度分布一致约束的最小最大概率机图像分类算法[J]. 山东大学学报(工学版), 2015, 45(5): 13 -21 .