您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2017, Vol. 47 ›› Issue (4): 1-6.doi: 10.6040/j.issn.1672-3961.0.2016.339

• •    下一篇

基于Parameter Server框架的大数据挖掘优化算法

刘洋1,刘博2,王峰1   

  1. 1. 河南财经政法大学云计算与大数据研究所, 河南 郑州 450046;2. 华中科技大学计算机学院, 湖北 武汉 430074
  • 收稿日期:2016-09-03 出版日期:2017-08-20 发布日期:2016-09-03
  • 作者简介:刘洋(1980— ),男,河南方城人,讲师,博士,主要研究方向为计算机系统结构,机器学习,大数据等.E-mail: liuyang@huel.edu.cn
  • 基金资助:
    河南省重点科技攻关资助项目(162102210096,152102210088,142102210090);河南省高等学校重点科研资助项目(18A520014)

Optimization algorithm for big data mining based on parameter server framework

LIU Yang1, LIU Bo2, WANG Feng1   

  1. 1. Institute of Cloud Computing and Big Data, Henan University of Economics and Law, Zhengzhou 450046, Henan, China;
    2. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China
  • Received:2016-09-03 Online:2017-08-20 Published:2016-09-03

摘要: 基于大数据挖掘的实时性要求和数据样本的多样性特征,提出一种面向大数据挖掘的机器学习模型训练优化算法。分析当前算法的迭代计算过程,根据模型向量的改变量将迭代过程分为粗调和微调两个阶段,并发现在微调阶段绝大部分样本对计算结果的影响极小,因此可以在微调阶段不计算此类样本的梯度而直接采用上次迭代的计算结果,从而减小计算量,提升计算效率。试验结果表明,算法在分布式集群环境下可以减小模型训练约35%的计算量,且训练得到的模型准确度在正常范围内,可有效提高大数据挖掘的实时性。

关键词: 优化算法, 分布式系统, 大数据, 样本差异性, 机器学习

Abstract: Traditional machine learning algorithms for small data were not applicable for mining of big data. An optimization algorithm for machine learning and big data mining was proposed. The iterative computation of machine learning algorithms was divided into two phases according to the change of model vector. According to the observation that most samples contributed little to the model update during the iteration, the computation load of machine learning algorithms could be reduced by reusing the iterative computing results of this kind of samples. The experimental results showed that the proposed method could reduce the computation load by 35%, with little effect on prediction accuracy of the training model.

Key words: big data, sample diversity, machine learning, distributed system, optimization

中图分类号: 

  • TU457
[1] 张引,陈敏,廖小飞. 大数据应用的现状与展望[J]. 计算机研究和发展,2013, 50(S2):216-233 ZHANG Yin, CHEN Min, LIAO Xiaofei. Big data applications: a survey[J]. Journal of Computer Research and Development, 2013, 50(S2):216-233.
[2] 王元卓,靳小龙,程学旗. 网络大数据:现状与展望[J]. 计算机学报,2013,36(6):1125-1138. WANG Yuanzhuo, JIN Xiaolong, CHENG Xueqi. Network big data: present and future[J]. Chinese Journal of Computers, 2013, 36(6):1125-1138.
[3] 张蕾,章毅. 大数据分析的无限深度神经网络方法[J]. 计算机研究与发展,2016,53(1):68-79. ZHANG Lei, ZHANG Yi. Big data analysis by infinite deep neural networks[J].Journal of Computer Research and Development, 2016, 53(1):68-79.
[4] 耿丽娟,李星毅. 用于大数据分类的KNN算法研究[J]. 计算机应用研究,2014, 31(5):1342-1344. GENG Lijuan, LI Xingyi. Improvements of KNN algorithm for big data classification[J]. Application Research of Computers, 2014, 31(5):1342-1344.
[5] 刘红岩,陈剑,陈国青. 数据挖掘中的数据分类算法综述[J].清华大学学报(自然科学版),2002,42(6):727-730. LIU Hongyan, CHEN Jian, CHEN Guoqing. Review of classification algorithms for data mining[J]. Journal of Tsinghua University(Science & Technology), 2002, 42(6):727-730.
[6] 何清,李宁,罗文娟,等. 大数据下的机器学习算法综述[J]. 模式识别与人工智能,2014,27(4):327-336. HE Qing, LI Ning, LUO Wenjuan, et al. A survey of machine learning algorithms for big data[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4):327-336.
[7] 吴启晖,邱俊飞,丁国如. 面向频谱大数据处理的机器学习方法[J].数据采集与处理,2015,30(4):703-713. WU Qihui, QIU Junfei, DING Guoru. Machine learning methods for big spectrum data processing[J]. Journal of Data Acquisition and Processing, 2015, 30(4):703-713.
[8] 程学旗,靳小龙,王元卓. 大数据系统和分析技术综述[J]. 软件学报,2014,25(9):1889-1908. CHENG Xueqi, JIN Xiaolong, WANG Yuanzhuo. Survey on big data system and analytic technology[J]. Journal of Software, 2014, 25(9):1889-1908.
[9] 郭迟,刘经南,方媛,等. 位置大数据的价值提取与协同挖掘方法[J]. 软件学报,2014, 25(4):713-730. GUO Chi, LIU Jingnan, FANG Yuan, et al. Value extraction and collaborative mining methods for location big data[J]. Journal of Software, 2014, 25(4):713-730.
[10] 陈国良,毛睿,陆克中. 大数据并行计算框架[J]. 科学通报,2015,60:566-569. CHEN Guoliang, MAO Rui, LU Kezhong. Parallel computing framework for big data[J]. Chinese Science Bulletin, 2015, 60:566-569.
[11] YUAN Jinhui, GAO Fei, HO Qirong, et al. Light LDA: big topic models on modest computer clusters[C] //Proceedings of the 24th International Conference on World Wide Web. Florence, Italy: Springer, 2015:1351-1361
[12] KUMAR Abhimanu, BEUTEL Alex, HO Qirong, et al. Fugue: slow-worker-agnostic distributed learning for big models on big data[C] //Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. Reykjavik, Iceland: JMLR, 2014:531-539.
[13] LIU Ji, WRIGHT S J, RE Christopher, et al. An asynchronous parallel stochastic coordinate descent algorithm[J]. Journal of Machine Learning Research, 2015, 16(1):285-322.
[14] HSIEH C J, YU H F, DHILLON I S. PASSCoDe: parallel asynchronous stochastic dual coordinate descent[C] //Proceedings of the 32nd International Conference on Machine Learning. Lille, France: ACM, 2015: 2370-2379.
[15] CHU Chengtao, KIM Sangkyun, LIN Yian, et al. Map-reduce for machine learning on multicore[C] //20th Annual Conference on Neural Information Processing Systems Vancouver. British Columbia, Canada: MIT Press, 2006:281-288.
[16] POWER Russell, LI Jinyang. Piccolo: building fast, distributed programs with partitioned tables[C] //9th USENIX Symposium on Operating Systems Design and Implementation. Vancouver, Canada: USENIX, 2010: 293-306.
[17] CHILIMBI Trishul, SUZUE Yutaka, APACIBLE Johnson, et al. Project adam: building an efficient and scalable deep learning training system[C] //11th USENIX Symposium on Operating Systems Design and Implementation. Broomfield, USA: USENIX, 2014: 571-582.
[18] XING Eric P, HO Qirong, DAI Wei, et al. Petuum: a new platform for distributed machine learning on big data[C] //Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, NSW, Australia: ACM, 2015: 1335-1344.
[19] LI Mu, ANDERSEN David G, PARK Jun Woo, et al. Scaling distributed machine learning with the parameter server[C] //11th USENIX Symposium on Operating Systems Design and Implementation. Broomfield, USA: USENIX, 2014:583-598.
[20] LI Mu, ANDERSEN David G, SMOLA Alexander J, et al. Communication efficient distributed machine learning with the parameter server[C] //28th Annual Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2014: 19-27.
[21] HO Qirong, CIPAR James, CUI Henggang, et al. More effective distributed ML via a stale synchronous parallel parameter server[C] //27th Annual Conference on Neural Information Processing Systems. Lake Tahoe, United States: MIT Press, 2013: 1223-1231.
[22] LANGFORD John, SMOLA Alexander J, ZINKEVICH Martin. Slow learners are fast[C] //23rd Annual Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2009: 2331-2339.
[23] ZINKEVICH Martin A, WEIMER Markus, SMOLA Alex, et al. Parallelized stochastic gradient descent[C] //24th Annual Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2009: 2331-233.
[24] LEWIS David D, YANG Yiming, ROSE Tony G, et al. RCV1: a new benchmark collection for text categorization research[J]. Journal of Machine Learning Research, 2004, 5:361-397.
[1] 文裕杰,张达敏. 增强型白鲸优化算法及其应用[J]. 山东大学学报 (工学版), 2025, 55(3): 88-99.
[2] 祝明,石承龙,吕潘,刘现荣,孙驰,陈建城,范宏运. 基于优化长短时记忆网络的深基坑变形预测方法及其工程应用[J]. 山东大学学报 (工学版), 2025, 55(3): 141-148.
[3] 鄢仁武,林剑雄,李培强,吴国耀,匡宇. 考虑碳排放因子与动态重构的主动配电网双层优化策略[J]. 山东大学学报 (工学版), 2025, 55(2): 16-27.
[4] 张梦雨,何振学,赵晓君,王浩然,肖利民,王翔. 基于AMSChOA的MPRM电路面积优化[J]. 山东大学学报 (工学版), 2024, 54(6): 147-155.
[5] 王辰龑,刘轩,超木日力格. 自适应的并行天牛须优化算法[J]. 山东大学学报 (工学版), 2024, 54(5): 74-80.
[6] 常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12.
[7] 乔慧妍,段学龙,解驰皓,赵冬慧,马玉玲. 基于异常点检测的心理健康辅助诊断方法[J]. 山东大学学报 (工学版), 2024, 54(4): 76-85.
[8] 王凤娟,王语睿,卫兰,范存群,徐晓斌. 基于自适应线性模型的环境数据预测算法[J]. 山东大学学报 (工学版), 2024, 54(4): 86-94.
[9] 刘新,刘冬兰,付婷,王勇,常英贤,姚洪磊,罗昕,王睿,张昊. 基于联邦学习的时间序列预测算法[J]. 山东大学学报 (工学版), 2024, 54(3): 55-63.
[10] 岳仁峰,张嘉琦,刘勇,范学忠,李琮琮,孔令鑫. 基于颜色和纹理特征的立体车库锈蚀检测技术[J]. 山东大学学报 (工学版), 2024, 54(3): 64-69.
[11] 陈成,董永权,贾瑞,刘源. 基于交互序列特征相关性的可解释知识追踪[J]. 山东大学学报 (工学版), 2024, 54(1): 100-108.
[12] 李源,张妮,张艳娜,刘士豪,李学辉. 用于预测边界元弱奇异积分的新型樽海鞘-神经网络模型[J]. 山东大学学报 (工学版), 2023, 53(6): 8-15.
[13] 卞小曼,王小琴,蓝如师,刘振丙,罗笑南. 基于相似性保持和判别性分析的快速视频哈希算法[J]. 山东大学学报 (工学版), 2023, 53(6): 63-69.
[14] 李鸿钊,张庆松,刘人太,陈新,辛勤,石乐乐. 浅埋地铁车站施工期地表变形风险预警[J]. 山东大学学报 (工学版), 2023, 53(6): 82-91.
[15] 韦修喜,陶道,黄华娟. 改进果蝇算法优化BP神经网络预测汽油辛烷值[J]. 山东大学学报 (工学版), 2023, 53(5): 20-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙从征,管从胜,秦敬玉,程川 . 铝合金化学镀镍磷合金结构和性能[J]. 山东大学学报(工学版), 2007, 37(5): 108 -112 .
[2] 刘新1 ,宋思利1 ,王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 .
[3] 潘多涛,刘桂萍,刘长风 . 生物絮凝剂产生菌的筛选及培养条件优化[J]. 山东大学学报(工学版), 2008, 38(3): 99 -103 .
[4] 徐晓丹, 段正杰, 陈中育. 基于扩展情感词典及特征加权的情感挖掘方法[J]. 山东大学学报(工学版), 2014, 44(6): 15 -18 .
[5] 张迎春 王佐勋 王桂娟. 基于神经网络控制器的高压电缆测温系统[J]. 山东大学学报(工学版), 2009, 39(5): 62 -67 .
[6] 孟健, 李贻斌, 李彬. 四足机器人跳跃步态控制方法[J]. 山东大学学报(工学版), 2015, 45(3): 28 -34 .
[7] 方 挺,杨 忠,沈春林 . 无人机编队视频序列中的多目标精确跟踪[J]. 山东大学学报(工学版), 2008, 38(4): 22 -26 .
[8] 李梦丽 王威强 徐书根 宋明大 王功 苗光同. 物料化学爆炸引起尿塔塔体爆破可能性分析[J]. 山东大学学报(工学版), 2008, 38(6): 1 -6 .
[9] 潘国栋,汪嘉业,向 辉 . 多边形三角化图三色问题证明的一个注记[J]. 山东大学学报(工学版), 2007, 37(1): 74 -75 .
[10] 王秀红,郭庆强,李歧强 . 基于粒子群优化算法的高阶累积量滤波器[J]. 山东大学学报(工学版), 2007, 37(6): 15 -19 .