您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报(工学版) ›› 2016, Vol. 46 ›› Issue (1): 15-21.doi: 10.6040/j.issn.1672-3961.1.2015.172

• 机器学习与数据挖掘 • 上一篇    下一篇

基于Map/Reduce的时间序列相似性搜索算法

王会青,孙宏伟,张建辉   

  1. 太原理工大学计算机科学与技术学院, 山西 太原 030024
  • 收稿日期:2015-05-12 出版日期:2016-02-20 发布日期:2015-05-12
  • 作者简介:王会青(1978- ),女,山西太原人,副教授,硕士研究生导师,主要研究方向为数据库和智能信息处理,数据挖掘与机器学习.E-mail:1013208257@qq.com
  • 基金资助:
    国家自然科学基金青年科学基金资助项目(61402318);高等学校博士学科点专项科研基金资助项目(20131402120009);山西省科技攻关资助项目(20130313012-2);太原理工大学校青年团队资助项目(2013T049)

Time series similarity searching algorithm based on Map/Reduce

WANG Huiqing, SUN Hongwei, ZHANG Jianhui   

  1. College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, Shanxi, China
  • Received:2015-05-12 Online:2016-02-20 Published:2015-05-12

摘要: 将并行计算的策略引入到时间序列处理中,提出基于Map/Reduce的时间序列相似性搜索算法,充分利用云计算可进行大规模计算和数据处理的特点,有效降低了时间序列相似性搜索中运算量,简化了计算过程。该算法在心电图数据集上进行相似性搜索,分别进行PAA下界过滤和DTW距离的计算,验证运算时间和并行加速比随节点变化的情况,与传统的单机运算相比,有效地提高了时间序列挖掘效率。

关键词: 时间序列挖掘, 并行计算, 动态时间弯曲距离, 相似性搜索, 下界算法

Abstract: The strategy of parallel computing was introduced into time series processing, and time series similarity searching algorithm based on Map/Reduce was proposed. The proposed algorithm could make use of the features of cloud computing to take large-scale computing and data processing, and could efficiently reduce the large calculation and simplify the computing process of time series similarity searching. The proposed algorithm was adopted on electrocardiograph dataset to complete similarity searching with piecewise aggregate approximation lower bound and dynamic time warping distance, which verified the effect of nodes changing on operation time and parallel speed up. Compared with the traditional one running on single PC, the proposed algorithm improved the efficiency of time series mining effectively.

Key words: parallel computing, similarity searching, time series mining, lower bound algorithm, dynamic time warping distance

中图分类号: 

  • TP311
[1] BHADURI K, ZHU Q, OZA N C, et al. Fast and flexible multivariate time series subsequence search[C] //Proc of the 2010 IEEE Int l Conf on Data Mining. Sydney, Australia: IEEE, 2010:48-57.
[2] BERNDT D J, CLIFFORD J. Using dynamic time warping to find patterns in time series[C] //KDD Workshop. Seattle, USA:Springer US, 1994:359-370.
[3] KEOGH E, RATANAMAHATANA C A. Exact indexing of dynamic time warping[J]. Knowledge and Information Systems, 2005(7):358-386.
[4] ZHANG Y, GLASS J. A piecewise aggregate approximation lower-bound estimate for posteriorgram-based dynamic time warping[C] //Interspeech. Florence,Italy:IEEE, 2011:1909-1912.
[5] 丁静,杨善林,罗贺,等. 云计算环境下的数据挖掘服务模式[J]. 计算机科学, 2012, 39(6):217-219. DING Jing, YANG Shanlin, LUO He, et al. Data mining service model in cloud computing environment[J]. Computer Science, 2012, 39(6):217-219.
[6] 杨来,史忠植,梁帆,等. 基于Hadoop云平台的并行数据挖掘方法[J].系统仿真学报,2013,25(5):936-944. YANG Lai, SHI Zhongzhi, LIANG Fan, et al. Parallel approach in data mining based on hadoop cloud platform[J]. Journal of System Simulation, 2013, 25(5):936-944.
[7] 程苗. 基于云计算的Web数据挖掘[J]. 计算机科学,2011,38(10):146-149. CHENG Miao. Web data mining based on cloud computing[J]. Computer Science, 2011, 38(10):146-149.
[8] 陈光鹏,杨育彬,高阳,等. 一种基于Map/Reduce的频繁闭项集挖掘算法[J]. 模式识别与人工智能,2012,25(2):220-224. CHEN Guangpeng, YANG Yubin, GAO Yang, et al. Closed frequent itemset mining based on mapreduce[J]. PR&AI, 2015, 25(2):220-224.
[9] 田国会,许亚雄. 云机器人:概念、架构与关键技术研究综述[J]. 山东大学学报(工学版),2014,44(6):47-54. TIAN Guohui, XU Yaxiong. Cloud robotics: concept, architectures and key technologies[J]. Journal of Shandong University(Engineering Science), 2014, 44(6):47-54.
[10] ZHU Q, WANG X Y, KEOGH E, et al. An efficient and effective similarity measure to enable data mining of petroglyphs[J]. Data Mining and Knowledge Discovery, 2011, 23(1):91-127.
[11] PAPAPETROU P, ATHITSOS V, POTAMIAS M, et al. Embedding-based subsequence matching in time-series databases[J]. ACM Transactions on Database Systems, 2011, 36(3):17.
[12] 周大镯,吴晓丽,闫红灿. 一种高效的多变量时间序列相似查询算[J]. 计算机应用,2008,28(10):2541-2543. ZHOU Dazhuo, WU Xiaoli, YAN Hongcan. An efficient similarity search for multivariate time series[J]. Computer Applications, 2008, 28(10):2541-2543.
[13] FU AWC, KEOGH E, LAU LYH, et al. Scaling and time warping in time series querying[J]. The VLDB Journal, 2008, 17:899-921.
[14] MU B, YAN J L. Efficient time series lower bounding technique[J]. Computer Engineering and Applications, 2009, 45(11):168-171.
[15] BEGUM N,KEOGH E. Rare time series motif discovery from unbounded streams[J]. Proceedings of the VLDB Endowment, 2014, 8(2):149-160.
[16] ZHOU D Z, JIANG W B, LI M Q. Efficient clustering algorithm for multivariate time series[J]. Computer Engineering and Applications, 2010, 46(1):137-139.
[17] THANAWIN Rakthanmanon, BILSON Campana, ABDULLAH Mueen, et al. Searching and mining trillions of time series subsequences under dynamic time warping[C] //KDD'12-18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA:IEEE, 2012:262-270.
[18] WANG X, MUEEN A, DING H, et al. Experimental comparison of representation methods and distance measures foe time series data[J]. Data Mining and Knowledge Discovery, 2013, 26(2):275-309.
[19] SHOKOOHI-YEKTA M, WANG J, KEOGH E. On the non-trivial generalization of dynamic time warping to the multi-dimensional case[C] //SDM Vancouver,Canada:Society for Industrial and Applied Mathematics, 2015:289-297.
[20] SAKURAI Y, YOSHIKAWA M, FALOUTSOS C. FTW: fast similarity search under the time warping distance[C] //PODS. Baltimore, Maryland:Sprinter US, 2005:326-337.
[21] BANKÓ Z, ABONYI J. Correlation based dynamic time warping of multivariate time series[J]. Expert Systems with Applications, 2012, 39(17):12814-12823.
[22] 李正欣,张凤鸣,李克武,等.一种支持DTW距离的多元时间序列索引结构[J]. 软件学报,2014,25(3):560-575. LI Zhengxin, ZHANG Fengming, LI Kewu, et al. Index structure for multivariate time series under DTW distance metric[J]. Journal of Software, 2014, 25(3):560-575.
[23] HYUNJIN Yoon, KIYOUNG Yang, CYRUS Shahabi. Feature subset selection and feature ranking for multivariate time series[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9):1186-1198.
[24] 李伟卫,赵航,张阳,等. 基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013,49(20):112-117. LI Weiwei, ZHAO Hang, ZHANG Yang, et al. Research on massive data mining based on mapreduce[J]. Computer Engineering and Applications, 2013, 49(20):112-117.
[25] 何东之,张吉沣,赵鹏飞. 不确定传播算法的MapReduce并行化实现[J]. 山东大学学报(工学版),2015,44(5):22-28. HE Dongzhi, ZHANG Jifeng, ZHAO Pengfei. Parallel implementing probabilistic spreading algorithm using MapReduce programming mode[J]. Journal of Shandong University(Engineering Science), 2015, 44(5):22-28.
[26] AGRAWAL R, FALOUTSOS C, SWAMI A. Efficient similarity search in sequence databases[C] //Proc of the 4th Int'l Conf on Foundations of Data Organization and Algorithms. Chicoga, USA:Springer Berlin Heidelberg, 1993:69-84.
[1] 任立波, 尚立宝, 闫日雄, 何海澜, 赵红霞, 韩吉田. 脉冲鼓泡床内鼓泡和颗粒混合特性的CFD-DEM数值模拟[J]. 山东大学学报(工学版), 2015, 45(2): 62-66.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 程代展,李志强. 非线性系统线性化综述(英文)[J]. 山东大学学报(工学版), 2009, 39(2): 26 -36 .
[2] 王勇, 谢玉东.

大流量管道煤气的控制技术研究

[J]. 山东大学学报(工学版), 2009, 39(2): 70 -74 .
[3] 刘新1 ,宋思利1 ,王新洪2 . 石墨配比对钨极氩弧熔敷层TiC增强相含量及分布形态的影响[J]. 山东大学学报(工学版), 2009, 39(2): 98 -100 .
[4] 田芳1,张颖欣2,张礼3,侯秀萍3,裘南畹3. 新型金属氧化物薄膜气敏元件基材料的开发[J]. 山东大学学报(工学版), 2009, 39(2): 104 -107 .
[5] 陈华鑫, 陈拴发, 王秉纲. 基质沥青老化行为与老化机理[J]. 山东大学学报(工学版), 2009, 39(2): 125 -130 .
[6] 赵延风1,2, 王正中1,2 ,芦琴1,祝晗英3 . 梯形明渠水跃共轭水深的直接计算方法[J]. 山东大学学报(工学版), 2009, 39(2): 131 -136 .
[7] 李士进,王声特,黄乐平. 基于正反向异质性的遥感图像变化检测[J]. 山东大学学报(工学版), 2018, 48(3): 1 -9 .
[8] 赵科军 王新军 刘洋 仇一泓. 基于结构化覆盖网的连续 top-k 联接查询算法[J]. 山东大学学报(工学版), 2009, 39(5): 32 -37 .
[9] 赵治广,王登杰,田云飞 . 基于灰色理论的路基沉降研究[J]. 山东大学学报(工学版), 2007, 37(3): 86 -88 .
[10] 姚占勇,商庆森,赵之仲,贾朝霞 . 界面条件对半刚性沥青路面结构应力分布的影响[J]. 山东大学学报(工学版), 2007, 37(3): 93 -99 .