﻿ 基于Map/Reduce的时间序列相似性搜索算法
 «上一篇 下一篇»
 山东大学学报(工学版)  2016, Vol. 46 Issue (1): 15-21  DOI: 10.6040/j.issn.1672-3961.1.2015.172 0

### 引用本文

WANG Huiqing, SUN Hongwei, ZHANG Jianhui. Time series similarity searching algorithm based on Map/Reduce[J]. Journal of Shandong University(Engineering Science), 2016, 46(1): 15-21. DOI: 10.6040/j.issn.1672-3961.1.2015.172.

### 文章历史

Time series similarity searching algorithm based on Map/Reduce
WANG Huiqing, SUN Hongwei, ZHANG Jianhui
College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, Shanxi, China
Abstract: The strategy of parallel computing was introduced into time series processing, and time series similarity searching algorithm based on Map/Reduce was proposed. The proposed algorithm could make use of the features of cloud computing to take large-scale computing and data processing, and could efficiently reduce the large calculation and simplify the computing process of time series similarity searching. The proposed algorithm was adopted on electrocardiograph dataset to complete similarity searching with piecewise aggregate approximation lower bound and dynamic time warping distance, which verified the effect of nodes changing on operation time and parallel speed up. Compared with the traditional one running on single PC, the proposed algorithm improved the efficiency of time series mining effectively.
Key words: parallel computing    time series mining    similarity searching    dynamic time warping distance    lower bound algorithm
0 引言

1 基于Map/Reduce的时间序列相似性搜索

1.1 时间序列相似性搜索算法

1.1.1 子序列搜索

 $T={t_1},{t_2},\ldots,{t_m},$ (1)

 ${T_{i,k}}={t_i},{t_{i+1}},\ldots,{t_{i+k - 1}},1 \le i \le m - k+1,$ (2)

 图1 子序列搜索 Fig. 1 Subsequence search

(1)对原始序列进行归一化预处理,使搜索序列与参考序列具有可搜索性;

(2)使用多维索引框架构建滑动窗口,取得原始搜索序列的子序列,并为其建立索引;

(3)使用下界算法进行过滤,将符合要求的子序列存入候选集中;

(4)最后遍历最终候选集,求每一条子序列与参考序列的DTW距离,得到最终搜索结果。子序列搜索的实现代码如下:

SubsequenceMatching(Q,C,d)

for each subsequence S of C (FRM)

if LB-SW(S,Q)>d (EA)//EA表示

Early Abandon算法

break;

else LB-SW(S,Q)≤d (EA)

Add S to Candidate Set H;

end

end

for each Hi of H

if DTW(Hi,Q)<Dmin(EA)

Dmin=DTW(Hi,Q)

end

end

1.1.2 DTW距离

1.1.3 下界算法

 ${U_i}=\max \left\{ {{c_{i - r}},\ldots,{c_{i+r}}} \right\},$ (3)
 ${L_i}=\min \left\{ {{c_{i - r}},\ldots,{c_{i+r}}} \right\},$ (4)

 图2 LB-Keogh边界曲线 Fig. 2 The boundary curve of LB-Keogh

 ${D_{{\rm{lb\_Keogh}}}}\left({C,Q} \right)==\sqrt {\mathop \sum \limits_n^{i=1} \left\{ {\begin{array}{*{20}{c}} {{{\left({{q_i} - {u_i}} \right)}^2},}\\ {{{\left({{l_i} - {q_i}} \right)}^2},}\\ {0,} \end{array}} \right.} \begin{array}{*{20}{c}} {{\rm{if}}{q_i} > {u_i},}\\ {{\rm{if}}{q_i} <{l_i},}\\ {{\rm{otherwise}}{\rm{.}}} \end{array}$ (5)

ZHANG Y和GLASS J在LB-Keogh算法的基础上提出了LB-PAA算法[4]。LB-PAA下界算法采用点对累积近似(piecewise aggregate approximation,PAA)表示法[23],分别表示原始序列与上下边界曲线,求取相应的下界距离。LB-PAA算法降低了原始序列的维度,简化了相似性搜索的运算过程,同时也减弱了部分噪声对结果的影响。

1.2 Map/Reduce下的时间序列相似性搜索算法

MapReduce机制分为Map和Reduce两个步骤。Map的工作是分解,把需要并行处理的数据拆分成大量数据片段,为分布式集群中的每一台计算机分配一个数据片段分别进行处理;而Reduce的工作与Map正好相反,负责将Map分开的数据在运算完成后再次整合到一起,并将最终汇总的结果输出[24, 25]

1.2.1 LB-PAA下界算法的并行处理

(1)Map阶段

①通过DistributedCache类将查询序列T中的所有子序列Ci按照节点性能分配给各节点,并将参考序列Q及下界阈值d发送到各节点。

②对每一节点,使用公式(5)计算其分配到的子序列与参考序列的下界距离Dlb-PAA(C,Q),节点内串行运算。

③每一节点中比较所有子序列下界距离Dlb-PAA(Ci,Q)与阈值d的大小,将满足条件的子序列写入该节点对应矩阵Zi

④将所有节点计算得到的Zi矩阵输出。

(2)Reduce阶段

①从Mapper处获得中间结果(即Zi矩阵)。

②汇总所有临时矩阵中的值,合并成最终结果矩阵Z

③把最终矩阵Z(即候选集)输出给主函数。

1.2.2 DTW距离计算的并行处理

(1)Map阶段

①通过DistributedCache类将候选集Z中的所有序列zi按照节点性能分配给各节点,并将参考序列Q发送到各节点;

②在每一节点内,使用DTW距离计算公式计算得出所有序列与参考序列的DDTW(zi,Q),求出其中的最小值;

③将DTW距离最小的序列及其DTW距离输出。

(2)Reduce阶段

①从Mapper处获得中间结果;

②汇总所有Mapper的临时结果,求出其中的最小值;

③把最终结果(所有子序列中DTW距离最小,即相似度最高的序列,及其DTW距离)输出给主函数。

2 试验与分析

2.1 试验一

 图3 不同数量级下界过滤运算时间及加速比 Fig. 3 Calculating time and speedup rate of lower bound filter on different order of magnitudes

2.2 试验二

 图4 不同数量级DTW距离运算时间及加速比 Fig. 4 Calculating time and speedup rate of DTW distance on different order of magnitudes

2.3 试验分析

3 结语

 [1] BHADURI K, ZHU Q, OZA N C, et al. Fast and flexible multivariate time series subsequence search[C]//Proc of the 2010 IEEE Int l Conf on Data Mining. Sydney, Australia: IEEE, 2010:48-57.(1) [2] BERNDT D J, CLIFFORD J. Using dynamic time warping to find patterns in time series[C]//KDD Workshop. Seattle, USA:Springer US, 1994:359-370.(1) [3] KEOGH E, RATANAMAHATANA C A. Exact indexing of dynamic time warping[J]. Knowledge and Information Systems, 2005(7):358-386.(2) [4] ZHANG Y, GLASS J. A piecewise aggregate approximation lower-bound estimate for posteriorgram-based dynamic time warping[C]//Interspeech. Florence,Italy:IEEE, 2011:1909-1912.(2) [5] 丁静,杨善林,罗贺,等. 云计算环境下的数据挖掘服务模式[J]. 计算机科学, 2012, 39(6):217-219.DING Jing, YANG Shanlin, LUO He, et al. Data mining service model in cloud computing environment[J]. Computer Science, 2012, 39(6):217-219.(1) [6] 杨来,史忠植,梁帆,等. 基于Hadoop云平台的并行数据挖掘方法[J].系统仿真学报,2013,25(5):936-944.YANG Lai, SHI Zhongzhi, LIANG Fan, et al. Parallel approach in data mining based on hadoop cloud platform[J]. Journal of System Simulation, 2013, 25(5):936-944.(1) [7] 程苗. 基于云计算的Web数据挖掘[J]. 计算机科学,2011,38(10):146-149.CHENG Miao. Web data mining based on cloud computing[J]. Computer Science, 2011, 38(10):146-149.(1) [8] 陈光鹏,杨育彬,高阳,等. 一种基于Map/Reduce的频繁闭项集挖掘算法[J]. 模式识别与人工智能,2012,25(2):220-224.CHEN Guangpeng, YANG Yubin, GAO Yang, et al. Closed frequent itemset mining based on mapreduce[J]. PR＆AI, 2015, 25(2):220-224.(1) [9] 田国会,许亚雄. 云机器人:概念、架构与关键技术研究综述[J]. 山东大学学报(工学版),2014,44(6):47-54.TIAN Guohui, XU Yaxiong. Cloud robotics: concept, architectures and key technologies[J]. Journal of Shandong University(Engineering Science), 2014, 44(6):47-54.(1) [10] ZHU Q, WANG X Y, KEOGH E, et al. An efficient and effective similarity measure to enable data mining of petroglyphs[J]. Data Mining and Knowledge Discovery, 2011, 23(1):91-127.(1) [11] PAPAPETROU P, ATHITSOS V, POTAMIAS M, et al. Embedding-based subsequence matching in time-series databases[J]. ACM Transactions on Database Systems, 2011, 36(3):17.(1) [12] 周大镯,吴晓丽,闫红灿. 一种高效的多变量时间序列相似查询算[J]. 计算机应用,2008,28(10):2541-2543.ZHOU Dazhuo, WU Xiaoli, YAN Hongcan. An efficient similarity search for multivariate time series[J]. Computer Applications, 2008, 28(10):2541-2543.(1) [13] FU AWC, KEOGH E, LAU LYH, et al. Scaling and time warping in time series querying[J]. The VLDB Journal, 2008, 17:899-921.(1) [14] MU B, YAN J L. Efficient time series lower bounding technique[J]. Computer Engineering and Applications, 2009, 45(11):168-171.(1) [15] BEGUM N,KEOGH E. Rare time series motif discovery from unbounded streams[J]. Proceedings of the VLDB Endowment, 2014, 8(2):149-160.(1) [16] ZHOU D Z, JIANG W B, LI M Q. Efficient clustering algorithm for multivariate time series[J]. Computer Engineering and Applications, 2010, 46(1):137-139.(1) [17] THANAWIN Rakthanmanon, BILSON Campana, ABDULLAH Mueen, et al. Searching and mining trillions of time series subsequences under dynamic time warping[C]//KDD′12-18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA:IEEE, 2012:262-270.(1) [18] WANG X, MUEEN A, DING H, et al. Experimental comparison of representation methods and distance measures foe time series data[J]. Data Mining and Knowledge Discovery, 2013, 26(2):275-309.(1) [19] SHOKOOHI-YEKTA M, WANG J, KEOGH E. On the non-trivial generalization of dynamic time warping to the multi-dimensional case[C]//SDM Vancouver,Canada:Society for Industrial and Applied Mathematics, 2015:289-297.(1) [20] SAKURAI Y, YOSHIKAWA M, FALOUTSOS C. FTW: fast similarity search under the time warping distance[C]//PODS. Baltimore, Maryland:Sprinter US, 2005:326-337.(1) [21] BANK Z, ABONYI J. Correlation based dynamic time warping of multivariate time series[J]. Expert Systems with Applications, 2012, 39(17):12814-12823.(1) [22] 李正欣,张凤鸣,李克武,等.一种支持DTW距离的多元时间序列索引结构[J]. 软件学报,2014,25(3):560-575.LI Zhengxin, ZHANG Fengming, LI Kewu, et al. Index structure for multivariate time series under DTW distance metric[J]. Journal of Software, 2014, 25(3):560-575.(1) [23] HYUNJIN Yoon, KIYOUNG Yang, CYRUS Shahabi. Feature subset selection and feature ranking for multivariate time series[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9):1186-1198.(1) [24] 李伟卫,赵航,张阳,等. 基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013,49(20):112-117.LI Weiwei, ZHAO Hang, ZHANG Yang, et al. Research on massive data mining based on mapreduce[J]. Computer Engineering and Applications, 2013, 49(20):112-117.(1) [25] 何东之,张吉沣,赵鹏飞. 不确定传播算法的MapReduce并行化实现[J]. 山东大学学报(工学版),2015,44(5):22-28.HE Dongzhi, ZHANG Jifeng, ZHAO Pengfei. Parallel implementing probabilistic spreading algorithm using MapReduce programming mode[J]. Journal of Shandong University(Engineering Science), 2015, 44(5):22-28.(1) [26] AGRAWAL R, FALOUTSOS C, SWAMI A. Efficient similarity search in sequence databases[C]//Proc of the 4th Int′l Conf on Foundations of Data Organization and Algorithms. Chicoga, USA:Springer Berlin Heidelberg, 1993:69-84.(1)