On Similarity-Based Queries For Time Series Data
On Similarity-Based Queries For Time Series Data
Davood Rafiei
Department of Computer Science, University of Toronto
E-mail: [email protected]
of searching the index multiple times and each time apply- 100
80
500 1
400 0.5
ing a single transformation, to search the index only once 60
300 0
and apply a collection of transformations simultaneously to 40
200 −0.5
0 100 −1
0 50 100 150 0 50 100 150 0 50 100 150
real data show that the new algorithm for simultaneously
COMPV 94/06/15 DECL 94/06/15 normalized and 19−day MV
80
0
tion at a time. We also examine the possibility of composing 40
1000
−0.5
transformations in a query or of rewriting a query expres- 20
0 500 −1
sion such that the resulting query can be efficiently evalu- 0 50 100 150 0 50 100 150 0 50 100 150
ated.
Figure 1. On the top from left to right,
daily closings of Dow Jones 65 Composite Volume
(COMPV) index, NYSE Volume (NYV) index and
1. Introduction both put together, normalized and smoothed
using 9-day moving average. On the bot-
Time-series data are of growing importance in many new tom from left to right, again daily closings of
database applications, such as data mining or data warehous- COMPV index, NYSE Declining Issues (DECL) in-
ing. A time series is a sequence of real numbers, each num- dex and both put together, normalized and
ber representing a value at a time point. For example, the se- smoothed using 19-day moving average.
quence could represent stock or commodity prices, sales, ex-
change rates, weather data, biomedical measurements, etc.
We are often interested in similarity queries on time-series
data [3, 2]. For example, we may want to find stocks that be- Example 1.1 Figure 1 shows daily closings of three in-
have in approximately the same way (or approximately the dices: Dow Jones 65 Composite Volume (COMPV), NYSE
opposite way, for hedging); or products that had similar sell- Volume (NYV) and NYSE Declining Issues (DECL). It is dif-
ing patterns during the last year; or years when the temper- ficult to see any similarity between these sequences. The
ature patterns in two regions of the world were similar. In Euclidean distance between closes of COMPV and NYV is
queries of this type, approximate, rather than exact, match- 2873 and that between COMPV and DECL is 12939. On the
ing is required. other hand, if we normalize 1 closes of COMPV and NYV
A simple approach to determine a possible similarity be- 1 This operation is described in Section 3.
and compare their 9-day moving averages, they look similar. on February 8th. No value is recorded for February 4th, 5th
The Euclidean distance between 9-day moving averages of and 6th. If we shift the momentum of PCG two days to the
normalized closes of COMPV and NYV is less than 3. Sim- right, the spikes will overlap and the Euclidean distance will
ilarly, if we normalize the closes of COMPV and DECL and reduce to 5.65.
compare their 19-day moving averages, they also look simi-
lar. In fact, ‘19-day moving average’ is the shortest moving The momentum of a sequence describes the rate at which
average that reduces the Euclidean distance between nor- its value (such as the price in the preceding example) is ris-
malized closes of COMPV and DECL to less than 3. ing or falling and it is seen as a measure of strength be-
hind upward or downward movements. On the other hand,
Moving averages are widely used in stock data analysis shifting a sequence horizontally before comparing it to an-
(for example, see [5]). Their primary use is to smooth out other sequence removes any possible delay between the two
short term fluctuations and depict the underlying trend of sequences which can arise, for example in the stock mar-
a stock. Given two sequences to be compared, we usually ket domain because of different reactions of two stocks to
do not know what moving average can make them similar. the same piece of news or recording errors. Both momen-
There can be several moving averages that reduce the dis- tum and shifting can be formulated as linear transformations
tance between two sequences to less than a threshold. We over the Fourier representation of a sequence (see the ex-
are often interested in the shortest moving average mainly tended version of this paper for details [10]). In general,
because it leaves more details to the distance computation there can be several possible linear transformations (or time
process 2 . Moving averages can be formulated as linear shifts, as an example) to be applied to sequences and each
transformations over the Fourier representation of a time se- transformation can either reduce or increase the distance be-
quence [12]. tween sequences. However, for every pair of sequences we
are usually interested in finding transformations that reduce
PCG 941102 PCL 941102
6 8
6
the distance between them to a minimum.
4
4 In this paper, we propose a fast algorithm to process
2 2
0
0 queries that specify more than one transformation as the ba-
−2
−2
0 50 100 150
−4
0 50 100 150
sis for similarity. The idea is, instead of processing a single
5
Momentum of PCG
6
Momentum of PCL transformation at a time, to process a collection of them at
4 once. To achieve this goal, we construct a minimum bound-
2
−4
minimum bounding rectangle for transformations can be ap-
−5
0 50 100 150
−6
0 50 100 150 plied to a multidimensional index constructed on sequences,
thus reducing the number of searches over the index to one.
Figure 2. The daily closing price of Pacific Gas Our experiments show that this algorithm performs much
and Electric Co. (PCG) and that of Plum Creek Tim- better than both sequentially scanning all sequences and also
ber Co. (PCL), both starting from 94/11/02 for the index traversal using one transformation at a time. We
128 days, represented in normal forms and also examine the possibility of composing transformations
their momentums. in a query or of rewriting a query expression such that the
resulting query can be efficiently evaluated.
The organization of the rest of the paper is as follows. In
the next section we review the related work. The benefits
Example 1.2 Figure 2 shows in normal form the daily clos- of using transformations for expressing similarity queries is
ing prices of stocks of Pacific Gas and Electric Co. (PCG) discussed in Section 3. In Section 4 we propose algorithms
and Plum Creek Timber Co. (PCL) both starting from for fast processing queries that express similarity in terms
November 2nd, 1994 for 128 days. One way to compare the of multiple transformations. Section 5 contains experiments
change rates of two stocks is to compare their “momenta”, that show the effectiveness of our algorithms. Section 6 is
which are obtained for every stock by subtracting the price the conclusion.
at time t from the price at time t+1 (or, in general, t+n for
some ). The Euclidean distance between the two momenta
is 13.01. The series representing the price of PCG has a 2. Related Work
spike on February 3rd while the series of PCL has a spike
An indexing technique for the fast retrieval of similar
2 Although
it seems if two sequences are similar w.r.t. -day moving
time sequences is proposed by Agrawal et al. [1]. The idea
average, they should be similar w.r.t. -day moving average, this is
not true in general; a counter example can be found in the extended version is to use Discrete Fourier Transform (DFT) to map time se-
of this paper [10]. quences (stored in a database) into the frequency domain.
, :@? BA <
The DFT of time sequence
denoted by , is given by
for maps to . This class of transformations can eas-
ily express operations such as moving average, momentum,
! '&)(+*#,'2 -/.10 3 4 5 (1)
time shift, etc. The expressions of these transformations and
more can be found somewhere else [12, 11].
# "%$
7
where 6
3.1. Transformations - Normal Form
is the imaginary unit. Keeping only the
first k Fourier coefficients, each sequence becomes a point in An efficient way to compare two time sequences is to
a k-dimensional feature space. To allow a fast retrieval, the compare their normal forms. Given a time sequence
C D
8 F E D G C F E D =
authors keep the first k Fourier coefficients of a sequence in
of mean and standard deviation , the transformation
a R-tree index. In an earlier work [13], as a major improve- applied to gives its normal form. Due to the
ment over this technique, we show that the last few Fourier linearity property of DFT, the same transformation is appli-
coefficients of a sequence are as important as the first few cable to the Fourier representation of a sequence.
coefficients due to the symmetry property of DFT. We also Although it is not required by the algorithms given in this
show that using the symmetry property improves the search paper, we assume time sequences are normalized and for ev-
time of the index by more than a factor of 2 without increas- ery sequence, its normal form along with its mean and stan-
ing its dimensionality. dard deviation are stored in a relation. This is mainly be-
In another work [12], we use this indexing method and cause of efficiency (as is noted by Goldin et al. [7]) and the
propose techniques for retrieving similar time sequences following two attractive properties of the normal form se-
whose differences can be removed by a linear transforma- quences which are not mentioned by Goldin et al. [7].
tion such as moving average, time scaling and inverting. In
this paper, we generalize our earlier work and allow queries
IGK% L INM =
8 J
1. It minimizes the Euclidean distance with respect to the
H
that express similarity in terms of multiple transformations. scalar shift, i.e.
INK NI M
has its minimum
Our work here can be seen as an efficient implementation of
a special case of the query language described by Jagadish of and 4. O
when and respectively are chosen to be the means
a a
5
real vectors under a safety constraint.
age”, for I 4
and
to apply a “s-day shift” followed by an “m-day moving aver-
, to a sequence.
of IBM. A solution for processing this query is to scan the
whole stocks relation, compute the -day moving average
We claim the queries expressed in terms of such a sequence for the closing price of every stock and
determine if the re-
of transformations also benefit from the algorithms given in sulting sequence is within distance of the -day moving
this paper. We show this by giving a method to translate average of the close of IBM. The distance predicate needs
any query expression that uses a sequence of transforma- to be checked for all possible transformations. We refer to
tions into one that uses only a set of transformations. The this algorithm as the sequential-scan method. The cost of
resulting query can then be processed using the same tech- this algorithm includes one scan of the whole relation and
I I ?
8: < = 8: < =
nique that we present for multiple transformations. computing the distance predicate times.
Given transformations and
for example respectively corresponding to “2-day shift” and
P,
P P Another approach is for every , apply t to the in-
dex built on the first few Fourier coefficients of the closing
8 =
“10-day moving average”, suppose we want to apply fol-
price and do a range query on the new index [12]. The union
P
lowed by , which we denote by
P
, to sequence . We
can construct the new transformation as follows:
of these results for all gives the query answer. We
call this algorithm ST-index, where ST stands for ‘a Single
8 8 == : ? 8 : ? A < = A <
P
Transformation at a time’. The cost of this algorithm in-
: P ? : ? A : ? < AP <
(3) cludes traversing the index times. Next, we describe a
P P P new algorithm that requires a single scan of the index and
8 =
equivalently can be expressed as
performs much better than both the sequential-scan and ST-
angle(F2)
−0.5 1 −0.5
−0.9
0.4
0.2
−0.8
−0.9
to every (point) entry of
and check if the resulting
rectangle intersects . If so, the entry is a candidate.
−1 0 −1
0.8 0.9 1 0.8 0.9 1 −1 0 1
abs(F2) abs(F2)
5. For every candidate entry, retrieve its full database
record, apply all transformations inside to the se-
Figure 3. The second DFT coefficients of -
day moving averages (for ) and quence, and determine transformations that reduce the
Euclidean distance between the data sequence and the
their decompositions into mult-MBR and add- query sequence to less than .
MBR
This algorithm is guaranteed not to miss any qualifying
sequence (the proof is given in the extended version of this
paper [10]). We can develop similar algorithms for effi-
ciently processing spatial join and nearest neighbor queries.
In a spatial join query, we apply the transformation MBR to
angle(F 2 )
7 17
1 * 1 - 0.96
0.85 * 7 + 0 17
-day moving average for some
of stocks that have similar closing prices with respect
to an
. Having
abs(F2 ) an R-tree index for the closing prices, we can use any well-
known spatial join algorithm for R-tree and change the join
Figure 4. A data rectangle before and after condition such that the transformation rectangle be applied
being transformed to both data rectangles involved in the join before testing
them for a possible overlap. Similarly in a nearest neighbor
query, as we walk down the tree, we apply the transforma-
To develop an algorithm for answering Query 1, suppose tion MBR to all entries of the node we visit. We can then use
an R-tree index is available on sequences. We can apply the any kind of metric (such as MINDIST or MINMAXDIST
transformation rectangle to every data rectangle in the index [15]) to prune the search.
and construct a new index on the fly. The new index is con-
structed one index rectangle at a time, and each time the new 4.2. Performance Improvement
rectangle is checked to see whether it intersects the query re-
gion. This process retrieves a set of candidate data items that A potential problem with the MT-index algorithm is if
includes all qualifying data items plus some false positives. transformations make several clusters or a few of them
The last step of the algorithm removes false positives by ap- spread all over the space, then the minimum bounding rect-
plying every member of the transformation set to every can- angle of transformations will cover a large area. This MBR,
didate data item and selecting data items that intersect the when applied to a data rectangle, can easily make the data
query region. We can write the search algorithm more for- rectangle intersect the query region. This can reduce the fil-
mally as follows: tering power of the index dramatically. A solution for this
problem is to allow more than one transformation rectan- need to apply all scale factors to sequences. Instead, we need
gle. As the number of MBRs goes up, the area of each MBR
I
to find the largest scale factor that makes the distance predi-
&
gets smaller, and as a result the filtering power of the MBR
increases; but, on the other hand, the same index needs to I
cate true. Suppose is such a scale factor. One way to find
&
is to do a binary search on the set of scale factors. Defi-
be traversed several times. In the worst case, the number of
MBRs is the same as the number of transformations, i.e. ev- I
nition 1 easily implies that the distance predicate is true for
all scale factors less than .
&
ery MBR includes only one transformation point. In such a We can use the binary search technique in all three al-
case, both ST-index and MT-index perform exactly the same. gorithms described earlier. In the case of the sequential
Now the question is how we should optimally choose scan method, we still need to scan the whole stocks rela-
MBRs for a given set of transformations such that the cost
of Algorithm 1 (in terms of the number of disk accesses) be- I
to "I ?
tion. However, the number of sequence comparisons drops
. Similarly in the case of the MT-index
comes minimum. One solution is to estimate the cost for algorithm, the number of disk accesses still will be the same,
any possible set of MBRs and choose the set with minimum but the number of comparisons for every candidate sequence
cost. A first attempt in estimating the cost for a given set of drops to . The ordering assumption reduces the num-
MBRs is to use the total area of MBRs. However, the total ber of index traversals for ST-index to .
area is minimum if every MBR includes only one transfor- On the other hand, the ordering assumption does not hold
mation point, i.e. the ST-index algorithm is used. Another in general. There are useful transformations that are not or-
approach for estimating the cost of a given set of MBRs is dered w.r.t. time sequences and the Euclidean distance. For
to apply MBRs for a fixed data rectangle, say a unit square, example, we can show that no ordering is possible for a set
then compute the total area of the resulting data rectangles. of moving averages w.r.t. time sequences and the Euclidean
Due to this estimation, the best performance should be ob- distance (see the extended version of this paper for a proof
tained using only one transformation rectangle. [10]).
However, our experiments showed that using one trans-
formation rectangle did not necessarily give the best per- 5. Experimental Results
formance. The worst performance for MT-index, which is
close to that of ST-index, is when we pack two clusters
We implemented both ST-index and MT-index, on top of
of transformations into one rectangle. A solution to avoid
Norbert Beckmann’s Version 2 implementation of the R*-
this problem is to use a cluster detection algorithm (such as
tree [4]. We ran experiments on both stock prices data ob-
CURE [8]) and avoid packing two clusters into one rectan-
tained from the ftp site “ftp.ai.mit.edu/pub/stocks/results”
gle.
and synthetic data. All our experiments were conducted on
a 168MHZ Ultrasparc station. The stock prices database
4.3. Ordering Assumption on Transformations
consisted of 1068 stocks and for each stock its daily clos-
So far, we have made no assumption on any possible or- 9 A N N
ing prices for 128 days. Each synthetic sequence was in the
and is a uniformly
dering among transformations. In this section, we define a
form of where
distributed random number in the range
.
notion of ordering among transformations and show that it For every time series, we first transformed it to the nor-
can be quite useful in guiding the search process more effec- mal form for reasons described in Section 3.1, and then we
tively. found its Fourier coefficients. Since the mean of a normal
# form series is zero by definition, the first Fourier coefficient
P
Definition 1 We
call an ordering of
is always zero, so we can throw it away. For every sequence,
&
tion D if
,
w.r.t. value domain dom and distance func-
we stored the magnitudes and the angles of the second and
threshold.
1
12
0
0 5 10 15 20 25 30
: sequential−scan number of transformations
10 : ST−index
: MT−index
8
Figure 6. Time per query varying the number
of transformations for range queries
running time (seconds)
4
ble performance. To show this, we ran Query 1 using MT-
2
index algorithm on real stock prices data, but this time we
varied the number of transformations per MBR from one to
its maximum. The transformation set consisted of -day
0
0 2000 4000 6000
number of sequences
8000 10000 12000
moving averages for . We equally partitioned >Q
subsequent transformations and built an MBR for each parti-
Figure 5. Time per query varying the number tion. As is shown in Figure 7, despite the fact that collecting
of sequences for range queries all transformations in one rectangle resulted in the minimum
number of disk accesses, it did not necessarily give us the
best performance mainly because of the increased number
Figure 5 shows the running time of Query 1 using three of false positives.
algorithms sequential-scan, ST-index, and MT-index. In
1600 14000
the experiment, we set the number of transformations fixed : mv6−mv29 : mv6−mv29
to 16, but we varied the number of sequences from 500 1400 their inverted forms 12000 their inverted forms
# of disk accesses
1000 8000
shows that MT-index performs better than both ST-index and 400 2000
sequential-scan. 200 0
0 10 20 30 40 50 0 10 20 30 40 50
Figure 6 shows the running time of Query 1 again using # of transformations per MBR # of transformations per MBR