0% found this document useful (0 votes)

24 views

Kshape Short

This document describes k-Shape, a novel algorithm for time-series clustering. k-Shape relies on an iterative refinement procedure to create homogeneous and well-separated clusters. It uses a normalized cross-correlation measure as the distance measure to consider the shapes of time series while comparing them. An evaluation showed k-Shape to be a robust, accurate, and efficient clustering approach for time series with broad applications.

Uploaded by

Yumin Wang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Kshape Short

Uploaded by

Yumin Wang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

k-Shape: Efficient and Accurate Clustering of Time Series

John Paparrizos Luis Gravano

Columbia University Columbia University
[email protected] [email protected]

ABSTRACT 6
ECG classes Types of sequence alignment

The proliferation and ubiquity of temporal data across many

Class A Non-linear (local)
4 2
Class B
1.8

2 1.6

disciplines has generated substantial interest in the analysis

1.4

0 1.2

-2 0.8

and mining of time series. Clustering is one of the most pop-

0.6

-4 0.4

0.2
Class A
-6 0
30 40 50 60 70 80 90 100 10 20 30 40 50 60 70

ular data mining methods, not only due to its exploratory

Class B Linear drift (global)

power, but also as a preprocessing step or subroutine for Class A

Class B

other techniques. In this paper, we describe k-Shape, a novel

algorithm for time-series clustering. k-Shape relies on a scal- 30 40 50 60 70 80 90 100

able iterative refinement procedure, which creates homoge- Figure 1: ECG sequence examples and types of alignments
neous and well-separated clusters. As its distance measure, for the two classes of the ECGFiveDays dataset [1].
k-Shape uses a normalized version of the cross-correlation
measure in order to consider the shapes of time series while
comparing them. Based on the properties of that distance
Most time-series analysis methods, including clustering,
measure, we develop a method to compute cluster centroids,
critically depend on the choice of distance measure. A key
which are used in every iteration to update the assignment
issue when comparing two sequences is how to handle the
of time series to clusters. An extensive experimental evalu-
variety of distortions, as we will discuss, that are character-
ation against partitional, hierarchical, and spectral cluster-
istic of the sequences. To illustrate this point, consider the
ing methods, with the most competitive distance measures,
ECGFiveDays dataset [1], with ECG sequences recorded for
showed the robustness of k-Shape. Overall, k-Shape emerges
the same patient on two different days. While the sequences
as a domain-independent, highly accurate, and efficient clus-
seem similar overall, they exhibit patterns that belong in
tering approach for time series with broad applications.
one of the two distinct classes (see Figure 1): Class A is
characterized by a sharp rise, a drop, and another gradual
1. INTRODUCTION increase while Class B is characterized by a gradual increase,
Temporal, or sequential, data mining deals with problems a drop, and another gradual increase. Ideally, a shape-based
where data are naturally organized in sequences [28]. We clustering method should generate a partition similar to the
refer to such data sequences as time-series sequences if they classes shown in Figure 1, where sequences exhibiting simi-
contain explicit information about timing (e.g., stock, au- lar patterns are placed into the same cluster based on their
dio, speech, and video) or if an ordering on values can be shape similarity, regardless of differences in amplitude and
inferred (e.g., streams and handwriting). Large volumes of phase. As the notion of shape cannot be precisely defined,
time-series sequences appear in almost every discipline, in- dozens of distance measures have been proposed [9, 10, 12,
cluding astronomy, biology, meteorology, medicine, finance, 16, 18, 46, 64] to offer invariances to multiple inherent distor-
robotics, engineering, and others [1, 5, 21, 23, 29, 43, 59, tions in the data. However, it has been shown that distance
62]. The ubiquity of time series has generated a substantial measures offering invariances to amplitude and phase per-
interest in querying [2, 38, 39, 41, 52, 61, 65], indexing [8, 11, form exceptionally well [15, 66] and, hence, such measures
34, 35, 37, 63], classification [30, 47, 58, 70], clustering [36, are used for shape-based clustering [44, 50, 54, 69].
45, 54, 69, 71], and modeling [3, 31, 68] of such data. Due to these difficulties and the different needs for invari-
Among all techniques applied to time-series data, cluster- ances from one domain to another, more attention has been
ing is the most widely used as it does not rely on costly given to the creation of new distance measures rather than
human supervision or time-consuming annotation of data. to the creation of new clustering algorithms. It is generally
With clustering, we can identify and summarize interesting believed that the choice of distance measure is more im-
patterns and correlations in the underlying data [27]. In the portant than the clustering algorithm itself [6]. As a conse-
last few decades, clustering of time-series sequences has re- quence, time-series clustering relies mostly on classic cluster-
ceived significant attention [4, 14, 21, 40, 51, 54, 56, 69, 71], ing methods, either by replacing the default distance mea-
not only as a powerful stand-alone exploratory method, but sure with one that is more appropriate for time series, or
also as a preprocessing step or subroutine for other tasks. by transforming time series into “flat” data so that existing
clustering algorithms can be directly used [67]. However, the
The original version of this paper was published in ACM choice of clustering method can affect: (i) accuracy, as ev-
SIGMOD 2015 [53]. ery method expresses homogeneity and separation of clus-

SIGMOD Record, March 2016 (Vol. 45, No. 1) 69

ters differently; and (ii) efficiency, as the computational cost trast to what has been reported in the literature, is a robust
differs from one method to another. For example, spectral approach for time-series clustering, but inadequate modifica-
clustering [17] or certain variants of hierarchical clustering tions of its distance measure and centroid computation can
[33] are more appropriate to identify density-based clusters reduce its performance; (3) the choice of clustering method,
(i.e., areas of higher density than the remainder of the data) which was believed to be less important than that of dis-
than partitional methods such as k-means [42] or k-medoids tance measure, is as important as the choice of distance mea-
[33]. On the other hand, k-means is more efficient than hi- sure; and (4) k-Shape outperforms all scalable approaches
erarchical, spectral, or k-medoids methods. in terms of accuracy. Furthermore, k-Shape also outper-
Unfortunately, state-of-the-art approaches for shape-based forms all non-scalable (and hence impractical) approaches,
clustering, which use partitional methods with distance mea- with one exception that achieves similar accuracy results.
sures that are scale- and shift-invariant, suffer from two However, unlike k-Shape, this approach requires tuning of
main drawbacks: (i) these approaches cannot scale to large- its distance measure and is two orders of magnitude slower
volumes of data as they depend on computationally expen- than k-Shape. Overall, k-Shape is a highly accurate and
sive methods or distance measures [44, 50, 54, 69]; and scalable choice for time-series clustering that performs ex-
(ii) these approaches have been developed for particular do- ceptionally well across domains [53].
mains [69] or their effectiveness has only been shown for a We start by reviewing the state of the art for clustering
limited number of datasets [44, 50]. Moreover, the most suc- time series, as well as with our problem definition (Section
cessful shape-based clustering methods handle phase invari- 2). We then describe our approach, as follows:
ance through a local, non-linear alignment of the sequence • We show how a scale-, translate-, and shift-invariant dis-
coordinates, even though a global alignment is often ade- tance measure can be derived in a principled manner
quate. For example, for the ECG dataset in Figure 1, an from the cross-correlation measure and how this mea-
efficient linear drift can reveal the underlying differences in sure can be efficiently computed (Section 3.1).
patterns of sequences of two classes, whereas an expensive • We present a novel method to compute a cluster centroid
non-linear alignment might match every corresponding in- when that distance measure is used (Section 3.2).
crease or drop of each sequence, making it difficult to distin- • We describe k-Shape, a centroid-based algorithm for time-
guish the two classes (see Figure 1). Importantly, to the best series clustering (Section 3.3).
of our knowledge, these approaches have never been exten- • We summarize our extensive experimental evaluation (Sec-
sively evaluated against each other, against other partitional tions 4 and 5).
methods, or against different approaches such as hierarchical We conclude with the implications of our work (Section 6).
or spectral methods. We summarize such an experimental Please refer to [53] for further details on our approach and
evaluation below. Our original paper [53] has further details. the experimental evaluation.
In this article, we discuss k-Shape, a novel algorithm for
shape-based time-series clustering that is efficient and do- 2. PRELIMINARIES
main independent. k-Shape is based on a scalable iterative In this section, we review distortions that are common in
refinement procedure similar to the one used by the k-means time series (Section 2.1) and the most popular distance mea-
algorithm, but with significant differences. Specifically, k- sures for such data (Section 2.2). Then, we summarize exist-
Shape uses both a different distance measure and a different ing approaches for clustering time-series data (Section 2.3)
method for centroid computation from those of k-means. As and for centroid computation (Section 2.4). Finally, we for-
argued above, k-Shape attempts to preserve the shapes of mally present our problem of focus (Section 2.5).
time-series sequences while comparing them. To do so, k-
Shape requires a distance measure that is invariant to scal- 2.1 Time-Series Invariances
ing and shifting. Unlike other clustering approaches [44, 54,
Based on the domain, sequences are often distorted in
69], for k-Shape we adapt the cross-correlation statistical
some way, and distance measures need to satisfy a number
measure and we show: (i) how we can derive in a principled
of invariances in order to compare sequences meaningfully.
manner a time-series distance measure that is scale- and
In this section, we review common time-series distortions
shift-invariant; and (ii) how this distance measure can be
and their invariances. For a more detailed review, see [6].
computed efficiently. Based on the properties of the normal-
Scaling and translation invariances: In many cases, it
ized version of cross-correlation, we develop a novel method
is useful to recognize the similarity of sequences despite dif-
to compute cluster centroids, which are used in every itera-
ferences in amplitude (scaling) and offset (translation). In
tion to update the assignment of time series to clusters.
other words, transforming a sequence ˛ x as ˛
xÕ = a˛x + b, where
To demonstrate the effectiveness of the distance measure
a and b are constants, should not change ˛ x’s similarity to
and k-Shape, we have conducted an extensive experimental
other sequences. For example, these invariances might be
evaluation on 48 datasets and compared the state-of-the-art
useful to analyze seasonal variations in currency values on
distance measures and clustering approaches for time series
foreign exchange markets without being biased by inflation.
using rigorous statistical analysis. We took steps to ensure
Shift invariance: When two sequences are similar but dif-
the reproducibility of our results, including making avail-
fer in phase (global alignment) or when there are regions
able our source code as well as using public datasets. Our
of the sequences that are aligned and others are not (local
experimental evaluation suggests that: (1) cross-correlation
alignment), we might still need to consider them similar.
measures, which are not widely adopted as time-series dis-
For example, heartbeats can be out of phase depending on
tance measures, outperform Euclidean distance (ED) [16]
when we start taking the measurements (global alignment)
and are as competitive as state-of-the-art measures, such as
and handwritings of a phrase from different people will need
constrained Dynamic Time Warping (cDTW) [60], but sig-
alignment depending on the size of the letters and on the
nificantly faster; (2) the k-means algorithm with ED, in con-
spaces between words (local alignment).

70 SIGMOD Record, March 2016 (Vol. 45, No. 1)

Uniform scaling invariance: Sequences that differ in ED

length require either stretching of the shorter sequence or

shrinking of the longer sequence so that we can compare
them effectively. For example, this invariance is required for
heartbeats with measurement periods of different duration. DTW

Occlusion invariance: When subsequences are missing,

we can still compare the sequences by ignoring the subse-
quences that do not match well. This invariance is useful in
handwritings if there is a typo or a letter is missing. (a) (b)
Complexity invariance: When sequences have similar Figure 2: Similarity computation: (a) alignment under ED
shape but different complexities, we might want to make (top) and DTW (bottom), (b) Sakoe-Chiba band with a
them have low or high similarity based on the application. warping window of 5 cells (light cells in band) and the warp-
For example, audio signals that were recorded indoors and ing path computed under cDTW (dark cells in band).
outdoors might be considered similar, despite the fact that
outdoor signals will be more noisy than indoor signals. than DTW and significantly reduces the computation time.
For many tasks, some or all of the above invariances are Several optimizations have been proposed to further speed
required when we compare time-series sequences. To satisfy up cDTW [55]. In the next section, we review clustering
the appropriate invariances, we could preprocess the data algorithms that can utilize these distance measures.
to eliminate the corresponding distortions before clustering.
For example, by z-normalizing [24] the data we can achieve 2.3 Time-Series Clustering Algorithms
the scaling and translation invariances. However, for invari- Several methods have been proposed to cluster time se-
ances that cannot be trivially achieved with a preprocessing ries. All approaches generally modify existing algorithms,
step, we can define sophisticated distance measures that of- either by replacing the default distance measures with a
fer distortion invariances. In the next section, we review the version that is more suitable for comparing time series (raw-
most common such distance measures. based methods), or by transforming the sequences into “flat”
2.2 Time-Series Distance Measures data so that they can be directly used in classic algorithms
(feature- and model-based methods) [67]. Raw-based ap-
The two state-of-the-art approaches for time-series com- proaches can easily leverage the vast literature on distance
parison first z-normalize the sequences and then use a dis- measures (see Section 2.2), which has shown that invari-
tance measure to determine their similarity, and possibly ances offered by certain measures, such as DTW, are gen-
capture more invariances. The most widely used distance eral and, hence, suitable for almost every domain [15]. In
metric is the simple ED [16]. ED compares two time series contrast, feature- and model-based approaches are usually
x = (x1 , . . . , xm ) and ˛
˛ y = (y1 , . . . , ym ) of length m as follows: domain-dependent and applications on different domains re-
Ú
ÿm quire that we modify the features or models. Because of
ED(˛ y) =
x, ˛ (xi ≠ yi )2 (1) these drawbacks of feature- and model-based methods, in
i=1
Another popular distance measure is DTW [60]. DTW can this paper we follow a raw-based approach.
be seen as an extension of ED that offers a local (non-linear) The three most popular raw-based methods are agglomer-
alignment. To achieve that, an m-by-m matrix M is con- ative hierarchical, spectral, and partitional clustering [6]. For
structed, with the ED between any two points of ˛ x and ˛y. A hierarchical clustering, the most widely used “linkage” cri-
warping path W = {w1 , w2 , . . . , wk }, with k Ø m, is a contigu- teria are the single, average, and complete linkage vari-
ous set of matrix elements that defines a mapping between ants [33]. Spectral clustering [49] has recently started re-
x and ˛
˛ y under several constraintsÚ[37]: ceiving attention [6] due to its success over other types of
ÿk data [17]. Among partitional methods, k-means [42] and k-
DT W (˛ y ) = min
x, ˛ wi (2) medoids [33] are the most representative examples. When
i=1 partitional methods use distance measures that offer invari-
This path can be computed on matrix M with dynamic pro- ances to scaling, translation, and shifting, we consider them
gramming for the evaluation of the following recurrence: as shape-based approaches. From these methods, k-medoids
“(i, j) = ED(i, j) + min{“(i ≠ 1, j ≠ 1), “(i ≠ 1, j), “(i, j ≠ 1)}. is usually preferred [67]: unlike k-means, k-medoids com-
It is common practice to constrain the warping path to putes the dissimilarity matrix of all data sequences and
visit only a subset of cells on matrix M . The shape of the uses actual sequences as cluster centroids; in contrast, k-
subset matrix is called band and the width of the band is means requires the computation of artificial sequences as
called warping window. The most frequently used band for centroids, which hinders the easy adaptation of distance
constrained Dynamic Time Warping (cDTW) is the Sakoe- measures other than ED. However, from all these methods,
Chiba band [60]. Figure 2a shows the difference in align- only the k-means class of algorithms can scale linearly with
ments of two sequences offered by ED and DTW distance the size of the datasets. Recently, k-means was modified to
measures, whereas Figure 2b presents the computation of work with (i) DTW [54] and (ii) a distance measure that
the warping path (dark cells) for cDTW constrained by the offers pairwise scaling and shifting of time-series sequences
Sakoe-Chiba band with width 5 cells (light cells). [69]. Both of these modifications rely on new methods to
Recently, Wang et al. [66] extensively evaluated 9 distance compute cluster centroids that we will review next.
measures and several variants thereof. They found that ED
is the most efficient measure with a reasonably high accu- 2.4 Time-Series Averaging Techniques
racy, and that DTW and cDTW perform exceptionally well The computation of an average sequence or, in the con-
in comparison to other measures. cDTW is slightly better text of clustering, a centroid, is a difficult task that critically

SIGMOD Record, March 2016 (Vol. 45, No. 1) 71

depends on the distance measure used to compare time se- a centroid computation method that can preserve the shapes
ries. We now review the state-of-the-art methods for the of time series. We first discuss our distance measure, which is
computation of an average sequence. based on the cross-correlation measure (Section 3.1). Based
With Euclidean distance, the arithmetic mean is used to on this distance measure, we propose a method to compute
compute an average sequence (e.g., as is the case in the cen- centroids of time-series clusters (Section 3.2). Finally, we
troid computation of the k-means algorithm). However, as describe k-Shape, our centroid-based clustering algorithm,
DTW is more appropriate for many time-series tasks [37, which relies on an iterative refinement procedure that scales
55], several methods have been proposed to average time- linearly in the number of sequences and generates homoge-
series sequences under DTW. Nonlinear alignment and av- neous and well-separated clusters (Section 3.3).
eraging filters (NLAAF) [26] uses a simple pairwise method
where each coordinate of the average sequence is calculated 3.1 Time-Series Shape Similarity
as the center of the mapping produced by DTW. This method As discussed earlier, capturing shape-based similarity re-
is applied sequentially to pairs of sequences until only one quires distance measures that can handle distortions in am-
pair is left. Prioritized shape averaging (PSA) [50] uses a plitude and phase. Unfortunately, the best performing dis-
hierarchical method to average sequences. The coordinates tance measures offering invariances to these distortions, such
of an average sequence are computed as the weighted cen- as DTW, are computationally expensive (see Section 2.2).
ter of the coordinates of two time-series sequences that were To circumvent this efficiency limitation, we adopt a normal-
coupled by DTW. Initially, all sequences have weight one, ized version of the cross-correlation measure.
and each average sequence produced in the nodes of the tree Cross-correlation is a measure of similarity for time-lagged
has a weight that corresponds to the number of sequences it signals that is widely used for signal and image process-
averages. To avoid the high computation cost of previous ap- ing. However, cross-correlation, a measure that compares
proaches, Ranking Shape-based Template Matching Frame- one-to-one points between signals, has largely been ignored
work (RSTMF) [44] approximates an ordering of the time- in experimental evaluations for the problem of time-series
series sequences by looking at the distances of sequences comparison. Instead, starting with the application of DTW
to all other cluster centroids, instead of computing the dis- decades ago [7], research on that problem has focused on
tances of all pairs of sequences. elastic distance measures that compare one-to-many or one-
Several drawbacks of these methods have led to the cre- to-none points [9, 10, 37, 46, 64]. In particular, recent
ation of a more robust technique called Dynamic Time Warp- comprehensive and independent experimental evaluations of
ing Barycenter Averaging (DBA) [54], which iteratively re- state-of-the-art distance measures for time-series compar-
fines the coordinates of a sequence initially picked from the ison — 9 measures and their variants in [15, 66] and 48
data. Each coordinate of the average sequence is updated measures in [22] — did not consider cross-correlation. Dif-
with the use of barycenter of one or more coordinates of the ferent needs from one domain or application to another hin-
other sequences that were associated with the use of DTW. der the process of finding appropriate normalizations for the
Among all these methods, DBA seems to be the most effi- data and the cross-correlation measure. Moreover, ineffi-
cient and accurate averaging approach when DTW is used cient implementations of cross-correlation can make it ap-
[54]. Another averaging technique that is based on matrix pear as slow as DTW. As a consequence of these drawbacks,
decomposition was proposed as part of K-Spectral Centroid cross-correlation has not been widely adopted as a time-
Clustering (KSC) [69], to compute the centroid of a cluster series distance measure. In the rest of this section, we show
when a distance measure for pairwise scaling and shifting is how to address these drawbacks. Specifically, we will show
used. In our approach, which we will present in Section 3, how to choose normalizations that are domain-independent
we also rely on matrix decomposition to compute centroids. and efficient, and lead to a shape-based distance measure
for comparing time series efficiently and effectively.
2.5 Problem Definition Cross-correlation measure: Cross-correlation is a statis-
We address the problem of domain-independent, accurate, tical measure with which we can determine the similarity of
and scalable clustering of time series into k clusters, for a two sequences ˛ x = (x1 , . . . , xm ) and ˛
y = (y1 , . . . , ym ), even if
given value of the target number of clusters k.1 Even though they are not properly aligned.2 To achieve shift-invariance,
different domains might require different invariances to data cross-correlation keeps ˛ y static and slides ˛ x over ˛ y to com-
distortions (see Section 2.1), we focus on distance measures pute their inner product for each shift s of ˛ x. We denote a
that offer invariances to scaling and shifting, which are gen- shift of a sequence as follows:
erally sufficient (see Section 2.2) [15]. Furthermore, to easily Y |s|
adopt such distance measures, we focus our analysis on raw- _
] ˙ ˝¸ ˚
_
based clustering approaches, as we argued in Section 2.3. (0, . . . , 0, x1 , x2 , . . . , xm≠s ), sØ0
x(s) =
˛ (3)
Next, we describe our k-Shape clustering algorithm. _
_(x1≠s , . . . , xm≠1 , xm , 0, . . . , 0), s < 0
[ ¸ ˚˙ ˝
|s|
3. K -SHAPE CLUSTERING ALGORITHM When all possible shifts ˛x(s) are considered, with s œ [≠m, m],
Our objective is to develop a domain-independent, accu- we produce CCw (˛ y ) = (c1 , . . . , cw ), the cross-correlation
x, ˛
rate, and scalable algorithm for time-series clustering that sequence with length 2m ≠ 1, defined as follows:
is invariant to scaling and shifting. We propose k-Shape, a CCw (˛ y ) = Rw≠m (˛
x, ˛ y ),
x, ˛ w œ {1, 2, . . . , 2m ≠ 1} (4)
clustering algorithm built on (i) a distance measure and (ii)
where Rw≠m (˛ y ) is computed, in turn, as:
x, ˛
1
Although the exact estimation of k is difficult without a gold stan-
2
dard, we can do so by varying k and evaluating clustering quality For simplicity, we consider sequences of equal length even though
with criteria that capture information intrinsic to the data alone [33]. cross-correlation can be computed on sequences of different length.

72 SIGMOD Record, March 2016 (Vol. 45, No. 1)

Ym≠k
]q
xl+k · yl , kØ0 2.0 30

Rk (˛ y) =
x, ˛ (5)
[ l=1
1.5 20 X: 1797

R≠k (˛ x), k<0

Y: 27.3
y, ˛ 1.0
10

0.5

Our goal is to compute the position w at which CCw (˛ y)

0
x, ˛ 0

is maximized. Based on this value of w, the optimal shift of

−10
−0.5

x with respect to ˛
y is then ˛x(s) , where s = w ≠ m.
−20
˛ −1.0

Depending on the domain or the application, different

−1.5 −30
0 200 400 600 800 1,024 0 500 1,000 1,500 2,047

(a) z-normalized time series (b) N CCb (no z-normalization)

normalizations for CCw (˛ y ) might be required. The most
x, ˛
common normalizations are the biased estimator, N CCb , 1.5 1.0

the unbiased estimator, N CCu , and the coefficient normal- 1.0

X: 1694
Y: 1.08 0.8

ization, N CCc , which are defined as follows:

X: 1024
0.6 Y: 0.90
0.5

Y CCw (˛x,˛y) 0.4

q = “b” (N CCb )
0

] CC m(˛x,˛y),
_ 0.2

q = “u” (N CCu ) (6)

−0.5

N CCq (˛ y) =
w 0

x, ˛ m≠|w≠m|
, −1.0
_
[Ô CCw (˛ y)
−0.2

q = “c” (N CCc )
x,˛
, −1.5 −0.4

R0 (˛ x)·R0 (˛ y)
0 500 1,000 1,500 2,047 0 500 1,000 1,500 2,047
x,˛ y ,˛
(c) N CCu with z-normalization (d) N CCc with z-normalization
Beyond the cross-correlation normalizations, time series
might also require normalization to remove inherent distor- Figure 3: Time-series and cross-correlation normalizations.
tions. Figure 3 illustrates how the cross-correlation normal-
izations for two sequences ˛ x and ˛ y of length m = 1024 are
affected by time-series normalizations. Independently of the Up to now we have addressed shift invariance. For scaling
normalization applied to CCw (˛ y ), the produced sequence invariance, we transform each sequence ˛ x into ˛
xÕ = ˛x≠µ‡ , so
x, ˛
will have length 2047. Initially, in Figure 3a, we remove that its mean µ is zero and its standard deviation ‡ is one.
differences in amplitude by z-normalizing ˛ x and ˛
y in order Efficient computation of SBD: From Equation 4, the
to show that they are aligned and, hence, no shifting is re- computation of CCw (˛ y ) for all values of w requires O(m2 )
x, ˛
quired. If CCw (˛ y ) is maximized for w œ [1025, 2047] (or
x, ˛ time, where m is the time-series length. The convolution
w œ [1, 1023]), one of ˛ x or ˛
y should be shifted by i ≠ 1024 to theorem [32] states that the convolution of two time series
the right (or 1024 ≠ i to the left). Otherwise, if w = 1024, ˛ x can be computed as the Inverse Discrete Fourier Trans-
and ˛ y are properly aligned, which is what we expect in our form (IDFT) of the product of their individual Discrete
example. Figure 3b shows that if we do not z-normalize ˛ x Fourier Transforms (DFT). Cross-correlation is then com-
and ˛ y , and we use the biased estimator, then N CCb is max- puted as the convolution of two time series if one sequence
imized at w = 1797, which indicates a shifting of a sequence
x(t) = ˛
is first reversed in time, ˛ x(≠t) [32], which equals taking
to the left 1797 ≠ 1024 = 773 times. If we z-normalize ˛ x and
the complex conjugate in the frequency domain. However,
y , and use the unbiased estimator, then N CCu is maximized
DFT and IDFT still require O(m2 ) time. By using a Fast
˛
at w = 1694, which indicates a shifting of a sequence to the
Fourier Transform (FFT) algorithm [13], the time reduces
right 1694 ≠ 1024 = 670 times (Figure 3c). Finally, if we
to O(m log(m)). Data and cross-correlation normalizations
z-normalize ˛ x and ˛y , and use the coefficient normalization,
can also be efficiently computed; thus the overall time com-
then N CCc is maximized at w = 1024, which indicates that
plexity of SBD remains O(m log(m)). Moreover, recursive
no shifting is required (Figure 3d).
algorithms compute an FFT by dividing it into pieces of
As illustrated by the example, normalizations of the data
power-of-two size [20]. Therefore, to further improve the per-
and the cross-correlation measure can have a significant im-
formance of the FFT computation, when CC(˛ y ) is not an
x, ˛
pact on the cross-correlation sequence produced, which makes
exact power of two we pad ˛ x and ˛ y with zeros to reach the
the creation of a distance measure a non-trivial task. Fur-
next power-of-two length after 2m ≠ 1.
thermore, as in Figure 3, cross-correlation sequences pro-
This section described effective cross-correlation and data
duced by pairwise comparisons of multiple time series will
normalizations to derive a shape-based distance measure. Im-
differ in amplitude based on the normalizations. Thus, a
portantly, we also discussed how the cross-correlation dis-
normalization that produces values within a specified range
tance measure can be efficiently computed. Our experiments
should be used to meaningfully compare such sequences.
show that SBD is highly competitive, achieving similar re-
Shape-based distance (SBD): To devise a shape-based
sults to cDTW and DTW while being orders of magnitude
distance measure, and based on the previous discussion, we
faster. We now turn to the critical problem of extracting a
use the coefficient normalization that gives values between
centroid for a cluster, to represent the cluster data consis-
≠1 and 1, regardless of the data normalization. Coeffi-
tently with the above shape-based distance measure.
cient normalization divides the cross-correlation sequence by
the geometric mean of autocorrelations of the individual se- 3.2 Time-Series Shape Extraction
quences. After normalization of the sequence, we detect the
Many time-series tasks rely on methods that summarize
position w where N CCc (˛ y ) is maximized and we derive
x, ˛
a set of time series by only one sequence, often referred to
the following distance measure:
A B as an average sequence or, in the context of clustering, as a
CCw (˛ y)
x, ˛ centroid. The extraction of meaningful centroids is a chal-
SBD(˛ y ) = 1 ≠ max
x, ˛  (7) lenging task that critically depends on the choice of distance
w R0 (˛ x) · R0 (˛
x, ˛ y)
y, ˛
measure. We now show how to determine such centroids for
which takes values between 0 to 2, with 0 indicating perfect time-series clustering for the SBD distance measure, to cap-
similarity for time-series sequences. ture shared characteristics of the underlying data.

SIGMOD Record, March 2016 (Vol. 45, No. 1) 73

ing, translation, and shift invariances. k-Shape is a non-
Class A Arithmetic mean
Shape extraction
Class B
trivial instantiation of k-means and, in contrast to similar
attempts in the literature [54, 69], its distance measure and
centroid computation method make k-Shape the only scal-
able method that significantly outperforms k-means.
In every iteration, k-Shape performs two steps: (i) in the
assignment step, the algorithm updates the cluster mem-
30 40 50 60 70 80 90 100 30 40 50 60 70 80 90 100

Figure 4: Examples of centroids for each class of the ECG-

berships by comparing each time series with all computed
FiveDays dataset, based on the arithmetic mean property
centroids and by assigning each time series to the cluster
(solid lines) and our shape extraction method (dashed lines).
of the closest centroid; (ii) in the refinement step, the clus-
ter centroids are updated to reflect the changes in cluster
The easiest way to extract an average sequence from a set memberships in the previous step. The algorithm repeats
of sequences is to compute each coordinate of the average these two steps until either no change in cluster member-
sequence as the arithmetic mean of the corresponding coor- ship occurs or the maximum number of iterations allowed is
dinates of all sequences. This approach is used by k-means, reached. In the assignment step, k-Shape relies on the dis-
the most popular clustering method. In Figure 4, the solid tance measure of Section 3.1, whereas in the refinement step
lines show such centroids for each class in the ECGFive- it relies on the centroid computation method of Section 3.2.
Days dataset of Figure 1: these centroids do not capture k-Shape expects as input the time series set and the num-
effectively the class characteristics (see Figures 1 and 4). ber of clusters that we want to produce. (Please refer to
To avoid such problems, we cast the centroid computation [53] for the full algorithm.) Initially, we randomly assign
as an optimization problem where the objective is to find the the time series in to clusters. Then, we compute each cluster
minimizer of the sum of squared distances to all other time centroid with the shape extraction method (see Section 3.2).
series sequences. However, as cross-correlation intuitively Once the centroids are computed, we refine the memberships
captures the similarity — rather than the dissimilarity — of the clusters by using the SBD distance measure. We re-
of time series, we can express the computed sequence as the peat this procedure until the algorithm converges or reaches
maximizer of the squared similarities to all other time-series the maximum number of iterations (usually a small number,
sequences. Such similarity (Equation 6) requires the com- such as 100). The output of the algorithm is the assignment
putation of an optimal shift for every sequence. As this of sequences to clusters and the centroids for each cluster.
approach is used in the context of iterative clustering, we We now turn to the experimental evaluation of k-Shape
use the previously computed centroid as reference and align against the state-of-the-art time-series clustering approaches.
all sequences towards this reference sequence. This is a rea-
sonable choice because the previous centroid will be very 4. EXPERIMENTAL SETTINGS
close to the new centroid. For this alignment, we use SBD,
In this section, we describe the experimental settings for
which identifies an optimal shift for every sequence. Subse-
the evaluation of both SBD and our k-Shape algorithm.
quently, as sequences are already aligned towards a reference
Datasets: We use 48 class-labeled time-series datasets, both
sequence, we can reduce this maximization to a well-known
synthetic and real, which span several different domains [1].
problem called maximization of the Rayleigh Quotient [25].
Platform: We ran our experiments on a cluster of 10 servers
(See details of this reduction in [53].)
with identical configuration: Dual Intel Xeon X5550 proces-
A desirable property of the above formulation is that we
sor with clock speed at 2.67 GHz and 24 GB RAM. Each
can extract the most representative shape from the underly-
server runs Ubuntu 12.04 and Matlab R2012b.
ing data in a few lines of code [53]. In Figure 4, the dashed
Implementation: We implemented our approach and all
lines show the centroids of each class in the ECGFiveDays
state-of-the-art approaches that we compare against under
dataset, extracted with our shape extraction method and
the same framework, in Matlab, for a consistent evaluation
using randomly selected sequences as reference sequences.
in terms of both accuracy and efficiency. For repeatability
This method for shape extraction can more effectively cap-
purposes, we make all datasets and source code available.3
ture the characteristics of each class (Figure 1) than by using
Baselines: We compare SBD against the strongest state-of-
the arithmetic mean property (solid lines in Figure 4). We
the-art distance measures for time series (see Section 2.2 for
now show how our shape extraction method is used in a
a detailed discussion), namely, ED, DTW, and cDTW. Only
time-series clustering algorithm.
cDTW requires setting a parameter, to constrain its warp-
ing window. We consider two cases from the literature: (i)
3.3 Shape-based Time-Series Clustering cDTWopt : we compute the optimal window by performing a
We now describe k-Shape, our novel algorithm for time- leave-one-out classification step over the training set of each
series clustering. k-Shape relies on the SBD distance mea- dataset; (ii) cDTWw : we use as window 5%, for cDTW5 , of
sure of Section 3.1 and the shape extraction method of Sec- the length of the time series of each dataset. We compare
tion 3.2 to efficiently produce clusters of time series. k-Shape against the three strongest types of scalable and
k-Shape Clustering Algorithm: k-Shape is a partitional non-scalable clustering methods, namely, partitional, hierar-
clustering method that is based on an iterative refinement chical, and spectral methods (see Section 2.3 for a detailed
procedure similar to the one used in k-means. Through this discussion), combined with the most competitive distance
iterative procedure, k-Shape minimizes the sum of squared measures discussed previously (we denote them as Dist).
distances and manages to: (i) produce homogeneous and As scalable methods, we consider the classic k-means algo-
well-separated clusters, and (ii) scale linearly with the num- rithm with ED (k-AVG+ED) [42], and the following vari-
ber of time series. Our algorithm compares sequences ef-
3
ficiently and computes centroids effectively under the scal- http://www.cs.columbia.edu/~jopa/kshape.html

74 SIGMOD Record, March 2016 (Vol. 45, No. 1)

1 2 3 4 1 2 3 4

cDTWopt ED k-Shape KSC

cDTW5 SBD k-AVG+ED k-DBA

Figure 5: Ranking of distance measures based on the aver- Figure 6: Ranking of k-means variants based on the average
age of their ranks across datasets. The wiggly line connects of their ranks across datasets. The wiggly line connects
all measures that do not perform statistically differently ac- all techniques that do not perform statistically differently
cording to the Nemenyi test. according to the Nemenyi test.

ants: (i) k-means with DTW as distance measure and the nificantly outperforms ED and achieves similar results to
DBA method for centroid computation (k-DBA) [54] and (ii) both constraint and unconstraint versions of DTW.
k-means with a distance measure offering pairwise scaling Evaluation of k-Shape Against Other Scalable Meth-
and shifting of time series and computation of the spectral ods: Figure 6 shows the average rank across datasets of
norm of a matrix for centroid computation (KSC) [69]. As each k-means variant. k-Shape is the top technique, with
non-scalable methods, among partitional methods we con- an average rank of 1.89, meaning that k-Shape was best
sider the Partitioning Around Medoids (PAM+Dist) imple- in the majority of the datasets. The Friedman test rejects
mentation of the k-medoids algorithm [33]. Among hierar- that all algorithms behave similarly, so we proceed with a
chical methods, we use agglomerative hierarchical cluster- post-hoc Nemenyi test, to evaluate the significance of the
ing with single (H-S+Dist), average (H-A+Dist), and com- differences in the ranks. We observe that the ranks of KSC,
plete (H-C+Dist) linkage criteria [33]. Finally, among spec- k-DBA, and k-AVG+ED do not present a statistically sig-
tral methods, we consider the popular normalized spectral nificant difference, whereas k-Shape, which is ranked first,
clustering method (S+Dist) [49]. Overall, we compared k- is significantly better than the others. Modifying k-means
Shape against 20 clustering approaches. with inappropriate distance measures or centroid computa-
Metrics: We compute CPU time utilization and report time tion methods might lead to unexpected results. In terms
ratios for our comparisons. We use the one nearest neighbor of efficiency, k-Shape is one order of magnitude faster than
classification accuracy to evaluate the distance measures and KSC, two orders of magnitude faster than k-DBA, and one
the Rand Index [57] to evaluate clustering accuracy. order of magnitude slower than k-AVG+ED.
Statistical analysis: We use the Friedman test [19] fol- Evaluation of k-Shape Against Non-Scalable Meth-
lowed by the post-hoc Nemenyi test [48] for comparison of ods: To show the robustness of k-Shape in terms of ac-
multiple algorithms over multiple datasets and we report curacy beyond scalable approaches, we now ignore scala-
statistical significant results with a 95% confidence level. bility and compare k-Shape against hierarchical, spectral,
and k-medoids methods. Among all existing state-of-the-art
methods that use ED or cDTW5 as distance measures, only
5. EXPERIMENTAL RESULTS partitional methods perform similarly to or better than k-
We now provide highlights of the detailed experimental AVG+ED. In particular, PAM+cDTW5 is the only method
evaluation in [53]. First, we evaluate SBD against the state- that outperforms k-AVG+ED. Figure 7 shows that k-Shape,
of-the-art distance measures. Then, we compare k-Shape PAM+SBD, PAM+cDTW5 , and S+SBD (i.e., all methods
against scalable and non-scalable clustering approaches. outperforming k-AVG+ED) do not present a significant dif-
Evaluation of SBD: All distance measures, including SBD, ference in accuracy, whereas k-AVG+ED, which is ranked
outperform ED with statistical significance. The difference last, is significantly worse than the others.
in accuracy between SBD and DTW is in most cases negligi- In short, our experimental evaluation suggests that SBD is
ble: SBD performs at least as well as DTW in 30 datasets. as competitive as state-of-the-art measures, such as cDTW
Considering the constrained versions of DTW, we observe and DTW, but faster, and k-Shape is the only method that
that SBD performs similarly to or better than cDTWopt and is both accurate and efficient. In [53], we provide further
cDTW5 in 22 and 18 datasets, respectively. To better under- details on these findings and on the performance of hierar-
stand the performance of SBD in comparison with cDTWopt chical and spectral methods as well.
and cDTW5 , we evaluate the significance of their differences
in accuracy when considered all together. Figure 5 shows 6. CONCLUSIONS
the average rank across datasets of each distance measure. We presented k-Shape, a partitional clustering algorithm
cDTWopt is the top measure, with an average rank of 1.96, that preserves the shapes of time series. k-Shape compares
meaning that cDTWopt performed best in the majority of time series efficiently and computes centroids effectively un-
the datasets. The Friedman test rejects the null hypothesis der the scaling and shift invariances. We have identified
that all measures behave similarly, and, hence, we proceed many interesting directions for future work. For example,
with a post-hoc Nemenyi test, to evaluate the significance k-Shape currently operates over a single time-series repre-
of the differences in the ranks. The wiggly line in the fig- sentation and cannot handle multiple representations. Con-
ure connects all measures that do not perform statistically sidering that several transformations (e.g., smoothing) can
differently according to the Nemenyi test. We observe that reduce noise and eliminate outliers in time series, an ex-
the ranks of cDTWopt , cDTW5 , and SBD do not present a tension of k-Shape to leverage characteristics from multiple
significant difference, and ED, which is ranked last, is signif- representations can significantly improve its accuracy. An-
icantly worse than the others. In terms of efficiency, SBD is other future direction is to explore the usefulness of k-Shape
only 4.4x slower than ED and remains one order of magni- as a “subroutine” of other methods. For example, nearest
tude faster than cDTWopt and cDTW5 . In conclusion, SBD centroid classifiers rely on effective clustering of time series
is a very efficient, parameter-free distance measure that sig- and subsequent extraction of centroids for the clusters.

SIGMOD Record, March 2016 (Vol. 45, No. 1) 75

1 2 3 4 5
[30] B. Hu, Y. Chen, and E. Keogh. Time series classification under more
realistic assumptions. In SDM, pages 578–586, 2013.
S+SBD k+AVG+ED [31] K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for
effective clustering of ARIMA time-series. In ICDM, pages 273–280, 2001.
PAM+cDTW5 k-Shape [32] Y. Katznelson. An introduction to harmonic analysis. Cambridge University
PAM+SBD Press, 2004.
[33] L. Kaufman and P. J. Rousseeuw. Finding groups in data: An introduction to
cluster analysis, volume 344. John Wiley & Sons, 2009.
Figure 7: Ranking of methods that outperform k-AVG+ED [34] E. Keogh. A decade of progress in indexing and mining large time series
databases. In VLDB, pages 1268–1268, 2006.
based on the average of their ranks across datasets. The [35] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Locally
wiggly line connects all techniques that do not perform sta- adaptive dimensionality reduction for indexing large time series
databases. In SIGMOD, pages 151–162, 2001.
tistically differently according to the Nemenyi test. [36] E. Keogh and J. Lin. Clustering of time-series subsequences is
meaningless: Implications for previous and future research. Knowledge and
Information Systems, 8(2):154–177, 2005.
[37] E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time
Acknowledgments: We thank Or Biran, Christos Faloutsos, Eamonn Keogh, warping. Knowledge and Information Systems, 7(3):358–386, 2005.
Kathy McKeown, Taesun Moon, François Petitjean, and Kapil Thadani for in- [38] C. Kin-pong and F. Ada. Efficient time series matching by wavelets. In
valuable discussions and feedback. We also thank Ken Ross for sharing computing ICDE, pages 126–133, 1999.
resources for our experiments. This research was supported by the Intelligence [39] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc
Advanced Research Projects Activity (IARPA) via Department of Interior Na- queries in large datasets of time sequences. In SIGMOD, pages 289–300,
tional Business Center (DoI/NBC) contract number D11PC20153. The U.S. Gov- 1997.
ernment is authorized to reproduce and distribute reprints for Governmental [40] C.-S. Li, P. S. Yu, and V. Castelli. MALM: A framework for mining
purposes notwithstanding any copyright annotation thereon. Disclaimer: The sequence database at multiple abstraction levels. In CIKM, pages
views and conclusions contained herein are those of the authors and should not 267–272, 1998.
be interpreted as necessarily representing the official policies or endorsements, [41] X. Lian, L. Chen, J. X. Yu, G. Wang, and G. Yu. Similarity match over
either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. This high speed time-series streams. In ICDE, pages 1086–1095, 2007.
material is also based upon work supported by a generous gift from Microsoft
[42] J. MacQueen. Some methods for classification and analysis of
Research. John Paparrizos is an Alexander S. Onassis Foundation Scholar.
multivariate observations. In BSMSP, pages 281–297, 1967.
[43] R. N. Mantegna. Hierarchical structure in financial markets. The European
Physical Journal B-Condensed Matter and Complex Systems, 11(1):193–197, 1999.

7. REFERENCES [44] W. Meesrikamolkul, V. Niennattrakul, and C. A. Ratanamahatana.

Shape-based clustering for time series data. In PAKDD, pages 530–541.
[1] The UCR Time Series Classification/Clustering Homepage. 2012.
http://www.cs.ucr.edu/~eamonn/time_series_data. Accessed: May 2014. [45] V. Megalooikonomou, Q. Wang, G. Li, and C. Faloutsos. A
[2] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient similarity search in multiresolution symbolic representation of time series. In ICDE, pages
sequence databases. In FODO, pages 69–84, 1993. 668–679, 2005.
[3] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic. Discovering clusters in [46] M. D. Morse and J. M. Patel. An efficient and accurate method for
motion time-series data. In CVPR, pages 375–381, 2003. evaluating time series similarity. In SIGMOD, pages 569–580, 2007.
[4] A. J. Bagnall and G. J. Janacek. Clustering time series from ARMA [47] A. Mueen, E. Keogh, and N. Young. Logical-shapelets: An expressive
models with clipped data. In KDD, pages 49–58, 2004. primitive for time series classification. In KDD, pages 1154–1162, 2011.
[5] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon. A [48] P. Nemenyi. Distribution-free Multiple Comparisons. PhD thesis, Princeton
new approach to analyzing gene expression time series data. In RECOMB, University, 1963.
pages 39–48, 2002. [49] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis
[6] G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. de Souza. CID: An and an algorithm. In NIPS, pages 849–856, 2002.
efficient complexity-invariant distance for time series. Data Mining and [50] V. Niennattrakul and C. A. Ratanamahatana. Shape averaging under
Knowledge Discovery, pages 1–36, 2013. time warping. In ECTI-CON, pages 626–629, 2009.
[7] D. J. Berndt and J. Clifford. Using dynamic time warping to find [51] T. Oates. Identifying distinctive subsequences in multivariate time series
patterns in time series. In AAAI Workshop on KDD, pages 359–370, 1994. by clustering. In KDD, pages 322–326, 1999.
[8] Y. Cai and R. Ng. Indexing spatio-temporal trajectories with Chebyshev [52] P. Papapetrou, V. Athitsos, M. Potamias, G. Kollios, and D. Gunopulos.
polynomials. In SIGMOD, pages 599–610, 2004. Embedding-based subsequence matching in time-series databases. TODS,
[9] L. Chen and R. Ng. On the marriage of Lp-norms and edit distance. In 36(3):17, 2011.
VLDB, pages 792–803, 2004. [53] J. Paparrizos and L. Gravano. k-Shape: Efficient and accurate clustering
[10] L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for of time series. In SIGMOD, pages 1855–1870, 2015.
moving object trajectories. In SIGMOD, pages 491–502, 2005. [54] F. Petitjean, A. Ketterlin, and P. Gançarski. A global averaging method
[11] Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable PLA for for dynamic time warping, with applications to clustering. Pattern
efficient similarity search. In VLDB, pages 435–446, 2007. Recognition, 44(3):678–693, 2011.
[12] Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. Tung. Spade: On [55] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover,
shape-based pattern detection in streaming time series. In ICDE, pages Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time
786–795, 2007. series subsequences under dynamic time warping. In KDD, pages 262–270,
2012.
[13] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation
of complex Fourier series. Mathematics of Computation, 19(90):297–301, [56] T. Rakthanmanon, E. J. Keogh, S. Lonardi, and S. Evans. Time series
1965. epenthesis: Clustering time series streams requires ignoring some data.
In ICDM, pages 547–556, 2011.
[14] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule
discovery from time series. In KDD, pages 16–22, 1998. [57] W. M. Rand. Objective criteria for the evaluation of clustering methods.
Journal of the American Statistical Association, 66(336):846–850, 1971.
[15] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh.
Querying and mining of time series data: Experimental comparison of [58] C. A. Ratanamahatana and E. Keogh. Making time-series classification
representations and distance measures. PVLDB, 1(2):1542–1552, 2008. more accurate using learned constraints. In SDM, pages 11–22, 2004.
[16] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence [59] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes.
matching in time-series databases. In SIGMOD, pages 419–429, 1994. Correlating financial time series with micro-blogging activity. In WSDM,
pages 513–522, 2012.
[17] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of
kernel and spectral methods for clustering. Pattern Recognition, [60] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization
41(1):176–190, 2008. for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal
Processing, 26(1):43–49, 1978.
[18] E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similar
trajectory search. In ICDE, pages 816–825, 2007. [61] Y. Shou, N. Mamoulis, and D. Cheung. Fast and exact warping of time
series using adaptive segmental approximations. Machine Learning,
[19] M. Friedman. The use of ranks to avoid the assumption of normality
58(2-3):231–267, 2005.
implicit in the analysis of variance. Journal of the American Statistical
Association, 32:675–701, 1937. [62] K. Uehara and M. Shimada. Extraction of primitive motion and discovery
of association rules from human motion data. In Progress in Discovery
[20] M. Frigo and S. G. Johnson. The design and implementation of FFTW3.
Science, pages 338–348. 2002.
Proceedings of the IEEE, 93(2):216–231, 2005.
[63] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing
[21] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock
multidimensional time-series. The VLDB Journal, 15(1):1–20, 2006.
market: Which measure is best? In KDD, pages 487–496, 2000.
[64] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar
[22] R. Giusti and G. E. Batista. An empirical comparison of dissimilarity
multidimensional trajectories. In ICDE, pages 673–684, 2002.
measures for time series classification. In BRACIS, pages 82–88, 2013.
[65] H. Wang, Y. Cai, Y. Yang, S. Zhang, and N. Mamoulis. Durable queries
[23] S. Goddard, S. K. Harms, S. E. Reichenbach, T. Tadesse, and W. J.
over historical time series. TKDE, 26(3):595–607, 2014.
Waltman. Geospatial decision support for drought risk management.
Communications of the ACM, 46(1):35–37, 2003. [66] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and
E. Keogh. Experimental comparison of representation methods and
[24] D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series
distance measures for time series data. Data Mining and Knowledge Discovery,
data: Constraint specification and implementation. In CP, pages 137–153,
26(2):275–309, 2013.
1995.
[67] T. Warren Liao. Clustering of time series data - a survey. Pattern
[25] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU
Recognition, 38(11):1857–1874, 2005.
Press, 2012.
[68] Y. Xiong and D.-Y. Yeung. Mixtures of ARMA models for model-based
[26] L. Gupta, D. L. Molfese, R. Tammana, and P. G. Simos. Nonlinear
time series clustering. In ICDM, pages 717–720, 2002.
alignment and averaging for estimating the evoked potential. IEEE
Transactions on Biomedical Engineering, 43(4):348–356, 1996. [69] J. Yang and J. Leskovec. Patterns of temporal variation in online media.
In WSDM, pages 177–186, 2011.
[27] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation
techniques. Journal of Intelligent Information Systems, 17(2-3):107–145, 2001. [70] L. Ye and E. Keogh. Time series shapelets: A new primitive for data
mining. In KDD, pages 947–956, 2009.
[28] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., 3rd edition, 2011. [71] J. Zakaria, A. Mueen, and E. Keogh. Clustering time series using
unsupervised-shapelets. In ICDM, pages 785–794, 2012.
[29] R. Honda, S. Wang, T. Kikuchi, and O. Konishi. Mining of moving
objects from time-series images and its application to satellite weather
imagery. Journal of Intelligent Information Systems, 19(1):79–93, 2002.

76 SIGMOD Record, March 2016 (Vol. 45, No. 1)

1.the Shazam Music Recognition Service
No ratings yet
1.the Shazam Music Recognition Service
5 pages
Kshape
No ratings yet
Kshape
49 pages
Dtwclust - Comparing Time Series Clustering Algorithms
No ratings yet
Dtwclust - Comparing Time Series Clustering Algorithms
46 pages
Clustering of Time-Series Data
No ratings yet
Clustering of Time-Series Data
20 pages
Dtwclust
No ratings yet
Dtwclust
41 pages
Stageverslag Roelofsen Tcm235 882304
No ratings yet
Stageverslag Roelofsen Tcm235 882304
83 pages
AghabozorgiSeyedShirkhorshidiYingWah 2015 Time SeriesclusteringAdecadereview
No ratings yet
AghabozorgiSeyedShirkhorshidiYingWah 2015 Time SeriesclusteringAdecadereview
23 pages
1512.04349v1
No ratings yet
1512.04349v1
53 pages
ASDM C02 Clustering
No ratings yet
ASDM C02 Clustering
107 pages
Time Series Forecasting Using Clustering With Periodinc Pattern
No ratings yet
Time Series Forecasting Using Clustering With Periodinc Pattern
8 pages
Convolutional Neural Networks For Time Series Classification
No ratings yet
Convolutional Neural Networks For Time Series Classification
8 pages
Unsupervised Classification of Multivariate Time Series Using VPCA and Fuzzy Clustering With Spatial Weighted Matrix Distance
No ratings yet
Unsupervised Classification of Multivariate Time Series Using VPCA and Fuzzy Clustering With Spatial Weighted Matrix Distance
10 pages
RDataMining Slides Time Series Analysis PDF
No ratings yet
RDataMining Slides Time Series Analysis PDF
41 pages
A02-Multivariate Time Series Clustering Based On Complex Network
No ratings yet
A02-Multivariate Time Series Clustering Based On Complex Network
17 pages
A Review and Evaluation of Elastic Distance Functions For Time Series Clustering
No ratings yet
A Review and Evaluation of Elastic Distance Functions For Time Series Clustering
45 pages
Bej1906 004r2a0 PDF
No ratings yet
Bej1906 004r2a0 PDF
35 pages
Comparing Clustering Algorithms Using Financial Time-Series Data
No ratings yet
Comparing Clustering Algorithms Using Financial Time-Series Data
21 pages
1709.08055
No ratings yet
1709.08055
28 pages
Clustering_Time_Series_from_Mixture_Polynomial_Mod
No ratings yet
Clustering_Time_Series_from_Mixture_Polynomial_Mod
23 pages
Engineering Applications of Artificial Intelligence: Tak-Chung Fu
No ratings yet
Engineering Applications of Artificial Intelligence: Tak-Chung Fu
18 pages
A Review On Time Series Data Mining
100% (1)
A Review On Time Series Data Mining
18 pages
Deep Multivariate Time Series Embedding Clustering
No ratings yet
Deep Multivariate Time Series Embedding Clustering
26 pages
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
No ratings yet
澳大利亚悉尼科技大学利用质量与距离峰值快速自主聚类，开发出Torque Clustering算法，实现无参数化高效聚类
14 pages
Clustering and Classification For Time Series Data in Visual Analytics A Survey
No ratings yet
Clustering and Classification For Time Series Data in Visual Analytics A Survey
25 pages
A P-spline based clustering approach for portfolio selection Iorio 2018
No ratings yet
A P-spline based clustering approach for portfolio selection Iorio 2018
16 pages
Petitjean2011 PR
No ratings yet
Petitjean2011 PR
16 pages
Energies 06 00579
No ratings yet
Energies 06 00579
19 pages
Cluster and Calendar Based Visualization of Time Series Data
No ratings yet
Cluster and Calendar Based Visualization of Time Series Data
6 pages
Multivariate Time Series Clustering Based On Common Principal Component Analysis 2019
No ratings yet
Multivariate Time Series Clustering Based On Common Principal Component Analysis 2019
9 pages
A Global Averaging Method For Dynamictime Warping, With Applications To Clustering
No ratings yet
A Global Averaging Method For Dynamictime Warping, With Applications To Clustering
16 pages
Data Clustering A Review
No ratings yet
Data Clustering A Review
60 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
A Wavelet-Based Anytime Algorithm For K-Means Clustering of Time Series
No ratings yet
A Wavelet-Based Anytime Algorithm For K-Means Clustering of Time Series
12 pages
A Review On Distance Based Time Series Classification
No ratings yet
A Review On Distance Based Time Series Classification
28 pages
OPTICS: Ordering Points To Identify The Clustering Structure
No ratings yet
OPTICS: Ordering Points To Identify The Clustering Structure
12 pages
M5
No ratings yet
M5
40 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
Predictive Mining of Time Series D Ata
No ratings yet
Predictive Mining of Time Series D Ata
11 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Information Geometry Univariate Time Series
No ratings yet
Information Geometry Univariate Time Series
12 pages
DBSCAN_An_Assessment_of_Density_Based_Cl
No ratings yet
DBSCAN_An_Assessment_of_Density_Based_Cl
5 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unsupervised Feature Extraction For Time Series Clustering Using Orthogonal Wavelet Transform
No ratings yet
Unsupervised Feature Extraction For Time Series Clustering Using Orthogonal Wavelet Transform
15 pages
Assi 1
No ratings yet
Assi 1
27 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Bse Paper 3 PDF
No ratings yet
Bse Paper 3 PDF
14 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
A Shape-Based Clustering Framework For Time Aggregation in The Presence of Variable Generation and Energy Storage
No ratings yet
A Shape-Based Clustering Framework For Time Aggregation in The Presence of Variable Generation and Energy Storage
12 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model
No ratings yet
SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model
11 pages
Path Based Dissimilarity Measured For Thesis Book Preparation
No ratings yet
Path Based Dissimilarity Measured For Thesis Book Preparation
11 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Unsupervised Aircraft Trajectories Clustering: A Minimum Entropy Approach
No ratings yet
Unsupervised Aircraft Trajectories Clustering: A Minimum Entropy Approach
7 pages
Clustering
No ratings yet
Clustering
12 pages
Spade: On Shape-Based Pattern Detection in Streaming Time Series
No ratings yet
Spade: On Shape-Based Pattern Detection in Streaming Time Series
10 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Voice Morphing Seminar Report
100% (5)
Voice Morphing Seminar Report
31 pages
Sequence Classification
No ratings yet
Sequence Classification
9 pages
Time Alignment Measurement For Time Series
No ratings yet
Time Alignment Measurement For Time Series
12 pages
Speech Processing
No ratings yet
Speech Processing
5 pages
The Big Book of Machine Learning Use Case
100% (1)
The Big Book of Machine Learning Use Case
75 pages
Voice Morphing
92% (13)
Voice Morphing
31 pages
Big Book of Machine Learning Use Cases: Ebook
No ratings yet
Big Book of Machine Learning Use Cases: Ebook
94 pages
1.6 Machine Learning For Time Series Analysis and Forecasting
No ratings yet
1.6 Machine Learning For Time Series Analysis and Forecasting
54 pages
Emotional Analysis Prediction Using Qualita-Tively Representations of Multiple Psycho-Phys - Iological Time-Series Signals
No ratings yet
Emotional Analysis Prediction Using Qualita-Tively Representations of Multiple Psycho-Phys - Iological Time-Series Signals
13 pages
2023 ICASSP Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond
No ratings yet
2023 ICASSP Soft Dynamic Time Warping for Multi-Pitch Estimation and Beyond
5 pages
Big Book of Data Science Use Cases v3
No ratings yet
Big Book of Data Science Use Cases v3
86 pages
Pratyush Master Thesis 2010
No ratings yet
Pratyush Master Thesis 2010
50 pages
Automatic Speech Recognition Documentation
No ratings yet
Automatic Speech Recognition Documentation
24 pages
IMPCAsia Pacific2022 Adhikarietal PDF
No ratings yet
IMPCAsia Pacific2022 Adhikarietal PDF
17 pages
Lecture8 1MultimodalAlignment
No ratings yet
Lecture8 1MultimodalAlignment
67 pages
A Review On Speech Recognition Challenge
No ratings yet
A Review On Speech Recognition Challenge
7 pages
1 s2.0 S0952197621003043 Main
No ratings yet
1 s2.0 S0952197621003043 Main
14 pages
Speech Processing 15-492/18-492: Speech Recognition Template Matching
No ratings yet
Speech Processing 15-492/18-492: Speech Recognition Template Matching
24 pages
Time Series Classification Through Visual Pattern Recognition
No ratings yet
Time Series Classification Through Visual Pattern Recognition
9 pages
"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering
No ratings yet
"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering
29 pages
Speech Recognition As Emerging Revolutionary Technology
No ratings yet
Speech Recognition As Emerging Revolutionary Technology
4 pages
Voice Morphing Seminar Report
No ratings yet
Voice Morphing Seminar Report
36 pages
A Review On Feature Extraction and Noise Reduction Technique
No ratings yet
A Review On Feature Extraction and Noise Reduction Technique
6 pages
Similarity Analysis of Industrial Alarm Flood
No ratings yet
Similarity Analysis of Industrial Alarm Flood
6 pages
FSR 2019 Paper 18
No ratings yet
FSR 2019 Paper 18
15 pages
DTW 17179 PDF
No ratings yet
DTW 17179 PDF
23 pages

Kshape Short

Uploaded by

Kshape Short

Uploaded by

k-Shape: Efficient and Accurate Clustering of Time Series

John Paparrizos Luis Gravano

The proliferation and ubiquity of temporal data across many

disciplines has generated substantial interest in the analysis

and mining of time series. Clustering is one of the most pop-

ular data mining methods, not only due to its exploratory

power, but also as a preprocessing step or subroutine for Class A

other techniques. In this paper, we describe k-Shape, a novel

SIGMOD Record, March 2016 (Vol. 45, No. 1) 69

70 SIGMOD Record, March 2016 (Vol. 45, No. 1)

length require either stretching of the shorter sequence or

Occlusion invariance: When subsequences are missing,

SIGMOD Record, March 2016 (Vol. 45, No. 1) 71

72 SIGMOD Record, March 2016 (Vol. 45, No. 1)

R≠k (˛ x), k<0

Our goal is to compute the position w at which CCw (˛ y)

is maximized. Based on this value of w, the optimal shift of

Depending on the domain or the application, different

(a) z-normalized time series (b) N CCb (no z-normalization)

the unbiased estimator, N CCu , and the coefficient normal- 1.0

ization, N CCc , which are defined as follows:

Y CCw (˛x,˛y) 0.4

q = “u” (N CCu ) (6)

SIGMOD Record, March 2016 (Vol. 45, No. 1) 73

Figure 4: Examples of centroids for each class of the ECG-

74 SIGMOD Record, March 2016 (Vol. 45, No. 1)

cDTWopt ED k-Shape KSC

SIGMOD Record, March 2016 (Vol. 45, No. 1) 75

7. REFERENCES [44] W. Meesrikamolkul, V. Niennattrakul, and C. A. Ratanamahatana.

76 SIGMOD Record, March 2016 (Vol. 45, No. 1)

You might also like