Kshape Short
Kshape Short
ABSTRACT 6
ECG classes Types of sequence alignment
2 1.6
0 1.2
-2 0.8
-4 0.4
0.2
Class A
-6 0
30 40 50 60 70 80 90 100 10 20 30 40 50 60 70
able iterative refinement procedure, which creates homoge- Figure 1: ECG sequence examples and types of alignments
neous and well-separated clusters. As its distance measure, for the two classes of the ECGFiveDays dataset [1].
k-Shape uses a normalized version of the cross-correlation
measure in order to consider the shapes of time series while
comparing them. Based on the properties of that distance
Most time-series analysis methods, including clustering,
measure, we develop a method to compute cluster centroids,
critically depend on the choice of distance measure. A key
which are used in every iteration to update the assignment
issue when comparing two sequences is how to handle the
of time series to clusters. An extensive experimental evalu-
variety of distortions, as we will discuss, that are character-
ation against partitional, hierarchical, and spectral cluster-
istic of the sequences. To illustrate this point, consider the
ing methods, with the most competitive distance measures,
ECGFiveDays dataset [1], with ECG sequences recorded for
showed the robustness of k-Shape. Overall, k-Shape emerges
the same patient on two different days. While the sequences
as a domain-independent, highly accurate, and efficient clus-
seem similar overall, they exhibit patterns that belong in
tering approach for time series with broad applications.
one of the two distinct classes (see Figure 1): Class A is
characterized by a sharp rise, a drop, and another gradual
1. INTRODUCTION increase while Class B is characterized by a gradual increase,
Temporal, or sequential, data mining deals with problems a drop, and another gradual increase. Ideally, a shape-based
where data are naturally organized in sequences [28]. We clustering method should generate a partition similar to the
refer to such data sequences as time-series sequences if they classes shown in Figure 1, where sequences exhibiting simi-
contain explicit information about timing (e.g., stock, au- lar patterns are placed into the same cluster based on their
dio, speech, and video) or if an ordering on values can be shape similarity, regardless of differences in amplitude and
inferred (e.g., streams and handwriting). Large volumes of phase. As the notion of shape cannot be precisely defined,
time-series sequences appear in almost every discipline, in- dozens of distance measures have been proposed [9, 10, 12,
cluding astronomy, biology, meteorology, medicine, finance, 16, 18, 46, 64] to offer invariances to multiple inherent distor-
robotics, engineering, and others [1, 5, 21, 23, 29, 43, 59, tions in the data. However, it has been shown that distance
62]. The ubiquity of time series has generated a substantial measures offering invariances to amplitude and phase per-
interest in querying [2, 38, 39, 41, 52, 61, 65], indexing [8, 11, form exceptionally well [15, 66] and, hence, such measures
34, 35, 37, 63], classification [30, 47, 58, 70], clustering [36, are used for shape-based clustering [44, 50, 54, 69].
45, 54, 69, 71], and modeling [3, 31, 68] of such data. Due to these difficulties and the different needs for invari-
Among all techniques applied to time-series data, cluster- ances from one domain to another, more attention has been
ing is the most widely used as it does not rely on costly given to the creation of new distance measures rather than
human supervision or time-consuming annotation of data. to the creation of new clustering algorithms. It is generally
With clustering, we can identify and summarize interesting believed that the choice of distance measure is more im-
patterns and correlations in the underlying data [27]. In the portant than the clustering algorithm itself [6]. As a conse-
last few decades, clustering of time-series sequences has re- quence, time-series clustering relies mostly on classic cluster-
ceived significant attention [4, 14, 21, 40, 51, 54, 56, 69, 71], ing methods, either by replacing the default distance mea-
not only as a powerful stand-alone exploratory method, but sure with one that is more appropriate for time series, or
also as a preprocessing step or subroutine for other tasks. by transforming time series into “flat” data so that existing
clustering algorithms can be directly used [67]. However, the
The original version of this paper was published in ACM choice of clustering method can affect: (i) accuracy, as ev-
SIGMOD 2015 [53]. ery method expresses homogeneity and separation of clus-
Rk (˛ y) =
x, ˛ (5)
[ l=1
1.5 20 X: 1797
0.5
x with respect to ˛
y is then ˛x(s) , where s = w ≠ m.
−20
˛ −1.0
q = “b” (N CCb )
0
] CC m(˛x,˛y),
_ 0.2
N CCq (˛ y) =
w 0
x, ˛ m≠|w≠m|
, −1.0
_
[Ô CCw (˛ y)
−0.2
q = “c” (N CCc )
x,˛
, −1.5 −0.4
R0 (˛ x)·R0 (˛ y)
0 500 1,000 1,500 2,047 0 500 1,000 1,500 2,047
x,˛ y ,˛
(c) N CCu with z-normalization (d) N CCc with z-normalization
Beyond the cross-correlation normalizations, time series
might also require normalization to remove inherent distor- Figure 3: Time-series and cross-correlation normalizations.
tions. Figure 3 illustrates how the cross-correlation normal-
izations for two sequences ˛ x and ˛ y of length m = 1024 are
affected by time-series normalizations. Independently of the Up to now we have addressed shift invariance. For scaling
normalization applied to CCw (˛ y ), the produced sequence invariance, we transform each sequence ˛ x into ˛
xÕ = ˛x≠µ‡ , so
x, ˛
will have length 2047. Initially, in Figure 3a, we remove that its mean µ is zero and its standard deviation ‡ is one.
differences in amplitude by z-normalizing ˛ x and ˛
y in order Efficient computation of SBD: From Equation 4, the
to show that they are aligned and, hence, no shifting is re- computation of CCw (˛ y ) for all values of w requires O(m2 )
x, ˛
quired. If CCw (˛ y ) is maximized for w œ [1025, 2047] (or
x, ˛ time, where m is the time-series length. The convolution
w œ [1, 1023]), one of ˛ x or ˛
y should be shifted by i ≠ 1024 to theorem [32] states that the convolution of two time series
the right (or 1024 ≠ i to the left). Otherwise, if w = 1024, ˛ x can be computed as the Inverse Discrete Fourier Trans-
and ˛ y are properly aligned, which is what we expect in our form (IDFT) of the product of their individual Discrete
example. Figure 3b shows that if we do not z-normalize ˛ x Fourier Transforms (DFT). Cross-correlation is then com-
and ˛ y , and we use the biased estimator, then N CCb is max- puted as the convolution of two time series if one sequence
imized at w = 1797, which indicates a shifting of a sequence
x(t) = ˛
is first reversed in time, ˛ x(≠t) [32], which equals taking
to the left 1797 ≠ 1024 = 773 times. If we z-normalize ˛ x and
the complex conjugate in the frequency domain. However,
y , and use the unbiased estimator, then N CCu is maximized
DFT and IDFT still require O(m2 ) time. By using a Fast
˛
at w = 1694, which indicates a shifting of a sequence to the
Fourier Transform (FFT) algorithm [13], the time reduces
right 1694 ≠ 1024 = 670 times (Figure 3c). Finally, if we
to O(m log(m)). Data and cross-correlation normalizations
z-normalize ˛ x and ˛y , and use the coefficient normalization,
can also be efficiently computed; thus the overall time com-
then N CCc is maximized at w = 1024, which indicates that
plexity of SBD remains O(m log(m)). Moreover, recursive
no shifting is required (Figure 3d).
algorithms compute an FFT by dividing it into pieces of
As illustrated by the example, normalizations of the data
power-of-two size [20]. Therefore, to further improve the per-
and the cross-correlation measure can have a significant im-
formance of the FFT computation, when CC(˛ y ) is not an
x, ˛
pact on the cross-correlation sequence produced, which makes
exact power of two we pad ˛ x and ˛ y with zeros to reach the
the creation of a distance measure a non-trivial task. Fur-
next power-of-two length after 2m ≠ 1.
thermore, as in Figure 3, cross-correlation sequences pro-
This section described effective cross-correlation and data
duced by pairwise comparisons of multiple time series will
normalizations to derive a shape-based distance measure. Im-
differ in amplitude based on the normalizations. Thus, a
portantly, we also discussed how the cross-correlation dis-
normalization that produces values within a specified range
tance measure can be efficiently computed. Our experiments
should be used to meaningfully compare such sequences.
show that SBD is highly competitive, achieving similar re-
Shape-based distance (SBD): To devise a shape-based
sults to cDTW and DTW while being orders of magnitude
distance measure, and based on the previous discussion, we
faster. We now turn to the critical problem of extracting a
use the coefficient normalization that gives values between
centroid for a cluster, to represent the cluster data consis-
≠1 and 1, regardless of the data normalization. Coeffi-
tently with the above shape-based distance measure.
cient normalization divides the cross-correlation sequence by
the geometric mean of autocorrelations of the individual se- 3.2 Time-Series Shape Extraction
quences. After normalization of the sequence, we detect the
Many time-series tasks rely on methods that summarize
position w where N CCc (˛ y ) is maximized and we derive
x, ˛
a set of time series by only one sequence, often referred to
the following distance measure:
A B as an average sequence or, in the context of clustering, as a
CCw (˛ y)
x, ˛ centroid. The extraction of meaningful centroids is a chal-
SBD(˛ y ) = 1 ≠ max
x, ˛ (7) lenging task that critically depends on the choice of distance
w R0 (˛ x) · R0 (˛
x, ˛ y)
y, ˛
measure. We now show how to determine such centroids for
which takes values between 0 to 2, with 0 indicating perfect time-series clustering for the SBD distance measure, to cap-
similarity for time-series sequences. ture shared characteristics of the underlying data.
Figure 5: Ranking of distance measures based on the aver- Figure 6: Ranking of k-means variants based on the average
age of their ranks across datasets. The wiggly line connects of their ranks across datasets. The wiggly line connects
all measures that do not perform statistically differently ac- all techniques that do not perform statistically differently
cording to the Nemenyi test. according to the Nemenyi test.
ants: (i) k-means with DTW as distance measure and the nificantly outperforms ED and achieves similar results to
DBA method for centroid computation (k-DBA) [54] and (ii) both constraint and unconstraint versions of DTW.
k-means with a distance measure offering pairwise scaling Evaluation of k-Shape Against Other Scalable Meth-
and shifting of time series and computation of the spectral ods: Figure 6 shows the average rank across datasets of
norm of a matrix for centroid computation (KSC) [69]. As each k-means variant. k-Shape is the top technique, with
non-scalable methods, among partitional methods we con- an average rank of 1.89, meaning that k-Shape was best
sider the Partitioning Around Medoids (PAM+Dist) imple- in the majority of the datasets. The Friedman test rejects
mentation of the k-medoids algorithm [33]. Among hierar- that all algorithms behave similarly, so we proceed with a
chical methods, we use agglomerative hierarchical cluster- post-hoc Nemenyi test, to evaluate the significance of the
ing with single (H-S+Dist), average (H-A+Dist), and com- differences in the ranks. We observe that the ranks of KSC,
plete (H-C+Dist) linkage criteria [33]. Finally, among spec- k-DBA, and k-AVG+ED do not present a statistically sig-
tral methods, we consider the popular normalized spectral nificant difference, whereas k-Shape, which is ranked first,
clustering method (S+Dist) [49]. Overall, we compared k- is significantly better than the others. Modifying k-means
Shape against 20 clustering approaches. with inappropriate distance measures or centroid computa-
Metrics: We compute CPU time utilization and report time tion methods might lead to unexpected results. In terms
ratios for our comparisons. We use the one nearest neighbor of efficiency, k-Shape is one order of magnitude faster than
classification accuracy to evaluate the distance measures and KSC, two orders of magnitude faster than k-DBA, and one
the Rand Index [57] to evaluate clustering accuracy. order of magnitude slower than k-AVG+ED.
Statistical analysis: We use the Friedman test [19] fol- Evaluation of k-Shape Against Non-Scalable Meth-
lowed by the post-hoc Nemenyi test [48] for comparison of ods: To show the robustness of k-Shape in terms of ac-
multiple algorithms over multiple datasets and we report curacy beyond scalable approaches, we now ignore scala-
statistical significant results with a 95% confidence level. bility and compare k-Shape against hierarchical, spectral,
and k-medoids methods. Among all existing state-of-the-art
methods that use ED or cDTW5 as distance measures, only
5. EXPERIMENTAL RESULTS partitional methods perform similarly to or better than k-
We now provide highlights of the detailed experimental AVG+ED. In particular, PAM+cDTW5 is the only method
evaluation in [53]. First, we evaluate SBD against the state- that outperforms k-AVG+ED. Figure 7 shows that k-Shape,
of-the-art distance measures. Then, we compare k-Shape PAM+SBD, PAM+cDTW5 , and S+SBD (i.e., all methods
against scalable and non-scalable clustering approaches. outperforming k-AVG+ED) do not present a significant dif-
Evaluation of SBD: All distance measures, including SBD, ference in accuracy, whereas k-AVG+ED, which is ranked
outperform ED with statistical significance. The difference last, is significantly worse than the others.
in accuracy between SBD and DTW is in most cases negligi- In short, our experimental evaluation suggests that SBD is
ble: SBD performs at least as well as DTW in 30 datasets. as competitive as state-of-the-art measures, such as cDTW
Considering the constrained versions of DTW, we observe and DTW, but faster, and k-Shape is the only method that
that SBD performs similarly to or better than cDTWopt and is both accurate and efficient. In [53], we provide further
cDTW5 in 22 and 18 datasets, respectively. To better under- details on these findings and on the performance of hierar-
stand the performance of SBD in comparison with cDTWopt chical and spectral methods as well.
and cDTW5 , we evaluate the significance of their differences
in accuracy when considered all together. Figure 5 shows 6. CONCLUSIONS
the average rank across datasets of each distance measure. We presented k-Shape, a partitional clustering algorithm
cDTWopt is the top measure, with an average rank of 1.96, that preserves the shapes of time series. k-Shape compares
meaning that cDTWopt performed best in the majority of time series efficiently and computes centroids effectively un-
the datasets. The Friedman test rejects the null hypothesis der the scaling and shift invariances. We have identified
that all measures behave similarly, and, hence, we proceed many interesting directions for future work. For example,
with a post-hoc Nemenyi test, to evaluate the significance k-Shape currently operates over a single time-series repre-
of the differences in the ranks. The wiggly line in the fig- sentation and cannot handle multiple representations. Con-
ure connects all measures that do not perform statistically sidering that several transformations (e.g., smoothing) can
differently according to the Nemenyi test. We observe that reduce noise and eliminate outliers in time series, an ex-
the ranks of cDTWopt , cDTW5 , and SBD do not present a tension of k-Shape to leverage characteristics from multiple
significant difference, and ED, which is ranked last, is signif- representations can significantly improve its accuracy. An-
icantly worse than the others. In terms of efficiency, SBD is other future direction is to explore the usefulness of k-Shape
only 4.4x slower than ED and remains one order of magni- as a “subroutine” of other methods. For example, nearest
tude faster than cDTWopt and cDTW5 . In conclusion, SBD centroid classifiers rely on effective clustering of time series
is a very efficient, parameter-free distance measure that sig- and subsequent extraction of centroids for the clusters.