Cluster and Calendar Based Visualization of Time Series Data
Cluster and Calendar Based Visualization of Time Series Data
Abstract analyzed, much simpler than for instance flow data, which
consist of a mix of scalar and vector quantities at a multi-
A new method is presented to get insight into univariate dimensional grid. Visualization is trivial: just draw a graph.
time series data. The problem addressed here is how to iden- So what’s the problem?
tify patterns and trends on multiple time scales (days, weeks, The first is that N can be very large. For instance, mea-
seasons) simultaneously. The solution presented is to cluster surement at 10-minute intervals during a year yields 52,560
similar daily data patterns, and to visualize the average pat- values. The second is that repetitive data patterns often have
terns as graphs and the corresponding days on a calendar. different time scales. For our applications we usually distin-
This presentation provides a quick insight into both standard guish three time scales: seasons, weeks, and days. Human
and exceptional patterns. Furthermore, it is well suited to activities can vary strongly for these time scales, and hence
interactive exploration. Two applications, numbers of em- also the related measured quantities. The third is that clear a
ployees present and energy consumption, are presented. priori hypotheses are rarely available. Hence, the user wants
to have an overview first, subsequently he may want to zoom
in on data and detect peculiar patterns or subsequences, and
1 Introduction so on.
How can we analyze time series data? The first approach
is to use mathematical models. A well-known method is
Time series data are ubiquitous. The aim of time se-
the ARIMA model of Box and Jenkins [1]. This stochastic
ries analysis is to obtain insight into phenomena, to discover
model can be used to predict future values, and for an expert,
repetitive patterns and trends, and to predict the future. We
its coefficients give some insight into its time-dependent be-
focus here on the analysis of univariate time series data.
havior. But in general, the multi-scale aspect is not ad-
Suppose, we have collected energy consumption or air pol-
dressed, and, the counterpart of the very high compression,
lution data at short time intervals during one year, then how
details are lost.
can we extract information from these data?
Transformation from the time domain to a scale space di-
In the next section we discuss the problem and consider
rectly addresses the multiple scales that are present in the
various solutions. Current methods fall short in the analysis
data. Fourier transforms, Wavelet transforms, and fractal
of time series data at the various time scales, such as years,
analysis [2] are conceivable approaches. They are most
weeks, and days. Our new approach is based on a combina-
suited when the dominant frequencies or time scales are un-
tion of two methods: The use of cluster analysis (section 3)
known. However, for the type of applications discussed here
and the visualization of the result on a calendar (section 4).
it is often known a priori that patterns will have a scale of
Several applications are presented. In section 5 the strengths
days or weeks, hence such methods are too general. Fur-
and limitations are discussed.
thermore, the result after transformation, defined over a fre-
quency or scale-space domain, is much harder to interpret.
2 Background Another approach is to use the dependency on time scales
explicitly, by considering the data as two-dimensional, for
Time series data consist of a sequence of N pairs instance as f (day, hour ). The data can then be displayed
(yi , ti ), i = 1, · · · , N, where yi is the measured value of as a so-called fingerprints. The days and hours are mapped
a quantity at time ti . They are the simplest type of data to be on different axes, data is visualized via color [3]. In addi-
1
Total KW−consumption ECN
KW
17 dec.
12 nov.
8 oct.
3 sep.
30 jul.
2000
25 jun.
1600
21 may
1200 days
16 apr. 24:00
800
18:00
12 mar.
400
12:00
5 feb.
0 6:00 hours
KW
1 jan. 0:00
tion to color, the third dimension can be used to display the large, and the difficulty arises how to combine graphs prop-
data, yielding a mountain landscape. As an example, fig- erly and how to extract information.
ure 1 shows the power demand data of a research facility Let’s make a step backward. What do we want? We want
(i.c. ECN). Such images show all data simultaneously. Sea- to elucidate which standard day patterns occur, and how they
sonal trends can be discerned, as well as the typical day pat- are distributed over the year and over the week. Further-
tern. Yet, the variation over the week is harder to discern more, we want to detect days with patterns that strongly
and the day-patterns of Saturdays and Sundays are obscured. deviate from these standard patterns. If we use multiple
Furthermore, in order to see the trends smoothing had to be graphs, as suggested before, it is implicitly assumed that
used, but this eliminates fine details. there is a fixed relation between the distribution of patterns
A simple way to get an overview is to average the data. over the months and weekdays. In general, this assumption
For instance, temperature data over a year can be displayed can not be tested a priori. An alternative is to drop this as-
as a graph of the average daily temperature, combined with sumption, and let the analysis tool decide which daily pat-
a grey-shaded band to show the variation over the day [4]. terns are similar and show their distribution over the year.
However, if the data follow a weekly pattern, this technique This is the basis of our approach: cluster analysis, combined
is less useful, and any pattern within a day is not shown. with a calendar based visualization.
This can be overcome by showing multiple graphs. We
could show the average day pattern for each month, for each 3 Cluster analysis
day of the week, and so on. However, information is lost
here too. As an example, many data patterns on holidays Our aim is to merge similar day patterns into clusters,
show strong similarities to data patterns on Sundays. If the such that the day patterns within a cluster are more similar
data for each weekday is averaged separately, the holidays than the day patterns in other clusters. Each cluster contains
will disturb the results. To get more precise information and an average day pattern. To this end, a simple and straightfor-
to eliminate cross-over of the various effects, we could make ward bottom-up clustering algorithm suffices [5]. We split
graphs for combinations of time scales: ranging from Sun- the time series data into a sequence of M day patterns. Each
days in January to Saturdays in December. As a result, the day pattern Y j , j = 1, · · · , M consists of a sequence of pairs
number of graphs to investigate becomes overwhelmingly (yi , ti ), i = 1, · · · , N, where yi denotes the measured value
2
and ti the time that has elapsed since midnight. dot [6] without additional directives for the lay-out. Such an
We start with M clusters, each cluster containing one day automatic tool yields the best result when the user does not
pattern. Next, we compute the mutual differences between supply lay-out directives that constrain the search for an op-
all clusters, and merge the two clusters which are most simi- timal lay-out. Additional lay-out directives must be used to
lar into a new cluster. As a result, we have M −1 active clus- generate a dendrogram, with the days in their original order
ters. This step of merging small clusters into larger clusters on the same row, and with each new cluster on a next row.
is repeated until a single cluster results, which contains the This yields a highly cluttered image.
average of all day patterns. To speed up the clustering proce-
dure, the calculated differences between clusters are stored
in a table, which only has to be updated for new clusters. The
result of this algorithm is a binary tree of 2M − 1 clusters.
Various distance measures can be used. Suppose that we
have two day patterns yi and z i , i = 1, · · · , N. A simple
measure is the average geometric distance, or root-mean-
square distance:
qX
drms = (yi − z i )2 /N. 1 2 3 4 5 6 7
710 727
470 23/3 475 31/3 709 716 722 5/12 718 721
466 468 469 28/3 705 25/9 667 715 714 717 713 4/7 708 720
465 27/9 467 26/10 18/1 30/4 671 699 27/11 16/12 695 711 691 702 707 712 703 8/8 9/5 4/10 696 24/12
462 464 463 28/9 653 3/2 654 15/1 662 15/12 704 17/12 683 27/6 688 694 697 706 701 19/12 521 692 561 2/1
458 11/10 456 461 435 460 650 1/4 644 4/2 629 27/10 693 698 656 674 639 684 608 622 680 681 687 22/12 700 19/8 18/7 25/7 11/7 1/8 510 3/1
440 27/4 446 6/9 25/1 21/6 454 457 581 27/1 605 619 631 14/5 641 27/3 25/11 8/12 627 2/12 648 670 625 659 557 17/7 536 573 30/5 6/6 646 20/6 527 665 30/6 2/7 472 8/7 16/5 29/8 664 26/9 12/8 13/8 613 23/5 25/6 26/6 652 18/6 682 12/12 660 678
28/6 25/10 445 20/12 451 22/3 452 1/2 517 13/1 598 26/5 603 2/9 610 620 634 640 567 10/11 633 23/10 669 1/10 604 20/10 658 13/11 481 520 535 10/7 14/7 16/7 524 17/10 4/8 5/8 588 6/8 7/7 9/7 636 23/12 545 18/4 471 1/7 676 24/10 623 637 677 26/2
24/5 8/11 449 450 448 11/1 9/1 13/10 523 587 592 22/5 25/2 12/3 611 8/1 609 23/4 630 16/1 3/11 24/11 614 22/10 666 15/9 558 596 655 28/10 22/7 28/7 519 23/7 479 30/7 13/6 22/8 28/2 7/8 593 621 17/1 10/10 1/5 6/5 599 675 19/6 14/8 597 11/6 626 672
438 439 441 443 14/4 9/10 539 569 515 26/8 566 30/1 11/9 18/9 15/5 3/9 606 8/10 547 16/9 473 17/11 528 11/11 31/1 11/4 553 577 591 19/9 3/10 7/11 5/6 12/6 651 3/6
437 21/12 432 15/6 406 7/6 427 8/5 533 538 509 24/3 485 26/3 564 565 514 601 546 18/11 1/12 9/12 21/10 4/11 552 21/2 14/3 5/9 580 14/11 584 645
409 436 429 431 20/9 22/11 426 17/5 518 14/10 498 500 8/4 23/9 477 28/4 563 3/4 549 555 513 10/3 572 25/3 542 19/11 24/1 7/2 576 579 583 10/6 635 19/3
402 3/5 414 433 419 2/3 398 415 424 2/2 507 7/4 492 21/1 2/4 2/6 493 17/4 556 20/5 488 503 486 21/3 562 616
15/3 9/8 413 26/1 403 422 418 5/7 22/2 1/6 5/4 3/8 421 423 496 506 7/1 20/1 28/1 10/4 548 554 29/9 6/10 491 502 476 25/4 560 15/10 28/5 27/8
412 13/4 397 14/12 417 6/4 405 16/11 420 16/8 23/8 29/11 495 11/2 505 24/9 20/3 24/4 544 14/1 30/9 6/11 490 501 7/3 4/4 534 9/6
411 7/12 396 4/1 416 6/7 404 30/8 31/5 14/6 484 494 487 17/9 532 543 7/10 26/11 22/9 10/12 530 18/8
means that we consider two patterns as equal if they are the 29/3 5/10
399
8/3
400
18/10
7/9
407
401
395
408
19/10
15/11
8/6
367
383
372
389
5/1
390
19/4
385
384
394
386
25/12
19/7
393
392
380
30/11
19/1
3/3
480
17/3
10/2 482
15/4
21/4
12/5
1/9 9/9
12/2
516
19/2
5/2
504
6/2
531
478
512
483
11/3
22/1
537
511
29/1
20/2
16/4
21/5 16/6
497
529
17/6
29/5
16/10
381
379
388
391
387
382
12/7
29/6
27/7
9/3 13/7 16/2
10/8
26/7
369
17/8
365
373
371
368
28/12
20/4
370
366
20/7
18/5
27/12 13/2 24/2
qX
9/2 11/5 378 8/2 376 377 24/8 26/12 19/5 25/5
2/8 9/11
Now we have grouped the day patterns, how can we get We have developed a combined representation of daily
insight into the result? A standard way to display the result patterns and clusters. Patterns are shown as graphs, clus-
of clustering is the use of dendrograms as is shown in fig- ters are shown on a calendar. Colors indicate corresponding
ure 2. The bottom row shows the initial elements, each next clusters and patterns. As an example, figure 4 shows a re-
row shows how two clusters are combined. This works fine sult of a cluster analysis of time series data on the number of
if the number of elements is small. For more than, say, hun- employees present at ECN. The most significant seven clus-
dred clusters such images are much harder to grasp. Figure 3 ters are shown. On the right, the average value per cluster is
shows a full clustering tree for 365 day patterns, which was shown as a colored graph; on the left, each day in the calen-
generated by the well-known graph visualization package dar is colored according to the cluster to which it belongs.
3
1997 employees Cluster viewer
(c) ECN 1998
januari februari maart
ma 6 13 20 27 3 10 17 24 3 10 17 24 31
Graphs
di 7 14 21 28 4 11 18 25 4 11 18 25 600
wo 1 8 15 22 29 5 12 19 26 5 12 19 26 5/12/1997
do 2 9 16 23 30 6 13 20 27 6 13 20 27
31/12/1997
vr 3 10 17 24 31 7 14 21 28 7 14 21 28
za 4 11 18 25 1 8 15 22 1 8 15 22 29 Cluster 710
zo 5 12 19 26 2 9 16 23 2 9 16 23 30 Cluster 718
500
Cluster 719
april mei juni Cluster 721
ma 7 14 21 28 5 12 19 26 2 9 16 23 30
di 1 8 15 22 29 6 13 20 27 3 10 17 24 Cluster 722
wo 2 9 16 23 30 7 14 21 28 4 11 18 25
do 3 10 17 24 1 8 15 22 29 5 12 19 26 400
vr 4 11 18 25 2 9 16 23 30 6 13 20 27
za 5 12 19 26 3 10 17 24 31 7 14 21 28
zo 6 13 20 27 4 11 18 25 1 8 15 22 29
Several conclusions can be drawn from this image. We • On December 5th many people left at 4:00 PM. Dutch
see that: people will immediately know the explanation: On this
day we celebrate Santa Claus and are allowed to leave
• Office hours are followed strictly. Most people arrive earlier!
between 8:30 and 9:00 am, and leave between 4:00 and
5:00 pm. Furthermore, in the morning the number of We see that for this distribution of patterns quite plausible
employees present is slightly higher than in the after- explanations exist. The advantage of clustering is that none
noon. of these explanations have to be inserted a priori, such as
separating working days and holidays, and all effects are
• On Fridays and in the summer fewer people are present elucidated automatically. The combined representation of
(cluster 722); average graphs and clusters enables a user to quantify these
effects easily. Another strong point is that standard patterns
• On Fridays in the summer even fewer people are
present (cluster 718); (cluster 719) as well as exceptional patterns (December 5th)
are detected automatically.
• In the weekend and at holidays only very few people
are working (cluster 710): security and fire brigade; 4.2 Interaction
• Holidays in the Netherlands in 1997 were January 1st, For effective data exploration, user interaction is as im-
March 28th, March 31st, April 30th, May 5th, May 8th, portant as presentation. The combination of cluster analysis
May 19th, December 25th and 26th. with a calendar representation provides good opportunities
• School vacations are visible in Spring (May 3rd to May for interaction. We have embedded our presentation in an in-
11th), in Autumn (October 11th to October 19th), and teractive system for the analysis of time series data, such that
in Winter (December 21th to December 31st); the user can interact with the image presented to him (such
as fig. 4) in many ways.
• Many people take a day off after a holiday (cluster Selection of the data to be displayed can be done easily.
721); Initially, no days are selected for display. The user can tog-
4
gle days for display via point-and-click on a single day, on same methods for interaction as with the content based clus-
the label of a month, or on the label of the year. All days are tering. With a slightly modified measure, also a separation
then displayed as separate graphs. The user can point-and- into weekdays can be made.
click on a graph, upon which the corresponding day on the Also, a simplified clustering method was implemented:
calendar is highlighted. Exceptional patterns are thus easy Starting at a selected day, all other are added one after each
to locate. other in order of their distance to an initial day. This option
When the user selects a day, a typical next question is is useful to determine stepwise whether certain patterns are
which other days have a similar pattern. This is where the exceptions or not, again using the same methods for interac-
cluster analysis comes in. The user can select a day, and tion.
ask for more similar days via a single button press. The sys-
tem determines the parent cluster, shows the average graph 4.3 Application
of this cluster and highlights the days within the cluster via
color. This step can be repeated and reversed, so that the The background of our interest in time series data is the
user can interactively enlarge and shrink the cluster to be dis- liberalization of the energy markets. In the Netherlands,
played. Also, the user can select other days, and inspect sev- customers with very high energy consumptions are recently
eral clusters simultaneously. allowed to choose their gas and electricity supplier and ne-
In addition to this bottom-up approach, the user can show gotiate a tailor-made tariff. Other customers will follow in
clusters top-down. The user can select the number of clus- the next few years. This will strongly enhance competition
ters to be displayed, upon which the system generates a par- between the energy distribution companies, which will have
tioning of the year as shown in figure 4. Via two more/less to transform themselves from utility companies into market-
buttons, the user can add and remove clusters, until a mean- oriented companies. Insight into consumption patterns is
ingful decomposition is made. essential for the segmentation of their customer markets.
The full clustering process itself is done initially and later But also customers themselves need insight into their energy
on request of the user. Our non-optimized version takes consumption patterns in order to lower consumptions, avoid
about 5 seconds on a PC with a Pentium 100 MHz processor, peak rates and to negotiate a lower tariff.
which is quite acceptable for interactive use. The clustering Our aim is to develop methods, techniques, and tools that
tree is stored and re-used upon each query. Reclustering has enable customers to analyze their energy consumption pat-
to be done if the user wants to use a different distance mea- terns easily and effectively. We started with a study of the
sure. As an additional option, the user can reduce or enlarge electricity consumption at ECN itself. After collection of
the time interval upon which the comparison has to be made. data several analysis and visualization methods were tried.
For instance, if he finds in a graph a strange peak occurring The time series analysis data tool described proved to be
between 9:00 am and 10:00 am, he can select this interval very helpful.
graphically, and ask for a reclustering using only this time Figure 5 shows a cluster analysis of the power con-
interval and the dsh measure. As a result, all days with a sim- sumption. The five main clusters are shown here. During
ilar peak in this time interval are clustered. week-ends power consumption was fairly constant. Further-
Many standard options are further provided for the dis- more, four clusters with about the same patterns but differ-
play of the graphs. The user can zoom-in and out, show ent plateau levels emerge. The correlation with the seasons
the standard deviation for a cluster, and show each member is clearly visible. Finally, in the morning of February 4th a
of the cluster individually. Smoothing, with different filters high peak demand occurred.
and a user-controllable width, can be applied, which is use-
ful if noisy data have to be processed. Clusters can be gener-
5 Discussion
ated from these smoothed data. Some straightforward addi-
tional options could be fit easily within our framework. The
use of the following distance measure: We have presented a new method for the exploration and
analysis of extensive time series data. The combination of
dmn = 5000 | ymon /6 − z mon /6 | + cluster analysis and calendar based visualization turned out
1200 | ymon /3 − z mon /3 | + to be highly effective. Almost effortlessly images such as
400 | ymon /3 − z mon /3 | + figure 4 and 5 can be generated, which provide a good in-
| yday − z day |; sight.
The next step to be made is the extension to the inter-
where ymon and yday are the number of the month and the active visualization and analysis of several variables simul-
day respectively, gives a balanced clustering of the year in taneously. This enables a user to study correlations be-
half-years, quarters of a year, months, etcetera. This enables tween variables, either manually or automatically. Detected
a user to view standard averages and slow trends, using the correlations can lead the user in the direction of a suitable
5
1997 kW Cluster viewer
(c) ECN 1998
januari februari maart 2200
ma 6 13 20 27 3 10 17 24 3 10 17 24 31
Graphs
di 7 14 21 28 4 11 18 25 4 11 18 25
wo 1 8 15 22 29 5 12 19 26 5 12 19 26 2000 4/2/1997
do 2 9 16 23 30 6 13 20 27 6 13 20 27
Cluster 706
vr 3 10 17 24 31 7 14 21 28 7 14 21 28
za 4 11 18 25 1 8 15 22 1 8 15 22 29 Cluster 714
1800
zo 5 12 19 26 2 9 16 23 2 9 16 23 30 Cluster 720
Cluster 722
april mei juni 1600 Cluster 723
ma 7 14 21 28 5 12 19 26 2 9 16 23 30
di 1 8 15 22 29 6 13 20 27 3 10 17 24
wo 2 9 16 23 30 7 14 21 28 4 11 18 25 1400
do 3 10 17 24 1 8 15 22 29 5 12 19 26
vr 4 11 18 25 2 9 16 23 30 6 13 20 27
za 5 12 19 26 3 10 17 24 31 7 14 21 28 1200
zo 6 13 20 27 4 11 18 25 1 8 15 22 29
model. Model parameters can subsequently be estimated by [4] Tufte, E.R. The Visual Display of Quantitative Infor-
a regression method, and a statistical analysis of the model mation, Graphics Press, 1983.
residuals will indicate the validity of the model. Adopting
this procedure in the study of ECN energy consumption, a [5] Kaufman, L. and Rousseeuw, P.J. Finding Groups in
linear model was identified which could accurately predict Data: An Introduction to Cluster Analysis, John Wiley,
the power consumption from the sunlight intensity and the 1990.
number of employees [7]. We used different packages for [6] Gansner, E.R., E. Koutsofios, S, North, and K-P. Vo. A
this, integration of such methods in a single tool would be Technique for Drawing Directed Graphs. IEEE Trans-
highly effective. actions on Software Engineering 19 (3), pp. 214-230,
In conclusion, we think that our cluster and calendar 1993.
based analysis is a useful method to explore and visualize
large quantities of univariate time series data, and provides [7] Selow, E.R. van, Wijk, J.J. van, Jehee, J.N.T. Identifi-
a sound basis for a general analysis tool. cation and Visualization of Energy Consumption Pat-
terns. In: Proceedings of DistribuTECH DA/DSM Eu-
rope, Pennwell, London, October 1998.
References