visual-analytics-for-data-scientists-1st-ed
visual-analytics-for-data-scientists-1st-ed
Visual
Analytics for
Data Scientists
Visual Analytics for Data Scientists
Natalia Andrienko • Gennady Andrienko
Georg Fuchs • Aidan Slingsby • Cagatay Turkay
Stefan Wrobel
Visual Analytics
for Data Scientists
Natalia Andrienko Gennady Andrienko
Fraunhofer Institute Intelligent Fraunhofer Institute Intelligent
Analysis and Information Systems IAIS Analysis and Information Systems IAIS
Schloss Birlinghoven Schloss Birlinghoven
Sankt Augustin, Germany Sankt Augustin, Germany
Department of Computer Science Department of Computer Science
City, University of London City, University of London
Northampton Square, London, UK Northampton Square, London, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To our families, friends, colleagues and
partners
Preface
vii
viii Preface
At the same time, the use of visual representations in data analysis becomes a widely
accessible and seemingly easy activity due to the appearance of the Python, R and
Javascript languages and proliferation of open-access packages with codes for data
processing, analysis, modelling, and visualisation. Creation and execution of ana-
lytical workflows involving both computations and visualisations is supported by
the Jupyter Notebook application, and there are myriads of analytical notebooks
created by various people and published on the Web. These notebooks are often
taken by other people for being adapted to their needs or as examples of what is
possible.
While this is a very positive development, it has a back side. The notebooks are
often created or adapted by people having quite little idea of how to choose ap-
propriate visualisation techniques and design correct and effective visualisations of
the data they deal with, and also have no good understanding of why, when, and
how visualisations need to be used in analysis. Some visualisations occurring in
the publicly accessible example notebooks may look impressive and convincing to
non-specialists, but, in fact, they may communicate spurious patterns in inadequate
ways. Those who view these visualisations and think of doing the same for their
data and tasks often lack knowledge that would enable critical assessment and un-
derstanding of the suitability of the techniques. Other notebooks include only basic
graphics having little analytical value, whereas better ways exist for representing
the relevant information.
Besides poor visualisation literacy, detrimental for analysis is a propensity to uncrit-
ically trust computers and take the outcome of a single run of an analysis algorithm
with default parameter settings, or with settings previously used by someone else,
as the final result. Naive analysts may not realise that a slight change in the data or
parameters can sometimes significantly change the result; therefore, they may not
bother to examine the reaction of the algorithm to such changes and to check results
of several runs for consistency. More experienced and critically minded analysts,
who usually take the trouble to evaluate and compare what they get from comput-
ers, may tend to rely solely on statistical measures rather than trying to gain better
understanding with the help of visualisations.
Visual analytics has not only generated the body of knowledge on how to create
meaningful visualisations and how to use them effectively in data analysis together
with computer operations but also developed a philosophy that should underlie an-
alytical activity. The main principles are the primacy of human understanding and
reasoning and awareness of the weaknesses of computers, which cannot see, un-
derstand, and think, and thus need to be led and controlled by humans. Both the
knowledge and the philosophy should be transferred to practitioners to help them to
do better analyses and come to valid conclusions. With this textbook, we make an
attempt to do this.
Preface ix
In this book, we do not aim to present the latest results of the visual analytics re-
search and the most innovative and advanced techniques and methods. Instead, we
shall present the main principles and describe the techniques and approaches that
are ready to be put in practice, which means that they, first, proved their utility and,
second, can be reproduced with moderate effort. We put emphasis on describing
examples of analyses, in which we explain the need for and the use of visualisa-
tions.
This book is a result of our collaboration with many different groups of partners.
We are thankful to
1 https://www.city.ac.uk/study/courses/postgraduate/data-science-
msc
2 www.iais.fraunhofer.de/en.html
3 https://www.iais.fraunhofer.de/en/institute/departments/
knowledge-discovery-en.html
4 www.city.ac.uk
5 www.gicentre.net
xi
xii Acknowledgements
Writing of the book was financially supported by the German Priority Research Pro-
gram SPP 1894 on Volunteered Geographic Information, EU projects Track&Know
and SoBigData++, EU SESAR project TAPAS, and Fraunhofer Cluster of Excel-
lence on “Cognitive Internet Technologies”.
Contents
2 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Subjects of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Structure of an analysis subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Using data to understand a subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Patterns and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Patterns in different kinds of distributions . . . . . . . . . . . . . . . . 36
2.3.4 Co-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.5 Spatialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiii
xiv Contents
14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.1 What you have learned about visual analytics . . . . . . . . . . . . . . . . . . . 409
14.2 Visual analytics way of thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.3 Examples in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.4 Example: devising an analytical workflow for understanding team
tactics in football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
14.4.1 Data description and problem statement . . . . . . . . . . . . . . . . . 412
14.4.2 Devising the approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.4.3 Choosing methods and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
14.4.4 Implementing the analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
14.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Acronyms
xix
xx Acronyms
However, visual analytics is not only the science and conduct of careful and ef-
fective use of computational techniques, but it is, first and foremost, the science of
human analytical reasoning, which does not necessarily require the involvement
of sophisticated computing but does require appropriate representation of informa-
tion to the human, so that it can be used in the reasoning. Visual representation is
acknowledged to be the best for this purpose, and it is the task of computers to gen-
erate these representations from available data and results of computations. In the
following, we present and discuss an example of an analysis process in which visual
representations inform human reasoning.
Some readers of this book may have quite vague idea of what visual analytics is
and may not understand why and how it can be useful for data scientists. Others
may believe that visualisation can be good for communicating ideas and presenting
analysis results but may not think of visual displays as tools for doing analysis.
To help both categories of readers to grasp the basic idea of visual analytics, we
shall start with an example showing visual analytics approaches at work. We shall
then discuss this example and outline in a more general way where and how visual
analytics fits in the data science workflows. This example comes from the IEEE
VAST Challenge 2011 [1]. Although the data are synthetic, they were carefully
constructed to resemble real data as much as possible. A good feature of these data
is that they contain an interesting and even dramatic story, while similar real data
may be either uninteresting or unavailable. Let’s dive into the story.
puters, and cellular phones. The second one contains map information for the entire
metropolitan area. The map dataset contains a satellite image with labelled high-
ways, hospitals, important landmarks, and water bodies (Fig. 1.1). There are also
supplemental tables for population statistics and observed weather data.
We need to acknowledge that the available data do not directly represent disease
occurrences; they just contain texts that may mention disease symptoms. We should
not assume that the locations and times specified in the microblog records men-
tioning disease symptoms are the actual locations and times of disease occurrences.
People may write about their health condition not necessarily immediately after
getting sick and not necessarily from the location where they first felt some health
problems. We should also keep in mind that not everyone who gets sick would send
1.2 A motivating example: Investigating an epidemic outbreak 7
a message about it, whereas some people may send more than one message. People
may also write about someone else being sick. Besides, messages mentioning dis-
ease symptoms may appear not only during the time of epidemic outbreak but also at
any other time. These specifics have the following implications for the investigation
we need to do:
• We should keep in mind that the distribution of the microblog posts can give us
only a rough approximation of the distribution of the disease cases.
• The epidemics may be manifested as patterns of increased temporal frequency
and spatial density of disease-mentioning messages. This is what we shall try to
find.
Hence, among all data records contained in the dataset, we need to identify the
subset of the data that are related to the epidemic. This subset has two major char-
acteristics: first, the texts of the messages include disease-related terms; second, the
temporal frequency of posting such messages is notably higher than usual.
First we need to select the data that are potentially relevant to the analysis goals,
that is, the messages mentioning health disorders. The task description lists some
symptoms that were observed: fever, chills, sweats, aches and pains, fatigue, cough-
ing, breathing difficulty, nausea and vomiting, diarrhoea, and enlarged lymph nodes.
These keywords can be used in a query for extracting potentially relevant data
records.
We perform querying in an interactive way. We start with putting the keywords
from the task description in the query condition. After the query selects a subset
of messages that include any of these keywords, we apply a tool that extracts the
most frequent terms from these messages (excluding so called “stop words” like
articles, prepositions, pronouns, etc.) and creates a visual display called text cloud,
or word cloud (Fig. 1.2) using font size to represent word frequencies. In this dis-
play, we find other disease-related terms (e.g., flu, stomach, sick, doctor) that occur
in the selected messages together with the terms that have been used in the query
condition. We extend the query condition by adding these terms; the query extracts
additional messages; in response, the word cloud display is updated to show the fre-
quent words and word combinations from the extended subset of messages. We also
find that some frequently used words shown in the word cloud are irrelevant (e.g.,
come, case, today, day, night, etc.), add them to the list of stop words, and make the
word cloud update after exclusion of these words (Fig. 1.3).
Now we notice word combinations that appear irrelevant to the epidemic: “chicken
flu” and “fried chicken flu” (Fig. 1.3). We apply another query to the selected sub-
8 1 Introduction to Visual Analytics by an Example
Fig. 1.2: Frequent terms extracted from the messages satisfying filter conditions.
Fig. 1.3: The word cloud display has been updated in response to changing the query
condition and extending the list of stop words.
set of messages, which selects only the messages containing the terms ‘chicken’ and
‘flu’. The word cloud changes as can be seen in Fig. 1.4. We also compare the tem-
poral frequency distribution of all messages containing some disease-related terms
and the messages containing the terms ‘chicken’ and ‘flu’. For this purpose, we use
an interactive filter-aware time histogram, as in Fig. 1.5. The upper image shows the
state of the time histogram after selecting the subset of messages containing any of
the disease-related terms. Each bar corresponds to one day. The whole bar height is
proportional to the total number of messages posted on that day whereas the dark
1.2 A motivating example: Investigating an epidemic outbreak 9
Fig. 1.4: Frequent words appearing in the messages containing the terms ‘chicken’
and ‘flu’.
segment represents the number of messages satisfying the current filter condition,
i.e., containing any of the disease-related terms. We see that the frequency of such
messages notably increases in the last three days. This corresponds to the statement
in the task description: “During the last few days, health professionals at local hos-
pitals have noticed a dramatic increase in reported illnesses”.
Fig. 1.5: Top: The time histogram shows the temporal frequency distribution of the
massages containing disease-related terms. Bottom: The time histogram shows the
temporal frequency distribution of the messages containing the terms ‘chicken’ and
’flu’.
The lower image in Fig. 1.5 shows the state of the time histogram after selecting
the messages containing the words ‘chicken’ and ‘flu’. Since the dark segments
were small and highly visible (because of low proportions of the selected messages
10 1 Introduction to Visual Analytics by an Example
among all messages), we have changed the vertical scale of the histogram using an
interactive focusing operation. We see that the messages related to the chicken flu
are distributed more evenly throughout the time period covered by the data. The
highest frequency of such messages was attained on the seventh day, i.e., long be-
fore the increase of the number of disease-related messages. This indicates that the
messages mentioning the chicken flu are indeed irrelevant to the analysis task and
should be filtered out. So, we exclude these messages from the further consideration.
The remaining set consists of 79,579 messages, which is 7.8% out of the original
set of 1,023,077 messages.
The temporal histogram (Fig. 1.5, top) shows us that the epidemic happened in the
last three days, which are represented by the three rightmost bars of the histogram.
More specifically, 59,761 out of the 79,579 disease-related messages (75%) were
posted in the last three days. We can conclude that the epidemic happened in the last
3 days; however, we want to identify the time of the epidemic start more precisely.
We use a time histogram with hourly temporal resolution, i.e., each bar corresponds
to a time interval of 1 hour length (Fig. 1.6), where we see that the temporal fre-
quency of the disease-related messages increased starting from 1 o’clock on May
18, then a very high increase happened at 9 o’clock of the same day, and a high
peak occurred at 18 o’clock of that day. In the remaining days, the frequency was
stably high except for drops in the night times (between 0 and 2 o’clock), which
give us some evidence that the observed frequency increase at 1 o’clock on May 18
is indeed due to the epidemic outbreak start.
Fig. 1.6: A time histogram shows the temporal frequency distribution of the disease-
related messages in the last 5 days by hourly intervals.
1.2 A motivating example: Investigating an epidemic outbreak 11
To analyse the spatial distribution of the outbreak-related messages (i.e., the mes-
sages mentioning disease symptoms that were posted in the last 3 days), we use a dot
map (Fig. 1.7, top) in which the messages are represented by dots (small circles) in
yellow. We observe quite prominent spatial patterns, namely, spatial clusters, which
appear as areas with high density of the circle symbols. Please note that in this
and following maps of the message distributions we adjust the transparency of the
symbols so that the patterns are best visible.
Fig. 1.7: Top: The dot map shows the spatial distribution of the epidemic-related
messages. Bottom: The dot map shows the spatial distribution of the disease-
unrelated messages.
12 1 Introduction to Visual Analytics by an Example
Fig. 1.8: The bar diagrams drawn within district boundaries show the ratios of the
numbers of the disease-related messages in the three days of the epidemics to the
average daily numbers of the messages posted in the districts before the epidemic
outbreak.
To check whether the high density of the outbreak-related messages in the centre is
due to the high spread of the disease in this area or due to the usual high message
posting activity, we perform some calculations based on the available data. Using the
boundaries of the city districts (visible in Fig. 1.7, bottom), we compute the average
daily number of the messages posted in each district before the beginning of the
outbreak. We also compute the number of disease-related messages posted in each
of the three days of the epidemics. From these numbers, we compute the ratios of
the numbers of the epidemic-related messages to the average daily message counts.
The computed numbers are represented by bar diagrams in the map in Fig. 1.8.
1.2 A motivating example: Investigating an epidemic outbreak 13
Fig. 1.9: The red dots are put in a space-time cube, where the horizontal plane
represents the geographic space and the vertical dimension time, according to the
spatial locations and posting times of the outbreak-related messages. The cube thus
shows the spatio-temporal distribution of the messages.
The yellow, red, and cyan bars correspond to the first, second, and third day of the
outbreak, respectively.
We see that one of the two central districts, called Downtown, has notably higher
relative numbers of disease-related messages in the first day of the outbreak than
the other districts, except the ones on the east and southeast. Hence, it can be con-
cluded that this district was indeed hit by the outbreak on the first day. The other
central district, called Uptown (northeast of Downtown), has only a slightly higher
relative number of outbreak-related messages than other districts. However, this dis-
trict covers a relatively small part of the dense cluster of disease-related messages.
Hence, we can conclude that we see the cluster in the city centre because this area
was hit by the outbreak and not just because it usually has high message posting
activity.
14 1 Introduction to Visual Analytics by an Example
Already the bar diagram map in Fig. 1.8, indicates that the spatial distribution of the
outbreak-related messages was not the same during the three days of the epidemic.
To see the evolution of the spatial distribution in more detail, we use a space-time
cube (STC) display (Fig. 1.9). It is a perspective view of a 3D scene where the hor-
izontal plane represents the geographic space and the vertical dimension represents
time going in the direction from the bottom to the top. The epidemics-related mes-
sages are represented by dots (in red, drawn with low degree of opacity) positioned
in the cube according to their spatial locations and posting times.
The observed gaps along the vertical dimension of the STC (i.e., time intervals of
low density of the dots) correspond to the night drops in the message numbers ob-
served earlier in a time histogram (Fig. 1.6). These gaps separate the three days of
the outbreak.
We see that three very dense spatio-temporal clusters of messages, i.e., very high
concentrations of messages in space and time, emerged on the first day of the epi-
demics. We use a dot map to see better the spatial footprints of the clusters (Fig. 1.10,
top). It appears that the disease might originate from these three places, or, even
more probably, these were areas visited by many people on the first day of the out-
break. Relatively high message density was also on the east of the three central
clusters. By the end of day 1 and during day 2, the spatial spread of the messages
increased; in particular, the density of the messages increased on the southwest of
the city. In the third day, multiple spatially compact clusters emerged. The map in
Fig. 1.10, bottom, shows that these clusters are located around hospitals, which in-
dicates that ill people came to hospitals.
Ill people might have posted messages concerning their health condition multiple
times. To see how the disease spread, it is reasonable to look only at the distribu-
tion of the messages where disease symptoms were mentioned for the first time.
To separate such messages from the rest, we apply another transformation to the
data. Each record contains an identifier of the person who posted the message. We
link the disease-mentioning messages of each person into a chronological sequence.
There are 27,446 such sequences, and this is the number of individuals who sup-
posedly got sick (it is 37% of the 73,928 distinct individual’s identifiers occurring
in the dataset). The lengths of the sequences vary from 1 to 6. Now we take only
the first message from each sequence and look at the spatio-temporal distribution
of these messages using a space-time cube display, as in Fig.1.11. The distribution
differs from that of all outbreak-related messages (Fig.1.9). Most notably, we don’t
1.2 A motivating example: Investigating an epidemic outbreak 15
Fig. 1.10: Top: Three dense clusters that emerged in the city centre on the first day of
the outbreak. Bottom: The distribution of the disease-related messages on the third
day of the outbreak.
see the hospital-centred clusters on the third day. The three very dense clusters that
emerged in the city centre on the first day dissolved on the second day. We see a
zone of increased message density stretching from the centre to the east on the first
day and another zone that formed on the second day on the southwest of the city.
Based on these observations, we conclude that the disease started in the centre and
spread to the east on the first day. On the second day, the outbreak hit the southwest-
ern part of the city. In the city centre, the frequency of the disease cases remained
quite high during the second and third days. Beyond the observed clusters, the re-
maining messages were scattered over the whole territory.
16 1 Introduction to Visual Analytics by an Example
Fig. 1.11: The space-time cube shows the spatio-temporal distribution of the mes-
sages where people mention disease symptoms for the first time.
If we consider the first two days of the epidemic, when the disease was spread-
ing (we can disregard the third day when the messages mostly concentrated around
the hospitals), we basically see two major areas highly affected by the outbreak:
the centre and east of the city and the southwest. The latter area was affected later
than the former. We need to understand why it was so. From the weather data pro-
vided together with the messages, we learn that on May 18 (i.e., on the first day
of the outbreak), there was wind from the west and on the next day from the west
and northwest. This could explain the propagation of the disease to the east but not
to the southwest. We wonder whether the disease symptoms were the same in the
two areas. The illustrations in Fig. 1.12 show that this is not so. We have used an
interactive spatial filtering tool for selecting the messages from the central-eastern
area and from the southwestern area and looked at the corresponding frequent key-
words in the word cloud display. The most frequent keywords in the centre and
on the east were chills, fever, caught, headache and other words indicating flu-like
symptoms. In the southwestern area, the most frequent symptoms were stomach dis-
orders. Hence, these two areas were hit by different diseases, which, probably, have
different transmission mechanisms. It is the most likely that the flu-like disease was
transmitted from the centre to the east by the western wind, but this does not apply
to the stomach disorders.
1.2 A motivating example: Investigating an epidemic outbreak 17
Fig. 1.12: Top: Selecting messages from two outbreak-affected areas by a spatial
filter. Middle and bottom: The frequent keywords that occurred in the messages in
these two areas.
On the map, we can observe that the dense message clusters on the southeast are
aligned along a river; hence, the stomach disease could be transmitted by the water
flow in the river.
Could both diseases have a common origin? On the upper map in Fig. 1.7, the cen-
tral and southwestern clusters seem to emanate from a point where a motorway
represented by a thick dark red line crosses the river. It is probable that there was a
common reason for both diseases. We come to a hypothesis that some event might
have happened on or near the motorway bridge before the 18th of May causing toxic
or infectious substances to be discharged in the air and in the river. This probable
event might leave traces in the microblog messages. To check this, we apply spatial
filtering to select the area around the bridge and temporal filtering to select the day
before the outbreak started. The word cloud display (Fig. 1.13) indicates that a truck
accident occurred in this place causing a fire and spilling of cargo. Evidently, the fire
produced some toxic gas that contaminated the air, and the spilled cargo contained
some toxic substance that contaminated the water.
18 1 Introduction to Visual Analytics by an Example
Fig. 1.13: The frequent words and combinations from the messages posted near the
motorway bridge on May 17.
What is the tendency in the outbreak development? Does the disease continue
spreading? Are any actions required to stop the spread, or the health profession-
als mainly need to help people who already got sick? To answer these questions,
we first look at the time histogram of the frequencies of the messages that mention
disease symptoms for the first time (Fig. 1.14). The histogram bars are divided into
segments of two colours, red for the messages mentioning digestive disorders and
blue for the remaining messages. We see that the overall frequency of the messages
gradually decreases, meaning that the outbreak goes down. This also indicates that
the disease, most likely, is not transmitted from person to person; otherwise, we
would observe an increasing rather than decreasing trend.
Fig. 1.14: The time histogram of the frequencies of the first mentioning of the dis-
ease symptoms. The red bar segments correspond to the messages mentioning di-
gestive disorders and the blue segments to the remaining messages.
1.2 A motivating example: Investigating an epidemic outbreak 19
Fig. 1.15: The spatio-temporal distribution of the messages mentioning health prob-
lems for the first time. The messages mentioning digestive disorders are represented
in red and the remaining messages in blue.
Looking separately at the red and blue segments, we can notice that the frequencies
of the new messages mentioning digestive disorders (red) are much lower on the
third day than on the second day, whereas the frequencies of the messages mention-
ing flu-like symptoms are almost as high as on the second day, which means that
the morbidity rate does not decrease as fast as it would be desired. We propagate the
red-blue colouring also to the space-time cube (Fig. 1.15). We see that new men-
tions of digestive disorders appear mainly on the southwest, as in the second day,
which means that the water remains contaminated. The new mentions of the flu-like
symptoms are scattered everywhere but the highest concentration is in the centre. It
may mean that some traces of contamination still remain in this area, and it would
be good to clean it somehow. It is also reasonable to warn the population about the
risks of contact with the water in the river.
river flow towards the southwest. Many people in the city centre and on the east
inhaled toxic particles on the next day after the accident and got ill. The main symp-
toms were chills, fever, caught, headache, sweats, and shortness of breath, similarly
to flu. Unfortunately, since the city centre is the busiest and most crowded area, quite
many people were affected. The toxic spill in the river also had sad consequences,
which were mainly observed on May 19. It affected people who were on the river
banks and, possibly, had direct contacts with the water. Toxic particles somehow got
to their stomachs and led to disorders of the digestive system. The morbidity rate of
the digestive system disease has notably decreased on May 20. New cases of people
feeling flu-like symptoms continued to appear on the second and third day after the
accident. The morbidity rate decreases quite slowly, calling for some measures to
clean the territory. Fortunately, there is no evidence that the disease can be transmit-
ted through personal contacts; therefore, there is no need to isolate affected people
from others and examine everyone who was in contact with them.
We have reconstructed the story and answered the questions of the challenge by
means of analytic reasoning, which is the principal component of any analysis. To
be able to reason, we need to ingest information into our brain. What is the best way
to do this? Can we just read all data records? Even if we could read the available
1,023,077 records in a reasonable time, would this help us to understand what was
going on? It seems doubtful. Throughout the whole process of analysis, we used
visual representations, simply speaking, pictures.
You probably heard the idiom “A picture is worth a thousand words”1 . In our ex-
ample, a picture can be worth more than a million records. Instead of reading the
records, we could catch useful information in just one look. Pictures were the main
sources providing material for our reasoning. This material was put in such a form
that could be very efficiently transferred to our brain using the great power of our
vision. What our vision does is not just transmitting pixels. Psychological studies
show that human vision involves abstraction [26]. Seeing actually means subcon-
sciously constructing patterns and extracting high-level features, and it is these pat-
terns and features that we use as material for our reasoning. Moreover, perceiving
patterns and features inevitably triggers our reasoning. Hence, whenever a task can-
not be fulfilled by routine computer processing but requires human reasoning, visual
representations can effectively convey relevant information to the human mind. Of
course, the visual representations need to be carefully and skilfully constructed to
be really effective and by no means mislead the viewers by conveying false pat-
1 https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_
words
1.3 Discussion: How visual analytics has helped us 21
terns. It is one of the goals of this book to explain how to construct such representa-
tions.
Although one picture can be worth million records, a single picture may not be
sufficient for solving a non-trivial problem. Thus, our analysis consisted of multiple
steps:
• data preparation, in which we selected the subset of potentially relevant records;
• analysis of the temporal distribution of the records, in which we identified the
start time of the epidemic;
• analysis of the spatial distribution, in which we identified the most affected areas;
• verification of observed patterns, in which we checked whether the high density
of the disease-related messages is not just proportional to the usual density of the
messages;
• analysis of the spatio-temporal distribution, in which we identified how the out-
break evolved, discovered differences between the temporal patterns in two most
affected areas, and came to a hypothesis that they could be affected by different
diseases;
• comparison between the texts of the messages posted in the two most affected
areas, in which we confirmed our hypothesis of the existence of two different
diseases;
• reasoning about the disease transmission mechanisms, in which we related the
observed patterns to the context information concerning the weather (the wind)
and the geographic features (the river);
• hypothesising about a common source of the two diseases based on the observa-
tion of the spatial patterns;
• finding relevant information for explaining the reasons for the epidemic outbreak;
• putting our findings together into a story that gives answers to the questions of
the challenge.
In each step of this process, we used visual aids for our thinking. Besides, we
used various operations: data selection, spatial and temporal filtering, extraction
of frequent terms, derivation of secondary data, such as district-based aggregates,
constructing record sequences, and extracting sequence heads. Throughout the pro-
cess, we continuously interacted with our computer, which performed these oper-
ations upon our request and also produced the visual representations that we used
for our reasoning. The whole process corresponds to the definition of visual ana-
lytics: “the science of analytical reasoning facilitated by interactive visual inter-
faces” [133, p. 4]. An essential idea of visual analytics is to combine the power of
human reasoning with the power of computational processing for solving complex
problems.
22 1 Introduction to Visual Analytics by an Example
At the same time, our analysis process also corresponds well to what data scien-
tists usually do: select and process data, explore the data to identify patterns, verify
the patterns, develop models, and communicate results. These activities cannot be
done without human reasoning, and, as we showed and discussed earlier, visual
representations can be of great utility. Perhaps, the main difference between visual
analytics and data science is their focusing on different aspects of the analytical
process, which is performed by a joint effort of the human and the computer. Data
science focuses on the computer side, on computational processing and derivation
of computer models (here and further throughout the book, we use the term com-
puter model to refer to any kind of model that is meant to be executed by computers,
typically for the purpose of prediction). Visual analytics focuses on the human side,
on reasoning and derivation of knowledge.
Visual analytics considers the knowledge generated by the human to be an essential
result of the analytical process, irrespective of whether a computer model is built or
not. This knowledge can also be seen as a kind of model of the subject that has been
analysed. It is a mental model2 , that is, a representation of the subject in the mind of
the human analyst. In our example, we have constructed such a model in the process
of the analysis. It can not only explain what happened but also predict how the situ-
ation will develop and tell what actions should be taken. Hence, it may not always
be necessary to develop a computer model, yet whenever a computer model is re-
quired, visual analytics can greatly help in building it. Moreover, a computer model
cannot be appropriately used without human knowledge of what it is supposed to do
and when and how to apply it. This knowledge of the computer model is, like the
knowledge of the phenomenon that is analysed and modelled, an important outcome
of the analytical process. Therefore, the scope of visual analytics includes not only
derivation of mental models (i.e., new knowledge represented in the analyst’s mind)
but also conscious development of computer models that are well considered and
well understood.
As you see, visual analytics aptly complements data science, and, moreover, it is
instrumental for doing good data science, because non-trivial and non-routine anal-
ysis tasks require joint efforts of computers and humans. Visual analytics provides
a way to perform data science so that the power of human vision and reasoning are
effectively utilised.
As we already mentioned, visual analytics has been defined as “the science of ana-
lytical reasoning facilitated by interactive visual interfaces” [133, p. 4]. This defini-
tion emphasises a certain kind of activity (analytical reasoning) and a certain tech-
2 https://en.wikipedia.org/wiki/Mental_model
1.4 General definition of visual analytics 23
Humans Computers
flexible and inventive, can deal with new situations can handle huge amounts of data
and problems
can associate diverse information pieces and “see can do fast search
the forest for the trees”
can solve problems that are hard to formalise can perform fast data processing
can cope with incomplete/inconsistent information can interlink to extend their capacities
can see and recognise things that are hard to com- can render high quality graphics
pute or formalise
24 1 Introduction to Visual Analytics by an Example
So, it is sensible to put these great capabilities together and let them work jointly.
This requires communication between the human and the computer, and the most
convenient way for the human is to do this through an interactive visual interface.
The book “Mastering the Information Age : Solving Problems with Visual Analyt-
ics” [79] says: “The visual analytics process combines automatic and visual analysis
methods with a tight coupling through human interaction in order to gain knowledge
from data”. As a schematic representation of this statement, Figure 1.16 shows how
a human and a computer work together to analyse data and generate knowledge.
The computer performs various kinds of automated data processing and derives
some artefacts, such as transformed data, results of queries and calculations, sta-
tistical summaries, patterns, or models. The computer also produces visualisations
enabling the human to perceive original data as well as any further data and infor-
mation derived by means of computational processing. The human uses the infor-
mation perceived for reasoning and knowledge construction. The human determines
and controls what the computer does by selecting data subsets to work on, choos-
ing suitable methods, and setting parameters for processing. Based on the current
knowledge, the human may refine what the computer has produced, for example,
discard some artefacts as uninteresting and apply further processing to interesting
stuff, or partition the input data into subsets to be processed separately.
Computer Human
Interaction
Visual mapping Visualization
Cognition
Visual mapping
Perception
Data
Transformation
Knowledge
Computational
processing Refinement Controlling
Analysis
Computed artefacts
Fig. 1.16: Schematic representation of the visual analytic activity in which human
cognition is combined with computational processing. Adapted from [78, 79]
During this activity, the human constantly uses and enriches the knowledge exist-
ing in the mind. The human begins the analysis having some prior knowledge and
constructs new knowledge as the analysis goes on. The activities of the humans
are supported by interactive visual representations created by the computers, and
the computers also help the humans by handling data and deriving various kinds
1.4 General definition of visual analytics 25
what we want to understand is a piece of the objectively existing world, but what we
analyse is data, which are recorded observations and measurements of some aspects
of this real world piece. We use the term “subject” to refer to the thing we want to
understand. Hence, in this case, the subject of our analysis is the piece of the real
world. The data that we use may be called the object of the analysis, because we
apply analytical operations to the items (records) of the data.
However, we may also want to understand the data as such: the structure of the
records, the meanings of the fields and relationships between them, the accuracy, de-
gree of detail, amount, scope, completeness, etc. Moreover, understanding of data is
absolutely necessary for performing valid analysis and gaining correct understand-
ing of the thing that is reflected in the data. When we seek understanding of data as
such, the data are our analysis subject. In this case, the analysis is applied directly to
the subject; in other words, the subject and object of the analysis coincide.
The goal of analysing a real-world phenomenon (here and further on, the word “phe-
nomenon” stands for the longer phrase “thing or phenomenon”) may be not only to
gain understanding but also to create a computer model of this phenomenon. The
purpose of a computer model is to characterise the phenomenon and predict poten-
tial observations beyond the available data. Naturally, an analyst strives at creating
a good model: correct, complete, prediction-capable, accurate, and so on. To un-
derstand how good a computer model is (in terms of various criteria) and in what
situations it is valid to use this model, it is necessary to analyse its properties and
behaviour. The model becomes the subject of the analysis. For analysing a model,
the analyst uses specific data reflecting its performance, such as correspondences
between inputs, parameter settings, and outputs. These data will be the object of the
analysis.
It is not usual that main goal of analysis is to understand properties of a data set or
behaviour of a computer model, so that the data set or the model is the only subject
of the analysis. More commonly, the overall goal of analysis is to understand and,
possibly, model a real-world phenomenon. The phenomenon is thus the primary
subject of the analysis. The expected outcome of the analysis is understanding, or
knowledge, of the phenomenon1 However, to achieve the overall goal, it is necessary
to understand not only the phenomenon but also the available data and the computer
model if it is built.
Generally, analysis is a multi-step process with the steps focusing on different sub-
jects. At the initial stage of the process, the subject is the data. The goal is to under-
stand properties of the data and assess how well the data represent the phenomenon,
which refers both to the data quality and the nature of the phenomenon. At the next
stage, the subject is the phenomenon reflected in the data, and the goal is to ob-
tain knowledge and, possibly, derive a computer model of the phenomenon. Once a
model is built, it becomes the analysis subject. Visual analytics approaches can be
applied to all three subjects: data, phenomena, and computer models.
1 In this book, the terms knowledge (of something) and understanding (of something) are used
interchangeably as synonyms.
2.1 Subjects of analysis 29
In our investigation of the microblog data in Chapter 1, the primary subject of the
analysis was the epidemic outbreak, as we aimed at understanding and characteris-
ing the outbreak. In respect to this overall goal, the microblog messages were the
object of our analysis. However, at the initial stage, it was necessary to select a rele-
vant portion of the data. We made a preliminary selection based on our background
knowledge, that is, we extracted texts including particular keywords. To understand
whether all these texts are relevant, we investigated the words that frequently oc-
curred in the selected texts and detected frequent occurrences of the phrase “fried
chicken flu”. We suspected that the texts including this phrase were irrelevant to the
outbreak. To be sure, we compared the times of the “fried chicken flu” messages
with the temporal distribution of the remaining selected messages. At this stage,
the subject of our analysis was the data. In the following steps, the subject was the
real-world phenomenon, that is, the epidemic outbreak. Although we did not build
a computer model in our example, it was possible to build a model predicting the
further spread of the flu-like disease or the numbers of people expected to come to
hospitals in the next days. In this case, it would be necessary to study the predictive
behaviour of this model, and the model would be the analysis subject.
Solve problem
Answer questions
Confirm or refute hypotheses
Fig. 2.1: The block diagram represents a general data science workflow. The grey
boxes specify the analysis subjects dealt with at different stages of the data science
process. Visual analytics approaches can be applied to each of the three subjects.
30 2 General Concepts
2 https://en.wikipedia.org/wiki/Cross-industry_standard_process_
for_data_mining
2.2 Structure of an analysis subject 31
occurs in
Population
distributed in
consists of can have, Disease
can transmit
move in Persons
River can
moves in transmit
Wind
moves in
Fig. 2.2: A schematic representation of the structure of the subject in the epidemic
outbreak analysis (Chapter 1).
tween time instances, which are the distances between their positions within the
cycles. Similarly to time, values of numeric measures have intrinsic distance and
order relationships among them but there are no cycle-based relationships.
Intrinsic relationships, e.g., ordering between time moments or differences between
numeric values, are usually not represented in data explicitly. When an explicit rep-
resentation is needed, it can be obtained by means of routine calculations. For ex-
ample, distances between values of a numeric attribute or between time instants can
be determined by simple subtraction, and there are formulas for determining spatial
distances between locations based on their coordinates.
Discrete entities and values of qualitative attributes have no intrinsic relationships
among them, that is, relationships that exist by nature and thus cannot be absent.
This does not mean, however, that they may not have any relationships. In a specific
set of discrete entities, there may be specific relationships, such as social relation-
ships between people and ‘reply’ relationships between messages. Such specific
relationships may be explicitly defined in data or inferred from data. For example,
social relationships between people can sometimes be inferred from data portraying
people’s movements based on repeated simultaneous appearances of two or more
individuals at the same places.
2.3.1 Distribution
3https://www.merriam-webster.com/dictionary/distribution
4 https://www.oxfordlearnersdictionaries.com/definition/english/
distribution?q=distribution
34 2 General Concepts
• distribution of values of attributes over space, such as the distribution of the num-
bers of the disease-related messages over the city districts (base: space, overlay:
attribute values);
• distribution of values of attributes over time, such as the distribution of the pro-
portions of the disease-related messages over the days (base: time, overlay: at-
tribute values);
• distribution of one sort of entities over a set of entities of another sort, such as the
distribution of the keywords over the messages (base: entities, overlay: entities);
• distribution of specific relationships over a set of entities, such as the distribution
of friendship relationships over the set of the microblog users or the distribu-
tion of the ‘reply’ and ‘re-post’ relationships over the set of the messages (base:
entities, overlay: relationships).
In a distribution, there are two aspects to consider:
• correspondence: associations between elements of the base and the correspond-
ing elements of the overlay;
• variation: similarity-difference relationships between the elements of the overlay
corresponding to different elements of the base.
If we imagine the overlay lying on top of the base, the correspondence consists
of the “vertical” relationships between the base and the overlay, and the variation
consists of the “horizontal” relationships within the overlay. These concepts are
schematically illustrated in Fig. 2.3.
Variation
Overlay
Correspondence
Base
Fig. 2.3: A schematic illustration of the concept of distribution. The dots represent
elements of two components and the arrows represent relationships between ele-
ments.
While being widely used in the context of data analysis, the term ‘pattern’ has no ex-
plicit common definition. We propose the following working definition of a pattern:
explained (why do they exist?). Verified patterns can be incorporated in the model
that is being built in the analyst’s mind and, possibly, in the computer.
There may be items that cannot be put together with any other items because they
are very different from everything else. Such items are called outliers. Detecting
outliers is an important activity in studying relationships. In visual displays, outliers
are often manifested as isolated visual marks (such as dots) that are not integrated in
any shape formed by other marks. The analyst needs to understand whether outliers
are errors in data, or they represent exceptional but real cases. For building a general
computer model, outliers may need to be removed, unless they are data errors that
can be corrected.
As noted in Section 2.3.1 and schematically illustrated in Fig. 2.3, two aspects are
considered in exploring distributions, correspondence and variation. We focus on
the correspondence when we look which or how many elements of one component
are associated with particular elements or groups of elements of the other compo-
nent, for example, how many messages were posted in the city centre and which
keywords occurred in these messages. We focus on the variation when we look at
the differences between the elements of one component associated with different el-
ements or groups of elements of the other component. An example is the differences
in the number and contents (keywords) of the messages posted in different areas in
the city. A pattern that can be found in a distribution may refer to the correspon-
dence, or variation, or to both.
Frequency distributions. The most basic and common characteristic of a distribu-
tion is the frequency of the occurrences of distinct elements of one component over
the elements of the other. This characteristic is called frequency distribution. In our
example analysis in Chapter 1, we looked at the frequencies of the occurrences of
distinct keywords in sets of messages. To observe and explore this distribution, we
represented it visually in the form of a word cloud.
Frequency distributions of values of numeric attributes are commonly represented
by frequency histograms, as in Fig. 2.4, in which the values are grouped into in-
tervals, often called “bins”. The general shape of a histogram tells us what values
are more usual and thus expectable and what values occur rarely. Thus, the upper
two histograms in Fig. 2.4 represent distributions in which medium values are more
usual than small and high values, and the lower two histograms exhibit the preva-
lence of lower values. Still, there are also shape differences between the upper two
histograms, as well as between the lower two histograms. On the top left, the range
of frequently occurring values is wider, and the frequencies gradually decrease to-
wards the left and right sides. On the top right, there is a narrow interval the values
2.3 Using data to understand a subject 37
Fig. 2.4: The frequency distributions of four distinct numeric attributes are repre-
sented graphically by frequency histograms.
from which are almost twice as frequent as their closest neighbours on the left and
on the right. In the lower row, the range of frequent values on the left is much nar-
rower than on the right, and the frequency decrease is steeper. Please note that, when
speaking about the differences between the frequencies of different values, we de-
scribe variation patterns, more specifically, patterns of frequency variation.
A histogram may also reveal the presence of outliers. Thus, in the lower histograms
in Fig. 2.4, the very short bars at the right edges represent several outliers.
In exploring a frequency distribution of numeric values with the use of a histogram,
it is important to remember that the histogram shape may change depending on
the binning, that is, the division of the value range into intervals. Thus, Figure 2.5
demonstrates four histograms representing the same distribution, but each histogram
has a different number of bins, which makes the shapes differ. None of these his-
tograms is wrong or better than others, and it is pointless to seek for optimal binning.
In order to have better understanding of a distribution, it is reasonable to try several
variants of binning and observe whether and how the shape changes.
Temporal distributions. The intrinsic properties of time are ordering and distance
relationships between elements, which are time instants (moments). Besides, tempo-
ral cycles may be relevant for many application domains and analysis tasks. Some of
the types of possible patterns in temporal distributions emerge due to these intrinsic
properties. For a component distributed over time, an important concept is change,
which is a difference between an element related to some time moment and another
38 2 General Concepts
Fig. 2.5: Dependency of the visible pattern on the sizes of the histogram bins.
element related to an earlier time moment. The concept of change is involved in the
meanings of many types of temporal distribution patterns. Thus, a constancy pattern
means absence of change, and a fluctuation pattern means frequent changes. When
a component distributed over time has ordering relationships between elements, a
change may be either increase or decrease. A trend pattern means multiple increases
or multiple decreases following one another along the sequence of time moments.
Since the concept of change refers to the variation in a distribution (Section 2.3.1),
all these kinds of patterns can be categorised as variation patterns, more specifically,
as patterns of temporal variation.
Changes of a set of time-based discrete entities, which can be called events, con-
sist of disappearances of existing entities and appearances of new entities. Apart
from or instead of the existence of particular entities, analysts are often interested
in the total number of entities existing at different times, which can be called the
temporal frequency of the entities (events); see Figs. 1.5 and 1.6. When we con-
sider the changes of the temporal frequency over time, we are looking for variation
patterns of the temporal frequency. The temporal frequency can be expressed as a
time-based numeric measure whose possible changes are increase and decrease, like
for any numeric attribute. It should be noted that the absence of increase or decrease
in the number of entities does not necessarily mean the absence of changes in the
set of entities, where some or even all entities might have been replaced by other
entities. In particular, this is always true for instant events, that is, entities that only
exist at the moment of their appearance or entities whose life duration is negligibly
short or not relevant to the analysis goals. Thus, the microblog message posts in
our example in Chapter 1 are considered as instant events. Examples of non-instant
events, i.e., events with duration, are the epidemic outbreak, which lasted for at least
2.3 Using data to understand a subject 39
three days, according to the data, the bad health condition of each affected person,
the truck accident, and the resulting traffic congestion.
A pattern may include other patterns. Thus, a peak pattern consists of an increasing
trend pattern followed by a decreasing trend pattern.
Due to the ordering and distance relationships between time moments, a periodicity
pattern may take place, which means repeated appearance of particular entities, val-
ues, changes, or patterns at (approximately) equal distances in time. A periodicity
pattern may be related to time cycles; in this case, repeated appearances of particular
entities, values, changes, or patterns have the same or close positions in consecutive
temporal cycles, as shown in Fig. 2.6.
Spatial distributions. The intrinsic property of space is the existence of distance
relationships among the elements, which are spatial locations. Based on distances,
neighbourhood relationships can be defined. Neighbourhood usually means small
distance between locations, but, when the space is geographic, two close locations
may not be considered as neighbours when there is a barrier between them, such as
a river, a mountain range, or a state border.
40 2 General Concepts
Space can be a base for discrete entities and for attribute values. For discrete en-
tities, possible types of spatial distribution patterns refer to the distances between
the spatial positions of the entities. A distribution of entities in space can be char-
acterised in terms of spatial density, which is conceptually similar to the temporal
frequency and means the number of entities located in a part of space of a certain
fixed size. Figure 2.7 demonstrates examples of some types of spatial distribution
patterns using the microblog data from the VAST Challenge 2011 considered in
Chapter 1.
Fig. 2.7: Types of patterns observable in the spatial distribution of the microblog
posts in the VAST Challenge 2011 (Chapter 1).
Fig. 2.8: Illustration of some types of patterns that can exist in a spatial distribution
of numeric attribute values. Left: A distribution with multiple spatial patterns. Right:
The overall pattern is an increasing trend from the centre to the periphery. It is
disrupted in some parts by a few local patterns.
2.3.4 Co-distributions
Fig. 2.9: A scatterplot presents the distribution of entities over an abstract space
made of all possible combinations of values of two numeric attributes.
tances represent degrees of similarity or relatedness: the more similar or related, the
closer in the space.
In a case of multiple numeric attributes, artificial spaces corresponding to this re-
quirement can be built using dimensionality reduction methods (Section 4.4), which
are capable to represent combinations of attribute values by positions in a space with
two or three dimensions. This technique is called embedding or projection, and the
space so created is called the embedding space or projection space. Ideally, all dis-
tances in a projection space should be proportional to the corresponding combined
differences between the values of the multiple attributes; however, this condition is
typically impossible to fulfil. The dimensionality reduction methods strive to min-
imise the distortions of the distances in the projection space, but it should never
be forgotten that distortions are inevitable. Hence, distances in a projection space
should be treated as approximate. One should look for general patterns and outliers
but avoid making judgements based on distances between individual items.
A projection can be visualised in a display that looks similar to a scatterplot and can
be called projection plot. Unlike in a scatterplot, the dimensions of a projection plot
have no specific meanings. Then, how to understand the meanings of spatial patterns
that can be exhibited by a projection plot? This requires additional visual and inter-
active techniques, as, for example, shown in Fig. 2.10. Here, a projection plot (on
the left) has been constructed based on the combinations of values of 16 attributes
expressing the percentages of different age groups in population of administrative
districts. These combinations are represented by lines on a parallel coordinates plot
(on the right), where the parallel axes correspond to the attributes. For each dot in
the projection plot, there is a corresponding line in the parallel coordinates plot.
The projection display is visually linked to the parallel coordinates plot by colour
propagation. For this purpose, continuous colouring is applied to the background of
the projection display. As a result, each position in the display receives a particular
colour, and close positions receive similar colours. The background colour of each
dot in the projection display is propagated to the corresponding line in the parallel
coordinates plot. This allows us to understand what kinds of value combinations
correspond to differently coloured parts of the projection display. The association
between the two displays is additionally supported by highlighting of correspond-
ing visual elements in both displays in response to interactive selection of elements
in one of them. In Fig. 2.10, bottom, a group of closely located dots has been se-
lected in the projection display. The selected dots and the corresponding lines in the
parallel coordinates plot are marked in black.
Another approach to studying interrelations between multiple components is demon-
strated in Fig. 2.11. The approach involves multiple displays representing the com-
ponents and interactive selection of items in one of the displays that causes simulta-
neous marking (highlighting) of corresponding parts in all displays. The example in
Fig. 2.11 involves multiple numeric attributes represented by frequency histograms.
The analyst can select high, low, or medium values of one attribute and see what val-
ues of the other attributes correspond to these. In principle, this idea can be extended
44 2 General Concepts
Fig. 2.11: Interrelations among multiple numeric attributes can be studied using
interactive histogram displays. Here, the analyst has selected a sequence of bars in
the right part of the first histogram (top left). The black segments indicate where the
selected data fit in each histogram.
2.3.5 Spatialisation
In the previous section we have seen how an artificial space can be constructed from
combinations of values of multiple numeric attributes using a dimensionality reduc-
tion algorithm. The idea of creating artificial spaces and spatial distributions that
represent non-spatial relationships can be extended to other kinds of data. The pro-
cess is called spatialisation. In general terms, spatialisation can refer to the use of
spatial metaphors to make sense of an abstract concept. In visualisation, “spatialisa-
tion” refers to arranging visual objects within the display space in such a way that
the distances between them reflect the degree of similarity or relatedness between
the data items they represent. Spatialisation exploits the innate capability of humans
to perceive multiple things located closely as being united in a larger shape or struc-
ture, that is, in some pattern. It also exploits the intuitive perception of spatially
close things to be more related than distant things.
Dimensionality reduction (also known as data embedding) algorithms is a class of
tools that can be used for spatialisation. Some of these algorithms can be applied
only to data consisting of combinations of multiple numeric values. If we wish to
use them for another kind of data, we need to transform the data to the suitable form.
Other methods can be applied to a previously created matrix of pairwise distances
between items that need to be spatialised, irrespective of where the distances come
46 2 General Concepts
from. Hence, for any kind of items, we need to define a suitable way to express the
degree of their similarity or relatedness by a numeric value that can be interpreted as
a distance. The principle is: the distance value is inversely proportional to the degree
of the similarity or the strength of the relationship, and it equals zero when two items
are identical. A procedure that measures the degree of similarity or relatedness for
pairs of items is called a distance function.
Examples of applying spatialisation to different kinds of items can be seen in
Fig. 2.12. The examples originate from a paper [28]. In the upper image, the spa-
tialisation has been applied to different versions of the Wikipedia article “Choco-
late”, which was edited many times. In the middle image, the dots in the artificial
space correspond to frames of a surveillance video, and in the lower image, to spa-
tial distributions of precipitation over the territory of the USA. In all cases, the
items represent states of some dynamic (that is, temporally evolving) phenomena
or processes. The similarities between the states were measured by means of dis-
tance functions suitable for the respective data types, i.e., texts in the Wikipedia
example and images in the other two examples. The states are represented by dots
arranged in the projection spaces according to the distances between them. The dots
are connected by curved line segments according to the chronological order of the
respective states. The relative times of the states are represented by shading of the
dots and connecting line segments from light (earlier times) to dark (later times).
In these displays, small distances between consecutively connected points indicate
small changes while large distances correspond to large changes.
In the visualisation of the editing process of the Wikipedia article, it is possible
to see a kind of “war” between two authors, in which two versions of the article
repeatedly replaced each other. In the visualisation of the surveillance video, we
see a large cluster of dots representing the states when nothing happened while the
remaining dots correspond to movements of pedestrians in front of the camera. In
the lower image, long alignments of dots indicate long-term trends in the evolution
of the spatial distribution of precipitation. The shape of the overall curve indicates
a cyclic pattern in the temporal evolution of the precipitation distribution: by the
end of the year, the distribution returns to states that are similar to those at the
beginning of the year. In these examples, interpretable patterns are created not only
by arrangements of marks in display spaces but also by shapes of curves connecting
the marks.
Generally, spatialisation is a powerful tool that can be used in analysing various
kinds of relationships when there is a reasonable way to represent the strengths of
these relationships numerically.
2.3 Using data to understand a subject 47
Fig. 2.12: Examples of the use of spatialisation for representing temporal evolution
of dynamic phenomena and processes [28]. Top: editing of the Wikipedia article
”Chocolate”. Middle: a surveillance video of a street. Bottom: spatial distribution of
precipitation.
48 2 General Concepts
The process of non-trivial data analysis and problem solving essentially involves
human reasoning. The goal of analysis is to gain understanding and, possibly, to
build a computer model of some subject, such as a phenomenon existing in the real
world. A non-trivial subject can be seen as a complex system including multiple
components linked by relationships. These relationships are reflected in data, which
specify correspondences between elements of different components.
Relationships between components can be understood by studying distributions of
elements of some components over elements of other components. To proceed from
correspondences between individual elements to understanding of relationships be-
tween components as wholes, one needs to find and interpret patterns in the distri-
butions. A pattern consist of multiple correspondences that can be perceived and/or
represented together as a single object. The power of human vision gives humans
the capability to notice various patterns in visual representations of distributions.
Unlike computers, humans do not require a precise definition of patterns to search
for. Human perception is especially highly inclined to seeing patterns in spatial
distributions. Therefore, it can be helpful to create artificial spaces in which vari-
ous non-spatial relationships are represented by relationships of spatial distance and
neighbourhood.
The types of patterns that may exist in distributions and the meanings of these
patterns depend on the types of components whose relationships are analysed. In
the following chapters, we shall consider classes of phenomena involving different
types of components.
At the beginning of this chapter, we stated that visual analytics is relevant to different
steps of the data science workflow. Visual analytics can be seen as a way to fulfil
the principles of good data science, which can be summarised as follows:
• You should always remember that the human, not the machine, is the leading
force of the data science process.
• You need to see and understand your data before trying to do something with it.
• You should not take blindly what machine gives you; look, understand, think,
experiment: what will happen if I change something?
• While analysing and modelling, you should also think about presenting your re-
sults and how you obtained them to other people. You must be able to explain
and justify the analytical process and its outcomes.
Given these principles, the relevance of visual analytics to data science becomes
clear. Visual analytics assumes that the human analyst interacts with the machine,
not just views what it has produced. The analyst tries different parameter settings,
different computational methods, different data transformations, feature selections,
data divisions, etc. The analyst provides background knowledge and evaluation
2.4 Concluding remarks 49
feedback when the algorithms are designed to accept these. The analyst takes care
about explaining and justifying the analytical procedure and its outcomes to others.
In the following chapters, we shall demonstrate how visual analytics as an approach,
and also as a particular attitude to the analysis process and a discipline of mind, can
help you to do good data science.
Chapter 3
Principles of Interactive Visualisation
Abstract We introduce the basic principles and rules of the visual representation
of information. Any visualisation involves so-called visual variables, such as posi-
tion along an axis, size, colour hue and lightness, and shape of a graphical element.
The variables differ by their perceptual properties, and it is important to choose
appropriate variables depending on the properties of the data they are intended to
represent. We discuss the commonly used types of plots, graphs, charts, and other
displays telling what kinds of information each of them is suited for and what vi-
sual variables it involves. We explain the necessity of supporting the use of visual
representations by tools for interacting with the displays and the information they
represent and introduce the common interaction techniques. We also discuss the
limitations of visualisation and interaction, substantiating the need to combine them
with computational operations for data processing and analysis.
The definition of visual analytics emphasises that visual interfaces need to be in-
teractive [133]. It is because visualisation and interaction complement each other in
supporting human analytical reasoning. Visualisation is the best way to supply infor-
mation to the human’s mind, and interaction helps to get additional information that
is not available immediately and to make use of computational operations.
Visualisation has always been used for studying phenomena taking place in the
geographic space. It is therefore not surprising that the fundamentals of visual rep-
resentation of information were first formulated by cartographers. Statistical graph-
ics, which appeared in the field of statistics, focuses on representing quantitative
information. The research field of information visualisation emerged as computer
technologies created opportunities for automated generation of visual displays from
data.
Interaction is the way of getting more information from a visual display than it
is possible by mere viewing. It is also the “glue” that enables the tight-coupling of
computational and visualisation techniques. It can enable the analyst to interactively
request more contextual data on-demand. It can also enable the analyst to make a
selection that feeds into the next analytical step, for instance, run the computational
method on a portion of the data or change a parameter. It enables analysts to inves-
tigate different ways of configuring computational methods and alternative models
for more informed analysis. The increasing feasibility of highly interactive analyti-
cal interfaces is driven by advances in computation speed and software frameworks
for their implementation, particularly on the web.
In the following sections, we shall introduce and discuss the main principles and
techniques of data visualisation and interaction.
3.2 Visualisation
A well-designed visualisation can present data in a way that helps analysts to un-
derstand the data and the phenomenon that the data reflect and promote analytical
reasoning and knowledge building. Statistical graphics is a branch of statistics that
advocates the use of graphics to help characterise statistical properties of data. This
is often illustrated by the well-known Anscombe’s Quartet1 . It contains four differ-
ent datasets consisting of pairs of numeric values. These datasets are characterised
by exactly equal summary statistics that are commonly used for summarising dis-
tributions – mean, median, standard deviation, r2 – suggesting that the distributions
are comparable. However, the four scatterplots of these data in Fig. 3.1 show that
this is not the case. This carefully-constructed example illustrates that high-level
statistical summaries can be misleading, and that visualisation can provide a richer
summary in an interpretable form.
In the Anscombe’s Quartet, the datasets are small enough to represent every data
item by an individual visual mark. In most real-world datasets there are too many
data items to plot each of them separately. In these cases, computational methods
are still required for summarising data to a manageable set of items that can be
visually represented. One of possible ways to present a visual summary of a set of
numeric values is a box plot, which is built based on descriptive statistics, namely,
the quartiles of the distribution. An example of a box plot is shown in Fig. 3.2, top
right. The main part of such a plot is a rectangle (i.e., a box) whose two opposite
sides represent the lower and upper quartiles, and a line drawn inside the box parallel
1 https://en.wikipedia.org/wiki/Anscombe%27s_quartet
3.2 Visualisation 53
Fig. 3.1: These four datasets have identical summary statistics indicating that they
probably have similar distributions but the scatterplots indicate that their distribu-
tions are very different.
to these sides represents the median. A box plot may also have lines extending
from the box indicating variability outside the upper and lower quartiles. These
lines resemble whiskers; therefore, a plot containing such “whiskers” is also called
box-and-whisker plot. Outliers (determined by means of statistical measures) may
be plotted as individual points; in this case, the “whiskers” show the data range
excluding the outliers.
The example box plot in Fig. 3.2, top right, summarises the distribution of tips in a
restaurant2 . For comparison, the same data are shown in a dot plot on the top left,
where the individual values are represented by dots positioned along a vertical axis.
Horizontal jittering is used to separate multiple dots representing identical or very
similar values and thus having the same vertical position. The same distribution is
also represented by a violin plot in the lower part of Fig. 3.2. The latter can be seen
as a variant of a frequency histogram drawn in a particular way – symmetric and
smoothed.
When data are differentiated by categories (e.g., by gender, occupation, or age
group), it is reasonable to compare the distributions of values of numeric attributes
for these categories. For example, in Fig. 3.3, the box plots summarise the tip dis-
2 https://www.kaggle.com/ranjeetjain3/seaborn-tips-dataset
54 3 Principles of Interactive Visualisation
Fig. 3.2: A box plot (top) and a violin plot (bottom) represent the distribution of the
tip amounts.
tributions for the lunch and dinner (top) and for the female and male customers
(bottom). The same distributions are represented in more detail in dot plots.
All visualisations in the Figs. 3.1 to 3.3 represent numbers, either original values of
numeric attributes or summary statistics, by positions along axes; in other words,
they employ the visual variable ‘position’. The different types of display utilise
this variable in different ways and in application to different graphical elements
(visual marks): dots in a dot plot, box plot components (box boundaries, median
line, and whiskers), and curved outlines in a violin plot. Let us consider the whole
set of visual variables and see examples of different ways of using them in the most
common types of visual display.
3.2 Visualisation 55
Fig. 3.3: Comparison of the distributions of the tips for the lunch and dinner (top)
and for the female and male customers (bottom).
within the display space. The remaining visual variables, which were called ‘retinal’
by Bertin, include ‘size’, ‘value’ (lightness), ‘colour’ (hue), ‘texture’, ‘orientation’,
and ‘shape’. The retinal variables are used to represent data through the appearance
of marks.
Figure 3.4 demonstrates the visual variables introduced by Bertin and shows how
each of them can be applied to three types of visual marks, points, lines, and areas.
Table 3.1 illustrates the concepts of visual marks and visual variables by exam-
ples.
Fig. 3.4: Jacques Bertin’s visual variables and how they are applied to 3 types of
visual marks, points, lines, and areas. Source: [31].
The visual variables differ in their perceptual properties. Values of some variables
can be arranged in a certain logical order. Thus, values of the variable ‘position’
are ordered in the direction of the axis, values of the variable ‘size’ can be ordered
from smaller to bigger, and values of the variable ‘value’ (i.e., lightness) can be
ordered from lighter to darker. For values of other variables, such as colour hue or
shape, there is no logical, intuitively understandable complete ordering, but it may
be sometimes possible to construct an ordered set of selected values. For example,
it is possible to construct a set of textures differing by their density, which can be
ordered from sparser to denser. Values of the variable ‘orientation’ can be arranged
in a cycle from 0◦ to 360◦ , which is the same as 0◦ .
3.2 Visualisation 57
Some of the variables with ordered values also allow judgement of the amount of
difference, or the distance, between any two values. Thus, we can judge the dis-
tance between two positions and the difference between the lengths of two bars or
between the sizes of two squares. However, it is much harder to estimate how much
lighter or darker one shade of grey is from another. Values of the variable ‘size’
allow one to judge not only differences but also ratios; e.g., we can see that one bar
is twice as long as another bar. Judgements of ratios are also possible for positions
when the axis origin is specified; in this case, the ratio of the distances from the
axis origin can be estimated. Bertin did not distinguish perception of amounts of
difference or distances and perception of ratios but applied the same term ‘quanti-
tative’ to the variables allowing numeric assessments of the differences between the
values.
Fig. 3.5: Selectivity test: Can attention be focused on one value of the variable,
excluding other variables and values? Association test: Can marks with the same
value of the visual variable be perceived simultaneously?
58 3 Principles of Interactive Visualisation
Apart from the ordering, distance, and ratio relationships between values of visual
variables, important perceptual properties are selectivity and associativity, that is,
how easily distinct values of a variable can be perceptually discerned and how eas-
ily multiple marks with the same value of a variable can be perceived all together
and differentiated from the remaining marks. The image in Fig. 3.5 can help you
to understand what these properties mean. To check the selectivity of a visual vari-
able, look at the image and try to focus your attention on one value of the variable,
excluding other variables and values. For example, try to find all letters on the left
half of the image (‘position’), all small letters (‘size’), all pink letters (‘colour’), all
letters K (‘shape’). If such marks are easy to find, the variable is selective.
To check the associativity of a visual variable, try to perceive simultaneously all
marks with the same value of the visual variable. For example, try to see simulta-
neously all letters located on the left half of the image (‘position’), all small letters
(‘size’), all pink letters (‘colour’), all letters K (‘shape’). If such marks are easy to
perceive together as a group, the variable is associative.
The perceptual properties of Bertin’s visual variables are summarised in Fig. 3.6.
Please note that the variable ‘value’ is claimed to be quantitative, which contradicts
to what was stated by Bertin. The reason is that the source of the image in Fig. 3.6 is
a web site focusing on map design3 . The variable ‘value’ is extensively used in car-
tography for representing quantitative information, in particular, in choropleth maps.
This is because the variable ‘position’ is used for showing geographic locations, and
it is impossible to apply the variable ‘size’ to geographic areas for showing associ-
ated numeric values without distorting geographic information. Therefore, cartogra-
phy has to employ variables that may not be perfectly suited to representing the type
of information that needs to be shown. Besides, as will be discussed later on, some
limitations of visual variables can be compensated by interactive operations.
3 https://www.axismaps.com/guide/general/visual-variables/
3.2 Visualisation 59
The set of visual variables originally introduced by Bertin was later extended by
other researchers. Among others, they introduced dynamic visual variables, such
as motion and flicker [60], which can be used in displays shown on a computer
screen. The “new” variables are not used as actively as the “old stuff” from Bertin.
An exception is, perhaps, display animation, which is quite popular. It can be said
that an animated display employs the visual variable ‘display time’ whose values
are the time moments or intervals at which different states of the display (frames)
are presented. Obviously, ‘display time’ is ordered and allows quantitative judge-
ments.
A frequently encountered extension of the original set of visual variables is the use
of the third display dimension in addition to the two planar dimensions considered
by Bertin. It is used in perspective representations of 3D displays on a computer
screen, such as the space-time cube in Section 1.2.7, and also in virtual 3D envi-
ronments. These displays involve three instances of the visual variable ‘position’
(certainly, the other visual variables can be used as well, like in two-dimensional
displays).
As we have seen, values of some visual variables are linked by relationships of or-
dering, distance, and/or quantitative difference (i.e., ratio). As we discussed in the
previous chapter, the same types of relationships can exist between elements of com-
ponents of an analysis subject and, consequently, between corresponding elements
of data reflecting this subject. The fundamental rule of visualisation is to strive at
representing components of data by visual variables so that the relationships existing
between the values of the visual variables correspond to the relationships existing
between the elements of the data components. For example, values of a numeric at-
tribute should be represented using the variable ‘position’ or ‘size’, whereas ‘value’
is less suitable, and ‘colour’ and ‘shape’ are completely unsuitable.
Bertin characterised the perceptual properties of visual variables in terms of the
levels of organisation: associative (the lowest level), selective, ordered, and quan-
titative (the highest level). The general principle of data presentation, according to
Bertin, is: “The visual variables must have a level of organisation at least equal
to that of the components which they represent”. The level of organisation, also
known as scale, of a data component may be nominal, ordinal, or numeric. Asso-
ciative and selective variables correspond to the nominal scale, ordered variables to
the ordinal scale, and quantitative variables to the numeric scale. If, for example,
the goal is to represent values of a numeric attribute graphically so that a viewer can
extract ratios from the visualisation, for example, to see immediately that one value
is twice another, one must choose a visual variable with quantitative organisation,
that is, either ‘position’ or ‘size’.
Another consideration in choosing visual variables for representing data concerns
the “length” of the visual variables. The length is the number of categories or steps
that can be distinguished visually, for example, discernibly different colours or light-
ness levels. A visual variable should have a length equal to or greater than the num-
ber of distinct values in the component that it represents. If the length of the variable
60 3 Principles of Interactive Visualisation
is insufficient, the observer will perceive some of the different data values as being
identical.
Tables 3.2 and 3.3 summarise perceptual properties of visual variables, as proposed
by Bertin [31] and updated by Green [60] from a psychologist’s perspective and by
MacEachren [93] as a cartographer. Further details and discussion on this can be
found in [17], section 4.3.
While the visual variable ‘display time’ used in animated displays seems to have
high level of organisation, its expressive power is strongly reduced by the fact that
the viewer cannot see all information at once. Comparison between display states
and extraction of patterns relies on the viewer’s memory, which has limited capac-
ity. Due to the short duration of presenting each state, viewer’s attention can miss
important changes; this phenomenon is known as “change blindness”4 . However,
animation is not always a poor choice [52]. Animation may be good for repre-
senting continuously developing phenomena and processes, when the changes from
one state to another are gradual rather than abrupt or chaotic (although abrupt and
chaotic changes can be noticed, the viewer will hardly be able to link them into
any kinds of patterns). Animation can help viewers perceive the general trend of
the development and may also be good for revealing coherent changes at multiple
locations or coherent movements of multiple objects. Generally, animation may be
suitable for supporting an overall view of how something develops over time, but it
is much less supportive for tasks requiring attention to details and detection of even
small changes.
The position in the display, or the dimensions of the display, are the most important
visual variables having the greatest expressive power regarding both the organisa-
tion level and the length, which is limited only by the display size and resolution.
Respectively, this is a valuable resource for visualisation. To increase the capacity of
the display space, different arrangements of marks or complex visual objects com-
prising multiple marks are frequently used in constructing visual displays, including
the following:
• Space partitioning: often called juxtaposition, or “small multiples”, this tech-
nique is used for comparing pieces of information related to different locations,
times, objects, categories, attributes, etc., or for showing alternative views of the
same data. For example, the display space in Fig. 2.4 is partitioned to accommo-
date four frequency histograms of different attributes. In Fig. 3.2, top, the display
space is divided between two alternative views, and in Fig. 3.3, we see space di-
vision providing alternative views and enabling comparison of categories.
• Space embedding: used for creating complex representations. A widely known
example is diagrams on a map: a diagram has its internal space, which is embed-
ded in the space of the map (see Fig. 1.8).
• Space sharing, a.k.a. superposition, or overlay: representation of two or more
data components in a common display space, e.g., multiple information layers
on a map, multiple comparable attributes or multiple time series in the same
4 https://en.wikipedia.org/wiki/Change_blindness
62 3 Principles of Interactive Visualisation
coordinate frame, etc. The lower part of Fig. 3.2 contains an example of a violin
plot and a box plot sharing the same display space.
• Fusing data components: use of two or three display dimensions for represent-
ing a larger number of data components, as introduced in Section 2.3.5 and will
be discussed in more detail in Section 4.4. An example can be seen in Fig. 2.10,
where the 2D display space is used to represent combinations of values of multi-
ple attributes.
As we mentioned in Section 2.3.5, human vision is especially well adapted to per-
ceiving spatial patterns, such as groups of close objects, linear arrangements, and
other shapes composed of multiple objects. Regarding visual displays, these ca-
pabilities are supported by the strong associative power of the variable ‘position’.
Hence, this variable should be of primary choice in representing distributions and
co-distributions. Many examples of displays representing distributions in Chapters 1
and 2 employ the variable ‘position’. Temporal distributions are portrayed by time
histograms, where times are represented by positions along the x-axis. Spatial distri-
butions are viewed using maps, where spatial locations of objects are represented by
positions on the maps. Frequency distributions are shown in frequency histograms,
where the positions on the x-axis correspond to attribute values. For spatio-temporal
distributions, space-time cubes are used, where the positions in the display represent
spatial locations and times. A scatterplot represents a co-distribution of two numeric
attributes; the positions in the display correspond to value pairs. Similarly, a projec-
tion plot representing a result of spatialisation supports perception of spatial patterns
made by positions of marks corresponding to complex data items.
It is important to understand that perception and interpretation of spatial patterns
formed by visual marks in a display makes sense only when the distances between
the positions of the marks represent relevant relationships between the correspond-
ing data items. If it is not so, any spatial patterns that can occasionally emerge in
such a display may be misleading and need to be ignored. Let us consider, for exam-
ple, a dataset consisting of texts that include various keywords. In the distribution
of the keywords over the texts, both the base (the set of texts) and the overlay (i.e.,
the keywords occurring in the texts) consist of discrete entities that are not linked by
any intrinsic relationships. According to the Bertin’s principle “The visual variables
must have a level of organisation at least equal to that of the components which they
represent”, the visual variable ‘position’ can be used to represent such components.
However, the distances between the positions assigned to the elements will not cor-
respond to any intrinsic relationships between the elements, and any possible spatial
patterns that are based on distances, such as spatial clustering, density, or sparsity,
will have no meaning.
A visualisation technique that is often used for representing frequency distributions
of keywords in texts is word cloud. We used such displays in Chapter 1; see, for
example, Fig. 1.2. In this visualisation, word frequencies are represented using the
visual variable ‘size’ (namely, font size). In our examples in Chapter 1, the words
were arranged in the order of decreasing frequency, which means that the visual
3.2 Visualisation 63
variable ‘position’ was also used for representing the word frequencies. This is an
example of a redundant use of two visual variables to represent the same infor-
mation for improving display readability or convenience of use. In this case, the
variable ‘position’ conveys meaningful information (frequency rank), and the dis-
tances thus also have meanings (differences in the frequency ranks). Hence, spatial
patterns, such as sequences of closely positioned words, are meaningful as well.
However, it is not typical for the general word cloud technique and its widely avail-
able implementations that the words are arranged in some meaningful order. Rather,
it is more typical to apply so-called space filling layouts to pack more words in
the limited display space. In such layouts, spatial positions of marks and spatial re-
lationships between them have no meanings; therefore, viewers should ignore any
spatial patterns that may emerge.
Hence, we have discussed two principally different approaches to utilising the dis-
play space. One approach is to employ the visual variable ‘position’ for represent-
ing elements of data components and/or relationships between them. It is especially
appropriate to use this variable for representing data components with intrinsic dis-
tance relationships between elements. ‘Position’ is also suitable for representing
sets of data items linked by other kinds of relationships that can be expressed as dis-
tances (e.g., social links between people). The other approach is to fill the display
space with visual marks so that their positions do not represent any semantically
meaningful information. In fact, the visual variable ‘position’ is not utilised in this
case, and any patterns formed by positions of marks in the display need to be ig-
nored.
There is also a third approach: create a spatial layout that represents some kind
of relationships between items (e.g., hierarchical) but does not exploit spatial dis-
tances between marks for this purpose. In this case, meaningful spatial patterns in
the display can be formed by configurations of visual marks. An example is a nested
layout, in which smaller marks are put inside bigger ones to represent part-whole
relationships.
In the next section, we shall consider the standard, widely used visualisation tech-
niques, which use the display space and the other visual variables in different ways
for representing different kinds of information.
In this section, we list visualisations that are used commonly. It is easy to find
them in a variety of implementations, including Python, R, JavaScript, and Java
libraries.
64 3 Principles of Interactive Visualisation
Bar chart: Each bar represents a discrete entity or category, and the length of the bar
represents a numerical value that relates to that entity or category. Where categories
are ordinal (i.e., have an intrinsic order), one would usually put the bars in this order,
but there are reasons why they might be ordered differently, for example, in the order
of decreasing or increasing bar lengths.
Pie chart: can represent the same information as bar charts when categories are parts
of some whole. While bar lengths can be compared more precisely than angular
sized of pie sectors, the latter are much more supportive for estimating the fractions
of the whole covered by the components.
Dot plot: depicts a distribution of numeric values along an axis by representing each
value by a dot. When the density of the dots is very high, overlapping of dots can be
decreased by jittering in the dimension perpendicular to the axis, as in Figs. 3.2 and
3.3. The use of semi-transparent dots is helpful, but the maximum opacity may be
reached very quickly. Interactively altering transparency is useful.
Line graph, or line plot, or line chart: represents a series of numeric values by a
sequence of points positioned along an axis with the distances from the axis propor-
tional to the values; the consecutive points in the sequence are connected by lines.
Unlike a bar chart, a line chart emphasises that the data refer to some ordered and,
usually, continuous base, such as time (see time graph). An example of data with
a non-temporal base may be data describing the dependency of the electric conduc-
tivity of some material on its temperature.
Heat map, or heatmap: Representation of variation of values of a numeric attribute
along an axis or over a 2D display space by encoding the values by colours or de-
grees of lightness. The colouring or shading is applied to uniform elements (pixels)
into which the display space is divided. A typical example is cells of a matrix. The
heatmap technique can be applied, in particular, on a cartographic map, in which
the space is divided into uniform compartments by means of a regular rectangular
or hexagonal grid. Examples of heatmaps can be seen in Fig. 4.10.
Histogram (frequency histogram) (Fig. 2.4): Depicts a distribution of numeric
attribute values by binning the data along an axis and depicting frequency with bar
lengths. Altering the bin size is a useful interaction (Fig. 2.5). Because aggregation
is used, it scales better than dot plots. Bars can be divided into segments to show
frequencies for subsets of data items, e.g., according to some categories or classes
identified by hues of the segments.
Violin plot (Fig. 3.2, bottom): The idea is similar to frequency histogram: one axis
represents the range of attribute values, and the other display dimension is used
for representing corresponding frequencies by distances from the axis. Instead of
discrete bars with their lengths, the frequencies are represented by the distances
between two continuous curves symmetric with respect to the attribute axis.
3.2 Visualisation 65
Box plot, a.k.a. box-and-whiskers plot: (Figs. 3.2, 3.3): summarises a distribution
of numeric values by depicting the quartiles, which are particular values dividing the
ordered sequence of the values occurring in the data into four approximately equal
parts. The graph is drawn along an axis representing the value range. The lower
(first) and upper (third) quartiles (Q1 and Q3) are represented by the positions of the
two sides of the box and the median (i.e., the second quartile, Q2) by the position of
the line inside the box. The whiskers can connect the box sides to several possible
alternative values, such as the minimum and maximum of the data, the 9th percentile
and the 91st percentile, the 2nd percentile and the 98th percentile, or the lower fence
(= Q1 − 1.5 · IQR) and the upper fence (= Q3 + 1.5 · IQR), where IQR = Q3 − Q1 is
the interquartile range. When a whisker does not reach the minimum or maximum,
the graph may also include dots to depict outliers, as in the box plot superposed on
the violin plot in Fig. 3.2, bottom.
Time graph or time plot: a line chart representing time-referenced numeric data.
The axis along which the points are placed represents in this case time.
A 2D time chart: represents time-referenced values of an attribute regarding the
linear and cyclic components of time or two time cycles. It is a matrix with one
dimension corresponding to one time cycle (e.g., hours of a day) and the other to the
linear component consisting of repetitions of this cycle (e.g., multiple consecutive
days) or to another time cycle, e.g., days of a week. The values of the attribute are
represented by marks in the cells of the matrix.
Timeline view: shows occurrences of some events or activities over time [117]. One
of the display dimensions (most often horizontal) represents time, and the events
are represented by bars or other symbols placed along the time axis according to
the times when the events happened. The other display dimension is used for dis-
tinguishing between events or activities that occur in parallel or overlap in time. A
Gantt chart, which is commonly used for portraying activities in project manage-
ment, is a form of timeline in which the bars representing the activities may be con-
nected by lines or arrows representing relationships, such as dependencies.
Time histogram (Figs. 1.5, 1.6, 2.6 top): depicts a distribution of discrete entities
(events) over time. Contains an axis representing times and a sequence of bars cor-
responding to time intervals; the bar lengths are proportional to the numbers of the
entities or events that existed or occurred within the time intervals. The bars can be
divided into distinctly coloured segments corresponding to different categories of
the entities or events (Fig. 1.14). A 2D time histogram (Fig. 2.6, bottom) portrays
a temporal distribution with respect to a recurring time cycle. Similarly to a 2D time
graph, it is a matrix with one dimension corresponding to a time cycle (e.g., hours
of a day) and the other to a sequence of repetitions of this cycle (e.g., multiple con-
66 3 Principles of Interactive Visualisation
secutive days) or to another time cycle, e.g., days of a week. The counts of entities
or events are represented by proportional sizes of bars or other kind of marks in the
cells of the matrix.
Dot map: depicts geographic distributions of discrete entities, which are represented
by dots positioned on a map according to the spatial locations of the entities (e.g.,
as in Fig. 1.7) . It can be seen as a 2D version of a dot plot. Because of the nature of
geographical data, occlusion is often a significant problem.
Choropleth map (Fig. 2.8, Fig. 3.9, top): portrays the spatial distribution of attribute
values referring to discrete geographical areas, such as units of a territory division.
It is common to use the visual variable ‘colour’ (i.e., hue) to represent values of a
categorical (qualitative) attribute and the variable ‘value’ (i.e., lightness) to represent
numerical variable values. A common criticism is that it that the size of an area,
while having little to do with the data, affects the visual salience and may introduce
bias. Diagram maps or area cartograms (see later) are possible alternatives.
Diagram map, or chart map (Fig. 1.8): represents a spatial distribution of values
of one or several numeric attributes referring to particular geographic locations or
areas. The values are represented by proportional sized of symbols, such as circles
or rectangles, or components of diagrams, such as bars of bar charts or sectors of
pie charts. When the values refer to areas, the symbols or diagrams are drawn inside
the areas.
Area cartogram: A map-like display where where 2D positions, shapes, and sizes
of geographic areas are distorted in order to represent by the sizes some numeric
values. An example is a population cartogram, in which areas of countries or regions
are sized proportionally to their population. Such representations may be beneficial
for portraying phenomena related to population, such as the poverty rate.
Scatterplot (Fig. 2.9): has two perpendicular axes representing value ranges of two
numeric attributes. Pairs of attribute values are represented by dots in this coordinate
system. See Section 2.3.4 for a discussion of possible patterns in a scatterplot and
their interpretations.
Scatterplot matrix: A tabular arrangement of multiple scatterplots representing dif-
ferent attribute pairs. Often, the diagonal is used to depict the distributions of the
individual attributes, e.g., by frequency histograms.
3.2 Visualisation 67
5 https://en.wikipedia.org/wiki/Treemapping
68 3 Principles of Interactive Visualisation
Although there are so many standard visualisations, not always you find the one
that is suitable for your data and analysis goals. You may need to design your own
displays, and you will want them to be effective and not misleading. Based on the
literature and our own experience, we can suggest the following principles of creat-
ing good data visualisations:
1. Utilise space at first: Taking into account the analysis tasks, represent the most
relevant data components by display dimensions (variable ‘position’) or suitable
spatial layouts depending on the intrinsic properties of the components or task-
relevant relationships between data items.
2. Respect properties: Properties of the visual variables must be consistent with
the properties of the value domains of the data components.
3. Respect semantics: Data and corresponding phenomena may have characteris-
tics and relationships that are not explicitly specified but supposed to be known
to everybody or to domain specialists. Domain knowledge and common sense
should be used in creation of visual displays.
4. Enable seeing the whole: Design the visualisation so that it facilitates gaining an
overview of all data items that are shown. Care about completeness (all relevant
items should be present in the display while overlapping and visual clutter should
be avoided) and possibilities for associating multiple items in groups, which is
prerequisite for extracting patterns.
3.2 Visualisation 69
Fig. 3.7: Bad design: Pie charts, either plain (left) or fancy 3D (right), are not appli-
cable to data with part-of relationships.
(e.g., prevailing direction of the wind or traffic movement) would be preferable over
the diverging colour scale, however.
Fig. 3.8: Good design: Line chart (left) shows 10-years dynamics of e to $ exchange
rates. Bad design: Radar chart (right) representing the same data delivers a false
cyclic pattern.
Fig. 3.9: Two maps show the same attribute (proportion of female population in
London wards) with a diverging colour scale (top), which is appropriate for these
data, and a rainbow colour scale (bottom), which is not appropriate.
72 3 Principles of Interactive Visualisation
the overall pattern, if any, and exceptions, or abnormalities. A good visual represen-
tation helps to confirm expected and discover unexpected patterns.
Understanding the distribution: Visualisation is often much more informative and
understandable than numerical summaries of data. For numeric data, visual repre-
sentations of distributions can effectively indicate the type of the distribution, e.g.
normal, skewed, bimodal, logarithmic, as well as reveal outliers. Spatial and tem-
poral distributions cannot be adequately expressed through any numeric summaries
and require visual representations for understanding.
Seeing relationships: Visual representation of multiple data items that respects in-
trinsic properties of the data (Section 3.2.5) can enable the viewer to perceive im-
mediately a multitude of relationships between all these items, such as their relative
positions in space and/or time. Computational extraction and expression of these
relationships would require more time, and the results would be much harder to
interpret.
Comparison: A well-designed graphical encoding facilitates comparisons much
more effectively than a table of numbers. Gleicher et al. [55] identify three visual
methods of comparison: juxtaposition, superposition and explicit encoding. Aligned
juxtaposition places visual representations next to each other applying the strategy
of display space partitioning (Section 3.2.3). Juxtaposition can be good for com-
parison of overall patterns in two or more distributions. Superposition facilitates
comparison by overlaying things to be compared in the same display space (space
sharing strategy), such as overlaying multiple annual trends for different years.. This
technique can be good for detailed exploration of commonalities and differences.
Explicit encoding of distinctions is where a comparison metric is calculated (e.g.,
difference) and visually encoded. For example, original numeric values can be com-
pared to some baseline such as the value for a particular year or a global or local
average. This technique, however, should be used as a complement rather than re-
placement to showing the original data.
Discovering the unexpected: Faithful visual encoding of data as they are (letting
the data speak for themselves [58]) can help identify patterns and properties of the
data that we did not necessarily expect, whereas numeric summaries and measures
often base on certain assumptions and may be meaningless or misleading when these
assumptions do not hold. For example, the statistical mean and standard deviation
are informative only for a normal distribution.
In using visualisation methods, their limitations need to be taken into account. Some
of the limitations are related to the increasing volumes of data. It is generally recog-
nised that visualisation alone is usually not sufficient for data analyses in real-world
3.2 Visualisation 73
3.3 Interaction
Although a picture may be worth a thousand words, a single static picture is in most
cases insufficient for a valid analysis and for understanding of a complex subject.
It is usual that an analyst needs to see different aspects or parts of data and look at
the data from different perspectives. This means that the analyst needs to interact
with the data and with the system that generates visual displays of the data: select
data components and subsets for viewing, select and tune visualisation techniques,
transform the views, transform the data, and so on. We can distinguish the following
types of interaction with graphical representations:
1. Changing data representation.
2. Focusing and getting details.
3. Data transformation.
4. Data selection and filtering.
5. Finding corresponding information pieces in multiple views.
Changing visual representation is useful for looking at the same data from differ-
ent task-specific perspectives. Let’s illustrate this on an example of a time series
of 21 years monthly temperature data collected at the Tegel airport in Berlin, Ger-
many. Weather measurements are usually done at high temporal resolution (e.g.,
every hour), but for analysing long-term patterns there data are aggregated, e.g.,
into monthly averages of the daily minimum, maximum and average temperatures.
In this illustrative example, we consider a single attribute, TNM, which represents
monthly averages of the daily average temperatures. Figure 3.10 shows this time se-
ries on a time plot. The time plot clearly shows annual repetition of similar seasonal
patterns. Respectively, it is useful to change the representation and look at the same
data with regard to two components of time, the months of a year and the sequence
of the years. This view is shown in Fig. 3.11 by 2D time charts where the vertical
76 3 Principles of Interactive Visualisation
dimension represents the months of a year and the horizontal dimension represents
the sequence of the years. The charts employ three variants of rendering: horizontal
and vertical bars (left and right) for enabling comparisons across the years and the
months, respectively, and squares (centre) for enabling comparisons across months
and years.
Fig. 3.10: Time plot of monthly averages of average daily temperatures at the Tegel
airport (Berlin, Germany).
Fig. 3.11: The same data as in Fig. 3.10 are represented, with regard to time cycles,
in a 2D time chart with columns for the years 1991-2011 and rows for the months
01-12 (January to December). Three variants of rendering emphasise annual (left),
overall (centre) and monthly (right) patterns.
In these two figures (3.10 and 3.11), we applied the most expressive visual variables
‘position’ and ‘size’ for representing the TNM dynamics over time. These visual
variables work well when the display space has sufficient size, i.e., provides enough
pixels for distinguishing small values from large. They may be less suitable when
the display space is small, e.g. as in charts on a chart map. In this case, it may be
better to represent the attribute values by means of a colour scale combining the
colour value and hue, as shown in Fig. 3.12. Theoretically, it could be sufficient
to use only the variable ‘value’, encoding the attribute values linearly by shades
3.3 Interaction 77
from light to dark. However, such encoding is sensitive to outliers: the maximum
darkness is assigned to one or a few very high values, whereas the shades assigned
to the bulk of the values are hardly distinguishable. Besides, when some value has
a special meaning, such as the temperature 0◦ , it is advisable to encode the values
using adiverging colour scale with distinct colour hues representing values below
and above the special value. In the first chart in Fig. 3.12, shades of blue are used for
the values below zero and shades of red for the values above zero, while the values
around zero are shown in light yellow.
Please note that the attribute values are not directly translated into colours or shades,
but the value range of the attribute is divided into intervals, and the colours are as-
signed to the intervals. Such intervals are called classes or class intervals in cartog-
raphy. This approach may work better than explicit encoding, because human eyes
are not able to distinguish small variations of shade. Another advantage is the possi-
bility to reduce the impact of outliers. The division into class intervals can be done
in many different ways using task-relevant criteria. Figure 3.12 demonstrates differ-
ent variants of division applied to the same data. The first two are human-defined
and the remaining two were created automatically. In particular, a division that cre-
ates approximately equal classes (in terms of the number of data items per class) is
called equal-frequency intervals.
Fig. 3.12: 2D time charts, also called “mosaics plots”, represent the same data as
in Figure 3.11. Different divisions of the attribute values into class intervals are
applied, two variants of human-defined breaks and automatic divisions into 5 and 7
equal-size classes, i.e., intervals containing approximately equal numbers of values.
So far, we have looked at the temporal distribution of the attribute values. Some
tasks may require consideration of the frequency distribution of the values. A fre-
quency histogram is suitable for supporting such tasks (Figure 3.13). Similarly to
what was discussed in Section 2.3.3, we observe that some patterns either appear or
disappear depending on the selected bin size. For example, the bi-modal character
of the distribution with two peaks is clearly visible in the variants with small bins,
but this pattern disappears in the variant with large bins in the bottom-right image.
However, the latter shows a pattern of decreasing frequencies as the attribute values
increase, which is not so clearly visible in the other histograms.
A frequency distribution of numeric values can also be visualised in a way that is
less dependent on the bin size, as shown in Fig. 3.14. This representation is called
78 3 Principles of Interactive Visualisation
Fig. 3.13: Frequency histograms of the temperature data with the bin sizes equal to
0.5, 1, 2, 3, 4, and 5 degrees Celsius, correspondingly (left to right, top to bottom).
cumulative frequency curve. The x-axis of the display represents the value range
[min, max] of the attribute, as in a histogram. Each position on the x-axis represents
a certain value v of the attribute, min ≤ v ≤ max. The corresponding vertical position
is proportional to the cumulative frequency of all values from the interval [min, v].
It can be seen that the shape of the curve does not change when the division of the
x-axis into bins change.
Fig. 3.14: Cumulative frequency curves for the same data as in Fig. 3.13. Grey
vertical lines are spaced by steps of 1◦ (left) and 2◦ Celsius (right).
The interactions changing the visual representation of data demonstrated in this sec-
tion are suitable for looking at data under analysis from different perspectives. In the
illustrative example, we explored the data from the perspectives of linear time, cyclic
time, and the frequency of the values. To adjust visual representations to properties
of the data, properties of the display medium, and analysis tasks, display modifi-
cation operations may be applied, such as changes of display size and aspect ratio,
scales. colours, class intervals, and other parameters of a visualisation.
3.3 Interaction 79
It is now a standard that graphical displays provide access to exact data values
upon pointing at visual marks. Mark-related information is often shown in a popup
window (Figure 3.15). It may include not only the data values but also additional
information, such as references, details of calculation, etc. Popup windows can also
contain graphical representations, e.g., showing more details or additional related
data.
Fig. 3.15: A popup window activated by mouse hovering shows attribute values and
details of their processing.
Other operations that are used for seeing more details are zooming and panning,
which allow a contiguous part of the display to be enlarged, so that the data it con-
tains can be presented in an increased visual resolution. An example is shown in
Fig. 3.16, where a fragment of a long time series is shown in higher detail than
in Fig. 3.10. There are implementations employing focus+context techniques, in
which a portion of data is shown in maximal detail, while the remaining data are
also presented in the display but shown in a less detailed manner. Examples are time
intervals before/after a selected interval, spatial information around a selected area,
etc.
In representing values of a numeric attribute by means of a colour scale (i.e., by
shades of one or two colour hues), colour re-scaling can be used to increase vi-
sual discrimination of values in a particular range. One possible approach is to use
the full colour scale for the selected range and hide the values beyond this range
(Figure 3.17, top right). Another approach may be called “visual comparison”: it
converts a sequential colour scale to a diverging one by introducing a user-selected
reference value (Figure 3.17, bottom), so that the colours show the differences of
all values in the display to the reference value. The reference value can be specified
explicitly as a number, or implicitly as a value associated with a selected reference
object. This operation can be performed, for example, by clicking on the reference
object on a map or another graphical display. These techniques were introduced
for choropleth maps in interactive cartography and geovisualisation [7]. In a sim-
80 3 Principles of Interactive Visualisation
Fig. 3.16: Time plot shown in Fig. 3.10 has been zoomed in to show data for selected
two years.
ilar way, re-scaling can be applied to other visual variables, for example, ‘size’ in
bar charts. Another possibility for interactive colour re-scaling is converting from a
continuous to a discretised representation of the data by introducing class intervals;
see Figure 3.12 and the corresponding discussion.
Ordering is an interactive operation that is often applied to matrices and tables.
An example is shown in Fig. 3.18, where ordering of the rows by the values of
one of the presented attributes reveals the relationships of this attribute with other
attributes. It may be useful to apply reordering to axes in parallel coordinates plots,
components of a scatter-plot matrix, table columns, slices of pie charts, and other
kinds of display components.
In the process of exploring data with the use of visual displays, different motives for
transforming the data may arise: to get a clearer view, to simplify the display or the
data, to reduce the amount of data, to disregard excessive details and facilitate ab-
straction, and others. Some data transformation operations may be intertwined with
operations for display modification. Thus, the division of a range of real numbers
into class intervals applied in the 2D time charts in Fig. 3.12 and in the choropleth
maps in Fig 3.9 transforms the data from a continuous range of real values to a finite
sequence of classes. This kind of transformation is called discretisation.
Logarithmic transformation can be applied to values of a numeric attribute when
most of the values are small, but there are a few high values. After such transfor-
mation, the differences between the small values can be represented visually with
higher expressiveness; however, the representation will be harder to interpret than a
display of the original values.
3.3 Interaction 81
Fig. 3.18: Ethnic structure of the population of the London wards. The rows (each
having one pixel height) are ordered by the proportions of the Asian population.
series of the differences to the previous month values. To take into account the peri-
odicity of the variation, it is useful to perform comparison to the values for the same
month a year earlier, as on the right image of Fig. 3.19.
Fig. 3.19: Time plots obtained by transforming the data presented in Fig. 3.10 into
the differences to the previous month and previous year values.
Data selection (also called querying) and/or data filtering are necessary opera-
tions for detailed exploration of particular portions of data. Selection means taking
a portion of data in order to visualise this portion only. Filtering means temporar-
ily removing the data that are currently out of interest from already existing visual
displays. For both selection and filtering, the analyst needs to specify the properties
of the data of interest. Selection or filtering can be based on the following crite-
ria:
• attribute values (attribute-based filter);
• position in time, either along the time line or in a temporal cycle (temporal filter);
• position in space (spatial filter);
• references to particular entities (entity-based filter);
• relationships to other data (filter of related data).
Examples of different filtering operations suitable for spatio-temporal data can be
found in the monograph [10], section 4.2 entitled “Interactive filtering”. Interactive
filtering is usually enabled by UI elements, such as sliders, or can be done directly
through a data display, e.g., by encircling a part of the display containing the data of
interest. For informed selection or filtering, it is appropriate to use visual displays of
the data distribution, such as histograms of value frequencies and time histograms.
Spatial filtering is convenient to do through a map display showing the spatial dis-
tribution of the data.
84 3 Principles of Interactive Visualisation
The result of data selection or filtering is that only data satisfying query conditions
are shown. It is often desirable to see how a specific part of the data relates to other
data. This task is supported by highlighting the data subset of interest while keep-
ing the remaining data visible; hence, the highlighted items can be viewed in the
context of all currently active data (which may have been previously selected or
filtered). A common operation for making a group of data items highlighted in a
display is brushing, which is typically performed by dragging the mouse cursor
over an area in the display space. Items can also be highlighted by clicking on the
visual marks representing them. When two or more data displays exist simultane-
ously, highlighting usually happens in all of them in a coordinated manner affecting
the appearance of the visual marks representing the same subset of data items in
all these displays. Such simultaneous highlighting of corresponding items is essen-
tial for the use of multiple complementary displays, as it facilitates linking distinct
pieces of information visible in these displays. The use of different colours or styles
for representing highlighted items makes it possible to have several subsets high-
lighted simultaneously. These subsets can thus be compared to each other and to the
remaining data.
Fig. 3.20: Selection of a subset of London wards by brushing. Top left: selection of
the wards belonging to the three rightmost bars of the frequency histogram for the
proportion of the female population. Top right: the bars representing the selected
items are highlighted in the histogram. Bottom left: the selected wards are high-
lighted in the map. Bottom right: bar segments corresponding to the selected wards
are highlighted in a histogram for the median age.
3.3 Interaction 85
10 https://en.wikipedia.org/wiki/Mahalanobis_distance
86 3 Principles of Interactive Visualisation
To a large extent, interaction techniques are designed to compensate for the limita-
tions of visualisation, particularly, the impossibility to show many data components
in a single display (compensated by CMV), impossibility to show large amounts of
data in sufficient detail (compensated by techniques for focusing and obtaining de-
tails), and difficulty of finding particular information (compensated by tools for data
selection and filtering). However, the existence of these tools and the necessity to ap-
ply them complicates the use of visualisations and the process of data analysis. The
analyst needs to keep in mind what tools are available, for what purposes they are
used, and how they are operated. Substantial parts of the analyst’s attention and time
have to be given to technical operations at the cost of cognitive activities.
A big problem of visual exploration with a strong emphasis on the use of interactions
is the lack of systematic approach to the analysis. It is hard to ensure that the entire
subject is comprehensively studied in sufficient detail when the analysis is done
through interactive selection of particular data portions and aspects for viewing. It
is hard to check whether everything what is essential has received sufficient attention
and thought of the analyst. Without a systematic approach, interactive exploration
mostly relies on serendipity. Although serendipity can have a role, a strong reliance
on it is deeply unsatisfactory within any scientific endeavour. For a more systematic
interactive exploration, it is useful to keep track of the inspected data components
and performed operations and document the visual data analysis process, which
88 3 Principles of Interactive Visualisation
helps to identify data components and portions that have not yet been considered.
Besides, it facilitates the externalisation of the analysts’ reasoning processes.
The use of interaction operations may be problematic for displays showing large
amounts of data, because updating such displays in response to every user’s action
may be too slow for comfortable and efficient analysis without unwanted interrup-
tions. As we discussed in the previous section, this problem can be alleviated by
controlling when, where, and what interaction events and reactions are allowed.
However, tools and operations for making such settings also complicate the use of
visual tools and take a certain amount of the analyst’s attention and time.
It is always necessary to remember that interaction consumes the most valuable
resource – time of the analyst. Therefore, the use of interaction should be well
justified. Whenever possible, computations should be used, either on request or pro-
actively, preparing answers to potentially emerging questions.
In this chapter, we have considered the major principles of creating analytical visu-
alisations and the purposes and methods to interact with them. To design an effec-
tive and efficient visualisation that facilitates data analysis, it is necessary to under-
stand these principles and apply them carefully. However, we have provided only
a short and quite superficial overview of this exciting discipline. We recommend
interested audience to read the two books: “Visualization Analysis and Design” by
T.Munzner [103], which focuses on the principles of visualisation, and “Interactive
Visual Data Analysis” by C.Tominski and H.Schumann [135], focusing on design-
ing and using visualisations for data analysis. Our book has a different focus. It does
not discuss the design of new visualisations but explains how to utilise the existing
visualisation methods that have proved to be useful in analytical workflows together
with other kinds of techniques for data processing and analysis.
It is necessary to understand both the benefits and unavoidable costs of interactive
visualisations. Whenever possible (i.e., for routine tasks, where human judgement
and reasoning are not necessary), computations need to be performed instead of
visual analysis, saving time and cognitive efforts of human analysts.
Chapter 4
Computational Techniques in Visual
Analytics
conditions, compute descriptive statistics, arrange and group data items, transform
data, derive new data components from existing ones, detect data pieces with spe-
cific properties, generate predictive models, and many others. Computational meth-
ods of data processing and analysis are increasingly important due to our desire to
extract meaning from ever-larger datasets.
It is hardly feasible to describe in one book chapter how each of the existing com-
putational techniques can be used in analysis. What we are going to do is to look
at the whole mass of these techniques from the visual analytics perspective. This
reveals two aspects relevant to combining computing with visualisation and interac-
tion:
• Visualisation and interaction can support appropriate use of computational meth-
ods.
• Computational methods can enable effective presentation of voluminous and/or
complex information to human analysts, mitigating the limitations of visualisa-
tion and interaction techniques.
Correct use of computational methods requires first of all selection of right methods
depending on the nature and properties of the data under analysis and the current
analysis task. Then, the data need to be appropriately prepared to the application
of the chosen method. Data preparation also requires thorough investigation and
good understanding of the data properties, which can hardly be achieved without
the help of visualisation. While the next chapter of this book is fully dedicated to
investigating and processing data, here we would like to draw your attention to
two features of data that can be uncovered by means of visualisation: existence
of outliers and dissimilar subsets.
Outliers can greatly affect the work of computational methods and skew their results
making them useless or misleading. It is usually necessary to remove outliers, but it
may also be sensible to investigate and understand their impact on the computation
results, and visualisation is a suitable tool for this task.
When a dataset consists of highly dissimilar subsets, it may be a good idea to
treat them differently by involving different data components in the analysis and/or
applying different parameter settings or even different analysis methods. This ap-
proach can be especially recommended in using methods for computational mod-
elling. For example, in studying and modelling the variation of the prices for housing
or accommodation, it may be reasonable to take separately residential, touristic, and
business areas, as the respective prices can be affected by different factors.
4.1 Preliminary notes 91
After running a computational method, the analyst needs to evaluate its results. For
many methods, there are numeric measures of the quality of their results. However,
these measures are not sufficient for understanding whether the results are mean-
ingful, i.e., can be interpreted and do not contradict to analyst’s knowledge, and
whether they are useful for reasoning and drawing conclusions or for performing
further steps in the analysis. Whenever human understanding is required, there is a
need in visualisation. Besides, it is quite a usual case that the quality of method re-
sults is not the same for all parts of the input data. This variation remains unnoticed
when only overall quality measures are taken into account, whereas visualisation
can reveal the variation and help the analyst to find a way to improve the result qual-
ity. Examples of this use of visualisation will be demonstrated in the chapter about
visual analytics approaches to building computer models (Chapter 13).
Most of the computational methods have parameters that need to be set for running
them. The outputs of the methods can vary depending on the chosen parameter
values. It is often not quite clear in advance what settings can be suitable for the data
at hand and the analysis goal. Even when the analyst’s knowledge and experience
suggest sensible choices, it is reasonable to perform sensitivity analysis by testing
the robustness of the method results to slight variations of the settings for either
gaining more trust and certainty or understanding the degree of uncertainty.
Besides parameters, method outcomes can be affected by variations of input data
due to taking different samples or selecting different combinations of attributes. In
any case, a method should be run several times with modifying parameter settings
and/or input data, and the outcomes of different runs need to be compared.
Another source of possible differences in analysis results is the existence of alterna-
tive methods suitable for the same kind of data and the same tasks. When there are
no serious reasons for preferring one method over the others, a diligent analyst will
try different methods and compare their results.
Hence, good practices of data analysis involve comparisons of results of different
runs of the same or different methods. Consistency between the results indicates
their validity, whereas inconsistencies require investigation, explanation, and mak-
ing a justified choice or drawing a justified overall conclusion. For comparing re-
sults, it is usual to use various statistical measures. However, these measures are
typically not sufficient for supporting understanding and substantiated judgements,
as they can tell how much difference exists between two results but cannot tell what
is different. This is where visualisation can greatly help.
The role of visualisation is to present outcomes of different executions of computa-
tional methods so that analysts can see where they diverge and in what manner and
look for possible reasons. Interaction techniques provide access to details or addi-
tional relevant information. In this role, interactive visualisation can be used together
with any kind of computational method. While the types of visual displays that are
suitable for this purpose depend on the type and properties of method outputs, there
is a general principle: the visualisation should link the outputs to the input data, so
92 4 Computational Techniques in Visual Analytics
that the analyst can see how the same input data have been dealt with by different
runs of computational methods. Further in this chapter, we shall demonstrate the
application of this principle by example of clustering.
Some computational methods are designed to work in an iterative manner allow-
ing, in principle, inspection of the course of method execution and the intermediate
results of the processing. Visualisation can be of great help in performing such in-
spections. Analysts can better understand the data processing and method training,
which adds transparency to the approach and increases the trust in its results.
Based on our discussion, the tasks of visualisation in supporting appropriate use of
computational methods can be summarised as follows:
• Before applying computational methods: enable investigation of data properties
in order to
– choose suitable computational methods and make appropriate parameter set-
tings;
– detect outliers, which may need to be removed;
– detect disparities between parts of the data, which may require different ap-
proaches to analysis.
• During applying computational methods:
– inspect how the method uses and processes the input data;
– investigate the intermediate structures constructed by the method;
– examine the current state of the model being built by the method and under-
stand whether is develops in the right direction.
• After applying computational methods:
– enable evaluation of method results for understanding their meaningfulness
and usefulness and seeing the variation of the quality across the input data set;
– enable comparisons of different results obtained by varying the input data or
values of method parameters, or by choosing alternative methods.
What we are going to discuss further in this chapter is the computational techniques
that are commonly used in visual analytics workflows for enabling effective visual
presentation of voluminous and/or complex information to human analysts. These
techniques are employed when there is no meaningful way to create a visualisation
4.1 Preliminary notes 93
showing each and every data item in detail: either it is impossible technically, or
such a visualisation would be incomprehensible.
One possible approach to dealing with huge data amounts is to visualise only small
pieces of data. Such pieces can be extracted from the bulk either in response to
queries with explicitly specified conditions or by applying data mining techniques
that search for combinations of data items having particular properties or frequently
occurring in the dataset. This approach is taken when only such data pieces are of
interest and there is no need in getting a “big picture” of the whole dataset. Tech-
niques that are used for extracting data pieces support data analysis but do not play
the special role of supporting visualisation.
Visualisation can be supported by computational techniques that organise and/or
summarise information so that it becomes better suited for visual representation and
human perception and comprehension. The main goal is to enable an overall view
of the whole bulk of information. Such an overall view is achieved owing to a high
degree of abstraction and omission of many details. The details, when needed, can
be obtained by means of interaction.
There are two major approaches to creating an overall view of a dataset:
• Spatialisation: position the data items in an artificial space with two or three
dimensions according to a certain principle, usually according to their similarity
or relatedness. The resulting spatial arrangement of the data items can be visually
represented in a two-dimensional plot or in a three-dimensional plot that can be
interactively turned for viewing from different perspectives.
• Grouping: organise the data items in groups according to their similarity. The
resulting groups can be treated as units, i.e., each group is a single object. The
groups are characterised by statistical summaries of characteristics of their ele-
ments. The statistical summaries can be represented visually.
Spatialisation can be achieved by means of the class of computational techniques
called data embedding or, more frequently, dimensionality reduction or dimension
reduction. These methods, generally, represent data items by points (i.e., combina-
tions of coordinates) in an abstract “space” with a chosen number of dimensions,
which can be 2 or 3, in particular. One category of data embedding methods apply
some transformations to the original components of the data, called “dimensions”,
in order to derive a smaller number of components by which the data items can be
described so that substantial differences between them are preserved. This is where
the term “dimension reduction” comes from.
Another category of the data embedding methods use numeric measures of the sim-
ilarity or relatedness between any two data items. Such a measure is expected to
be zero when two items are identical or indistinguishable according to chosen cri-
teria of similarity, and, as the similarity decreases, the value of the measure must
increase. A measure fulfilling this principle is called distance. A mathematical for-
mula or an algorithm for determining the distance between given two items is called
94 4 Computational Techniques in Visual Analytics
Spatialisation Grouping
Reproducing distances
In the remainder of this chapter, we shall discuss the classes of techniques that are
commonly employed in visual analytics approaches for supporting creation of an
overall view of a dataset. The main classes are data embedding and clustering, the
methods from which are used for spatialisation and grouping, respectively. As both
classes of methods use distance functions, we shall discuss distance functions as
well. For defining appropriate distance functions, it is often necessary to perform
feature selection, which means finding a small non-redundant subset of data com-
ponents that is sufficient for representing substantial differences between data items
(data components are called “features” or “variables” in statistics and data mining).
Therefore, the chapter includes a section about feature selection.
We also decided to write a special section about topic modelling. This class of meth-
ods was originally created specifically for analysing text data. However, several vi-
sual analytics researchers have demonstrated in recent years that topic modelling
4.2 Distance functions 95
can have much wider area of applicability than just texts. In essence, topic mod-
elling is a kind of dimension reduction. We discuss it in a separate section because
it needs to be described using specific terms, such as documents, words, and topics,
which are not relevant to the other dimension reduction methods. The uses of topic
modelling for non-textual data are still new; therefore, we shall first describe the
traditional use and then show how it translates to other data.
The scheme in Fig. 4.1 shows the classes of computational methods discussed in
the remainder of this chapter. The arrows indicate the use of methods. A solid arrow
means that methods of one class require the use of methods of another class. Thus,
methods for distance-reproducing data embedding and for clustering require the use
of distance functions. A dashed arrow means that methods of one class may use
methods of another class. Thus, distance functions can be defined using results of
feature selection or dimension-transforming data embedding.
Please keep in mind that each of these classes of methods is used not only for sup-
porting visualisation but also for other purposes; moreover, most of these methods
are primarily created for other purposes but can be used for visualisation as well.
Particularly, feature selection, dimensionality reduction, and clustering can be used
for building computer models, and distance functions can be used for searching by
similarity.
There exist many distance functions differing in the types of data they can be applied
to and in the approaches to expressing dissimilarities. In the subsections of this
section, we shall group the distance functions according to the data types.
When two data items differ in values of a single numeric attribute, the arithmetic
difference between the two values (the sign being ignored) can serve as a measure of
the dissimilarity between the data items. This does not apply, however, to attributes
with cyclic value domain, such as spatial orientation or time of the day. Take, for
example, the spatial orientation, which is an angle with respect to a chosen reference
direction, such as the geographical north. It is usually measured in degrees, and 0◦
is the same as 360◦ . For the time of the day, the time 00:00, or 0 AM, is the same as
24:00, or 12 PM. When we need to determine the difference between, for example,
15◦ and 350◦ , straightforward subtraction of one of them from the other will give the
96 4 Computational Techniques in Visual Analytics
result 335◦ , which is wrong; the correct result is just 25◦ . Similarly, the difference
between 23:45 and 00:15 is not 23 21 hours but only half an hour.
Hence, determining differences between values of an attribute with a cyclic value
domain requires the following special approach: if the result of subtracting one value
from another exceeds the half of the cycle length, it needs to be subtracted from the
cycle length. Thus, the cycle length of the spatial orientation is 360◦ and the half of
it is 180◦ . The subtraction result for 15◦ and 350◦ is 335◦ . It is greater than 180◦ and
therefore needs to be subtracted from 360◦ to yield the final result 25◦ . Similarly, the
cycle length of the time of the day is 24 hours and the half is 12 hours; hence, 23 21
hours needs to be subtracted from 24 hours to yield the final result 21 hour.
When data items differ in more than one numeric attributes, the dissimilarity can
be expressed by combining in some way the arithmetic differences between the
values of each attribute. The most popular method for combining multiple arithmetic
differences is the Euclidean distance, which can be interpreted as the straight line
distance between two points in space. For n numeric attributes and two data items
(points) denoted x and y, the formula of the Euclidean distance is
s
n
∑ |xi − yi |2 , (4.1)
i=1
where xi and yi are the values of the i-th attribute for the items x and y, respectively.
Another popular distance function is the Manhattan distance, which is merely the
sum of the per-attribute differences.
The Minkowski distance1 generalises the possible approaches to combining multi-
ple differences in the formula
s !1/p
n n
p
p
∑ |xi − yi |p == ∑ |xi − yi | , (4.2)
i=1 i=1
1 https://en.wikipedia.org/wiki/Minkowski_distance
4.2 Distance functions 97
Fig. 4.2: The impact of the parameter p of the Minkowski distance on defining the
neighbourhood of a point. The blue outlines encircle the areas where the distances
to the intersection point of the axes are below the same threshold value.
2 https://en.wikipedia.org/wiki/Mahalanobis_distance
3 https://en.wikipedia.org/wiki/Cosine_similarity
98 4 Computational Techniques in Visual Analytics
The functions of cosine similarity and angular distance are often used in analysis of
texts. Thus, when texts are compared regarding the frequencies of occurring terms
or keywords, these functions can assess the similarity or dissimilarity irrespective
of the text lengths. Generally, this approach can be useful in application to numeric
attributes whose values depend on the sizes of the entities they refer to. For example,
in census data, values of many attributes depend on the population of the census
districts.
Even more generally, distance functions that disregard absolute values of attributes
are useful when data items need to be compared in terms of their structure irre-
spective of quantitative differences. Such distance functions can be called scale-
independent. Apart from the angular distance, a scale-independent distance func-
tion can be defined based on the correlation between two combinations of values
of multiple attributes. Like with the cosine similarity, the correlation values, which
range from -1 (opposite) through 0 (unrelated) to 1 (absolutely correlated), need to
be transformed to distances in the range [0, max]. Scale-independent functions are
applied when the attributes have similar semantics and can be meaningfully com-
pared. For example, in analysing census data, a correlation-based distance function
can be applied to a group of attributes reporting the numbers of the inhabitants by
age intervals, or by types of occupation, or by social status, but it would not be
reasonable to apply such a function to a mixture of attributes from these different
groups, or to the mean age, mean income, and mean distance to the work taken
together.
Like the cosine similarity, a correlation coefficient needs to be treated as a measure
of similarity and not as a distance. It equals 1 when two combinations of values
are fully correlated, which is treated as the highest similarity. The coefficient value
of −1 means that two combinations are opposite. The correlation-based distance
is defined by subtracting the correlation coefficient from 1; hence, the value of 0
means the maximal similarity.
There are many formulas for expressing the correlation. The most popular is the
Pearson correlation, which measures the degree of a linear relationship between
two value combinations. Using the same notation as before, the Pearson correlation
is calculated according to the formula
∑n (x − x)(yi − y)
p n i=1 i (4.4)
∑i=1 (xi − x)2 ∑ni=1 (yi − y)2
Here, x and y are the arithmetic means of the groups of the attribute values belonging
to the data items x and y, respectively.
In dealing with multidimensional data, you need to be aware of the problem known
as the curse of dimensionality4 . When the number of data dimensions (attributes)
is very high, the volume of the space comprising these dimensions is huge. The
available data are very sparse in this space, and there is very little difference in
4 https://en.wikipedia.org/wiki/Curse_of_dimensionality
4.2 Distance functions 99
the distances between different data items. In such a case, it is necessary to apply
feature selection (Section 4.3) and/or dimensionality reduction (Section 4.4). These
is no clear definition of what number of dimensions should be considered high.
It is acknowledged to be dependent on the number of available data items. When
the number of dimensions exceeds the number of data items, the data should be
definitely treated as high-dimensional. However, for analyses involving assessment
of similarities or distances, it is admitted that the curse of dimensionality begins
much earlier than the number of dimensions starts exceeding the number of data
items.
4.2.2 Distributions
5 https://en.wikipedia.org/wiki/Histogram
100 4 Computational Techniques in Visual Analytics
The sets of per-bin statistics representing two distributions can, in principle, be com-
pared in the same ways as combinations of values of multiple numeric attributes, as
discussed in the previous section. Distributions are typically compared using scale-
independent distance functions allowing to disregard quantities and compare the
distribution profiles, such as the angular distance or correlation distance. Besides,
distributions can also be compared using chi-square statistics, Bhattacharyya dis-
tance, or Kullback-Leibler divergence. The respective formulas can be easily found
in literature (e.g., [29]) or on the web.
In representing distributions by histograms, be mindful of the curse of dimension-
ality mentioned in the previous section. Since you can usually choose how many
bins to create, make sure that the number of bins is smaller than the number of
distributions you need to analyse, or apply feature selection (Section 4.3) and/or
dimensionality reduction (Section 4.4) to the histograms.
6 https://en.wikipedia.org/wiki/Dynamic_time_warping
4.2 Distance functions 101
the speed of the speech. In research on human-computer interaction, DTW has been
applied for comparison of EEG signals of different individuals who performed the
same task on a computer at their own speed [154]. In general, DTW can be useful
for comparing processes unfolding with different speeds. DTW has also been used
for comparing trajectories of moving objects following similar routes.
Fig. 4.3: A schematic illustration of the point matching by the Dynamic Time Warp-
ing algorithm. Source: [154].
While DTW has become quite popular, it has been argued that the results of
using DTW for dissimilarity assessment do not substantially differ from using
the Euclidean or Manhattan distance after re-sampling of sequences to the same
length [113].
Comparison of time series may be greatly affected by noise (i.e., random fluctua-
tions) in the data. Application of a smoothing method (see Section 5.3.2) can reduce
the noisiness and thus help to compare the general patterns of the temporal variation
rather than minor details. When time series are long and may involve several vari-
ation patterns, such as an overall trend and seasonal or cyclic fluctuation, it makes
sense to decompose the time series into components7 and apply a chosen distance
function to the components that are relevant to the analysis task. If two or more
components need to be involved, you can compute separately the distance for each
component and then take the average or a weighted average of the distances.
For periodic time series, it is also possible to use the Fourier transform8 , which rep-
resents a time series as a sum of sinusoidal components, i.e., sine and cosine multi-
plied by some coefficients, which are called Fourier coefficients. Then, the distance
between two time series can be expressed as the distance between the combinations
of their Fourier coefficients using one of the functions suitable for multiple numeric
attributes (Section 4.2.1).
7 https://en.wikipedia.org/wiki/Decomposition_of_time_series
8 https://https://en.wikipedia.org/wiki/Fourier_series
102 4 Computational Techniques in Visual Analytics
The distance between data items described by multiple categorical attributes (i.e.,
attributes with categorical values) is obtained from the per-attribute distances, which
are expressed by some numbers and combined in a single number by summing or
applying the formulas of the Minkowski distance or angular distance that are used
for numeric attributes (see Section 4.2.1). The question is how to measure the simi-
larity or distance between two categorical values of a single attribute.
The simplest approach is to assume that the distance between any two distinct values
is the same, usually 1. Then, if the per-attribute distances are combined by summing,
the total distance will equal the number of attributes in which two data items differ,
i.e., the number of mismatches between two combinations of categorical values.
This approach to expressing the per-attribute distances is equivalent to replacing
each of the categorical attributes by k “dummy” numeric attributes, where k is the
number of values of the categorical attribute. Each “dummy” attribute corresponds
to one categorical value and has two values, 0 and 1. When a data item has this
categorical value of the original attribute, the value of the “dummy” attribute is 1,
otherwise it is 0. After such a transformation of the data, it is possible to compute the
Minkowski or angular distances in the same way as for usual numeric data.
A more sophisticated approach takes into account the frequencies of different val-
ues, assuming that rarely occurring values should be considered as more distant
from all others than values occurring frequently. The idea is to compute the proba-
bility p of each value, i.e., its frequency divided by the total number of data items,
and use 1 − p or 1/p as the distance of this value to any other, or, equivalently, as
the value of the “dummy” numeric attribute corresponding to this value instead of
the value 1.
There are also much more sophisticated measures of similarity or distance. A com-
parative testing of a variety of similarity measures [37] showed that none of them
was always superior or inferior to others, but some measures, such as the Goodall’s
and Lin’s similarity (both have been implemented in R), performed consistently bet-
ter on a large number of test data.
When you have mixed data, i.e., both numeric and categorical attributes, you can
take one of the following approaches:
• Convert the categorical attributes into numeric by introducing “dummy” at-
tributes as described above. Than deal with the transformed data as with usual
numeric attributes.
• Compute separately the distances for the numeric and for the categorical at-
tributes and combine them into a single measure, for example, by computing
a weighted average (w1 · d1 + w2 · d2 )/(w1 + w2 ), where d1 and d2 are two partial
distances and w1 and w2 are the weights given to the two groups of attributes.
4.2 Distance functions 103
The weights can be proportional to the numbers of the attributes, or you can have
a different idea concerning the relative importance of each group.
• Use a distance function that was specially created for mixed data, such as the
Gower’s distance [59]. When deciding to use such a function, it is good to have
understanding of how it works.
4.2.5 Sets
The Jaccard index, or Jaccard similarity coefficient9 , is used for comparing sets,
for example, sets of students attending different classes. It is defined as the size of
the set intersection divided by the size of the set union. Hence, when the sets are
identical, the result is 1, and when they have no common elements, the result is 0.
To convert this measure of similarity into a distance function, you can subtract the
result from 1.
4.2.6 Sequences
9 https://en.wikipedia.org/wiki/Jaccard_index
104 4 Computational Techniques in Visual Analytics
4.2.7 Graphs
10 https://en.wikipedia.org/wiki/Centrality
4.2 Distance functions 105
To find clusters of objects in space, a clustering tool needs to use the spatial dis-
tances between the objects. The spatial distance is easy to compute when the ob-
jects can be treated as points in the space. When the spatial positions of the points
are specified by coordinates in an orthogonal coordinate system, the spatial dis-
tance is computed as the straight-line distance, i.e., the Euclidean distance. This
applies, for example, to positions of players and a ball on a sports game field where
an internal coordinate system is defined. When the spatial positions are specified
by geographic coordinates, i.e., longitudes and latitudes, the distance is calculated
as the great-circle distance11 (the formulas are quite complicated and therefore not
shown here). Please note that coordinates in geographically referenced data can also
be specified in an orthogonal coordinate system obtained after applying one of the
methods for map projection12 . Hence, you should be aware of the kind of coordi-
nates you have in your data in order to choose the right method of computing the
spatial distances.
For some analyses, not the straight-line or great-circle distances between spatial po-
sitions are important but the lengths of the paths that are taken for getting from one
position to another, or the duration of the travelling. There are web services that can
compute path lengths and durations of travelling through the street network.
For determining the distances between spatial objects that consist of more than one
points, different approaches can be taken depending on the kind of objects and
specifics of the analysis that needs to be done. The possible approaches include tak-
ing the distance between the centres of the objects, or between the closest points of
their boundaries or outlines, or the Hausdorff distance13 , which is the longest of all
distances from a point of one object to the closest point of the other object.
Finding distances between linear objects, such as paths of movement, may involve
matching points from the two lines, similar to matching points from two time series
by the Dynamic Time Warping distance function discussed in Section 4.2.3. In fact,
DTW can be applied for calculating distances between lines in space. It is also pos-
sible to do point matching using DTW or another suitable algorithm [121] and then
compute the mean of the distances between the corresponding points. Please note
that almost all of the existing point matching algorithms, including DTW, match the
first and the last points of one line to, respectively, the first and the last points of the
other line. If you want to find clusters of largely overlapping paths that may differ
in their lengths, you may need another algorithm, such as the “route similarity” dis-
tance function [10, p.147, Algorithm 5.4] or the point matching algorithm proposed
for analysis of route variability [23].
11 https://en.wikipedia.org/wiki/Great-circle_distance
12 https://en.wikipedia.org/wiki/Map_projection
13 https://en.wikipedia.org/wiki/Hausdorff_distance
106 4 Computational Techniques in Visual Analytics
The distance in time between two time moments is the time difference between
them, which is determined by subtracting the earlier moment from the later one.
The distance between two time intervals can be determined, following the approach
of the Hausdorff distance, as the maximum of the distances between the start times
of the intervals and between their end times. For finding clusters of events in time,
it may be reasonable to define the distance as 0 if the intervals overlap and as the
difference between the start time of the later interval and the end time of the earlier
interval otherwise.
What if you need to find clusters of objects in space and time? In this case, you
need a distance function that combines the distance in space and the distance in
time. Formally, you could combine the two distances by applying the formula of the
Euclidean or Minkowski distance. This idea may not be especially good, however,
because it will be very hard to figure out what the result means. A more meaningful
approach would be to set an explicit rule of trading time for space, or vice versa.
For example, knowing the properties of your data, you can decide that the temporal
distance of 5 minutes should be treated equivalently to the temporal distance of 200
metres. On this basis, you can transform the temporal distances into spatial, which
can then be meaningfully combined with the “normal” spatial distances.
It may be more intuitive to take a slightly different perspective: think when two
objects or events should be treated as neighbours in space and time. For example,
your judgement may be that two neighbouring objects should be not more than in k
metres away from each other in space and separated by not more than m minutes in
time. In essence, it is the same as deciding that m minutes is worth k metres, which
gives you an opportunity to transform the temporal distances into spatial.
This approach can be applied not only to time but also to other attributes you may
wish to involve in determining distances. For example, for finding traffic jams, i.e.,
spatio-temporal concentrations of vehicles moving very slowly in a common direc-
tion, it is appropriate to have a distance function combining the distance in space,
the distance in time, and the difference in the spatial orientation [14]. This can be
done by choosing appropriate distance thresholds for defining neighbourhoods in
space, time, and orientation.
When you wish to use Euclidean or Minkowski distance for computing distances
between combinations of values of multiple attributes, you should take care that the
attributes are comparable. The reason is that the function you use combines per-
attribute differences between values, and the magnitudes of these differences can
differ greatly. For example, attributes characterising countries include population,
gross domestic product per capita, and life expectancy. The per-attribute differences
4.2 Distance functions 107
may be millions for the first attribute, thousands for the seconds, and quite small
numbers for the third. Obviously, the contributions of these differences into the com-
bined distance will be extremely unequal. To give all differences equal treatment,
the attributes need to be transformed to a common scale. Such a transformation is
commonly called normalisation.
One of possible ways of doing this is to re-scale the values of all attributes to the
same range, usually from 0 to 1. This may be done by subtracting the minimal
attribute value from each value and dividing the result by the difference between the
maximal and minimal values of this attribute. This may not be a good idea when
the distribution of values of an attribute is very much skewed towards small or large
values, or when there are outliers. In such a case, most of the transformed values
will be very close to either 0 or 1, and the differences between them will be tiny.
In fact, the “effective scale” of the attribute (i.e., the interval containing the bulk
of the values) will not be comparable to the scales of other attributes whose values
are distributed differently. Hence, this approach may not always fit to the purpose.
The terms “feature” used in data mining and machine learning and “variable” used
in statistics have the same meaning as the term “attribute” used in visualisation
literature and in our book. Feature selection (or variable selection, or attribute se-
lection) means selecting from a large number of attributes a smaller subset that is
sufficient for a certain purpose. The term “feature selection” appears in the literature
and materials available on the web predominantly in relation to constructing predic-
tive models. It is typically explained that the purpose of feature selection is being
able to construct a better model using less computational resources. Accordingly,
the existing methods for feature selection are designed to be used in the context of
modelling. Some of them may exclude features from a model one by one and check
how this affects the model’s performance. Others may assess the importance of in-
put features for predicting the model output using certain numeric measures, such as
correlation or mutual information. These methods have no relation to visualisation
or analytical reasoning, and we shall not discuss them here.
Apart from modelling, feature selection may also be needed for supporting visual
analysis by means of computational methods, such as clustering and data embed-
ding. Specifically, a typical task is to select attributes to be used in a distance func-
tion (Section 4.2). One of the things to care about is possible correlations between
attributes. All distance functions involving multiple attributes implicitly assume that
all attributes are independent from each other, but it is rarely so in real data. When
several attributes are highly related, their joint effect on the result of a distance func-
tion may dominate the contributions of the remaining attributes.
For example, we have demographic data with 18 attributes expressing the proportion
of different age ranges in the population: 0-4 years old, 5-7 years old, 8-9, 10-14,
and so on, up to 90 years and more. There are strong positive correlations between
the proportions of the lowest six age ranges (i.e., from 0-4 to 16-17) and between
the proportions of the highest six age ranges (i.e., from 45-59 to 90 and more). This
can be seen in the visualisation of the pairwise linear correlations between the at-
tributes (more precisely, the values of the Pearson’s correlation coefficient) shown
in Fig. 4.4. Each column and each row in this triangular layout corresponds to one
attribute, which is denoted by a label. The blue and red bars in the cells represent,
4.3 Feature selection 109
attribute from the group because the correlation between these two attributes is not
high while their relationships to the other attributes are quite different. This can be
seen by comparing the columns corresponding to the age ranges 0-4 and 16-17. The
range 0-4 is negatively correlated with the ages starting from 45-59 whereas the
range 16-18 has slight positive correlations with the upper age ranges.
The attributes in the second group (i.e., from 45-50 to 90 and over) have quite similar
relationships with the remaining attributes; hence, it may be sufficient to select one
of them. The best candidate is the range 75-84, which has the highest correlations
with the remaining five attributes in the group.
Fig. 4.5: A matrix of correlations between attributes. The ordering of the rows and
columns reveals clusters of correlated attributes. Source: [106].
Figure 4.5 shows an example of visualising pairwise correlations for a much larger
set of attributes [106]. This is a matrix the rows and columns of which correspond
to attributes. Since the attributes are numerous, the cells of the matrix are tiny. The
green shading represents positive correlation values exceeding a chosen threshold;
in this example, it is 0.8. The rows and columns of the matrix have been arranged
in such an order (by means of a hierarchical clustering algorithm) that the rows
and columns of correlated attributes are put close to each other. With this ordering,
groups of correlated attributes are manifested by dark rectangular areas in the ma-
trix. Some groups are marked and labelled in the figure. Please note that the groups
labelled A1 and B1 have high correlations with the groups labelled A and B, re-
spectively. This means that A and A1 belong together, as well as B and B1, but the
matrix reordering algorithm failed to put them together.
Apart from impacts of correlated attributes, another thing to care about is the num-
ber of attributes that are used for distance computation. One reason why the number
4.4 Data embedding 111
of selected attributes should not be large is the curse of dimensionality (see Sec-
tion 4.2.1): when the number of attributes is very large, the distances between data
items tend to be nearly the same. Another reason, which is often ignored in the liter-
ature focused on computational methods for analysis and modelling, is the necessity
to understand outcomes of computational techniques: what do neighbours in a data
embedding or members of the same cluster have in common? How do they differ
from others? It may not be easy to find answers to such questions when the attributes
are many.
Hence, when the aim of using a computational method is to empower your analytical
reasoning, it may be a bad idea to throw all available attributes into the method and
see what happens. Thus, if you use all existing demographic attributes for cluster-
ing of districts, the result, most probably, will not be insightful. What makes more
sense is to decompose your analysis into steps in which you deal with subsets of
semantically related attributes. In the example with demographic data, you may fo-
cus separately on the age structure, education, occupation, health-related attributes,
and so on. In each step, you obtain a piece of knowledge concerning some aspect
of the phenomenon under study, and at the end you synthesise these pieces into a
comprehensive whole.
Data embedding, also called data projection, usually means representing data items
by points in an abstract metric space, that is, a set of possible locations with some
non-negative numeric measure of the distance between any two locations. Each
point receives a certain location in this space. The distances between the points are
supposed to represent certain relationships between the corresponding data items,
most often, relationships of similarity, so that stronger relationships (such as higher
similarity) are represented by smaller distances. The space that is used for data em-
bedding may be either continuous or discrete. In a continuous space, there is an
infinite number of locations within any chosen non-zero distance from any loca-
tion. In practice, analysts usually strive to embed data items in a continuous Eu-
clidean space, which means that Euclidean (straight-line) distances, as defined by
the formula 4.1 in Section 4.2.1, exist between the locations. While an abstract Eu-
clidean space may, in principle, have any finite number of dimensions, Euclidean
spaces with two or at most three dimensions are of particular interest due to the
possibility of visualisation. The most popular methods for embedding data in con-
tinuous Euclidean spaces include Principal Component Analysis (PCA)14 , Multidi-
14 https://en.wikipedia.org/wiki/Principal_component_analysis
112 4 Computational Techniques in Visual Analytics
As we stated earlier (Section 4.1.2), the value of data embedding for visual analytics
is that it can be used for creating spatialisations of data. The purpose of spatialisa-
tion is to enable a human observer to associate multiple objects into patterns. The
fundamental idea is that important patterns (combinations of relationships) existing
in the data should translate into spatial patterns that emerge between visual repre-
sentations of data items (Section 2.3.5). In Section 2.3.3, it is explained that spatial
patterns, such as density, sparsity, and spatial concentration (cluster), emerge from
distance relationships between objects located in space.
When human eyes see a set of points or other objects located on a plane or in the
three-dimensional space, the human brain instinctively judges the straight-line dis-
tances (i.e., Euclidean distances) between them. Objects are deemed close or distant
based on these straight-line distances, and it is quite hard to change these intuitive
15 https://en.wikipedia.org/wiki/Multidimensional_scaling
16 https://en.wikipedia.org/wiki/Sammon_mapping
17 https://en.wikipedia.org/wiki/T-distributed_stochastic_
neighbor_embedding
18 https://en.wikipedia.org/wiki/Self-organising_map
4.4 Data embedding 113
Fig. 4.6: Halos around points in a projection plot represent the cumulative distor-
tions of the original distances of the corresponding data items to the other items. The
radii are proportional to the absolute values of the distortions. White colour means
that the other points in the projection are mostly farther away than they should be,
and grey means the opposite direction of the distance distortion. Source: [127].
For a discrete embedding space, such as a result of the SOM method, a common
way to show the distortion is to visualise the distances between data items assigned
to neighbouring locations (or neurons, or nodes), which are usually represented as
cells of a regular grid. A common approach to visualising the distances is to sep-
arate the representations of the locations in the display by spaces and paint these
spaces according to the average distances between the data items assigned to the ad-
jacent locations. In a case of using SOM, the average distances between data items
corresponding to adjacent SOM nodes are usually represented in the form of a ma-
trix, which is called U-matrix (unified distance matrix) [84]. When discrete locations
(nodes) are represented by regular hexagons, the spaces between them can also have
the hexagonal shape, as shown in Fig. 4.8. In the example in Fig.4.8, right, we see
light and dark zones in the SOM space. In the light zones, there is high similarity be-
tween the groups of the data items put in the neighbouring nodes. In the dark zones,
the groups of data items in the neighbouring nodes are quite dissimilar.
4.4 Data embedding 115
Fig. 4.7: After interactive selection of a dot in the projection plot, the distortions of
the distances from the respective data item to all other data items are represented by
shifting the dots closer to or farther away from the selected dots. The white and grey
lines show the directions and distances of the shifts. Source: [127].
Fig. 4.8: A result of the SOM method is represented as a grid with hexagonal cells.
The cells with the coloured circles in them represent the neurons, or nodes, of the
neural network. The sizes of the circles depict the numbers of the data items in the
nodes. The remaining hexagons are, essentially, spaces between the node cells. The
shading of these spaces represents the distances between the data items in the nodes
separated by the spaces. Source: [64].
116 4 Computational Techniques in Visual Analytics
19 https://en.wikipedia.org/wiki/Isomap
118 4 Computational Techniques in Visual Analytics
than others. Sparsely filled regions correspond to highly varied data items. A nearly
uniform density of the dots throughout the plot area signify large variance among
the data items and absence of clusters of similar items.
When data items can be meaningfully ordered, in particular, by the times the data
correspond to, data spatialisation can be enhanced by connecting dots (or other vi-
sual marks) by lines according to their order. This idea is employed in the Time
Curve visualisation technique demonstrated in Fig. 2.12. The connecting lines can
show small and large changes along the order, as well as re-occurrences of proper-
ties that occurred earlier in the order. However, you should keep in mind that the
shapes of the connecting lines between the dots do not convey any meaning. They
may be straight or curved, with higher or lower curvature, but do not try to interpret
these shapes.
Hence, the properties of the overall spatial distribution of visual marks in a data
spatialisation can give us useful information concerning relationships within a set
of data items. However, the interpretation of the spatial patterns requires caution
due to the unavoidable distortions of the distances. All observations and judgements
should be treated as approximate and potentially partly erroneous. For example, a
spatial cluster of dots may mostly correspond to a group of highly similar or strongly
related data items, but it may also include a few dots corresponding to less similar or
related items. Even when neighbouring dots correspond indeed to similar or related
items, you should not assume that smaller distances necessarily represent higher
similarity or stronger relationships. Beyond small neighbourhoods, the distances
in the embedding space should not be used for any estimations of the relationship
strengths or amounts of dissimilarity.
The spatial patterns observable in an embedding display give us highly abstract in-
formation about the distribution of relationships over the set of data items. A major
problem of such a display is that we do not know what data item(s) each visual mark
in the display stands for. It is insufficient to just label the marks; it is important to
know not the names but the characteristics of the data items. Some implementations
of visual tools for data spatialisation allow interactive selection of dots or groups
of dots in the data embedding display and show in response information about se-
lected data items. For example, in Fig. 2.12, images corresponding to selected dots
are shown beside the projection plot. In Fig. 4.9, information about attribute val-
ues is shown for two selected groups of dots and one singular dot. The density
plots (smoothed histograms) shown in a popup window enable comparisons of the
frequency distributions of the attribute values in the two groups and in the whole
dataset. The density plots are also shown in larger sizes in the lower right section of
the display. Moreover, the grey-scale thumbnails appearing in the same section left
to the attribute names show the distributions of the values of these attributes over
the area of the projection plot. Larger images appear upon clicking on the thumb-
nails.
Visualisation of the distributions of attribute values over an embedding space can
also be done for discrete embedding, such as SOM. Particularly for SOM, the term
4.4 Data embedding 119
Fig. 4.9: Two groups of dots and one singular dot are selected in a projection plot for
viewing and comparison. The corresponding mean values of attributes and density
plots of the value distributions are shown in a popup window. Larger density plots
are also shown in the lower right section of the display. This section also includes
small squares showing the distributions of the values of the different attributes over
the plot area as grey-scale heatmaps. Source: [127].
Fig. 4.10: SOM component planes, in which the heatmap technique is employed to
show the distribution of values of individual attributes over the discrete space of a
SOM. Source: [34].
pairs of opposite corners of a square. The colours for all other positions within the
square are obtained by interpolation between the four colour. While simple linear
interpolation has been used in Fig. 2.10, the interpolation in Fig. 4.12 is done based
on an ellipsoid model. In this model, two dimensions correspond to the colour hues,
which are interpolated between the corners of the square, and the third dimension
corresponds to the colour lightness, which reaches maximum in the centre of the
square. For all other positions, the lightness values are obtained as vertical coor-
dinates of the corresponding points on the surface of the ellipsoid, which are cal-
culated using trigonometric functions. Compared to linear interpolation applied in
Fig. 2.10, ellipsoid-based interpolation generates lighter colours in the inner part of
the colour space. Hence, when you see a coloured visual mark in another display,
the colour lightness suggests in what part of the embedding space the corresponding
data item is positioned.
4.4 Data embedding 121
Fig. 4.11: Colours assigned to nodes of a SOM, as shown in Fig. 4.8, are used in
a parallel coordinates plot for colouring the lines representing the average attribute
values for the data items included in the SOM nodes. Source: [64].
The approach involving generation of a 2D colour scale can also be used for assign-
ing colours to clusters, as will be demonstrated in the next section.
4.5 Clustering
Clustering is a class of computational method that groups data items by their sim-
ilarity. An example is customer segmentation in retail, where customers may be
grouped by the similarity of the products they buy. The groups of similar data
items are called “clusters”. As in data embedding, similarities between data items
are represented as distances, which are defined using suitable distance functions as
discussed in Section 4.2. Some clustering methods use in-built distance functions,
most often the Euclidean or Manhattan distance, for computing distances between
data items. Other methods allow external definition of the distances, which can be
supplied to the methods in the form of a distance matrix. There are also implementa-
tions of clustering methods that can call an external distance function for computing
distances when required.
Three types of clustering methods are used in data science frequently: partition-
based, density-based and hierarchical clustering. These method types differ not
just algorithmically but conceptually, as they assume different meanings of the term
“cluster”.
Partition-based clustering divides the set of items into subsets such that the items
within the subsets are more similar than between the subsets. These subsets are
called “clusters”. Each data item is put in some cluster, so that its average distance
to the other members of this cluster (or, alternatively, its distance to the centre of
this cluster) is smaller than its average distance to the members (or the distance to
the centre) of any other cluster. However, an item can be closer (more similar) to
some members of other clusters than to the nearest members of its cluster. Many
such cases can be seen in the example on the left of Fig. 4.13. The most popular
method for partition-based clustering is k-means20 .
Density-based clustering involves the concept of neighbourhood. Two data items
are treated as neighbours if the distance between them is below some chosen thresh-
old. When an item has many neighbours, it means that the density of the data around
it is high. Density-based clustering methods aim at finding dense groups of data
20 https://en.wikipedia.org/wiki/K-means_clustering
4.5 Clustering 123
Fig. 4.13: Conceptual differences between the partition-based (left) and density-
based (right) types of clustering are demonstrated by clustering the same set of
points in the geographical space according to the spatial distances between them.
The number of the resulting clusters is the same in both images. The cluster mem-
bership is signified by the dot colours (the colours are used independently in each
image). On the left, the clusters are enclosed in convex hulls for better visibility.
items, i.e., such groups in which each item has sufficiently many neighbours. These
dense groups are called “clusters”. Items that don’t have enough neighbours are
not included in any cluster. All these items are jointly called “noise”. Unlike in
partition-based clustering, items that are put in clusters by density-based cluster-
ing cannot be closer to members of other clusters than to neighbouring members
of their own clusters; however, the distances between non-neighbouring members
of the same cluster may be quite high. The differences between clusters created by
partition-based and density-based clustering are demonstrated in Fig. 4.13 by ex-
ample of clustering a set of points distributed in the geographic space according to
the spatial distances between the points. In both images, dots are coloured accord-
ing to their cluster membership. The grey dots in the right image are the “noise”,
according to the density-based clustering. On the left, the clusters resulting from
partition-based clustering are enclosed in convex hulls to be better visible on the
map. The representative method for density-based clustering is DBSCAN21 .
21 https://en.wikipedia.org/wiki/DBSCAN
124 4 Computational Techniques in Visual Analytics
Data Dendrogram
B D
A
C F
E A B C D E F
At each level of the hierarchy, the clusters do not overlap and jointly include all
data items, which is analogous to results of partition-based clustering. Hence, by
cutting the hierarchy (or the dendrogram representing it) at some chosen level and
discarding the levels below, we obtain clusters similar to what can be obtained with
partition-based clustering. Thus, if we cut the dendrogram in Fig. 4.14 at level 1
(i.e., the highest), we obtain two clusters {A, B} and {C, D, E, F}, and cutting at the
next level gives us {A, B}, {C}, and {D, E, F}.
If the cut is done farther down the hierarchy, a possible result can be that many
clusters include singular data items or very few data items. If the members of such
very small clusters are treated as “noise”, the result will be similar to that of density-
based clustering. Hence, hierarchical clustering can be seen as a kind of “hybrid” of
partition-based and density-based clustering. For example, cutting the hierarchy in
Fig. 4.14 at level 3 gives us two “dense” clusters {A, B} and {E, F}, while the items
C and D can be treated as “noise”.
When you use a dendrogram representing results of hierarchical clustering, you
should keep in mind that only the distances along the hierarchy are informative:
they are proportional to the distances between the clusters. The distances in the
perpendicular dimension are not meaningful and should be ignored when you in-
terpret the clustering results. Particularly, when a dendrogram has a vertical layout
22 https://en.wikipedia.org/wiki/Hierarchical_clustering
4.5 Clustering 125
of the levels, as in Fig. 4.14, meaningful are the heights of the branches but not the
horizontal distances between the branches or between the nodes at the bottom of
the hierarchy. For example, the nodes C and D are very close in the dendrogram
in Fig. 4.14 but quite distant from each other in the data space. A dendrogram can
also be drawn using a horizontal layout, e.g., with the highest level on the left and
the lowest on the right. In this case, the horizontal distances are meaningful and the
vertical are not.
This discussion of the principal differences between the types of clustering methods
should help you to understand what type of clustering you need to apply in each
particular situation. When you want to divide your dataset into subsets consisting of
homogeneous (more or less) items, you apply partition-based clustering. When you
want to identify groups of highly similar items and assess the degree of variability
in the data, you apply density-based clustering. Apparently, hierarchical clustering
combines benefits of partition-based and density-based clustering: a dendrogram
can show the variability and groups of similar items and also allow division into
internally homogeneous subsets. However, practical use of these benefits is possible
only for relatively small datasets. One limitation of the hierarchical clustering is
the long computation time. Another limitation, which is even more relevant in the
context of visual analysis, is the impossibility to represent a very large dendrogram
in a visual display and to examine and interpret it.
After you have run some clustering tool and obtained clusters, you need to interpret
and compare these clusters in terms of the characteristics of the data items they
consist of. To see these characteristics, you need to use suitable visual displays,
in which the clusters need to be somehow distinguished. A typical approach is to
assign a distinctive colour to each cluster and use these colours to represent cluster
affiliations of data items or subsets of data items. If the data items are not very
numerous, you can use displays showing them individually. In case of larger data,
you will have to visualise aggregated characteristics. Aggregated data visualisations
may be clearer and more informative also for smaller datasets. Let us illustrate these
approaches by example of clustering of the London wards according to the values
of several attributes representing proportions of population with different levels of
qualification.
For example, Figures 4.15, 4.16, and 4.17 demonstrate different degrees of aggre-
gation in visualisation of results of partition-based clustering, which was done ac-
cording to values of multiple numeric attributes. The result consists of four clusters.
Each cluster has been assigned a unique colour. Figure 4.15 demonstrates a parallel
coordinates plot where the parallel axes correspond to the attributes that were in-
volved in the clustering. Each individual data item is represented by a line in this
126 4 Computational Techniques in Visual Analytics
plot, and the cluster membership of this data item is represented by the line colour.
In Fig 4.16, the same data are represented by a set of histograms, each corresponding
to one attribute. The bars of the histograms are divided into coloured segments with
the heights proportional to the numbers of the data items belonging to the four clus-
ters. Unlike the parallel coordinates plot, the histograms are free from over-plotting
and visual clutter, and it is easier to see and compare the ranges of the attribute val-
ues and the most frequent values in the clusters. The bar charts in Fig. 4.17 show the
characteristics of the clusters in the most aggregated form. In the upper display, the
mean values of the attributes for the clusters are represented by the bar length. To
facilitate comparison of the cluster profiles, the bar charts in the lower display show
the means of the attribute values transformed to z-scores, i.e., differences from the
attribute’s means divided by the standard deviations.
Of course, such displays as in Figs. 4.15, 4.16, and 4.17 are suitable for representing
clustering results when the clustering is done according to values of multiple nu-
meric attributes. If the type of data used for clustering is different, you will need to
choose other visualisation techniques, such as time series plot and time histograms,
or maps of spatial distributions, or other methods suitable to the data type.
Having understood what the clusters mean, you may wish to see how they are re-
lated to other components of the data. In our example, the London wards have been
grouped according to the similarity of the qualification profiles of the population.
It is interesting to see how these different groups of qualification profiles are dis-
tributed over the territory of London. For this purpose, we can use a map with the
wards painted in the colours of their clusters, as shown in Fig. 4.18, top. Looking
at this map and the bar charts in Fig. 4.17 or histograms in Fig. 4.16 in parallel, we
128 4 Computational Techniques in Visual Analytics
can observe that the wards with high proportions of highly qualified people (painted
in purple) are mostly located in the centre, whereas wards having high proportions
of people with low to medium levels of qualification as well as schoolchildren and
students 16-17 years old (green) are on the periphery, particularly on the east and
southeast.
Fig. 4.18: Cluster colours can be propagated to various displays representing the
objects that have been clustered, such as a map (top) or a frequency histogram of
values of a numeric attribute (bottom).
To relate the qualification profiles to the mean age of the ward inhabitants, we can
create a frequency histogram for the attribute ‘mean age’ and represent the clusters
in the same way as it was done in Fig. 4.16 for the attributes involved in the cluster-
4.5 Clustering 129
ing: the bars of the histogram are divided into coloured segments proportionally to
the counts of the members of the different clusters in the corresponding intervals of
the values of ‘mean age’, as shown in Fig. 4.18, bottom. In this histogram we see,
for example, that the wards from the red cluster have mostly low to medium values
of the mean age, while the mean ages in the wards from the green cluster are mostly
medium to high.
Fig. 4.19: The use of data embedding for assigning colour to clusters. Left: 2D
embedding of summarised cluster profiles. Right: Cluster colours are picked from a
continuous colour scale spread over the embedding space.
Hence, the use of cluster colours in various visual displays can reveal the relation-
ships between the data components that were involved in defining the clusters and
the other components of the data. It is very beneficial, especially when the clus-
ters are more numerous than in our example, to assign colours to the clusters in a
meaningful way, so that similarity of colours signifies similarity of the clusters. This
can be achieved by applying data embedding techniques to summary characteristics
of the clusters (such as the mean values of the attributes involved in the cluster-
ing) and spreading a continuous colour scale over the embedding space as shown in
Fig. 4.19 (which is similar to Fig. 2.10, where this technique was applied to original
data rather than clusters). Each cluster gets a particular position in the embedding
space and can be assigned the colour corresponding to this position.
When applying partition-based or hierarchical clustering, you usually have the
power to decide how many clusters you wish to have. For practical needs, you will
typically not generate too many clusters. With density-based clustering, it is hard to
predict how many dense groups of data items the algorithm will find, and it is not
unusual to obtain quite many clusters. It may not be possible to give each cluster a
sufficiently distinct colour. In this case, it is even more important to assign similar
colours to similar clusters. Although you will not differentiate individual clusters in
the displays where the colours are used, you will see the characteristics and distri-
butions of groups of similar clusters. To see individual differences within a group,
130 4 Computational Techniques in Visual Analytics
you may investigate this group separately by applying secondary embedding and
colouring only to the clusters of this group.
edge. Hence, you need to perform clustering several times with different parameter
settings until you feel that your understand the data and the corresponding aspect of
the phenomenon well enough.
Let us consider partition-based clustering (which is the most frequently used type
of clustering) using the running example with the London wards that we had before
(Figures 4.15 to 4.18). The result of the partitioning into 4 clusters told us that the
categories representing lower qualification levels (1 and 2), absence of qualifica-
tion, and apprenticeship are interrelated in the sense that the proportions of these
categories are often either simultaneously low or simultaneously high. The highest
qualification level (4) is a kind of opposite from all others: its proportion is high
where the proportions of the other levels are low, and vice versa. We have learned
that there is a group of wards where the number of people having the highest qual-
ification is much above the average, and these wards are mostly in the city centre.
We have also discovered that quite many wards have higher than average propor-
tions of people with lower qualification levels and without qualification, and these
wards are mostly located on the eastern and southeastern periphery of the city. We
can also notice that the wards with high proportions of highly qualified people have
low proportions of schoolchildren and students 16-17 years old, and we notice the
existence of wards distinguished by high proportions of people with “other qualifi-
cations” (this category includes qualifications that have no official levels in the UK,
in particular, qualifications obtained in foreign countries).
Let us see what we can gain from increasing the number of clusters from 4 to 5. The
bar charts in Fig. 4.20, top, confirm the knowledge we have got earlier. Additionally,
we discover the existence of spatially scattered wards with very high (in relation to
the average) proportions of the level 3 qualifications and students 18 and over years
old. However, this group of wards is quite small (23 wards), as can be seen from
the legend in Fig. 4.20, bottom. A large group of wards (a cluster with very pale,
almost white colour) has quite average proportions of all qualification and study
categories.
Increasing the cluster number to 6 reveals a rather small group of wards, mostly
located compactly in the centre, with high proportions of level 4 qualifications and
students aged 18 and more years and also relatively high proportions of level 3 and
other qualifications (Fig. 4.21). Further increase of the cluster number reveals only
quite small variations of the main profiles we have discovered earlier; so, we see no
reason to continue the process.
If we compare the clustering results in terms of the average silhouette scores, we
shall see that the average scores for 4, 5, and 6 clusters are 0.2788, 0.2916, and
0.2522, respectively; hence, the result with 5 clusters appears slightly better than
the others. The average silhouette scores are computed from the individual silhou-
ette scores of the cluster members, which may range from -1 to 1. Positive values
indicate that a data item is closer to the members of its own cluster (regarding the
mean distance) than to the members of any other cluster, and a negative value means
that a data item is closer to members of some other cluster. The closer the individual
132 4 Computational Techniques in Visual Analytics
Fig. 4.20: The same dataset as in Figs. 4.15 to 4.18 has been partitioned into 5
clusters. Top: the cluster summaries in terms of the average z-scores of the attribute
values. Bottom: the spatial distribution of the clusters.
value is to 1, the better, and higher average scores for clusters are meant to indicate
better clustering results.
In our case, the result with 5 clusters is better than with 4 and 6; however, the average
score for cluster 4 with 23 members (bright pink in Fig. 4.20) is very low, only
0.1679. The other results also include clusters with very low scores; thus, cluster 2
(yellow) in the result with 4 clusters has the average score of 0.1716, and cluster
4 (dark purple) in the result with 6 clusters has the average score of 0.0964 only.
Increasing the number of clusters further does not improve the quality in terms of
the silhouette scores; however, when we decrease the number of clusters to 3, the
overall average silhouette score increases to 0.3369, with the minimum of the cluster
averages being 0.3128. Hence, according to this measure of the cluster quality, the
result with 3 clusters is the best.
4.5 Clustering 133
Fig. 4.21: The same dataset as in the previous figures has been partitioned into 6
clusters. Top: the cluster summaries in terms of the average z-scores of the attribute
values. Bottom: the spatial distribution of the clusters.
Apart from the average scores, we can also compare the distributions of the individ-
ual scores using histograms, as in Fig. 4.22. Indeed, the scores for 3 clusters look
better: there are more high values and fewer negative values.
The better silhouette scores for the result with 3 clusters do not mean, however,
that we should ignore all other results. First, multiple results allow us to check the
consistency of our findings. Thus, the result with 3 clusters (Fig. 4.23) tells us that
the proportions of lower qualification categories, along with no qualification and
younger students, are interrelated and opposite to level 4 qualifications. When we
increase the number of clusters, we still see the same relationships, which increases
our confidence in this finding. Second, increasing the number of clusters reveals
subsets of the data whose profiles differ in some ways from the others. When these
subsets are big enough, seeing their differences increases and refines our under-
134 4 Computational Techniques in Visual Analytics
Fig. 4.22: Frequency histograms of the individual silhouette scores of the data items
for the results with 3, 4, and 5 clusters.
standing of the data and the phenomenon. Thus, we have observed the variety of
qualification profiles lying between the extreme high and low qualification profiles
and found that the profiles with all proportions close to the average occur in a large
part of the wards.
What we wanted to demonstrate with this example is that the main role of clustering
in studying a phenomenon is not to merely produce good clusters but to provide
valuable knowledge. In this respect, important criteria of cluster goodness are how
meaningful and informative they are. Therefore, you should not rely only on nu-
meric measures of cluster quality. On the other hand, you should not neglect such
measures either, as they can tell you whether you can treat cluster summaries as
4.5 Clustering 135
Fig. 4.23: The same dataset as in the previous figures has been partitioned into 3
clusters. This result is the best of all in terms of the average silhouette scores
valid and significant. For example, the lowest histogram of the silhouette scores in
Fig. 4.22 tells us that the bright pink cluster has many (relative to its size) members
with negative scores. These members are closer to members of other clusters than
to the co-members in their cluster, which means that the summary profile of this
cluster in Fig. 4.20 should not be trusted.
As we explained in Section 4.1.2, the role of clustering in visual analytics work-
flows is to enable effective visual representation of large and/or complex data by
organising the data into groups, which are then characterised by their statistical
summaries. In our example, we characterised the clusters by the profiles consist-
ing of the mean values of the attributes, which is a typical approach. Such profiles
can be seen as cluster centres, usually called centroids in statistics and data mining.
When the nature of the data does not permit computing some kind of “average ob-
ject” from cluster members, clusters can be represented by their medoids. These are
136 4 Computational Techniques in Visual Analytics
Fig. 4.24: The frequency distribution of the distances of the cluster members to the
respective cluster centroids for the result with three clusters presented in Fig. 4.21.
the cluster members having the smallest sums of the distances to the other cluster
members.
Apart from summarising cluster members, it is also appropriate to gain awareness of
the within-cluster variation. When cluster centroids can be obtained, a possible ap-
proach is to look at the distribution of the distances from the cluster members to the
centroids (or medoids, when centroids cannot be obtained). For example, the seg-
mented frequency histogram in Fig. 4.24 shows the distribution of the member-to-
centroid distances for the clustering result with three clusters. The histogram reveals
the existence of several outliers, i.e., data items that are quite distant from the cen-
troids of their clusters. Such outliers may affect the cluster summaries. Therefore, it
is useful In such cases to filter the outliers out, re-compute the cluster summaries,
and check if they have changed significantly. In our example, while the mean at-
tribute values indeed slightly change, the overall profile patterns remain the same.
If your analysis aims not only at finding the most typical patterns but also detect-
ing and inspecting anomalies, the items distant from the cluster centroids deserve a
closer look.
Fig. 4.25: An enlarged fragment of the map in Fig. 4.13, right demonstrates that
density-based clusters may have complex shapes.
4.5 Clustering 137
clusters is not helpful as well, because the members of such clusters vary greatly
and have nothing in common.
When you are not happy with the current results, you run the clustering algorithm
another time with different parameter settings. To reduce the fraction of noise, you
loosen the density conditions by either increasing the distance threshold (neighbour-
hood radius) or decreasing the required minimal number of neighbours (it is better
to avoid changing both parameters simultaneously). To decrease the internal vari-
ance within clusters, you tighten the density conditions by making opposite changes
of the parameter values.
However, it may happen that most of the clusters you have obtained are nice, while
there are just a few clusters you are not happy with. Re-running the clustering with
other settings may ruin the good clusters. To preserve the good clusters and improve
the bad ones, you can do progressive clustering: you separate the good clusters
from the rest of the data and apply clustering with tighter parameter settings to the
rest.
Generally speaking, progressive clustering is application of clustering to a subset
of data that has been chosen from the result of previous clustering. The purpose is
to obtain a refined representation and, hence, a refined understanding of this data
subset. For this purpose, the subset clustering is done with different parameter set-
tings or even with a different distance function than the whole set. The use of dif-
ferent distance functions may be especially helpful when you deal with complex
objects characterised by heterogeneous properties, such as trajectories of moving
objects [116].
Summarising this section, we emphasise again that clustering is an analytical pro-
cess rather than a single application of a clustering algorithm to data. In this pro-
cess, you do clustering multiple times with different parameter settings, different
data subsets, different distance functions, and, perhaps, different clustering algo-
rithms (e.g., apply partition-based clustering to density-based clusters). You need
to create appropriate visual representations of cluster summaries and properties for
interpreting and comparing clustering results and gradually building up your knowl-
edge about the data.
multiple sets of clusters, you are interested in keeping the cluster colours consistent
between the results of the different runs, that is, similarity or dissimilarity of colours
corresponds to similarity or dissimilarity of clusters across the different clustering
results. How can this be achieved?
Most of the data embedding algorithms are non-deterministic with respect to the
absolute positions given to data items in the embedding space. If you apply an em-
bedding algorithm several times to the same data with the same parameter settings,
the results will, most probably, look differently because the data will be put in dif-
ferent positions. However, the relative placements of the data items with respect to
other data items will be consistent between the results. In fact, one result can be
transformed to another one through rotation and/or flipping, plus, possibly, a little
bit of scaling.
When you have two sets of clusters (obtained from different runs of a clustering
algorithm) such that most of the clusters from one set have corresponding identical
or very similar clusters in the other set, you can, in principle, find such transforma-
tions of the embedding of one set of clusters to that the clusters that have matches in
the other set get the positions as close as possible to the positions of their matches
in the embedding of the second set. However, we do not know any method to find
such transformations automatically. What can be done instead is to let the embed-
ding tool run many times producing many variants of embedding. This does not
take much time because the number of clusters is usually quite small. Having many
variants, you can choose the one that is the most similar to the embedding you used
for assigning colours to the earlier obtained clusters. The degree of similarity can
be expressed numerically as the sum of weighted distances between the positions of
the matching clusters, the distances being weighted by the cluster sizes [13].
Fig. 4.27: Cluster colours obtained through the joint embedding of the cluster cen-
troids (Fig. 4.26) are used for painting London wards on a map.
Even a better idea may be to apply embedding to the centroids (or summaries, or
other representatives) of the clusters resulting both from the previous run and from
the new run, i.e., two sets of clusters are taken together. In this case, there is no
need to match clusters from two runs; similar clusters will be placed together in the
embedding. The embedding tool can be run multiple times to produce embedding
variants, from which the variant where the positions of the earlier obtained clusters
are the closest to the positions they had before the last run of clustering is chosen.
The positions (and, respectively, the colours) will not be exactly the same as before,
but the changes are likely to be small. You will, however, need to re-assign the new
colours to the previous clusters and adjust your mind to the colour change.
The idea of joint embedding can be applied not only to the results of the two last runs
of clustering, but the results of all previous runs can be involved as well. Apart from
consistent colour assignment, the benefit of joint embedding is showing similarities
and differences between clusters from different runs. This is illustrated in Fig. 4.26,
where data embedding has been applied to the centroids of the clusters resulting
from 6 runs of partition-based clustering that produced from 3 to 8 clusters. The
clustering was done on the same data referring to the London wards that we used
in the previous illustrations. On the left, each cluster is represented by a dot in the
embedding space coloured according to the clustering run the cluster results from:
the colours from dark blue to red correspond to the runs with 3 to 8 resulting clusters.
On the right, a continuous colour scale is spread over the embedding space. It can be
used for assigning consistent, similarity-based colours to the clusters. In Fig.4.27,
these colours are used for painting the wards on six maps representing results of the
six clustering runs. The maps are easy to compare due to the high consistency of the
colours.
4.6 Topic modelling 141
The extensive discussion of the use of clustering presented in this section should
be considered more generally. In analytical workflows, almost any method for com-
puter analysis needs to be applied in an iterative way with varying parameter settings
and/or modifying the input data. The purpose of running a method repeatedly is not
only (and not so much) to obtain the best possible results but to incrementally build
up human’s knowledge about the data and the phenomenon reflected in the data.
Results of different runs need to be interpreted, assessed, and compared. All these
operations, which require involvement of human’s cognitive abilities, need to be
supported by visualisation. In this section, we showed and discussed how visuali-
sation can support the process of using clustering for data analysis. This provides
an example for the use of other kinds of computational methods. Perhaps, the tech-
niques that we used for supporting clustering cannot be directly applied to results
of another kind of computational method. However, this section demonstrates what
needs to be cared about in the process of analysis with the use of computational
techniques.
We have mentioned in Section 4.4.3 that the methods known as “topic modelling”,
which have been developed for text analysis, are, in fact, methods of data embed-
ding; more specifically, they belong to the category of dimensions-transforming
methods. The new dimensions generated by the topic modelling methods are called
“topics”. Each “topic” is defined as a composition of weights of the original at-
tributes. Usually, the number of “topics” to be generated is specified as a parameter
of the method. Each data item is described by a combination of weights of the “top-
ics”. In classical applications of topic modelling methods, the original attributes are
frequencies of various keywords occurring in texts. The “topics” are thus combina-
tions of weights, or probabilities, of the keywords. A topic can be represented by a
list of keywords having high probabilities.
Beyond the classical applications, topic modelling methods can be applied to other
data that can be treated as “texts” in a vary general sense, i.e., as sequences or com-
binations of items taken from a finite set, which can be treated as a “vocabulary”.
For example, the “vocabulary” can consist of the names of the streets in a city, and
trajectories of vehicles can be represented as texts consisting of the names of the
streets passed on the way [41]. This approach, however, ignores the direction of the
movement, treating street segments going from A to B and from B to A as equivalent,
unless segments in opposite directions are encoded differently. Another example is a
“vocabulary” consisting of possible actions, such as operations performed by users
of a computer system, and “texts” consisting of action sequences [40]. In this case,
142 4 Computational Techniques in Visual Analytics
application of topic modelling will ignore the order in which the actions were per-
formed and take into account only the co-occurrence of different actions in one
sequence.
Moreover, it is possible, in principle, to apply topic modelling to any data with a
large number of positively-valued attributes having compatible meanings. For ex-
ample, these may be data describing purchases of various kinds of products by cus-
tomers of a e-shop, or combinations of educational modules chosen by students. In
these examples, the products or modules play the role of “words”, and their combi-
nations can be treated as “texts”.
Since the methods for topic modelling have been originally developed for text doc-
uments, they are usually described using text-specific terminology. A more precise
term to refer to this class of methods is probabilistic topic modelling [32], because
these methods define a topic as a probability distribution over a given vocabulary,
i.e., a set of keywords. In application to texts, the keywords with high probabilities
represent the semantic contents of the topics. Apart from the topic − keywords dis-
tributions, topic modelling methods also produce document − topics distributions
consisting of the probabilities of each document to be related to each topic.
It needs to be borne in mind that topics are defined based on frequent co-occurrences
of keywords in texts. For example, a topic represented by the keywords “dog, tail,
bark” can be derived only when these keywords frequently co-occur in same text
documents. This means that topic modelling methods may fail to derive meaningful
topics from collections of very short texts [155], such as messages consisting of one
or a few sentences posted in online social media. A single message may contain only
one or two informative keywords; hence, the total number of co-occurrences may
be insufficient for deriving valid relationships between the words. To deal with this
problem, short texts are aggregated into larger pseudo-documents [150]. This can be
done based on the temporal proximity of the creation of the texts, commonality of
message tags, and/or authorship of the texts. When geographic locations of the text
creation are known, as in the case of the georeferenced tweets in the introductory
example (Chapter 1), the texts can also be aggregated based on the proximity of
their locations.
Let us extrapolate this problem and the solution to another application. Sets of prod-
ucts purchased simultaneously by individual customers may be too small for iden-
tifying groups of related products based on separate purchases. It may be reason-
able to aggregate multiple purchases of different customers based on their temporal
proximity, occurrence of products from the same category, and locations of the cus-
tomers.
4.6 Topic modelling 143
One of the most popular algorithms used for topic modelling is Latent Dirichlet Al-
location (LDA) [33]. It is implemented in multiple open software libraries, computa-
tionally efficient, and known as more general than some other methods, e.g., Latent
Semantic Analysis (LSA) [48] and Probabilistic Semantic Analysis (PLSA) [69]. A
shortcoming of the method is its inconsistency across multiple runs. Being proba-
bilistic, the algorithm does not guarantee the same results to be produced for the
same input data and parameters. Nonnegative Matrix Factorization (NMF) [108] is
a deterministic method, which generates consistent results in multiple runs when
applied to same data. The drawback is the polynomial complexity, which limits the
applicability of the method to large data sets.
Most of the topic modelling methods require the number of topics to derive to be
specified by the user, for whom it may be hard to estimate how many meaningful and
distinct topics exist. This is similar to specifying the number of clusters in partition-
based clustering, and the same approach is taken to deal with the problem: run topic
modelling several times and compare the results.
Another problem is that many topic modelling algorithms are non-deterministic and
produce slightly different results in each run even if the number of topics is the
same [43]. Therefore, better understanding of the contents of a text corpus can be
gained by examining and comparing results of multiple runs of topic modelling, or,
in other words, an ensemble of topic models [40].
Topics generated in different runs of an algorithm can be compared with the help of
joint embedding, as, for example, shown in Fig. 4.28. We have earlier applied the
same approach to results of several runs of partition-based clustering (Fig. 4.26). In
the embedding space, similar topics are located close to each other and dissimilar
ones are distant from each other. A cluster of spatially close topic indicates the exis-
tence of an archetype topic that is discovered by multiple runs of a topic modelling
method. In other words, it is the same topic represented slightly differently in dif-
ferent models. The fact that it was found more than once indicates its significance
and trustworthiness. Topics that are scattered over a wide area in the plot may be
computational artefacts rather than representations of really existing topics. If there
are a few topics that are distant from others, the analyst can check their trustworthi-
ness through additional runs of the algorithm with incrementing the topic number in
each run. An occurrence of similar topics in the additionally obtained topic models
is an evidence of topic significance. Topics that remain far from all others can be
ignored. In this iterative and interactive way, the analyst can gain understanding of
what and how many topics exist.
144 4 Computational Techniques in Visual Analytics
To obtain a good set of valid and distinct topics suitable for further analysis or for
reporting, the analyst can select a representative topic from each group of close
topics. The analyst can define the groups using the topic embedding display, e.g., by
encircling clusters of dots. A suitable representative of a group is the medoid, i.e.,
the topic having the smallest sum of distances to all other topics in the group [40].
Here we mean not the distances between the dots in the embedding space but the
distances in the multidimensional data space, in which the topics are represented by
weights or probabilities of multiple components of the original data.
In principle, a suitable number of topics can be determined automatically based
on some quality measures [27]. However, what is optimal in terms of statistical
indicators in not necessarily the most meaningful and useful to a human. An analyst
can gain better understanding and knowledge of the data from seeing and exploring
the “topic space” with dense and sparse areas, clusters, and outliers.
The most significant difference between topic modelling and clustering is that the
common methods for clustering assign each data item to a single cluster whereas
topic modelling can associate a data item with several relevant topics. A basic
4.6 Topic modelling 145
premise of topic modelling is that any document may concern several topics, pos-
sibly, in different proportions; i.e., each document is modelled as a topic mixture.
Topics are identified based on term co-occurrences. Even if there are no documents
solely concerned with some topic A but there are documents where A is combined
with various other topics B, C, D..., a topic modelling algorithm is quite likely to
find topic A, because the co-occurrence frequencies of the terms pertinent to A will
be high while their co-occurrences with the terms specific to B, C, D... will be rela-
tively less frequent.
Another difference is that clustering works based on distances between data items
defined by some distance function. In cases of multidimentional data, the “curse of
dimensionality” 4.2.1 makes division into clusters quite arbitrary and unstable re-
garding slight changes of parameters. Unlike clustering, topic modelling techniques
are designed to deal with high-dimensional and sparse data, being based on term co-
occurrences within data items (documents) rather than distances between the data
items.
Topic modelling produces two kinds of outputs: first, the topics as such and, second,
the profiles of the input data items in terms of the topics. Each of these outputs is
valuable for data analysis.
Topics are defined by weights, or probabilities, of the input data dimensions, that is,
keywords in text analysis or, more generally, attributes with non-negative numeric
values in other applications. A topic is represented by the combination of keywords
or attributes having high weights. This means that there are quite many data items
in which these keywords co-occur or these attributes simultaneously have high val-
ues. This, in turn, means that these keywords or these attributes are semantically
related. Hence, one possible purpose of using topic modelling is revealing semantic
relatedness of data dimensions.
As we explained earlier, topic modelling is a specific sort of dimensionality re-
duction. It transforms highly multidimentional data into profiles consisting of topic
weights. Since there are much fewer topics as the original data dimensions, the di-
mensionality is thus significantly reduced. A nice property of the transformed data is
their intrepretability: knowing the meanings of the topics, it is not difficult to under-
stand the profile of each item. Like results of any dimensionality reduction methods,
results of topic modelling can be used for clustering or data embedding. The use of
2D data embedding is the most common approach to visualisation of results of topic
modelling. The data items (particularly, documents) are represented by points in the
embedding space arranged according to the similarities of their profiles in terms of
the topics.
146 4 Computational Techniques in Visual Analytics
Fig. 4.29: A matrix display supporting interpretation of topics derived from se-
quences of actions. The rows correspond to the topics and the columns to the actions.
The colour hues correspond to different categories of the actions. The saturation of
the colours in the matrix cells encodes the weights of the actions for the topics.
Source: [40].
The use of topic modelling in data analysis requires understanding of the meanings
of the topics. In case of topics derived from texts, the meanings can be adequately
represented by combinations of a few keywords having high probabilities, such as
“environment, air, pollution” and “space, shuttle, launch”. It is usually not difficult
for a human analyst to guess the meanings of so represented topics. The same ap-
plies to other cases where the meanings of the data dimensions are well known to
the analyst. Examples are types of products sold or names of educational modules.
However, this approach may not work well when the meanings of the dimensions
are not immediately clear and/or more than two or three dimensions can have high
weights in the topic definitions. In such cases, interpretation of the topics requires
visualisation support. Since topic definitions are multidimensional numeric data, the
visualisation methods designed for this kind of data can be used for this purpose.
These methods are presented in Chapter 6. An example of a visual display support-
ing topic interpretation is shown in Fig. 4.29.
4.7 Conclusion 147
4.7 Conclusion
There are lots of techniques for computational processing and analysis of data of
different types. This chapter discusses only a few classes of such techniques that
are most often used in combination with interactive visualisation in visual analytics
activities, as represented schematically in Fig. 1.16. All computational techniques
produce some derived data, which need to be interpreted by the analyst and used
in the following analysis and reasoning. To enable perception and understanding by
the human, the derived data need to be represented visually. Obviously, the choice
of computational techniques depends on the type and characteristics of the original
data, and the choice of the visualisation techniques for representing the outputs of
the computational processing depends on the type and characteristics of the derived
data.
In the following chapters, we shall discuss and exemplify applications of computa-
tional techniques (together with visualisations) to different types of data. Apart from
the common classes of techniques discussed in this chapter, there will be examples
of applying more specific techniques suitable for particular data types.
Part II
Visual Analytics along the Data Science
Workflow
In Chapter 2 (Section 2.1), we argued that different stages of a typical data science
workflow require understanding of three different subjects: the data under analysis,
the real-world phenomenon reflected in the data, and the model that is being built.
Visual analytics approaches, which are designed to support human reasoning and en-
able understanding of various kinds of subjects, can thus be very helpful at all stages
of the data science workflow. In the following chapters, we shall consider how vi-
sual analytics can help in exploring, understanding, and preparing data (Chapter 5),
using data to understand phenomena that involve different kinds of components and
relationships (Chapters 6 to 12), and using the acquired understanding of the data
and phenomena in building mathematical or computer models (Chapter 13).
Chapter 5
Visual Analytics for Investigating and Processing
Data
Abstract In this chapter, we discuss how visual analytics techniques can support
you in investigating and understanding the properties of your data and in conduct-
ing common data processing tasks. We consider several examples of possible prob-
lems in data and how they may manifest in visual representations, discuss where
and why data quality issues can appear, and introduce a number of elementary data
processing operations and computational techniques that help, in combination with
visualisation, understand data characteristics and detect abnormalities.
In a perfectly organised world, all data that land on the desk of a data analyst, are
collected and verified carefully, documented thoroughly, and cleaned from any oc-
casional problems. The reality often differs from this description, unfortunately. We
have seen many data sets that were collected by different people and organisations
using different equipment, methods, and protocols. The data are often represented in
different formats using different notations, making their fusion a challenging task.
Erroneous and missing values are very usual in any data.
In this chapter, we shall write about data sets in general, irrespective of their
specifics and representation. The following chapters will address different types of
data in detail. So, we shall talk here about a general dataset consisting of multi-
ple data items. Data items are composed of fields, which may contain values of
attributes, references to entities, places, or times, or to items in another dataset. All
data items in a dataset have homogeneous structure, i.e., consist of the same number
of fields having the same meaning and containing the same kind of information. The
fields usually have names. The contents of the fields are called the values of these
fields. Some fields in a dataset may be empty. The absence of a value in a field may
have different meanings: either no value exists, or some value exists in principle
but could not be determined. Knowing the meaning of the field can help understand
what the absence of a value means. When this is not clear and not described in
metadata, it is necessary to obtain additional information about the dataset, e.g., by
contacting the data collector.
In many data sets, dummy values, like 999 or −1, have special meanings, such
as “missing value”, “anything else” (e.g., when values in a field are categories or
classes), or “error”. Sometimes, different dummy values have the same meaning in
a single data set, if it was prepared by different people or at different times. For
example, both “n/a” and “-” may mean “not applicable”. It may also happen that
the same dummy value has different meanings. A series of data records may lack
consistency in measurement units (e.g. metres or kilometres), formatting (decimal
dot or decimal comma), and representation of dates and times. If a data set has been
collected over a long period of time, consistency may be lacking due to changes in
equipment, policies, daily routines, or personal habits and preferences. Data items
representing times may be inconsistent due to wrong time zones (e.g. if a tourist did
not set a correct time zone in the photo camera after moving to a different continent)
or ignoring switches to/from daylight saving time. Textual data components may be
misspelled, contain abbreviations and jargon, texts in different languages, etc. The
same meanings may be expressed using synonyms, thus complicating processing
and analysis.
During processing, field values may lose their precision. It may be a result of insuf-
ficiently careful transformation of the data format (e.g., from a spreadsheet to a text
file) or an attempt to decrease the size of a file with data. Figure 5.1 shows the impact
of rounding up geographic coordinates from 5 to 2 decimals. The dots are rendered
with 70% transparency for enabling the assessment of the densities. By comparing
two maps, it is visible that, as a consequence of the displacement of the points, some
real patterns disappear while artefacts, or fake patterns, emerge.
Fig. 5.1: The same set of 3,083 points (Twitter messages posted in London) is dis-
played with precision of 5 and 2 decimals (left and right).
5.2 Investigating data properties 153
Another possible reason for precision loss is striving to protect sensitive data, for
example, to hide exact locations visited by individuals and preclude in this way
identification of the individuals and their activities. Location data may also be in-
accurate due to the data collection procedures employed. For example, if points
describing locations of mobile phones originate from a mobile phone network, their
coordinates may represent the positions of the cell tower antennas rather than actual
positions of the phones.
Missing data, i.e., absence of valid values in fields, may be occasional (e.g. no value
for a specific age category, no measurement in a single location) or systematic (for
example, no observation during weekends due to absent staff). Sometimes, missing
values may indicate absence of something (e.g., absence of crime incidents in a
given place), which can be appropriately represented by the value 0.
Opposite to missing values, sometimes data sets contain duplicates. For example,
if it is expected to have a single data record per minute, two or more records with
the same temporal reference are duplicates. Duplicates may have the same field
contents, but they may also differ. In such cases, it is necessary to determine which
of the records has higher validity.
The notes and examples from this section indicate the need to investigate and under-
stand data properties before starting any analysis. As explained in Section 2.1 and
schematically illustrated in Fig. 2.1, investigation of data properties and preparation
of the data to the analysis is an essential stage of the data science process.
We refer an interested reader to the paper [62] that introduces a general taxonomy
of problems in data. Although it specifically focuses on temporal data, a great part
of the material is also valid for other kinds of data. A recent paper [91] considers
possible data problems for the most commonly used data types, such as graphs,
images and videos, trajectories of moving objects, etc. In this book, we give specific
recommendations for different types of data in the further chapters. The following
section considers data quality problems at a more general level and provides generic
recommendations for dealing with outliers and missing data irrespective of data
types.
There are a number of visual and computational tools that help us to investigate
of new datasets that we encounter. We have discussed these tools in Chapters 2
and 3. In this chapter, we are revisiting some of them in the context of conducting
an exploration into data properties.
The first thing you will need to do when starting a data science project is to, as
most data scientists would say, “play with the data”. Through this interaction with
154 5 Visual Analytics for Investigating and Processing Data
the data, you get familiarised with them and get some understanding of how the
fields represent the phenomenon you want to study. You need to question critically
whether the data provide an effective and suitable representation of the phenomenon
and determine if you need to get additional data from other sources or perform
some data transformations. You also investigate whether the data have sufficient
quality and, if not, what needs to be done to bring them to a state that is suitable for
any further analysis. Data investigation, assessment, and preparation to the analysis
(often referred to as data wrangling1 ) is a phase of utmost value, and like it or not,
one of the most time consuming, labour intensive parts of the data science process.
One of the interviewees from Kandel et al.’s study that looked into data analysis
practices at enterprises [77] expresses this view very nicely: "I spend more than half
of the (my) time integrating, cleansing and transforming data without doing any
actual analysis. Most of the time I’m lucky if I get to do any analysis..."
An important point to make clear here is that your emphasis in this phase of the anal-
ysis process is on the data itself, i.e., the data is the subject of the analysis (Fig. 2.1
and Section 2.1). Of course, you always need to keep the underlying phenomena in
your mind, as certain decisions are only viable when you put them in the context
what the data is referring to. For instance, a suspiciously high value within rainfall
measurements might mean a malfunctioning device but could also be the result of a
rare weather event that is of utmost interest.
It is typical to begin the investigation of the data properties with considering the
descriptive statistics2 . Unlike inferential statistics, which is designed for inferring
properties of a population by testing hypotheses and computing estimates3 , descrip-
tive statistics measures are calculated based on the given sample of data and reflect
properties of this particular set. The most common measures are those that estimate
the central tendency, i.e., where the central, average value of observations fall, and
dispersion, i.e., how much the values vary. Common statistics for central tendency
are the mean – the sum of all the values divided by the number of data items, median
– the middle value that separates the half with higher values from the other half with
lower values in the dataset, mode – the most frequent value within the records, es-
pecially useful when the data are categorical. The dispersion is estimated using the
standard deviation, the interquartile range (IQR), which is the difference between
the values of the upper and lower quartiles, or simply the difference between the
maximum and minimum.
1 https://en.wikipedia.org/wiki/Data_wrangling
2 https://en.wikipedia.org/wiki/Descriptive_statistics
3 https://en.wikipedia.org/wiki/Statistical_inference
5.2 Investigating data properties 155
Introduced in the John Tukey’s seminal book on exploratory data analysis [138],
the box plot technique (Sections 3.2.1 and 3.2.4, Figures 3.2 and 3.3) visually dis-
plays most of the descriptive statistics. However, as noted in Section 3.2.7, both the
box plot and the descriptive statistics it portrays may be insufficient and even in-
appropriate for characterising a set of numeric values when the distribution is not
unimodal. While the statistics and the box plot provide a compact summary, it is
necessary to see the data in more detail. Thus, we have started Chapter 3 with an
example of the “Anscombe’s Quartet”: a series of small data sets with very different
patterns, but exactly the same statistical indicators. Another example of this kind is
the “datasaurus dozen” (Fig. 5.2).
Fig. 5.2: The datasaurus dozen [97] – a good reminder of why you always need to
inspect your data visually before performing any further analysis. In this set of scat-
terplots representing pairwise relations with the same x-mean, y-mean, x-standard
deviation, y-standard deviation and correlation, the visualisations show us a com-
pletely different story.
156 5 Visual Analytics for Investigating and Processing Data
Fig. 5.3: Box plots combined with dot plots portray value distributions of 11 numeric
attributes, we spot attribute-8 as standing out and requiring further investigation.
Box plots combined with dots are suitable only if the amount of data is rather small.
For large data volumes, it is more appropriate to investigate distributions of attribute
values using frequency histograms (see Section 2.3.3). Different distribution prop-
erties exhibit themselves in particular histogram shapes and features. Various his-
togram shapes and possible interpretations of their meanings are discussed exten-
sively in statistical literature. Many examples of possible meanings of the shapes
are explained in the book [130]; we reproduce and summarise them in Table 5.1.
5.2.2 Outliers
Outliers4 are data items that differ significantly from others. In Section 2.3.2, we
said that outliers do not join patterns formed by groups of other items because they
4 https://en.wikipedia.org/wiki/Outlier
5.2 Investigating data properties 157
Shape Interpretation
Normal. A common pattern is the bell–shaped curve known as the “nor-
mal distribution”. In a normal distribution, points are as likely to occur
on one side of the average as on the other. Be aware, however, that other
distributions look similar to the normal distribution. Statistical calcula-
tions must be used to prove a normal distribution.
Don’t let the name “normal” confuse you. The outputs of many pro-
cesses—perhaps even a majority of them—do not form normal distri-
butions, but that does not mean anything is wrong with those processes.
For example, many processes have a natural limit on one side and will
produce skewed distributions. This is normal — meaning typical — for
those processes, even if the distribution isn’t called “normal”!
Skewed. The skewed distribution is asymmetrical because a natural
limit prevents outcomes on one side. The distribution’s peak is off cen-
ter toward the limit and a tail stretches away from it. For example, a dis-
tribution of analyses of a very pure product would be skewed, because
the product cannot be more than 100 percent pure. Other examples of
natural limits are holes that cannot be smaller than the diameter of the
drill bit or call-handling times that cannot be less than zero. These dis-
tributions are called right – or left–skewed according to the direction of
the tail.
Double-peaked or bimodal. The bimodal distribution looks like the
back of a two-humped camel. The outcomes of two processes with dif-
ferent distributions are combined in one set of data. For example, a
distribution of production data from a two-shift operation might be bi-
modal, if each shift produces a different distribution of results. Stratifi-
cation often reveals this problem.
Plateau. The plateau might be called a “multimodal distribution”. Sev-
eral processes with normal distributions are combined. Because there
are many peaks close together, the top of the distribution resembles a
plateau.
Edge peak. The edge peak distribution looks like the normal distribu-
tion except that it has a large peak at one tail. Usually this is caused by
faulty construction of the histogram, with data lumped together into a
group labeled “greater than...”
Comb. In a comb distribution, the bars are alternately tall and short.
This distribution often results from rounded-off data and/or an incor-
rectly constructed histogram. For example, temperature data rounded
off to the nearest 0.2 degree would show a comb shape if the bar width
for the histogram were 0.1 degree.
Truncated or heart-cut. The truncated distribution looks like a normal
distribution with the tails cut off. The supplier might be producing a
normal distribution of material and then relying on inspection to sep-
arate what is within specification limits from what is out of spec. The
resulting shipments to the customer from inside the specifications are
the heart cut.
Dog food. The dog food distribution is missing something—results near
the average. If a customer receives this kind of distribution, someone
else is receiving a heart cut, and the customer is left with the “dog
food,” the odds and ends left over after the master’s meal. Even though
what the customer receives is within specifications, the product falls
into two clusters: one near the upper specification limit and one near
the lower specification limit. This variation often causes problems in
the customer’s process.
are dissimilar to everything else. Particularly, outliers among numeric values are
manifested in histograms as isolated bars located far away from others and in dot
plots and scatterplots as isolated dots distant from the bulk of the dots. As noted in
Section 2.3.2 it is necessary to understand whether outliers are errors in the data that
need to be removed, or they represent exceptional but real cases that require focused
investigation.
For numeric attributes, statisticians have developed a number of computational tech-
niques that, assuming a given distribution of the values, indicate if some values are
outliers. The type of the value distribution may be understood from the shape of a
frequency histogram (see Table 5.1) For example, if a distribution is normal5 (ex-
hibited by a histogram having a bell shape, as in the top row of Table 5.1), the values
can be checked for outlierness by computing their distances to the quartiles of the
value set. If Q1 and Q3 denote respectively the lower (first) and upper (third) quar-
tiles and IQR = Q3 − Q1 is the interquartile range, then an outlier can be defined
as any observation outside the range [Q1 − k · IQR, Q3 + k · IQR)] for some con-
stant k > 1. According to John Tukey who proposed this test, k = 1.5 indicates an
“outlier”, and k = 3 indicates that a value is “far out”.
A data item consisting of values of multiple attributes can be an outlier although the
value of each individual attribute may not be an outlier with respect to the other val-
ues of this attribute. Such multidimensional outliers can be detected using a suitable
distance function (Section 4.2) taking care of data normalisation and standardisation
when necessary (Section 4.2.9). The outlierness can be judged based on the distri-
bution of the distances of the data items to the average or central item, such as the
centroid or medoid of the dataset.
After outliers are identified, they need to be examined by a human analyst for mak-
ing an informed decision whether to treat them as mistakes or as correct data signi-
fying interesting properties of the phenomenon reflected in the data. Suspicious data
may originate from procedures of data collection (e.g. by selection of an inappropri-
ate sample of a population), from hardware or software tools used for data collec-
tion (incorrect measurements, wrong units etc.), or from mistakes inevitably made
by humans in the course of manual recording. The ways to understand whether data
are plausible or suspicious are usually domain- and application-specific and require
expert knowledge. It is important that a knowledgeable person looks at different
data components from different perspectives and in different combinations, using
a variety of potentially useful visual representations and corresponding interactive
controls from those described in Section 3.
Outliers regarded as errors need to be removed from the data or corrected, if such
a possibility exists. It should be remembered that removal or correction of outliers
may change the descriptive statistics substantially; therefore, they need to be re-
calculated. Moreover, visual representations of the value distributions, such as fre-
quency histograms and scatterplots, may also change. It can happen that some data
5 https://en.wikipedia.org/wiki/Normal_distribution
5.2 Investigating data properties 159
items that were earlier not identified as outliers, either statistically or visually, will
now stand far apart from the others. Hence, the procedure of outlier detection needs
to be repeated each time after some outliers have been detected and removed or
corrected.
All considerations so far referred to global outliers that differ from all other items
in a data set. In a distribution of some values or value combinations over a base with
distances (see Section 2.3.1), such as space or time, a local outlier is an element
of the overlay (i.e., a value or combination) that is substantially dissimilar from the
overlay elements in its neighbourhood. An example is shown in Fig. 5.4, where
the value in one of the London wards is much higher than everywhere else in the
surrounding area. A map representation of values in comparison to a reference value
(e.g. the average) enables identification of local outliers.
The concept of neighbourhood can be defined not only for distribution bases with
inherently existing distances between the elements (positions) but also in any set of
160 5 Visual Analytics for Investigating and Processing Data
items for which a suitable distance function can be specified (see Section 4.2). The
neighbourhood of an item positioned in a base, in terms of the specified distance
function, is defined as all positions in the base whose distances from the position of
the given item are below a certain chosen threshold. The distance function can be
used for data spatialisation by means of embedding (Section 4.4). The distribution
of the data items in the embedding space can be represented by marks positioned
in a projection plot, similarly to the circle marks on the map in Fig. 5.4. Local
outliers will be manifested by marks dissimilar from the neighbouring marks in the
embedding space.
When a data set is large, it may be difficult to detect local outliers in a distribution by
purely visual inspection. To support the detection computationally, the differences
of the data items – elements of the distribution overlay – from their neighbours in
the base need to be expressed numerically. Again, a suitable distance function is
needed for this purpose, that is, the function must be applicable to the elements
of the distribution overlay. It may be the same function as is used for the base of
the distribution, if the elements of the base and the overlay are of the same kind;
otherwise, a different distance function is required. Using the distance function,
the dissimilarities of all overlay elements to their neighbours are computed and the
frequency distribution of these dissimilarities is examined. High-valued outliers of
this frequency distribution potentially correspond to local outliers of the original
distribution. The corresponding data items and their neighbourhoods in the original
distribution need to be specially inspected.
Apart from global and local outliers, there may also be unexpected values, i.e., val-
ues that usually do not occur in a given place and/or time and/or combination of
conditions. For example, a zero count of mobile phone calls in a business area on
a weekday afternoon appears suspicious and needs to be investigated. It may hap-
pen due to a problem with the cell network infrastructure, or a public holiday, or an
emergency situation, or be just an error in the data. Detection and inspection of such
cases requires involvement of domain knowledge.
Missing data is another important aspect of data quality that always requires a thor-
ough investigation. Unfortunately, many visual representations, by their very design,
cannot indicate that some data are missing; therefore, an analyst has no chance to
detect the problem. Imagine a scatterplot displaying 1,000 data items with attributes
a1 and a2 . If 20 records have missing values of a1 , 10 records miss values of a2 ,
and 5 of them miss both values, only 1000 − 25 = 975 dots will be plotted. When
you see this plot, you cannot guess that 25 dots are missing. Like this scatterplot,
other common visualisations cannot help you to detect that some data are missing,
because they show the data that are present but do not show what is absent.
5.2 Investigating data properties 161
Apart from the existence of data items that miss values in some fields, it may happen
that entire data items (records) that are supposed to be present in a dataset are absent.
Missing records can occur in data that are collected and recorded in some regular
way, for example, at equally spaced time moments or in nodes of a spatial grid.
When some recordings have been skipped, e.g., due to a failure of the data recording
equipment, the dataset will miss respective data items.
It is usually not a big problem for analysis when a few data items miss some field
values or a few records are missing in a dataset. A problematic situation arises when
a large portion of the data misses values in the fields that are important for the
analysis and understanding of the subject, or when a large fraction of the supposed-
to-be-there items are absent. In such cases, it is necessary to evaluate whether the
remaining data are acceptable for the analysis. For the data to be acceptable, the
following conditions must fulfil:
• The amount of the data must be sufficient for the analysis and modelling, i.e.,
contain sufficient number of instances for making valid inferences and conclu-
sions.
• The coverage of the data must be sufficient for representing the analysis subject.
This requirement may concern the area in space for which the data are available,
the period in time, the age range of the individuals described by the data, and the
like.
• The missing data must be randomly distributed over the supposed full dataset, so
that you can treat the missing data as a randomly chosen sample from the whole
data. Otherwise, the available data will be biased and thus not suitable for a valid
analysis.
It is easy to count how many data records miss values in analysis-relevant fields and,
hence, how many records remain for the analysis. This can be done, in particular,
using functions of a database. It is also not difficult to obtain the ranges of field
values, or value lists for categorical fields, for assessing the coverage. The spatial
coverage can be examined using a map showing the bounding box of the available
data items. A more accurate judgement can be supported by dividing the bounding
box by a grid and representing the counts of the data items in the grid cells.
It may seem quite tricky to check whether the missing data constitute a random
sample of the supposed whole dataset, which does not exist in reality. However, the
requirement of the missing data to be random is equivalent to the requirement that
the distributions of the missing and available data over analysis-relevant bases are
similar, i.e., contain the same patterns. This, in turn, means that the proportion of the
missing data is uniform over the base, i.e., is nearly constant in any part of the base.
Thus, when data include references to time moments or intervals, the proportions of
missing data should be approximately the same in different time intervals. The same
requirement applies to different locations or areas in space when data are space-
referenced, to different groups of entities when data refer to entities, and to different
162 5 Visual Analytics for Investigating and Processing Data
intervals of numeric attribute values when relationships of this attribute to other data
components need to be studied.
Hence, investigation of the data suitability is done by comparing distributions of the
available and missing data, or by investigating the distributions of the proportions
of the missing data over different bases. This is where visualisation is of great help,
because, as we explained in Chapter 2, the key value of visualisation in data analysis
arise from its ability to display various kinds of distributions. The general approach
is to partition the base of a distribution into subsets, such as intervals, compart-
ments, groups of entities, etc., and obtain counts of the available data items for each
partition, as well as for combinations of partitions from different bases.
Let us consider an example of a dataset describing mobile phone calls in Ivory
Coast provided by the telecommunications company Orange for the “Data for De-
velopment” challenge [35]. The dataset includes records about mobile phone calls
made during 20 weeks (140 days) in 2011-2012. Each record includes the date and
time of the call and the position of the caller specified as a reference to one of 1231
cell phone antennas in the country. It is supposed that the dataset contains data about
all calls performed within the given period.
In this dataset, the absence of records about phone calls can be detected by consid-
ering the counts of the recorded calls by the network cells and time intervals. We
therefore aggregate the data into daily counts by the cells. Each cell receives a time
series of the count values. Zero or very low values in this time series may indicate
that all or a large part of data records for this cell and this day are missing, because
our common sense tells us that people make phone calls every day. However, it is
daunting to look at the individual time series of each of the 1231 cells. We shall
instead look at the overall temporal patterns made by all time series.
In the upper time graph in Fig. 5.5, the individual time series are represented by thin
grey lines; the thick black line connects the mean daily values. There were several
days with unusually high counts of calls in many locations, especially on January 1,
2012. These peaks correspond to major holidays and are plausible therefore. At the
right edge of the time graph, we see three drops of the mean line, which correspond
to 3 days in April, specifically, 10th, 15th, and 19th. In these days, the call counts
were everywhere substantially lower than usual, which is an indicator of missing
data. When we look at the mean line more attentively, we notice that similar drops
occurred also at other times. The time graph in the middle of Fig. 5.5 shows the
individual time series of a few randomly selected cells. It can be seen that there
were multiple days with no records. Moreover, there were periods of missing data
extended over multiple days or even weeks.
To investigate the temporal distribution of the missing records more thoroughly,
we generate a segmented time histogram, as shown at the bottom of Fig. 5.5. Each
bar corresponds to one day, and the full bar length represents all 1231 cells. The
bar segments represents the numbers of the cells whose daily counts of the phone
calls fall into the class intervals 0-5, 5-10, 10-20, 20-50, 50-100, 100-200, 200-500,
5.2 Investigating data properties 163
Fig. 5.5: The time graph at the top represents the time series of the daily call counts
for 1231 cells. The time graph in the middle portrays the time series for 4 selected
cells. The time histogram at the bottom shows the proportions of the cells with the
daily call counts falling into given class intervals. Missing data are manifested as
zero values of the call counts.
164 5 Visual Analytics for Investigating and Processing Data
500-1000, and 1000 and more. The interval 0-5 is represented by dark blue and light
blue corresponds to the interval 5-10. The segments of these colours mean that no or
very few calls were reflected in the data. As it is not likely that nobody made phone
calls in the corresponding cells during whole days, most probably, the records about
the phone calls that were actually made are missing. The time histogram shows us
multiple irregular occurrences of days when very many cells had no or very few
records, and we also see an increasing trend in the number of the cells with no
records starting from the beginning of the period until the last three weeks.
The temporal patterns we have observed signify that the lacunae in the recordings
of the calls are not randomly scattered over the dataset. The height of the blue-
coloured part of the segmented time histogram (Fig. 5.5, bottom) is not uniform,
which means that the temporal distributions of the numbers of the cells with and
without data differ. Hence, it can be concluded that the dataset is not suitable for
revealing and studying long-term patterns of the calling behaviour on the overall
territory.
Still, some parts of the dataset may be suitable for smaller-scale analyses focusing
on particular areas and/or time intervals. To find such parts, it is necessary to investi-
gate the spatio-temporal distribution of the gaps in the recordings. This can be done
using a chart map as shown in Fig. 5.6. The map contains 2D time charts positioned
at the centres of the phone network cells. The columns of the charts correspond to
the 7 days of the week and the rows to the sequence of the weeks. The daily counts
of the recorded phone calls are represented by colours. The dark blue colour means
absence of recordings.
Obviously, the map with 1231 diagrams on it suffers from overplotting and occlu-
sions. While it is easy to notice many dark blue rectangles signifying absence of data
in the respective places, it is necessary to apply zooming and panning operations to
see more details for smaller areas. In Fig. 5.7, there is an enlarged map fragment
showing a business area of one of the major cities in the country, Abidjan. The ma-
jority of the phone cells in this area miss recorded data for the initial 17 weeks, and
only the last 3 weeks of data are present. Some other cells have data since the begin-
ning of the 20-weeks period, but there are 2 or 3 weeks long gaps in the recordings
before the last 3-weeks interval. Hence, in this part of the territory, the data can only
be suitable for analysing the phone calling activities of the people during the last 3
weeks. The consideration of the whole time period of 20 weeks is only possible for
the relatively small area on the southwest of the country (Fig. 5.6), where days with
missing recordings seem to occur sporadically and infrequently.
This example shows that missing data may occur systematically rather than occa-
sionally. The presence of non-random patterns in the distribution of missing data
signifies that the available dataset in its full extent is not acceptable for analysis, but
it may contain parts suitable for analysis tasks with smaller scope. Such parts, if they
exist, can be found by investigating the distributions of the missing and available
data. Our last example demonstrated visual exploration of the spatial and spatio-
temporal distributions. Many examples of approaches to investigation of other kinds
5.2 Investigating data properties 165
Fig. 5.6: A chart map representing the spatio-temporal distribution of the gaps in the
data recording. The 2D time charts positioned at the centres of the phone network
cells show the local temporal distributions of the phone call counts. The counts are
represented by colours; dark blue means absence of recordings.
of distributions occur throughout the book, and many of these approaches can be
used for investigating various distributions of missing data.
When cases of missing values or records occur occasionally and rarely, it is some-
times possible to estimate the likely values based on some assumed model of the
data distribution. For example, assuming a smooth continuous distribution of at-
tribute values over a base with distances, such as time or space, estimates for miss-
ing values can be derived (imputed) from known values in the neighbourhood of the
missing values. Imputation of missing values in a numeric time series can also be
done based on a periodic pattern observed in the distribution of the available val-
ues [61]. Assuming a certain interrelation between two or more numeric attributes,
166 5 Visual Analytics for Investigating and Processing Data
Fig. 5.7: An enlarged fragment of the map from Fig. 5.6 showing a part of the
country’s largest city Abidjan.
it may be possible to impute a missing value of one of them from known values
of the other attributes. Imputation of missing values usually requires creation of a
mathematical or computer model representing the distribution or interrelation. The
model is built using the available data.
The ways in which data are collected and processed may have a large impact on the
patterns emerging in data distributions. Let us consider the example demonstrated in
Fig. 5.8. The time histogram portrays data that are supposed to reflect the spread of
an epidemic in some country. The bars in the histogram represent the daily numbers
of the registered disease cases. We want to analyse how the epidemic was evolving
and expect that the data tell us how many people got sick each day. However, the
shape of the histogram indicates that the data, possibly, do not conform to our expec-
tations. The strangest feature is the drop to a negative number in the last day. Quite
suspicious are also two very high peaks and a day with no registered cases.
To understand what these features mean, we need to investigate how the data were
collected. We find relevant information and learn that the data come from two groups
of clinics: state clinics, which report the disease cases every day, and private clinics,
5.2 Investigating data properties 167
Fig. 5.8: The abrupt peaks and drops in a time histogram showing the number of
reported cases of an epidemic disease emerge due to specifics of the data collection
and integration.
which report their data from time to time. The two high peaks in the daily counts
emerged due to arrival of two batches of data from the private clinics. The records
from these batches were not assigned to the days in which the cases were registered
but to the days in which the data arrived. The gap with the zero daily count near the
right edge of the histogram corresponds to delayed reporting rather than absence of
disease cases. The bar next to the gap is approximately twice longer than the bar
before the gap and the following two bars because it represents the registered cases
from two days. The negative count at the end reflects a correction of the data: some
of the earlier registered disease cases turned out to be wrongly classified, i.e., the
people did not have this particular disease.
Hence, the temporal distribution of the disease cases is distorted in the data by arte-
facts caused by data collection specifics. These data thus cannot be immediately
used for the analysis. If we cannot get better data, we need to make corrections in
these data taking some assumptions. One assumption we can take is that the num-
bers of the disease cases identified in the private clinics are distributed over time in
the same way as the disease cases identified in the state clinics. Under this assump-
tion, the data records from the two batches that came from the private clinics can be
spread over the previous days proportionally to the counts of the cases in the state
clinics. Another assumption is that the fraction of the wrongly classified cases is
approximately the same in each day. Bases on this assumption, we can compute the
ratio of the absolute value of the negative count to the total number of disease cases
and decrease each daily count by the resulting fraction. The gap with zero records
can be filled by dividing the next day’s count into two days. After all these correc-
168 5 Visual Analytics for Investigating and Processing Data
tions, we can hope that the data give an approximate account of how the epidemic
was developing.
Generally, it happens quite often that problems in data are caused by insufficiently
careful integration of data coming from different sources. Thus, when data come
from spatially distributed collecting units (such as clinics, radars, or other kinds
of sensors) that observe certain areas around them, an integrated dataset may
have spatial gaps due to missing data from some of the units, as demonstrated in
Fig. 5.9.
Fig. 5.9: Spatial coverage gaps in a dataset obtained by integration of records from
multiple radars.
When the “areas of responsibility” of different collecting units overlap, it may hap-
pen that the same information is reflected two or more times in the data. It is not
always easy to understand that two data items are meant to represent the same, be-
cause these items are not necessarily identical. There may be discrepancies due to
measurement errors or differences in the times when the measurements were taken
or recorded. Specific problems arise when the data refer to some entities represented
by identifiers. Sometimes, different collecting units may use the same identifiers for
referring to different entities. It is also possible that different identifiers are used
for referring to the same entity. Detecting such problems usually requires domain
knowledge: you need to know the nature of the things or phenomena reflected in the
data in order to understand how good data should look like and be able to uncover
abnormalities.
Let us consider the example shown in Fig. 5.10. Here, spatial positions of two mov-
ing objects (namely, two cars) were recorded independently by two positioning de-
vices. When the data from multiple positioning devices are put together in a single
5.2 Investigating data properties 169
Fig. 5.10: When two simultaneously moving objects are denoted in data by the same
identifier, connecting the positions in the temporal order produces a false trajectory
with a zigzagged shape and unrealistically distant moves between consecutive time
steps. The two images show the appearance of such a “trajectory” on a map (left)
and in a space-time cube (right).
dataset, the outputs of different devices are distinguished by attaching unique iden-
tifiers to their records. It somehow happened that identical identifiers were attached
to the records coming from these two devices. Trajectories of moving objects are
reconstructed by arranging records with coinciding identifiers in the chronological
order. When this procedure was applied to the records having the same identifiers
but originating from distinct devices, the positions of the different moving objects
were mixed, which produced a false trajectory with an unrealistic zigzagged shape.
We can easily recognise the abnormality of this shape because we have background
knowledge of how cars can move.
When you have a very large dataset with positions of many moving objects, it may
be hard to detect such abnormalities from a visual display showing all trajectories.
Fortunately, we also have other background knowledge about car movement; par-
ticularly, we know what speed values can be reached by a car. We can do a simple
computation of the speed in each position by dividing the spatial distance to the next
position by the length of the time interval between the positions. In a false trajec-
tory composed of positions of two or more cars, temporally consecutive points are
distant in space but close in time; therefore, the computed speed values will be unre-
alistically high. High values of the computed speed can also result from occasional
erroneously measured positions. To distinguish trajectories with occasional errors
from trajectories composed of positions of multiple objects, we can inspect the dis-
tribution of the counts of the positions with too high speed values per trajectory.
170 5 Visual Analytics for Investigating and Processing Data
After data properties are investigated and understood, some kind of data process-
ing may be required to make the data suitable for further analysis. In this section,
we give a brief overview of the major data processing operations. Not all of them
can benefit from using visualisations; however, there are operations in which a hu-
man analyst needs to make informed decisions, and visual displays are required for
conveying relevant information to the human. We shall consider such operations in
more detail.
During the recent decade, a number of commercial and free tools for data cleaning
have been developed. Most of them apply a combination of visual and computational
techniques for identifying data problems and fixing them. Representative examples
of such tools are Trifacta data wrangler6 and OpenRefine (formerly Google Re-
fine)7 . These tools focus mostly on tabular data, supporting such operations as data
cleaning in columns containing numeric values (detecting and replacing dummy
values, ensuring formatting consistency, detecting outliers), dates and times (fixing
formatting issues, adjusting time zones, changing temporal resolution), and texts
(fixing misspelled words, abbreviations, etc.). Table columns are usually consid-
ered independently. Histograms are commonly used for assessing distributions of
6 https://www.trifacta.com/
7 https://openrefine.org/
5.3 Processing data 171
numeric attributes. Interactive query tools are used for inspecting outliers and fixing
errors.
In discussing the problem of missing data (Section 5.2.3), we mentioned that it may
be possible to impute missing values based on some assumed model of the data
distribution. Data modelling may be helpful not only for reconstruction of missing
data but also for detection and correction of errors in data and for adaptation of the
data sampling to the requirements of the following analysis.
As an example, let us consider map matching8 of GPS-tracked mobility data. GPS
positioning results may contain errors and/or temporal gaps between recorded po-
sitions (i.e., missing data), especially when the visibility of satellites is restricted.
For supporting transportation applications, it is often necessary to snap GPS coor-
dinates to streets or roads. A straightforward geometry-based approach with finding
the nearest street segment for each point in the data may not work well. For example,
an erroneous position may be located closer to a street lying aside of the movement
path than to the street where a vehicle was actually moving. A more appropriate
but computationally costly approach is to compute a plausible route through the
street network that matches the majority of the tracked positions, and then snap the
positions to the segments of this route.
This transformation provides several benefits. First, erroneously recorded positions
lying beyond streets of on wrong streets are replaced by plausible corrected posi-
tion. Second, after linking the resulting positions by lines, it becomes possible to
determine the likely intermediate positions between the recorded positions. This al-
lows one to increase the temporal and spatial resolution of the data (i.e., generate
more data records with smaller temporal and spatial distances between them) and/or
to create a dataset with regular temporal sampling (i.e., a constant time interval be-
tween consecutive records).
This example demonstrates the following general idea: from a set of discrete data
records, derive a mathematical or computer model representing the variation or dis-
tribution of the data as a continuous function, and use this model for (a) filling gaps
in the data, i.e., reconstructing missing values, (b) replacing outlying values (possi-
ble errors) by values given by the model, and (c) creation of a dataset with regular
spacing (distances) between data items within some distribution base with distances,
such as time or space.
8 https://en.wikipedia.org/wiki/Map_matching
172 5 Visual Analytics for Investigating and Processing Data
9 https://en.wikipedia.org/wiki/Smoothing
10 https://en.wikipedia.org/wiki/Interpolation
11 https://en.wikipedia.org/wiki/Resampling_(statistics)
5.3 Processing data 173
After cleaning data from errors and outliers and/or reconstructing missing values,
field values may still need to be improved for facilitating analysis. If data items
include values of multiple attributes, it may be useful to bring them to comparable
ranges. For this purpose, normalisation and scaling techniques can be applied, see
Section 4.2.9.
Another potentially useful operation is binning, or discretisation, that transforms
a numeric attribute with a continuous value domain into an attribute with ordered
categorical values. We have mentioned the concept of discretisation in Chapter 3
(particularly, in Section 3.3.3) in relation to defining classes of values of a nu-
meric attribute for visual encoding in choropleth maps (Fig. 3.9) and 2D time charts
(Fig. 3.12). There exist a wide range of approaches to discretisation. The most pop-
ular are equal intervals (having the same widths of the bins but a varying number of
data items in each bin) and equal quantiles, or equal size (having the same amount
of data items in each bin, but varying bin widths). Discretisation can be applied
either to raw data or to results of normalisation. For example, some analysis tasks
may benefit from equal interval binning with the interval length (i.e., the bin width)
equal to the attribute’s standard deviation.
It should always be kept in mind that the way in which data are discretised affects
the distribution of the resulting values, Figures 2.5 and 3.13 demonstrate how the
choice of the bin widths in a frequency histogram can affect the visible patterns
of a frequency distribution. Figure 3.12 demonstrates how the pattern of a temporal
distribution can depend on the definition of attribute value classes. Discretisation can
also affect spatial distribution patterns, particularly, patterns visible on a choropleth
map.
When data consist of multiple components, it may be useful or necessary for anal-
ysis to create additional components by synthesising the existing ones. A simple
example is assessing the overall magnitudes of numeric time series based on the
maximal, average, or total value. This creates a new data component, namely, a nu-
meric attribute, such that each time series is characterised by a single value of this
attribute derived from the multiple values originally present in this time series. Even
in this simple example, it can be useful to involve human reasoning supported by
visual analytics approaches, so that only relevant components are taken into account
in the synthesis process. Thus, an analyst may want to assess time series of traffic
counts according to the values attained in rush hours of working days. This requires
careful selection of days that were working (excluding holidays, special events etc.)
174 5 Visual Analytics for Investigating and Processing Data
and hours of intensive traffic (taking into account seasonal changes, daylight sav-
ing time, particular traffic patterns on Fridays, and other possible sources of varia-
tion).
Another example is determining the dominant attribute (i.e., the one with the high-
est value) among multiple attributes. Obviously, the original attributes need to be
comparable either by their nature (e.g. proportions of different ethnic groups in the
total population of geographic areas) or as a result of data standardisation, such as
transformation to z-scores, see Section 4.2.9. Even a simple operation of finding
the attribute with the highest value in real-world data may require a human expert
to make informed decisions: What to do when all values are almost equal? What
amount of difference between the highest value and the next one is sufficient? How
to handle cases when all attributes have very small values? To make such deci-
sions and obtain meaningful results of the synthesis, the analyst needs to analyse
the distributions of the attribute values and their differences and to see the effects of
possible decisions on the outcomes. This activity requires the use of visual displays
and interaction operations [8].
A need in synthesising multiple attributes can arise in the context of multi-criteria
decision analysis [95, 75], where decision options are described by a number of at-
tributes used as criteria. Criteria whose low values are preferred to higher values
are called cost criteria, and those whose high values are preferred are called benefit
criteria. There may be also criteria whose average values are desirable. More gener-
ally, one can define a certain utility function assigning degrees of utility to different
values of a criterion. By means of this function, the original attributes describing the
options are transformed into a group of comparable attributes whose values are the
utility degrees. These derived attributes are then used for comparing the options and
choosing the best of them.
It happens very rarely that there exists a decision option with the highest possible
utility degrees for all criteria. Usually, each option has its advantages and disadvan-
tages, and it is necessary to select such an option that the advantages overweight
the disadvantages. A typical approach is to combine the values of the transformed
criteria into a single utility value and then select one or a few options with the high-
est combined values. The criteria may differ in their importance; these differences
are represented by assigning numeric weights to the criteria. Having the criteria-
wise utility degrees and the weights of the criteria, the overall utility score of each
option can be calculated as a weighted linear combination of the individual utility
degrees for the criteria. There are also more sophisticated methods for combining
multiple utility degrees, but all of them use numeric weights, which are supposed to
be assigned to the criteria by a human expert.
A problem hides here: it is usually hard for a human expert to express the relative
importance of criteria in exact numbers. Should the weight of this criterion be 0.3
or 0.33? It may be tricky to decide, whereas a slight change of the criteria weights
may have a significant impact on the scores and ranks of the options. To make a
reasonable final choice, the analyst needs to be aware of the consequences of small
5.3 Processing data 175
changes of the criteria weights. Such awareness can be gained by performing sensi-
tivity analysis of the option ranks with respect to the criteria weights. As any analysis
process where human reasoning and judgement is critical, it should be supported by
visual analytics techniques, as, for example, proposed in paper [16]. The analyst can
interactively change any of the weights and immediately observe the changes of the
option scores and ranks, which are shown in a visual display. There is also a more
automated and thus more scalable way of the analysis. The analyst specifies the vari-
ation ranges of the criteria weights and the steps by which to change the weights. In
response, a computational tool calculates the option scores and ranks for all possible
combinations of the weights taken from the specified ranges. The summary statistics
of the resulting scores and ranks (minimum, maximum, mean, median, quartiles) are
then represented visually, so that the analyst can find one or more options having ro-
bustly high scores. Instead of the summary statistics, frequency histograms could be
used. Their advantage is a more detailed representation of the distributions. Thus, a
bimodal distribution (which would be very strange to encounter in this context) can
be recognised from a histogram but not from the summary statistics.
Synthesis of multiple attributes into integrated scores or ranks is quite a widespread
procedure, which is applied in evaluating students, universities, countries, living
conditions, and many other things. In all cases, weighting is applied for expressing
the relative importance of different attributes. It is often useful to investigate how
the final scores depend on these weights.
To summarise, synthesis of multiple data components often requires taking certain
decisions and analysing the consequences of different possible decisions. This re-
quires involvement of human reasoning and judgement; therefore, the use of visual
analytics techniques is highly appropriate. Their primary role is to allow the analyst
to make changes in the settings and inputs of the synthesis method and observe the
effects of these changes on the synthesis results. In complex cases, when there are
too many possible settings to be tested, the analysis of the sensitivity of the results to
the settings is supported by combining automated calculations of a large number of
possible outcomes with visualisation of the distributions of these outcomes.
Data required for analysis may be originally available in several parts contained in
different sources. These parts need to be integrated. For example, one database table
describes salaries of individuals over time, and another table lists employees per
department. If a goal of analysis is to study the dynamics of salaries per department,
it is necessary to integrate these two tables. Standard database operations1213 can be
12 https://en.wikipedia.org/wiki/Join_(SQL)
13 https://en.wikipedia.org/wiki/Merge_(SQL)
176 5 Visual Analytics for Investigating and Processing Data
applied for this purpose. Before linking the data, it is necessary to check whether
there is a correspondence between the identities of the individuals specified in the
two tables.
There may be more complex cases when data integration is not limited to sim-
ple key-based join. For example, sophisticated spatial queries need to be used for
combining population data with land use information in order to assess more accu-
rately the distribution of the population density. The integration enables exclusion
of unpopulated areas (e.g. water bodies, agricultural lands etc.) from the areas for
which population counts are specified. This approach is called dasymetric map-
ping14 )
Data integration includes not only operations on pre-existing database tables or data
files but may involve the use of external services for obtaining additional data. Data
wrangling software (e.g. OpenRefine) can automate the process of obtaining geo-
graphic coordinates based on addresses by means of online services for geo-coding.
Similarly, named entity recognition services can be used for extracting names of
places, persons, organisations, etc. from texts. Language detection tools may help
to determine the languages of given texts. Sentiment analysis can be applied for
assessing emotions in texts.
Open linked data15 technologies can support data integration technically. All these
tools extend available data by deriving additional attributes for the existing data
items. However, such new attributes need to be checked for possible problems sim-
ilarly to the original data.
17 https://en.wikipedia.org/wiki/United_States_presidential_
elections_in_which_the_winner_lost_the_popular_vote
18 https://en.wikipedia.org/wiki/United_States_Electoral_College#
Contemporary_issues
178 5 Visual Analytics for Investigating and Processing Data
president
electors
voters
Fig. 5.12: A two-level aggregation procedure applied to primary votes for two can-
didates can give the victory to a candidate supported by fewer voters than the other
candidate.
There may be many possible reasons for data reduction and selection:
• some available data are not relevant;
• some available data are outdated;
• data are too numerous to be handled by the analysis tools;
• data are excessively detailed;
• data are excessively precise, while a lower precision is sufficient for analysis.
Data filtering is the tool for removing irrelevant or outdated data from the dataset
to be used in the analysis. As described in Section 3.3.3, filtering can remove data
items with unsuitable attribute values, or data referring to times or spatial locations
beyond the period or area of interest, or aged data that may not be valid anymore.
Filtering can also be used for selecting manageable portions of data in order to test
whether the planned approach to the analysis will work as expected, or to do the
analysis portion-wise. In this case, it is necessary to take care that the selection is
reasonable according to the domain specifics as well as common sense. For exam-
ple, when data portions are selected based on time references, it may not be very
meaningful simply to take continuous time intervals containing equal numbers of
records. It may be more appropriate to take time intervals including the same num-
ber of full days or weeks, or to do the selection based on the positions in the time
cycles (e.g., analyse separately mornings and evenings). When the data represent
some activities, such as journeys, or events that happen during some time, such as
public events, it is reasonable to ensure that the selection does not contain unfin-
ished activities or events. Sensible decisions on how to select or divide data need to
be taken by a human expert, who needs to be supported by visualisations showing
the consequences of different possible decisions.
5.4 Concluding remarks 179
In this chapter, we argued that data properties need to be studied and understood
with the help of visual displays showing distributions of data items with respect to
different components: value domains of attributes, time line and time cycles, space,
and sets of entities. Human analysts use these displays to check if the distributions
appear consistent with their background knowledge and, if inconsistencies are no-
180 5 Visual Analytics for Investigating and Processing Data
ticed, understand their reasons. We demonstrated how visual displays can be helpful
in detecting different problems in data quality. Visualisation of distributions can re-
veal outliers, unrealistic items and combinations of items, gaps in coverage of a
distribution base, biases, and wrong references to entities, time moments, or spatial
locations. Such problems may sometimes make the data unsuitable for a valid anal-
ysis, but even when they are not that severe, they need to be taken into account in
performing the analysis.
Data suitability for intended analysis depends not only on the quality but also other
features, such as regularity, resolution, comparability of field values, structure, and
presence of required kinds of information. To become suited for the analysis, data
may need to undergo preparatory operations, some of which may require involve-
ment of human knowledge, interpretation, and reasoned decision making. In all
such cases, visual displays are necessary to inform the humans, support their cogni-
tive activities, and evaluate and compare the outcomes of possible alternative deci-
sions.
In the following chapters, we shall mainly focus on the use of visualisations for
studying and understanding various kinds of real-world phenomena, but we shall
also discuss the structure and properties of data reflecting these kinds of phenomena
and mention possible specific problems in such data.
• Does a box plot convey sufficient information about properties of a set of numeric
values?
• What features in histogram shapes require special attention in the context of ex-
ploring data quality?
• What is a local outlier in a numeric time series and how can it be detected?
• When some data are missing in a dataset, what needs to be checked for judging
whether the available data are suitable for a valid analysis? How can this be
checked?
• What data transformations can alter properties of data distributions? How to
avoid such alterations?
Chapter 6
Visual Analytics for Understanding Multiple
Attributes
Abstract One very common challenge that every data scientists has to deal with is to
make sense of data sets with many attributes, where “many” can sometimes be tens,
sometimes hundreds, and even thousands. Whether your goal is to do exploratory
analysis on the relationships between the attributes, or to build models of the under-
lying phenomena, working with many dimensions is not trivial. The high number of
attributes is a barrier against using some of the standard visual representations: just
try to imagine a scatterplot matrix where you want to look at the pairwise distribu-
tions combinations of 100 variables. Moreover, any computational method that you
apply produces results that are challenging to interpret. Even linear regression, one
of the easiest models to understand, becomes quite complex if you need to investi-
gate the interactions between hundreds of variables. This chapter will discuss how
not to get lost in these high-dimensional spaces and how visual analytics techniques
can help you navigate your way through.
You might think that you know your city well and you have a good idea which
neighbourhoods are similar and which are different. Let’s assume that you live in
London, in a neighbourhood called Hackney, which is full of bars and coffee shops
and trendy looking youngsters who are likely early in their careers or still studying.
You could probably tell how Hackney is different to Westminster, which is full of
governmental buildings and businesses and people who live there are pretty wealthy.
Here, you are distinguishing these two boroughs by the dominant kinds of busi-
nesses, and the age and average (likely) income of its residents. These are relevant
observations, but these aspects are only a small portion of different characteristics
one can think of. What if I ask you to think of 72 different characteristics of the
London boroughs all at the same time, including the “number of cars per household
1https://data.london.gov.uk/dataset/london-borough-profiles
2 http://data.london.gov.uk/dataset/subjective-personal-well-
being-borough
6.1 Motivating example 183
Fig. 6.1: Sammon mapping applied to the 33 London boroughs using data that de-
scribe the boroughs in terms of 69 numerical features.
the lowest happiness score while the differences among the remaining boroughs are
not very high. Kensington and Chelsea, which lies opposite to City of London, has
the highest happiness score, and Hackney, lying on the other side of the cluster, has
the second lowest score after City of London but not as much different from those of
the remaining boroughs. Generally, we can see that the relative arrangement of the
boroughs in the projection plot is consistent with the values of the happiness score.
This may mean that either this attribute has high impact on the result or that it is
correlated with the other “emotional” attributes.
This example only scratches the surface of what we could be doing in such an analy-
sis. It would be appropriate to investigate whether the attributes are interrelated and
whether some of them are non-informative or more informative. We have seen that,
depending on which aspects you consider (e.g., our manual selection of “emotional”
attributes), you get very different observations. Think of all possible combinations
that can be taken from the 69 attributes – the full space of possibilities is practically
intractable. The question is which combinations are meaningful and worth consider-
ing. This section now looks at visual analytics techniques to help us work with such
high-dimensional data sets and enable us to answer a wider variety of questions, and
eventually reach deeper insights.
184 6 Visual Analytics for Understanding Multiple Attributes
Fig. 6.2: Sammon mapping applied to the 33 London boroughs using only 4 at-
tributes that relate to the emotional state of the residents.
Fig. 6.3: The same result as in Figure 6.2 with the colour indicating the “Happiness
Score” values, darker green meaning higher reported happiness.
The data sets that we deal with here are those sets where for each data record
we have several attributes. Depending on the context, the number of attributes can
nowadays easily reach hundreds and even thousands. Most of the methods available
to data scientists are not geared for such high numbers of attributes, e.g., even sim-
ple linear regression models become hard to interpret once you try to build it with
6.2 Specifics of multivariate data 185
The typical high level analytical goals referring to high dimensional data analysis
include:
• Investigating relationships: As discussed above, there often are inherent rela-
tionships between the attributes. For instance, consider the variables we used in
the example in Fig.6.1. It is not unusual to expect some degree of dependency be-
tween the survey responses for the “happiness” and “life satisfaction” scores. One
fundamental task therefore is to discover and study the relationships between the
attributes, which is important for understanding and modelling the phenomenon
these attributes describe.
• Feature selection: One natural follow-up task to the above is to filter the at-
tributes for removing redundancies and selecting what is important. The selection
requires ranking the features according to different criteria.
• Formulating and modelling multi-variate relationships: It may be necessary
not only to discover and understand the existing relationships between the at-
tributes but also to formally represent and model them. This can be done by
building mathematical or computer models, such as linear regression or support
vector machines. Such a model can eventually serve an analyst as a predictor,
classifier, or an explainer.
• Studying data distribution in the multidimensional space: Basically, the goal
is to understand the variation of the multi-attribute characteristics (i.e., combina-
tions of attribute values) over a set of items described by these attributes. This
goal can be re-phrased in the following way: imagining a multidimensional space
of all possible value combinations, the goal is to understand which parts of the
space are more populated than others and which are empty, whether there are
groupings of close items in this space and isolated items distant from all oth-
ers (outliers), whether there exists a concentration of items in some part of the
space and a decreasing density trend outwards from this part. Basically, analysts
may be interested in finding and studying the same kinds of distribution patterns
as in a usual two-dimensional space; see Section 2.3.3 and examples of spatial
distribution patterns in Fig. 2.7.
Here, we organise the visual analysis techniques according to the high-level capa-
bilities they provide within the course of an analysis process.
6.4 Visual Analytics Techniques 187
Small multiples of attribute profiles: When you have data with multiple attributes,
it often makes sense to begin with considering the value frequency distribution of
each of them. This can be done by creating multiple frequency histograms, as in the
example shown in Fig. 2.11. That example demonstrates that it may be possible not
only to examine the value distribution of each individual attribute but also, by means
of interaction, explore the interrelations between the attributes. The interaction tech-
niques needed for supporting this exploration are selection and highlighting of data
subsets, as described in Section 3.3.4. The idea is that differently coloured segments
of the histogram bars show the value distribution in the selected data subset. An-
other example of the use of this approach can be found in the paper by Krause et
al [85].
Attribute Spaces: One can easily imagine that even using the most space-efficient
visual representation for each dimension would not scale (in terms of the screen
space or perceptually) once we start working with very high numbers of dimen-
sions. A promising approach is creation of so-called attribute spaces [140]. The
main idea is that the value distributions of the attributes are characterised by a num-
ber of numeric features, such as skewness, kurtosis, and other descriptive statistics.
This creates, additionally to the data space, where the data items are represented by
points positioned according to their values of the attributes, an attribute space, where
the attributes are represented by points positioned according to their distributional
features. For this attribute space, one can use the same visualisation techniques as
for the data space. For example, one can create a scatterplot where two attribute fea-
tures form the axes and the attributes are represented by dots positioned according
to the values of these features.
In the example in Fig. 6.4a, we see a visualisation of two dimensions of the attribute
space (left, yellow background) with more than 300 attributes. The data we are
looking at here is from a longitudinal study of cognitive ageing that involved 82
participants with 373 numeric attributes derived from the imaging of the brain using
various modalities and from the results of cognitive capability tests [140]. The axes
of the scatterplot correspond to the skewness and kurtosis. For instance, the attributes
that exhibit high values in both skewness and kurtosis are in the top right quadrant.
The scatterplot gives you a mechanism to select a subset of attributes with particular
values of these distributional features and visualise the data with regard to these
attributes or perform further analysis on this selected subset.
Another key application using attribute spaces is to interactively investigate the dif-
ferences within subsets of the data records across all the attributes. To do this, one
can select a subset of records, calculate all the descriptors for the subset, for in-
stance looking at the standard deviation of the data within the selected sample, and
then compare the re-computed metrics to the metrics computed using all the avail-
able data records. Once the difference information is visualised, it reveals in what
ways the attribute value distributions differ when a subset is considered. Figure 6.4c
188 6 Visual Analytics for Understanding Multiple Attributes
Fig. 6.4: Dual analysis framework, where visualisations of data items and attributes
have blue and yellow backgrounds, respectively. a) A scatterplot of the attributes re-
garding their skewness and kurtosis values. b) A scatterplot of the data items (which
describe people) regarding the attributes ‘Age’ and ‘Education’. A group of older
people with lower educational levels is selected. c) A deviation plot shows how
the mean and standard deviation values of the different attributes change when the
selection in (b) is made. Source: [140]
demonstrates how the values of the mean µ̄ and standard deviation σ̄ change when
a subset of the participants who are older and have lower level of education is se-
lected, as shown in Fig. 6.4b. The plot in Fig. 6.4c can be called a deviation plot.
Here, the centre refers to the zero deviation in µ̄ and zero deviation in σ̄ . We ob-
serve, for instance, certain subsets of attributes that have remarkably high values
for the selected subset. These attributes (which are marked in Fig. 6.4-c) are poten-
tial candidates for being discriminatory features for the older and lower educated
sub-cohort.
can be seen in Fig.6.5. Examples of potentially useful ranking criteria are the cor-
relation coefficient, the number of outliers, according to statistical measures of the
outlierness, or the uniformity of the point distribution over the whole 2D space of
the plot. These characteristics of the joint 2D distributions can be combined with
the distributional characteristics of the individual attributes. An analyst can sort the
pairs according to some of the characteristics, choose those that are of potential in-
terest, and focus the analysis on those. Hence, ranking of attributes and attribute
pairs by various distributional features can help you to navigate the huge attribute
space more effectively.
Fig. 6.5: Rank-by-feature framework interface for ranking histograms and scatter-
plots. Source: [122].
Fig. 6.6: Various scagnostic measures to characterise spatial patterns of dot distri-
bution observed in 2D scatterplots [152].
Fig. 6.7: A correlation network constructed with the 69 numeric attributes charac-
terising the London boroughs that we have in the motivating example in Section 6.1.
Each node represents one of the attributes. The edges represent strong correlations
between attributes, i.e., where ρ > 0.75.
Section 6.1. We calculate the Pearson correlation coefficient ρ between all attribute
pairs, construct a network with 69 nodes corresponding to the 69 numeric attributes
and connecting edges between those nodes where there is a strong positive corre-
lation between the attributes; specifically, we take ρ > 0.75 in this example. In the
result, we get is a network with unconnected sub-components, as seen in Fig.6.7.
Each component represents a small group of attributes with strong interrelationships
among them.
Employing machine learning methods: The machine learning literature offers sev-
eral effective tools to work with high-dimensional data sets, from reducing the di-
mensionality (as considered in Section 4.4) to multivariate modelling. The survey by
192 6 Visual Analytics for Understanding Multiple Attributes
Endert et al. [51] is a good source to read about the integration of machine learning
techniques into visual analytics approaches. Particularly, dimensions-transforming
data embedding techniques (discussed in Section 4.4.3) can be helpful for reveal-
ing relationships among attributes. The most widely adopted method is the Principal
Component Analysis (PCA), which represents variation in data in terms of a smaller
number of artificial attributes, called principal components, derived as weighted lin-
ear combinations of the original attributes. The original attributes that have high
weights in one and the same principal component can be considered as somehow
related. It is a bit like the clusters of attributes we saw in the correlation network in
the previous section.
A problem with using PCA or similar techniques for high-dimensional data is that
the number of principal components to consider may become too large. A possible
approach to deal with this problem is to use subspace clustering [86]. The main
idea of subspace clustering is to find subsets of data that form dense clusters in
spaces constructed from subsets of the whole set of dimensions. For example, a
dense cluster of data points can exist in a space of k particular attributes taken from
a larger set of n attributes, but these points may be scattered with respect to any
of the remaining n − k dimensions4 . In other words, there may be a subset of data
items that are similar in terms of particular attributes but not similar in terms of the
other attributes. The goal of the subspace clustering algorithms, which are developed
in machine learning, is to find subsets of attributes such that there are dense and
sufficiently big clusters of data items with similar combinations of values of these
attributes. The resulting data clusters may be existing in distinct subspaces. The
groups of attributes forming these subspaces are locally related with regard to certain
subsets of data. These local structures can be explored to gain some understanding
of the data distribution and the relationships among the attributes. To generate a
visual representation of the subspaces and the respective data distributions, one can
apply PCA or another dimensionality reduction technique.
An example is shown in Fig. 6.8, where the display contains multiple projection
plots representing different subspaces. The coloured bars along the x- and y-axes
of the projection plots indicate the relevance of different dimensions for the cor-
responding principal components used to create the plots. The projection plots are
laid out in the display space according to the similarities between the subsets of the
attributes forming the represented subspaces. This can be achieved by means of a
data embedding method, such as MDS, which is applied in Fig. 6.8.
4 https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#
Subspace_clustering
6.5 Further Examples 193
The paper by Stahnke et al. [127] was mentioned multiple times in Section 4.4,
where we discussed data embedding. It provides good examples of how to visualise
and explore distortions of between-item distances in a projection (Figs. 4.6 and 4.7)
and how to explore and compare multi-attribute profiles of selected subsets of data
items (Fig. 4.9). In addition to the illustrations provided in Section 4.4, Figure 6.9
demonstrates interactive selection of a subset of dots in a projection plot and ob-
taining information about the subset of data they represent in the form of smoothed
frequency histograms corresponding to multiple attributes. The histograms show the
distributions of the attribute values in the subset in comparison to the distributions
194 6 Visual Analytics for Understanding Multiple Attributes
in the whole set. Hence, it is possible to identify attributes that are distinctive for the
selected subset.
Fig. 6.9: The “Probing Projections” approach enables selection of a subset of data
items and viewing the distributions of the values of multiple attributes over the se-
lected subset in comparison to their distributions in the whole dataset. Source: [127]
Fig. 6.10: The distributions of the values of individual attributes over the projection
space are visualised in the form of heatmaps. Source: [127].
5 https://en.wikipedia.org/wiki/Support-vector_machine
6.5 Further Examples 197
Fig. 6.12: The axes of the parallel coordinates plot correspond to 5 distinct explainer
functions generated for the same target group of data items. The lines connecting
the axes represent different cities. The lines of the American cities are coloured in
blue. Source: [54].
consideration and selection of the most appropriate one. The candidate functions
are generated so that they are diverse in terms of involving distinct subsets of the
original attributes, and functions with integer weights of the attributes are preferred
for their simplicity.
198 6 Visual Analytics for Understanding Multiple Attributes
Figure 6.12 demonstrates how an analyst can compare several candidate functions.
Each axis in the parallel coordinates plot represents one function. The attributes in-
volved in the functions and their weights are indicated below the axes. The data
items are represented by polygonal lines connecting the positions on the axes cor-
responding to the scores given by the functions. In this example, the axes represent
different variants of the “American-ness” function, some emphasising healthcare
while other emphasising housing or other criteria. The lines in blue correspond to
the American cities. The dark blue line corresponds to the city San Jose, which is
given the lowest scores by all functions except the last one. This city appears to be
the least American among all American cities. All five functions fail to completely
separate the American cities from not American. The advantage of the first function
is that it gives the highest possible score to American cities while the others to non-
American. This is the function whose value distribution is seen in Fig. 6.11.
An explainer can also be constructed with the purpose to separate a single item from
all other items, i.e., the function is expected to give the highest possible score to the
selected item and lower scores to all others. Such a function characterises what is
unique about the selected item compared to the others. For example, one can build
a function describing and explaining the uniqueness of Paris.
Fig. 6.13: Artificial data dimensions “Paris-ness” and “New York-ness” constructed
for capturing the distinctive characteristics of Paris and New York are used as axes
of a projection plot, in which all cities are positioned according to their scores in
terms of “Paris-ness” and “New York-ness”. Source: [54].
Fig. 6.13, where the axes of the projection plot correspond to the explainers “Paris-
ness” and “New York-ness”. The red dots represent Paris and New York, which
have the highest values of the respective functions. The positions of the other dots
show how similar or different to Paris and New York the respective cities are. Thus,
London is very New York-like, whereas Mumbai is quite similar to New York but
is opposite to Paris. Of course, the similarities and differences refer to specific at-
tributes that are involved in the definitions of the functions “Paris-ness” and “New
York-ness”.
The following takeaways can be drawn from this example:
• In reducing data dimensionality and creating data embeddings, an analyst may
wish to obtain artificial dimensions that capture distinctive features of particular
items or groups of items. Hence, the analyst needs to have control on the process
of dimension construction, which is not possible when automatic methods are
used.
• Human-controlled creation of dimensions with desired properties can be sup-
ported by employing machine learning methods that build classification functions
with numeric outputs.
• The purpose of using the classification functions is data description and explana-
tion rather than prediction. Hence, interpretability of the functions is crucial.
• For gaining better, more comprehensive understanding of the analysis subject, it
is beneficial to consider several distinct variants of classification functions that
involve different groups of attributes.
• The functions constructed under the analyst’s supervision can be used in further
analysis, particularly, as dimensions of a data embedding space.
• To select one of multiple possible functions for using in further analysis, the ana-
lyst seeks a suitable trade-off between the classification accuracy and complexity.
isations and computations can complement each other, but they demonstrate a few
general ideas that can be re-used in other analytical scenarios:
• Support interpreting computation results by visualising value distributions of
original and derived attributes.
• Perform multiple computations and use visualisations to support comparison of
the results.
• Direct the work of the computational methods through visually supported inter-
active selection of subsets of attributes and/or data items.
Use the publicly available dataset describing the London boroughs that is referred
to in Section 6.1, or take a similar dataset where you can define several groups
of semantically related attributes. What approach(es) would you try for comparing
the boroughs in terms of each group of attributes? What techniques can help you
to identify the unique and mainstream items with regard to each attribute group?
Experiment with applying data embedding and clustering to different subsets of the
attributes and different subsets of the data items.
Chapter 7
Visual Analytics for Understanding
Relationships between Entities
a data set with ratings of movies by users, that is, by people who viewed the movies.
The data set is publicly available and known as MovieLens1 . Each record of the
data includes information about the user (identifier, sex, age, and occupation), about
some movie (title, release year, and genre), and the rating given by the user to the
movie. In the analysis we are going to discuss, not the ratings are of interest but
the users’ choices of the movies to view. The analyst wants to know how charac-
teristics of the users (their sex, age, and occupation) are related to characteristics of
the movies they choose (year and genre). For this purpose, the analyst will perform
two kinds of operations: first, define groups of users and groups of movies based on
the attributes of the former and latter and, second, extract and explore the following
relationships:
• associations between the groups of the users and groups of the movies;
• similarities between the user groups in terms of movie choices;
• links between the movie groups due to shared audiences.
For extracting the associations between groups of users, on the one side, and the
movies or groups of movies, on the other side, the analyst transforms the original
data into a so called contingency table. It is a matrix whose rows and columns corre-
spond to values of two attributes and the cells contain frequencies of the occurrence
of the value pairs in the data. Usually, contingency tables are generated for attributes
with categorical (qualitative) values. Figure 7.1 shows an example of a contingency
table where the rows correspond to the movies and the columns to the user groups
by occupations. Similar tables can be built for movie groups based on the genre or
for user groups based on the sex. To define groups of users or groups of movies
based on numeric attributes, such as ‘age’ and ‘release year’, the analyst needs to
divide the ranges of the numeric values into meaningful intervals. For example, the
analyst may define the following age groups of the users: below 18, from 18 to 25,
from 25 to 35, and so on.
Tools for creation of contingency tables are commonly available in various data
analysis systems, libraries, and frameworks, such as R, KNIME, WEKA, and
Python. Having a contingency table with frequencies of pairs of some items, it is
possible to assess the strengths of the associations between the items. If a pair oc-
curs more frequently than it can be expected, it is said that the items are positively
associated. If a pair occurs less frequently than expected, the items are negatively
associated. If the frequency is close to what is expected, the items are not associ-
ated. The strength of a positive or negative association can be measured based on
1 https://grouplens.org/datasets/movielens/
7.1 Motivating example 203
Fig. 7.1: A contingency table containing the frequencies of all pairs movie +
viewer’s occupation. Source: [5]
the difference between the actual and expected frequencies. The expected frequen-
cies are estimated based on the null hypothesis assuming absence of association.
Let i+ and + j be two items corresponding, respectively, to the row i and to the
column j of a contingency table. Let fi+ and f+ j be the overall frequencies of the
items i+ and + j in the data, and let f++ denote the sum of all frequencies con-
tained in the table; see Fig. 7.1. Then, the expected frequency êi j can be estimated
as the product of the overall frequencies of the items divided by the sum of all fre-
quencies: êi j = fi+ · f+ j / f++ . If the actual frequency of the item pair fi j is close
to êi j , it means that the null hypothesis holds, that is, the items are not associated;
otherwise, they are associated, either positively, when fi j > êi j , or negatively, when
fi j < êi j .
The significance of the difference between the actual and expected frequency values
can be estimated by means of the standardised, or adjusted, Pearson residual:
fi j −êi j
r pearsoni j = √
êi j ·(1− fi+ / f++ )·(1− f+ j / f++ )
The values of r pearsoni j have a normal distribution with the mean of 0 and the stan-
dard deviation of 1. It is convenient to re-scale these values to the range [−1, 1],
so that -1 corresponds to the strongest negative association (it would mean that the
items never occurred together, i.e., fi j = 0) and 1 to the strongest positive association
(it would mean that the items occurred only together with each other and never with
any other items, i.e., fi j = fi+ = f+ j ). To obtain the scaling factor, the maximal the-
oretically possible absolute value of r pearsoni j is calculated based on the contingency
table. For obtaining the maximal possible negative association value, the frequency
of the pair (i+, + j) is set to 0, and the aggregate frequencies fi+ , f+ j , and f++ are
204 7 Visual Analytics for Understanding Relationships between Entities
updated accordingly. For obtaining the maximal possible positive association value,
the frequencies of all pairs in the ith row and in the jth column, except for the pair
(i+, + j), are set to 0, and the aggregate frequencies fi+ , f+ j , and f++ are decreased
accordingly.
To represent the associations visually, the authors of the paper [5] propose a design
demonstrated in Figs. 7.2 and 7.3. The display has an appearance of a wheel and
is called Contingency Wheel. The wheel is divided into segments corresponding to
columns or rows of a contingency table. Thus, the wheel segments in Fig. 7.2 cor-
respond to groups of users according to the occupation, and in Fig 7.3 to groups of
movies according to the genre. The widths of the segments may differ proportion-
ally to the frequencies of the respective items, as in Figs. 7.2. Inside the segments,
the associations with the items corresponding to the other dimension of the table
are represented in a detailed mode, as in Fig. 7.2, or in an aggregated way, as in
Fig 7.3. In the example we consider, the analyst is only interested in the positive
associations between the users and the movies; therefore, the negative associations,
if any, are not shown in the display.
In Fig. 7.2, each movie is represented by a dot whose distance from the inner side
of the segment is proportional to the strength of the association between the movie
and the user group represented by the segment. When there are many movies with
the same association strength, the corresponding dots may overlap, and the quanti-
ties will be hard to estimate. In such cases, it is better to represent the associations
using histograms, as in Fig 7.3 and Fig 7.4, instead of dots. In these figures, the
histograms are drawn in a peculiar way corresponding to the wheel-like design of
the display. Such a fancy design is not essential; it would also be OK to use standard
histograms.
The use of histograms, either curved or standard, creates additional opportunities
for the exploration of the associations. The bars of the histograms can be divided
into segments corresponding to groups of the associated items. Thus, in Fig. 7.4,
the segments correspond to groups of movies according to the release year (top) and
according to the genre (middle). The legends at the bottom explain the encoding of
the years and genres by colours of the bar segments and also show the frequency
distributions of the values of these attributes. Such segmented histograms are good
for revealing associations between groups of items. The histograms in Fig. 7.4 re-
veal the movie choice preferences of some occupation-based user groups. We can
see that the group “K-12 Student” (i.e., from kindergarten to the 12th school grade)
has strong associations with recent movies while the group “Retired” is stronger
associated with older movies. Concerning the movie genres, the group “K-12 Stu-
dent” has very strong associations with movies for children and with comedies,
7.1 Motivating example 205
Fig. 7.2: A Contingency Wheel display showing the links between the viewers’
occupations based on the associations with the same movies. Source: [5]
Fig. 7.3: A Contingency Wheel display showing the links between the genres of
the movies based on the associations with the same viewers. The numbers of the
associated viewers are represented by the lengths of the curved histogram bars in
the wheel sections corresponding to the movie genres. The bars are divided into
coloured segments corresponding to the age groups (left) and to the gender (right)
of the viewers. Source: [5]
206 7 Visual Analytics for Understanding Relationships between Entities
Fig. 7.4: Exploration of the associations between the occupations of the movie view-
ers and the movies regarding their release times (top) and genres (middle). The num-
bers of the associated movies are represented by the lengths of the curved histogram
bars. The bars are divided into coloured segments corresponding to the release years
(top) and genres (middle) of the movies. The vertical positions of the bars corre-
spond to the strengths of the associations. Source: [5]
that is, the association strength is higher than a chosen threshold. The strength of the
relationship between two movies can be measured by the number of the users who
viewed both movies. The strength of the relationship between two groups of movies
can be measured by the number of the users having sufficiently strong (i.e., above
a threshold) associations with both groups. In the analysis scenario described in the
paper [5], the analyst used the threshold 0.4.
The Contingency Wheel display is designed to represent simultaneously the asso-
ciations between the columns and rows of a contingency table (by the histograms
in the wheel segments) and the similarity relationships between the columns or be-
tween the rows (by the arcs linking the wheel segments). However, as we noted
earlier, it is not essential to use this particular design. It would be quite suitable to
represent two kinds of relationships separately. Usual histograms can represent the
column-row associations. The similarity links can be visualised by means of stan-
dard graph drawing techniques, which are available in many software systems and
data analysis frameworks, including Python and R.
Let us return to the analysis of the relationships hidden in the movie rating data
set. The analyst first creates a contingency table of the user’s occupations and the
movies as shown in Fig. 7.1. The analyst explores the relationships between the
user groups based on the associations with the same movies. She finds strong rela-
tionships between the groups “K-12 student” and “college/grad student” and be-
tween “programmer” and “technician/engineer”. More surprisingly, there is also
quite a strong relationship between “academic/educator” and “retired”. To inves-
tigate this relationship, the analyst extracts and examines the subset of the movies
associated with both these groups. She finds that the subset mostly consists of old
movies; only a few of the shared movies were released in the 1990-ies and all oth-
ers are older. Based on the noticed similarities, the analyst merges “K-12 student”
with “college/grad student”, “programmer” with “technician/engineer”, and “aca-
demic/educator” with “retired”. The further analysis is done based on the redefined
user groups.
Having learned that the group [academic/educator, retired] has a strong inclination
to viewing old movies, the analyst compares this group to the other user groups with
regard to the distribution of the release years of the associated movies. She finds that
the distribution for the group [academic/educator, retired] is quite different from the
distributions for the other user groups. In particular, the group [college/grad student,
K-12 student] has opposite preferences (Fig. 7.4).
To explore the relationships between the movie genres, the analyst creates a con-
tingency table with the columns corresponding to the genres and the rows to the
208 7 Visual Analytics for Understanding Relationships between Entities
users. After visualising the relationships, the analyst finds high overlaps of the au-
diences of “Musical” and “Children”, “Action” and “Adventure”, and “War” and
“Western”. Contrary to her expectations, she does not see much overlap between
the audiences of the genres “War” or “Western” and “Crime”. Like with the user
groups, the analyst redefines the groups of the movies according to the observed
commonalities of the audiences. Then the analyst explores the age and gender dif-
ferences between the audiences associated with the redefined movie groups using
segmented histograms, as in Fig. 7.3. She finds that the user age distributions for the
movie groups look quite similar. Concerning the gender, she sees that, as could be
expected, the most strongly associated users of the movie group “Romance” are fe-
male, but, surprisingly, the audience of “Horor” does not include fewer women than
the other movie groups. The analyst also investigates the audience compositions of
the different movie groups in more detail using appropriate histogram displays. Par-
ticularly, for the movie group “Children”, she finds that there is an almost equal
distribution between male and female viewers, and that most viewers are in the age
group of 18–24 year, while the analyst expected the prevalence of younger users.
The movies of the groups “War” and “Western” are viewed more by men, which are
often executives and in the age group of 35-44 years for “War” and older age groups
for “Western”.
The analysis was done for a movie rental service company and allowed them to
better understand their customers. Based on the insights gained, the company can
simplify and improve their movie recommendation system.
7.2.1 Definition
A graph G consists of a set of vertices V , which are also called nodes, and a set of
edges E, which are also called links: G = (V, E). A vertex represents an individual
entity and an edge indicates the existence of a relationship between two entities;
hence, each edge refers to some pair of vertices. For instance, in a graph represent-
ing a friend network on a social media site, a node would be an individual user, for
instance Mary or Bob, and an edge between Mary and Bob would mean that they are
friends on this site and thus related. In our motivating example, there were relation-
ships of shared audience between groups of movies and relationships of common
movie choices between groups of viewers. The groups of movies or the groups of
users could be represented as vertices in two graphs, and the relationships between
the groups would be represented by graph edges.
Graphs can be directed, which means that there is a direction associated with each
edge to specify which vertex is the origin and which is the destination of the relation.
A typical example could be a graph representing message communications or money
transfers between people or organisations, where each edge would be directional and
indicate a sender and a receiver.
Often in real-world applications vertices and edges in graphs come with multiple
attributes associated with them. These attributes can be any form of data from nu-
meric values, to textual information to multimedia data. In our motivating example,
the movies and the users had specific attributes, and the groups were described by
distributions of values of these attributes.
210 7 Visual Analytics for Understanding Relationships between Entities
A special numeric measure, which is treated as the strength or weight of the relation-
ship, can be associated with the edges of a graph. In such a case, the graph is called
weighted. In our motivating example, there were numeric measures of the strengths
of the relationships between the movies or movie groups (based on the number of
common users) and between the users or user groups (based on the number of the
same viewed movies). These measures could be represented as the weights of the
graph edges. In the internal circular area of a Contingency Wheel, the arcs between
the wheel segments represent visually the graph edges, and the widths of the arcs are
proportional to the weights of the edges. In a graph representing money transfers,
the edge weights could represent the amounts of the transferred money.
In various applied disciplines, including machine learning, data mining, data sci-
ence, transportation, biology, and others, the term network is used as a synonym for
graph.
For weighted graphs, it is necessary to check also the properties of attributes as-
sociated with the vertices and edges. Specifically, missing data, outliers, unusual
combinations of values of multiple attributes require special attention, among other
potentially suspicious patterns. Such patterns may indicate errors in data that need
to be fixed before the analysis.
We shall illustrate graph visualization approaches using data from VAST Challenge
20152 . The data set consists of simulated trajectories of 11,374 individuals who
spent 3 days in an imaginary amusement park, repeatedly visiting 73 places of in-
terest (POIs) belonging to 11 distinct categories, including park entry, information,
food, beer gardens, shopping, restrooms, shows, and different categories of rides
(thrill rides, rides for everyone, and kiddie rides). Figure 7.5 shows the spatial lay-
out of the park, the locations of the POIs, and the flows of the visitors. The POIs
and the flows between them make a directed weighted graph where the POIs are the
nodes and the flows are the edges. This is a specific kind of graph: its nodes have
fixed spatial locations. To have a more generic kind of graph for further illustrations,
we shall make some transformation of the data. Specifically, we aggregate the POIs
into groups, represent the groups by graph nodes, aggregate the between-POI flows
into the between-group flows, and obtain in this way a state transition graph provid-
ing an aggregated representation of the collective movement behaviour [19].
A meaningful way for grouping locations is by their types or categories. In our ex-
ample, we make groups and create graph nodes based on the POI categories. How-
ever, taking into account the specifics of the challenge, there are a few places requir-
ing separate consideration; we thus represent each of them by a separate graph node.
After aggregating the flows, we obtain a directed weighted graph, where the weights
of the nodes are the counts of the visits to the POIs of the respective categories, and
the weights of the links correspond to the counts of the moves between the POIs of
the categories represented by the nodes. The graph is not fully connected, as a few
of potentially possible transitions did not ever occur in the data.
There are two major approaches to graph visualisation [90]: a graph can be repre-
sented by an adjacency matrix and by a node-link diagram. Figure 7.6 demonstrates
two variants of a matrix representation of the state transition graph constructed from
the amusement part data. The variants differ in the ordering of the rows and columns.
Meaningful ordering may enable identification of structures and connectivity pat-
terns and help in data interpretation. Jacques Bertin was the first to introduce matrix
reordering as an approach to finding patterns [31]. Bertin suggested that the reorder-
ing should aim at “diagonalisation” of the matrix, so that most of the filled cells are
2 http://vacommunity.org/VAST+Challenge+2015
7.4 Graph/network visualisation techniques 213
Fig. 7.5: Flows of visitors in an amusement park. Labelled dots represent the POIs.
The dots are coloured according to the place categories, and the sizes are propor-
tional to the counts of the visits. The magnitudes of the flows between the POIs are
represented by the line widths.
arranged along the matrix diagonal. Since the pioneering work of Bertin, numerous
matrix reordering algorithms have appeared [30].
Figure 7.7 represents our state transition graph in the form of node-link diagrams
in different automatically produced layouts. Some layouts position the nodes on the
plane irrespective of the edge weights while others (e.g., the so-called force-directed
layout) treat the edge weights as forces in a virtual system of springs: repulsive
forces exist between nodes that are not linked by edges and attractive forces exist
between linked nodes. The algorithm strives to balance the repulsive and attractive
forces. A survey of graph layout techniques can be found in [53].
In a node-link diagram, edges are represented by straight or curved lines connecting
nodes. When the edges are directed, the direction is indicated by an arrow or by a
specific line shape, such as decreasing width. In our examples, the direction is rep-
resented by the increasing line curvature. In a directed graph, it may be suitable for
214 7 Visual Analytics for Understanding Relationships between Entities
Fig. 7.6: Adjacency matrices representing the collective movements in the amuse-
ment park. Left: the rows and columns are ordered alphabetically; right: they are
ordered according to the visit counts.
certain analysis tasks to replace each pair of opposite edges by single edge repre-
senting the difference between the edges, which can be interpreted as the “net flow”.
Examples can be seen in the lower images in Fig. 7.11.
A big problem in drawing node-link diagrams is numerous intersections of the
edges. This problem can be dealt with by using edge bundling [71], in which multi-
ple edges that have close positions and similar directions are united in a single thick
line representing a “bundle” of edges. However, such edge bundles may be mislead-
ing due to representing artificial patterns that do not really exist in the data. This
relates especially to geographically-referenced graphs.
Sometimes it may be useful for analysis to create specialised data-driven layouts,
utilising additional information beyond the weights of nodes and links. For example,
Figure 7.8 shows the same state transition graph as before. For projecting the nodes
to a 2D plane, a data embedding method (namely, Sammon mapping) was applied
to the time series of the hourly counts of the place visitors. Two different variants of
the projections were obtained using different variants of normalisation of the same
time series of the counts.
There exist approaches that combine node-link diagrams and adjacency matrices in
the same display. Matrices are used to represent strongly connected groups of nodes,
thus decreasing the number of intersecting lines that would be drawn in a pure node-
link diagram. Obviously, neither adjacent matrices nor node-link diagrams scale to
very large graphs. Graph drawing community continuously develops new methods
for layout optimisation, while information visualisation and visual analytics com-
munities focus on devising approaches to visually-steered graph aggregation and
abstraction [90].
7.4 Graph/network visualisation techniques 215
Fig. 7.7: Graph representing collective movements in the amusement park are drawn
with different layouts: two variants of force-directed layouts (top), circular (bottom
left) and diagonal (bottom right) layouts.
Fig. 7.8: Two variants of graph layout with node positions defined by a 2D embed-
ding of time series of node attributes.
Fig. 7.9: Graph abstraction by aggregating nodes and edges. The original graph in
this example represents collaborations (graph edges) between individual researchers
(graph nodes). Left: The graph nodes have been aggregated according to the organ-
isations in which the researchers work. Right: by zooming into an aggregated node,
the corresponding sub-graph can be viewed at a lower level of aggregation (here: by
departments within an organisation). Source: [57].
7.4 Graph/network visualisation techniques 217
Fig. 7.10: Visual comparison of two graphs: data for Friday (left) and Sunday (right).
Fig. 7.11: Visual comparison of the net flows in two graphs: data for Friday (left)
and Sunday (right).
Fig. 7.12: Difference graphs, Sunday - Friday, based on absolute (left) and nor-
malised (right) weights.
7.5 Common tasks in graph/network analysis 219
indicators for the nodes and edges and/or time series of the weights, where a zero
weight indicates the absence of a node or edge. These time series can be analysed
using established techniques for analysis of time-related data, see Chapter 8. Be-
sides, graph states can be characterised by general features: counts of nodes and
edges, statistics of weights (e.g. distribution histograms), selected graph-theoretic
metrics and their distributions etc. Time series analysis is applicable also to time
series of graph features.
Generally, there exist numerous implementations of graph-specifics computations,
so finding a ready-to-use code may be quite easy for different development en-
vironments and application settings. However, combining computational methods
with appropriate visualisation techniques in valid visual analytics workflows re-
quires good understanding of data properties and semantics, engineering of relevant
features, and selection of appropriate visualisation techniques for interpreting and
evaluating computation results.
A variety of workflows for analysing graphs were proposed recently. These work-
flows combine decomposition of large graphs into components, feature engineering
and selection, and the use of the so defined features for assessing the similarity
of graphs and graph components by means of suitable distance functions. A fre-
quently used operation is clustering of graphs or sub-graphs according to their sim-
ilarity, which is determined by means of suitable structure-based or feature-based
distance functions (see Section 4.2.7). In the following subsections, we shall con-
sider two representative examples of visual analytics approaches to graph data anal-
ysis.
which are not interesting for the purposes of analysis. Such groups are represented
by strongly connected sub-graphs (communities), which can be easily identified by
applying graph-theoretic measures (Section 7.2.2). In this way, the analysts detected
two large communities, one with 115,000 nodes connected by 135,000 edges, and
another with 20,000 nodes. The edges linking the entities within these two com-
munities were excluded from the further analysis. The analysts and focused their
attention on 40,000 much smaller sub-graphs with up to 110 nodes each. For char-
acterising these smaller graphs, a library of graph-specific features has been created,
including the following features:
• general features of the network, such as its size, degree of completeness mea-
sured as the ratio between the existent and potentially possible links, statistics of
the edge weights, etc.;
• reciprocity features representing the distribution of bi-directional edges and their
weights;
• distance features representing statistics of distances in a network, e.g., the diam-
eter of a graph;
• centrality features characterising the distributions of different graph-theoretic
measures of the nodes and edges, as described in Section 7.2.2;
• motif-based features representing frequencies of user-selected predefined struc-
tures (motifs, see examples in Fig. 7.13) that reflect particular domain semantics,
such as out-star motifs for companies with many subsidiaries.
Computing the whole set of features for a large number of graphs is time consum-
ing. However, this needs to be performed only once, and can be parallelised in a
distributed computing environment. Since the computed features are numeric at-
tributes, the characterisations of the graphs in terms of these features have the form
of usual multidimensional data, which can be analysed using the whole range of
methods designed for this kind of data; some of them have been discussed in Chap-
ter 6. Feature selection methods (Section 4.3) can help to select and rank the features
that are most important for a task at hand. Clustering and embedding operations
(Sections 4.5 and 4.4) can be used for identifying groups of similar graphs.
222 7 Visual Analytics for Understanding Relationships between Entities
Fig. 7.14: A result of SOM embedding of a set of graphs is visualised in the form of
a matrix with rectangular cells corresponding to the SOM nodes (neurons). Each cell
contains an image of a representative graph, while the background colour indicates
the size of the group of graphs represented by the node. The groups can be viewed
in detail, as shown on the figure margins. Source: [144].
In addition to the matrix of the neurons, SOM component planes for the features
used in the SOM derivation can be visualised as shown in Fig. 7.15) , which is
similar to the earlier demonstrated Fig. 4.10 but uses a rectangular grid instead of
hexagonal.
7.6 Further Examples 223
Fig. 7.15: Visualisation of the SOM component planes for the graph embedding.
Source: [144].
The analyst can progressively refine large groups of graphs assigned to the same
SOM node by selecting one or more cells and creating a new SOM for the selection,
as demonstrated in Fig. 7.16.
The analysis reported in the paper [144] enabled discovery of several interesting
patterns and supported understanding of the overall structure of the shareholding
system. This was achieved through representation of the relationships in the form
of a graph, detection and removal of parts that were not relevant to the goals of the
study, efficient feature engineering, interactive progressive spatialisation and visual
inspection of groups of similar graphs.
Fig. 7.16: Refinement of a SOM embedding. The upper two images show the SOM
embeddings created from groups of graphs assigned to selected cells of the initial
SOM. Source: [144].
Fig. 7.17: States of a dynamic social network are represented by points in an em-
bedding space. The points are connected by lines in the chronological order. On the
left, the points are coloured according to the dates and times of the state occurrence;
on the right, the same display is shown with the points coloured by the times of the
day. The graph structures corresponding to several selected points are shown around
the projection plot. Source: [141]
tors were obtained from the adjacency matrices representing the graph states by con-
catenating the matrix rows into sequences. After such a transformation, any distance
function suitable for multiple numeric attributes, such as Euclidean distance, can be
applied for assessing the dissimilarities between the graph states. Since the num-
ber of the resulting attributes is very high, it is reasonable to apply feature selection
and/or dimensionality reduction techniques for avoiding the curse of dimensionality,
as discussed in Chapter 4.
The analysts tried different data embedding methods for creating spatialisations of
the dynamic network, in particular, PCA and t-SNE; the latter is used in Fig. 7.17.
The methods provided somewhat different views of the network evolution, and it
was useful to see both and compare them. Both methods generate embeddings with
a large dense cluster of states corresponding to the night hours of all days, when
there were no or almost no contacts. The t-SNE embedding shows more clearly
that the days differed with regard to the sets of the contacts that occurred in these
days, because the “trails” corresponding to the different days do not overlap or form
joint clusters. The analysts have gained a number of interesting findings, which are
reported in the paper [141].
The analysts also utilised the result of the PCA in a different way: they constructed
a time graph where the x-axis corresponds to the time and the y-axis to the first
principal component generated by PCA (Fig. 7.18). The time graph clearly shows
the periodic repetition of the daily patterns of the social contacts.
226 7 Visual Analytics for Understanding Relationships between Entities
3 https://en.wikipedia.org/wiki/Bipartite_graph
7.7 Concluding remarks 227
Fig. 7.19: Left: Player-to-player pressure forces during a game episode are visually
represented on a map of the football pitch by curved linear symbols connecting the
positions of the players. Right: For this game episode, a pressure graph shows the
total amounts of the pressure from the players on the ball (by the darkness of the
node shading) and on the opponents (by the line widths). Source: [11]
The main take-away messages from this chapter are the following:
• In many applications, a graph is a suitable structure for representing and analysing
relationships between entities. The entities are represented by nodes of the graph
and the relationships by edges. The strengths of the relationships can be ex-
pressed as weights of the edges.
• Relations between entities are not necessarily described in data explicitly, but
they can be identified and extracted by means of computations chosen according
to the domain semantics and analysis goals.
• There are several established techniques for visual representation of graphs. The
most frequently used are matrix representations and node-link diagrams; both are
not scalable to large graphs. Graph abstraction by aggregating nodes and links is
a helpful approach to dealing with large graphs.
• Interactive operations may partly compensate for the limitations of the visual-
isations and facilitate analysis. The possible operations include filtering of the
nodes and links in both representations, reordering of the rows and columns in a
matrix, changing the layouts of node-link diagrams, and zooming of a graph rep-
resented in an aggregated form. For comparison of graphs with similar structure,
visualisation of their differences and ratio may be useful.
• There exist an established set of graph-theoretic metrics and graph-specific com-
putations, such as shortest path calculation, that can be used for characterising
228 7 Visual Analytics for Understanding Relationships between Entities
and comparing graphs, identifying their parts with special properties, decompos-
ing graphs into sub-graphs, and other purposes.
• Some applications require considering multiple graphs and, often, time related
graphs. Similarity search, embedding, and clustering are useful techniques in
such contexts. The use of these techniques requires that an appropriate measure
of the dissimilarity (distance) between graphs is chosen or defined according to
the application semantics and analysis goals.
Consider typical activities that you perform during a usual week or a working day.
This may include sleeping, preparing food, eating, Potentially interesting relation-
ships are transitions between the activities: how often, when and where they hap-
pen, and how long are the time lapses between two consecutive activities. Think
about a possible representation of your data, their expected properties and complex-
ities, consider how to check data quality, formulate some analysis tasks, and design
appropriate data analysis approaches and procedures. What kinds of data transfor-
mations, computational methods and visualisations are necessary / useful / nice to
have?
You can extend this task by considering the associations between the activities and
the locations in which they take place. In this case, you may need to represent your
data by a bipartite graph, with one subset of nodes representing the locations and
another the activities. Again, define the data collection and processing methods,
analysis tasks, and corresponding analytical procedures.
Chapter 8
Visual Analytics for Understanding Temporal
Distributions and Variations
Abstract There are two major types of temporal data, events and time series of at-
tribute values, and there are methods for transforming one of them into the other.
For events, a general analysis task is to understand how they are distributed in time.
For time series, as well as for events of diverse kinds, a general task is to understand
how the attribute values or the kinds of occurring events vary over time. In analysing
temporal distributions and variations, it is essential to account for the specific fea-
tures of time and temporal phenomena, particularly, recurring cycles and temporal
dependence. To see patterns of temporal distribution or variation, people commonly
apply visual displays where one dimension represents time and the other is used to
show individual events, attribute values, or statistical summaries. To see the data in
the context of temporal cycles, a common approach is to use a 2D display where
one or two cycles are represented by display dimensions. For large and/or complex
data, visual displays need to be combined with techniques for computational analy-
sis, such as clustering, embedding, sequence mining, and motif discovery. We show
and discuss examples of employing such combinations in application to different
data and analysis tasks.
We have already dealt with temporal distributions when we investigated the epi-
demic outbreak in Vastopolis in Chapter 1, and we briefly discussed possible pat-
terns in temporal distributions in Section 2.3.3. We have introduced the term event
for discrete entities (things, phenomena, actions, ...) that appear or happen at some
time moments and disappear immediately (instant events) or exist for limited time
(durable events). When events are numerous, like the microblog posts in Chapter 1,
we often want to investigate their temporal frequency, that is, when how many events
happened and what is the pattern of the change of the event number over time. To
see the variation of the event number over time, it is convenient to use temporal
histograms like in Figs. 1.5 and 1.6. Figure 1.14 demonstrates that the bars in a
time histogram can be divided into coloured segments to represent counts of events
of different types, like the two distinct diseases in Vastopolis. Observing patterns in
these time histograms was essential for our analytical reasoning concerning the time
of the outbreak and its development trends.
Further examples of time histograms appear in Fig. 2.6. The upper image is similar
to the histograms used in Chapter 1: the horizontal axis represents time, which is
divided into intervals of equal length (bins), and the vertical dimension is used to
represent the counts of the events in these intervals by the heights of the bars. Such
a histogram is called linear: the time is represented as a line. However, the linear
representation of time may be not ideal (or, rather, not sufficient) when the variation
of the number of events (or any other kind of temporal variation) may be related to
temporal cycles.
Thus, in the upper image of Fig. 2.6, we see high peaks in the event number that
occur at approximately equal time intervals between them. This is not surprising be-
cause the histogram represents the events of taking photos of cherry blossoms. The
underlying dataset consists of metadata records of geolocated photos taken on the
territory of the North America in the years from 2007 till 2016 that were published
online through the photo hosting service flickr1 and have keywords related to cherry
blossoming in the titles or tags attached. It is natural that these photo taking events
are much more frequent in spring than in other seasons of a year, which means that
the number of events varies according to the annual (seasonal) time cycle.
While the linear histogram informs us that the distribution is periodic, it does not
explicitly show the relevant time cycles and is not convenient for checking if the
peaks occur at the same times in each year. This information can be seen much
better when we use a two-dimensional display, like the 2D time histogram in the
lower image of Fig. 2.6, in which one dimension represents positions within a time
cycle (such as days in a year) and the other dimension represents a sequence of
consecutive cycles (years). We obtain a matrix the cells of which may contain certain
symbols representing counts of events or values of another measure, for example,
the number of distinct photographers that made the photos. In Fig. 2.6, bottom, the
horizontal dimension (matrix columns) represents the days of a year grouped into
7-day bins and the vertical dimension (matrix rows) represents the sequence of 10
years from 2007 at the bottom to 2016 at the top. We see that the periods of mass
photo taking activities related to cherry blossoming did not happen at the same times
in different years but they happened earlier in the years 2011 and 2012 and later in
the following three years.
Time histograms are of good service when we deal with large numbers of instant
events and need to investigate the temporal variation of the frequencies of event oc-
currences. However, when we have durable (but not so numerous) events, we may
1 https://flickr.com/
8.1 Motivating example 231
Fig. 8.1: The map shows the regions where the events of mass photo taking hap-
pened. In the timeline display below the map, the horizontal dimension represents
the time, and the horizontal bars represent the periods when the events were hap-
pening. The bars are coloured according to the regions where the events took place.
need to investigate and compare the periods of event existence. Let us use the same
example of taking photos of cherry blossoms, but now we want to consider not the
occurrences of the elementary photo taking events but large spatio-temporal con-
centrations (clusters) of photo taking activities, when people make many photos in
some area over multiple days. We have detected 34 clusters that include at least 50
elementary events (the sizes of the clusters obtained range from 57 to 711 mem-
bers). We treat these clusters as events of mass photo taking. The map in Fig. 8.1,
top, shows that the events mostly happened in four regions: Northeast, including
Washington D.C., New York City, Philadelphia, and Boston (blue), area of Vancou-
ver and Seattle (yellow), San Francisco and its neighbourhood (red), and Toronto
(purple). There was also an occasional event in the area of Portland (green, south
of the Vancouver-Seattle region). Unlike the individual photo takings, these events
have duration. In Fig. 8.1, bottom, a timeline display (Section 3.2.4), represents the
times when the events were happening by the horizontal positions and lengths of
the horizontal bars, which are coloured according to the event regions. The bars are
ordered according to the start times of the events and make compact groups corre-
sponding to different years. This display allows us to compare the times of event
232 8 Visual Analytics for Understanding Temporal Distributions and Variations
happening in different regions in each year, however, it is not so easy to make com-
parisons between the years.
Fig. 8.2: The times of the mass photo taking events have been aligned relative to the
yearly cycle. The horizontal dimension of the display represents the days of a year
(in all years). On the left, the bars representing the events are grouped by the years
when they happened, on the right – by the regions.
A technique that supports comparisons between cycles, like the yearly cycles in our
case, is transformation of the event times from absolute to relative with respect to
the cycle start. In our case, simply speaking, we ignore the differences in the years of
the event happening and treat them as if they would happen in one year. We build a
timeline display according to the relative dates of the events (Fig. 8.2). We can now
group the bars by the years (Fig. 8.2, left) and compare the beginning and ending
times of the events across the years focusing our attention on the bars of particular
colours or on the whole bar groups corresponding to the different years. We can also
group the bars according to the regions (Fig. 8.2, right), which allows us to observe
the variation of the event times in the regions and compare the relative times of the
groups of events that happened in the different regions.
Apart from the time moments or intervals of the existence, events may have various
other attributes, for example, as shown in the table in Fig. 8.3. These attributes can
be analysed as multivariate data considered in Chapter 6.
However, data having temporal components are not limited to events only. The
most ubiquitous type of temporal data are time series, which consist of attribute
values specified for a sequence of time steps (moments or intervals), usually equally
spaced. We can easily obtain time series from our set of photo taking events: we
just divide the whole time span of the data into equal time intervals, such as days
or weeks, and compute for each interval how many events happened and, possi-
bly, other statistics, such as the number of distinct users or the average time of the
day when the photos were taken. Such aggregation can be done for all events taken
together or separately for each region. In the latter case, we shall obtain multiple
time series; thus, Figure 8.4 shows time series of weekly counts of the photo taking
8.1 Motivating example 233
Fig. 8.3: A table display shows various attributes of the mass photo taking events.
events by the regions. Here we apply a standard visualisation techniques known un-
der multiple name variants: line chart or line graph, time series plot (or graph), or,
simply, time plot or time graph (Section 3.2.4). The horizontal axis represents time,
and values of a numeric attribute are represented by vertical positions. The positions
corresponding to consecutive time steps are connected by lines. The lines in Fig. 8.4
are coloured according to the regions.
Fig. 8.4: Time series of the weekly counts of the photo taking events by the regions.
234 8 Visual Analytics for Understanding Temporal Distributions and Variations
Not only time series can be generated from events, but also events can be extracted
from time series. Thus, in our time graph in Fig. 8.4, we see a number of peaks,
which can be treated as events, extracted from the time series, and explored sepa-
rately.
Having these examples in mind, let us proceed with a more general discussion of
the types and properties of temporal data.
As we have seen by examples of events and time series, the temporal aspect of a
phenomenon may have different meanings. It may be the time interval or moment
when something existed or happened. Things (entities) that happen at some times
or exist during a limited time period are called events. It can also be a succession
of time steps (moments or intervals) in the life of an evolving phenomenon whose
state (properties and/or structure) changes over time. A chronological sequence of
states of such a phenomenon is called a time series.
In analysing events and time series, it is necessary to account for the structure and
properties of time. Generally, time is a continuous sequence of linearly ordered el-
ements (moments), and it is often represented as an axis along one dimension of a
visual display. However, the time in which everything (including us) exists is not
only linear but also consists of recurrent cycles: daily, weekly, seasonal, etc. Many
phenomena and events depend on some of these cycles; for example, vegetation phe-
nomena (such as cherry blossoming) depend on the seasonal cycle, and activities of
people depend on the daily, weekly, and seasonal cycles. There may also be other,
domain-specific, temporal cycles, such as cycles of climatic processes, development
cycles of biological organisms, or manufacturing cycles in industry.
Most temporal phenomena have the property of temporal dependence, also known
as temporal autocorrelation: values or events that are close in time are usually more
similar or more related than those that are distant in time. For example, the current
air temperature is usually quite similar to what was ten minutes ago and may differ
more from what was a few hours ago. The temporal dependence needs to be taken
into account, but it should not be thought that the degree of similarity or relatedness
is just proportional to the distance in time. It is more reasonable to expect that the
dependence exists within some quite narrow time window the width of which de-
pends on the dynamics of the phenomenon, that is, how fast it changes. The time
range of the temporal dependence is often modified by temporal cycles. For exam-
ple, the air temperature observed at noon may be more similar to the temperature
that was at noon a day ago than to the morning temperature of the same day.
8.2 Specifics of temporal phenomena and temporal data 235
Statistics has developed specific methods for time series analysis that account for
the existence of temporal dependence and cycles. Thus, exponential smoothing2 is
an approach to time series modelling that gives higher weights to more recent values
in predicting the next value. There are methods that model periodic variation (called
“seasonality” in statistics, irrespective of the length and meaning of the period) and
can thus account for temporal cycles. In visualisation, temporal dependence and its
effect range can be revealed if distances in time are represented by distances in the
display space. The existence of periodicity can be detected from displays where
times are represented by positions along an axis, like in Figs. 8.1, bottom, and 8.4.
Comparisons between periods (cycles) can be easier to do after aligning the periods
along a common relative time scale, as in Fig. 8.2.
As any data, temporal data may have errors, such as wrong recorded values or wrong
time references, and other problems, in particular, missing data. Visualisation can
often exhibit such problems quite prominently. We recommend you to read the pa-
per expressively called “Know Your Enemy: Identifying Quality Problems of Time
Series Data” [61]. It describes a range of automatic and visual methods for problem
detection.
In looking for possibly wrong values, you need to account for the property of tem-
poral dependence: not only a value that is dissimilar to all others may be wrong
but also a value that appears quite realistic by itself but differs much from its near-
est neighbours in time. It may be reasonable to remove suspicious values in order to
avoid wrong inferences and the risk of missing important pieces of true information,
which may be less prominent than the erroneous data.
Since most of the existing analysis methods cannot work when some data are miss-
ing, it is quite usual to “improve” data by inserting (“imputing”) some estimated
values. Many methods exist for this purpose; see, for example, [74]. Making use
of the property of temporal dependence, they estimate the missing values based on
available values in the temporal neighbourhood. However, different methods may
produce quite different estimations. The plausibility of the estimations depends on
how well the character of the variation is understood and accounted for in the cho-
sen method. If you really need to apply value imputation, we highly recommend you
first to explore the data to understand the variation pattern and to be able to choose
the right method. Having applied the method, it is wise to compare the previous and
new versions of the data using visualisations: the visible patterns should be similar,
and no additional patterns should appear in the new version.
2 https://en.wikipedia.org/wiki/Exponential_smoothing
236 8 Visual Analytics for Understanding Temporal Distributions and Variations
Fig. 8.5: Time series of event counts by the days of the week and the hours of the day
are represented by lines in a time graph (top) and by bar heights in 2D histograms
with the rows corresponding to the days and the columns corresponding to the hours
(bottom).
Fig. 8.6: The time series shown in Fig. 8.5, top, have been smoothed and normalised.
according to the regions the time series refer to. The histogram with the dark grey
bars summarises the time series for all regions. Not surprisingly, both the time graph
and the histograms show that people took more photos of cherry blossoms at the
daytime than in the night and, generally, more events happened on the weekend than
on the week days. However, there are differences between the regions. For example,
in California (red), very few photos were taken in the weekdays, and notably more
events happened on Sundays than on Saturdays. The photos were mostly taken from
hour 10 till hour 15. On the Northeast (blue), the weekend activities also dominate,
but many events happened in the weekdays as well, especially in the afternoons and
evenings of Fridays. The weekend events were more spread along the day time: the
counts increase starting from hour 7 and remain high until hour 19 on Saturday and
hour 18 on Sunday. The weekend patterns in the Vancouver-Seattle region (yellow)
are similar to those on the Northeast but the activities during the weekdays were
also relatively high (compared to the highest counts for this region).
238 8 Visual Analytics for Understanding Temporal Distributions and Variations
Generally, two-dimensional views as in Figs. 2.6, bottom, and 8.5, bottom, are good
for observing periodic patterns in data. In such displays, either both dimensions rep-
resent two time cycles (as the weekly and daily cycles in Fig. 8.5), or one dimension
represents some time cycle (as the cycle of the days of the year in Fig. 2.6) and the
other represents the linear time progression (as the sequence of the years in Fig. 2.6).
As already mentioned in Section 8.1, it is possible not only to construct time series
using events but also extract events from time series for a separate analysis. Poten-
tially interesting are such events as peaks, drops, and abrupt increases or decreases.
Figure 8.2 demonstrates another transformation that is possible for temporal data:
replacement of the absolute time references, which refer to positions along the time
line, by references to positions in a temporal cycle. In this way, we aligned the events
of mass photo taking within the yearly cycle, which allowed us to compare the start
and end times and the durations of the events in the different years. This can be seen
as transforming the event times from absolute to relative. Relative times are also
involved in time series obtained by aggregating events by time cycles.
Various transformations can be applied to values in time series. Smoothing is used
to diminish fluctuations and make the overall patterns more clear. Smoothing meth-
ods are well described in statistics textbooks and in educational materials available
on the web. Normalisation may be useful when there are multiple time series with
substantially differing value ranges. Thus, in our examples, the time series of the
Northwest has much higher values than the others, and the time series of Toronto and
Portland have much smaller values. If we want to compare the temporal variation
patterns in these regions, we can replace the absolute counts by the differences from
the means of the time series (for each time series, its own mean is used). The ranges
of these absolute differences also differ much between the time series; therefore, we
divide them by the standard deviations of the respective time series. We have ap-
plied such normalisation (after a little bit of smoothing) to the time series shown in
Fig. 8.5; the result can be seen in Fig. 8.6. Now we can see the similarities and dif-
ferences between the variation patterns. Thus, we see quite high similarity between
the patterns of Vancouver-Seattle (yellow) and Portland (green). We prominently
see the different pattern of Toronto (purple) with its high peaks in the evenings of
Thursday and Friday (days 4 and 5) and the difference of California (red) with its
particularly high peak at the Sunday noon. There may be other approaches to value
normalisation (some have been discussed in Section 4.2.9), for example, relative to
the minimal and maximal values of the time series. Besides, other transformations
can be applied to the values. Thus, the values can be replaced by the differences
with respect to the previous values, or with respect to the values for a selected time
step, or with respect to the values in a chosen time series, and so on.
Figure 8.7 schematically summarises the possible transformations that can be ap-
plied to two major types of temporal data, events and time series.
8.4 Temporal filtering 239
Absolute time
extract
smooth,
group and normalise,
integrate Events aggregate by
Time series compute
intervals along differences
the time line
refer to
positions in aggregate by
time cycles intervals in
time cycles
smooth,
normalise,
Events Time series
aggregate by compute
intervals in the differences
Relative time relative time
As discussed in Section 3.3.4, data filtering (or querying) is a very important oper-
ation in data analysis. A specific kind of querying/filtering that is used in analysing
temporal data is, obviously, selection of data based on their temporal references, i.e.,
temporal filtering. There are two common approaches to temporal filtering. First, it
can be done by selecting a continuous time interval within the time range of the
data. This kind of filter adheres to the linear view of time, in which time is treated
as a continuous linearly ordered sequence of time instants. Another approach is to
filter data according to the positions of their time references within one or more
time cycles. These two types of filtering can be called linear and cyclic, respec-
tively. Please note that temporal filtering, in essence, involves two operations: first,
selection of one or more time moments or intervals and, second, selection of data
whose temporal references correspond to the selected time(s). The linear and cyclic
methods of filtering differ in the criteria used for selecting times. A less common
method of temporal filtering is selecting times based on fulfilment of some query
conditions defined in terms of time-related components of data, such as values at-
tained by time-variant attributes, or existence of particular entities, or occurrence of
certain events [11, 9]. Similarly to cyclic temporal filtering, such conditions-based
temporal filtering can select multiple times.
Irrespective of the method of selecting times, the correspondence between time ref-
erences in data and the times selected can be defined in terms of different temporal
relationships, depending on what you wish to select. It can be exact coincidence,
containing or being contained, overlapping, or even being before or after the se-
lected times within a certain temporal distance.
240 8 Visual Analytics for Understanding Temporal Distributions and Variations
When there are several temporal datasets, temporal filtering can be applied to all of
them simultaneously or to some of them, according to the decision of an analyst.
Simultaneous temporal filtering of two or more datasets is useful for exploring re-
lationships between the respective phenomena and for having a coherent view of
everything that happened in the selected times. Individual application of temporal
filtering to some of the datasets may be useful for seeing particular parts of these
datasets in the overall context of the remaining data. It is also possible to apply
different filters to different datasets, which may be helpful for finding temporally
lagged relationships between phenomena or asynchronous similarities in their de-
velopments.
When you apply a filtering method that selects multiple time intervals, you need
to see the selection results in visual displays. When you have a temporal display
showing the data for the whole time under study, the selected times can be somehow
marked (highlighted) in this display. When you use a visualisation showing the data
in aggregated way, the aggregation needs to be applied to the selected data, and the
display needs to be updated. Aggregation of the selected data pieces is quite often a
useful operation that helps to obtain an overview of what has been selected.
In the following section, we shall introduce a selection of visual analytics ap-
proaches to analysing temporal data of various kinds.
8.5.1 Events
In our example, we were dealing with a large number of events of the same kind.
We were interested to know how many events happened when and where but did
not care about the specifics of the individual events. In such cases, helpful opera-
tions are integration or aggregation of events based on their temporal neighbourhood
(possibly, in combination with some other criteria). However, these operations are
not applicable when it is important to know what events appeared when and/or in
what sequence. Here we shall discuss three kinds of analysis tasks referring to event
sequences:
• In a single event sequence, observe patterns of event repetition.
• Having a large number of event sequences, find common patterns of event ar-
rangement.
• Having a set of standard event sequences with the same relative ordering, detect
temporal anomalies.
8.5 Analysing temporal data with visual analytics 241
Repeated sub-sequences in a single sequence. The first task can refer not only to
a sequence of events but, generally, to any sequence whose elements can be treated
as symbols from some “alphabet” in a broad sense, that is, some finite set of items.
Thus, it may be a sequence of nucleotides in a DNA molecule, notes in a melody,
or words in a verse. On the one hand, each occurrence of a particular symbol can
be seen as an event. Such an event is usually not related to the absolute time (in
terms of the calendar and clock), but it has a specific position in the “internal time”
of the sequence. The “internal time” may be a simple chain of discrete indivisible
and dimensionless units, each containing one event, or, as in a melody, the time can
be continuous, and events (including pauses) may differ in duration. On the other
hand, to represent a sequence of happenings which we intuitively regard as events,
it is often possible to create a suitable “alphabet”. For example, it is not difficult
to imagine an alphabet suitable for representing events in a football (soccer) game,
such as passes, shots, tackles, faults, etc. It may be quite a large alphabet if we
want to encode event details, such as who passed the ball to whom, but even in
such a case we can expect that particular symbols and sub-sequences of symbols
may occur repeatedly. If only the frequencies of the re-occurrences are of interest,
methods from statistics and data mining suffice. A need in visualisation arises when
we want to see where in the whole sequence the repetitions occur.
Fig. 8.8: Using arc diagrams for revealing the structure of a musical piece (left) and
a text (right). Source: [147]
Figure 8.8 demonstrates a technique called Arc Diagram [147], which addresses
this need. A computational analysis algorithm is applied to find matching pairs
of longest non-overlapping sub-sequences of symbols. In the visualisation, one of
the display dimensions (horizontal in Fig. 8.8) represents the internal time of the
sequence. Matching sub-sequences are connected by semi-transparent arcs whose
widths are proportional to the lengths of the sub-sequences and heights to the dis-
tances between them. The amount of shown detail can be controlled through filtering
by the sub-sequence length or by inclusion of a particular symbol.
It is not always necessary that matching sub-sequences are identical; they may be
similar or related in some other way. Thus, in analysing a football game, all se-
quences of passes starting with some player A passing the ball and ending with
the same player A receiving the ball back may be considered similar irrespective
242 8 Visual Analytics for Understanding Temporal Distributions and Variations
headache before beginning to take drug A, and about one fourth of it started taking
drug B after the stroke while continuing to take drug A.
This simple example demonstrates the main idea of the approach. In Fig. 8.10, it is
applied to data describing a basketball game, which includes such events as success-
ful and unsuccessful shots, rebounds, steals, etc. The game is divided into episodes
244 8 Visual Analytics for Understanding Temporal Distributions and Variations
where the ball is possessed first by team A and then by team B. The corresponding
event sequences are aligned by the possession change from team A to team B. The
blocks representing event groups are coloured according to the team possessing the
ball: light red for team A and light blue for team B.
Generally, to apply this idea to a given set of event sequences, it is necessary to
define (a) what events can be considered similar, (b) what relative times can be
considered sufficiently close, (c) what should be the representative time for a group
of similar events whose relative times of occurrence are close but not the same.
Imagine that we have data resulting from a population survey in which people were
asked to describe their daily routines: what actions they usually do at what times.
Each daily routine consists of actions (which are events in this case) and times when
they are usually performed. Obviously, similar actions are first of all actions of the
same kind: wake up, exercise, take shower, shave, etc. However, it is also possible
to treat taking shower and shaving as similar actions as they belong to the hygienic
routine. Likewise, it is possible to consider all kinds of sport activities as similar, or
to distinguish exercising at home, outdoors, and in a fitness centre. The definition
of similarity may change in the course of the analysis: you may start with taking
larger action categories for finding high-level general patterns of daily behaviours
and then progressively refine the categories for revealing variants of these general
patterns. For finding the most general patterns, it may be reasonable to omit some
events, for example, the lunch break during the work.
The closeness of action times is also defined depending on the desired level of gen-
eralisation. For example, for extracting highly general patterns, time differences of
30 minutes or even an hour can be tolerated, but the difference threshold needs
to be lowered for revealing finer distinctions. For a group of similar actions with
close times, the representative time interval may be the union of the time intervals
of all actions, or the intersection of these intervals, or something in between, such
as the interval from the first quartile of the action start times to the third quartile
of the action end times. There are also different possibilities for aligning the action
sequences: by the time of the day, for example, starting from 3 AM, when most peo-
ple sleep, or by the time of a certain action, for example, waking up or beginning of
daily occupation (work, study, household management, etc).
Additional information concerning the approach can be found in the paper [101]
and on a dedicated web site3 .
3 https://hcil.umd.edu/eventflow/
8.5 Analysing temporal data with visual analytics 245
deviate from this ideal. To detect deviations and understand how to improve the
performance, is necessary to analyse data comprising lots of operation sequences.
The data need to be represented so that anomalies (that is, deviations from the ideal
procedure) are easy to spot.
To tackle this problem, Xu et al. [153] take the idea presented in Fig. 8.11. This is a
widely known visualisation of the daily schedule of the trains connecting Paris and
Lyon created by Étienne-Jules Marey4 in the 1880s. The train stops are positioned
along the vertical axis according to their distances from each other. The horizontal
axis represents time. The train journeys are represented by diagonal lines running
from top left to bottom right (Paris – Lyon) and bottom left to top right (Lyon –
Paris) respectively. The slope of the line gives information about the speed of the
train – the steeper the line, the faster the respective train is travelling. Horizontal
sections of the trains’ lines indicate if the train stops at the respective station at all
and how long does a stop last. Besides, the density of the lines provides information
about the frequency of the trains over time.
Fig. 8.11: Étienne-Jules Marey’s graphical train schedule. Source: [96, Fig. 7], dis-
cussed by E.Tufte [137].
4 https://en.wikipedia.org/wiki/%C3%89tienne-Jules_Marey
246 8 Visual Analytics for Understanding Temporal Distributions and Variations
Fig. 8.12: The idea of the Marey’s graph is applied for visualising the performance
of a production process. Source: [153]
and right, and they are very easy to spot. To simplify the display and make the
anomalies even more conspicuous, groups of nicely parallel equally spaced lines
can be aggregated and represented by bands, as shown in the lower part of Fig. 8.12.
The most common representation for a time series of numeric values is the line chart
(Fig. 8.13A); however, the heatmap technique, in which data values are represented
by colours (Fig. 8.13B), is now also quite popular; see the descriptions of both
techniques in Section 3.2.4. The heatmap technique may be better suitable for long
time series and for multiple time series that need to be compared. However, since
the resource of display pixels is limited, so is the maximal length of a time series
that can be represented without reduction. Besides, it is hard to argue that values and
their changes can be represented more accurately in a line chart than in a heatmap.
8.5 Analysing temporal data with visual analytics 247
Fig. 8.13: Representation of a numeric time series (A) in a line chart, or time graph
and (B) as a heatmap.
Long time series may result from long-term observations or measurements, but they
may also be long due to a high frequency of obtaining data. For example, the dura-
tion of the time series in Fig. 8.13 is only 47 minutes while the time space between
the measurements is 40 milliseconds; hence, the length of the time series is 70,500
time steps (the data are positions of the ball between the two opposite goals in a
football game). As you can guess, both representations of this time series involve
substantial losses because the same horizontal positions have to be used for rep-
resenting multiple values. To see information without losses and more distinctly,
temporal zooming is needed, which means selection of a small time interval for
viewing in more detail. However, when you zoom in to a chosen interval, you do
not see the other data beyond this interval; hence, you cannot compare this part of
the time series to another one. A viable approach to dealing with this problem is
to create separate displays for selected time intervals and juxtapose these displays
for comparison (Fig 8.14). It is also possible to represent selected parts of the time
series by lines superposed in the same display area, as shown on the bottom right of
Fig 8.14. For this purpose, the selected intervals need to be temporally aligned, that
is, the absolute times need to be transformed to relative with respect to the starting
times of the intervals.
It can be noticed in Fig. 8.14 that different parts of a long time series may have
quite similar patterns of temporal variation, which manifest as similar line shapes.
Such repeated patterns are called “motifs”. Purely visual discovery of motifs in long
time series may be a daunting task. It is more sensible to apply one of many exist-
ing computational techniques for motif discovery [136] and then mark the detected
motifs, for example, as shown in Fig. 8.15. Apart from detecting repetitions, motif
discovery techniques may also be helpful for finding anomalies in time series that
are expected to be highly regular, for example, in electrocardiograms. An anomaly
is a segment whose shape differs from the main motif.
An interesting property of temporal phenomena is that different variation patterns
may exist at different temporal scales. As an obvious example, let us consider the
weather phenomena. You can focus on the changes at the daily scale and see that
248 8 Visual Analytics for Understanding Temporal Distributions and Variations
Fig. 8.14: Selecting parts of a long time series and representing the selected parts in
separate displays. On the bottom left, graphs of several selected parts are juxtaposed
for comparison; on the bottom right, other selected parts are represented by lines
superposed in the same plot area. Source: [145].
nights are typically cooler than days. Raising to a higher scale, you will see seasonal
changes. To observe the manifestation of the El Niño–Southern Oscillation5 , which
occurs every few years, you need to go further up. Yet a higher scale is required for
seeing long-term trends, such as global climate changes. When you apply computa-
tional analysis methods, such as motif discovery, to detailed, low-level time series
5 https://en.wikipedia.org/wiki/El_Ni%C3%B1o%E2%80%93Southern_
Oscillation
8.5 Analysing temporal data with visual analytics 249
Fig. 8.16: Right: patterns of temporal variation in groups of similar daily time series.
Left: the distribution of the distinct daily patterns over a year. Source: [151].
(such as hourly temperature values), you will only be able to find patterns of the
lowest possible scale. When time series are represented visually, there is a chance
that your vision (which includes perception by the eyes and processing by the brain)
will perform temporal abstraction allowing you to observe higher-scale patterns;
however, chances are quite low when the time series are long. Besides, the skill of
seeing the forest for the trees, that is, abstracting from details, may require training.
Therefore, it is reasonable to use data transformation methods that reduce details
and produce higher level constructs.
Let us consider an example of an approach applicable when it is known or expected
in advance at what temporal scale interesting patterns may exist. The approach is
demonstrated in Fig. 8.16 [151]. Here, the data describe the variation of the power
demand of a facility during a year with a hourly temporal resolution. It is expectable
that the variation may be related to the temporal cycles: daily, weekly, and seasonal.
The analysts first divided the whole time series into 365 daily time series. Then they
applied a clustering method to these time series and obtained groups (clusters) of
similar time series. For each cluster, they computed the average time series. On the
right of Fig. 8.16, the average time series of the clusters are plotted together for
comparison, each cluster in a distinct colour. We see that all but one clusters have
similar variation patterns, with low values in the nigh hours, steep increase by hour
9, and more gradual decrease in the afternoon. The patterns differ in the magnitude
of the increase. The remaining cluster has a pattern of constant low value throughout
the day (blue line).
250 8 Visual Analytics for Understanding Temporal Distributions and Variations
Having assigned distinct colours to the clusters, the analysts created a calendar dis-
play shown in Fig. 8.16, left. The squares representing the days are painted in the
colours of the respective clusters. The display thus shows which patterns of daily
variation occurred in which days. We can observe weekly patterns (particularly, all
Saturdays and Sundays are painted in blue) and a seasonal pattern, with the green
colour occurring in colder months of the year and magenta in the summer.
Contemplating this example, we can formulate the following general approach.
First, decide what time unit can be appropriate to the time scale at which you wish to
discover patterns. For example, for observing weekly patterns, the appropriate unit
is one day. Then, divide the time series into chunks of the length of the chosen unit.
Apply some method, such as clustering, for grouping the chunks by similarity. In the
original time series, replace each chunk by a reference to the corresponding group.
You can now visualise the time series representing the groups by colours, similarly
to Fig. 8.16, left, but, perhaps, using another arrangement if your data do not refer
to a calendar year. Please note that the transformed time series can be treated as
an event sequence; hence, you can try the Arc Diagram method (Fig. 8.8). You can
also apply suitable methods for computational analysis, such as sequential pattern
mining.
However, the scales at which interesting patterns may exist are not always known
in advance, or you may want to see patterns at several scales simultaneously. There
is a technique that may help you [125]. As shown in Fig. 8.17, you create a two-
dimensional view where the horizontal axis represents the time of your time series
and the vertical dimension will be used for the time scale. At the bottom of the graph,
you represent your time series at the original level of detail using the heatmap tech-
nique; each value is represented by a coloured pixel. This is the time scale of level
1. Then you go up along the time scale axis. For each time scale of level w, starting
from level 2, you take all value sub-sequences consisting of w consecutive values
from the previous level and derive a summary value from each sub-sequence using
a suitable aggregation operator, such as the mean, variance, or entropy. This sum-
mary value is represented visually by a coloured pixel. Its vertical position is w and
the horizontal position is the centre of the sub-sequence represented by this pixel.
Obviously, the number of pixels decreases as the time scale increases; therefore,
you obtain at the end a coloured triangle as in Fig. 8.17, right. The colour variation
patterns at different heights reflect the value variation patterns existing at different
temporal scales.
To give an example of interpreting such a visualisation, let us describe what can be
seen in Fig. 8.17, right. Here, the values for higher temporal scales are obtained by
computing the means from the underlying original values. A positive global trend
can be spotted in the upper part of the triangle, where the colours progress from
white to light red. It is notable that the time scales in regions T 1a and T 1b do not
reveal a global trend while regions T 2a and T 2b clearly show it. The small time
scales in region S depict the seasonal cycle in the data. Periods of rather low and
rather high values alternate. The fluctuations between these two states appear to be
8.5 Analysing temporal data with visual analytics 251
Fig. 8.17: Multiscale aggregation of a time series of sea-level anomalies from Octo-
ber 1992 through July 2009 exposes patterns at different time scales. Source: [125].
Fig. 8.18: The triangular displays represent the changes of each value with respect
to all previous values. Source: [80].
are represented using the heatmap technique, with two distinct colour hues corre-
sponding to the value increase and decrease. In Fig. 8.18, where the shown data are
financial time series, namely, asset prices, the growth is represented in shades of
green and the decrease in shades of red. The figure shows the changes of the prices
of three assets. The smaller triangles shows the changes computed for each asset in-
dividually, and the larger triangles show the changes normalised against the market
median growth rates. For this specific kind of data, the visualisation represents the
potential gains (in green) or losses (in red) from selling the assets at different times
depending on the time of purchasing them.
In the latter example, several time series were compared by representing each one
in an individual display and juxtaposing the displays. This approach can work when
you have a few time series, but what to do when they are numerous? The earlier dis-
cussed example represented in Fig. 8.16 demonstrates a viable approach: clustering
by similarity. The original data in this example consisted of a single but very long
time series. The analysts divided it into many shorter time series. By applying a
clustering algorithm to these multiple time series, groups of similar time series were
obtained. For each group, a single representative time series was derived by aver-
aging. Instead of a very large number of individual time series, a small number of
representative time series could be visually explored and compared. This approach
can be applied to sets of comparable time series. This means that the time series
must have the same length (number of time steps) and temporal scale (years, days,
minutes, or something else but common for all). Even more importantly, it should
be meaningful to compare the values of the attributes by judging which of them is
larger and which is smaller. Thus, it would be senseless to compare the use of elec-
tric power to the number of people at work (while it may be interesting to look for a
correlation), but it may be quite reasonable to compare the power usage in different
departments of a company.
When you have time series of diverse incomparable attributes, you may wish to
know whether these attributes change in similar ways. In this case, you can trans-
form the original attribute values into relative values expressing the variation, for
example, into the standard score, also known as z-score, which is the difference
from the attribute’s own mean value divided by the standard deviation (we applied
this transformation in Fig. 8.6), or use another form of normalisation, as discussed
in Section 4.2.9. A different case is when you wish not to compare the ways in
which the individual attributes vary but to understand how they all vary together,
that is, how the combinations of the attribute values change over time. This case is
discussed in the following.
8.5 Analysing temporal data with visual analytics 253
Let us first try the clustering approach. We take one of the most popular clustering
algorithms k-means and apply it to our sequence of 521 states. The main parameter
of k-means is k, which is the cluster number. We try different values and compare
the results regarding the cluster distinctiveness but also looking at the cluster sizes,
as we would not like to deal with many clusters most of which have one or two mem-
bers. For assessing the cluster distinctiveness, we use a 2D projection of the cluster
centres, which are the cluster means in our case (Fig. 8.19, top left). Close positions
of two or more points in the projection means that the profiles of the respective clus-
ters are similar; hence, it is better that all points representing the clusters are quite
far from each other. Following the approach introduced in Section 4.5.2, we put the
254 8 Visual Analytics for Understanding Temporal Distributions and Variations
Fig. 8.20: The temporal distribution of the clusters of the states is represented by
a 2D arrangement of squares painted in the cluster colours. Each row consists of
52 squares corresponding to 7-days (one week) intervals; hence, the rows roughly
correspond to the years. The sizes of the squares show how close the states are to
the centres of their clusters; the closer, the bigger.
in the year than the states from the other clusters characterised by high activities
in the other regions. High activities in California (cyan cluster) happened later than
on the Northeast (blue cluster), and high activities in Vancouver-Seattle happened
sometimes later and sometimes earlier than the major peaks of the activities on the
Northeast.
By this example, we see how clustering of complex states by similarity and visu-
alisation of the temporal distribution of the clusters can be helpful in investigating
the development of a phenomenon or process. Let us now try the other way of using
the similarity measure: data embedding, or projection (this general approach to data
analysis has been described in Section 4.4). We use the same similarity measure as
for the clustering. The projection we have obtained is shown in Fig. 8.21, left. The
states are represented by dots on a 2D projection plot. The majority of the dots are
densely packed in a small area in the lower left quadrant of the plot, which means
high similarity of the corresponding states. In the remaining three quadrants, the
dots are highly dispersed, which means low similarity between the states. For com-
parison with the previously obtained results of the state clustering, we have painted
the dots in the colours of the clusters; we shall refer to them later on.
In the middle of Fig. 8.21, the dots are connected by straight lines in the chronologi-
cal order. Theoretically, this should allow us to trace the state succession; practically,
we cannot do this due to the display clutter. On the right, the dots are connected by
curved line segments instead of the straight ones. This have improved the display
appearance, and some traces have become slightly better separated. Furthermore,
we have coloured the line segments according to the years when the states hap-
pened, from dark green for 2007 through white for 2011 to dark purple for 2016.
This improves the traceability of the trajectory.
Bach et al. [28] call a trajectory representing the succession of states of a phe-
nomenon in a projection plot a time curve. They discuss the geometric character-
istics of time curves (Fig. 8.22, top) and how these characteristics can yield mean-
ingful visual patterns (Fig. 8.22, bottom). Clusters appear if a curve segment has
256 8 Visual Analytics for Understanding Temporal Distributions and Variations
Fig. 8.21: Left: a 2D projection of the states by similarity. Centre: the projected
points are connected by straight lines in the chronological order. Right: the projected
points are connected by curved lines.
a significantly denser set of points than its neighbourhood, or when it has a sig-
nificantly higher degree of stagnation. Related to clusters are transitions, i.e., curve
segments between clusters with a high degree of progression. The presence of multi-
ple clusters and transitions evoke a dynamic process with different continuing states.
A cycle refers to the situation where a time curve comes back to a previous point af-
ter a long progression. U-turns indicate reversal in the process, and outliers indicate
anomalies. There are nice examples of time curves in the literature [28, 141], some
of which are reproduced in Fig. 2.12 and Fig. 7.17 in our book.
Since a time curve shows how a phenomenon evolves, its appearance, naturally,
depends on the character of this evolution. A curve may have a clear and easily in-
terpretable shape if the phenomenon evolves gradually, so that consecutive states are
similar, or it has several periods of stability and transitions between them. Although
our time curve (Fig. 8.21, right) does not look as nice as the examples from the pa-
pers, we still can see interpretable geometric shapes and patterns. The curve shows
us there were periods when no or very small changes happened; the corresponding
trajectory segments are very short and confined in a small area. Apart from these
periods, there were times of big changes, when consecutive states where quite dis-
similar from each other. These changes are represented by long jumps. Most of the
jumps out of the dense area are followed by jumps to other positions in the sparse
parts of the displays rather than returns back; hence, periods of stability alternated
with periods of big changes.
Even a time curve with a simple and clear shape may not be good enough for exhibit-
ing periodic variation patterns. As we have shown in Fig. 8.20, periodic patterns can
be studied using appropriate temporal displays for representing clustering results.
The same can be done for state embedding. The idea is demonstrated in Fig. 8.23:
we apply continuously varying colouring to the background of the projection dis-
play, as we did previously for clusters (Fig. 8.19) and associate each state with the
background colour of the corresponding projection point. Then we can use the state
colours so obtained in temporal displays, analogously to the colours of the clusters
8.5 Analysing temporal data with visual analytics 257
Fig. 8.22: Top: Geometric characteristics of time curves. Bottom: Examples of vi-
sual patterns in time curves. Source: [28].
Fig. 8.23: Left: continuous colouring has been applied to the background of the
projection plot. Right: the colours from the projection background are used in a
display similar to that in Fig. 8.20.
in Fig. 8.19, bottom, and Fig. 8.20. This approach can be used in addition to the
trajectory construction.
Let us now compare the results of the clustering and projection techniques. We see
in Fig. 8.21 that the clusters tend to have their areas in the projection. The largest
cluster 2 (orange) is the most compact, the areas of clusters 3 (yellow) and 1 (pale
pinkish) are a bit larger, and the areas of the remaining clusters are quite large,
which means high internal variance of these clusters. We also see that the clusters
258 8 Visual Analytics for Understanding Temporal Distributions and Variations
are not fully separated in the projection. Let us make the display more informative
by representing characteristics of the states.
In Fig. 8.24, the states are represented by pie charts whose sizes encode the total
event counts and the segments show the proportions by the regions. By comparing
the appearances of the pie charts, we can make our own judgements concerning the
similarities and differences between the states and compare our judgements with the
results of the clustering and the projection. We find that in most cases neighbouring
pies look similarly, but there are exceptions, for example, a relatively big and almost
fully blue pie in the left part of Fig. 8.24, top left. We have a closer look into the
areas where close neighbours belong to different clusters, or isolated members of
some clusters are surrounded by members of another cluster (Fig. 8.24, bottom).
We see that both methods do not work ideally, and their results should always be
treated as approximations. In clustering, a member of a cluster is closer (i.e., more
similar) to the centre of this cluster than to the centre of another cluster, but it can
be quite close to some members of another cluster, as mentioned in Section 4.5.3
in discussing the assessment of cluster quality. In projection, similarities in terms of
multiple characteristics usually cannot be represented by distances in a 2D or even
3D space without distortions, as discussed in Section 4.4.2. Therefore, it is useful to
combine both approaches, and it is necessary to investigate the results with regard
to the data that have been used for the clustering or projection.
8.6 Questions
• What are the main specific features of temporal phenomena and temporal data?
• What are the two major types of temporal data and what is the conceptual differ-
ence between them?
• How is it possible to transform one type of data into the other?
• Give examples of analysis tasks relevant to the different types of temporal data.
• What computational methods can help when temporal data are too large for
purely visual exploration?
• Give examples of the kinds of patterns that can exist in a temporal distribution
and in a temporal variation.
• What is a common approach to exploring periodic temporal distributions and
variations?
8.6 Questions 259
Fig. 8.24: In the projection plot, the states are represented by pie charts with the
segments showing the event counts by the regions. The curve connecting the states
is coloured according to the time of the state occurrence. Three fragments of the
display are shown in more detail, with coloured dots on top of the pies indicating the
cluster membership of the states. The frames mark cases of inconsistency between
the clustering and the projection, when objects are positioned closer to members of
other clusters than to members of their own clusters.
260 8 Visual Analytics for Understanding Temporal Distributions and Variations
8.7 Exercises
Study temporal patterns of your email (or SMS or social media) activity. For this
purpose, extract time stamps and explore their distribution over time. It can be ex-
pected that the temporal distribution reflects the natural human activity cycles, with
daily and weekly periodic patterns, seasonal trends, outlier activities related to pub-
lic holidays etc. Identify dates and times of unusual activities (too may or too few
messages). Compare patterns for outgoing and incoming messages.
Chapter 9
Visual Analytics for Understanding Spatial
Distributions and Spatial Variation
Abstract We begin with a simple motivating example that shows how putting spatial
data on a map and seeing spatial relationships can help an analyst to make important
discoveries. We consider possible contents and forms of spatial data, the ways of
specifying spatial locations, and how to use spatial references for joining different
datasets. We discuss the specifics of the (geographic) space and spatial phenomena,
where spatial relationships within and between the phenomena play a crucial role.
The First Law of Geography states: “Everything is related to everything else, but
near things are more related than distant things”, emphasising the importance of
distance relationships. However, the spatial context, which includes the properties of
the underlying space and the things and phenomena existing around, often modifies
these relationships; hence, everything related to space needs to be considered in
its spatial context. We describe techniques for transforming and analysing spatial
data and give an example of an analytical workflow where some of these techniques
are used but, as usual, the main instruments of the analysis are human background
knowledge, capability to see and interpret patterns, and reasoning.
In late 18th Century London, there was sinister killer on the loose – sinister because
no one could see it and one on understood how it acted. We now know that this killer
was cholera, and the primary way in which it spread was through people drinking
contaminated water.
John Snow1 became famous for his investigation of the epidemic. He collected
geographically-referenced data of the homes in which people died, mapped the data
on a street map, analysed the distribution of the deaths, and reasoned about the pos-
1 https://en.wikipedia.org/wiki/John_Snow
Fig. 9.1: A fragment of the John Snow’s famous map which enabled him to identify
a cluster of cholera cases and the potential source (we have marked it with a red
ellipse).
sible source of the disease. This approach, which was novel at that time, is now
the foundation of the visually-focused quantitative analysis. Snow noticed that the
deaths were clustered around one of the public water pumps, which was located in
the Broad Street. We have marked this location with a red ellipse on the fragment
of the Snow’s map shown in Fig. 9.1. To test whether this correlation was a causal
effect, he removed the pump’s handle, rendering it unusable. The people – no doubt
grumbling as they went – had to use an alternative pump, and the number of the
cholera deaths dropped. A further investigation showed that the water to this pump
was contaminated from sewage from a nearby cesspit. This observation helped pave
the way for modern epidemiology,
This simple but hugely influential study illustrates two important characteristics of
geographical data. The first is that non-uniform geographical patterns may indi-
cate some underlying process. In this case, the visually apparent spatial clustering
indicated that there might be something special in the area. The second is that differ-
ent geographical datasets have a common reference frame, as the deaths cases
and the water pumps in this example, enabling data that might otherwise be unre-
lated, to be related. These are the reasons why studying geographical patterns in data
is worthwhile. This example also demonstrates that geographic data and phenomena
are studied by representing them on maps, which usually include multiple informa-
tion layers. When several layers are put together (as the streets, houses, and water
9.2 How spatial phenomena are represented by data 263
pumps in the Snow’s map), it becomes possible to notice relationships between the
things and phenomena represented by them.
Any data that can be meaningfully related to spatial locations (sometimes with a
spatial extent) can be considered as spatial data. Data records may correspond to
one or more geographical positions, linear geographical features, or areas. Examples
are:
• The locations of water pumps (Fig. 9.1).
• The location at which a photograph was taken or a tweet was tweeted (Fig. 1.7).
• A building footprint (Fig. 9.1).
• The boundary of a country or an administrative district, such as a London ward
in Fig. 4.18.
• The origin and destination locations of a migrant.
• Stopping positions on a journey.
Spatial position can be quantified within a spatial reference frame using 2D coor-
dinates, which will be discussed in more detail in Section 9.2.4. This may be a lat-
itude/longitude pair or map coordinates. A single 2D coordinate describes a point
that may conceptually represent a single location (e.g., a GPS reading) or a larger
area (e.g., the location of a city on a map of a country). Spatial extents can be explic-
itly represented as sequences of positions that form lines (e.g., representing roads
or rivers) or polygons (e.g., representing countries or districts), or implicitly, using
numeric attributes. The type of spatial representation where discrete spatial objects
are described by records specifying their spatial locations is known as ‘vector data’.
Additionally to the spatial locations, the records may also include values of various
attributes, which are usually called thematic attributes.
An alternative representation is ‘raster data’, which usually represents how a contin-
uous phenomenon, such as elevation or temperature, varies over space. It is a regu-
lar, typically fine-grained grid with cells containing attribute values. The grid itself
is georeferenced using coordinates as described above. Raster data are represented
visually on a map by continuous variation of colour, as demonstrated in Fig. 9.2.
264 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
The grid in raster data is characterised by its resolution. Thus, the example raster in
Fig. 9.2 has 1126 columns and 626 rows, and the cell size is 1 sq.km.
Spatial data can be object-based or place-based. Object-based data refer to entities
located in space and describe properties of these entities. Place-based data refer to
locations in space and describe the objects and/or phenomena located there. For an
example, let’s consider crime data. If data include the location and characteristics of
each crime, then these would be object-based data. If crime data are supplied in an
aggregate form by place (e.g. administrative unit) with crimes quantified as counts
per place, this would be a place-based representation. It is possible to transform from
object-based to place-based representations but not the other way around.
9.2.2 Georeferencing
Many spatial datasets are explicitly georeferenced, i.e., spatial positions are specified
explicitly using coordinates. Other datasets may be implicitly georeferenced using
9.2 How spatial phenomena are represented by data 265
geocodes, that is, text values for which the geographical coordinates of the corre-
sponding points, lines or polygons can be retrieved from appropriate geographical
lookup datasets. Common geocodes are country names, country codes, and postal
codes. Spatial administrative and census data are often implicitly georeferenced by
including codes of administrative units whose coordinates are specified elsewhere.
These are usually place-based spatial data.
Ultimately, for spatial analysis, it is necessary to have explicitly georeferenced data.
Obtaining explicit georeferences for data containing geocodes is achieved by join-
ing these data with data specifying the coordinates for these geocodes. For example,
customer data containing the customers’ addresses and/or postcodes can be georef-
erenced by looking up the spatial positions of the addresses or postcodes. City-level
data can be georeferenced using the city names and widely-available lookup tables.
Demographic data from a census can be georeferenced by obtaining spatial posi-
tions of the census districts from administrative boundary lookups, as we did for the
London wards in the example shown in Fig. 4.18.
When two or more spatially referenced datasets are put on the same map, relation-
ships between the data from these datasets can be detected visually, as happened
when John Snow analysed the epidemic data. Such “visual join” may not always
work perfectly, in particular, due to occlusions of data from one dataset by data
from another dataset. However, it is possible to perform computational join of spa-
tially coincident datasets using the spatial positions of the data items. Thus, in the
example of the cholera epidemic, the dataset containing the locations of the deaths
could be joined with the dataset containing the positions of the water pumps. One
possible way is to determine for each death record the nearest water pump and at-
tach an identifier of this pump to the record. Another possible way is to count for
each water pump the number of the deaths within a certain maximal distance, for
example, 500 metres.
Another example of spatial joining is complementing customer data with demo-
graphic data characterising the areas where the customers live. Having the cus-
tomers’ coordinates (which might first be obtained based on their addresses), the
areas containing these coordinates are determined, and then area-specific data are
retrieved from a demographic dataset and attached to the customers’ records. This
procedure is often used in geomarketing for guessing the likely characteristics of
the customers (e.g., their income or number of children) in order to plan targeted
marketing campaigns.
266 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Fig. 9.3: Distortions of the areas in the Mercator projection. All red ellipses
have equal sizes and shapes on the Earth surface but greatly differ when pro-
jected on a plane. Source: https://en.wikipedia.org/wiki/Tissot’
s_indicatrix.
2 https://en.wikipedia.org/wiki/IERS_Reference_Meridian
9.2 How spatial phenomena are represented by data 267
There are also many coordinate systems that project parts of the ellipsoid onto a
planar surface. This transformation causes unavoidable distortions, and the type of
distortion depends on the projection techniques. Many countries whose spatial ex-
tents are relatively small have their specific Cartesian spatial reference systems in
which both areas and angles are preserved within acceptable tolerances. For larger
areas, it is not possible to project onto a plane whilst preserving both angles and
areas. The Mercator projection is good for global coverage and where directions
need to be preserved, but it maintains angle/shape at the expense of scale which
continually increases towards the poles, as illustrated in Fig. 9.3. Other projections
may preserve areas but distort angles and, consequently, shapes. There are also pro-
jections that try to minimise the distortions of both areas and angles, preserving
neither.
As mentioned, geographical coordinates specify positions on the Earth’s surface.
Altitude/height and depth (in a sea) are usually measured in metres from this sur-
face.
3 https://en.wikipedia.org/wiki/World_Geodetic_System
268 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
spond to spatial density. These problems are reduced when the geographic extent of
the territory that is visualised is small. In viewing and interpreting a map of a large
territory, one should be very cautious in making judgements concerning distances,
areas, shapes, and densities.
Geographical and, more generally, spatial phenomena have some unique charac-
teristics that need to be understood and taken into account in analysing spatial
data.
car parking and public transport), and the position and nature of the competition. In
retail modelling, ‘gravity modelling’ quantifies the shops’ competition as a function
of the sum of the inverse distances and sizes of the competing shops. This illus-
trates that retail modelling needs to consider other geographical data including the
competition, population, catchment areas, and infrastructure.
Large cities contain more crime than smaller cities. This is because crime involves
people, and there are more people in cities. Crime analysis needs to consider other
geographical contextual data to account for these effects, such as where people live
and how space in cities is used.
The properties of spatial dependence and interdependence are captured in the “To-
bler’s First Law of Geography”4 , which states: “Everything is related to everything
else, but near things are more related than distant things” [134, p. 236].
Implication: put your data in spatial context. The use of maps is crucial in
analysing spatial data. Maps can show not only spatial distributions of entities and
attributes and the distance/neighbourhood relationships between them but, very im-
portantly, the geographical context i.e., other things and phenomena that spatially
co-occur with the things or phenomena under our study. In particular, the spatial
context can interfere with the influence of the spatial dependence. For example,
entities or attribute values located at a small distance from each other may not be
similar or strongly related due to the presence of a spatial barrier between them,
such as a river, a mountain ridge, or a country border. Seeing things in context is
very important for correct interpretation and making valid inferences. Thus, as we
noted concerning the John Snow example, it was important to see the streets in the
map for correct interpretation of the alignments formed by the dots representing the
deaths. In the example maps showing the distribution of the cherry blossom photos
in Fig. 9.4, the cartographic background allows us to understand the clustering and
alignment patterns observed at the larger and smaller scales.
4 https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography
270 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Sometimes the precision at which data are supplied (the number of decimal places)
helps indicate the accuracy at which the data were collected. However, it often does
not, particularly for derived data or transformed data, where the precision of stored
values may be much higher than the precision of the original data. This is particu-
larly true for spatial coordinates, which are often recorded at a much higher preci-
sion than the equipment is able to measure.
Besides, coordinates that are present in data may have been obtained from lookup ta-
bles in which whole cities or other geographic objects are represented as points. The
coordinates may look highly precise, but this impression of precision is deceptive.
It is often possible to detect visually that data may refer to some set of “standard”
positions: when you put the data on the map, many data items will have exactly the
same position, and you will see notably fewer dots visible on a map than there are
records in the dataset.
Implication: it is not always clear what spatial scale of analysis the data sup-
port. Without knowing how accurate and precise the spatial data are, it is difficult
to know at what scale the spatial analysis can be done, even if the coordinates are
reported at a high spatial precision. Sometimes column headings and other metadata
can give clues. If not, then counting unique values of coordinates and drawing maps
of these locations will help establish whether there is a finite set of locations. This
can help in choosing appropriate scales of analysis.
Fig. 9.4: Dependence of the spatial distribution patterns on the spatial scale. Top:
the distribution of the cherry blossom photos at the country scale reveals the regions
where the phenomenon mostly occurs. Bottom: at a city scale, the distribution is
related to parks and waterfronts.
data have been derived from detailed data by someone else, and we have no access
to the original data. By considering the example with the photos, it is easy to guess
that, depending on the choice of the areas for aggregating the data, different spatial
patterns may be captured while others may be destroyed. Thus, if the photo data
were aggregated by the states, we would never know that the photos are clustered
in particular big cities and, certainly, could not refer the distribution to parks and
waterfronts. Considering economic activity, using area units of the size of about
100 sq.km would reveal geographical trends as a national scale, whereas using area
272 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
units of about 1 sq.km size could disclose particular spots of high economic activity
and local trends within smaller geographical regions.
Implication: spatial analysis results are scale dependent. The implication is that
analysts must choose the right spatial scale for analysis depending on the data char-
acteristics (particularly, aggregation level and/or precision), on the one hand, and the
analysis goals, on the other hand. Often, a visual analytics approach can facilitate
finding a scale that produces patterns helpful to the analysis. It may also happen that
available data do not allow performing analysis at the required scale. Analysts may
have to search for more appropriate data or to modify the analysis goals.
Spatial partitioning divides space into discrete spatial units. Obviously, this can be
done differently. If we then summarise data within these spatial units, the results
are likely to be dependent on the way in which the space has been divided. This is
called the Modifiable Areal Unit Problem (MAUP) [107]. The reason for this effect
is that spatial distributions are non-uniform, as mentioned in Section 9.3.1. Spatial
patterns, such as high densities, clusters, and alignments, may be concealed when
covered by big units or split among several units. The fact that a pattern may be
destroyed by splitting into parts indicates that not only the sizes of the units are
important but also the delineation. Thus, even when a regular grid with equal cells
is used for partitioning, distribution patterns of grid-aggregated data may change
when the grid is slightly shifted or rotated. The notorious practice of gerryman-
dering5 defines boundaries of electoral districts so as to favour particular election
outcomes.
In spatial analysis, we regularly partition space. Most of the transformations (Sec-
tion 9.4) involve partitioning space. Having detailed data allows us to find suitable
partitioning for the intended analysis, possibly, by trying different approaches. One
possible approach is data-driven space tessellation attempting to create spatial
units that enclose spatial clusters of data items, as illustrated in Fig. 9.5.
The approach involves two major operations. First, points are organised in groups
based on their spatial proximity so that the spatial extent of a group does not ex-
ceed a chosen maximal radius. The grouping (clustering) algorithm is described in
paper [21]. Second, the territory is partitioned into Voronoi polygons6 using the cen-
troids of the point groups as the generating seeds. When there are big empty areas,
as in our example in Fig. 9.5, additional generating seeds can be created in these
areas, for example, at predefined equal distances from each other. The use of this
5 https://en.wikipedia.org/wiki/Gerrymandering
6 https://en.wikipedia.org/wiki/Voronoi_diagram
9.3 Specifics of this kind of phenomena 273
method does not guarantee a perfect result. The result depends on the choice of the
maximal radius of a point group, and finding a suitable value may require several
trials. A good feature of this approach is that, by choosing a larger or smaller pa-
rameter value, we can adjust the partitioning to the desired spatial scale of analysis
while preserving the patterns that exist at this scale.
However, not always we can decide how to partition the territory. We often get
data that have been already aggregated by some units, and we also often want to
aggregate available detailed data by predetermined areas for comparability to other
datasets.
Implication: data may contain data processing artefacts. The implication of this
is potentially quite serious: our analysis may be dependent on how the data have
274 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
been processed and prepared. Data-driven approaches that partition space taking the
data distribution into account can help reduce these effects. A smoothing filter can
also be used, so that data items located near boundaries can partially contribute to
several neighbouring units. By using visual analytics techniques for comparing the
results of summarised data partitioned in different ways, we can determine the extent
to which MAUP operates. This will help us determine appropriate spatial partition-
ing and how much confidence we can have in the results. However, we sometimes
have little choice, because the data we use may already be highly aggregated.
There are a number of reasons why we may need or wish to do a coordinate transfor-
mation. If not all spatial datasets are in the same spatial reference frame, we would
transform coordinate so that they are in a common spatial reference frame. If we
need to do spatial analysis by means of a particular software tool, we should ensure
that our data are in a spatial reference frame that the tool accepts.
If we are doing our own calculations on the coordinates, we need to ensure that the
calculations are valid for the spatial reference frame. If the we are using a country-
specific Cartesian spatial reference system, we can assume that these will preserve
both angle and distance within acceptable tolerances; hence, the calculations can be
based on the Euclidean geometry. If the coordinates in our data are latitudes and
longitudes, which are angular spheroid coordinates, we cannot use the Euclidean
geometry and instead need to use Great Circle-based calculations designed for such
coordinates.
If our spatial analysis concerns angles, we need to ensure that our spatial reference
frame preserves angles (Mercator or a country-level Cartesian mapping system). If
our analysis is about area or density, we need to ensure that our spatial reference
frame preserves distances and areas.
Apart from transformations of geographical coordinates from one projection to an-
other, it may be interesting and useful to transform positions in a real, physical space
to positions in an imaginary, artificial space. Figure 9.6 demonstrates the idea of a
transformation that can be applied to positions of football (soccer) players on the
pitch. To understand how the players of a team position themselves in relation to
their teammates, we create an artificial “relative space” of the team [11]. The team
space is a Cartesian coordinate system with the origin in the team centre, which
is determined as the mean position of the team players, excluding the goalkeeper
and, possibly, “spatial outliers” – players that are distant from the others. The ver-
9.4 Transformations of spatial data 275
tical axis corresponds to the direction of the team’s attacks, i.e., towards the goal
of the opponent team. The horizontal axis is directed from left to right with respect
to the attack direction. The players’ coordinates on the pitch are then transformed
intoTowards
the coordinates
seeing the big in the new coordinate system with the axes left-right and back-
picture
front.
Direction towards
the opponents’ goal
front
left right
Absolute positions
of the players
on the pitch
Relative positions
of the players
in the team
back Team’s centre
49
Fig. 9.6: Transformation of “absolute” spatial positions of team players on the pitch
(right) into their relative positions with respect to the team centre (left).
Since the players are constantly moving in the course of the game, the position of
the team centre changes every moment. Hence, the team centre’s position needs to
be determined individually for each time step that the players’ position data refer to.
The players’ relative positions in each time step are calculated with respect to the
position of the team centre in this step. While the team centre’s position on the pitch
is constantly changing, its position in the team space remains the same – it is always
the point (0,0) – unlike the positions of the players. Figure 9.7 demonstrates how
trajectories (tracks) that players of a team made on the pitch during a game may look
like after a transformation into the team space. While on the pitch map (left) we see
an incomprehensible mess made by overlapping tangled lines, the map of the team
space (right) looks much tidier. We see that the lines corresponding to the players
(represented by different colours) form compact rolls within certain parts of the team
space. This indicates that the players have their “areas of responsibility” in the team
and tend to keep within these areas. You may note that the track of the goalkeeper,
which is at the bottom of the team space (in green), is extended vertically much
more than the tracks of the other players. It is not because the goalkeeper moved
more to the front and back than the others, but because the team centre moved to the
front and back with respect to the goalkeeper’s position.
276 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Fig. 9.7: The trajectories (tracks) made by team players on the pitch (left) have been
transformed into the “team space”.
Don’t think that such coordinate transformations to artificial “relative” spaces are
only possible for a group of simultaneously moving objects. It can also be useful
for comparing properties of two or more spatial distributions that are not spatially
coincident. For example, imagine that you have spatial data describing personally
significant places of different individuals: home, work place, school or kindergarten
attended by children, usual places for shopping, sports, entertainment, and meet-
ings with friends. For each individual, there is a set of personal places with their
categories and coordinates. How can you compare the spatial distributions of the
personal places of different individuals when you are not interested where on the
Earth these places are located, but you want to know how far from home each person
works and performs other activities and whether their places are spatially dispersed
or clustered? You can, for example, take the position of each person’s home to be the
origin of a new coordinate system, take the direction from the home to the work, or
to the most frequently visited place, as the vertical axis, and transform the positions
of all places into this coordinate system. Now you can superpose the distributions
of the places of two or more individuals within this common coordinate space and
compare these distributions visually and, if necessary, computationally.
9.4 Transformations of spatial data 277
9.4.2 Aggregation
You may have detailed spatial data referring to locations or objects that can be
treated as points in space (i.e., the spatial extents, if any, are negligible). An ex-
ample is our dataset with the data concerning the cherry blossom photos. To see
the spatial distribution of your data, you may represent each data item on a map
by a dot, as we did in Fig. 9.4. There is a problem here: multiple dots may overlap
and conceal each other. By varying the transparency of the dot symbols, you may
achieve the effect that dense spatial clusters become well visible, as it has been done
in Fig. 9.4. This is useful, but the dot symbols beyond the densely filled areas are
hardly visible. Another problem is that you cannot guess how many dots are in the
different densely packed regions and thus cannot compare the regions. To see a spa-
tial distribution more clearly, to quantify it, and to compare different parts of it, you
may need to aggregate the data by areal spatial units.
The general idea of spatial aggregation is simple: you partition the space into some
areal units (see Section 9.3.4), or you take some predefined partitioning, and cal-
culate for each unit the count of the original data items contained in it and, when
needed, summary statistics of the values of thematic attributes present in the data.
In the result, you obtain a set of data records referring to the spatial units, i.e.,
place-based data (Section 9.2.1). You can visualise these data, for example, on a
choropleth map, as in Fig. 2.8. While the procedure of counting and summarising
is always the same, the partitioning of the space can be done in different ways, as
will be discussed in the following. To illustrate how space partitioning affects the
result of aggregation, we shall use an example dataset containing the positions of the
London pubs. A fragment of a map portraying the detailed data is shown in Fig. 9.8.
We do not use the photo dataset because it refers to a very large territory, which is
much distorted when shown on a map due to the projection effects, as discussed in
Section 9.2.4.
9.4.2.1 Gridding
There are two variants of grid-based aggregation. One is to use a grid with quite
large cells, as in Fig. 9.9, where the cell dimensions are 1 × 1 km. In this example
maps, the counts of the pubs contained in the grid cells are represented by propor-
tional sizes (areas) of the circles drawn inside the cells.
Fig. 9.9: The London pubs data aggregated by a regular grid with square cells.
9.4 Transformations of spatial data 279
Another, a very different approach is to use a grid with tiny cells and transform our
data into a raster; this form of spatial data was mentioned in Section 9.2.1. A raster
represents data as continuous and smooth variation of some attribute over the terri-
tory. The example in Fig. 9.10 represents the variation of the density of the London
pub. Here, the cell dimensions are 100 × 100 m. As can be seen from the exam-
ple, a raster is visualised in such a way that the cells are not discernible. The image
looks smooth thanks to the use of interpolation for determining the attribute values
corresponding to neighbouring screen pixels. Moreover, smoothing is involved not
merely in the process of drawing a raster but, more essentially, in the aggregation
procedure as such. The idea is that each data item contributes into calculating the
value not only in the cell containing this item but also in the neighbouring cells.
A suitable radius of the cell neighbourhood, i.e., the maximal distance at which
an item may have influence on a cell value, is specified by the analyst. The neigh-
bourhood radius is a parameter determining the degree of smoothing: the larger, the
smoother.
In particular, generation of a raster representing the densities of data items is done
using the statistical technique known as kernel density estimation (KDE)7 . The term
kernel refers to the function that is used for the aggregation, and the smoothing
parameter (i.e., the neighbourhood radius) is called the bandwidth. The two images
in Fig. 9.10 represent variants of the density raster generated by means of KDE with
the bandwidth of 1 km (upper image) and 500 m (lower image). It is easy to notice
that the larger bandwidth can be good for revealing larger-scale distribution patterns,
whereas the smaller bandwidth promotes more local patterns to show up.
Since the fine-grained raster format and its visual representation aim at mimick-
ing continuous variation of a phenomenon, the variant of aggregation that produces
raster data can be called continuous spatial aggregation. The other variants, which
do not blur the spatial compartments but keep them explicit, can be called discrete
spatial aggregation.
If we compare the results of the continuous aggregation in Fig. 9.10 and the discrete
aggregation by a regular grid in Fig. 9.9, we may note that the distribution in the
discrete variant looks unnatural and, to some extent, misleading, because the areas of
especially high density of the pubs are “dissolved” and/or distorted. The continuous
variant is much more faithful in showing the density variation pattern. However, its
disadvantage is that it is hard to judge how many pubs are in different places. We can
consult the map legend to see how the raster values are represented by the colours,
but the problem lies in interpreting the values, not the colours, because the values
were calculated with applying smoothing.
Another disadvantage of the raster form is that you cannot see more than one raster
presented on a map, which complicates comparisons between different attributes
or different phenomena. Thus, if we apply discrete spatial aggregation to the dis-
tributions of the pubs, bars, and restaurants, we can obtain data where each spatial
7 https://en.wikipedia.org/wiki/Kernel_density_estimation
280 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Fig. 9.10: The London pubs data aggregated into a raster (fine-grained grid) by
means of KDE with the bandwidth of 1000 metres (upper image) and 500 metres
(lower image).
Fig. 9.11: The data about London pubs, restaurants, and bars have been aggregated
by the same discrete compartments and can be visualised together. The counts of
the different facilities in the compartments are represented by sectors of pier charts.
attributes for the same compartments and visualising them together does not depend
on the shapes of the compartments that are used for discrete aggregation.
Fig. 9.12: The London pubs data aggregated by an irregular grid with polygonal
cells obtained by means of data-driven tessellation.
Spatial data can also be aggregated into existing geographical units, such as units of
administrative division, census areas, electoral districts, etc. The use of such units
9.5 Analysis tasks and visual analytics techniques 283
is often appropriate, as many kinds of them are specially designed for summarising
population-related data, so that they include approximately equal numbers of people
to whom the data refer (inhabitants, voters, workers, etc.). These units are usually
smaller where population density is high. You may have to use predefined regions
when you want to relate your data to other data that are only available in the form
aggregated by these regions. However, it is sensible to avoid the use of predefined
regions when you have no serious reasons for that. Unlike a data-driven division,
predefined regions may not represent the distribution of your data well enough, and
differences in the sizes of the regions will complicate comparisons and quantitative
judgements, as we discussed previously regarding irregular cells. Hence, unjustified
use of predefined compartments that are irrelevant to your data and analysis goals
combines the disadvantages of the regular and irregular tessellations and does not
offer any advantages.
Let us see in Fig.9.13, top, how the distribution of the pubs will look like after
aggregation by administrative districts, namely, London wards. In the area of the
City of London, where there is a dense concentration of pubs, we now see many
tiny districts with small pub counts, while the largest value is attained in the City of
Westminster, where the pubs are much sparser. These distortions of the distribution
patterns emerge due to the differences in the district sizes. The normalisation of
the values by the areas of the districts (i.e., transforming the counts into densities)
decreases the distortions (Fig.9.13, bottom, but the overall pattern of clustering of
the pubs is not conveyed.
Hence, there are several conflicting criteria for choosing appropriate space partition-
ing for aggregating your data, including
• preservation of the distribution patterns,
• possibility to calculate and compare multiple attributes,
• possibility to aggregate several datasets for joint analysis,
• possibility of comparisons between places and quantitative judgements,
• possibility to relate your data to other data available in an aggregated form.
Each time when you consider applying spatial aggregation to your data, you need
to weigh the importance of these different criteria for your data and analysis
goals.
Fig. 9.13: The London pubs data aggregated by predefined districts (wards). Upper
image: the circle sizes are proportional to the pub counts. Lower image: the circle
sizes are proportional to the pub density, i.e., count divided by the land area.
showed examples of choropleth maps with different colour scales, and in Chapter 4,
where we used spatial data to describe the process of cluster analysis and demon-
strate the impact of the parameter setting on the result. We have earlier mentioned
(in Section 9.3.1) and would like to emphasise again the importance of representing
spatial data in the relevant spatial context, that is, to put your data on a background
map showing general geographic information (land and waters, countries and popu-
lated sites, roads and parks, etc.) that is appropriate to the spatial extent of your data
and the scale of the intended analysis. While the contextual information should be
visible on a map, it should be depicted so that your data can be easily seen on top
of the background, and you can focus on your data without being distracted by the
background. Hence, the colours in the background should not be too bright or too
dark, and the lines and labels should not be too prominent.
Currently, there are numerous online servers providing georeferenced map tiles, i.e.,
images containing pieces of large maps, which can be stuck together for creating a
map of the territory relevant to your analysis8 . Many of the map tile servers use
the open data from OpenStreetMap9 (OSM), which is a database of crowdsourced
worldwide geographic information and a set of data-based services. You can choose
between different map styles; not all of them are suitable for background maps.
Sometimes it may also be useful to put your data on satellite images, which are also
accessible in the form of georeferenced tiles.
There is a trick how to reduce the prominence of a background map that interferes
with the visibility of your data: put a semi-transparent grey rectangle on top of
the background map and underneath your data. By varying the shade of grey and
the degree of transparency, you can create a suitable map display for your analy-
sis. We have applied this trick in generating the maps in the figures from 9.8 up
to9.13.
The OSM server can be queried for obtaining not only map tiles (i.e., images) but
also the geographical data in the vector form. It is possible to get the coordinates and
attributes of various geographical objects represented by points, lines, or polygons.
Thus, our example dataset with the locations of the London pubs have been created
by retrieving the data from OSM.
As we discussed in the previous section, before putting your data on a map, you
may apply transformations. In particular, aggregation and smoothing are very of-
ten used in geographic data analysis. Visual analytics workflows may involve spe-
cialised computational methods for spatial analysis10 , such as spatial statistics11 .
Geographic Information Systems12 (GIS) include many tools for spatial data trans-
formation, visualisation, and analysis. Professional analysis of spatial information
8 https://wiki.openstreetmap.org/wiki/Tiles
9 https://en.wikipedia.org/wiki/OpenStreetMap
10 https://en.wikipedia.org/wiki/Spatial_analysis
11 https://en.wikipedia.org/wiki/Geostatistics
12 https://en.wikipedia.org/wiki/Geographic_information_system
286 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
is mostly done using this kind of software. Implementations of some methods for
spatial analysis can also be found in open packages and libraries.
In analysing spatial data, it may also be necessary to perform specific spatial cal-
culations, including computation of numeric measures (distances, lengths, areas,
angles), generation of geometric objects, such as spatial buffers, convex hulls, and
Voronoi polygons, checking various geometric conditions (point on line, point in
polygon, crossing, intersection, etc.).
A very important analytical operation is data selection and filtering (Section 3.3.4)
based on the spatial positions and footprints of the data items. There are multiple
possibilities for specifying conditions for spatial queries and spatial filtering:
• containment in a given bounding box;
• containment in a specific area or one of several areas;
• being within a given distance from some location or object;
• lying on a given line;
• intersecting or overlapping a given spatial object or one of multiple objects;
• containment of a given location or object, or one of multiple locations or objects.
In the following, we shall describe an example of an analytical workflow that in-
cludes visualisation, spatial filtering, data transformations, and joint consideration
of two spatial datasets.
We shall analyse the spatial distribution of the London pubs using the dataset that we
used previously for illustrating the transformations of spatial data. From all example
maps showing these data, the best representation of the distribution is provided by
the raster maps in Fig. 9.10. We observe that the highest spatial density of the pubs
(a spot in the darkest shade of red) is in the area of the City of London. We also
see, especially in the map created with the smaller smoothing bandwidth (the lower
image in Fig. 9.10), two smaller dense clusters west of the largest and densest City
cluster. Our knowledge of the geography of London tells us that all these three “dark
spots” on the map are in the areas where not many people live but there are many
visitors, in particular, tourists. It can be conjectured that the pubs are here primarily
for serving visitors rather than locals from the neighbourhood.
We want to explore the other areas of high density of pubs and try to understand the
reasons for their existence. However, a raster map is not the most convenient tool
for such an exploration, because we cannot see well the map background covered
by the raster. When we increase the transparency of the raster image, we see the
9.6 An example of a spatial analysis workflow 287
background, but the dark spots of high pub density become harder to see. To see both
the pub-dense areas and the spatial context well enough, we shall generate explicit
outlines of these areas using the following approach. First, we apply density-based
clustering (Section 4.5.1) to the locations of the pubs based on the spatial distances
between them. Since the spatial positions in our data are specified by geographic
coordinates according to the WGS84 reference system (Section 9.2.4), the spatial
distances are calculated using the great-circle distance function (Section 4.2.8). We
try several combinations of the parameters ‘neighbourhood radius’ and ‘number of
neighbours’ and conclude that the result obtained with 300 metres and 5 neighbours
is sufficiently good: there are 23 clusters (not too many) which are quite dense and
spatially compact (Fig 9.14). The sizes of the clusters (counts of the members) vary
from 6 to 79. In total, the clusters include 327 pubs (14.45% of all), and 1,936 pubs
are treated as “noise”.
We filter out the “noise” and generate polygons enclosing the clusters. We first build
convex hulls around the clusters and then extend the hulls by adding 150 m wide
spatial buffers around them. The buffering operation outlines areas around the clus-
ters, whereas the convex hulls are too “tight”, so that many pubs have borderline
positions.
Now we switch off the visibility of the map layer containing the individual pubs
a points and focus our exploration on the cluster areas whose outlines we have ob-
tained. We use zooming and panning to see the areas in detail, and we consider back-
ground maps generated from different variants of map tiles to find relevant spatial
context information. We find that the background showing the transportation lines
and hubs is definitely relevant to our investigation. In Fig. 9.15, the “pubby” areas
are shown as yellowish semi-transparent shapes on top of the background map with
the transit information. The sizes of the red dots represent the counts of the pubs in
the areas. We have reduced the range of the values represented by the dot sizes to
the maximum of 26 pubs. In the three most “pubby” areas that we have identified
earlier, the corresponding counts (79, 37, and 28) are not depicted by proportional
symbols.
The map tells us that many of the pub clustering areas are situated at transit hubs.
It seems quite likely that a large part of the customers of these pubs are commuters.
However, there are also clusters that are not spatially associated with transporta-
tion facilities. A careful investigation of different kinds of spatial context of these
clusters (e.g., the cluster located north of the metro station Angel) did not reveal
anything special about these areas, except that not only pubs but also restaurants
and small shops are densely clustered there. We can conjecture that London seems
to have particular areas where people come to shop, dine, and socialise.
Now we want to investigate whether there is any association between the spatial
distribution of the pubs and the variation of the characteristics of the resident pop-
ulation. We have downloaded open population statistics from the London Datas-
288 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Fig. 9.14: A result of density-based clustering of the London pubs based on the
spatial distances between them. Top: the coloured dots represent the members of
the clusters and the grey dots represent the “noise”. Bottom: the “noise” has been
filtered out, and the clusters have been enclosed in polygonal outlines.
tore 13 . In particular, there are data describing the deprivation of the population. The
data consist of numeric scores associated with small districts, called Lower Layer
Super Output Areas (LSOA)14 . The deprivation is characterised in terms of several
13 https://data.london.gov.uk/
14 https://en.wikipedia.org/wiki/Lower_Layer_Super_Output_Area
9.6 An example of a spatial analysis workflow 289
Fig. 9.15: A fragment of a map with the outlines of the pub clusters and the back-
ground showing the transportation lines.
criteria, which are combined into an integrated score called Index of Multiple Depri-
vation (IMD)15 . The component criteria, which have different weights in the IMD,
are deprivation in terms of income, employment, education, health, crime, barriers
to housing and services, and living environment. The choropleth maps in Fig. 9.16
show the spatial distributions of the IMD scores (top left) and some of its compo-
nent scores, namely, education, living environment, and crime. Higher values of the
scores indicate higher deprivation.
We use choropleth maps in which the value ranges of the attributes are divided into
five equal-frequency class intervals (Section 3.3.1), i.e., each of the corresponding
classes (groups of areas) contains approximately one fifth of the total number of the
areas. Such groups are called quintiles. In all maps, we apply a colour scale with
the shades of blue corresponding to lower deprivation scores and shades of red to
higher scores.
As we know that the largest clusters of pubs in the central London, apparently,
serve mostly visitors rather than local population, we want to exclude the central
areas from the further exploration, as they can skew our observations. We exclude
these 65 areas (1.3% of the 4,835 areas that we have) by means of spatial filtering
(Fig. 9.17). After that, we re-divide the value ranges of the deprivation attributes
into five equal-frequency intervals based on the remaining 4,770 areas, i.e., each in-
terval contains approximately 954 areas (the group sizes may slightly vary, because
15 https://en.wikipedia.org/wiki/Multiple_deprivation_index
290 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
Fig. 9.16: Spatial distributions of the compound deprivation score – IMD (top left)
and its components: education, skills, and training (top right), living environment
(bottom left), and crime (bottom right).
multiple areas may have the same attribute value and thus need to be in the same
group, even when the group becomes bigger than others). In the following analysis,
we shall investigate how the pubs are distributed among the area groups defined ac-
cording to the different deprivation scores. To be able to do that, we apply the spatial
calculation operation that counts the number of pubs in each area.
The existence of associations between the distribution of the pubs and the variation
of the area statistics cannot be detected straightforwardly by computing the statis-
tical correlation coefficients or looking at a scatterplot. It is because the statistical
distribution of the per-area pub counts is extremely skewed: 77.4% of the areas do
not contain pubs at all, 15.5% of the areas contain 1 pub each, 4.3% have 2 pubs
per area, and only 2.8% have 3 or more pubs per area. In total, there are 1,611 pubs
in 4,770 areas. In such a situation, a viable approach is to consider the distribution
of the pubs by groups of areas rather than over the individual areas. Our approach
is to define groups of areas based on values of area attributes, such as the depriva-
tion scores, and see how many pubs there are in each group. For comparability, the
groups should be equal in size (number of areas) or population. In the particular case
of the output areas designed for reporting the population statistics, equal-size groups
of areas are expected to have approximately equal population counts. It would still
9.6 An example of a spatial analysis workflow 291
Fig. 9.17: The areas in the centre of London have been excluded from the further
analysis by means of spatial filtering.
be appropriate to check whether this expectation holds and adjust the division if
necessary; however, 193 records in the data that we use lack the population counts.
Therefore, we shall divide the areas into equal-size groups each containing approx-
imately 20% of the areas, and we shall look at the population counts in the groups
keeping in mind that these counts may be incomplete.
Our approach to the exploration is based on the following idea. We divide the areas
into equal groups according to values of some attribute characterising the groups,
and we compare the total counts of the pubs in the groups. As there are 1,611 pubs
distributed over 5 groups of areas, the average number of pubs per group should be
around 322. If the counts do not differ much from the average and from each other,
it means that there is no association between the distribution of the pubs over the
set of the areas and the characteristic of the areas that was used for the grouping.
Significant differences among the pub counts indicate the existence of an associa-
tion. The association is positive when the groups of areas with higher values of the
attribute have more pubs. If there are fewer pubs in the areas with higher attribute
values, it means that there is a negative association.
We first consider the deprivation attributes, including the composite score IMD and
its components. The distributions of the pubs over the quintiles of the areas defined
according to different deprivation attributes are visually represented in Fig. 9.18,
top. The lower image shows the distribution of the population over the same groups
of the areas. Each distribution is represented by a segmented bar; the whole bar
length corresponds to the total number of the pubs in the upper image and to the total
292 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
number of the population in the lower image. Each bar is divided into 5 segments
corresponding to the quintiles of the areas. The colours from dark blue to dark red
correspond to the attribute values from the lowest to the highest. The lengths of the
segments are proportional to the counts of the pubs (upper image) and the counts
of the population (lower image) in the groups. The bars in both images correspond,
from top to bottom, to the following deprivation attributes: IMD score; education,
skills, and training score; health deprivation and disability score; crime score; living
environment score.
Fig. 9.18: Distribution of the London pubs (top) and population (bottom) over quin-
tiles of areas defined according to different deprivation attributes.
Fig. 9.19: Distribution of the London pubs (top) and population (bottom) over quin-
tiles of areas defined according to different attributes describing the population
structure.
Concerning the distributions of the population, which are shown in the lower image,
we can see that the groups defined based on IMD, education, and living environment
have quite similar population counts whereas the groups based on health and crime
are less equal, but the differences are not as large as we see in some of the bars in the
upper image. The latter tells us that there is no association between the distribution
of the pubs and the composite deprivation index IMD. The same refers to the income
9.6 An example of a spatial analysis workflow 293
and employment deprivation (not shown in Fig. 9.18), which are the IMD compo-
nents having the highest weights. However, there are notable differences between
the per-group pub counts for the groups defined based on the education, health,
crime, and living environment deprivation. These differences indicate the existence
of associations. There is a strong negative association between the pub distribution
and the education deprivation, i.e., there are notably fewer pubs in the areas with
higher education deprivation. For the remaining three deprivation attributes shown
in Fig. 9.18, the associations with the pub distribution are positive. The association
of the living environment deprivation is especially strong, also taking into account
the approximately uniform distribution of the population over the groups of the area.
Hence, higher numbers of pubs are associated with worse living environment and
with higher levels of crime and health deprivation.
We apply the same approach to investigate how the spatial distribution of the pubs is
associated with the population structure of the output areas. Similarly to Fig. 9.18,
Figure 9.19 represents the distributions of the pubs over the area groups defined
according to the percentages of the white population and the national minorities
(abbreviated as BAME – Black, Asian and Minority Ethnic) and to the percentages
of the age groups 0-15 years old, 16-29, 30-44, 45-64, and 65+. We clearly see that
the distribution of the pubs is negatively associated with the percentages of the na-
tional minorities, children (0-15 years), and people aged 45-64 years, and positively
associated with the percentages of the white population and the age groups of 16-
29 and 30-44 years old. There is no association with the percentages of the elderly
people (65+ years).
This completes our investigation. Now we can report the following findings:
• The spatial distribution of the London pubs is strongly clustered. The largest and
densest clusters are located in the centre, where there are many tourists and other
visitors of the city. Many clusters are related to the transportation facilities.
• The spatial distribution of the pubs is associated with some aspects of the de-
privation of the resident population, namely, education (negative association),
health, crime, and living environment (positive association).
• The spatial distribution of the pubs is also associated with the ethnic and age
structure of the resident population: there are more pubs in the areas with higher
percentages of white people and age groups 16-29 and 30-34 and fewer pubs in
the areas with higher proportions of ethnic minorities, children, and 45-64 years
old people.
In this investigation, we used multiple analytical techniques and operations: data
transformations, density-based clustering, basic spatial computations, filtering, and,
obviously, visual representations, including different types of maps. As it is usual for
visual analytics workflows, our major activity was to observe and interpret distribu-
tion patterns, derive knowledge by means of analytical reasoning, and plan further
steps in the analysis based on what we have observed and learned so far.
294 9 Visual Analytics for Understanding Spatial Distributions and Spatial Variation
9.7 Conclusion
• Which of the following distributions and variations allow or disallow the use of
spatial smoothing? Why?
– Measurements of air quality at sample locations throughout a town.
9.8 Questions and exercises 295
MMSI1 ), a time moment, the geographic position of the vessel at that time moment,
and the navigation status reporting the activity of the vessel such as ‘at anchor’,
‘under way using engine’, ‘engaged in fishing’, and others. Being arranged in the
chronological order and connected by lines, the positions of each individual moving
entity make a trajectory of this entity.
Our task in this analysis is to study when, where, and for how long the cargo vessels
were anchoring outside of the port and understand whether the events of anchoring
may indicate waiting for an opportunity to enter or exit the bay (through a narrow
strait) or the port of Brest. Naturally, we hope that the attribute ‘navigational sta-
tus’ will tell us which positions in the data correspond to anchoring. However, by
visual inspection of a sample of the data (Fig. 10.1), we notice that the values of
this attribute are quite unreliable. The upper left image shows a selection of points
from multiple trajectories in which the navigational status equals ‘at anchor’. It is
well visible that the selection includes not only stationary points but also point se-
quences arranged in long traces, which means that the vessels were moving rather
than anchoring. On the opposite, the upper right image demonstrates several tra-
jectory fragments that look like hairballs. Such shapes are typical for stops, when
the position of a moving object does not change but the tracking device reports each
time a slightly differing position due to unavoidable errors in the measurements. The
hairball shapes thus signify that the vessels were at anchor or moored at the shore,
but the recorded navigational status was ‘under way using engine’. The lower im-
age shows trajectory fragments of several vessels. The character of their movements
(back and forth repeated multiple times) indicates that they were fishing, i.e., the
navigational status should be ‘engaged in fishing’, but the value attached to the po-
sitions is ‘under way using engine’. This means that we need to identify anchoring
events without relying on the reported navigational status.
Since we are interested in anchoring of cargo vessels in the vicinity of the port
of Brest, we need to select only relevant data from the database. In accord with our
analysis goals, we exclude the data referring to fishing vessels and, for the remaining
vessels, select only those trajectories that passed through the strait at least once.
From these trajectories, we select only the points located inside the bay of Brest, in
the strait, and in the area extending to about 20 km west of the strait. As a vessel
might not be present in the study area all the time, there may be long temporal
gaps between some of the selected positions of this vessel. To exclude such gaps
from consideration, we split the trajectories into parts at all positions in the record
sequences where the time interval to the next point exceeds 30 minutes or the spatial
distance exceeds 2 km. Next, we further divide the trajectories by excluding the
stops (segments with near-zero speed) within the Brest port area. From the resulting
trajectories, we select only those that passed through the strait and had duration at
least 15 minutes. As a result of these selections and transformations, we obtain 1718
trajectories for the further analysis. Of these trajectories, 945 came into the bay from
1 https://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity
10.1 Motivating example 299
Fig. 10.1: Fragments of trajectories with wrongly reported navigation status: under
way using engine (upper left and bottom), at anchor (upper right).
the outer area, 914 moved from the bay out, and 141 trajectories include incoming
and outgoing parts.
The analysis goal requires us to identify the anchoring events. As we cannot rely
completely on the navigation status in position records (see Fig. 10.1), we apply the
following heuristics. We know that, in the vicinity of ports or major traffic lanes, ves-
sels do not anchor at arbitrarily positions. There are special areas, called anchoring
zones, where vessels are allowed to anchor. Hence, we need to identify these anchor-
ing zones. For this purpose, we find the areas containing spatial clusters of vessel
positions with the corresponding navigational status being ‘anchoring’ (Fig. 10.2)
and ignore occasional solitary occurrences of records reporting anchoring in other
places. Next, we assume that any sufficiently (at least 5 minutes) long stop in an an-
choring zone corresponds to anchoring. We extract the stops made in the anchoring
zones, irrespective of the value of the attribute ‘navigational status’, into a separate
dataset for the further analysis. In this way, we get a set of 212 anchoring events (we
shall further call them shortly “stops”) that happened in 126 trajectories. Fig. 10.3
shows these trajectories in bright blue and the positions of the stops in red.
Since we want to understand how the stops are related to passing the strait between
the bay and the outer sea, we find the part corresponding to strait passing in each
trajectory. For this purpose, we interactively outline the area of the strait as a whole
and, separately, two areas stretching across the strait at the inner and outer ends of
it, as shown in Fig. 10.4. The segments of the trajectories located inside the whole
strait area (painted in yellow in Fig. 10.4) are treated as strait passing events. For
these events, we determine the times of vessel appearances in the areas at the inner
and outer ends of the strait (painted in red in Fig. 10.4). Based on the chronological
order of the appearances, we determine the direction of the strait passing events:
300 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.2: Delineation of anchoring zones: The violet dots show all positions re-
ported as anchoring. The yellow-filled polygons outline anchoring zones containing
dense concentrations of the anchoring points beyond the port and major traffic lanes.
Fig. 10.3: The trajectories selected for analysis with the anchoring events (stops)
marked in red.
inward or outward with respect to the bay. Then we categorise the stops based on
the directions of the preceding and following events of strait passing.
The pie charts on the map in Fig. 10.5 represent the counts of the different cate-
gories of the stops that occurred in the anchoring areas. The most numerous cat-
egory ‘inward;none’ (105 stops) includes the stops of the vessels that entered the
bay, anchored inside the bay, and, afterwards, entered the port. The category ‘out-
ward;inward’ (36 stops) contains the stops of the vessels that exited the bay, an-
chored in the outer area, then returned to the bay and came in the port. 34 stops took
10.1 Motivating example 301
Fig. 10.4: Interactively specified areas used for identifying fragments of trajectories
corresponding to travelling through the strait (yellow) and determining the travelling
direction (red).
Fig. 10.5: The pie charts represent the counts of the stops in the anchoring zones cat-
egorised with regard to the directions of the preceding and following strait passing
by the vessels.
place before entering the bay (‘none;inward’), 18 happened after exiting the bay
(‘outward;none’) and 11 before exiting the bay (‘none;outward’). In 7 cases, vessels
entered the bay from the outside, anchored, and then returned back without visiting
the port (‘inward;outward’), and there was one stop that happened after entering the
strait at the inner side and returning back (‘in2in;none’).
We see that the majority of the stop events (yellow pie segments) happened after
entering the bay and, moreover, a large part of the stops that took place in the outer
area happened after exiting the bay and before re-entering it (orange pie segments).
It appears probable that the vessels stopped because they had to wait for being served
in the port. Most of them were waiting inside the bay but some had or preferred to
302 10 Visual Analytics for Understanding Phenomena in Space and Time
wait outside. Hence, the majority of the anchoring events can be related to waiting
for port services rather than to a difficult traffic situation in the strait.
Additional evidence can be gained from the 2D time histogram in Fig. 10.6, where
the rows correspond to the days of the week and the columns to the hours of the day.
It shows us that the number of the anchoring vessels reaches the highest levels on
the weekend (two top rows) and on Monday (the bottom row). It tends to decrease
starting from the morning of Wednesday (the third row from the bottom of the his-
togram) till the morning of Thursday (the fourth row), and then it starts increasing
again. The accumulation of the anchoring vessels by the weekend and gradual de-
crease of their number during the weekdays supports our hypothesis that the stops
may be related to the port operation.
Fig. 10.6: A 2D time histogram represents the counts of the anchoring events by the
hours of the day (horizontal axis) and days of the week (vertical axis) by the heights
of the corresponding bars.
Now we want to look at the movements of the vessels that made stops on their way.
We apply dynamic aggregation of the vessel trajectories by a set of interactively
define areas, which include the anchoring zones, the port area, the areas at the outer
and inner ends of the strait, and a few additional regions in the outer sea. The ag-
gregation connects the areas by vectors and computes for each vector the number of
moves that happened between its origin and destination areas. The result is shown
on a flow map, where the vectors are represented by curved lines with the widths
proportional to the move counts (Fig. 10.7). The curvatures of the lines are lower at
the vector origins and higher at the destinations.
10.1 Motivating example 303
Fig. 10.7: The trajectories under study are represented in an aggregated form on flow
maps. Top left: trajectories without stops. The remaining images represent subsets
of the trajectories having stops after entering the bay (top right), before (bottom left)
and after exiting the bay (bottom right).
The aggregation we have applied to the trajectories is dynamic in the sense that it re-
acts to changes of the filters that are applied to the trajectories. As soon as the subset
of the trajectories selected by one or more filters changes, the counts of the moves
between the areas are automatically re-calculated, and the flow map representing
them is immediately updated. The four images in Fig. 10.7 represent different states
of the same map display corresponding to different query conditions. The upper left
image represents the 1592 trajectories (92.67% of all initially selected trajectories)
that did not include stops. This flow map can be considered as showing normal,
uninterrupted traffic to and from the port of Brest. The flows between any two ar-
eas look symmetric, i.e., the lines have equal widths, which means approximately
the same numbers of moves in the two opposite directions. The remaining images
in Fig. 10.7 show aggregates of different selections of the trajectories depending
on the relative times (with respect to the strait passing) when they had anchoring
events. These flow maps demonstrate which parts of the routes are common in a
given selection, and which parts vary and proportions of the varying parts.
Concluding the exploration of the stops, we can summarise that most of them are
likely to have happened because the vessels had to wait for being served in the port
of Brest. The vessels coming from the outer sea were waiting mostly inside the bay
(Fig. 10.7, top right), and the vessels that had been unloaded in the port and had to
wait for the next load or another service were waiting mostly outside of the bay. The
vessels that had to wait for port services tended to accumulate over the weekend,
and their number reduced during the weekend (Fig. 10.6, right).
304 10 Visual Analytics for Understanding Phenomena in Space and Time
group and
integrate
Spatial events
If you compare this transformation scheme with the transformation scheme of tem-
poral data presented in Fig. 8.7, you will certainly notice commonalities. It is not
surprising, because spatio-temporal data are a special class of temporal data. Thus,
like events in general, spatial events can be grouped and integrated into larger events.
These larger events are also spatial events because they have spatial locations, which
can be defined as unions of the locations of the component events or as spatial out-
lines of the event groups. All transformations that are applicable to time series (Sec-
tion 8.3) can, obviously, be applied to spatial time series. Events extracted from spa-
tial time series are spatial events: the spatial locations of these events are the places
the original time series refer to. Generally, the whole system of possible transfor-
mations of temporal data is applicable, in particular, to spatio-temporal data, with
the specifics that the derived data will usually have spatial references. Since spatio-
temporal data are also a special subclass of spatial data, all transformations appli-
cable to spatial data (Section 9.4) can be applied, in particular, to spatio-temporal
data.
Special notes are required concerning movement data. Such data are usually avail-
able originally as collections of records specifying spatial positions of moving ob-
jects (e.g. vessels) at different times. Such records describe spatial events of the
presence (or appearance) of the moving objects at certain locations and specify the
times when these events occurred. When all records referring to the same mov-
ing object are put in a chronological order, they together describe a trajectory of
this object. Hence, trajectories are obtained by integrating spatial events of object
appearance at specific locations. The trajectories can be again disintegrated to the
component events. Particular events of interest, such as stops or zigzagged move-
ment, can be detected in trajectories and extracted from them. A trajectory describ-
ing movements of an object during a long time period can be divided into shorter
306 10 Visual Analytics for Understanding Phenomena in Space and Time
trajectories, for example, representing different trips of the object. Trajectories may
be divided according to different criteria:
• Temporal gap between consecutive positions;
• Large spatial distance between consecutive positions;
• Visiting particular areas, such as ports or fuel stations;
• Occurrence of particular events or changes in the movement, e.g., a sufficiently
long stop, a change of a car driver, or a change of the ball possession in a sport
game;
• Beginning of a new time period, e.g., a new day or a new season.
The relevance of the division criteria depends on the nature of the moving objects,
the character of their movements, and the analysis goals. For example, if you want
to compare daily mobility behaviours of individuals, their trajectories need to be
divided into daily trips. If you wish to find repeated trips, you need to divide the
trajectories by stop events. For studying temporal patterns of stops, the stops need
to be extracted from complete trajectories without any division: this is necessary for
correct estimation of the stop durations.
Aggregation of spatio-temporal data combines temporal and spatial aggregation.
Section 8.3 discusses possible ways for aggregating events into time series using ei-
ther absolute or relative (cyclic) time references. Section 9.4.2 considers aggregation
of spatial data by areas, which can be defined using different ways of space partition-
ing discussed in Section 9.3.4. Having divided the space into compartments (shortly,
places) and time into intervals, it is possible to aggregate either spatial events or tra-
jectories by places and time intervals. Place-based aggregation involves counting
for each pair of place and time interval (1) the events that occurred in this place dur-
ing this interval, or (2) the number of visits of this place by moving objects and the
number of distinct objects that visited this place or stayed in it during the interval.
Additionally, various summary statistics of the events or visits can be calculated,
for example, the average or total duration of the events or visits. The result of this
operation is time series of the aggregated counts (e.g. counts of stops or counts of
distinct visitors) and statistical summaries associated with the places.
Link-based aggregation summarises movements (transitions) between places and,
thus, can be applied to trajectories. For each combination of two places and a time
interval, the number of times when any object moved from the first to the second
place during this interval and the number of the objects that moved are counted.
Additionally, summary statistics of the transitions can be computed, such as the
average speed or the duration of the transitions. The result of this operation is time
series of the counts and statistical summaries associated with the pairs of places. The
time series characterise the links between the places; therefore, they can be called
link-based. The term “link between place A and place B” refers to the existence of
at least one transition from A to B.
10.2 Specifics of this kind of phenomena/data 307
Spatial time series have an interesting specific property as compared to time series
of attribute values, which were considered in Chapter 8. Both place-based and link-
based time series can be viewed in two complementary ways: as spatially distributed
local time series (i.e., each time series refers to one place or one link) and as tempo-
ral sequence of spatial situations, where each situation is a particular spatial distri-
bution of the counts and summaries over the set of places or the set of links. These
perspectives require different methods of visualisation and analysis. Thus, the first
perspective focuses on the places or links, and the analyst compares the respective
temporal variations of the attribute values such as counts of distinct vessels in ports
over days. The local time series can be analysed using methods suitable for attribute
time series, such as those considered in Chapter 8. The second perspective focuses
on the time intervals, and the analyst compares the respective spatial distributions
of the values associated with the places or links. The spatial distributions can be
analysed using methods suitable for spatial data analysis, such as those mentioned
in Chapter 9.
Technically, the two complementary views of spatial time series can be supported
by transposing a table containing the time series. The table format in which the
rows correspond to places and the columns to time steps is suitable for considering
the data as local time series referring to places. Each table row describes one time
series. The format in which the rows correspond to time steps and the columns
to places is suitable for analysing data as spatial situations referring to time steps.
Each table row describes one spatial situation. Methods of time series analysis (see
Section 8.5.2) are applicable to local time series, and methods for analysis of time
series of complex states (Section 8.5.3) can be applied to spatial situations.
The types of transformations applicable to temporal data include transforming time
references from absolute to relative (Section 8.3). Similarly, spatial references in
spatial data can also be transformed from absolute to relative (Section 9.4). Obvi-
ously, both kinds of transformations can be applied to spatio-temporal data. Let us
show a few examples of how this can be done and for what purpose.
In Fig. 10.9, we have trajectories of seasonal migration of several white storks. The
STC on the left shows the trajectories according to the original time references, and
the STC on the right shows the result of transforming the time references to rela-
tive times with respect to the beginning of each migration season. After this trans-
formation, we can better see the similarities and differences between the seasonal
movement patterns of the same and different birds across different seasons. The
main purpose of cycle-based time transformations is to support exploration of rou-
tine behaviours and repetitive processes. Other potentially useful transformations of
time references in trajectories are transformations to relative times with respect to
trip starts, or trip ends, or particular reference time moments, such as the moments
when aircraft reach their planned cruise altitudes. Such transformations are useful
for comparing movement dynamics, e.g., speed patterns, among trips.
In Section 9.4, we had an example of a spatial coordinates transformation that cre-
ated a so-called “team space” (or “group space”) and represented relative move-
308 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.10: Mobility data of two persons living in different geographical regions
have been translated from the geographical space into a semantic space of human
activities and aggregated into flows between “semantic places” and time series of
presence in these places by the days of the week and hours of the day. The upper
two maps show the behaviours of the two persons, and the lower two maps represent
the differences between their flows (left) and the patterns of the presence in the
“semantic places”.
places in this space are the home, work, shopping, health care, sports, and other
concepts corresponding to activities of people. We have arranged these “places”
into “regions”: work, transport, life maintenance, recreation, and social life. We got
data from two persons living in different geographical regions who tracked their
movements over long time periods. By analysing these data, it was possible to de-
tect repeatedly visited places and identify the meanings of these places based on the
typical times, frequencies, and durations of the visits. We substituted the place coor-
dinates in the data by the semantic labels of the places and in this way translated the
data from the geographic space to the semantic space. We applied spatio-temporal
aggregation to the transformed data and represented the mobility behaviours of the
two individuals by aggregate flows between semantic places and by weekly time
series of the presence in these places. The time series are represented on the maps in
Fig. 10.10 by 2D time charts, or “mosaic plots”, as introduced in Fig. 3.12. Although
these two persons live very far from each other, we can now easily compare their
behaviours. In the upper part of Fig. 10.10, two maps representing the behaviours
are juxtaposed for comparison. In the lower left image, the aggregated inter-place
movements of the two persons are superposed on the same map. In the lower right
310 10 Visual Analytics for Understanding Phenomena in Space and Time
image, the daily and weekly patterns of the presence in the “semantic places” are
compared by explicit encoding of the differences. The shades of blue correspond to
higher values of the person whose behaviour is shown on the top left, and the shades
of orange correspond to higher values of the other person.
Apart from the transformation of the spatial coordinates into positions in an artificial
“semantic” space, this example also demonstrates that spatio-temporal aggregation
can be applied not only to original data describing movements or spatial events but
also to data with transformed spatial positions and/or temporal references (here, we
transformed the original time references into positions within the daily and weekly
cycles).
Since spatio-temporal phenomena and data are both spatial and temporal, they have
all properties pertinent to spatial and temporal phenomena and data, as discussed in
Chapters 9 and 8, respectively. Specifically, spatial events combine the properties of
general events and discrete spatial objects, and spatial time series combine the prop-
erties of general time series and spatial phenomena described by spatial variation
of attribute values. Spatio-temporal data may also have the same kinds of quality
problems as spatial and temporal data.
Unlike spatial events and spatial time series, movement data have specific proper-
ties that are not pertinent to either purely spatial (static) or purely temporal (space-
irrelevant) data taken alone. Therefore, we consider the properties of movement data
separately.
The first group of properties relates to the data structure. The essential components
of the movement phenomenon are moving entities, shortly, movers, the space in
which they move, and the time when they move, particularly, the times when the
movers are present at different spatial locations. These components of the movement
phenomenon are represented by corresponding components of movement data. A
set of movement data can represent movements of a single entity; in this case, there
is no need in a special data component referring to movers. When data describe
movements of several objects, each mover needs to be denoted by some identifier.
Since, in general, there are three essential components in movement data, we put
the properties into subgroups referring to these components.
• Properties of the mover set:
– number of movers: a single mover, a small number of movers, a large number
of movers;
– population coverage: whether there are data about all movers of interest for a
given territory and time period or only for a sample of the movers;
10.2 Specifics of this kind of phenomena/data 311
The properties in the subgroups “spatial” and “temporal” can refer, respectively, to
spatial and temporal data in general. The properties in the subgroup “mover set”
are, basically , the properties of a statistical sample in relation to a statistical popu-
lation2 .
The properties of movement data are strongly related to the data collection methods,
which may be
• Time-based: positions of movers are recorded at regularly spaced time moments.
• Change-based: a record is made when mover’s position, speed, or movement
direction differs from the previous one.
• Location-based: a record is made when a mover enters or comes close to a spe-
cific place, e.g., where a sensor is installed.
• Event-based: positions and times are recorded when certain events occur, for ex-
ample, when movers perform certain activities, such as cellphone calls or posting
georeferenced contents to social media.
Only time-based measurements can produce temporally regular data. The temporal
resolution may depend on the capacities and/or settings of the measuring device.
GPS tracking, which may be time-based or change-based, typically yields very high
spatial precision and quite high accuracy while the temporal and spatial resolution
depends on the device settings. The spatial coverage of GPS tracking is very high
(almost complete) in open areas. Location-based and event-based recordings usually
produce temporally irregular data with low temporal and spatial resolution and low
spatial coverage. The spatial precision of location-based recordings may be low
(positions specified as areas) or high (positions specified as points), but even in the
latter case the position exactness is typically low. The spatial precision of event-
based recording may be high while the accuracy may vary (cf. positions of photos
taken by a GPS-enabled camera or phone with positions specified manually by the
photographer).
Before analysing spatio-temporal data, it is necessary to find out what data collec-
tion method was used. Due to the complexity of this kind of data, they are often
provided together with a description and/or metadata, which cab be used to check if
the described data collection procedure and data properties match the actual proper-
ties of the data.
Irrespective of the collection method and device settings, there is also indispens-
able uncertainty in any time-related data caused by their discreteness. Since time is
continuous, the data cannot refer to every possible instant. For any two successive
instants t1 and t2 referred to in the data, there are moments in between for which
there are no data. Therefore, one cannot know definitely what happened between t1
and t2 . Movement data with fine temporal and spatial resolution give a possibility
of interpolation, i.e., estimation of the positions of the moving entities between the
2 https://en.wikipedia.org/wiki/Sampling_(statistics)
10.2 Specifics of this kind of phenomena/data 313
measured positions. In this way, the continuous path of a mover can be approxi-
mately reconstructed.
Movement data that do not allow valid interpolation between subsequent positions
are called episodic [24]. Episodic data are usually produced by location-based and
event-based collection methods but may also be produced by time-based methods
when the position measurements cannot be done sufficiently frequently, for exam-
ple, due to the limited battery lives of the recording devices. Thus, when tracking
movements of wild animals, ecologists have to reduce the frequency of measure-
ments to be able to track the animals over longer time periods.
Whatever the measurement frequency is, there may be time gaps between recorded
positions that are longer than usual or expected according to the device settings,
which means that some positions may be missing. In data analysis, it is important
to know the meaning of the position absence: whether it corresponds to absence of
movement, or to conditions when measurements were impossible (e.g., GPS mea-
surements in a tunnel), or to device failure, or to private information that has been
intentionally removed.
As all other kinds of data, spatio-temporal data may have quality problems. The
most common types of data quality problems have been mentioned in Sections 5.1
and 5.2. These may be measurement errors, insufficient precision, missing values
or missing records, gaps in spatial or temporal coverage, etc. Some examples of
data quality problems given in Sections 5.1 and 5.2 have been made using spatio-
temporal data: georeferenced social media posts in Fig. 5.1, mobile phone calls
in Fig. 5.5, Fig. 5.6, and Fig. 5.7, trajectories of moving entities in Fig. 5.9 and
Fig. 5.10.
The most frequent problems that we encountered dealing with spatio-temporal data
were insufficient temporal resolution, variable temporal resolution (fine in some
parts of data and coarse in other parts), temporal gaps, and erroneous coordinates.
Thus, the dataset with the vessel trajectories that we considered in the introductory
example (Section 10.1) had the latter two kinds of quality issues. Figure 10.11,
top, demonstrates how vessel trajectories with temporal gaps between consecutive
records appear on a map. The lower image shows how the problem can be dealt with
by splitting the trajectories into parts. The map in Fig. 10.12 demonstrates position
errors, i.e., wrong coordinates. Generally, wrong coordinates exhibit themselves as
outliers. When you have a set of spatial events, solitary events located far from all
others may have wrongly specified positions. In movement data, positioning errors
lead to appearance of local outliers, i.e., positions that are spatially distant from
other positions in their temporal neighbourhood. Many such outliers existed in the
314 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.11: Top: long straight line segments in trajectories of vessels correspond to
temporal gaps, i.e., long time intervals in which position records for the vessels are
missing. Bottom: the result of dividing the trajectories by the spatio-temporal gaps
in which the spatial distance exceeded 2 km and the time interval length exceeded
30 minutes.
vessel dataset, and Figure 10.12 shows that some of the erroneous positions where
even on land, where vessels could never come in reality. While it may be easy to
notice obviously wrong positions in places where moving entities (or spatial events)
could never be, many other positioning errors may not be so obvious. They can be
detected by calculating the speed of the movement from each recorded position to
the next position as the ratio of the spatial distance to the length of the time interval
between them. Unrealistically high speed values signify position errors, which need
to be cleaned.
Since spatio-temporal data may include thematic attributes (i.e., attributes whose
values are neither spatial locations nor time moments or intervals), errors may also
occur in values of these attributes, or some values may be missing. In our introduc-
tory example in Section 10.1, we had issues with the attribute ‘navigational status’.
Another group of problems, which is pertinent to movement data in particular, re-
lates to identifiers of entities. It may happen that two or more entities are denoted
10.2 Specifics of this kind of phenomena/data 315
Fig. 10.12: Vessel trajectories affected by positioning errors. Some trajectories even
include positions located on land far from the sea.
by the same identifier, as in Fig. 5.10, but it may also happen that the same entity
is denoted by multiple different identifiers. The latter kind of circumstance is not
always a result of errors in data, but it can be a specific feature of a dataset resulting
from the data collection method or, sometimes, even intentionally designed. For ex-
ample, when data about people’s mobility are collected by tracking mobile devices
worn by people, the data will contain identifiers of the devices rather than identifiers
of the individuals wearing them. Hence, individuals wearing two or more devices
will be represented in the data multiple times with different identifiers. Another ex-
ample, which we also encountered in our practice, is intentional periodic change
of identifiers denoting people, which is done for protecting people’s privacy by ex-
cluding the possibility to identify individuals based on the locations they visit. Thus,
this technique was applied to the dataset with the mobile phone calls considered in
Section5.2 before the data were provided for analysis. Obviously, data with such a
property are not suitable for studying long-term individual mobility behaviours. As
usual, you need to be aware of the methods used for collecting and preparing data
for understanding if the data are appropriate to your analysis goals.
Paper [12] systematically considers the kinds of errors that may occur in move-
ment data and mentions the methods that can be used for detecting and, whenever
possible, fixing the errors. Similar principles and approaches are applicable to other
types of spatio-temporal data. Generally, it is necessary to consider all major compo-
nents of the data structure, namely identities, spatial locations, times, and thematic
attributes, and their combinations. Any unexpected regularity or irregularity of dis-
tributions requires attention and explanation. Calculation of derived attributes (e.g.
speed, direction) and aggregates over space, time, and categories of objects provides
further distributions to be assessed.
316 10 Visual Analytics for Understanding Phenomena in Space and Time
To see the spatial distribution of the available data, we create a map with the trajec-
tories represented by lines drawn with low opacity (Fig. 10.13). The map shows that
the dataset covers the whole area with no obvious omissions, and the trajectories
mostly stick to the road network.
To see the temporal distribution of the data, we generate a time histogram of the
times specified in the position records (Fig. 10.14). The histogram shows us that
the temporal coverage is also good; there are no obvious gaps without recordings,
except for the night times, when the intensity of car movement is normally low.
The periodic pattern of the variation of the number of records nicely adheres to
10.2 Specifics of this kind of phenomena/data 317
Fig. 10.14: A time histogram shows the temporal distribution of the time stamps in
the car dataset.
Fig. 10.15: A frequency histogram of the distribution of the sampling rates in the
position recordings of the cars.
the daily and weekly cycles of human activities. Both the spatial and the temporal
distributions correspond well to our common-sense expectations.
To understand the temporal resolution of the position recordings, it is necessary to
compute the lengths of the time intervals between consecutive records in each tra-
jectory and inspect the overall frequency distribution of the interval lengths. As we
know that long time intervals in this dataset correspond to stops, such intervals need
to be excluded from the investigation. Figure 10.15 demonstrates that the most fre-
quent sampling rate (i.e., the time between recordings) is around 1 minute. There is a
much smaller subset of points where the time interval to the next point is 2 minutes,
and only a few points have 3 minutes intervals to the next points. All other interval
lengths occur in the data infrequently. Next, it is useful to check if the sampling rates
are common for all cars. For this purpose, we calculate the median sampling rate for
each trajectory. The results demonstrate that more than 98% of the cars in this data
set have the median sampling rate of one minute ± one second. However, we have
identified a few outliers, which include about 100 cars that had only a few positions
recorded and, accordingly, quite irregular sampling rates; 9 cars with many recorded
318 10 Visual Analytics for Understanding Phenomena in Space and Time
positions but the median sampling rates of 3 to 5 minutes; and 2 cars with very high
sampling rates (13 seconds). Such outliers need to be separated from the others in
the further analysis. We have also identified several thousands of duplicate pairs of
a car identifier and a time stamp, which were indicated by zero time intervals be-
tween records with the same car identifiers. We removed the duplicates from the
dataset.
Fig. 10.16: A frequency histogram of the heading values recorded together with the
positions of the cars.
When position records contain thematic attributes that will be involved in the anal-
ysis, their properties need to be inspected. Our example dataset contains measure-
ments of the vehicle headings, specified as compass degrees; their frequency distri-
bution is represented by a frequency histogram in Fig. 10.16. There are two strange
pits around the values 90 and 270 degrees. It is quite unlikely that these directions
were indeed much less frequent than the others. The pits may be due to the method
that was used by the tracking devices for determining the vehicle heading. The
method might calculate the angle based on the ratio of the longitude and latitude
(x and y) differences between two consecutively measured positions, of which the
second position was not recorded. Such a method would fail in cases when the y-
difference equals zero. Whatever the reason is, the measured heading values cannot
be trusted.
In addition to pre-recorded attributes, calculation of derived attributes and studying
their distributions is a useful instrument for further checks. Unrealistic values or un-
expected properties of distributions may indicate certain problems in the data under
inspection. Thus, as we already mentioned, unrealistic values of calculated speeds
can indicate errors in the recorded positions or the use of the same identifier for
denoting several moving entities. Small sizes of spatial bounding boxes containing
sub-sequences of positions from substantially long time intervals can indicate actual
10.3 Visual Analytics Techniques 319
absence of movement, although the recorded positions may not be exactly the same
due to unavoidable measurement errors.
The major analysis goal specific to spatio-temporal data is to understand how the
phenomenon is distributed in space and time and how its characteristics vary over
space and time.
Fig. 10.17: Finding dense spatio-temporal clusters of photo taking events. The
events are represented by dots on a map (top) and in a space-time cube (bottom).
The coloured dots represent events belonging to clusters and the grey dots represent
noise. The clusters are enclosed in spatio-temporal convex hulls.
Fig. 10.18: By setting larger or smaller parameter values for density-based cluster-
ing, it is possible to detect large-scale, as in Fig. 10.17, or local, as in this figure,
spatio-temporal clusters of spatial events. The view direction in the STC (right) is
from the east.
A nice feature of density-based clustering is that we can control the spatial and tem-
poral scales of the clusters that will be detected by making appropriate parameter
settings. Thus, using the same clustering algorithm and the same distance function
(spatio-temporal distance) as previously by decreasing the spatial distance thresh-
old to 500 m, we discover 78 local clusters of the size from 10 to 383 events that
occurred in 8 cities. As an example, Figure 10.18 shows the local clusters that oc-
curred in the New York City. By decreasing the temporal threshold from several
days to an hour, it may be possible to detect events of multiple photos taken in the
same place close in time. We have detected 73 such local clusters with the sizes
from 5 to 24 elementary events and duration from 17 seconds (during which 16
photos were made, all by a single photographer) to 6 hours (19 photos made by 15
photographers).
Computationally detected spatio-temporal clusters of spatial events can be treated
as higher-order events, as it was with the mass photo taking events that we consid-
ered in Chapter 8. The higher-order events can be characterised by the counts of
the component events, overall duration, spatial extent (e.g., the area of the polygon
enclosing the cluster in space), density, and statistical summaries of the attributes of
the component events. Thus, in our examples with photo taking, we were interested
to know how many distinct photographers contributed to each aggregate event. For
the lately mentioned set of local short-term clusters, only 5 clusters consisted of
photos of single photographers and the remaining 68 clusters were indeed events of
collective photo taking that involved from 3 to 24 distinct people.
Thanks to providing the possibility to derive higher-order events from elementary
ones, density-based clustering can be used as an important instrument in analyt-
322 10 Visual Analytics for Understanding Phenomena in Space and Time
This example has several interesting aspects to pay attention to. Its primary purpose
was to demonstrate dynamic spatio-temporal patterns of clustering of spatial events.
Besides, it demonstrates that trajectories may represent movements of not only dis-
crete entities treated as points (because their spatial dimensions are not relevant or
not specified in data) but also spatially extended phenomena, such as thunderstorms.
This example also shows that trajectories may not be originally described in data but
reconstructed by means of sophisticated data processing and analysis.
In the following section, we shall see what kinds of patterns can be found when
spatial events are aggregated into spatial time series.
Spatial time series may be the form in which spatio-temporal data are available
originally, or they may result from aggregation of spatial event data or trajectories
of moving entities, as described in Section 10.2.1. Spatial time series consist of at-
tribute values associated with spatial locations and time steps. The analysis goal is
324 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.20: Dynamic patterns of evolving clusters of spatial events are represented
on an animated map. The violet-painted shapes show the outlines of the clusters at
the end of the time window currently represented in the map, and the grey-painted
shapes show the cumulative outlines including all component events that happened
during the whole time window. The trajectory lines show the cluster movements,
and the sizes of the red dots represent the cluster sizes. Characteristics of the cluster
states can be accessed by mouse hovering on the trajectory lines (right).
to discover and understand the patterns of the spatio-temporal variation of the at-
tribute values. This goal is quite difficult to achieve, as we need to see the values
both in space and in time. There is no visualisation technique that could provide a
comprehensive view of all values in space and time. The space-time cube, which
quite often turns to be helpful in exploring spatial events or trajectories, is not suit-
able for representing spatial time series. It would be necessary to represent the value
for each place and time moment by some mark, and most of the marks would not be
visible due to occlusions.
Hence, to visualise the variation, it is necessary to decompose it into components
that could be represented by available visualisation techniques, particularly, maps
showing the spatial aspect of the distribution and temporal displays showing the
temporal aspect. As we said in Section 10.2.1, spatial time series can be viewed
in two complementary ways: as a set of local time series associated with places
or links between places and as a chronological sequence of spatial situations, in
other words, spatial variations of attribute values that took place in different time
steps. In order to understand the spatio-temporal variation as a whole, it is necessary
to consider both perspectives. As we noted previously, these perspectives require
different visualisation and analysis techniques.
When the time series are short, the sequence of the spatial situations can be repre-
sented by multiple maps each showing the situation in one time step. With such a
representation, the situations can be easily compared, and the temporal patterns of
10.3 Visual Analytics Techniques 325
their changes can be seen. However, this approach becomes daunting when there
are many time steps. When the spatial time series involve a small number of spatial
locations, it may sometimes be possible to represent the local time series on a map
by temporal charts, as, for example, the 2D time charts in Fig. 10.10. However, this
technique will not work when the locations are numerous, and also when the time
series are very long.
When existing visualisation techniques cannot scale to the amounts of the data that
need to be analysed, the most common approach is to use clustering, as described
in Section 4.5. In the case of spatial time series, clustering can be applied to each
of the two complementary perspectives, that is, to the local time series and to the
spatial situations. The results of this two-way clustering will provide complemen-
tary pieces of information enabling understanding of the spatio-temporal variation.
Let us demonstrate the two-way clustering by the example of the data that were
originally available as positions of cars in the Greater London and around (Sec-
tion 10.2.3, Fig. 10.13). We partitioned the territory by means of data-driven tes-
sellation as described in Section 9.4.2.2. Then, we aggregated the car trajectories
by the resulting space compartments and time intervals of the length of one hour
into place-based and link-based spatial time series, as described in Section 10.2.1.
The time series include 312 time steps. The place-based time series refer to 3,535
distinct places (space compartments), and the link-based time series refer to 12,352
links between the places.
Figure 10.21 demonstrates a result of applying partition-based clustering to the
place-based local time series of the hourly counts of the place visits, and, simi-
larly, Figure 10.22 demonstrates the application of clustering to the link-based time
series of the hourly magnitudes of the flows (i.e., the counts to the moves). The
2D time charts represent statistical summaries (namely, mean hourly values) of the
time series belonging to each cluster. This combination of visual displays allows us
to see the temporal patterns of the variation in the 2D time charts and the spatial
distribution of these patterns in the map. Thus, in the map in Fig. 10.22, we can ob-
serve consistency of the cluster affiliation along chains of links following the major
roads; hence, the traffic has common patterns along the major transportation corri-
dors formed by the most important motorways. We can also notice pairs of opposite
links that were put in distinct clusters, which means that the temporal patterns of the
respective flows differ.
When we take the other perspective, i.e., viewing spatial time series as temporal
sequences of spatial situations, we have time series of complex states, as considered
in Section 8.5.3. Each spatial situation is one complex state described by multiple
features; in this case, by attribute values in different places. The analysis methods
that are applicable to time series of complex states are thus suitable to spatial time
series. These include clustering of the states according to the similarity of their fea-
tures. In Fig. 10.23, partition-based clustering has been applied to the hourly spatial
situations in terms of the place visits (i.e., place-based) and in terms of the flows
between the places (i.e., link-based). The results are clusters of time steps, which
326 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.21: Partition-based clustering have been applied to local time series associ-
ated with space compartments. The map fragment shows the cluster affiliations of
the compartments, and the 2D time charts show the average hourly values computed
from the time series of each cluster.
are represented in 2D time charts. In Fig. 10.24 and Fig. 10.25, the averaged spatial
situations of these time clusters are shown on multiple maps. The temporal patterns
visible in the 2D time charts clearly correspond to the human activity cycles, daily
and weekly, and represent the specifics of the two different kinds of aggregates. For
example, commuting flows are characterised by similar magnitudes in the mornings
and evenings, but their directions are opposite. The daytime patterns of the working
days differ much from the nights according to the place visits, while the movement
patterns of regular commuters are similar, as the commuters don’t move much at
night and in the middle of their work days.
In analysing local time series, one may be interested in detecting specific kinds of
temporal variation patterns, such as peaks. There are computational techniques for
10.3 Visual Analytics Techniques 327
Fig. 10.22: The links have been clustered according to the similarity of the nor-
malised time series of the flow volumes. The map fragment shows the cluster affili-
ations of the links, and the 2D time charts show the average hourly values computed
from the normalised time series of each cluster.
pattern detection (also called pattern mining) in time series, as, for example illus-
trated in Fig. 8.15. In the case of spatial time series, analysts are interested to know
not only when certain patterns occurred but also where in space this happened. The
analysis of the pattern locations in space can be supported by the technique of event
extraction from time series. Each occurrence of a pattern is an event, which has
a specific time of existence. It can be extracted from the time series to a separate
dataset for further analysis. As explained in Section 10.2.1, events extracted from
spatial time series are spatial events having the same spatial references as the time
328 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.23: 2D time charts represent the temporal distributions of the clusters of the
place-based (left) and link-based (right) spatial situations. The colours correspond
to different clusters, and the sizes of the coloured rectangles represent the closeness
of the cluster members to the cluster centroids (the closer, the bigger).
series from which they have been extracted. Spatial events extracted from spatial
time series can be analysed using the methods suitable for spatial event data, as
considered in Section 10.3.1. An example of extracting spatial events from spatial
time series is presented in Fig. 10.26. The time series were obtained by spatio-
temporal aggregation of elementary photo taking events that occurred on the ter-
ritory of Switzerland in the period from 2005 to 2009. A computational peak de-
tection method was applied for detection and extraction of high peak patterns from
the time series. The peaks signify abrupt increases of the photo taking activities.
The map shows the spatial locations of the peak events. Most of them happen in
nature areas in summer, but some happened in December in big cities. This example
demonstrates an interesting chain of data transformations: spatial events – spatial
time series – spatial events. Unlike the original elementary events, the latter events
are abstractions obtained through data analysis.
Fig. 10.24: The place-based spatial situations corresponding to the time clusters
shown in Fig. 10.23, left, are represented on multiple choropleth maps, each show-
ing the average situation for one of the clusters. The light blue colouring corresponds
to zero values.
query conditions and considering the spatial and temporal distribution of the re-
sulting events, we can gain multiple complementary pieces of knowledge about the
movements.
An example using the London cars dataset is demonstrated in Fig. 10.27 and
Fig. 10.28. We are looking at the spatial (Fig. 10.27) and temporal (Fig. 10.28)
distributions of all positions from the trajectories of the cars and of different kinds
of events: stops for 1 hour or more, low speed (less than 20 km/h), medium speed
(60 to 90 km/h) and high speed (over 90 km/h). We can see that the spatial dis-
tribution of the low speed events is similar to the overall spatial distribution of all
car positions but the temporal distribution differs, having the highest frequencies
in hours 8 and 17-8 of the working days. The stop events, expectantly, are concen-
trated in the cities and towns, and they happen especially often in the morning hours
330 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.25: The link-based spatial situations corresponding to the time clusters
shown in Fig. 10.23, right are represented on multiple flow maps, each showing
the average situation for one of the clusters.
7 to 9 of the working days. In the afternoon and evening hours of the working days,
the stop frequency increases in hour 15, reaches the highest values in hours 17-18,
and then gradually decreases. The evening patterns on Fridays differ from the re-
maining working days, and the weekend days can be easily distinguished due to the
difference of their patterns. The spatial distributions of the medium-speed and fast
movement events notably differ from the others. While the temporal distribution of
the medium-speed events is similar to the overall temporal distribution, the tempo-
ral distribution of the high-speed events differs a lot, with the highest frequencies
attained Saturdays, Sundays, and on Monday, January 2, which was a free day in
UK because the New Year holiday (January 1) happened to be on Sunday.
Apart from exploring the temporal distribution of all events throughout the whole
territory, it is possible to select particular areas by means of spatial filtering and see
the temporal distributions of the events in these areas. And, obviously, it is possi-
ble to apply spatio-temporal aggregation to the extracted events of any kind and to
analyse the resulting spatial time series.
10.3 Visual Analytics Techniques 331
Fig. 10.26: Extraction of peak events from spatial time series of monthly counts
of photo taking events in different spatial compartments in Switzerland in the time
period from 2005 to 2009. Top: the time series with the detected peaks. Bottom: the
spatial positions of the peaks.
While event extraction helps us to gain more refined knowledge about movements
than by only aggregating the full trajectories, we still lose all information concerning
the trips: their origins and destinations, routes, times when they took place, dynam-
ics of the speed along the route, and much more. Hence, not all analysis tasks can
be fulfilled by constructing spatial time series or extracting events, and there may
be a need to deal with data in the form of trajectories.
Trajectories are usually visualised as lines on a map; however, this representation
does not show the movements were conducted in time. For seeing both the spa-
tial and temporal aspects of a single trajectory or a few trajectories, the space-time
cube may be a helpful technique (Fig. 10.9). However, it will fail when there are
many trajectories to analyse. A viable approach, as in many other cases, is to cluster
the trajectories by similarity and then analyse and compare the clusters instead of
332 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.27: Density fields show the spatial distributions of all positions from the
trajectories of the cars (top left), the events of the stops for 1 hour or more (top
right), and the events of the slow, medium, and fast movement (bottom).
dealing with the individual trajectories. The question is: what is “similarity of tra-
jectories”? Or, in other words, how to define a distance function (i.e., a measure of
the dissimilarity; see Section 4.2) for trajectories?
Trajectories are complex objects with multiple heterogeneous features: spatial (route
geometry and spatial location), spatio-temporal (spatial positions at different time
moments), overall temporal (time and duration of the trip), and temporal variation
of thematic attributes describing the movement (speed, direction, acceleration, etc.)
and, possibly, the context (road category, relief, land cover, traffic conditions, etc.).
Trajectories can be similar or different in terms of any of these features. It is hard to
imagine a distance function that takes all these features into account, and it is also
hard to imagine the purpose for which such a function could be used. In practical
analyses, analysts typically focus on one or a few features at once. When it is nec-
essary to consider more features, the analytical work is split into several steps. The
limitation for the number of the features that can be considered at once is not due to
the technical difficulty of computing the dissimilarities but due to the difficulty of
10.3 Visual Analytics Techniques 333
Fig. 10.28: 2D time histograms show the temporal distributions of all positions from
the trajectories of the cars (left), the events of the stops for 1 hour or more, and the
events of the slow, medium, and fast movement.
interpreting the results of these computations and the clusters obtained using such a
distance function.
Therefore, the approach to analysing trajectories (or other complex objects) by
means of clustering consists in using multiple distance functions accounting for
different types of features. These functions can be used in different steps of the
analysis, allowing the analyst to obtain complementary pieces of knowledge con-
cerning different aspects of the movements. Furthermore, according to the idea of
progressive clustering, results of earlier steps of clustering can be refined in fol-
lowing steps by applying clustering with a different distance function to selected
clusters obtained earlier. The concept of progressive clustering was introduced in
Section 4.5.3 as an approach to refining “bad” (insufficiently coherent) clusters, and
we also mentioned that progressive clustering may involve different distance func-
tions at different steps. Trajectories is a kind of objects that may need to be analysed
using this approach. A simple example is finding in the first step clusters of trips
according to the destinations and refining the major clusters in the next step by
means of clustering according to the similarity of the routes taken for reaching the
destinations [116].
There exists an extensive set of distance functions designed for trajectories of mov-
ing objects. Paper [109] provides a review of these functions and examples of their
application for clustering. Many of the functions are space-aware adaptations of
distance functions originally designed for time series (Section 4.2.3).
Obviously, results of clustering need to be visualised to enable interpretation and
knowledge construction. The visualisation needs to show the features that were ac-
counted for in the clustering. Thus, maps are needed to show clusters according
to spatial features, and temporal displays are necessary for representing clusters ac-
334 10 Visual Analytics for Understanding Phenomena in Space and Time
cording to temporal and dynamic features. Often, spatial and temporal displays need
to be combined, similarly to the case of spatial time series considered in the previous
section.
The following section presents an example of a workflow in which a set of trajecto-
ries is analysed with the use of clustering.
Fig. 10.30: The holding loops occurring in the flight trajectories are highlighted in
red on the map and in a 3D view.
To verify the result, the analyst builds the interactive display shown in Fig. 10.30,
where the segments of the trajectories are coloured according to the values of the
binary attribute just created; red colour corresponds to the loops. The loops occurred
in 1,484 trajectories (29.4%), including more than 50% of the flights that landed
in the airport Heathrow and about 5%-10% of the flights that landed in the other
airports. For reducing the impact of overplotting by numerous trajectories that share
the same airspace and, respectively, screen real estate, the analyst applies a different
visualisation that presents detailed holding loops (drawn as lines) in the context of
the overall traffic represented by a density map (Fig. 10.31).
By filtering, the analyst hides the loops and selects the final parts of the trajecto-
ries starting from the 75 km distance to the destination. Then the analyst applies
density-based clustering using the distance function “route similarity” [10], which
takes into account only the selected parts of the trajectories. The “route similarity”
336 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.31: The holding loops are shown as red lines on top of a density map repre-
senting the overall traffic.
function matches points and segments of two trajectories based on spatial proxim-
ity, computes the mean distance between the matched parts, and adds penalties for
unmatched parts.
After trying several combinations of parameter values for the neighbourhood radius
and number of neighbours (see Section 4.5.1), the analyst manages to obtain clusters
that separate very well different approach routes to all airports except Stansted. As
explained in Section 4.5.3, “bad” clusters can be improved through progressive clus-
tering. The analyst selects the subset of trajectories ending in Stansted and applies
two steps of progressive clustering with slightly different parameters separately to
this subset, thus yielding good, clean clusters of different approaches into Stansted.
Figure 10.32 shows the original clustering result for Stansted and the outcome of the
progressive clustering. After merging the clustering results for all airports, there are
in total 34 clusters including 4,628 trajectories (91.7%), and 417 trajectories (8.3%)
are labelled as noise.
Each of the clusters obtained corresponds to a possible approach route. The ana-
lyst explores the temporal distribution of the use of these routes by representing the
flight counts on a segmented time histogram (Fig. 10.33) where the segment colours
correspond to the cluster affiliations of the flights. It is well visible that day 1 differs
substantially from the following days 2–4. The analyst guesses that the changes in
the landing schemes are, most likely, due to changes of wind direction. Inspection of
10.4 Analysis example: Understanding approach schemes in aviation 337
Fig. 10.33: The time histogram represents the dynamics of the use of different ap-
proach routes.
meteorological records confirms this guess. Indeed, on the second day, the wind di-
rection has changed. Figure 10.34 demonstrates the spatial footprints of the clusters
of the approaches at different days, namely day 1 (left) and 3 (right).
After identifying the major routes, the analyst proceeds with the task of quantify-
ing the routes according to their usage and the frequency of the holding loops on
each route. Counting trajectories and the corresponding holding loops is an obvious
calculation. However, the results appear to be quite interesting. While the overall
amounts of landings corresponded well to the expectations, it was surprising to find
that only landing at Heathrow airport are highly affected by holding loops, while
landings at other airports are suffering from waiting quite rarely. Specifically, on the
most affected approach route 184 out of 258 trajectories (71.3%) included holding
loops, with an impressive statistics of time spent in the loops: 10.2 minutes in aver-
age, 8.78 minutes median, and more than 29 minutes maximal waiting time.
338 10 Visual Analytics for Understanding Phenomena in Space and Time
Fig. 10.34: Clusters of the approach routes taken in two different days.
Let’s reconstruct the analysis workflow in this study. The analysts started with in-
vestigation of data quality. We did not discuss this step here, as it followed precisely
the recommendations from Section 10.2.3. Next, the analyst identified that holding
loops are present in the trajectories and applied interactive filtering tools for defin-
ing appropriate computations and excluding the loops from the trajectories. Next,
visually-driven density-based clustering was applied for separating repeated routes
from occasional and grouping trajectories into clusters. Global parameter settings
did not work well enough, therefore it was necessary to apply progressive cluster-
ing, defining parameters specifically for selected subsets of the data. Next, temporal
dynamics of the cluster lifelines was investigated, and spatial distributions have been
studied for the indicative days. Finally, statistics of holding loops has been computed
and studied, featuring quite surprising results.
It is necessary to stress that in this example we have applied a sophisticated work-
flow consisting of rather simple and easy to understand steps, combining visualisa-
tions and computations in a transparent way. Each step was quite easy to understand
and produced interpretable results, which is very important when results of analysis
need to be communicated to domain experts or decision makers.
Abstract Texts are created for humans, who are trained to read and understand them.
Texts are poorly suited for machine processing; still, humans need computer help
when it is necessary to gain an overall understanding of characteristics and contents
of large volumes of text or to find specific information in these volumes. Computer
support in text analysis involves derivation of various kinds of structured data, such
as numeric attributes and lists of significant items with associated numeric measures
or weighted binary relationships between them. Computers themselves cannot give
any meaning to the data they derive; therefore, the data need to be presented to hu-
mans in ways enabling semantic interpretations. While there exist a few text-specific
visualisation techniques, such as Word Cloud and Word Tree, which explicitly rep-
resent words, it is often beneficial to use also general approaches suitable for multi-
dimensional data. Collections of texts having spatial and/or temporal references are
transformed to data that can be visualised and analysed using general methods de-
vised for spatial, temporal, and spatio-temporal data. We show multiple examples of
possible tasks in text data analysis and approaches to accomplishing them.
Fig. 11.1: Visual comparison of statistical distributions of feature values over two
sets of texts. Source: [106].
Fig. 11.2: Visual analysis of the readability of 8 election agendas from the elections
of the German parliament in 2009. Source: [106].
when writing a text, allowing the writer to improve the readability of a sentence
with respect to this feature. In this way, the following list of distinctive and non-
redundant features was selected: (1) word length: the average number of characters
in a word; (2) vocabulary complexity: proportion of terms that are not contained
in the list of most common terms; (3) ratio of the number of nouns to the number
of verbs; (4) number of nominal forms; (5) sentence length (number of words); (6)
complexity of the sentence structure, expressed as the number of branches in the
sentence tree.
The selected features can be used for evaluation of the readability of individual sen-
tences in a text, and it is also possible to create a “portrait” of a text document, as
demonstrated in Fig. 11.2. Each sentence is represented by a pixel (a small square)
coloured according to the average readability score (Fig. 11.2a) or the value of one
of the selected features. Multiple documents can be compared in terms of their read-
ability. When documents are long, the scores of the sentences can be aggregated for
paragraphs or sections.
This example demonstrates that analysis of texts, on the one hand, relies on cal-
culation of derived numeric data, on the other hand, requires involvement of hu-
man knowledge and reasoning, which needs to be supported by appropriate visu-
344 11 Visual Analytics for Understanding Texts
alisations. In this case, human background knowledge was necessary for defining
an initial comprehensive set of potentially relevant features. Then, visualisation of
computation results helped the analysts to determine which features are really use-
ful, understand the relationships and redundancies among them, and select a small
subset of understandable and non-redundant features that were appropriate to the
purpose.
To analyse texts with the help of computers, you need to transform them into some
kind of structured data, such as tables or graphs, and then visualise, interpret, and
analyse these structured data. However, before the extraction of structured data, texts
usually need to undergo several steps of preprocessing, namely:
• Tokenisation (segmentation): splits longer strings of text into smaller pieces, or
tokens. Long texts are divided into paragraphs or individual sentences, and sen-
tences may be divided into words.
• Normalisation, which includes
– converting all text to the same case (upper or lower), removing punctuation,
converting numbers to their word equivalents, and so on;
– stemming: eliminating affixes (suffixes, prefixes, infixes, circumfixes) from
words in order to obtain word stems;
– lemmatisation: transforming words to their canonical forms; for example, the
word “better” is transformed to “good”;
• Removal of so-called “stop words”, which do not convey significant information.
These include articles, pronouns, prepositions, conjunctions, auxiliary verbs, and
words with similar functions.
Software tools for performing these operations are now easy to find. Obviously,
you need to choose tools that can deal with the language in which the texts are
written.
Computational techniques for deriving structured data from texts can be grouped in
the following classes, in the order of increasing sophistication:
• calculation of simple numeric measures, such as word length, sentence length,
etc.;
• extraction of significant keywords and computation of statistical characteristics
of their usage, such as frequency and specificity for a given text;
• probabilistic topic modelling (see Section 4.6);
• NLP (Natural Language Processing) techniques for
– identification of named entities (people, places, organisations, etc.);
– sentiment analysis (identification of emotions and attitudes and measurement
of their intensity).
These techniques create the following kinds of structured data:
• numeric attributes associated with entire documents or their parts (sections, para-
graphs, or sentences);
346 11 Visual Analytics for Understanding Texts
Text-derived numeric attributes can be visualised using any methods suitable for
this kind of data. For example, in Fig. 11.2, numeric measures of text readability are
represented by colour coding. When multiple numeric attributes have been derived
and need to be considered together, the techniques devised for multidimensional
data are applied in standard ways (see Chapter 6).
Fig. 11.4: A matrix display of the term weights for topics. Source: [42].
Results of topic modelling can also be used for data embedding and spatialisation,
in which texts are represented by points in a 2D embedding space. The points are
arranged according to the similarity of the combinations of the topic weights. An
example is shown in Fig. 11.6. Here, an interactive display does not only show
documents arranged in an embedding space by the topic weights but also allows the
analyst to obtain a more detailed topic model for a selected subset of documents.
Fig. 11.5: A matrix display of the topic weights for documents. Source: [4].
selected based on their spatial and temporal references. The keywords that got high
prominence in these word clouds allowed us to discover the existence of two distinct
diseases and to understand the reason for the epidemic outbreak.
When you use word clouds for summarising text contents, it may also be reasonable
to remove from them terms that are not relevant to analysis or occur in very many
documents. For example, in customer reviews of printers, the term “printer” may
appear frequently, but it is not highly informative since all reviews refer to printers.
Such words can be temporarily added to the list of stop words, which makes the
tool generating text clouds ignore these words and give higher prominence to other
words.
Another application of the idea of text cloud is comparison of documents or col-
lections of documents. An example is shown in Fig. 11.7, where the text clouds
generated from different collections of documents are represented in the form of
parallel vertical lists. The analyst can interactively select a term in any list and see
in which other lists this term appears and compare its weights in these lists.
350 11 Visual Analytics for Understanding Texts
Fig. 11.6: 2D embedding has been applied to a set of documents represented by com-
binations of topic weights. The dots in the 2D embedding space represent the docu-
ments; the concentrations of dots correspond to groups of documents with common
prevailing topics. These topics are indicated by the terms with the highest weights.
The system providing this display allows the analyst to refine the topic model by se-
lecting a region in the projection space and applying the topic modelling algorithm
to the documents located in this region. Source: [82].
Fig. 11.8: Lists of entities of the categories Location, Person, Organisation, and
Money extracted from the 9/11 Commission Report (see
https://en.wikipedia.org/wiki/9/11_Commission_Report).
Source: [56].
Fig. 11.9: The List View, as in Fig. 11.8, showing locations, persons, and organisa-
tions connected to Usama Bin Ladin. Source: [56].
to the selected entities. An example is shown in Fig. 11.9. Here, the saturation of
the background colour represents the strength of the relationship. The general ap-
proaches to the visualisation and analysis of pairwise relationships between entities
are described in Chapter 7.
Apart from analysing various kinds of structured data that can be derived from texts,
it can also be helpful to look at the contexts in which particular words appear in a
text. A nice way to visualise this information is the Word Tree [148]. To show an
example, we applied a web tool for online generation of word trees1 to the descrip-
tion of the introductory example in Chapter 1. For an interactively selected word,
such as “distribution” in Fig. 11.10, the display shows the phrases that either follow
or precede this word in the text. The selected word appears as the root of the tree.
Phases having common beginnings or ends are grouped, and their common parts
appear as roots of sub-trees. These common parts are shown using the font sizes
proportional to the numbers of the phrases in which they appear.
For other techniques that can be used for visualisation of text-derived data, we refer
the readers to published surveys [88, 89] and online repositories, as, for example,
TextVis2 .
1 https://www.jasondavies.com/wordtree/
2 http://textvis.lnu.se/
11.7 Texts in geographic space 353
Fig. 11.10: A word tree generation tool has been applied to the text describing the
introductory example of visual analysis in Chapter 1. The display shows the contexts
for the word “distribution”, i.e., the words and phrases following (top) and preceding
(bottom) the word “distribution” in the text.
while avoiding overlapping labels. The level of detail in showing the terms depends
on the map scale. The analyst can interactively select subsets of the messages, and
the system will show the spatial distribution of the terms from this subset only, as
demonstrated on the right of Fig. 11.11.
Another approach to representing the spatial distribution of texts on a map involves
aggregation of the text data by areas in space and categories according to text se-
mantics. For example, in analysing a set of georeferenced Twitter messages posted
on the territory of Seattle (USA), a group of analysts identified 22 key topics, in-
cluding home, work, education, transport, food, and others [132]. The topics were
determined based on occurrence of pre-specified indicative keywords in the mes-
sage texts. Then, the analysts divided the territory into areas based on the spatial
distribution of the tweets and aggregated the data into per-area counts of the posted
tweets by the topics. The resulting aggregated were represented on a map display
by pie charts, as shown in Fig. 11.12.
354 11 Visual Analytics for Understanding Texts
Fig. 11.11: The left image shows an overview of the spatial distribution of the most
prominent terms from the tweets posted during the earthquake on 23 August 2011.
The right image demonstrates the use of an interactive lens tool for a subset of the
messages mentioning ‘damage’ and ‘building’. Source: [131]
Fig. 11.12: Spatial distribution of the Twitter messages if different thematic cate-
gories in Seattle, USA. Source: [132]
On the left, the pies represent all 22 topics. The labels from A to G have been added
to mark the places where particular topics were more popular than in others: A –
“education” (areas of universities), B – “sports” (University of Washington sports
arenas), C – “love” (an artsy and Bohemian district of Freemont), D – “music” and
“public event” (Seattle Cente – the location of Bumbershoot music and arts festival,
the US’s largest arts festival), E – “coffee” (the most of Seattle downtown area), F
– “sports” and “music” (Pioneer Square known for its lively bar and club scene),
11.8 Texts over time 355
G – “sports” and “game” (the CenturyLink Field multi-purpose stadium and the
Safeco Field baseball park). The map on the right was created using the subset of
the messages referring to the topics “coffee” and “tea”.
In this example, which demonstrates the use of text data for understanding the spa-
tial distribution of people’s activities, the texts were used for derivation of seman-
tically meaningful items (topics) and assigning feature vectors to the data records
indicating by values 1 and 0 whether each of the items is referred to in the text.
These feature vectors were spatially aggregated and visualised as standard spatially
referenced numeric data.
Any text is created at a certain time. When you analyse a collection of texts, the
times of the text creation or publication may be important to take into account. This
refers, in particular, to dynamic text streams, with new texts continuously added to
them. Examples are online news articles and messages in social media. Besides,
a single text document may also change due to revising and editing. Hence, there
may be multiple versions of a document, and the corresponding task may be to
understand how the document evolved over time.
A sequence of versions of an evolving document can be treated as a time series of
complex states. In Chapter 8 (Section 8.5.3), we considered two general approaches
to dealing with such data. One approach is clustering of the states and visual anal-
ysis of the temporal distribution of the clusters. The other approach is spatialisation
of the states and connection of the dots representing the states in the chronological
order by straight or curved lines. The latter technique is known as Time Curve [28].
Thus, the upper image in Fig. 2.12 demonstrates a time curve representing the evo-
lution of the Wikipedia article “Chocolate”. The display includes a part where the
curve alternates between the exact same revisions (blue halos), suggesting a so-
called “edit war”. The two blue halos are rather dark, suggesting a long edit war.
One of the opponents finally won, and the article continued to progress.
Both clustering and spatialisation use some distance function (Section 4.2) for nu-
meric assessment of the dissimilarity between two states. For the spatialisation in
Fig. 2.12, top, the article versions were treated as sequences of symbols (words),
and the dissimilarity was assessed using one of the edit distance functions (Sec-
tion 4.2.6).
A result of spatialisation can tell the analyst how much the document changed from
one version to another but does not tell what specifically changed in the document
content. The analyst needs to look at the texts to understand this. Such a way of
analysis may be daunting when changes are numerous. Particularly, dynamic text
streams require a different approach, which may involve summarisation of text con-
356 11 Visual Analytics for Understanding Texts
tents by time steps. For each time step, the contents of the texts that appeared at
this time can be summarised into a list of semantically meaningful items, such as
significant keywords, topics, or entities, and their “weights”, that is, values of some
numeric measure representing their frequency or degree of importance. For each
item, there will be a time series of the corresponding weights. Hence, the result of
the summarisation will be a collection of numeric time series corresponding to dif-
ferent items extracted from the texts. This transformation is applicable not only to
a text stream but also to a long text document, which can be divided into segments,
such as sections or paragraphs. In this case, the relative ordering of the segments in
the document plays the same role as the chronological sequence of time steps in a
text stream.
Time series of item weights can be visualised and analysed using the standard meth-
ods suitable for numeric time series; see Chapter 8. There are also techniques pro-
posed specifically for texts. The most popular is the ThemeRiver, which represents
changes of item weights over time using the river metaphor. An example is shown in
Fig. 11.13. The same information can also be shown by means of a time histogram
with segmented bars corresponding to time steps. Unlike a histogram, ThemeRiver
depicts time in a continuous manner. Each theme (i.e., one of semantically mean-
ingful items extracted from the texts) is shown as a “current”, which “flows” along
the time. Hence, each theme maintains its integrity as a single entity throughout the
graph. To create a continuous representation from discrete time steps, the data are
interpolated into soft curves that look like currents in a river.
11.8 Texts over time 357
Fig. 11.14: The 2D histograms with the rows corresponding to the days of the week
and columns to the hours of the day show the distribution of the Twitter messages
with different thematic contents with respect to the daily and weekly time cycles.
Source: [132].
The ThemeRiver, as well as a standard time series graph, shows the variation of the
text amount and contents along the time line. There may be a task to understand
how the text characteristics vary with respect to time cycles. A possible approach is
demonstrated in Fig. 11.14. These 2D time histograms were generated for the collec-
tion of the Twitter messages posted in Seattle we discussed before (see Fig. 11.12).
The rows and columns of the histogram correspond to the days of the week and
hours of the day, respectively. The bars within the cells represent the counts of the
posted tweets referring to all topics (top left) and to a few selected topics. The figure
demonstrates that the variation of the thematic content of social media may be re-
lated to temporal cycles, according to the typical times of human activities.
This property of the social media contents needs to be taken into account when the
analysis task is to detect unusual events signified by temporal or spatio-temporal
bursts of thematically similar messages. To separate unusual peaks from normal
patterns of social media activities, it was proposed to employ a seasonal-trend de-
composition procedure [39]. For the anomaly detection task, globally and season-
ally trending portions of the data need to be ignored, whereas major non-seasonal
elements can be considered as potentially anomalous and, therefore, relevant. The
workflow involves topic extraction (Section 4.6) and construction of time series of
the daily counts of the topic-related messages. The seasonal-trend decomposition
transforms each time series into a sum of three components: a trend component, a
seasonal component, and a remainder. The values from the remainder are utilised
for detecting anomalous outliers. The deviation of a daily remainder value from the
7-days mean by more than 2 standard deviations is considered as an anomaly. For
358 11 Visual Analytics for Understanding Texts
example, Fig. 11.15 demonstrates how an abnormally high remainder value in a time
series signified the Virginia earthquake on August 23rd, 2011.
The examples in this section demonstrate that analysis of the distribution and vari-
ation of text amounts and contents over time involves derivation of numeric data,
which are then processed, visualised, and analysed using the standard methods suit-
able for time-related data, i.e., events and time series.
This chapter contains examples for two classes of text analysis tasks: tasks focusing
on content-unrelated text characteristics, such as amount of texts, document length,
sentence complexity, as well as spatial and temporal distributions, and tasks fo-
cusing on text contents. The first class of tasks does not require any text-specific
approaches to data analysis, except that calculation of some numeric measures may
require text-specific processing, such as sentence parsing. After the necessary text
attributes are obtained, the data are analysed in the standard ways suitable for mul-
tidimensional, spatial, temporal, or spatio-temporal data.
The second class of tasks requires such processing of texts that the resulting derived
data convey some aspects of the text contents. Such aspects can be represented by
extracted keywords, topics, or named entities. The analysis relies on the capability of
human analysts to understand the semantics of these items. Typically, extracted se-
mantically meaningful items are characterised by numeric attributes expressing their
importance, prominence, or specificity. The analysis requires visual representations
showing both the items ad their attributes. While a few text-specific techniques ex-
ist, such as word cloud and word tree, the analytical opportunities they can provide
11.10 Questions and exercises 359
are quite limited. A more general approach is to treat the extracted items with the
respective measures as attributes and analyse the data as usual multidimensional nu-
meric data. Examples can be seen in Figs. 11.4, 11.5, and 11.6. Another possibility
is data aggregation by subsets corresponding to the items, possibly, in combina-
tion with spatial and/or temporal aggregation. The resulting aggregates, on the one
hand, have clear semantics, on the other hand, are standard numeric data that can
be visualised and analysed in standard, text-unspecific ways. Examples have been
demonstrated in Figs. 11.3, 11.12, 11.13, 11.14, and 11.15.
The chapter demonstrates that text analysis may be done for various purposes, that
there exist a variety of techniques allowing derivation of task-relevant structured
data, and that there are many ways to visualise and analyse these derived data. Hu-
man knowledge of text properties and understanding of the analysis tasks are vital
for selection of right text processing techniques and analysis methods for the derived
data.
• Consider the task of comparing election agendas of several political parties. What
are the relevant aspects to compare? What kind of structured data you need to
derive? What text processing methods can be used for this purpose? What visu-
alisation of the derived data can help you to perform efficient comparison?
• Assuming that you have a collection of messages of social media users from
some city (e.g., London) for a period of 3 months from November to the next
year’s January. How can you identify specific themes appearing in the commu-
nication before and during Christmas and New Year? How can you analyse the
overall evolution of the message contents? How can you identify messages men-
tioning particular places in the city? Assuming that you have selected these mes-
sages, how can you summarise and represent the contents and the sentiments of
the texts concerning the different places?
• Perform sentiment analysis of your emails (e.g. by applying Python code from
Afinn3 ), study the dynamics of the sentiments (e.g. compare the sentiments on
Friday evening and Monday morning), compare the results for different partners
in communication, compare your private emails against the business emails etc.
3 https://github.com/fnielsen/afinn
Chapter 12
Visual Analytics for Understanding Images and
Video
Computer processing of unstructured data begins with deriving some kind of struc-
tured data. One possible kind of structured data is summary statistics characterising
each image or video frame as a whole. Such data can be used for arranging the
images or frames by similarity to provide an overview of an image collection or a
video recording and enable search for data items similar to a given sample. Another
possibility is to use the existing image processing techniques for detecting partic-
ular objects represented in the visual contents and computing their characteristics,
such as sizes, shapes, and positions. These characteristics can then be analysed in
various ways suitable for structured data, but these analyses must be complemented
with human perception and understanding of the original visual contents. This chap-
ter includes several examples of approaches in which computers support the unique
capabilities of humans. At the end, we summarise these approaches in a general
scheme showing the possible operations and the types of data that are derived and
analysed.
Imagine you are an architect that is asked to improve the lobby of your client’s office
building. For this, you first want to find out how people actually use this part of the
building – how many people move through it from the elevator to the main entrance,
and vice versa? Which routes do they take? When do they take these routes, are there
traffic peaks when people start getting in each others’ way? What about users of the
staircase, and the side entrance? If people are loitering in the lobby, where do they
do it?
To help you out, your client is providing you with several hours of video surveillance
recordings capturing peoples’ movements from typical office days. But how would
you go about analysing these videos? Sure, you can watch them one after another,
then spend hours rewinding and fast-forwarding to take stock of, and compare, dif-
ferent traffic situations. You would probably take notes and do coarse sketches of
individual persons’ trajectories on paper to finally get an idea of the overall move-
ment patterns that will form the basis of your refurbishment planning. However, this
manual approach will be time-consuming, tedious, and error-prone.
Luckily, video analysis methods can help with this kind of data exploration by per-
forming a lot of the tedious and error-prone work automatically. Specifically, you
delegate the detection of moving people in the videos and the extraction of their 2D
trajectories from the 3D video images (i.e., lines on a floor plan of the building) to a
software tool like [68]. But this would still require you to manually compare the 2D
images of trajectories to find common patterns. And because there will be a lot of
trajectories from many people, images with superimposed trajectories from all ob-
servations will be very cluttered and hard to read. Even worse, your analysis needs
to account for further properties of the movement beyond just their geometric shape:
when people where moving, and how fast for example (are loitering/slow walkers
often get in the way of those in a hurry?).
Thus for an efficient analysis, you need means for summarising groups of trajec-
tories, but in a flexible way. Again following Shneiderman’s principle of “overview
first, zoom and filter, details on demand” [124], you will want to start with a rough
overview of where the majority of people walk, and which are outliers (i.e., a few
unusual paths) that you want to ignore for further analysis. At this stage, just looking
at the trajectories’ geometry, or positions, is a good approach.
Next, you want to filter and drill down on the large flow of people coming from the
elevator and going to the main entrance, and compare them to the group of people
using the stairs after entering the building (Fig. 12.1). Staircase users likely do this
as a form exercise, while people using the elevator may either be more convenient,
or simply are in a hurry. Distinguishing between these groups for summarisation
purposes works better if looking at another facet of movement: walking speeds.
You select the speed facet and indeed find it nicely separates three major movement
flows, two to the elevators (fast and slow movers), and one to the staircase.
However, you wonder if in the two groups of elevator users there might also be some
people that would have rather liked to use the stairs but simply did not find them af-
ter a sensible time (they are around that large pillar and the direction sign is a little
small, after all). This group of people would probably exhibit some sort of hesitant
12.2 Specifics of this kind of phenomena/data 363
movement: walking into the lobby relatively briskly, then slowing down or mean-
dering while looking for the stairs, and then making a beeline to the elevators after
giving up their search. To see whether such a group exists, you select both subsets
of trajectories belonging to all elevator users from the previous analysis step, and
apply yet other movement facets to find and summarise new subgroups according to
your hypothesis: variations of movement speeds, variations from average positions
(i.e., deviations from straight-line movement), and a combinations of the two. To
make sure your new summarisation really captures the assumed behaviour of “tak-
ing a beeline”, you interactively play with the detail settings for the measure applied
to the “positional deviation” facet – specifically, you adjust and visually check how
different settings for dividing trajectories into episodes affect the generated sum-
marisations. You find that indeed there is a group of would-be stair users, and that
using four episodes captures their behaviour best (i.e., on average the first quarter of
their path represent walking in, the middle two quarters are spent searching, and the
last quarter captures the beeline to the elevators).
This example adapted from [67] illustrates how a visual analysis approach is applied
on top of the results of computational video content analysis: during the analysis
process, the human analyst interactively selects interesting subsets based on a com-
bination of movement data facets, lets the computer re-calculate a summarisation,
and derives conclusions from its visualisation (Fig. 12.1). By further drilling down
on sub-subsets the analyst interactively defines a hierarchical model that captures
all relevant patterns in the examined data set with respect to the analysis task. By
contrast, a non-interactive approach would be much less flexible and might not be
able to account for the specifics of a given task. For example, security personnel
would be looking for a different subgroup and facet structure that is better suited to
distinguish “normal” from “susceptible” behaviour.
While images and videos are are very well aligned with human perception and can
be a rich source of information to a human viewer, analysing large collections of
images or extracting important information from many hours of surveillance video
records may be too difficult, time consuming, and/or boring for humans. At the same
time, such data are also poorly suited for computational analysis. That is, without
additional meta data, an image or a single frame from a video is just a meaning-
less grid of colour values, or ‘pixel soup’. Any analysis thus typically starts with
derivation of some kind of structured data from the pixels. There are two possible
approaches:
• Characterise each image as a whole by various summary statistics (such as the
mean, quantiles, or frequencies by value intervals) of the lightness, hue, and sat-
364 12 Visual Analytics for Understanding Images and Video
uration of the pixels. For a more refined characterisation, images are divided into
several parts by a coarse grid, and the summaries are computed for the parts.
• Detect specific kinds of relevant objects (e.g., human faces in photos, organs
in medical images, moving entities in video records, etc.) and derive attributes
characterising these objects, such as the position, size, and colour components.
There exist many computational tools and algorithms for image and video process-
ing that can be utilised for deriving structured data. However, before doing that, it
is reasonable to assess the quality of the images and take precautions against possi-
ble impacts of the quality issues, if detected. Thus, images may be noisy or contain
artefacts resulting from compression. Other relevant attributes of image quality are
sharpness, contrast, and presence of a moire pattern. Automated evaluation of image
quality can be performed using special algorithms for no-reference image quality
assessment (NR-IQA) [76], that is, algorithms measuring the perceptual quality of
an image without access to a high-quality reference image. Image quality can be
improved using techniques for image enhancement [73, 94].
12.4 Visual Analytics techniques 365
Structured data derived from images or video frames can be used for assessing the
similarity between the images or frames. The analyst chooses or defines a distance
function, i.e., a numeric measure of the difference between images, which is based
on the differences between the values of the derived attributes; see Section 4.2. As
for usual multi-attribute data, the most commonly used distance functions are the
Euclidean and Manhattan distances. Once a distance function is defined, some di-
mensionality reduction method (Section 4.4) is applied to represent the images by
points in an abstract two-dimensional space, so that similarity of images is repre-
sented by proximity of the corresponding points. Figure 12.2 shows examples of
visual displays created with the use of this technique. When the image collection is
relatively small, the available display space may permit showing reduced versions
of the images (thumbnails), as on the left of Fig. 12.2. In this example, the images
are not only placed on a plane but also connected into a tree-like structure, which is
achieved using a point-placement technique called Neighbor-Joining (NJ) similarity
tree [46]. When there is no enough space for thumbnails, the images are represented
by dots, as on the right of Fig. 12.2. In any case, the analyst should be able to view
the images corresponding to the thumbnails or dots in the display. Interactive func-
tions allow the analyst to select images for viewing.
Fig. 12.2: Spatialisation of smaller and larger collections of images based on simi-
larity of image features. Source: [50].
Fig. 12.3: The use of spatialisation for representing contents of video recordings
in a summarised form. The Time Curve technique ([28]; Section 8.5.3) has been
applied to a video of an animated map showing the worldwide cloud coverage and
precipitation over one year (top) and an eight-minute movie (bottom). Source: [28].
The analysis task in the paper [65] is to observe and examine changes in the human
spleen, which is the largest organ in the lymphatic system, based on series of medical
images obtained by means of computer tomography (CT). The authors of the paper
utilise advanced image processing techniques for detection of the spleen in each
image and extraction of its characteristics, including the metric dimensions (length,
width, and thickness), shape, and texture.
The overall analytical workflow is schematically represented in Fig. 12.4. Image
processing techniques are used for segmenting the images, detecting the spleen, and
extracting its characteristics, or features. The derived data are stored in a database
12.4 Visual Analytics techniques 369
Fig. 12.4: A workflow designed for a specific application problem (source: [65])
demonstrates a general approach to image analysis with visual analytics.
Fig. 12.5: A visual display supporting image analysis combines pictures with plots
showing derived numeric data. Source: [65].
together with data about the patients. At the stage of analysis, a doctor can select
the data of a specific patient and use visual representations of the spleen features
and appearance to observe the temporal progression of the spleen condition of this
patient. Simultaneously, the system searches in the database for other patients with
similar spleen condition and temporal evolution. The spleen characteristics obtained
from different CT scans are compared using a distance function suitable for multi-
attribute data. The authors of the paper apply a special version of the cosine similar-
ity function (Section 4.2) allowing attributes to have different weights. To compare
the time series of the spleen conditions of different patients, they use another dis-
tance function, the Dynamic Time Warping (DTW) [112].
For a selected patient, the doctor can see plots and graphs representing the numeric
characteristics of the spleen as well as 3D representations of the appearance of the
spleen reconstructed from the tomography data, as shown in Fig. 12.5, left. In par-
allel, cases similar to the selected one are presented in the same way. Thus, the right
part of Fig. 12.5 shows the most similar case to the one on the left. Both cases have
the same decrease in the organ volume.
370 12 Visual Analytics for Understanding Images and Video
Another example [83] also comes from the field of medical image analysis. The im-
ages are MRI (magnetic resonance imaging) scans of the lumbar spine area. Image
processing tools are used to detect the lumbar spine in the images and reconstruct a
3D model (i.e., a mesh consisting of polygons) of the spine shape from each scan.
Besides the 3D mesh, the central line of the lumbar spine canal is extracted. It cap-
tures essential information about the spine shape deformation. Due to the same way
of reconstruction, the corresponding points of the meshes can be easily matched,
and the same applies to the central lines; therefore, differences between meshes and
between lines can be straightforwardly measured based on the distances between
corresponding points. Besides, it is easy to compute an “average shape” from a
set of meshes or lines. These opportunities can be utilised for supporting analysis
through clustering and comparative visualisation.
Figure 12.6 demonstrates application of clustering to image-derived spine data of
multiple patients. The patients have been clustered according to the similarity of
the central lines of the lumbar spines. The light grey bars show the sizes of the
clusters (the bluish segments correspond to the female patients). The 3D images
above the bars represent the average spine shapes corresponding to the clusters.
The colouring encodes the local differences of the shapes with respect to a selected
reference shape. In this example, the average shape of cluster 4 (in the middle of the
plot) has been selected as the reference. Red denotes the differences on the X-axis,
blue on the Y-axis, and green on the Z-axis.
The examples in this section demonstrate the use of image processing techniques
when it is necessary to analyse particular objects reflected in images rather than the
whole images. The resulting derived data can be visualised and analysed as usual
multi-attribute data. However, these data usually cannot capture all information es-
sential for analysis that is contained in the image, whereas human eyes can easily
do this. Therefore, visualisations of derived numeric data need to be combined with
representations of the appearance of the extracted objects. Both examples in the
12.4 Visual Analytics techniques 371
Figs. 12.5 and 12.6 include visualisations of 3D models, due to the nature of the
objects and the possibility to reconstruct the models from the original data. In other
cases, appearances of objects can be represented by 2D excerpts from the origi-
nal images, or, depending on the application, it may be even more appropriate to
show the whole images, where the objects can be seen together with the surround-
ing context. Anyway, whatever sophisticated computational techniques are applied
for image processing and analysis, the unique capabilities of the human vision are
irreplaceable.
Fig. 12.7: After extraction of object movements from video, they can be analysed
using space-time cube and other visualisation techniques suitable for movement
data. Source: [98].
As we wrote in Section 12.2, unstructured data, such as images and video, are meant
to be perceived and understood by humans, whereas computers can only deal with
some kinds of structured data derived from the unstructured data. Since it is not
possible to obtain such structured data that comprehensively capture the human-
12.5 General scheme for visual analysis of unstructured data 373
Compute statistics
of pixel values
Spatialisation
Define
unstructured Compute similarity
data numeric Search by
descriptive measure
attributes similarity
attributes (distance
Detect function)
relevant snippets Clustering
objects Reconstruct shape
shapes models
Determine
background trajectories
positions
For creating models, numerous methods and algorithms are developed in statistics,
machine learning, and various domain-oriented sciences. According to the types of
the input and/or output variables, there are several large classes of models:
• Classification models, also called classifiers, predict values of qualitative, or cat-
egorical, attributes, usually with a small set of possible values, called classes.
When there are only two values, the classification is called binary; otherwise,
it is called multi-class classification. The common modelling methods include
Logistic Regression, Decision Tree, Random Forest, Naive Bayes, and Support
Vector Machine (SVM).
• Numeric value forecast models estimate values of a numeric attribute. This class
includes a range of quantitative regression models. Regression models working
with a single input variable are called univariate, and models working with mul-
tiple input variables are called multivariate. Linear and polynomial regression
are univariate regression models. Multivariate models include Stepwise Regres-
sion, Ridge, Lasso, Elastic Net, and Regression Tree. Besides, there exist spatial
regression modelling methods, which are used when the values of the output vari-
able are distributed in space. These methods account for the property of spatial
dependence and autocorrelation (Section 9.3.1).
• Time series models predict the future behaviour of time-varying output variables
based on their past behaviour. These models account for the specifics of temporal
phenomena and data (Section 8.2), particularly, temporal dependency (autocorre-
lation) and possible influence of time cycles. Two commonly used forms of time
series models are autoregressive models (AR) and moving-average (MA) mod-
els. These two are combined in ARMA and ARIMA (autoregressive integrated
moving average) models, which are used for stationary and non-stationary time
series, respectively. A time series is stationary if its statistical properties such
as the mean, variance, and autocorrelation, are all constant over time and non-
stationary if these properties change over time.
From another perspective, models can be categorised into deterministic and proba-
bilistic. A deterministic model returns a single output for each given input. A prob-
abilistic model estimates the probabilities of different possible outputs and returns a
probability distribution among the possible outcomes. For example, in binary classi-
fication between classes A and B, a deterministic model returns either A or B, while
an outcome of a probabilistic model may be “70% A and 30% B”. Thus, most of the
classification modelling methods create probabilistic models.
Modelling methods can also be distinguished based on the technology involved in
model creation. Currently, a very popular technology is neural networks, which
“learn” the relationship between input and output variables through training. In the
context of this book, the modelling technology is irrelevant, as well as the specifics
of different modelling methods. We may refer to specific methods in the examples,
but our goal is to introduce the general principles of model building and demonstrate
how they can be fulfilled using visual analytics approaches and techniques.
13.1 Basic concepts 377
In the following discussion, we shall use a few more concepts and terms specific
to modelling. First of all, what makes a model good? On the one hand, it should
accurately represent the available data. It means that for any combination of values
of the input variables that exists in the data the model returns the value of the output
variable that is the same as or very close to the corresponding value in the data. The
difference between the value specified in the data and the value predicted by the
model is called model error or residual. It is desirable that the model errors are as
low as possible. The process of minimising the model errors is often called model
fitting.
On the other hand, the model needs to generalise the data in order to be applicable
to new data, which have not been used for model construction. This means that the
model needs to represent correctly the general character of the relationship between
the input and output variables but should not represent in fine detail all individual
correspondences between particular values. Typically, data contain random varia-
tions, called noise, due to unavoidable errors in measurement or many different fac-
tors that can affect measurements or observations. For example, the measured size
of a thing may slightly vary depending on the temperature. It is rarely possible to
take all such factors into account. Hence, it should always be assumed that the data
used for modelling contain noise. A good model should leave the noise off. It may
happen that model that has very low errors reflects the noise from the data that were
used for model creation. In this case, the model reproduces well these old data but
fails to give good predictions for new data. This is a phenomenon called overfitting,
and the model is said to be overfitted.
Hence, in building models, people strive to achieve high accuracy (= low errors) but
avoid overfitting (= capturing unessential data variations). To assess the goodness
of a model, people analyse the residuals of the model. The goal is not to make the
residuals not as small as possible but to ensure that the residuals are random. For ex-
ample, when the model predicts values of a numeric attribute, the mean difference
between the value in the data and the predicted value should be zero. Otherwise,
the model tends to either overestimate or underestimate the value. It may happen,
however, that the overall mean residual is zero, but the mean residuals for different
subsets of the data, such as younger and older people or men and women, deviate
from zero. Therefore, it is not sufficient to just check the mean residual for evalu-
ating the model. It is necessary to analyse the distributions of the residuals that are
relevant to the phenomenon being modelled. These may include the overall value
frequency distribution, the distribution over subsets of entities, the distribution over
space, and the distribution over time. All such distributions should not contain any
patterns except a random pattern.
Other desirable characteristics of a good model are low complexity, understandable
behaviour, fast performance, and low uncertainty for probabilistic models. Creating
a good model certainly requires selection of a right modelling method and appro-
priate setting of its parameters. However, not less important is good selection of the
input variables, or features, that will be used in the model. It would be absolutely
378 13 Computational Modelling with Visual Analytics
wrong to think that the more features a model involves, the better. On the opposite:
a crucial task in modelling is choosing a minimal subset of independent variables
enabling sufficiently accurate prediction of the value of the dependent variable. This
task is called feature subset selection. Involvement of redundant features entails a
risk of obtaining an over-fitted and biased model. Involvement of irrelevant features
can impair model performance, and it increases model complexity, slows down the
performance, and hinders understanding.
There exist many methods for automatic selection of features. Different techniques
often yield different results depending on the optimisation criteria used. Hence, a
problem of selecting among different selections may arise. Besides, automatic meth-
ods cannot incorporate domain knowledge that would allow better generalisation
but can instead select features that reflect unessential variation in the training data.
These methods also cannot create meaningful and useful new features from what is
available. Hence, it should be understood that “automated variable selection proce-
dures are no substitute for careful thought” [3]. The following example demonstrates
thoughtful selection of features for modelling.
records. The training set will be used for deriving the model and the validation set
for evaluating the predictive capabilities of the model.
The amount of gas consumption, which needs to be predicted, is a quantitative at-
tribute. The most commonly used class of modelling methods for predicting quan-
titative attribute values is quantitative regression. The data scientist does not rely on
automatic selection of relevant features and generation of a model in a black-box
manner. Instead, she prefers a more human-controlled approach by which she can
incorporate her understanding of the modelled phenomenon and build a logically
sound, well explainable and thus trustful model.
In order to make a good selection of features and understand how they need to be
used in the model, the data scientist needs to investigate the relationships of the
different independent variables to the dependent variable. Figure 13.1 demonstrate
how visual displays can help a human analyst to spot complex relationships, as in
Fig. 13.1(a), or local relationships, as in Fig. 13.1(b), or determine that a feature is
irrelevant to predicting the value of the dependent variable, as in Fig. 13.1(c).
In Fig. 13.1(a), the relationship between the feature X1 and the dependent variable
Y1 is non-monotonic. X1 is certainly relevant to predicting the value of Y1, but the
relationship between X1 and Y1 cannot be adequately represented by a single linear
or polynomial regression model. Instead, a combination of several linear regression
models can be appropriate. For this purpose, the data analyst needs to partition the
range of the values of X1 into intervals so that the relationship between the values
of X1 within each interval and the corresponding values of Y1 is monotonic. Being
monotonic means that value of the dependent variable tends to either increase or
decrease as the value of the independent variable increases.
380 13 Computational Modelling with Visual Analytics
Fig. 13.2: Percentile plots show relationships between two numeric variables in
cases of large data amounts. Source: [102].
In Fig. 13.1(b), the feature X2 is partly relevant to predicting the value of the output
variable Y1. Specifically, for high values of X2, the corresponding ranges of the
values of Y1 are quite narrow, which means that the value of Y1 can be predicted
quite accurately. However, for low and medium values of X2, Y1 can take almost
any value. It can be reasonable to involve X2 in the model, but use it only in the
cases when the values of X2 are high.
Hence, the data scientist not only selects features but also decides whether it is
reasonable to partition the data into subsets bases on the feature values and, if so,
how the ranges of the feature values should be divided. Application of partitioning
means that the data scientist creates a compound model consisting of several simple
models. Each of these simple models is applied under specific conditions.
A scatterplot can clearly show how two variables are related when the data are not
very numerous but does not scale to large amounts of data. Having a large amount
of data, the data scientist uses another kind of visual display, called percentile plot.
The idea is illustrated in Fig. 13.2, left. The value range of the input variable, which
is represented on the X-axis of the plot, is divided into intervals. For each inter-
val, the statistics of the corresponding values of the output variable is calculated,
namely, the median, the quartiles (i.e., 25th and 75th percentiles), and the 5th and
95th percentiles. These statistics are represented by along the Y-axis. The horizontal
black line represents the median, the lower and upper edges of the dark grey vertical
bar correspond to the quartiles, and the lower and upper edges of the light grey bar
correspond to the 5th and 95th percentiles. Such a bar is drawn for each of the value
intervals of the input variable.
Figure 13.2 demonstrates two possible ways of dividing the input variable’s value
range into intervals. In the upper part of the figure, the intervals are of equal length.
13.2 Motivating example 381
For example, when the value range from 0 to 100 is divided into 20 equal-length
intervals, the resulting intervals are from 0 up to 5, from 5 up to 10, and so on; the
length of each interval is 5. The problem with such division is that the intervals may
greatly differ in the amounts of data records that include values from these intervals.
Thus, the scatterplot and the histogram on the left of Fig. 13.2 show that most of the
data contain very low values of the input variable. The upper percentile plot does
not show this information. This may lead to wrong understanding of the relationship.
For example, the plot on the top right of Fig. 13.2 induces an impression of a strong
dependency of the output variable (gas consumption) from the input variable (wind
speed). However, high values of the input variable occur very rarely; therefore, their
correspondences to high values of the output variable may be occasional.
To represent the relationship in a more truthful way, the value range of the input
variable is divided into intervals differing in the length but containing approxi-
mately equal amounts of data records. These are called equal-frequency intervals.
The percentile plots in the lower part of Fig. 13.2 have been built based on the equal-
frequency division. The vertical bars differ in their widths, which are proportional to
the lengths of the corresponding intervals of the values of the input variables. These
plots show that the global radiation and the wind speed are irrelevant to predict-
ing the gas consumption, because the values of the output variable (represented by
the statistical summaries) do not substantially change along the ranges of the input
variables.
So, the data scientist has an instrument to explore the relationships between the
variables, namely, the percentile plots, which can be used for finding the features
that are the most relevant to predicting the gas consumption. It is reasonable to
begin with constructing a simple univariate regression model based on the most
relevant feature. The percentile plots in Fig. 13.3 suggest that the most relevant
feature is Temperature. The data scientist selects this feature and creates a univariate
regression model M1. Specifically, M1 is a third degree (i.e., cubic) polynomial
function.
Fig. 13.3: Selection of the most relevant feature (Temperature) for the initial model.
Source: [102].
382 13 Computational Modelling with Visual Analytics
To evaluate the accuracy of the model, the data scientist computes the residuals,
or errors, which are simply the differences between the values in the data and the
values predicted by the model. A frequently used measure of the model error is
RMSE (Root Mean Squared Error), which is calculated as the square root of the
sum of the squared errors. The problem with this measure is that it is hard to judge
whether the value is too high or OK. Thus, the RMSE of M1 is 24853; is it good or
bad? Such a global numeric measure can only be useful when two or more models
are compared; the model with a smaller total error is more accurate. So, the data
scientist needs to build another model and check if the RMSE of the new model is
significantly smaller than the RMSE of M1. However, it is pointless to construct just
any model that somehow differs from M1, but it makes sense to build a model that
can be expected to perform better than M1.
It may seem that a reasonable approach is to create a bivariate regression model
involving two most relevant features, which are Temperature and Day of Year, ac-
cording to Fig. 13.3. However, this is a bad idea for at least two reasons. First, not
necessarily the combination of the two features with the highest relevance will give
the most accurate prediction among all possible pairs. It is possible that the com-
bination of Temperature with another feature may yield better accuracy than Tem-
perature and Day of Year. However, trying all possible pairs would take too much
effort and time. Second, as we explained in Section 13.1, the absolute values of the
model residuals are not as important as the absence of non-random patterns in their
distributions over different components of the data.
In our example, it is necessary to investigate the distributions of the residuals over
the value ranges of the available variables. This can be done using percentile plots.
Instead of the distributions of the values of the output variable, as in Fig. 13.3, they
can show the distributions of the residuals of a model, as in Fig. 13.4. The plots not
only reveal notable patterns in the distributions of the residuals but also suggest the
data scientist which feature can be the best to use in combination with Temperature
for refining the model M1. The most prominent pattern is observed for the feature
Hour rather than for Day of Year. For Hour, not only the medians of the residuals
notably differ in the night and day hours but also the percentile ranges are located
on different sides of the zero line. For Day of year, the percentile ranges are wide
and stretch on both sides of the zero line.
Fig. 13.4: Exploration of the distribution of the residuals of the model M1 over the
ranges of feature values. Source: [102].
M2 structure
13.2 Motivating example 383
Hour
Fig. 13.5: A schematic representation of the structure of the model M2. It includes
three sub-models, which are functions of the input variables Temperature (t) and
Hour (h).
Hence, the data scientist decides to improve the model using the feature Hour. As it
is seen from the percentile plot, thet –relationship
temperature between Hour and the M1 residuals
is non-monotonous: the values almost h – hour
do not change during the night, then increase
in the morning, then tend to keep nearly stable over the day, and then decrease in
the evening. It is very difficult to represent this relationship by a single function that
would be sufficiently simple and easy to understand. Therefore, the data scientist 29
decides to proceed by partitioning the range of Hour and creating a combination of
several partial models rather than a single model. She divides the range of Hour into
three sub-ranges, [0am-6am), [6am-8pm), and [8pm-0am). In these sub-ranges, the
relationships can be represented by polynomials of the second degree, i.e., quadratic.
So, the data scientist creates the next model M2 as a combination of three sub-
models. Each sub-model comprises the terms from the model M1 involving the
variable Temperature and two additional terms, linear and squared, involving the
variable Hour. The structure of the model M2 is schematically represented as a tree
in Fig. 13.5. The RMSE of the new model is 14385, which is a substantial reduction
with respect to M1 (24853).
Now the computation and analysis of the residuals is repeated for M2 in the same
way as it was done for M1. The percentile plots (Fig. 13.6) show that the effect of
Hour is captured well as there is no pattern in the distribution of the residuals over
the range of Hour anymore. However, there are patterns in the distributions over Day
of Year and Weekend. At the first glance, it seems that there is also a strong pattern
(increasing trend) in the distribution with respect to Wind speed, but the apparent
pattern almost disappears when the division of the value ranges of the variables is
changed from equal-interval to equal-frequency and the percentile plots are re-built;
see the lower row of plots in Fig. 13.6. At the same time, the remaining patterns
for the features Day of Year and Weekend indicate that these features should be
included in the model. In other words, the model needs to account to the seasonal
and weekly variations of the gas consumption.
Again, the percentile plot for Day of Year exhibits a non-monotonous relationship
to the output variable. To capture this relationship, the data scientist again applies
384 13 Computational Modelling with Visual Analytics
Fig. 13.6: Exploration of the distribution of the residuals of the model M1 over the
M3ofrefines
ranges M2 The
feature values. by plots
capturing the
in the upper rowseasonal effects
are built based on intervals of
equal length and in the lower row on intervals of equal frequency. Source: [102].
Hour
Fig. 13.7: A schematic representation of the structure of the model M3. It includes
six sub-models, which are functions of the input variables Temperature (t), Hour
(h), and Day (d).
t – temperature; h – hour; d – day of year
33
partitioning. She divides the year into two periods: from the beginning of April till
the end of September and from the beginning of October till the end of March.
On this basis, she transforms model M2 into model M3 containing 6 sub-models
(Fig. 13.7).
The RMSE of M3 is 10832, which is not as dramatic improvement with respect
to M2 (14385) as it was with M2 compared to M1 (24853). More importantly, the
refinement of M2 to M3 removes the previously present pattern from the residual
distribution with respect to Day of Year. However, the pattern with respect to Week-
end still remains, indicating that this variable needs to be involved in the model.
Since this is a categorical variable with two values, the data scientist again refines
the model through partitioning, separating the weekends from the weekdays. In
this way, she creates model M4 containing 12 sub-models (Fig. 13.8). Its RMSE
is 10251.7.
While the pattern of the distribution of the model residuals with respect to Wind
Speed can be judged as insignificant, the data scientists, based on her domain knowl-
edge, suspects that the effect of the wind speed may depend on the temperature. To
check this, the data scientist uses a 2D error plot with one dimension corresponding
to Temperature and the other to Wind Speed. The plot area is divided into cells,
M4 refines M3 by addressing the weekly cycle
13.2 Motivating example 385
Hour
true false true false true false true false true false true false
Fig. 13.8: A schematic representation of the structure of the model M4. It includes
t – temperature; h – hour; d – day of year
12 sub-models, which are functions of the input variables Temperature (t), Hour34(h),
and Day (d).
which are painted in shades of two colours (Fig. 13.9, left). Yellow corresponds to
positive residuals, when the values in the data are higher than the predicted values,
which means that the model underestimates the values. Blue corresponds to neg-
ative residuals, which mean that the model overestimates the values of the output
variable. The yellow area in the plot in Fig. 13.9, left, shows that model M4 un-
derestimates the gas consumption for combinations of high wind speeds with lower
temperatures. To improve the prediction, the data scientist refines M4 by involving
the variable Wind Speed. The resulting model M5 has the same hierarchical struc-
ture as M4 (Fig. 13.8) but the 12 polynomial functions at the bottom level include
additional terms with the variable Wind Speed. The RMSE of M5 is 10059.8, which
is a modest improvement with respect to M4.
Since M5 is more complex than M4, the data scientist wants to compare the accuracy
of M4 and M5 in more detail. She again uses 2D error plots, but the colouring of the
cells shows this time the differences between the absolute values of the residuals of
two models. On the right of Fig. 13.9, yellow indicates superiority (i.e., lower errors)
of model M5 and blue means lower errors of M4. For all but one pairs of input
variables, the shades of yellow clearly dominate over the entire plot areas whereas
shades of blue occur rarely, and the occurrences are randomly scattered. However, in
the plot of Day of Year against Temperature, which is shown in Fig. 13.9, right, there
is a relatively large area where blue shades prevail; the area is marked in the figure.
It means that M4 gives more accurate predictions than M5 for certain temperatures
during spring and summer. Nevertheless, the data scientist is more satisfied with the
accuracy of M5 and takes it as the final result of the model building process.
386 13 Computational Modelling with Visual Analytics
Fig. 13.9: Left: Investigating the interplay of the wind speed (vertical dimension)
and the temperature (horizontal dimension). Right: Comparison of the distributions
of the residuals of two models with respect to two input variables, Temperature
(vertical dimension) and Day of Year (horizontal dimension). Source: [102].
The example we have considered involved several activities that are often performed
in the course of model building. We shall call such activities general tasks. These
include:
• Data preparation: Data collection when needed, examination of the properties
of the data, checking the suitability for the modelling, handling problems if de-
tected.
• Feature engineering: Creation of potentially relevant input variables from avail-
able data components.
• Feature selection: Finding a good combination of features to involve in the
model.
• Method selection: Choosing an appropriate modelling method.
• Data division: Dividing the available data into parts one of which will be used
for model creation (this part is usually called training set) and another for model
evaluation (this part is called test set). Some modelling methods additionally
require a validation set, which may be used for setting model parameters or for
choosing one of multiple alternative models.
• Method application: Creation of a model by applying the chosen method to the
data, which typically involves setting and tuning method parameters.
• Model evaluation: Analysis of model prediction quality and checking for possi-
ble biases. Evaluation also includes testing model sensitivity to small changes in
input data and/or parameter settings.
• Model refinement: Actions to improve the prediction quality, eliminate biases,
and increase robustness.
• Model comparison: Comparing the performance and complexity of two or more
alternative models.
This list of tasks should not be treated as a one-way pipeline that needs to be fol-
lowed once. These tasks (not necessarily all) are performed as steps of an iterative
process with multiple returns to previous steps. Thus, in our motivating example
(Section 13.2), the data scientist repeatedly returned to the task of feature selection
and repeatedly performed method application, model evaluation, and model refine-
ment. The example included feature engineering: it was creation of the attributes
‘Day of Year’, ‘Hour’, and ‘Weekend’ from the timestamps available in the raw
data. The data scientist divided the available data into parts used for model building
(the first three years) and for evaluation (the following two years). The modelling
method (regression) was chosen according to the type of the output variable. The
modelling process included comparison between the models M4 and M5 for decid-
388 13 Computational Modelling with Visual Analytics
ing if the gain in the accuracy and reduced bias justify the increase of the complexity.
Our motivating example (Section 13.2) demonstrates how visual analysis helps in
fulfilling several tasks in the model building process. Now we shall generalise and
extend what was demonstrated.
Feature construction is often done based on domain knowledge and may not need a
help from visual analytics. However, useful features may also be constructed based
on patterns observed in data. In this case, visualisation plays a crucial role as a means
enabling the analyst to observe patterns. Let us take the example of the epidemic
outbreak in Vastopolis (Section 1.2). If we had a goal to build a model predicting
if a person gets sick and, if so, the kind of the disease, we would construct features
reflecting the observed patterns in the spatio-temporal distribution of the outbreak-
related microblog posts. The features would express whether a person had been in
the area affected by air pollution in the next few days after the truck accident and
whether a person had been close to the river in the southwestern part of the city on
the third day after the accident.
Visual analytics is definitely useful for feature selection. In the example in Sec-
tion 13.2, relevant features were selected based on analysis and comparison of fre-
quency distributions of the feature values and model residuals. In the text readability
assessment example from Section 11.1, the analysts explored a visual representation
13.4 Doing modelling tasks with visual analytics 389
of the correlations between the features in the form of correlation matrix as shown in
Fig. 4.5 [106]. Special ordering of the rows and columns of the matrix by means of
a hierarchical clustering algorithm helped the analysts to detect groups of correlated
features and select the most expressive representatives from the groups. Hence, we
can conclude that visualisation supports feature selection by showing distributions
and correlations.
Method selection does not always require visualisation, but there are cases when an
appropriate method needs to be chosen according to patterns observed in the data,
which makes visualisation essential. The most common case is modelling of time
series. The choice of the method depends on the presence of cyclic variation pat-
terns. Moreover, when the variation is related to two or more cycles, such as weekly
and daily, the analyst’s choice may be to decompose the variation into components
and create a structured model where each component is captured by a special sub-
model. An example can be found in paper [18]. The decomposition of the modelling
problem and definition of the structure of the future model are parts of the method
selection task.
For data division into training and test sets and, when needed, a validation set, the
key requirement is that all sets have the same distribution patterns. Therefore, visual
analytics is certainly relevant as a tool to explore and compare distributions and
uncover patterns. Thus, in the example in Section 13.2, the data scientist could use
percentile plots, as in Fig. 13.3, to compare the distributions of the values of the
output variable over the features in the subsets of the data covering the first three
years and the remaining two years, although it is not said explicitly in the paper as
the data division step is not described in detail [102].
In application of modelling methods to data, model builders often use visual displays
for judging how well the prediction matches the training data. The most commonly
used display is a scatterplot showing data as points and the relationship extracted by
the modelling method as a line. Another common case is to use a time series plot
when modelling temporal variation of a numeric variable. Depending on the nature
and structure of the data, other visualisations may be helpful. For example, mosaic
plots can be used for seeing how a classification model separates classes based on
ordinal or categorical variables [99]. The model builder uses such displays to make
an approximate judgement of the model fitness for deciding whether the parameter
settings of the modelling method need to be changed (for example, the order of
the polynomial regression model) or, possibly, another method needs to be chosen.
This judgement is not yet a full-fledged evaluation of the model but a preliminary
estimation of its potential appropriateness. When it is clearly seen that a model does
not fit, it makes little sense to do a thorough evaluation.
In evaluating a model, it is insufficient to use just statistical indicators of model
quality, which do not provide any hint of what is wrong and how the quality can be
improved; see our arguments concerning RMSE in Section13.2.3. In our motivating
example, the data scientist used percentile plots (Figs. 13.4 and 13.4) and 2D error
plots (Fig. 13.9) for a detailed inspection of the distributions of the model errors
390 13 Computational Modelling with Visual Analytics
expressed as numeric values. When the model output is not numeric, it is anyway
important to inspect the distributions of the correct and wrong model results. In the
following section, we shall show examples of doing this for classification models.
Before describing the examples, we shall discuss the usual practices in assessing the
quality of classification models.
13.5 Further examples: Evaluation and refinement of classification models 391
Fig. 13.10: Confusion matrices for classifiers with two (left) and three (right)
classes.
For a classification model, good quality means correct assignment of items to their
classes. In practice, a model can rarely achieve absolute correctness. For binary
classifiers, which assign items to two possible classes ‘yes’ and ‘no’, or ‘true’ and
‘false’, there are special terms for correct and wrong assignments. “True positive”
means a correct assignment to the class ‘yes’ and “false positive” means a wrong
assignment to this class. Similarly, “true negative” and “false negative” mean correct
and wrong assignments, respectively, to the class “no”. For multi-class classifiers,
these concepts are applied individually to each class. Thus, a false positive for class
X is an item assigned to class X but actually belonging to another class, whereas
a false negative for class X is an item belonging to class X but assigned to another
class by the classifier. The numbers or percentages of correct and wrong assignments
to each class can be represented in a confusion matrix, as demonstrated in Fig. 13.10.
• false positive rate: the ratio of the number of false positives to the size of the ‘no’
class, that is, the sum of the numbers of true negatives and false positives.
As we said in Section 13.2.3 regarding regression models, such overall quality met-
rics are not very helpful, because they do not tell the model builder what and how
can be improved.
There are many methods that create probabilistic classifiers whose results are not
crisp assignments of items to singular classes but probabilities for the items to be-
long to each of possible classes. These probabilities are also called prediction scores.
The sum of all probabilities is 1; therefore, for a binary classifier, it is sufficient to
return a single value between 0 and 1, where 0 means 100% ‘no’ and 1 means 100%
‘yes’. To assign items to either the ‘yes’ or ‘no’ class, some threshold value between
0 and 1 is chosen; it is often called prediction threshold. An item is then assigned
to the class ‘yes’ if the probability is above the prediction threshold. When using
a multi-class classifier, a possible approach is to assign an item to the class having
the highest probability. Another approach is to apply a threshold, as for a binary
classifier. With this approach, it may happen that the probabilities of all classes are
below the threshold, and some items may get no class assignment.
Multiclass classifiers can also be built by training multiple binary classifiers and
then combining their outputs to make predictions for individual items. The one-vs-
rest method (also known as one-vs-all) means that each binary classifier is trained
to discriminate one of the classes from the remaining classes. The class of each item
is determined by the classifier that produces the highest score. With the one-vs-one
(or all-vs-all) method, binary classifiers are created for every pair of classes, and
majority voting is used to select the winning class prediction for each item.
To asses the performance of a classifier and choose a suitable value of the prediction
threshold, model builders plot a so-called ROC (Receiver Operating Characteristic)
Curve, as shown in Fig. 13.11. The X-axis corresponds to the false positive rate and
the Y-axis to the true positive rate. The points on the curve correspond to the values
of these rates obtained with different threshold settings. The threshold value is the
lowest at the right end of the curve and the highest at the left end. With a very low
threshold value (close to 0), all items will be assigned to the given class (in a binary
case, to the ‘yes’ class) irrespective of their actual class membership. This means
that the true positive rate will be 1, but the false positive rate will also be 1, because
all “negative” items will be classified as “positive”. As the threshold increases, fewer
and fewer items are assigned to the given class. When fewer “negative” items are
assigned to the class, the false positive rate decreases. When fewer “positive” items
are assigned to the class, the true positive rate decreases. In an ideal case, only
the false positive rate decreases, but the true positive rate does not decrease. This
means that the plot has the shape of a horizontal line approaching the Y-axis at
the level Y = 1 (the green line in Fig. 13.11). In reality, both rates decrease as the
threshold increases. It is good if the true positive rate decreases much slower than the
false positive rate. The slower the true positive rate decreases, the closer the curve
approaches the upper horizontal line. The faster the decrease of the true positive
13.5 Further examples: Evaluation and refinement of classification models 393
0.5
0.25
rate is, the closer the curve approaches the diagonal line, where both rates have
equal values. Thus, in Fig. 13.11, the classifier represented by the blue ROC curve
is better than the classifier represented by the red curve, because the blue curve
deviates from the diagonal more and approaches the level Y = 1 closer than the red
curve.
The quality of a classifier can be expressed numerically by the Area Under the Curve
(AUC): the larger this number is, the better. Thus, the area under the blue curve in
Fig. 13.11 is larger than the area under the red curve.
The ROC curve shows the overall quality of a probabilistic classifier and the impact
of the prediction threshold on the results. However, this information does not suggest
how the quality can be improved. The model builder needs to see what is wrong and
understand the reason for it in order to make targeted improvements. This often
requires direct examination of individual prediction errors of a model.
The most common reasons for classification errors include
• mislabelled data,
• inadequate features to distinguish between classes, and
• insufficient data for generalising from existing examples.
394 13 Computational Modelling with Visual Analytics
In this example [6], a binary classifier needs to determine whether a text document
(more specifically, a web page) concerns a certain subject, such as cycling. The data
scientist applies a modelling method (specifically, logistic regression) that creates a
probabilistic model. Two sets of labelled documents are used for model training and
testing. To understand how the model performs and how this can be improved, the
data scientist needs to see in detail the model predictions for the labelled documents.
For this purpose, she uses the display shown in Fig. 13.12.
This is a variant of the general visualisation technique called stacked dot plot, where
data items are represented by dots positioned along a horizontal axis according to
values of a numeric attribute. When several items have the same or very close values,
the corresponding dots have the same horizontal position and are arranged in a stack
in the vertical dimension of the display. In Fig. 13.12, the dots have square shapes.
Each square represents a document. It is drawn below the horizontal axis if the
document belongs to the train set and above the axis if the document is from the test
set. The green and red colouring of the squares corresponds to the labels ‘yes’ and
‘no’, respectively. The positions along the horizontal axis correspond to the scores,
that is, the probabilities of the ‘yes’ class, returned by the classifier. The display
includes a vertical line that shows the current prediction threshold.
For an ideal classifier, all red squares would be at the left end and all green squares
at the right end of the display. In a real case, the red and green squares are mixed.
The data scientist wants to increase the separation between the classes,to be able
to find such a position for the threshold line that all or almost all squares on the
left of it are red and all or almost all squares on the right are green. To find out
how to achieve this, the data scientist needs to understand the reasons for the model
errors, that is, low probabilities for the items labelled ‘yes’ and high probabilities
13.5 Further examples: Evaluation and refinement of classification models 395
for the items labelled ‘no’. This requires examination of particular documents. The
interactive display facilitates such examination by providing an opportunity to select
a dot and obtain a hyperlink to the corresponding document.
Let us see how the visualisation can help the data scientist. One of possible rea-
sons for classification errors is mislabelled data items. Assuming that wrong class
labels occur rarely in the train set, correct data will have higher influence on the
classifier’s training. Therefore, the classifier’s predictions for the mislabelled data
items are likely to correspond to their true classes. In our case, documents wrongly
labelled as belonging to the ‘yes’ class will receive low scores and will appear in
the display as green squares positioned close to the left edge of the plot. On the op-
posite, documents wrongly labelled as representatives of the ‘no’ class will receive
high scores and will appear as red squares drawn close to the right edge of the plot.
When the data scientist sees such squares, she can open the corresponding docu-
ments and check if they have correct class labels. Having encountered mislabelled
documents, the data scientists can modify the labels and re-train the classifier. After
each re-training, the display is updated.
Another possible reason for poor class separation is inadequacy of the features that
are used for the classification. When this is the case, items belonging to different
classes may have very similar values of the features. In our example, the features are
presence or absence of specific keywords. The set of keywords may be insufficient
for distinguishing relevant documents (i.e., talking about the subject of interest)
from irrelevant. Thus, when the subject is “cycling”, the model may give a low score
to a document about unicycling, which is relevant and has the class label ‘yes’, and
a high score to a document about motorcycling, which is irrelevant and has the class
label ‘no’. The relevant document will be represented by a green rectangle drawn
on the left of the plot, and the irrelevant document will appear as a red rectangle
positioned on the right.
To find out whether the reason is feature inadequacy, the data scientist should be
able to see which other documents are similar to the documents whose scores do
not correspond to the class labels. Using the interactive tool shown in Fig. 13.12, the
data scientist can select any square, and the tool will link this square with squares
representing similar documents by lines. Another possibility would be to use an ad-
ditional projection plot built with the use of some dimensionality reduction method
(Section 4.4), where the dots representing the documents are positioned according
to the similarity of their features. In such a plot, items with similar features will be
represented by groups (clusters) of close dots. When the feature set is insufficiently
distinctive, green dots will appear in or close to groups of red dots, and the other
way around. When such situations are detected, the data scientist can inspect the
contents of the documents and extend the set of features with additional keywords,
such as “motorcycling” and “unicycling”.
As mentioned at the end of Section 13.5.1, the data set used for model training
may be insufficient for generalising. When this happens, the model may classify
the examples from the train set well enough but perform poorly for the test set.
396 13 Computational Modelling with Visual Analytics
Another possible indication of such a problem is presence of outliers, that is, data
items that are very dissimilar to all others according to the features used for the
classification. In the tool shown in Fig. 13.12, outliers are specially marked, but it
would be even more convenient to use an additional projection plot, as we suggested
in the previous paragraph. In such a plot, outliers would appear as isolated dots
positioned far from all others. An obvious remedy for this problem is to extend the
train set with additional examples that are similar to the outliers or to the items from
the test set that were poorly classified.
Hence, appropriate visualisation of model performance allows model builders to
detect and inspect errors, understand their reasons, and choose appropriate actions
for targeted improvement. To enable error detection, the visualisation must show
model results for individual data items. Detailed inspection is enabled by interactive
operations exhibiting selected data items. For understanding reasons for the errors,
model builders need to compare the features of the data items. This can be enabled
by interactive linking of similar items or by an additional projection plot where
similarities between data items are represented by distances between corresponding
visual marks.
In this example [115], the data scientist analyses the performance of a probabilistic
multi-class classifier. The task is to recognise hand-written digits from 0 to 9; hence,
there are 10 possible classes. The data scientist creates and trains a classification
model, which yields the accuracy 0.87. Can this be improved?
To understand this, the data scientist creates a visualisation with 10 histograms
showing the distribution of the item scores for each class (Fig. 13.13). As we ex-
plained in Section 13.5.1, each item receives some probability value, or score, for
13.5 Further examples: Evaluation and refinement of classification models 397
each class. The histogram of a class shows the frequency distribution of the proba-
bilities of this class received by all items. The histograms in Fig. 13.13 are drawn
so that the value axes are oriented vertically; the lower ends correspond to the prob-
ability 0 and the upper ends to the probability 1. The bars are oriented horizontally
and show the value frequencies by intervals of the length 0.1. A distinctive colour is
chosen for each class. These colours are used for showing correct and wrong class
assignments. Some bars in the histograms are painted in uniform solid colours. This
means that all items represented by these bars were correctly recognised as mem-
bers of the respective classes. Other bars contain segments with textured painting.
These segments represent groups of items that were classified wrongly. A textured
segment drawn on the right of the vertical axis represents items that were assigned
to the class corresponding to this histogram but belong to another class, which is
indicated by the colour of the segment paining. A textured segment drawn on the
left of the axis represents items that actually belong to the class of this histogram
but were assigned to another class indicated by the colour of the segment. Hence,
the histograms show the confusions between the classes and the scores received by
the correctly and wrongly classified items.
The overall shapes of the histograms are also highly informative. They show that,
except for the classes C0 and C1, the classifier tends to give quite low class proba-
bilities. In many cases, the probabilities of the winning classes were less than 0.3.
This means that the probabilities of the remaining classes did not differ much from
the highest class probability. This, in turn, means that, even when the classes are
identified correctly, the certainty of the classification results is quite low. A very
small difference in the input data may change the class prediction.
Additional information about the between-class confusions is represented by the
small plots drawn above the histograms. One of these plots is enlarged in Fig. 13.14.
Each plot contains multiple polylines (polygonal lines) corresponding to all true
members of the respective class and showing the scores the members of this class
received for each of the ten classes. A plot that has a single high peak and a flat
remainder indicates that all or almost all members of this class received high proba-
bilities of this class, which is very good. The presence of two or more peaks means
that some class members received relatively high probabilities of classes they do not
belong to. Thus, the enlarged plot in Fig. 13.14 indicates that many of the members
of the class C5 received high probabilities of the class C3. For some of them, the
probabilities of C3 were higher than the probabilities of C5, and they were wrongly
assigned to C3. The red segments in the histogram of the class C3 correspond to
these items. Reciprocally, many members of class C3 received relatively high prob-
abilities of C5.
From the histograms and plots in Fig. 13.13, the data scientist gains an overall idea
about the performance of the classifier, but she needs to see details for individual
items for understanding the reasons for the poor performance and finding ways to
improve it. In particular, she needs to look at several members of the classes C3 and
C5 in order to understand the reasons for the confusions between these two classes.
398 13 Computational Modelling with Visual Analytics
Fig. 13.14: Some histogram bars are transformed into arrays of dots representing
individual data items.
Fig. 13.15: The class probabilities for several selected items are shown by lines
connecting corresponding positions on the histogram axes of the different classes.
In Figs. 13.14 and 13.15, selected items are represented by polylines connecting the
positions on the vertical axes corresponding to the probabilities of the classes these
axes correspond to. In Fig. 13.14, the data scientist has selected one member of the
class C5 that received equal probabilities of the classes C5 and C3. In Fig. 13.15, the
data scientist has selected four members of the class C3 that were wrongly classified
as C5. Since the items being classified are images (of handwritten digits), the data
scientist looks not only at the scores of the selected items but also at the images.
Please note that detailed information for selected items does not need to be repre-
sented by polylines on top of the histogram display, as in Figs. 13.14 and 13.15. It is
possible to use a separate display, or even a table, as at the bottom of Fig. 13.15, can
be appropriate. What is really important is selection and detailed inspection of rep-
resentative problematic items for understanding the reasons for the problems. Such
items can be selected using query tools.
In our example, the data scientist finds that some handwritten variants of the digits
3 and 5 may be hard to distinguish due to low resolution of the input images (7x7
pixels). This is a special case of using inadequate features. Increasing the image
resolution to 14x14 pixels greatly improves the recognition of the digits.
13.5 Further examples: Evaluation and refinement of classification models 399
Like with a binary classifier (Section 13.5.2), there may be mislabelled examples.
Such an example is likely to receive a high probability of its true class rather than the
class specified by the label. In the histogram display, mislabelled examples will be
manifested as segments located high in some of the class histograms and coloured
differently from the colours of the respective classes.
Quite often, people create several classifiers using different modelling methods in
order to choose the one that performs the best. It may not be enough to compare
the statistical indicators of each model’s quality. When there is no obvious win-
ner in terms of these indicators, it is appropriate to compare the behaviours of the
classifiers, particularly, the distributions of the class probabilities. In Fig. 13.16, the
classification model that we have discussed so far is compared with another model
created for the same classification problem using the same input data but a different
modelling method. Both models have the same accuracy 0.87, but the distributions
of the class probabilities are very different. While the first model (Random Forest)
tends to give low scores that do not differ much between the classes, the second
model (SVM - support vector machine) has high frequencies of very high scores
(close to 1). The long bars at the tops of the histograms indicate that there are many
items for which one class has a very high probability whereas the probabilities of the
remaining classes are close to 0. Hence, the predictions made by the second model
have much higher certainty than the predictions of the first model. With the first
model, slight variations or noise in the input data may easily flip the prediction from
correct to wrong or vice versa. The second model will be more robust and therefore
should be preferred.
400 13 Computational Modelling with Visual Analytics
Both examples, with the binary and multiclass classifiers, demonstrate the useful-
ness of visualising the distributions of the prediction scores with indication of the
correct and wrong class assignments. Apart from enabling understanding of the
model behaviour, the visualisation suggests what items should be inspected indi-
vidually for understanding the reasons of wrong classifications and finding ways to
improve the model.
Additionally to the visualisations proposed by the authors of the discussed papers [6,
115], we suggest using a projection plot where the items from the train and test sets
are arranged on a plane according to the similarity of their features and coloured
according to the true classes. When items from different classes happen to be close
in the projection plot, it means that they have similar features, which, it turn, may
mean that the set of features currently used is inadequate for distinguishing members
of different classes.
Fig. 13.17: Visualisation of the temporal and statistical distributions of the stan-
dardised residuals of a time series model (source: [36]). 4a: the residuals plotted
over time; 4b: the values of the ACF (autocorrelation function) plotted over differ-
ent time lags; 4c: the Q-Q (quantile-quantile) plot of the quantiles of the residuals
against the quantiles of an “ideal” normal distribution.
plotted over a range of temporal lags. The ACF is the correlation between any two
values in a time series with a specific time shift, called lag.
4c is a Q-Q plot, or quantile-quantile plot. Such a plot is used for comparison of
the statistical distributions of two sets of numeric values. Quantiles (e.g., 0.01, 0.02,
..., or 1%, 2%, and so on) of one set are plotted against the corresponding quantiles
of the other set. An x% quantile of a set of numeric values is a particular value vx
such that x% of all values are less than vx and the remaining (100 − x)% are above
vx . When two sets have the same statistical distribution, all dots in the Q-Q plot
fall on the 45◦ diagonal. Deviations from the diagonal indicate how different the
distributions are.
A fully random set of numeric values has a normal, or Gaussian, statistical dis-
tribution. Therefore, the randomness of the distribution of model residuals can be
checked by comparing it with the “ideal” (theoretical rather than real) normal distri-
bution. Hence, the plot labelled 4c is the Q-Q plot of the quantiles of the set of the
residuals against the quantiles of the theoretical normal distribution.
As for any kind of model, building of a time series model is an iterative process, in
which the model builder repeatedly evaluates the current version of the model and
performs some actions for its improvement until getting satisfied with the model
quality. This process requires comparing of each new version with the previous
version. Figure 13.18 demonstrates a possible way of comparing residuals of two
models.
402 13 Computational Modelling with Visual Analytics
Fig. 13.18: Comparison of residuals of two models in two selected steps (4 and 5)
of an iterative model building process. Source: [36].
Being able to understand why a model makes a certain prediction is extremely im-
portant for the users to gain trust and use the model confidently. Lack of under-
standing may preclude model use in many applications, such as health care or law
enforcement. Understanding of model behaviour also provides insight into how a
model may be improved and supports understanding of the phenomenon being mod-
elled. However, in response to the challenges of big data, there is a growing trend of
developing and using methods of machine learning (ML) that create models whose
operation is not comprehensible to people, so-called “black boxes”. To address the
problem of model non-transparency, even a special research field called eXplainable
Artificial Intelligence (XAI) has emerged [63]. Artificial intelligence (AI) is the
overarching discipline that includes ML as well as other areas focusing on making
machines behave smart, such as robotics and natural language processing.
Although a large part of the AI algorithms and the resulting models cannot be di-
rectly explained (for instance, deep learning models), XAI methods aim to create
human-understandable explanations of some aspects of the behaviour of these mod-
els. Thus, there are methods that support checking whether a model makes use of
features that a human expert deems important. Other methods explain a black box
model by creating an understandable surrogate model (such as a set of rules or a
decision tree) that is supposed to replicate most of the behaviour of the primary
model. In fact, any explanation of a model’s prediction can be viewed as a model
itself. Based on this view, Lundberg and Lee have proposed a unified framework
for interpreting predictions called SHAP, which assigns each input variable an im-
portance value for a particular prediction [92]. There are also methods revealing
the internal structure of a model and/or flows of data. Such methods may be use-
ful for model developers. A comprehensive survey [63] can be recommended for
understanding the main concepts and the variety of approaches to explaining dif-
ferent aspects of black box models. A brief overview of the major approaches and
representative methods can be found in paper [126].
Since explanations are meant for humans, they often involve visualisations. A good
example is visualisation of the contributions of different input variables into model
predictions23 based on the SHAP framework [92].
An increasing amount of research in the field of visual analytics is focused on sup-
porting the process of building deep learning models [70]. Model developers use
visual representations of the network architecture, filters, or neuron activations in
response to given input data. We shall not describe these representations, since spe-
cific technical knowledge would be required for understanding them. What is im-
2 https://towardsdatascience.com/explain-your-model-with-the-
shap-values-bc36aac4de3d
3 https://towardsdatascience.com/explain-any-models-with-the-
shap-values-use-the-kernelexplainer-79de9464897a
404 13 Computational Modelling with Visual Analytics
portant to note is that visual analytics can and should be involved in building AI
models, as with other kinds of models.
Nevertheless, before rushing into creating a human-incomprehensible AI model, it
is appropriate to think about a possibility to create instead an understandable model
that could serve the intended purpose sufficiently well. We highly recommend read-
ing the paper with the expressive title “Stop explaining black box machine learn-
ing models for high stakes decisions and use interpretable models instead” [119].
The author Cynthia Rudin convincingly argues that XAI methods provide “expla-
nations” that cannot have perfect fidelity with respect to what the original models
actually do. This entails the danger that the representation of the behaviour of the
target model can be inaccurate for some of possible inputs, which limits the trust in
the explanations and in the model itself.
Here is what the author says: “An explainable model that has a 90% agreement with
the original model indeed explains the original model most of the time. However,
an explanation model that is correct 90% of the time is wrong 10% of the time. If
a tenth of the explanations are incorrect, one cannot trust the explanations, and thus
one cannot trust the original black box. If we cannot know for certain whether our
explanation is correct, we cannot know whether to trust either the explanation or the
original model” [119, p.207].
The paper contains a number of other serious arguments against the use of black
box models, even when they are supplied with “explanations” (which is a mis-
leading term, because these are not real explanations). The author also refutes the
widespread beliefs that more complex models are more accurate and that AI meth-
ods are always superior to the “traditional” modelling techniques. “In data science
problems, where structured data with meaningful features are constructed as part
of the data science process, there tends to be little difference between algorithms,
assuming that the data scientist follows a standard process for knowledge discov-
ery” [119, p.207].
Another belief that prompts some people to use AI techniques creating black box
models is that these techniques are capable to capture hidden patterns in the data the
users are not aware of. The author objects that a transparent model may be able to
uncover the same patterns, if they are important enough to be leveraged for obtaining
better predictions.
It is also worth noting that the existing spectrum of XAI methods does not cover all
types of data. Thus, the authors of the survey [63] note that they did not find any
works addressing the interpretability of models for data different from images, texts,
and tabular data.
Together with Cynthia Rudin, we call for responsible use of ML and AI techniques
and avoidance of creating black box models if they are intended to support important
decisions. For solving more serious problems than classification of photos of cats
and dogs, efforts should be put into creation of interpretable models. However, even
when there are good reasons for creating a black box model, the model developer
13.8 General principles of thoughtful model building with visual analytics 405
should not understand the term “Artificial Intelligence” literally and believe that a
machine can be intelligent enough to create a good model without human control
and involvement of human knowledge and intellect. Any kind of model requires
thoughtful preparation of data and making a number of selections and settings, and
any kind of model needs to be carefully verified, checked for sensitivity to the set-
tings made and to variation of input data, and compared with possible alternatives.
Visual analytics approaches are particularly relevant to these activities.
The main and most general principle to obey in model building is that all tasks
involved in the model building process need to be done thoughtfully and responsibly.
We have listed these tasks in Section 13.3 and discussed in Section 13.4 how visual
analytics can help a thoughtful modeller to fulfil these tasks. Here we would like to
attract your special attention to several activities that are not always considered by
model builders.
One of the activities that may not come to the model builder’s mind is decomposi-
tion of the overall modelling problem into sub-problems. A combination of several
partial models may perform much better than a single global model; moreover, sev-
eral partial models may be simpler and easier to understand than a single model
intended to cover everything. Very often appropriate partitioning can be suggested
by domain knowledge or even common sense. Thus, it is well known that in the life
and activities of people weekdays differ from weekends and summers from winters.
The example in Section 13.2 demonstrated how a modelling problem could be aptly
decomposed by taking these differences into account. Essential differences that can
motivate creation of several sub-models exist also in the geographic space: cities
differ from the countryside, coastal areas from inland, and mountains from plains.
Among people, there may be differing subgroups that cannot be adequately repre-
sented in a single universal model. Many other examples can be added to this list.
Essential differences between parts of the subject that is modelled can be known in
advance or expected, or they can be discovered in the process of data analysis prior
to model building. In any case, the model builder needs to have a careful look at
the data to decide whether partitioning is appropriate and, if so, to define the sub-
models to be built and the portions of the data to be used for building and testing
them.
While the task of model testing and evaluation is not likely to be omitted, mod-
ellers very often rely too much on the statistical metrics of model quality and may
not investigate the model behaviour in sufficient detail. As we noted several times,
406 13 Computational Modelling with Visual Analytics
statistical metrics do not tell what and where is wrong and do not give the model
builder any hint of how the model can be improved. We demonstrated by exam-
ples how useful is visualisation of the distribution of the model results and/or model
errors over the set of available data and over its components, such as domains of
different attributes and time (the same refers to space). The reason is that a model
may perform differently for different parts of the data. Thus, in predicting a numeric
value, a model can underestimate it for one part of data and overestimate for another,
but the overall statistics may look OK; particularly, the statistical distribution of the
residuals may appear close to normal. For classification models, visualisation of the
distribution of the model results may reveal errors in labelling, insufficient distinc-
tiveness of the features, insufficiently representative set of examples for training, or
low confidence in assigning items to classes. Information provided by visual dis-
plays of model results or errors distributions can be enlightening and suggestive of
suitable ways towards model improvement. One of possible ways may be problem
decomposition, as discussed above.
Comparison of models is also a task that can hardly be neglected: since a model
needs to be iteratively evaluated and improved, it is necessary at least to compare
the next, supposedly improved, version of the model with the previous one. It is
also often reasonable to create several variants of models using different methods or
different parameter settings. Such variants also need to be compared. What we said
earlier about model evaluation also applies to model comparison. Comparing only
statistical indicators of model quality is, generally, not sufficient. Even statistical
graphs, such as ROC (Fig. 13.11) or Q-Q plots (Fig. 13.18, right), that seem to
provide more information than just numbers, do not tell how the models perform on
different subsets of inputs or how much uncertainty is in their results. More useful
information can be gained from visual comparison of two distributions, which can
be juxtaposed, as in Fig. 13.16, or superposed in the same display, as in Fig. 13.18,
left. Visualisation of the distribution of the differences between the model results or
residuals, as in Fig. 13.9, right, may be very helpful.
These considerations can be briefly summarised in the following principles:
• Do not expect that the machine is more intelligent than you; apply your knowl-
edge and reasoning using visualisations that provide you relevant food for
thought.
• Do not rely on overall indicators and statistical summaries; see and inspect details
(not only the devil may be there but also the angel suggesting you what to do).
13.9 Questions 407
13.9 Questions
Abstract This chapter very briefly summarises the main ideas and principles of vi-
sual analytics, while the main goal is to show by example how to devise new visual
analytics approaches and workflows using general techniques of visual analytics:
abstraction, decomposition, selection, arrangement, and visual comparison. We take
an example of an analysis scenario where the standard approaches presented earlier
in this book do not work, and we demonstrate the possibility to construct a suit-
able new approach starting with abstract operations and then inventively elaborating
these abstractions according to the specifics of the data and analysis tasks.
In the context of data science, visual analytics is used for (1) data exploration and
assessment of their fitness to purpose; (2) gaining understanding of a piece of real
world reflected in data; (3) conscious derivation of good and trustable computa-
tional models. Visualisation is the most effective way for conveying information
to human mind and promoting cognition, but it has its principles that must be fol-
lowed to avoid being mislead. Visualisation methods need to be carefully chosen
depending on (a) the structure and properties of data and (b) the goals of analysis.
Interaction techniques complement visualisation enabling seeing data from different
perspectives, digging into detail, or focusing on relevant portions. Visual analytics
approaches combine visualisation and interaction with the power of computational
processing thereby enabling effective division of labour and synergistic cooperation
of the human and the computer for appropriate data analysis and problem solving,
in which each partner can employ its unique capabilities.
In the first part of our book, we introduced several general approaches and classes
of methods that are most commonly used in analytical workflows where human
reasoning plays the leading role. Throughout the book, we presented numerous ex-
amples of the use of these approaches and methods. Some examples focus only on
application of particular techniques. Other examples describe analytical workflows
where techniques are used in combination. In such examples, we tried to present and
explain the reasoning of the analysts and their decisions concerning further steps in
the analysis.
How can you use these examples of analysis processes in your data science practice?
A trivial way of use would be to just reproduce the presented workflows when you
have very similar data and analysis tasks. However, such situations are not likely to
happen frequently. Therefore, what you need to learn from the examples is how to
14.4 Example: devising an analytical workflow for understanding team tactics in football 411
choose suitable methods and plan your own workflows based on critical assessment
of what you have (data) and what you need to do (tasks) and matching these to the
capabilities and applicability conditions of the approaches and methods known to
you. As one of the main differences of humans from computers is inventiveness and
capability to act reasonably in new situations, you need to involve these features for
creating new workflows from generic building blocks.
At the most general level, the main visual analytics techniques are abstraction
(e.g., by grouping and aggregating), decomposition (e.g., by partitioning or taking
complementary perspectives), arrangement (e.g., by spatialisation, re-ordering, or
transformation of coordinates), selection (querying and filtering), and visual com-
parison. (e.g., by juxtaposition of several displays, superposing several information
layers in one display, or by explicit visualisation of differences). Each of these basic
techniques has multiple realisations in various kinds of methods, which are chosen
based on the specifics of particular data and tasks. Seeing it from the opposite side,
each generic technique is an abstraction of multiple specific methods used for com-
mon purposes. When you devise an approach to handling your data and achieving
your analysis goals, you can begin with considering these generic techniques: how
they can help you, for what sub-goals you can use them, in what sequence, how their
outcomes can be used or accounted for in applying the other techniques, and so on.
In this way, you create an abstract plan of your analysis. After elaborating this plan
according to the particulars of your data and tasks, you will instantiate it by choos-
ing specific methods suitable for your data and producing the types of results you
need.
In the following section, we shall present one more example of analysis, where we
shall emphasise the process of planning the analysis and choosing suitable tech-
niques.
The example, which is based on the paper [9], is taken from the domain of sport
analytics, specifically, analysis of spatio-temporal data describing movements and
events occurring in a football (a.k.a. soccer) game. The analysis workflow has been
designed in a close collaboration with domain experts who defined analysis tasks,
provided data sets, discussed the approach and methods, and validated findings of
the study. As it is usual for specialists in different domains and, more generally,
customers of data analysis, the football experts prefer simple and easily understand-
able methods and workflows dealing with data in transparent and reasoned ways.
Visualisations play an important role as facilitators of understanding of the analysis
process and results.
412 14 Conclusion
Football1 is one of the most popular spatio-temporal phenomena that attracts at-
tention of billions of people. Professional football requires not only well-trained
brilliant players and wise coaches but also understanding of opponent’s tactics,
strengths and weaknesses, as well as assessment of team’s own behaviour. Not
surprisingly, many professional clubs nowadays hire data scientists for analysing
data. In almost all important games, the movements of the players and the ball are
tracked, resulting in trajectories with high temporal and spatial resolution. In addi-
tion to automatically acquired trajectories, data about game events, such as changes
of ball possession, passes, tackles, shots and goals, are collected semi-automatically
or manually. An overview of methods for acquisition of football game data can be
found in paper [128].
Typically, positions of football players and the ball are recorded at 25Hz frequency,
i.e. 25 frames per second. For 22 players moving over 2 x 45 minutes (135,000
frames), this produces about 3,000,000 time-stamped position records. This is quite
a large amount of data for analysis and finding relevant patterns. Moreover, the kinds
of potentially interesting patterns are very complex. The major interest of analysis
is collective behaviours of the teams and the contributions of the individual players
into these collective behaviours. The behaviours include two aspects: cooperation
within the teams and competition between the teams. The behaviours are not just
combinations of spontaneous actions but implementations of certain intended tac-
tics, which need to be revealed in analysis. However, it is hardly possible to describe
teams’ behaviours and tactics using the kinds of patterns that are typically looked
for in spatial, temporal, and spatio-temporal data and that we had in our examples
in Chapters 8, 9, and 10. Thus, such patterns as spatio-temporal clusters of events,
groups of trajectories following similar routes, or occurrence of similar spatial sit-
uations are not relevant or not sufficient for characterising team behaviours and
tactics.
The data consist of mere positions (x- and y-coordinates) of game participants at
many different time steps plus a few hundreds of records specifying the times, po-
sitions, types, and participants of elementary events. Gaining an overall picture of
the behaviour of a team and understanding of the tactics definitely requires high
degree of abstraction above the elementary level of the raw data. The abstraction
can be enabled by aggregation of the data. However, it would not be appropriate to
aggregate data from the whole game or to do this by equal time steps, as is typi-
1 https://en.wikipedia.org/wiki/Association_football
14.4 Example: devising an analytical workflow for understanding team tactics in football 413
cally done in aggregating (spatial) events or trajectories into (spatial) time series.
The team’s behaviour depends on the current circumstances: which team possesses
the ball, where on the pitch the ball and the team players are, how many opponents
are around, how they are distributed, what the current score is, how much time is
left till the game end, and others. It is therefore necessary to consider separately the
data from the time intervals characterised by different circumstances, which means
that we need to apply decomposition to the data (i.e., partition the game into inter-
vals with distinct characteristics) and to the analysis task (i.e., investigate the team
behaviour in different classes of situations). We use the term ’situation’ in the sense
of a combination of circumstances in which players behave, and we shall use the
term ’episode’ to refer to all movements, actions, and events that occur during a
continuous time interval. A class of situations consists of situations with similar
characteristics.
We need to take into account that the total time of any football game includes quite
many intervals of different duration when the game is stopped by the referee. Such
situations, which are called “out of play”, need to be disregarded. For this purpose,
selection techniques can be used. Furthermore, since it may be very difficult to
consider simultaneously the behaviours in all possible classes of situations, selection
is also needed, as a complement to partitioning, to enable focusing on particular
parts of the data one after another.
The tactics of a team is partly reflected in the team formation2 , which is the spa-
tial arrangement of the players in the team characterised by the relative positions
of the players with respect to each other. A formation is often represented (partic-
ularly, in mass media) by showing the average positions of the players on the pitch
computed from all positions they had during a game. However, team tactics also in-
volves certain changes of the formation depending on the situation. For example, a
team may extend in width when possessing the ball and condense when defending.
It is thus reasonable to reconstruct the formations separately for different classes
of situations. Furthermore, while the players of a team are constantly moving on
the pitch, they strive to keep their relative positions. Therefore, it makes sense to
consider, in addition to the positions on the pitch, such an arrangement of the data
that would be independent of the movements on the pitch and represent only the
relative positions of the players with respect to their teammates, as in Fig. 9.6 and
Fog. 9.7.
Naturally, since we are going to apply decomposition and consider team behaviours
in different situation classes, it is necessary to apply visual comparison techniques
to compare these behaviours and understand how the team responds to situation
changes.
Hence, at the high level of abstraction, our approach to analysis will involve the
following components:
2 https://en.wikipedia.org/wiki/Formation_(association_football)
414 14 Conclusion
• Decomposition: division of the game into situations and episodes with distinct
characteristics.
• Selection: disregarding irrelevant episodes and focusing on groups of episodes
with particular characteristics.
• Abstraction: aggregation of data from groups of episodes and generation of vi-
sual summaries.
• Arrangement: creation of representations showing the relative players’ positions
irrespective of their movements on the pitch.
• Visual comparison: creation of visualisations showing differences between be-
haviours in different groups of episodes.
Our analysis plan consists of iterative selection of different groups of episodes, ex-
amination and interpretation of visual summaries of these groups of episodes in
terms of the players’ positions on the pitch and their arrangements within the teams,
and comparison of the summaries representing different groups of episodes. By
means of these operations, we hope to understand how the teams’ behaviours de-
pend on the circumstances and ultimately gain an insight into the intended tactics of
the teams.
Let us now elaborate the components of our approach.
Selection. We need a query tool allowing selection of time intervals based on com-
binations of attribute values and/or occurrences of certain types of events. In terms
of Section 8.4, we need a tool for conditions-based temporal filtering. Our specific
requirement to this tool is to be able to ignore momentary changes of conditions,
which often happen in such a highly dynamic game as football. For example, during
an attack of one team, the ball may be seized for a short moment by the other team
but quickly re-gained by the attacking team. Hence, we need a tool where we can set
the minimal meaningful duration of an episode, so that the tool skips shorter inter-
vals of fulfilment of query conditions and unites shorter intervals of non-fulfilment
with the preceding and following intervals of query fulfilment.
Abstraction. Since we need to see the contribution of the players in the team’s be-
haviour and implementation of the tactics, we cannot use such aggregation methods
where the positions or movements of all players are aggregated all together. Thus,
density fields (Fig. 10.27) and flow maps (Fig. 10.25) will not be helpful to us.
A suitable approach is computation and representation of the players’ average posi-
tions. Since the mean of a set of positions may be affected by outliers, it makes sense
14.4 Example: devising an analytical workflow for understanding team tactics in football 415
to look at the median positions. To see also the variation of the positions around the
average, we can build polygons (convex hulls) enclosing certain percentages of the
positions (e.g., 50% and/or 75%) taken in the order of increasing distances from the
average.
Arrangement. The idea of “team space” (Fig. 9.6) is suitable for our purposes.
The aggregated players’ positions can be computed for the pitch space and for the
team space.
Fig. 14.1: An incremental process of query construction in Time Mask. The hori-
zontal dimension represents time. In the vertical dimension, the display is divided
into sections. Each section shows the variation of values of an attribute or a sequence
of events. Categorical attributes are represented by segmented bars, the values be-
ing encoded in segment colours. Numeric attributes are represented by line plots.
Events are represented by dots coloured according to event categories. The yellow
vertical stripes mark the time intervals satisfying query conditions. The blue vertical
stripes mark the time selected after applying a duration threshold and/or extension
(or reduction, or shifting) of the time intervals.
ball possession by one of the teams. Next, we set a duration threshold to disregard
episodes shorter than 1 second. Then, we add a new query condition that the ball
must be in the one third of the pitch adjacent to the goal of the defending team. Fi-
14.4 Example: devising an analytical workflow for understanding team tactics in football 417
nally, we extend the intervals selected by the query by adding 1 second before each
of them. In the lower image, zooming in the temporal dimension has been applied to
show in more detail how the selected intervals are marked in the display. The semi-
transparent painting in yellow colour marks all intervals that satisfy query conditions
in terms of features. The semi-transparent painting in light blue marks the selected
time intervals after applying the duration threshold and time extension.
Fig. 14.2: Average positions of the players and the ball (in blue) under the ball
possession of the red team (top left) and yellow team (top right). The lower image
shows the changes of the average positions due to the changes of the ball possession.
do not include the polygons in the following illustrations. The visual comparison
of two or more combinations of players’ position aggregate is supported by su-
perposing the combinations in the same visual display and connecting the average
positions of each player by lines. Figure 14.2 demonstrates this idea by example of
differences due to changes of the ball possession (after excluding the “out-of-play”
intervals). The upper two images in Fig. 14.2 show the average positions of the play-
ers and the ball under the possession of the red team (left) and yellow team (right).
The lower image explicitly shows the changes of the average positions from one
subset of episodes to another. Note that the total number of players in each team is
higher than the obviously expected value of 11 due to substitutions.
The idea of superposing and linking combinations of average positions can be ex-
tended to more than two combinations. This will allow us to trace the development
of a group of selected episodes by creating a sequence of aggregates correspond-
ing to consecutive time steps along the duration of the episodes, for example, first
second, second second, third second, and so on. We shall thus be able to see not
only the average positions but also the average movement patterns in the selected
episodes.
Fig. 14.4: Abstracted vectors of the displacements of the players and the ball from
the start to the end moments of the long forward passes of the red team. Like in the
previous figures, there are more than 11 lines in each team due to substitutions. The
destinations of the are marked by cyan dots.
To see the relative displacements of the players of the two teams, we also look at the
similar summaries of the long-ball episodes presented in the relative spaces of the
teams. Figure 14.5 shows the displacement vectors of the players and the ball in the
space of the defending yellow team. The vertical axis corresponds to the direction
from the goal of the yellow team (bottom) to the goal of the red team (top). This
representation shows us an interesting pattern that all long balls of the red team
were directed into a small area in the space of the yellow team. A possible reason
was that the red team was aware of a certain weakness of the left wing of the yellow
team’s defence. It seems to be a part of the red team’s tactics to send repeatedly
long balls in the direction where they expected to have more chances for successful
reception of the pass and further development of the attack. Perhaps, to implement
this tactics, the red team’s coach assigned one of the fastest runners of the team to
play on the right wing. The team space display also shows us that the remaining
players of the red team were effectively increasing the area covered by their team
and spreading themselves among the opponents.
14.4 Example: devising an analytical workflow for understanding team tactics in football 421
Fig. 14.5: The abstracted displacement vectors, as in Fig. 14.4, are shown in the
team space of the defending yellow team.
Here we have described the investigation into the team behaviours in tactics in one
class of episodes. Examples of analyses focusing on other classes of episodes can
be seen in the paper [9] from which the example has been taken.
14.4.5 Conclusion
The analytical workflow in this example did not include any advanced computa-
tional methods but only easily understandable and implementable techniques. Like
many other examples in this book, it demonstrates that human reasoning is the main
instrument of analysis. This statement remains valid also when more sophisticated
methods of computational processing are used. However, the use of simpler meth-
ods, when possible (i.e., does not increase substantially the human’s workload),
gives several advantages. You can better understand what is done and what is ob-
tained, and it will be easier for you to explain this to others. You can easier validate
the results and convince others of their validity.
With this example, we primarily aimed to demonstrate not the analysis process as
such but the process of finding an approach to the analysis. We intentionally took
an analysis scenario in which we could not re-apply any of the analytical workflows
described earlier in the book. Moreover, the common methods that are usually ap-
422 14 Conclusion
plied to this kind of data (i.e., trajectories; see Section 10.3.3) were not suitable for
this scenario. In these settings, we wanted to show that new approaches to analy-
sis can be devised by inventive application of general ideas and principles of visual
analytics. Inventiveness is one of the key aspects of the superiority of humans over
machines. The philosophy of visual analytics assumes that humans use their ad-
vantages and do interesting part of the work requiring creative thinking, while the
routine, tedious processing is left to computers.
Certainly, this book does not present visual analytics science in all its width and
depth. We introduced the fundamental concepts and ideas, and we made a selection
of the methods that we deemed sufficiently general and widely applicable, easy
to understand, and easy to reproduce. We hope that this book convinces you how
exciting is can be to see data represented visually and to gain insights by means of
the great power of the human brain. We hope that you will love visual analytics, try
to learn more from it, and keep trace of new developments in this area.
Glossary
Cluster A group of objects that have similar properties or close positions in space
and/or time.
Computer model : In the context of this book, this term refers to any kind of
model (e.g., statistical model, simulation model, machine learning model, etc.) that
is meant to be executed by computers, typically for the purpose of prediction.
Distance function A method that, for two given items, expresses the degree of
their relatedness or similarity by a numeric value such that a smaller value means
higher similarity or relatedness. Distance functions are used for clustering, projec-
tion, search by similarity, detection of frequent patterns, etc. Other names are simi-
larity function and dissimilarity function. See section 4.2.
as bases (i.e., containers) of entities and values of attributes, and a set of entities can
be seen as a base (i.e., a carrier) of attribute values. In this view, the elements of the
base play the role of positions that can be filled by elements of the overlay. Then, the
distribution of a component O over (or in) another component B viewed as a base
is the relationship between the elements of O and the positions in B, that is, which
positions in B are occupied by which elements of O. See section 2.3.1.
Heatmap : A display, such as a matrix or a map, where the display space is divided
into uniform compartments, in which values of a numeric attribute are represented
by colours or degrees of lightness.
Interrelation between two or more components distributed over the same base
means a tendency of particular elements of different components to co-occur at the
same or close positions in the base. Correlation (in the statistical sense) is a special
case of interrelation. See section 2.3.4.
Medoid of a set or cluster of data items of any nature is the data item having the
smallest sum of the distances (in terms of a certain distance function) to all other
data items.
Outlier is an item that cannot be put in a group together with any other items
because of being very different from everything else. See section 2.3.2.
Spatialisation refers to arranging visual objects within the display space in such a
way that the distances between them reflect the degree of similarity or relatedness
between the data items they represent. Spatialisation exploits the innate capability
of humans to perceive multiple things located closely as being united in a larger
shape or structure, that is, in some pattern. It also exploits the intuitive perception of
spatially close things to be more related than distant things. See section 2.3.5.
Glossary 425
Visual variables are aspects of a display that can be used for visual representa-
tion of information. These include x-position, y-position, width, height, size, hue,
lightness, saturation, orientation, etc. See section 3.2.2.
426 References
References
19. Andrienko, N., Andrienko, G.: State transition graphs for semantic analysis of movement be-
haviours. Information Visualization 17(1), 41–65 (2018). DOI 10.1177/1473871617692841
20. Andrienko, N., Andrienko, G., Barrett, L., Dostie, M., Henzi, P.: Space transformation for
understanding group movement. IEEE Transactions on Visualization and Computer Graphics
19(12), 2169–2178 (2013). DOI 10.1109/TVCG.2013.193
21. Andrienko, N., Andrienko, G., Fuchs, G., Jankowski, P.: Scalable and privacy-respectful in-
teractive discovery of place semantics from human mobility traces. Information Visualization
15(2), 117–153 (2016). DOI 10.1177/1473871615581216
22. Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S., Betz, H.: Detection, tracking, and
visualization of spatial event clusters for real time monitoring. In: 2015 IEEE International
Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10 (2015). DOI 10.
1109/DSAA.2015.7344880
23. Andrienko, N., Andrienko, G., Garcia, J.M.C., Scarlatti, D.: Analysis of flight variability:
a systematic approach. IEEE Transactions on Visualization and Computer Graphics 25(1),
54–64 (2019). DOI 10.1109/TVCG.2018.2864811
24. Andrienko, N., Andrienko, G., Stange, H., Liebig, T., Hecker, D.: Visual analytics for under-
standing spatial situations from episodic movement data. KI - Künstliche Intelligenz 26(3),
241–251 (2012). DOI 10.1007/s13218-012-0177-4
25. Angelini, M., Santucci, G., Schumann, H., Schulz, H.J.: A review and characterization of
progressive visual analytics. Informatics 5(3) (2018). DOI 10.3390/informatics5030031
26. Arnheim, R.: Visual Thinking. London: Faber (1969)
27. Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural
number of topics with latent dirichlet allocation: Some observations. In: M.J. Zaki, J.X.
Yu, B. Ravindran, V. Pudi (eds.) Advances in Knowledge Discovery and Data Mining, pp.
391–402. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)
28. Bach, B., Shi, C., Heulot, N., Madhyastha, T., Grabowski, T., Dragicevic, P.: Time curves:
Folding time to visualize patterns of temporal evolution in data. IEEE Transactions on Visual-
ization and Computer Graphics 22(1), 559–568 (2016). DOI 10.1109/TVCG.2015.2467851
29. Bazán, E., Dokládal, P., Dokládalová, E.: Quantitative Analysis of Similarity Measures
of Distributions. In: British Machine Vision Conference (BMVC). Cardiff, United King-
dom (2019). URL https://hal-upec-upem.archives-ouvertes.fr/hal-
02299826
30. Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.D.: Matrix reordering meth-
ods for table and network visualization. Computer Graphics Forum 35(3), 693–716 (2016).
DOI 10.1111/cgf.12935
31. Bertin, J.: Semiology of Graphics: Diagrams,Networks, Maps (1983)
32. Blei, D.M.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012).
DOI 10.1145/2133806.2133826
33. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning
research 3(Jan), 993–1022 (2003)
34. Blokker, E., Furnass, W.R., Machell, J., Mounce, S.R., Schaap, P.G., Boxall, J.B.: Relating
water quality and age in drinking water distribution systems using self-organising maps.
Environments 3(2), 10 (2016)
35. Blondel, V.D., Esch, M., Chan, C., Clérot, F., Deville, P., Huens, E., Morlot, F., Smoreda,
Z., Ziemlicki, C.: Data for development: the d4d challenge on mobile phone data. ArXiv
abs/1210.0137 (2012)
36. Bögl, M., Aigner, W., Filzmoser, P., Lammarsch, T., Miksch, S., Rind, A.: Visual analytics
for model selection in time series analysis. IEEE transactions on visualization and computer
graphics 19, 2237–46 (2013). DOI 10.1109/TVCG.2013.222
37. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative
evaluation. In: Society for Industrial and Applied Mathematics - 8th SIAM International
Conference on Data Mining 2008, Proceedings in Applied Mathematics 130, Society for
Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining
2008, Proceedings in Applied Mathematics 130, pp. 243–254 (2008)
428 References
38. Brandes, U., Erlebach, T.: Network analysis: methodological foundations, vol. 3418.
Springer Science & Business Media (2005)
39. Chae, J., Thom, D., Bosch, H., Jang, Y., Maciejewski, R., Ebert, D.S., Ertl, T.: Spatiotemporal
social media analytics for abnormal event detection and examination using seasonal-trend
decomposition. In: 2012 IEEE Conference on Visual Analytics Science and Technology
(VAST), pp. 143–152 (2012). DOI 10.1109/VAST.2012.6400557
40. Chen, S., Andrienko, N., Andrienko, G., Adilova, L., Barlet, J., Kindermann, J., Nguyen,
P.H., Thonnard, O., Turkay, C.: Lda ensembles for interactive exploration and categorization
of behaviors. IEEE Transactions on Visualization and Computer Graphics pp. 1–1 (2019).
DOI 10.1109/TVCG.2019.2904069
41. Chu, D., Sheets, D.A., Zhao, Y., Wu, Y., Yang, J., Zheng, M., Chen, G.: Visualizing hidden
themes of taxi movement with semantic transformation. In: 2014 IEEE Pacific Visualization
Symposium, pp. 137–144 (2014). DOI 10.1109/PacificVis.2014.50
42. Chuang, J., Manning, C.D., Heer, J.: Termite: Visualization techniques for assessing textual
topic models. In: Proceedings of the International Working Conference on Advanced Visual
Interfaces, pp. 74–77. ACM (2012)
43. Chuang, J., Roberts, M.E., Stewart, B.M., Weiss, R., Tingley, D., Grimmer, J., Heer, J.: Top-
iccheck: Interactive alignment for assessing topic model stability. In: Proceedings of the
2015 Conference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pp. 175–184 (2015)
44. Collins, C., Andrienko, N., Schreck, T., Yang, J., Choo, J., Engelke, U., Jena, A., Dwyer, T.:
Guidance in the human–machine analytics process. Visual Informatics 2(3), 166–180 (2018)
45. Collins, C., Viegas, F., Wattenberg, M.: Parallel tag clouds to explore and analyze faceted
text corpora. pp. 91 – 98 (2009). DOI 10.1109/VAST.2009.5333443
46. Cuadros, A.M., Paulovich, F.V., Minghim, R., Telles, G.P.: Point placement by phylogenetic
trees and its application to visual analysis of document collections. In: Proceedings of the
2007 IEEE Symposium on Visual Analytics Science and Technology, VAST ’07, p. 99–106.
IEEE Computer Society, USA (2007). DOI 10.1109/VAST.2007.4389002
47. Dang, T.N., Wilkinson, L.: Scagexplorer: Exploring scatterplots by their scagnostics. In:
Visualization Symposium (PacificVis), 2014 IEEE Pacific, pp. 73–80. IEEE (2014)
48. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by la-
tent semantic analysis. Journal of the American society for information science 41(6), 391–
407 (1990)
49. Dykes, J.A.: Exploring spatial data representation with dynamic graphics. Computers &
Geosciences 23(4), 345 – 370 (1997). DOI https://doi.org/10.1016/S0098-3004(97)00009-5.
Exploratory Cartograpic Visualisation
50. Eler, D., Nakazaki, M., Paulovich, F., Santos, D., Andery, G., Oliveira, M.C., Neto, J.,
Minghim, R.: Visual analysis of image collections. The Visual Computer 25, 923–937
(2009). DOI 10.1007/s00371-009-0368-7
51. Endert, A., Ribarsky, W., Turkay, C., Wong, B.W., Nabney, I., Blanco, I.D., Rossi, F.: The
state of the art in integrating machine learning into visual analytics. In: Computer Graphics
Forum, vol. 36, pp. 458–486 (2017)
52. Fisher, D.: Animation for Visualization: Opportunities and Drawbacks, beauti-
ful visualization edn. O’Reilly Media (2010). Complete book avaialble at
http://oreilly.com/catalog/0636920000617/
53. Gibson, H., Faith, J., Vickers, P.: A survey of two-dimensional graph layout techniques for
information visualisation. Information Visualization 12(3-4), 324–357 (2013). DOI 10.1177/
1473871612455749
54. Gleicher, M.: Explainers: Expert explorations with crafted projections. IEEE Transactions
on Visualization & Computer Graphics (12), 2042–2051 (2013)
55. Gleicher, M., Albers, D., Walker, R., Jusufi, I., Hansen, C.D., Roberts, J.C.: Visual compari-
son for information visualization. Information Visualization 10(4), 289–309 (2011)
56. Görg, C., Kang, Y., Liu, Z., Stasko, J.: Visual analytics support for intelligence analysis.
Computer 46(7), 30–38 (2013). DOI 10.1109/MC.2013.76
References 429
57. Gou, L., Zhang, X., Luo, A., Anderson, P.F.: Socialnetsense: Supporting sensemaking of
social and structural features in networks with interactive visualization. In: 2012 IEEE Con-
ference on Visual Analytics Science and Technology (VAST), pp. 133–142 (2012)
58. Gould, P.: Letting the data speak for themselves. Annals of the Association of American
Geographers 71(2), 166–176 (1981)
59. Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27(4),
857–871 (1971)
60. Green, M.: Toward a perceptual science of multidimensional data visualization: Bertin and
beyond. ERGO/GERO Human Factors Science 8, 1–30 (1998)
61. Gschwandtner, T., Erhart, O.: Know your enemy: Identifying quality problems of time series
data. In: 2018 IEEE Pacific Visualization Symposium (PacificVis), pp. 205–214 (2018).
DOI 10.1109/PacificVis.2018.00034
62. Gschwandtner, T., Gärtner, J., Aigner, W., Miksch, S.: A taxonomy of dirty time-oriented
data. In: G. Quirchmayr, J. Basl, I. You, L. Xu, E. Weippl (eds.) Multidisciplinary Research
and Practice for Information Systems, pp. 58–72. Springer Berlin Heidelberg, Berlin, Hei-
delberg (2012)
63. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of
methods for explaining black box models. ACM Comput. Surv. 51(5), 93:1–93:42 (2018).
DOI 10.1145/3236009
64. Guo, D., Gahegan, M., MacEachren, A.M., Zhou, B.: Multivariate analysis and geovisu-
alization with an integrated geographic knowledge discovery approach. Cartography and
geographic information science 32(2), 113–132 (2005)
65. Gutenko, I., Dmitriev, K., Kaufman, A.E., Barish, M.A.: Anafe: Visual analytics of image-
derived temporal features—focusing on the spleen. IEEE Transactions on Visualization and
Computer Graphics 23(1), 171–180 (2017). DOI 10.1109/TVCG.2016.2598463
66. Havre, S., Hetzler, B., Nowell, L.: Themeriver: visualizing theme changes over time. In:
IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings, pp. 115–
123 (2000). DOI 10.1109/INFVIS.2000.885098
67. Höferlin, M., Höferlin, B., Heidemann, G., Weiskopf, D.: Interactive schematic summaries
for faceted exploration of surveillance video. IEEE Transactions on Multimedia 15(4), 908–
920 (2013). DOI 10.1109/TMM.2013.2238521
68. Höferlin, M., Höferlin, B., Weiskopf, D., Heidemann, G.: Uncertainty-aware video visual
analytics of tracked moving objects. Journal of Spatial Information Science 2011(2), 87–117
(2011). DOI 10.5311/JOSIS.2010.2.1
69. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Con-
ference on Uncertainty in Artificial Intelligence, UAI’99, pp. 289–296. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA (1999)
70. Hohman, F., Kahng, M., Pienta, R., Chau, D.H.: Visual analytics in deep learning: An in-
terrogative survey for the next frontiers. IEEE Transactions on Visualization and Computer
Graphics 25(8), 2674–2693 (2019). DOI 10.1109/TVCG.2018.2843369
71. Holten, D., Van Wijk, J.J.: Force-directed edge bundling for graph visualization. Computer
Graphics Forum 28(3), 983–990 (2009). DOI 10.1111/j.1467-8659.2009.01450.x
72. Huff, D.: How to lie with statistics. WW Norton & Company (1993)
73. Iwasokun, G.: Image enhancement methods: A review. British Journal of Mathematics &
Computer Science 4, 2251–2277 (2014). DOI 10.9734/BJMCS/2014/10332
74. Iwueze, I., Nwogu, E., Nlebedim, V., Nwosu, U., Chinyem, U.: Comparison of methods of
estimating missing values in time series. Open Journal of Statistics (8), 390–399 (2018).
DOI 10.4236/ojs.2018.82025
75. Jankowski, P., Andrienko, N., Andrienko, G.: Map-centred exploratory approach to multiple
criteria spatial decision making. International Journal of Geographical Information Science
15(2), 101–127 (2001)
76. Kamble, V., Bhurchandi, K.: No-reference image quality assessment algorithms: A survey.
Optik 126(11), 1090 – 1097 (2015). DOI https://doi.org/10.1016/j.ijleo.2015.02.093
430 References
77. Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J.: Enterprise data analysis and visualization:
An interview study. IEEE Transactions on Visualization and Computer Graphics 18(12),
2917–2926 (2012)
78. Keim, D., Andrienko, G., Fekete, J.D., Görg, C., Kohlhammer, J., Melançon, G.: Visual an-
alytics: Definition, process, and challenges. In: A. Kerren, J.T. Stasko, J.D. Fekete, C. North
(eds.) Information Visualization: Human-Centered Issues and Perspectives, pp. 154–175.
Springer, Berlin (2008)
79. Keim, D., Kohlhammer, J., Ellis, G., Mansmann, F. (eds.): Mastering the Information Age :
Solving Problems with Visual Analytics. Goslar : Eurographics Association (2010). DOI
10.2312/14803
80. Keim, D.A., Nietzschmann, T., Schelwies, N., Schneidewind, J., Schreck, T., Ziegler, H.: A
spectral visualization system for analyzing financial time series data. In: EuroVis 2006 –
Eurographics /IEEE VGTC Symposium on Visualization, pp. 195–202. Eurographics Asso-
ciation (2006). DOI 10.2312/VisSym/EuroVis06/195-202
81. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and
empirical demonstration. Data Min. Knowl. Discov. 7(4), 349–371 (2003). DOI 10.1023/A:
1024988512476
82. Kim, M., Kang, K., Park, D., Choo, J., Elmqvist, N.: Topiclens: Efficient multi-level visual
topic exploration of large-scale document collections. IEEE Transactions on Visualization
and Computer Graphics 23(1), 151–160 (2017). DOI 10.1109/TVCG.2016.2598445
83. Klemm, P., Oeltze-Jafra, S., Lawonn, K., Hegenscheid, K., Völzke, H., Preim, B.: Interactive
visual analysis of image-centric cohort study data. IEEE Transactions on Visualization and
Computer Graphics 20 (2014). DOI 10.1109/TVCG.2014.2346591
84. Kohonen, T., Schroeder, M.R., Huang, T.S.: Self-Organizing Maps, 3rd edn. Springer-Verlag,
Berlin, Heidelberg (2001)
85. Krause, J., Dasgupta, A., Fekete, J.D., Bertini, E.: Seekaview: an intelligent dimensionality
reduction strategy for navigating high-dimensional data spaces. In: Large Data Analysis and
Visualization (LDAV), 2016 IEEE 6th Symposium on, pp. 11–19. IEEE (2016)
86. Kriegel, H.P., Kröger, P., Zimek, A.: Subspace clustering. WIREs Data Mining and Knowl-
edge Discovery 2(4), 351–364 (2012). DOI 10.1002/widm.1057
87. Kruskal, J.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothe-
sis. Psychometrika 29(1), 1–27 (1964)
88. Kucher, K., Kerren, A.: Text visualization techniques: Taxonomy, visual survey, and com-
munity insights. In: Visualization Symposium (PacificVis), 2015 IEEE Pacific, pp. 117–121.
IEEE (2015)
89. Kucher, K., Paradis, C., Kerren, A.: The state of the art in sentiment visualization. Computer
Graphics Forum 37(1), 71–96 (2018). DOI 10.1111/cgf.13217
90. von Landesberger, T., Kuijper, A., Schreck, T., Kohlhammer, J., van Wijk, J., Fekete, J.D.,
Fellner, D.: Visual analysis of large graphs: State-of-the-art and future research challenges.
Computer Graphics Forum 30(6), 1719–1749 (2011). DOI 10.1111/j.1467-8659.2011.
01898.x
91. Liu, S., Andrienko, G., Wu, Y., Cao, N., Jiang, L., Shi, C., Wang, Y.S., Hong, S.: Steering
data quality with visual analytics: The complexity challenge. Visual Informatics 2(4), 191 –
197 (2018). DOI https://doi.org/10.1016/j.visinf.2018.12.001
92. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Pro-
ceedings of the 31st International Conference on Neural Information Processing Systems,
NIPS’17, p. 4768–4777. Curran Associates Inc., Red Hook, NY, USA (2017)
93. MacEachren, A.M.: How maps work: representation, visualization, and design. Guilford
Press (1995)
94. Maini, R., Aggarwal, H.: A comprehensive review of image enhancement techniques. Journal
of Computing 2(3) (2010)
95. Malczewski, J.: GIS and multicriteria decision analysis. John Wiley & Sons (1999)
96. Marey, É.J.: La méthode graphique dans les sciences expérimentales et principalement en
physiologie et en médecine. G. Masson (1885)
References 431
97. Matejka, J., Fitzmaurice, G.: Same stats, different graphs: generating datasets with varied
appearance and identical statistics through simulated annealing. In: Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems, pp. 1290–1294. ACM (2017)
98. Meghdadi, A.H., Irani, P.: Interactive exploration of surveillance video through action shot
summarization and trajectory visualization. IEEE Transactions on Visualization and Com-
puter Graphics 19(12), 2119–2128 (2013). DOI 10.1109/TVCG.2013.168
99. Migut, M., Worring, M.: Visual exploration of classification models for risk assessment. pp.
11–18 (2010). DOI 10.1109/VAST.2010.5652398
100. Monmonier, M.: How to lie with maps. University of Chicago Press (1996)
101. Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence sim-
plification. IEEE Transactions on Visualization and Computer Graphics 19(12), 2227–2236
(2013). DOI 10.1109/TVCG.2013.200
102. Mühlbacher, T., Piringer, H.: A partition-based framework for building and validating regres-
sion models. IEEE Transactions on Visualization and Computer Graphics 19, 1962–1971
(2013)
103. Munzner, T.: Visualization analysis and design. CRC press (2014)
104. Nam, J.E., Mueller, K.: Tripadvisorˆ{ND}: A tourism-inspired high-dimensional space ex-
ploration framework with overview and detail. IEEE transactions on visualization and com-
puter graphics 19(2), 291–305 (2013)
105. Oelke, D., Hao, M., Rohrdantz, C., Keim, D.A., Dayal, U., Haug, L., Janetzko, H.: Visual
opinion analysis of customer feedback data. In: 2009 IEEE Symposium on Visual Analytics
Science and Technology, pp. 187–194 (2009). DOI 10.1109/VAST.2009.5333919
106. Oelke, D., Spretke, D., Stoffel, A., Keim, D.A.: Visual readability analysis: How to make
your writings easier to read. IEEE Transactions on Visualization and Computer Graphics
18(5), 662–674 (2012). DOI 10.1109/TVCG.2011.266
107. Openshaw, S.: Ecological fallacies and the analysis of areal census data. Environment and
planning A 16(1), 17–31 (1984)
108. Paatero, P., Tapper, U.: Positive matrix factorization: A non-negative factor model with op-
timal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994).
DOI 10.1002/env.3170050203
109. Pelekis, N., Andrienko, G., Andrienko, N., Kopanakis, I., Marketos, G., Theodoridis, Y.:
Visually exploring movement data via similarity-based analysis. Journal of Intelligent Infor-
mation Systems 38(2), 343–391 (2012). DOI 10.1007/s10844-011-0159-2
110. Pelleg, D., Moore, A.: X-means: Extending k-means with efficient estimation of the number
of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp.
727–734. Morgan Kaufmann (2000)
111. Radoš, S., Splechtna, R., Matković, K., Ðuras, M., Gröller, E., Hauser, H.: Towards quan-
titative visual analytics with structured brushing and linked statistics. Computer Graphics
Forum 35(3), 251–260 (2016). DOI 10.1111/cgf.12901
112. Ralanamahatana, C.A., Lin, J., Gunopulos, D., Keogh, E., Vlachos, M., Das, G.: Mining
Time Series Data, pp. 1069–1103. Springer US, Boston, MA (2005). DOI 10.1007/0-387-
25465-X_51
113. Ratanamahatana, C., Keogh, E.: Everything you know about dynamic time warping is wrong.
In: Third Workshop on Mining Temporal and Sequential Data (2004)
114. Ray, C., Dreo, R., Camossi, E., Jousselme, A.L.: Heterogeneous Integrated Dataset for Mar-
itime Intelligence, Surveillance, and Reconnaissance (2018). DOI 10.5281/zenodo.1167595
115. Ren, D., Amershi, S., Lee, B., Suh, J., Williams, J.D.: Squares: Supporting interactive perfor-
mance analysis for multiclass classifiers. IEEE Transactions on Visualization and Computer
Graphics 23(1), 61–70 (2017). DOI 10.1109/TVCG.2016.2598828
116. Rinzivillo, S., Pedreschi, D., Nanni, M., Giannotti, F., Andrienko, N., Andrienko, G.: Visu-
ally driven analysis of movement data by progressive clustering. Information Visualization
7(3-4), 225–239 (2008). DOI 10.1057/PALGRAVE.IVS.9500183
117. Rosenberg, D., Grafton, A.: Cartographies of time: A history of the timeline. Princeton
Architectural Press (2013)
432 References
118. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis. Journal of Computational and Applied Mathematics 20, 53 – 65 (1987). DOI
https://doi.org/10.1016/0377-0427(87)90125-7
119. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and
use interpretable models instead. Nature Machine Intelligence 1, 206–215 (2019). DOI
10.1038/s42256-019-0048-x
120. Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Transactions on Com-
puters 18(5), 401–409 (1969). DOI 10.1109/T-C.1969.222678
121. Sankararaman, S., Agarwal, P.K., Mølhave, T., Pan, J., Boedihardjo, A.P.: Model-driven
matching and segmentation of trajectories. In: Proceedings of the 21st ACM SIGSPA-
TIAL International Conference on Advances in Geographic Information Systems, SIGSPA-
TIAL’13, p. 234–243. Association for Computing Machinery, New York, NY, USA (2013).
DOI 10.1145/2525314.2525360
122. Seo, J., Shneiderman, B.: A rank-by-feature framework for unsupervised multidimensional
data exploration using low dimensional projections. In: Information Visualization, 2004.
INFOVIS 2004. IEEE Symposium on, pp. 65–72. IEEE (2004)
123. Shearer, C.: The CRISP-DM Model: The new blueprint for data mining. Journal of Data
Warehousing 5(4), 13–22 (2000)
124. Shneiderman, B.: The Eyes Have It: A Task by Data Type Taxonomy for Information Visual-
ization. In: Proceedings of the IEEE Symposium on Visual Languages, pp. 336–343 (1996).
DOI 10.1109/VL.1996.545307
125. Sips, M., Köthur, P., Unger, A., Hege, H., Dransch, D.: A visual analytics approach to mul-
tiscale exploration of environmental time series. IEEE Transactions on Visualization and
Computer Graphics 18(12), 2899–2907 (2012). DOI 10.1109/TVCG.2012.191
126. Spinner, T., Schlegel, U., Schafer, H., El-Assady, M.: explAIner: A visual analytics frame-
work for interactive and explainable machine learning. IEEE Transactions on Visualization
and Computer Graphics pp. 1–1 (2019). DOI 10.1109/tvcg.2019.2934629
127. Stahnke, J., Dörk, M., Müller, B., Thom, A.: Probing projections: Interaction techniques for
interpreting arrangements and errors of dimensionality reductions. IEEE transactions on
visualization and computer graphics 22(1), 629–638 (2016)
128. Stein, M., Janetzko, H., Lamprecht, A., Breitkreutz, T., Zimmermann, P., Goldlücke, B.,
Schreck, T., Andrienko, G., Grossniklaus, M., Keim, D.A.: Bring it to the pitch: Combining
video and movement data to enhance team sport analysis. IEEE Transactions on Visualiza-
tion and Computer Graphics 24(1), 13–22 (2018)
129. Stolper, C.D., Perer, A., Gotz, D.: Progressive visual analytics: User-driven visual explo-
ration of in-progress analytics. IEEE Transactions on Visualization and Computer Graphics
20(12), 1653–1662 (2014). DOI 10.1109/TVCG.2014.2346574
130. Tague, N.R.: The Quality Toolbox, Second Edition. ASQ Quality Press (2005)
131. Thom, D., Bosch, H., Koch, S., Wörner, M., Ertl, T.: Spatiotemporal anomaly detection
through visual analysis of geolocated twitter messages. In: 2012 IEEE Pacific Visualiza-
tion Symposium, pp. 41–48 (2012). DOI 10.1109/PacificVis.2012.6183572
132. Thom, D., Jankowski, P., Fuchs, G., Ertl, T., Bosch, H., Andrienko, N., Andrienko, G.: The-
matic patterns in georeferenced tweets through space-time visual analytics. Computing in
Science & Engineering 15(03), 72–82 (2013). DOI 10.1109/MCSE.2013.70
133. Thomas, J., Cook, K.: Illuminating the Path: The Research and Development Agenda for
Visual Analytics. IEEE (2005)
134. Tobler, W.R.: A Computer Movie Simulating Urban Growth in the Detroit Region. Economic
Geography 46, 234–240 (1970)
135. Tominski, C., Schumann, H.: Interactive Visual Data Analysis. AK Peters Visualization
Series. CRC Press (2020). DOI 10.1201/9781315152707
136. Torkamani, S., Lohweg, V.: Survey on time series motif discovery. Wiley Interdiscip. Rev.
Data Min. Knowl. Discov. 7 (2017)
137. Tufte, E.R.: The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT,
USA (1986)
References 433
A C
abstraction 20, 32, 60, 80, 93, 209, 214– categorical data 102
216, 249, 334, 339, 411, 414, 417, central tendency 154
418 centroid 135–137, 139, 140, 158,
adjacency matrix 212, 214 423
aggregation 74, 80, 162, 176–179, 214–216, change 37, 40
233, 236–238, 242, 243, 277–284, change blindness 61, 74
302, 306, 308, 310, 319, 323, 325, chart map 12, 66, 164–166, 279,
326, 330, 339, 353, 411, 412, 415, 417, 281
418 choropleth map 41, 57, 58, 66, 71, 79–
alignment (pattern type) 40–42 81, 85–87, 277, 284, 287, 289–291,
analysis goal 27 329
analysis process 4, 20, 21, 28–30, 153, classes of numeric attribute values 77, 162,
154 173
angular distance 97, 98, 102 cluster 117, 122, 256
animation 58, 60 cluster quality measure 130
Anscombe’s Quartet 52, 53 cluster: spatial 11, 17, 40–42, 105, 112, 117,
anytime algorithm 74 118, 137, 272, 273
arc diagram 241, 250 cluster: spatio-temporal 14, 40, 106, 137,
artificial space 43–47, 424 232, 236, 319–324, 353
autocorrelation 234, 268, 376, 401, clustering 89, 95, 108, 122, 144, 145, 192,
402 199, 221, 249, 250, 252–255, 258, 259,
average silhouette width 130–133, 272, 281, 322, 325, 331, 333, 334, 339,
135 355, 367, 370, 373, 423
clustering, density-based 122–125, 130, 137,
B 177, 286–289, 293, 319–324, 335–337,
339, 353
bar chart 64, 66, 127, 131, 254, 279 clustering, hierarchical 122, 124, 125, 129,
base of distribution 33–35, 159–162, 171, 130, 343, 389
172, 423–425 clustering, partition-based 122–127,
Bertin, Jacques 55–60, 62, 212 129–131, 137, 140, 143, 325–327,
Bezier curve 74 339
binning 173 clustering, progressive 138, 333, 335–
box plot 52–55, 65, 73, 154, 156, 337
197 cold spot (pattern type) 41
brushing 43–45, 83–85 colour propagation 43, 44, 85, 86
colour scale 70, 71, 76, 77, 79–81, 129, display space 61–63, 68, 72
140 display space embedding 61
colour scale, 2D 43, 44, 119, 121, 122, 129, display space partitioning 61, 72
254, 257 display space sharing 61, 72
colour scale, diverging 71, 77, 79– distance function 46, 89, 94–101, 103, 107,
81 108, 116, 117, 122, 138, 145, 159, 160,
concentration (pattern type) 40 219, 220, 253, 331–334, 339, 355, 367,
conditioning 85, 87 423
confusion matrix 391 distance: spatial 105, 106, 123, 124,
connectivity matrix 67 339
constancy (pattern type) 37 distance: spatio-temporal 106, 320, 321,
Contingency Wheel 201, 204, 205, 339
209 distance: temporal 106, 339
coordinated multiple views 43–45, distribution 33–36, 48, 62, 72, 99, 100, 154,
85–87 156–162, 164, 165, 171, 172, 190, 292,
correlation 98, 108–110, 188, 190, 191, 290, 293, 318, 319, 342, 377, 386, 389, 390,
343, 389 397, 400, 401, 406, 423–425
correlation network 190, 191 distribution types 33, 72
cosine similarity 97, 98 distribution, base of 33–35, 99, 159–162,
cumulative frequency curve 78 171, 172, 423–425
curse of dimensionality 98, 100, 111, 116, distribution, normal 158
145, 224 distribution, overlay of 33–35, 159, 160,
172, 424, 425
D distribution, skewed 72, 107, 108, 157, 290,
390
dasymetric map 176 distribution: co-distribution 41, 42,
data embedding 43, 45, 47, 89, 93, 95, 108, 66
111–113, 116, 117, 129, 130, 138–141, distribution: frequency 36–38, 62, 77, 78,
143–145, 159, 182–184, 191–193, 195, 156–158, 174, 187
199, 214, 216, 223, 225, 253, 255–259, distribution: spatial 11–13, 39–43, 99,
348, 350 127, 132, 133, 159, 189, 268–272, 275,
data precision 152, 179, 311, 313 277, 281, 283, 284, 286, 287, 290, 293,
data properties 28, 153, 310–313, 334, 294, 306, 319, 329, 332, 344, 353–355,
340 358
data quality 154, 160, 235, 298, 313–318, distribution: spatio-temporal 13–16, 40,
337 164–166, 319, 344, 388
data science process 21, 28–30, 153, distribution: temporal 7, 9, 10, 18, 37, 39, 76,
154 77, 99, 162, 164, 166–168, 229, 230, 319,
data science, principles 48 329, 330, 344, 355, 358
DBSCAN clustering 123 dot map 11, 15, 66
decrease (pattern type) 37, 38 dot plot 53–55, 64, 66, 156
dendrogram 124, 125 dot plot, stacked 394
density (pattern type) 39, 40 dynamic time warping 100, 101, 105,
density map 278–281, 286, 332, 334, 336, 369
415
density plot 118, 119
density trend (pattern type) 40 E
deviation plot 187
diagram map 12, 66, 164–166, 281 edge bundling 74, 214
dimensionality reduction 43, 45, 47, 74, embedding space 43–45, 47, 111–121, 129,
93–95, 99, 100, 129, 130, 144, 145, 192, 130, 144, 145, 159, 182–184, 199, 223,
224, 347, 366, 395 225, 348, 350
discretisation 77, 79, 80, 173 embedding space: continuous 111, 116,
dispersion 154 117
Index 437
K
Gantt chart 65
geo-coding 176, 264, 265
k-means clustering 122, 130, 254
geographic coordinates 105, 263–267, 274,
KDE 279, 280, 286
285, 297, 307, 318
knowledge 22, 24, 28, 134, 172
geomarketing 265
Kohonen map 112
goal of analysis 27
Gower’s distance 103 L
GPS, global positioning system 171, 263,
312, 313 LDA 143
graph 67, 104, 201, 207, 209–212, 214–217, Levenshtein distance 103
219–225, 308 line chart 64, 70, 233, 247
graph centrality measures 104, 210, line graph 64, 233
221 LSA 143
graph, weighted 209–212, 217–219
great-circle distance 105, 274, 286, M
288
grouping 89, 93, 94, 411 Mahalonobis distance 85, 97
438 Index
S T