0% found this document useful (0 votes)
4 views

Data Visualization Techniques: Dr. D. Koteswara Rao

notes

Uploaded by

rishithammi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Visualization Techniques: Dr. D. Koteswara Rao

notes

Uploaded by

rishithammi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Visualization Techniques

Dr. D. Koteswara Rao


Textbook
What Is Visualization?
● We define visualization as the communication of information
using graphical representations.
● Pictures have been used as a mechanism for communication
since before the formalization of written language.
● A single picture can contain a wealth of information, and can
be processed much more quickly than a comparable page of
words.
● Pictures can also be independent of local language, as a
graph or a map may be understood by a group of people with
no common tongue.
Visualization in Everyday Life
● It is an interesting exercise to consider the number and types of data and
information visualization that we encounter in our normal activities. Some of
these might include
● a table in a newspaper, representing data being discussed in an article;
● a train and subway map with times used for determining train arrivals and
departures;
● a map of the region, to help select a route to a new destination;
● a weather chart showing the movement of a storm front that might influence your
weekend activities;
● a graph of stock market activities that might indicate an upswing (or downturn) in
the economy;
● a plot comparing the effectiveness of your pain killer to that of the leading brand;
● a 3D reconstruction of your injured knee, as generated from a CT scan;
Why Is Visualization Important?
● There are many reasons why visualization is
important.
● Perhaps the most obvious reason is that we are
visual beings who use sight as one of our key
senses for information understanding.
The Visualization Process
Cont..
● The process of starting with data and generating an
image, a visualization, or a model via the computer
is traditionally described as a pipeline—a sequence
of stages that can be studied independently in terms
of algorithms, data structures, and coordinate
systems.
● These processes or pipelines are different for
graphics, visualization, and knowledge discovery,
but overlap a great deal. All start with data and end
with the user.
Cont..
● To visualize data, one needs to define a mapping from the data to
the display.
● There are many ways to achieve this mapping.
● The user interface consists of components, some of which deal with
data needing to be entered, presented, monitored, analyzed, and
computed.
● These user interface components are often input via dialog boxes,
but they could be visual representations of the data to facilitate the
selections required by the user.
● Visualizations can provide mechanisms for translating data and task
into more visual and intuitive formats for users to perform their tasks.
Cont..
● This means that the data values themselves, or perhaps
the attributes of the data, are used to define graphical
objects, such as points, lines, and shapes; and their
attributes, such as size, position, orientation, and color.
● Thus, for example, a list of numbers can be plotted by
mapping each number to the y-coordinate of a point and
the number’s index in the list to the x-coordinate.
● Alternatively, we could map the number to the height of
a bar or the color of a square to get a different way to
view the data.
● Another significant, yet often overlooked, component
of the visualization process is the provision of
interactive controls for the viewing and mapping of
variables (attributes or parameters).
● While early visualizations were static objects, printed
on paper or other fixed media,
● modern visualization is a very dynamic process, with
the user controlling virtually all stages of the
procedure, from data selection and mapping control to
color manipulation and view refinement.
Relationship between Visualization
and Other Fields
● Originally, visualization was considered a subfield of computer
graphics, primarily because visualization uses graphics to display
information via images.
● visualization applies graphical techniques to generate visual displays
of data. Here, graphics is used as the communication medium.
● visualization is the application of graphics to display data by mapping
data to graphical primitives and rendering the display.
● visualization is more than simply computer graphics. The field of
visualization encompasses aspects from numerous other disciplines,
including human-computer interaction, perceptual psychology,
databases, statistics, and data mining, to name a few.
The Scatterplot
● One of the most basic visualization techniques
is the scatterplot.
● This will give us some experience with
transforming data into a visual representation
that is understood by most readers.
● The scatterplot is one of the earliest and most
widely used visualizations developed. It is
based on the Cartesian coordinate system.
Cont..
● The following pseudocode renders a scatterplot of circles.
● Records are represented in the scatterplot as circles of
varying location, color, and size.
● The x- and y-axes represent data from dimension numbers
xDim and yDim,respectively.
● The color of the circles is derived from dimension
numbercDim.
● The radius of the circles is derived from dimension number
rDim, aswell as from the upper and lower bounds for the
radius, rM in and rM ax.
Pseudocode
Scatterplot(xDim, yDim, cDim, rDim, rMin,
rMax)
for each record i
do x ← Normalize(i, xDim)
y ← Normalize(i, yDim)
r ← Normalize(i, rDim, rM in, rM ax)
MapColor(i, cDim)
Circle(x, y, r)
Pseudocode Conventions
● data—The working data table. This data table is assumed to contain only
numeric values. The working data table is the subset of original data table.
● m—The number of dimensions (columns) in the working data table.
Dimensions are typically iterated over using j.
● n—The number of records (rows) in the working data table. Records are
typically iterated over using i as the index.
● Normalize(record, dimension), Normalize(record, dimension, min,max)—A
function that maps the value for the given record and dimension in the
working data table to a value between min and max,
● Circle(x, y, radius)—A function that fills a circle centered at the given(x, y)-
location, with the given radius, with the color of the graphics environment.
Cont..
● Polyline(xs, ys)—A function that draws a polyline
(many connected line segments) from the given
arrays of x- and y-coordinates.
● Color(color)—A function that sets the color state of
the graphics environment to the specified color.
● MapColor(record, dimension)—A function that sets
the color state of the graphics environment to be
the color derived from applying the global color
map to the normalized value of the given record

Visualization types
● It is useful to categorize visualizations based on the purpose they serve. These
include the following:
● Exploration. The user possesses a data set and wants to examine it to ascertain its
contents and/or whether a particular feature or set of features is present or absent.
● Confirmation. The user has determined (e.g., via computational analysis) or
hypothesized that a given feature is present in the data and wants to use the
visualization to verify this fact or hypothesis.
● Presentation. The user is trying to convey some concept or set of facts to an
audience. Note the added labeling and stronger colors to emphasize and support
the author’s conclusion.
● Interactive Presentation. The user is providing a presentation as above but one that
is interactive typically for an individual to explore.
● Interactive presentations such as those available nowadays on the web take much
more time to prepare.
Data Foundations
● Every visualization starts with the data that is to be displayed.
● A first step in addressing the design of visualizations is to examine the
characteristics of the data.
● Data comes from many sources; it can be gathered from sensors or
surveys, or it can be generated by simulations and computations.
● Data can be raw (untreated), or it can be derived from raw data via some
process, such as smoothing, noise removal, scaling, or interpolation.
● A typical data set used in visualization consists of a list of n records,(r1 , r
2 , . . . , rn ). Each record ri consists of m (one or more) observations or
variables, (v1 , v2 , . . . vm ).
● An observation may be a single number/symbol/string or a more complex
structure.
Cont..
● An independent variable vi is one whose value is not controlled or affected
by another variabl, such as the time variable in a time-series data set.
● A dependent variable vj is one whose value is affected by a variation in
one or more associated independent variables. Temperature for a region
would be considered a dependent variable, as its value is affected by
variables such as date, time, or location.
● Thus we can formally represent a record as
r i = (iv 1 , iv 2 , . . . iv m i , dv 1 , dv 2 , . . . dv m d ),
where m i is the number of independent variables and m d is the number
of dependent variables. With this notation we have, m = m i + m d .
● In many cases, we may not know which variables are dependent or
independent.
Types of Data
● In its simplest form, each observation or
variable of a data record represents a single
piece of information.
● We can categorize this information as being
ordinal (numeric) or nominal(nonnumeric).
Subcategories of each can be readily defined.
Cont..
● Ordinal. The data take on numeric values:
binary—assuming only values of 0 and 1;
discrete—taking on only integer values or from a specific subset (e.g., (2, 4,
6));
continuous—representing real values (e.g., in the interval [0, 5]).
● Nominal. The data take on nonnumeric values:
categorical—a value selected from a finite (often short) list of possibilities
(e.g., red, blue, green);
ranked—a categorical variable that has an implied ordering (e.g.,small,
medium, large);
arbitrary—a variable with a potentially infinite range of values with no
implied ordering (e.g., addresses).
Structure within and between
Records
● Data sets have structure, both in terms of the means of representation (syntax ),
● and the types of interrelationships within a given record and between records
(semantics).
● Scalars, Vectors, and Tensors
● An individual number in a data record is often referred to as a scalar. Scalar
values, such as the cost of an item or the age of an individual, are often the
focus for analysis and visualization. Multiple variables within a single record can
represent a composite data item.
● For example, a point in a two-dimensional flow field might be represented by a
pair of values, such as a displacement in x and y. This pair, and any such
composition, is referred to as a vector.
● Other examples of vectors found in typical data sets include position (2 or 3
spatial values), color (a triplet of red, green, and blue components), and phone
number (country code, area code, and local number).
Cont..
● Scalars and vectors are simple variants on a more
general structure known as a tensor.
● A tensor is defined by its rank and by the dimensionality
of the space within which it is defined.
● It is generally represented as an array or matrix. A scalar
is a tensor of rank 0, while a vector is a tensor of rank 1.
● One could use a 3 × 3 matrix to represent a tensor of
rank 2 in 3D space, and in general, a tensor of rank M in
D-dimensional space requires D power M data values.
Geometry and Grids
● Geometric structure can commonly be found in data sets,
especially those from scientific and engineering domains.
● The simplest method of incorporating geometric structure in
a data set is to have explicit coordinates for each data
record.
● Thus, a data set of temperature readings from across the
country might include the longitude and latitude associated
with the sensors, as well as the sensor values.
● In modeling of 3D objects, the geometry constitutes the
majority of the data, with coordinates given for each vertex.
Cont..

Sometimes the geometric structure is implied. When this is the case,
it is assumed that some form of grid exists, and the data set is structured
such that successive data records are located at successive locations on the
grid.

For example, if one had a data set giving elevation at uniform spacing
across a surface, it would be unnecessary to include the coordinates for each
record; it would be sufficient to indicate a starting location, orientation, and
the step size horizontally and vertically.

There are many different coordinate systems that are used for gridstructured
data, including Cartesian, spherical , and hyperbolic coordinates.
Other Forms of Structure
● A timestamp is an important attribute that can be associated with a
data record.
● Time perhaps has the widest range of possible values of all
aspects of a data set, since we can refer to time with units from
picoseconds to millennia.
● Another important form of structure found within many data sets is
that of topology, or how the data records are connected.
Connectivity indicates that data records have some relationship to
each other.
● Thus, vertices on a surface (geometry) are connected to their
neighbors via edges (topology), and relationships between nodes
in a hierarchy or graph can be specified by links.
Data Preprocessing

Metadata and Statistics

Missing Values and Data Cleansing

Normalization

Segmentation

Sampling and Subsetting

Dimension Reduction

Mapping Nominal Dimensions to Numbers

Aggregation and Summarization
● Smoothing and Filtering

Raster-to-Vector Conversion
Cont..
● In most circumstances, it is preferable to view the original raw data.
● In many domains, such as medical imaging, the data analyst is
often opposed to any sort of data modifications, such as filtering or
smoothing, for fear that important information will be lost or
deceptive artifacts will be added.
● Viewing raw data also often identifies problems in the data set,
such as missing data, or outliers that may be the result of errors in
computation or input.
● Depending on the type of data and the visualization techniques to
be applied, however, some forms of preprocessing might be
necessary.
Metadata and Statistics
● Information regarding a data set of interest (its metadata)
and statistical analysis can provide invaluable guidance in
preprocessing the data.
● Metadata may provide information that can help in its
interpretation, such as the format of individual fields within
the data records.
● It may also contain the base reference point from which
some of the data fields are measured, the units used in the
measurements, the symbol or number used to indicate a
missing value (see below), and the resolution at which
measurements were acquired.
Statistical analysis
● Various methods of statistical analysis can provide us with useful
insights.
● Outlier detection can indicate records with erroneous data fields.
Cluster analysis can help segment the data into groups exhibiting
strong similarities.
● Correlation analysis can help users eliminate redundant fields or
highlight associations between dimensions that might not have been
apparent otherwise.
● The most common statistical plot is the distribution of data, in the
form of a histogram.
● The most common statistics about data include the mean and and
the standard deviation
Missing Values and Data Cleansing
● One of the realities of analyzing and visualizing
“real” data sets is that they often are missing
some data entries or have erroneous entries.
● Missing data may be caused by several
reasons, including, for example, a malfunctioning
sensor, a blank entry on a survey, or an omission
on the part of the person entering the data.
● Erroneous data is most often caused by human
error and can be difficult to detect.
Normalization
● Normalization is the process of transforming a data set
so that the results satisfy a particular statistical property.
● A simple example of this is to transform the range of
values a particular variable assumes so that all numbers
fall within the range of 0.0 to 1.0. Other forms of
normalization convert the data such that each dimension
has a common mean and standard deviation.
● Normalization is a useful operation since it allows us to
compare seemingly unrelated variables.
Segmentation
● In many situations, the data can be separated into contiguous
regions, where each region corresponds to a particular
classification of the data.
● For example, an MRI data set might originally have 256 possible
values for each data point, and then be segmented into specific
categories, such as bone, muscle, fat, and skin.
● Simple segmentation can be performed by just mapping disjoint
ranges of the data values to specific categories.
● A typical problem with segmentation is that the results may not
coincide with regions that are semantically homogeneous
(undersegmented ), or may consist of large numbers of tiny regions
(oversegmented ).
Sampling and Subsetting
● Often it is necessary to transform a data set with one spatial
resolution into another data set with a different spatial resolution.
● For example, we might have an image we would like to shrink or
expand, or we might have only a small sampling of data points
and wish to fill in values for locations between our samples.
● In each case, we assume that the data we possessis a discrete
sampling of a continuous phenomenon, and therefore we can
predict the values at another location by examining the actual
data nearestto it.
● The process of interpolation is a commonly used resampling
method in many fields, including visualization.
Dimension Reduction
● In situations where the dimensionality of the data
exceeds the capabilities of the visualization technique, it
is necessary to investigate ways to reduce the data
dimensionality, while at the same time preserving, as
much as possible, the information contained within.
● This can be done manually by allowing the user to select
the dimensions deemed most important, or via
computational techniques, such as principal component
analysis (PCA), multidimensional scaling (MDS),
Kohonen self-organizing maps (SOMs),and Local Linear
Embedding (LLE).
Mapping Nominal Dimensions to
Numbers
● In many domains, one or more of the data dimensions consist of
nominal values.
● We may have several alternative strategies for handling these
dimensions within our visualizations, depending on how many nominal
dimensions there are, how many distinct values each variable can take
on, and whetheran ordering or distance relation is available or can be
derived.
● The key is to find a mapping of the data to a graphical entity or attribute
that doesn’t introduce artificial relationships that don’t exist in the data.
● For example, when looking at a data set consisting of information about
cars, the manufacturer and model name would both be nominal fields.
Aggregation and Summarization
● In the event that too much data is present, it is often useful to
group data points based on their similarity in value and/or
position and represent the group by some smaller amount of
data.
● This can be as simple as averaging the values, or there might be
more descriptive information, such as the number of members in
the group and the extents of their positions or values.
● Thus, there are two components to aggregation: the method of
grouping the points and the method of displaying the resulting
groups. Grouping can be done in a number of ways; the literature
on data clustering is quite rich.
Cont..
● The key to visually depicting aggregated data is to provide
sufficientinformation for the user to decide whether he or
she wishes to perform a drill-down on the data, i.e., to
explore the contents of one or more clusters.
● Simply displaying a single representative data point per
cluster may not help in the understanding of the variability
within the cluster, or in detecting outliers in the data set.
● Thus, other cluster measures, such as those listed above,
are useful in exploring this sort of preprocessed data.
Smoothing and Filtering
● A common process in signal processing is to smooth the data values,
to reduce noise and to blur sharp discontinuities.
● A typical way to perform this task is through a process known as
convolution, which for our purposes can be viewed as a weighted
averaging of neighbors surrounding a data point.
● The result of applying this operation is that values that are
significantlydifferent from their neighbors (e.g., noise) will be modified
to be more similar to the neighbors, while values corresponding to
dramatic changes will be “softened” to smooth out the transition.
● Many types of operations can be accomplished via this filtering
operation, by simply varying the weights or changing the size or shape
of the neighborhood considered.
Raster-to-Vector Conversion
● In computer graphics, objects are typically represented
by sets of connected, planar polygons (vertices, edges,
and triangular or quadrilateral patches), and the task is to
create a raster (pixel-level) image representing these
objects, their surface properties, and their interactions
with light sources and other objects.
● In spatial data visualization, our objects can be points or
regions, or they can be linear structures, such as a road
on a map. It is sometimes useful to take a raster-based
data set, such as an image, and extract linear structures
from it.
Cont..
● Reasons for doing this might include:
● Compressing the contents for transmission. A vertex and
edge list isalmost always more compact than a raster
image.
● Comparing the contents of two or more images. It is
generally easier and more reliable to compare higher-
level features of images, rather than their pixels.
● Transforming the data. Affine transformations such as
rotation and scaling are easier to apply to vector
representations than to raster.

You might also like