Textbook What Is Visualization? ● We define visualization as the communication of information using graphical representations. ● Pictures have been used as a mechanism for communication since before the formalization of written language. ● A single picture can contain a wealth of information, and can be processed much more quickly than a comparable page of words. ● Pictures can also be independent of local language, as a graph or a map may be understood by a group of people with no common tongue. Visualization in Everyday Life ● It is an interesting exercise to consider the number and types of data and information visualization that we encounter in our normal activities. Some of these might include ● a table in a newspaper, representing data being discussed in an article; ● a train and subway map with times used for determining train arrivals and departures; ● a map of the region, to help select a route to a new destination; ● a weather chart showing the movement of a storm front that might influence your weekend activities; ● a graph of stock market activities that might indicate an upswing (or downturn) in the economy; ● a plot comparing the effectiveness of your pain killer to that of the leading brand; ● a 3D reconstruction of your injured knee, as generated from a CT scan; Why Is Visualization Important? ● There are many reasons why visualization is important. ● Perhaps the most obvious reason is that we are visual beings who use sight as one of our key senses for information understanding. The Visualization Process Cont.. ● The process of starting with data and generating an image, a visualization, or a model via the computer is traditionally described as a pipeline—a sequence of stages that can be studied independently in terms of algorithms, data structures, and coordinate systems. ● These processes or pipelines are different for graphics, visualization, and knowledge discovery, but overlap a great deal. All start with data and end with the user. Cont.. ● To visualize data, one needs to define a mapping from the data to the display. ● There are many ways to achieve this mapping. ● The user interface consists of components, some of which deal with data needing to be entered, presented, monitored, analyzed, and computed. ● These user interface components are often input via dialog boxes, but they could be visual representations of the data to facilitate the selections required by the user. ● Visualizations can provide mechanisms for translating data and task into more visual and intuitive formats for users to perform their tasks. Cont.. ● This means that the data values themselves, or perhaps the attributes of the data, are used to define graphical objects, such as points, lines, and shapes; and their attributes, such as size, position, orientation, and color. ● Thus, for example, a list of numbers can be plotted by mapping each number to the y-coordinate of a point and the number’s index in the list to the x-coordinate. ● Alternatively, we could map the number to the height of a bar or the color of a square to get a different way to view the data. ● Another significant, yet often overlooked, component of the visualization process is the provision of interactive controls for the viewing and mapping of variables (attributes or parameters). ● While early visualizations were static objects, printed on paper or other fixed media, ● modern visualization is a very dynamic process, with the user controlling virtually all stages of the procedure, from data selection and mapping control to color manipulation and view refinement. Relationship between Visualization and Other Fields ● Originally, visualization was considered a subfield of computer graphics, primarily because visualization uses graphics to display information via images. ● visualization applies graphical techniques to generate visual displays of data. Here, graphics is used as the communication medium. ● visualization is the application of graphics to display data by mapping data to graphical primitives and rendering the display. ● visualization is more than simply computer graphics. The field of visualization encompasses aspects from numerous other disciplines, including human-computer interaction, perceptual psychology, databases, statistics, and data mining, to name a few. The Scatterplot ● One of the most basic visualization techniques is the scatterplot. ● This will give us some experience with transforming data into a visual representation that is understood by most readers. ● The scatterplot is one of the earliest and most widely used visualizations developed. It is based on the Cartesian coordinate system. Cont.. ● The following pseudocode renders a scatterplot of circles. ● Records are represented in the scatterplot as circles of varying location, color, and size. ● The x- and y-axes represent data from dimension numbers xDim and yDim,respectively. ● The color of the circles is derived from dimension numbercDim. ● The radius of the circles is derived from dimension number rDim, aswell as from the upper and lower bounds for the radius, rM in and rM ax. Pseudocode Scatterplot(xDim, yDim, cDim, rDim, rMin, rMax) for each record i do x ← Normalize(i, xDim) y ← Normalize(i, yDim) r ← Normalize(i, rDim, rM in, rM ax) MapColor(i, cDim) Circle(x, y, r) Pseudocode Conventions ● data—The working data table. This data table is assumed to contain only numeric values. The working data table is the subset of original data table. ● m—The number of dimensions (columns) in the working data table. Dimensions are typically iterated over using j. ● n—The number of records (rows) in the working data table. Records are typically iterated over using i as the index. ● Normalize(record, dimension), Normalize(record, dimension, min,max)—A function that maps the value for the given record and dimension in the working data table to a value between min and max, ● Circle(x, y, radius)—A function that fills a circle centered at the given(x, y)- location, with the given radius, with the color of the graphics environment. Cont.. ● Polyline(xs, ys)—A function that draws a polyline (many connected line segments) from the given arrays of x- and y-coordinates. ● Color(color)—A function that sets the color state of the graphics environment to the specified color. ● MapColor(record, dimension)—A function that sets the color state of the graphics environment to be the color derived from applying the global color map to the normalized value of the given record ● Visualization types ● It is useful to categorize visualizations based on the purpose they serve. These include the following: ● Exploration. The user possesses a data set and wants to examine it to ascertain its contents and/or whether a particular feature or set of features is present or absent. ● Confirmation. The user has determined (e.g., via computational analysis) or hypothesized that a given feature is present in the data and wants to use the visualization to verify this fact or hypothesis. ● Presentation. The user is trying to convey some concept or set of facts to an audience. Note the added labeling and stronger colors to emphasize and support the author’s conclusion. ● Interactive Presentation. The user is providing a presentation as above but one that is interactive typically for an individual to explore. ● Interactive presentations such as those available nowadays on the web take much more time to prepare. Data Foundations ● Every visualization starts with the data that is to be displayed. ● A first step in addressing the design of visualizations is to examine the characteristics of the data. ● Data comes from many sources; it can be gathered from sensors or surveys, or it can be generated by simulations and computations. ● Data can be raw (untreated), or it can be derived from raw data via some process, such as smoothing, noise removal, scaling, or interpolation. ● A typical data set used in visualization consists of a list of n records,(r1 , r 2 , . . . , rn ). Each record ri consists of m (one or more) observations or variables, (v1 , v2 , . . . vm ). ● An observation may be a single number/symbol/string or a more complex structure. Cont.. ● An independent variable vi is one whose value is not controlled or affected by another variabl, such as the time variable in a time-series data set. ● A dependent variable vj is one whose value is affected by a variation in one or more associated independent variables. Temperature for a region would be considered a dependent variable, as its value is affected by variables such as date, time, or location. ● Thus we can formally represent a record as r i = (iv 1 , iv 2 , . . . iv m i , dv 1 , dv 2 , . . . dv m d ), where m i is the number of independent variables and m d is the number of dependent variables. With this notation we have, m = m i + m d . ● In many cases, we may not know which variables are dependent or independent. Types of Data ● In its simplest form, each observation or variable of a data record represents a single piece of information. ● We can categorize this information as being ordinal (numeric) or nominal(nonnumeric). Subcategories of each can be readily defined. Cont.. ● Ordinal. The data take on numeric values: binary—assuming only values of 0 and 1; discrete—taking on only integer values or from a specific subset (e.g., (2, 4, 6)); continuous—representing real values (e.g., in the interval [0, 5]). ● Nominal. The data take on nonnumeric values: categorical—a value selected from a finite (often short) list of possibilities (e.g., red, blue, green); ranked—a categorical variable that has an implied ordering (e.g.,small, medium, large); arbitrary—a variable with a potentially infinite range of values with no implied ordering (e.g., addresses). Structure within and between Records ● Data sets have structure, both in terms of the means of representation (syntax ), ● and the types of interrelationships within a given record and between records (semantics). ● Scalars, Vectors, and Tensors ● An individual number in a data record is often referred to as a scalar. Scalar values, such as the cost of an item or the age of an individual, are often the focus for analysis and visualization. Multiple variables within a single record can represent a composite data item. ● For example, a point in a two-dimensional flow field might be represented by a pair of values, such as a displacement in x and y. This pair, and any such composition, is referred to as a vector. ● Other examples of vectors found in typical data sets include position (2 or 3 spatial values), color (a triplet of red, green, and blue components), and phone number (country code, area code, and local number). Cont.. ● Scalars and vectors are simple variants on a more general structure known as a tensor. ● A tensor is defined by its rank and by the dimensionality of the space within which it is defined. ● It is generally represented as an array or matrix. A scalar is a tensor of rank 0, while a vector is a tensor of rank 1. ● One could use a 3 × 3 matrix to represent a tensor of rank 2 in 3D space, and in general, a tensor of rank M in D-dimensional space requires D power M data values. Geometry and Grids ● Geometric structure can commonly be found in data sets, especially those from scientific and engineering domains. ● The simplest method of incorporating geometric structure in a data set is to have explicit coordinates for each data record. ● Thus, a data set of temperature readings from across the country might include the longitude and latitude associated with the sensors, as well as the sensor values. ● In modeling of 3D objects, the geometry constitutes the majority of the data, with coordinates given for each vertex. Cont.. ● Sometimes the geometric structure is implied. When this is the case, it is assumed that some form of grid exists, and the data set is structured such that successive data records are located at successive locations on the grid. ● For example, if one had a data set giving elevation at uniform spacing across a surface, it would be unnecessary to include the coordinates for each record; it would be sufficient to indicate a starting location, orientation, and the step size horizontally and vertically. ● There are many different coordinate systems that are used for gridstructured data, including Cartesian, spherical , and hyperbolic coordinates. Other Forms of Structure ● A timestamp is an important attribute that can be associated with a data record. ● Time perhaps has the widest range of possible values of all aspects of a data set, since we can refer to time with units from picoseconds to millennia. ● Another important form of structure found within many data sets is that of topology, or how the data records are connected. Connectivity indicates that data records have some relationship to each other. ● Thus, vertices on a surface (geometry) are connected to their neighbors via edges (topology), and relationships between nodes in a hierarchy or graph can be specified by links. Data Preprocessing ● Metadata and Statistics ● Missing Values and Data Cleansing ● Normalization ● Segmentation ● Sampling and Subsetting ● Dimension Reduction ● Mapping Nominal Dimensions to Numbers ● Aggregation and Summarization ● Smoothing and Filtering ● Raster-to-Vector Conversion Cont.. ● In most circumstances, it is preferable to view the original raw data. ● In many domains, such as medical imaging, the data analyst is often opposed to any sort of data modifications, such as filtering or smoothing, for fear that important information will be lost or deceptive artifacts will be added. ● Viewing raw data also often identifies problems in the data set, such as missing data, or outliers that may be the result of errors in computation or input. ● Depending on the type of data and the visualization techniques to be applied, however, some forms of preprocessing might be necessary. Metadata and Statistics ● Information regarding a data set of interest (its metadata) and statistical analysis can provide invaluable guidance in preprocessing the data. ● Metadata may provide information that can help in its interpretation, such as the format of individual fields within the data records. ● It may also contain the base reference point from which some of the data fields are measured, the units used in the measurements, the symbol or number used to indicate a missing value (see below), and the resolution at which measurements were acquired. Statistical analysis ● Various methods of statistical analysis can provide us with useful insights. ● Outlier detection can indicate records with erroneous data fields. Cluster analysis can help segment the data into groups exhibiting strong similarities. ● Correlation analysis can help users eliminate redundant fields or highlight associations between dimensions that might not have been apparent otherwise. ● The most common statistical plot is the distribution of data, in the form of a histogram. ● The most common statistics about data include the mean and and the standard deviation Missing Values and Data Cleansing ● One of the realities of analyzing and visualizing “real” data sets is that they often are missing some data entries or have erroneous entries. ● Missing data may be caused by several reasons, including, for example, a malfunctioning sensor, a blank entry on a survey, or an omission on the part of the person entering the data. ● Erroneous data is most often caused by human error and can be difficult to detect. Normalization ● Normalization is the process of transforming a data set so that the results satisfy a particular statistical property. ● A simple example of this is to transform the range of values a particular variable assumes so that all numbers fall within the range of 0.0 to 1.0. Other forms of normalization convert the data such that each dimension has a common mean and standard deviation. ● Normalization is a useful operation since it allows us to compare seemingly unrelated variables. Segmentation ● In many situations, the data can be separated into contiguous regions, where each region corresponds to a particular classification of the data. ● For example, an MRI data set might originally have 256 possible values for each data point, and then be segmented into specific categories, such as bone, muscle, fat, and skin. ● Simple segmentation can be performed by just mapping disjoint ranges of the data values to specific categories. ● A typical problem with segmentation is that the results may not coincide with regions that are semantically homogeneous (undersegmented ), or may consist of large numbers of tiny regions (oversegmented ). Sampling and Subsetting ● Often it is necessary to transform a data set with one spatial resolution into another data set with a different spatial resolution. ● For example, we might have an image we would like to shrink or expand, or we might have only a small sampling of data points and wish to fill in values for locations between our samples. ● In each case, we assume that the data we possessis a discrete sampling of a continuous phenomenon, and therefore we can predict the values at another location by examining the actual data nearestto it. ● The process of interpolation is a commonly used resampling method in many fields, including visualization. Dimension Reduction ● In situations where the dimensionality of the data exceeds the capabilities of the visualization technique, it is necessary to investigate ways to reduce the data dimensionality, while at the same time preserving, as much as possible, the information contained within. ● This can be done manually by allowing the user to select the dimensions deemed most important, or via computational techniques, such as principal component analysis (PCA), multidimensional scaling (MDS), Kohonen self-organizing maps (SOMs),and Local Linear Embedding (LLE). Mapping Nominal Dimensions to Numbers ● In many domains, one or more of the data dimensions consist of nominal values. ● We may have several alternative strategies for handling these dimensions within our visualizations, depending on how many nominal dimensions there are, how many distinct values each variable can take on, and whetheran ordering or distance relation is available or can be derived. ● The key is to find a mapping of the data to a graphical entity or attribute that doesn’t introduce artificial relationships that don’t exist in the data. ● For example, when looking at a data set consisting of information about cars, the manufacturer and model name would both be nominal fields. Aggregation and Summarization ● In the event that too much data is present, it is often useful to group data points based on their similarity in value and/or position and represent the group by some smaller amount of data. ● This can be as simple as averaging the values, or there might be more descriptive information, such as the number of members in the group and the extents of their positions or values. ● Thus, there are two components to aggregation: the method of grouping the points and the method of displaying the resulting groups. Grouping can be done in a number of ways; the literature on data clustering is quite rich. Cont.. ● The key to visually depicting aggregated data is to provide sufficientinformation for the user to decide whether he or she wishes to perform a drill-down on the data, i.e., to explore the contents of one or more clusters. ● Simply displaying a single representative data point per cluster may not help in the understanding of the variability within the cluster, or in detecting outliers in the data set. ● Thus, other cluster measures, such as those listed above, are useful in exploring this sort of preprocessed data. Smoothing and Filtering ● A common process in signal processing is to smooth the data values, to reduce noise and to blur sharp discontinuities. ● A typical way to perform this task is through a process known as convolution, which for our purposes can be viewed as a weighted averaging of neighbors surrounding a data point. ● The result of applying this operation is that values that are significantlydifferent from their neighbors (e.g., noise) will be modified to be more similar to the neighbors, while values corresponding to dramatic changes will be “softened” to smooth out the transition. ● Many types of operations can be accomplished via this filtering operation, by simply varying the weights or changing the size or shape of the neighborhood considered. Raster-to-Vector Conversion ● In computer graphics, objects are typically represented by sets of connected, planar polygons (vertices, edges, and triangular or quadrilateral patches), and the task is to create a raster (pixel-level) image representing these objects, their surface properties, and their interactions with light sources and other objects. ● In spatial data visualization, our objects can be points or regions, or they can be linear structures, such as a road on a map. It is sometimes useful to take a raster-based data set, such as an image, and extract linear structures from it. Cont.. ● Reasons for doing this might include: ● Compressing the contents for transmission. A vertex and edge list isalmost always more compact than a raster image. ● Comparing the contents of two or more images. It is generally easier and more reliable to compare higher- level features of images, rather than their pixels. ● Transforming the data. Affine transformations such as rotation and scaling are easier to apply to vector representations than to raster.