IDS UNIT-5
IDS UNIT-5
Data Reduction
Data mining is applied to the selected data in a large amount database. When data analysis and
mining is done on a huge amount of data, then it takes a very long time to process, making it
impractical and infeasible.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.
3. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
4. Stratified sample: The large data set D is partitioned into mutually
disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales
per quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction, such
as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your
total spending on capacity
DATA ANALYTICS
III YEAR I SEM
5. DATA VISUALIZATION
Introduction
Data visualization gives us a clear idea of what the information means by giving it visual
context through maps or graphs. This makes the data more natural for the human mind to
comprehend and therefore makes it easier to identify trends, patterns, and outliers within
large data sets.
By using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
Because of the way the human brain processes information, using charts or graphs to visualize large
amounts of complex data is easier than poring over spread sheets or reports. Data visualization is a
quick, easy way to convey concepts in a universal manner – and you can experiment with different
scenarios by making slight adjustments.
Data Visualization: Data visualization is the graphical representation of information and data in a
pictorial or graphical form at(Example: charts, graphs, and maps). Data visualization tools provide an
accessible way to see and understand trends, patterns in data and outliers. Data visualization tools and
technologies are essential to analyse massive amounts of information and make data-driven decisions.
The concept of using pictures is to understand data has been used since centuries. General types of
data visualizations are Charts, Tables, Graphs, Maps, and Dashboards.
Page 1
Data Analytics: Data analytics is the process of analysing data sets in order to make the decision
about the information they have, increasingly with specialized software and system. Data analytics
technologies are used in commercial industries that allow organizations to make business decisions.
Data can help businesses better understand their customers, improve their advertising campaigns,
personalize their content, and improve their bottom lines. The techniques and processes of data
analytics have been automated into mechanical processes and algorithms that work over raw data for
human consumption. Data analytics help a business optimize its performance.
Page 2
Data Visualization Techniques
Box plots
Histograms
Heat maps
Charts
Tree maps
Word Cloud/Network diagram
Box Plots
The image above is a box plot. A boxplot is a standardized way of displaying the distribution of data
based on a five-number summary (―minimum‖, first quartile (Q1), median, third quart ile (Q3), and
―maximum‖). It can tell you about your out liers and what their values are. It can also tell you if your
data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the values in the data are spread out.
Although box plots may seem primitive in comparison to a histogram or density plot, they have the
advantage of taking up less space, which is useful when comparing distributions between many groups
or datasets.
Histograms
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar
groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays
the shape and spread of continuous sample data.
Page 3
Charts
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence of one variable on
another. To plot the relationship between the two variables, we can simply call the plot function.
Page 4
Bar Charts
Bar charts are used for comparing the quantities of different categories or groups. Values of a category
are represented with the help of bars and they can be configured with vertical or horizontal bars, with
the length or height of each bar representing the value.
Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc
length of each slide is proportional to the quantity it represents. As a rule, they are used to compare the
parts of a whole and are most effective when there are limited components and when text and
percentages are included to describe the content.
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot representing the
joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents
an observation. The marker position indicates the value for each observation. When you assign more
than two measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible
pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the
relationship, or correlations, between X and Y variables
Page 5
Timeline Charts
Timeline charts illustrate events, in chronological order — for example the progress of a project,
advertising campaign, acquisition process — in whatever unit of time the data was recorded — for
example week, month, year, quarter. It shows the chronological sequence of past or future events on a
timescale.
The variety of big data brings challenges because semi-structured and unstructured data require new
visualization techniques. A word cloud visual represents the frequency of a word within a body of text
with its relative size in the cloud. This technique is used on unstructured data as a way to display high-
or low-frequency words.
Page 6
Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and
ties (relationships between the individuals). They are used in many applications, for example for
analysis of social networks or mapping product sales across geographic areas.
Visualization techniques are of increasing importance in exploring and analysing large amounts of
multidimensional information. One important class of visualization techniques which is particularly
interesting for visualizing very large multidimensional data sets is the class of pixel-oriented techniques.
The basic idea of pixel-oriented visualization techniques is to represent as many data objects as possible
on the screen at the same time by mapping each data value to a pixel of the screen and arranging the
pixels adequately.
One important class of visualization techniques which is particularly interesting for visualizing very
large multidimensional data sets is the class of pixel-oriented techniques. The basic idea of pixel-
oriented visualization techniques is to represent as many data objects as possible on the screen at the
same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. A
number of different pixel-oriented visualization techniques have been proposed in recent years and it
has been shown that the techniques are useful for visual data exploration in a number of different
application contexts.
Page 7
Data Visualization techniques:
A simple way to visualize the value of a dimension is to use a pixel where the colour of the pixel
reflects the dimension‘s value.
For a data set of m dimensions pixel oriented techniques create m windows on the screen, one
for each dimension.
The m dimension values of a record are mapped to m pixels at the corresponding position in the
windows.
The colour of the pixel reflects other corresponding values.
Inside a window, the data values are arranged in some global order shared by all windows
Eg: All Electronics maintains a customer information table, which consists of 4 dimensions:
income, credit limit, transaction volume and age. We analyse the correlation between income
and other attributes by visualization.
We sort all customers in income in ascending order and use this order to layout the customer
data in the 4 visualization windows as shown in fig.
The pixel colours are chosen so that the smaller the value, the lighter the shading.
Using pixel based visualization we can easily observe that credit limit increases as income
increases customer whose income is in the middle range are more likely to purchase more from
All Electronics, these is no clear correlation between income and age.
Fig: Pixel oriented visualization of 4 attributes by sorting all customers in income Ascending
order.
Page 8
Fig: visualization of 2D data set using scatter plot
For data sets with more than four dimensions, scatter plots are usually ineffective. The scatter-plot matrix
technique is a useful extension to the scatter plot. For an ndimensional data set, a scatter-plot matrix is an
n × n grid of 2-D scatter plots that provides a visualization of each dimension with every other dimension.
The scatter-plot matrix becomes less effective as the dimensionality increases. Another popular technique,
called parallel coordinates, can handle higher dimensionality. To visualize n-dimensional data points, the
parallel coordinates technique draws n equally spaced axes, one for each dimension, parallel to one of the
display axes.
A data record is represented by a polygonal line that intersects each axis at the point corresponding to the
associated dimension value (Figure 2.16). A major limitation of the parallel coordinates technique is that
it cannot effectively show a data set of many records. Even for a data set of several thousand records,
visual clutter and overlap often reduce the readability of the visualization and make the patterns hard to
find.
Page 9
It uses small icons to represent multidimensional data values
2 popular icon based techniques:- 1)Chern off faces 2) Stick figures
3.1 Chern off faces: - They display multidimensional data of up to 18 variables as a cartoon human
face. Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears, mouth,
and nose, represent values of the dimensions by their shape, size, placement, and orientation. For example,
dimensions can be mapped to the following facial characteristics: eye size, eye spacing, nose length, nose
width, mouth curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye eccentricity, and head
eccentricity. Chernoff faces make use of the ability of the human mind to recognize small differences in
facial characteristics and to assimilate many facial characteristics at once.
Chernoff faces make the data easier for users to digest. In this way, they facilitate visualization of regularities
and irregularities present in the data, although their power in relating multiple relationships is limited.
Another limitation is that specific data values are not shown.
Asymmetrical Chernoff faces were proposed as an extension to the original technique. Since a face has
vertical symmetry (along the y-axis), the left and right side of a face are identical, which wastes space.
Asymmetrical Chernoff faces double the number of facial characteristics, thus allowing up to 36 dimensions
to be displayed.
3.2 Stick figures: It maps multidimensional data to f ive –piece stick figure, where each figure has
4limbs and a body.
2 dimensions are mapped to the display axes and the remaining dimensions are mapped to the angle
and/ or length of the limbs.
Figure 2.18 shows census data, where age and income are mapped to the display axes, and the
remaining dimensions (gender, education, and so on) are mapped to stick figures. If the data items are
relatively dense with respect to the two display dimensions, the resulting visualization shows texture
patterns, reflecting data trends.
Page 10
4. Hierarchical visualization techniques (i.e. subspaces)
Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The subspaces
are visualized in a hierarchical manner
4.2) Tree-maps: As another example of hierarchical visualization methods, tree-maps display hierarchical data as a
set of nested rectangles. For example, Figure 2.20 shows a tree-map visualizing Google news stories. All news
stories are organized into seven categories, each shown in a large rectangle of a unique color. Within each category
(i.e., each rectangle at the top level), the news stories are further partitioned into smaller subcategories.
Page 11
Visualizing Complex Data and Relations
Page 12