0% found this document useful (0 votes)
14 views

IDS UNIT-5

The document discusses data reduction techniques in data mining, emphasizing the importance of reducing data volume while maintaining its integrity to improve processing efficiency. Key methods include dimensionality reduction, numerosity reduction, data cube aggregation, and data compression, each with specific techniques such as wavelet transform and clustering. Additionally, it highlights the benefits of data reduction, including cost savings and enhanced storage efficiency, and transitions into data visualization, explaining its role in representing data graphically for better understanding and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

IDS UNIT-5

The document discusses data reduction techniques in data mining, emphasizing the importance of reducing data volume while maintaining its integrity to improve processing efficiency. Key methods include dimensionality reduction, numerosity reduction, data cube aggregation, and data compression, each with specific techniques such as wavelet transform and clustering. Additionally, it highlights the benefits of data reduction, including cost savings and enhanced storage efficiency, and transitions into data visualization, explaining its role in representing data graphically for better understanding and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-5

Data Reduction
Data mining is applied to the selected data in a large amount database. When data analysis and
mining is done on a huge amount of data, then it takes a very long time to process, making it
impractical and infeasible.

Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.

Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.

Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).

Techniques of Data Reduction


Here are the following techniques or methods of data reduction in data mining, such as:

1. Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed


into a numerically different data vector A' such that both A and A' vectors are of the
same length. Then how it is useful in reducing data because the data obtained from the
wavelet transform can be truncated. The compressed data is obtained by retaining the
smallest fragment of the strongest wavelet coefficients. Wavelet transform can be applied
to data cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has
tuples with n attributes. The principal component analysis identifies k independent tuples
with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.

2. sNumerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between
the two attributes by modeling a linear equation to the data set. Suppose we need
to model a linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in
terms of data mining, attribute x and attribute y are the numeric database
attributes, whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume
any model. The non-Parametric technique results in a more uniform reduction,
irrespective of data size, but it may not achieve a high volume of data reduction like the
parametric. There are at least four types of Non-Parametric data reduction techniques,
Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which
describes how often a value appears in the data. Histogram uses the binning
method to represent an attribute's data distribution. It uses a disjoint subset which
we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of
only one attribute, the histogram can be implemented for multiple attributes. It
can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a
distance function. More is the similarity between the objects in a cluster closer
they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can
reduce the large data set into a much smaller data sample. Below we will discuss
the different methods in which we can sample a large data set D containing N
tuples:
1. Simple random sample without replacement (SRSWOR) of size s: In
this s, some tuples are drawn from N tuples such that in the data set D
(s<N). The probability of drawing any tuple from the data set D is 1/N.
This means all tuples have an equal probability of getting sampled.
2. Simple random sample with replacement (SRSWR) of size s: It is
similar to the SRSWOR, but the tuple is drawn from data set D, is
recorded, and then replaced into the data set D so that it can be drawn
again.

3. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
4. Stratified sample: The large data set D is partitioned into mutually
disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales
per quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases multidimensional


analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.

i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.

5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this method up
to the end, then the process is known as top-down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points,
some are discarded through a combination of the neighborhood values in the interval.
That process is called bottom-up discretization.

Benefits of Data Reduction

The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction, such
as:

 Data reduction can save energy.


 Data reduction can reduce your physical storage costs.
 And data reduction can decrease your data center track.

Data reduction greatly increases the efficiency of a storage system and directly impacts your
total spending on capacity
DATA ANALYTICS
III YEAR I SEM
5. DATA VISUALIZATION

Introduction

Data visualization is the graphical representation of information and data in a pictorial or


graphical format.

Data visualization gives us a clear idea of what the information means by giving it visual
context through maps or graphs. This makes the data more natural for the human mind to
comprehend and therefore makes it easier to identify trends, patterns, and outliers within
large data sets.

By using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.

Why is data visualization important?

Because of the way the human brain processes information, using charts or graphs to visualize large
amounts of complex data is easier than poring over spread sheets or reports. Data visualization is a
quick, easy way to convey concepts in a universal manner – and you can experiment with different
scenarios by making slight adjustments.

Data visualization can also:

 Identify areas that need attention or improvement.


 Clarify which factors influence customer behaviour.
 Help you understand which products to place where.
 Predict sales volumes.

Difference between Data Visualization and Data Analytics

Data Visualization: Data visualization is the graphical representation of information and data in a
pictorial or graphical form at(Example: charts, graphs, and maps). Data visualization tools provide an
accessible way to see and understand trends, patterns in data and outliers. Data visualization tools and
technologies are essential to analyse massive amounts of information and make data-driven decisions.
The concept of using pictures is to understand data has been used since centuries. General types of
data visualizations are Charts, Tables, Graphs, Maps, and Dashboards.

Page 1
Data Analytics: Data analytics is the process of analysing data sets in order to make the decision
about the information they have, increasingly with specialized software and system. Data analytics
technologies are used in commercial industries that allow organizations to make business decisions.
Data can help businesses better understand their customers, improve their advertising campaigns,
personalize their content, and improve their bottom lines. The techniques and processes of data
analytics have been automated into mechanical processes and algorithms that work over raw data for
human consumption. Data analytics help a business optimize its performance.

Based on Data Visualization Data Analytics


Data analytics is the process of analysing
Data visualization is the graphical data sets in order to make decision about
representation of information and data the information they have, increasingly
Definition in a pictorial or graphical format. with specialized software and system.
Identify the underlying models and
Identify areas that need attention or patterns
improvement.
Clarity which factors influence Acts as an input source for the Data
customer behaviour Visualization,
Helps understand which products to
places where Helps in improving the business by
Benefits Predict sales volumes predicting the needs Conclusion
The goal of the data visualization is to Every business collects data; data
communicate information clearly and analytics will help the business to make
efficiently to users by presenting them more-informed business decisions by
Used for visually. analysing the data.
Together Data visualization and analytics
will draw the conclusions about the
Data visualization helps to get better datasets. In few scenarios, it might act as
Relation perception. a source for visualization.

Data Analytics technologies and


Data Visualization technologies and techniques are widely used in
techniques are widely used in Finance, Commercial, Finance, Healthcare, Crime
Industries Banking, Healthcare, Retailing etc detection, Travel agencies etc
Trifecta, Excel /Spread sheet, Hive,
Polybase, Presto, Trifecta, Excel
Plotly, Data Hero, Tableau, Dygraphs, /Spreadsheet, Clear Analytics, SAP
Tools QlikView, ZingCHhart etc. Business Intelligence, etc.
Big data processing, Service
management dashboards, Analysis and Big data processing, Data mining,
Platforms design. Analysis and design
Data visualization can be static or Data Analytics can be Prescriptive
Techniques interactive. analytics, Predictive analytics.
Performed by Data Engineers Data Analytics

Page 2
Data Visualization Techniques

 Box plots
 Histograms
 Heat maps
 Charts
 Tree maps
 Word Cloud/Network diagram

Box Plots

The image above is a box plot. A boxplot is a standardized way of displaying the distribution of data
based on a five-number summary (―minimum‖, first quartile (Q1), median, third quart ile (Q3), and
―maximum‖). It can tell you about your out liers and what their values are. It can also tell you if your
data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

A box plot is a graph that gives you a good indication of how the values in the data are spread out.
Although box plots may seem primitive in comparison to a histogram or density plot, they have the
advantage of taking up less space, which is useful when comparing distributions between many groups
or datasets.

Histograms

A histogram is a graphical display of data using bars of different heights. In a histogram, each bar
groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays
the shape and spread of continuous sample data.

Page 3
Charts

Line Chart

The simplest technique, a line plot is used to plot the relationship or dependence of one variable on
another. To plot the relationship between the two variables, we can simply call the plot function.

Page 4
Bar Charts

Bar charts are used for comparing the quantities of different categories or groups. Values of a category
are represented with the help of bars and they can be configured with vertical or horizontal bars, with
the length or height of each bar representing the value.

Pie Chart

It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc
length of each slide is proportional to the quantity it represents. As a rule, they are used to compare the
parts of a whole and are most effective when there are limited components and when text and
percentages are included to describe the content.

Scatter Charts

Another common visualization technique is a scatter plot that is a two-dimensional plot representing the
joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents
an observation. The marker position indicates the value for each observation. When you assign more
than two measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible
pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the
relationship, or correlations, between X and Y variables

Page 5
Timeline Charts

Timeline charts illustrate events, in chronological order — for example the progress of a project,
advertising campaign, acquisition process — in whatever unit of time the data was recorded — for
example week, month, year, quarter. It shows the chronological sequence of past or future events on a
timescale.

Word Clouds and Network Diagrams for Unstructured Data

The variety of big data brings challenges because semi-structured and unstructured data require new
visualization techniques. A word cloud visual represents the frequency of a word within a body of text
with its relative size in the cloud. This technique is used on unstructured data as a way to display high-
or low-frequency words.

Page 6
Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and
ties (relationships between the individuals). They are used in many applications, for example for
analysis of social networks or mapping product sales across geographic areas.

Visualization techniques are of increasing importance in exploring and analysing large amounts of
multidimensional information. One important class of visualization techniques which is particularly
interesting for visualizing very large multidimensional data sets is the class of pixel-oriented techniques.
The basic idea of pixel-oriented visualization techniques is to represent as many data objects as possible
on the screen at the same time by mapping each data value to a pixel of the screen and arranging the
pixels adequately.

Pixel Oriented Technique:

One important class of visualization techniques which is particularly interesting for visualizing very
large multidimensional data sets is the class of pixel-oriented techniques. The basic idea of pixel-
oriented visualization techniques is to represent as many data objects as possible on the screen at the
same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. A
number of different pixel-oriented visualization techniques have been proposed in recent years and it
has been shown that the techniques are useful for visual data exploration in a number of different
application contexts.

 Represent each attribute value as a single colored pixel


 Map the range of possible attribute values to a fixed color map
 Maximizes the amount of information represented at one time without any overlap

Page 7
Data Visualization techniques:

1. Pixel oriented visualization techniques:

 A simple way to visualize the value of a dimension is to use a pixel where the colour of the pixel
reflects the dimension‘s value.
 For a data set of m dimensions pixel oriented techniques create m windows on the screen, one
for each dimension. 
 The m dimension values of a record are mapped to m pixels at the corresponding position in the
windows.
 The colour of the pixel reflects other corresponding values.
 Inside a window, the data values are arranged in some global order shared by all windows
 Eg: All Electronics maintains a customer information table, which consists of 4 dimensions:
income, credit limit, transaction volume and age. We analyse the correlation between income
and other attributes by visualization.
 We sort all customers in income in ascending order and use this order to layout the customer
data in the 4 visualization windows as shown in fig.
 The pixel colours are chosen so that the smaller the value, the lighter the shading.
 Using pixel based visualization we can easily observe that credit limit increases as income
increases customer whose income is in the middle range are more likely to purchase more from
All Electronics, these is no clear correlation between income and age.

Fig: Pixel oriented visualization of 4 attributes by sorting all customers in income Ascending
order.

2. Geometric Projection visualization techniques

 A drawback of pixel-oriented visualization techniques is that they cannot help us much in


understanding the distribution of data in a multidimensional space.
 Geometric projection techniques help users find interesting projections of multidimensional data
sets.
 A scatter plot displays 2-D data point using Cartesian co-ordinates. A third dimension can be
added using different colours of shapes to represent different data points.
 Eg. Where x and y are two spatial attributes and the third dimension is represented by different
shapes
 Through this visualization, we can see that points of types “+” and “×” tend to be collocated.

Page 8
Fig: visualization of 2D data set using scatter plot
For data sets with more than four dimensions, scatter plots are usually ineffective. The scatter-plot matrix
technique is a useful extension to the scatter plot. For an ndimensional data set, a scatter-plot matrix is an
n × n grid of 2-D scatter plots that provides a visualization of each dimension with every other dimension.
The scatter-plot matrix becomes less effective as the dimensionality increases. Another popular technique,
called parallel coordinates, can handle higher dimensionality. To visualize n-dimensional data points, the
parallel coordinates technique draws n equally spaced axes, one for each dimension, parallel to one of the
display axes.
A data record is represented by a polygonal line that intersects each axis at the point corresponding to the
associated dimension value (Figure 2.16). A major limitation of the parallel coordinates technique is that
it cannot effectively show a data set of many records. Even for a data set of several thousand records,
visual clutter and overlap often reduce the readability of the visualization and make the patterns hard to
find.

Figure 2.16 Here is a visualization that uses parallel coordinates.

3. Icon based visualization techniques:-

Page 9
 It uses small icons to represent multidimensional data values
 2 popular icon based techniques:- 1)Chern off faces 2) Stick figures
3.1 Chern off faces: - They display multidimensional data of up to 18 variables as a cartoon human
face. Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears, mouth,
and nose, represent values of the dimensions by their shape, size, placement, and orientation. For example,
dimensions can be mapped to the following facial characteristics: eye size, eye spacing, nose length, nose
width, mouth curvature, mouth width, mouth openness, pupil size, eyebrow slant, eye eccentricity, and head
eccentricity. Chernoff faces make use of the ability of the human mind to recognize small differences in
facial characteristics and to assimilate many facial characteristics at once.
Chernoff faces make the data easier for users to digest. In this way, they facilitate visualization of regularities
and irregularities present in the data, although their power in relating multiple relationships is limited.
Another limitation is that specific data values are not shown.

Asymmetrical Chernoff faces were proposed as an extension to the original technique. Since a face has
vertical symmetry (along the y-axis), the left and right side of a face are identical, which wastes space.
Asymmetrical Chernoff faces double the number of facial characteristics, thus allowing up to 36 dimensions
to be displayed.
3.2 Stick figures: It maps multidimensional data to f ive –piece stick figure, where each figure has
4limbs and a body.
2 dimensions are mapped to the display axes and the remaining dimensions are mapped to the angle
and/ or length of the limbs.

Figure 2.18 shows census data, where age and income are mapped to the display axes, and the
remaining dimensions (gender, education, and so on) are mapped to stick figures. If the data items are
relatively dense with respect to the two display dimensions, the resulting visualization shows texture
patterns, reflecting data trends.

Page 10
4. Hierarchical visualization techniques (i.e. subspaces)

Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The subspaces
are visualized in a hierarchical manner

4.1)“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical visualization method.


Suppose we want to visualize a 6-D data set, where the dimensions are F,X1,...,X5. We want to observe how
dimension F changes with respect to the other dimensions. We can first fix the values of dimensions X3,X4,X5 to
some selected values, say, c3,c4,c5. We can then visualize F,X1,X2 using a 3-D plot, called a world, as shown in
Figure 2.19. The position of the origin of the inner world is located at the point(c3,c4,c5) in the outer world, which is
another 3-D plot using dimensions X3,X4,X5. A user can interactively change, in the outer world, the location of the
origin of the inner world. The user then views the resulting changes of the inner world. Moreover, a user can vary
the dimensions used in the inner world and the outer world.

4.2) Tree-maps: As another example of hierarchical visualization methods, tree-maps display hierarchical data as a
set of nested rectangles. For example, Figure 2.20 shows a tree-map visualizing Google news stories. All news
stories are organized into seven categories, each shown in a large rectangle of a unique color. Within each category
(i.e., each rectangle at the top level), the news stories are further partitioned into smaller subcategories.

Page 11
Visualizing Complex Data and Relations

 Visualizing non-numerical data: text and social networks


 Tag cloud: visualizing user-generated tags
 The importance of tag is represented by font size/color
 Besides text data, there are also methods to visualize relationships, such as visualizing social
networks

5 factors that influence data visualization choices:


1. Audience. It‘s important to adjust data representation to the specific target audience. For example,
fitness mobile app users who browse through their progress can easily work with uncomplicated
visualizations. On the other hand, if data insights are intended for researchers and experienced
decision-makers who regularly work with data, you can and often have to go beyond simple charts.
2. Content. The type of data you are dealing with will determine the tactics. For example, if it‘s time-
series metrics, you will use line charts to show the dynamics in many cases. To show the relationship
between two elements, scatter plots are often used. In turn, bar charts work well for comparative
analysis.
3. Context. You can use different data visualization approaches and read data depending on the
context. To emphasize a certain figure, for example, significant profit growth, you can use the shades
of one color on the chart and highlight the highest value with the brightest one. On the contrary, to
differentiate elements, you can use contrast colors.
4. Dynamics. There are various types of data, and each type has a different rate of change. For
example, financial results can be measured monthly or yearly, while time series and tracking data are
changing constantly. Depending on the rate of change, you may consider dynamic representation
(steaming) or static visualization techniques in data mining.
5. Purpose. The goal of data visualization affects the way it is implemented. In order to make a
complex analysis, visualizations are compiled into dynamic and controllable dashboards that work as
visual data analysis techniques and tools. However, dashboards are not necessary to show a single or
occasional data insight.

Page 12

You might also like