Module 1_BCS602_chapter 02.pptx
Module 1_BCS602_chapter 02.pptx
Learning
S. Sridhar and M.
Vijayalakshmi
Module 1_Chapter 2
Understanding of Data
What is Data?
• Data are facts
• Facts are in the form of numbers, audio, video, and image
• Need to analyze data for taking decisions.
• Today buisness organizations are accumulating vast amount of
data of the order of giga,tera,exa bytes of data.
Characteristics of Big Data
Characteristics of Big Data
Types of Data
• STRUCTURED DATA
• SEMI-STRUCTURED DATA
• UNSTRUCTURED DATA
Structured Data
• RECORD DATA
• GRAPHICS DATA
• DATA MATRIX
• ORDERED DATA – SEQUENCE DATA, SPATIAL DATA, TEMPORAL
DATA
Sequence data
Temporal data
DNA sequences, speech
Stock prices, weather recognition, natural language
forecasting, sensor readings, processing (NLP).
traffic data.
Spatial data
Spatial Data
Satellite images,
geographical mapping,
urban planning, land
usage
Unstructured Data
AN UNSTRUCTURED DATA CAN BE ANY ONE OF THE
FOLLOWING –
• XML/JSON OBJECTS
• RSS FEEDS
• HIERARCHICAL RECORDS
Data Storage and Representation
Data Storage
• DATABASE SYSTEMS
• TYPES ARE
1. TRANSACTIONAL
DATABASE
2. TIME SERIES DATABASE
3. TEMPORAL DATABASE
Data Storage
• OTHER
TYPES
Multimodal Data
Text,Video,audio and mixed
type.
Data Preprocessing
•The process of detection and removal of data is called
“ Data Cleaning”
•In the real world available data is “dirty”,It meanS
• INCOMPLETE DATA
• OUTLIER DATA
• INCONSISTENT DATA
• INACCURATE DATA
• MISSING VALUES
• DUPLICATE DATA
DOB---not given-----incomplete data
-1500-----Noisy Data
“ “-----Missing data
DoB(5,1980)----Inconsistence data
136----Outlier(
MissingData Analysis—Primary data cleaning process
Removal of Noisy or Outlier value
•Noise is a random error or variance in a measured value.
•It can be measured by using binning.
• It is a method where the data values are sorted and
distributed in to equal frequency bins.
•Bins are also called as Buckets.
•Binning method then uses the neighbour values to smooth the
noisy value.
•Smoothing by bin Meadians.
•Smoothing by bin Boundaries.
Consider the following set .S =
{12,14,19,22,24,26,28,31,32}.Apply various various binning
techniques and show the result.
Ratio Data
CATEGORICAL or Qualitative Data
Numerical Or Qualitative Data
Interval data
Numerical data where the difference between values is
meaningful, but there is no true zero (i.e., zero does not
indicate an absence of the quantity).
Meaningful Ratios
(Multiplication/Division) ❌ No ✅ Yes
MEAN OF
DATA
Central Tendency
MEDIAN OF DATA
Central Tendency
MODE OF DATA
Dispersion
What is Dispersion?
🔹Dispersion measures how spread out data is
around the central tendency (mean, median, or mode).
🔹 If the data points are close together, dispersion is
low; if they are far apart, dispersion is high.
🔹 It helps us understand variability in a dataset.
Example:
•Dataset 1: [18, 19, 20, 21, 22] → Low dispersion
(values are close together).
•Dataset 2: [5, 10, 20, 35, 50] → High dispersion
(values are spread out).
DISPERSION
RANGE AND STANDARD DEVIATION
DISPERSION
QUARTILES AND Inter Quartile Range(IQR)
IQR = 13
1.5 x IQR = 1.5 x 13 =19.5
lower_bound = Q1 - 1.5 * IQR = 16.5-19.5
=-3
upper_bound = Q3 + 1.5 * IQR =
29.5+19.5 = 49
Five-point summary and Box Plots
5-POINT SUMMARY
Shape of Data
SKEWNESS
Mean <median—Negetive
Mean > median--Positive
Peak on the right, tail extending to the left ➝ Left-skewed Peak on the left, tail extending to the right ➝
Right-skewed
Shape of Data
KURTOSI
S
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF
VARIATION
Special Univariate Plots-Stem-Leaf Plot
Q-Q Plot is a
2D scatter plot of univariate data
QQ PLOT IS NORMALITY TEST. IF DATA CLOSER TO STRAIGHT LINE, THEN THE
DISTRIBUTION IS NORMAL.
Summary
Univariate and Bivariate Data
1. Univariate Data: