0% found this document useful (0 votes)

2 views

Module 1_BCS602_chapter 02.pptx

The document provides an overview of data types, including structured, semi-structured, and unstructured data, as well as the characteristics of big data. It discusses various analytics methods such as descriptive, diagnostic, predictive, and prescriptive analytics, along with data storage and processing techniques. Additionally, it covers data preprocessing, normalization, and the importance of central tendency and dispersion in data analysis.

Uploaded by

notfairksd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Module 1_BCS602_chapter 02.pptx

Uploaded by

notfairksd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

Machine

Learning
S. Sridhar and M.
Vijayalakshmi
Module 1_Chapter 2

Understanding of Data
What is Data?
• Data are facts
• Facts are in the form of numbers, audio, video, and image
• Need to analyze data for taking decisions.
• Today buisness organizations are accumulating vast amount of
data of the order of giga,tera,exa bytes of data.
Characteristics of Big Data
Characteristics of Big Data
Types of Data
• STRUCTURED DATA
• SEMI-STRUCTURED DATA
• UNSTRUCTURED DATA
Structured Data

A STRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING –

• RECORD DATA
• GRAPHICS DATA
• DATA MATRIX
• ORDERED DATA – SEQUENCE DATA, SPATIAL DATA, TEMPORAL
DATA
Sequence data
Temporal data
DNA sequences, speech
Stock prices, weather recognition, natural language
forecasting, sensor readings, processing (NLP).
traffic data.
Spatial data

Spatial Data

Satellite images,
geographical mapping,
urban planning, land
usage
Unstructured Data
AN UNSTRUCTURED DATA CAN BE ANY ONE OF THE
FOLLOWING –

• VIDEO, IMAGE, PROGRAMS

• BLOG DATA
• 80% OF ORGANIZATION DATA
Semi-Structured Data
A SEMI-STRUCTURED DATA CAN BE ANY ONE OF THE
FOLLOWING –

• XML/JSON OBJECTS
• RSS FEEDS
• HIERARCHICAL RECORDS
Data Storage and Representation
Data Storage
• DATABASE SYSTEMS
• TYPES ARE
1. TRANSACTIONAL
DATABASE
2. TIME SERIES DATABASE
3. TEMPORAL DATABASE
Data Storage
• OTHER
TYPES

© Oxford University Press 2021. All rights reserved

BIG DATA ANALYTICS
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Types of Big Data Analytics
1. Descriptive Analytics (What Happened?)
Example: The company analyzes past customer data and finds that
5,000 customers left in the last 3 months.
Insight: Reports and dashboards show that most churned
(Customers left the company)customers were prepaid users
with high call drop rates.
ML Techniques: Data visualization, reports, trend analysis.

2. Diagnostic Analytics (Why Did It Happen?)

Example: The company investigates the data and finds that most
customers who left had frequent network issues and billing
disputes.
Insight: Customer complaints and surveys show that poor customer
service and high call drop rates were major reasons for churn.
ML Techniques: Correlation analysis, clustering, decision trees.
3.Predictive Analytics (What Might Happen in the Future?)
Example: Using historical data, the company builds a churn
prediction model that identifies customers likely to leave in the
next month.
Insight: The model predicts that 30% of prepaid customers with
frequent billing issues and low engagement will churn in the
next quarter.
ML Techniques: Regression models, neural networks, classification
algorithms.

4. Prescriptive Analytics (What Should Be Done?)

Example: Based on the predictive model, the company sends special
discount offers and better service plans to at-risk customers to
retain them.
Insight: By offering better network coverage and customer
service, the company reduces churn by 15% in the next quarter.
ML Techniques: Recommendation systems, reinforcement learning,
optimization algorithms.
Big Data Analysis Framework
Big data framework is a layered architecture.a 4 layer
architecture has
Data Connection Layer –Taking raw data and importing it in to
appropriate data structures.ETL-Extract,Transform,Load operations .

Data Management Layer—It performs preprocessing of data.Parallel

execution of queries,read,write and data management tasks.

Data analytic layer—It has many functionalities like statistical

tests,Machine learning algorithms,construction of machine learning
models.
Types of Processing
• CLOUD COMPUTING
• GRID COMPUTING
• H-COMPUTING(High Performance Computing-HPC)
Data collection
• GOOD DATA SHOULD HAVE THESE CHARACTERISTICS
Data source can be classified as
Open-Source Data
Social-Media Data
Multimodal Data
Social-Media Data
1. TWITTER DATA
2. FACEBOOK DATA
3. YOUTUBE VIDEOS
4. INSTAGRAM DATA

Multimodal Data
Text,Video,audio and mixed
type.
Data Preprocessing
•The process of detection and removal of data is called
“ Data Cleaning”
•In the real world available data is “dirty”,It meanS

• INCOMPLETE DATA
• OUTLIER DATA
• INCONSISTENT DATA
• INACCURATE DATA
• MISSING VALUES
• DUPLICATE DATA
DOB---not given-----incomplete data
-1500-----Noisy Data
“ “-----Missing data
DoB(5,1980)----Inconsistence data
136----Outlier(
MissingData Analysis—Primary data cleaning process
Removal of Noisy or Outlier value
•Noise is a random error or variance in a measured value.
•It can be measured by using binning.
• It is a method where the data values are sorted and
distributed in to equal frequency bins.
•Bins are also called as Buckets.
•Binning method then uses the neighbour values to smooth the
noisy value.
•Smoothing by bin Meadians.
•Smoothing by bin Boundaries.
Consider the following set .S =
{12,14,19,22,24,26,28,31,32}.Apply various various binning
techniques and show the result.

Smoothing by equal frequency

Smoothing by binning means
Smoothing by Bin Boundaries.
BINNING TECHNIQUE
Consider the following set .S = {12,14,19,22,24,26,28,31,32}.Apply
various various binning techniques and show the result.
1)Equal Frequency binning method----
Consider the following set
S = {4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34}.
Apply various binning techniques and show the result.
Assume bin size = 4.
Partition using equal frequency approach: -
Bin 1 : 4, 8, 9, 15
Bin 2 : 21, 21, 24, 25
Bin 3 : 26, 28, 29, 34
Smoothing by bin means: -
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries: -
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
Smoothing by bin median: -
Bin 1: 9 ‚9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Data Integration and data Transformation
Data Integration Involves merging of data from multiple
sources in to single data source.
This may produce Redundant data,which must be
removed.
Data transformation routines perform normalization of
data to improve the performance.
In normalization attribute values are fit in to a data
range,for example (0-1).
Some of the normalization procedures used are
1)Min-Max
2)z-Score
Data Normalization
MIN-MAX PROCEDURE
TRANSFORMS DATA TO THE RANGE 0-1
Neural networks use this procedure.
Data Normalization
Z-SCORE

Z-score normalization, also known as standardization, transforms the

data so that it has a mean (μ) of 0 and a standard deviation (σ) of 1.
DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that does
data set summerization.It summarizes and describes
the data.
Data Set—Is a collection of data objects.
Types of Data

Ratio Data
CATEGORICAL or Qualitative Data
Numerical Or Qualitative Data
Interval data
Numerical data where the difference between values is
meaningful, but there is no true zero (i.e., zero does not
indicate an absence of the quantity).

Example:Temperature in Celsius or Fahrenheit:

The difference between 20°C and 30°C is the same as the difference
between 40°C and 50°C (10°C).
However, 0°C does not mean "no temperature"—it is just a reference point.
Also, saying 40°C is "twice as hot" as 20°C is incorrect because temperature
ratios are meaningless in the interval scale.
Ratio Data
Definition: Ratio data is numerical data where both differences
and ratios are meaningful because there is a true zero (zero
means the complete absence of the quantity).

Example:Height and Weight:

A person who is 180 cm tall is 20 cm taller than someone who is 160 cm
(meaningful difference).
A weight of 0 kg means no weight (true zero exists).
A 60 kg person is twice as heavy as a 30 kg person (ratios make sense).
Features Interval Data Ratio Data

Meaningful Differences ✅ Yes ✅ Yes

True Zero Exists ❌ No ✅ Yes

Meaningful Ratios
(Multiplication/Division) ❌ No ✅ Yes

Example Temperature (°C, °F) Height, Weight,

Age, Distance, Speed
rd
Types of Data based on number of variables(3 way)

In Univariate data data set has only one variable.

In Bivariate data dataset has two variables.
In Multivariate data dataset has three or more variables.
Univariate Data analysis and Visualization
Univariate analysis is the simplest form of data analysis, where we
analyze only one variable at a time.
The main purpose is to describe and summarize the data without
considering relationships between variables.
Types of Univariate Analysis:
Descriptive Statistics:
Measures used to describe the data:
Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation, IQR (Interquartile
Range)
Graphical Representation:
Bar Chart: Represents categorical data frequencies.
Pie Chart: Displays proportion in categorical data.
Histogram: Shows the distribution of numerical data.
Box Plot: Identifies outliers and the spread of data.
Data Visualization -BAR CHART
Data Visualization -PIE CHART
Data Visualization -HISTOGRAMS
Data Visualization-DOT PLOTS
Central Tendency
Why Do We Need Central Tendency?
✔ We can't remember all data points, so we
summarize data using central tendency.
✔ Helps find a single representative value for a
dataset.
✔ Useful for comparison and decision-making in data
analysis.
The three main measures of central tendency are:
1.Mean (Average)
2.Median (Middle Value)
3.Mode (Most Frequent Value)
Central Tendency

MEAN OF
DATA
Central Tendency
MEDIAN OF DATA
Central Tendency
MODE OF DATA
Dispersion
What is Dispersion?
🔹Dispersion measures how spread out data is
around the central tendency (mean, median, or mode).
🔹 If the data points are close together, dispersion is
low; if they are far apart, dispersion is high.
🔹 It helps us understand variability in a dataset.
Example:
•Dataset 1: [18, 19, 20, 21, 22] → Low dispersion
(values are close together).
•Dataset 2: [5, 10, 20, 35, 50] → High dispersion
(values are spread out).
DISPERSION
RANGE AND STANDARD DEVIATION
DISPERSION
QUARTILES AND Inter Quartile Range(IQR)
IQR = 13
1.5 x IQR = 1.5 x 13 =19.5
lower_bound = Q1 - 1.5 * IQR = 16.5-19.5
=-3
upper_bound = Q3 + 1.5 * IQR =
29.5+19.5 = 49
Five-point summary and Box Plots
5-POINT SUMMARY
Shape of Data

SKEWNESS
Mean <median—Negetive
Mean > median--Positive
Peak on the right, tail extending to the left ➝ Left-skewed Peak on the left, tail extending to the right ➝
Right-skewed
Shape of Data
KURTOSI
S
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF
VARIATION
Special Univariate Plots-Stem-Leaf Plot
Q-Q Plot is a
2D scatter plot of univariate data
QQ PLOT IS NORMALITY TEST. IF DATA CLOSER TO STRAIGHT LINE, THEN THE
DISTRIBUTION IS NORMAL.
Summary
Univariate and Bivariate Data
1. Univariate Data:

Definition: Univariate data consists of observations on

a single variable. It focuses on analyzing only one
characteristic at a time.
Example:
Heights of students in a class: {150 cm, 160 cm, 165 cm, 170 cm,
175 cm}
Monthly sales of a store: {₹50,000, ₹60,000, ₹55,000, ₹70,000}
Analysis: Mean, median, mode, range, and standard
deviation can be used to summarize and analyze
univariate data.
Bivariate Data:

Definition: Bivariate data consists of observations on two variables and

focuses on finding relationships between them.
Height vs. Weight of students:

Analysis: Correlation, regression, and scatter plots are used to determine

relationships between the two variables.

AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
2 Descriptive Analytics
No ratings yet
2 Descriptive Analytics
32 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
253777
No ratings yet
253777
66 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
FDS - 5 SOLVED
No ratings yet
FDS - 5 SOLVED
13 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
UNIT02
No ratings yet
UNIT02
41 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Data Science
No ratings yet
Data Science
12 pages
CHP 2
No ratings yet
CHP 2
52 pages
DA (1)
No ratings yet
DA (1)
86 pages
DV Co1 All PDF
No ratings yet
DV Co1 All PDF
196 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
15 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
DA Merge Notes(30!09!24)
No ratings yet
DA Merge Notes(30!09!24)
348 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Python for Data Analysis
No ratings yet
Python for Data Analysis
84 pages
Business Statistics - Session 1 - 3
No ratings yet
Business Statistics - Session 1 - 3
63 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Down 2
No ratings yet
Down 2
61 pages
Assignment DSBDS Insem
No ratings yet
Assignment DSBDS Insem
6 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
EDA
No ratings yet
EDA
52 pages
KMBN IT01 LM Consolidated
No ratings yet
KMBN IT01 LM Consolidated
123 pages
Discrete and Continuous Data
No ratings yet
Discrete and Continuous Data
8 pages
Unit .......
No ratings yet
Unit .......
45 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
Introduction to Business Analytics - Copy
No ratings yet
Introduction to Business Analytics - Copy
63 pages
Ass-3 Ds
No ratings yet
Ass-3 Ds
7 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
ML U2
No ratings yet
ML U2
62 pages
Dr. Ayaz_Data Science Presentation
No ratings yet
Dr. Ayaz_Data Science Presentation
164 pages
Unit-2
No ratings yet
Unit-2
144 pages
data science
No ratings yet
data science
23 pages
Data Science - g.scali (Lect1) (1)
No ratings yet
Data Science - g.scali (Lect1) (1)
22 pages
Lesson4 Data
No ratings yet
Lesson4 Data
31 pages
Class X AI Project Cycle Notes
No ratings yet
Class X AI Project Cycle Notes
19 pages
Summary DS231
No ratings yet
Summary DS231
11 pages
Week2-2
No ratings yet
Week2-2
25 pages
Data Mining Report
No ratings yet
Data Mining Report
15 pages
FDS notes
No ratings yet
FDS notes
5 pages
Data Science_ppt
No ratings yet
Data Science_ppt
45 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Statistics
No ratings yet
Statistics
72 pages
Data Analysis3
No ratings yet
Data Analysis3
31 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
pe report 6
No ratings yet
pe report 6
7 pages
amal captain project aprovel form
No ratings yet
amal captain project aprovel form
6 pages
indian knowledge system- module 2
100% (1)
indian knowledge system- module 2
19 pages
AI NEWS1
No ratings yet
AI NEWS1
2 pages
Ai Report FINAL
No ratings yet
Ai Report FINAL
26 pages
WEB_Viva.AH
No ratings yet
WEB_Viva.AH
5 pages
Graphical Solution of LP Problems Hand-Out
No ratings yet
Graphical Solution of LP Problems Hand-Out
11 pages
Qin2000 Article GeneralMatrixRepresentationsFo
No ratings yet
Qin2000 Article GeneralMatrixRepresentationsFo
10 pages
Information Systems Security LAB: RSA Cryptography Algorithm
No ratings yet
Information Systems Security LAB: RSA Cryptography Algorithm
9 pages
Manual JCL
No ratings yet
Manual JCL
15 pages
Discrete Time Random Process
No ratings yet
Discrete Time Random Process
35 pages
Stock Price Prediction - SMCS2324009
No ratings yet
Stock Price Prediction - SMCS2324009
28 pages
Experiment 1: Objective: - Introduction With MATLAB Software and Plotting of General Functions
No ratings yet
Experiment 1: Objective: - Introduction With MATLAB Software and Plotting of General Functions
4 pages
Modeling Dynamics and Stability of Variable Pitch and Helix Milling PDF
No ratings yet
Modeling Dynamics and Stability of Variable Pitch and Helix Milling PDF
10 pages
Phylogenetic Trees (BIOINFORMATICS)
No ratings yet
Phylogenetic Trees (BIOINFORMATICS)
7 pages
ECN225 Week2 PS
No ratings yet
ECN225 Week2 PS
3 pages
Pert-Cost Analysis and Project Cost Control
100% (1)
Pert-Cost Analysis and Project Cost Control
7 pages
Mini Project
No ratings yet
Mini Project
25 pages
North South University Mat361 Total Marks-30 (Time - 70 Min + 10min)
No ratings yet
North South University Mat361 Total Marks-30 (Time - 70 Min + 10min)
8 pages
Report Main Project
No ratings yet
Report Main Project
19 pages
Process Costing
No ratings yet
Process Costing
2 pages
L. Gaceta Division Algorithm
No ratings yet
L. Gaceta Division Algorithm
22 pages
Fakenews
No ratings yet
Fakenews
5 pages
Entropy 23 01000
No ratings yet
Entropy 23 01000
17 pages
A Secure RSA For Data Transmission in Wireless Sensor Networks
No ratings yet
A Secure RSA For Data Transmission in Wireless Sensor Networks
7 pages
Operations Research
No ratings yet
Operations Research
11 pages
Brent Optimization
No ratings yet
Brent Optimization
8 pages
Bmats201 Qb2@Azdocuments - in
No ratings yet
Bmats201 Qb2@Azdocuments - in
3 pages
Matlab For Maph 3071 Lab 6: Numerical Integration
No ratings yet
Matlab For Maph 3071 Lab 6: Numerical Integration
3 pages
1 5 Bias Variance Trade Off
No ratings yet
1 5 Bias Variance Trade Off
34 pages
Operation Research
100% (1)
Operation Research
191 pages
Werdibaji
No ratings yet
Werdibaji
203 pages
Poai-Unit 3 Notes
No ratings yet
Poai-Unit 3 Notes
95 pages
Project Management: Lecture Note: 4 Probabilistic Time Estimates
No ratings yet
Project Management: Lecture Note: 4 Probabilistic Time Estimates
26 pages
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
No ratings yet
Tối Ưu Hóa Cho Khoa Học Dữ Liệu
64 pages
3151 H3 20240514
No ratings yet
3151 H3 20240514
1 page

Module 1_BCS602_chapter 02.pptx

Uploaded by

Module 1_BCS602_chapter 02.pptx

Uploaded by

Machine

A STRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING –

• VIDEO, IMAGE, PROGRAMS

© Oxford University Press 2021. All rights reserved

2. Diagnostic Analytics (Why Did It Happen?)

4. Prescriptive Analytics (What Should Be Done?)

Data Management Layer—It performs preprocessing of data.Parallel

Data analytic layer—It has many functionalities like statistical

Smoothing by equal frequency

Z-score normalization, also known as standardization, transforms the

Example:Temperature in Celsius or Fahrenheit:

Example:Height and Weight:

Meaningful Differences ✅ Yes ✅ Yes

True Zero Exists ❌ No ✅ Yes

Example Temperature (°C, °F) Height, Weight,

In Univariate data data set has only one variable.

Definition: Univariate data consists of observations on

Definition: Bivariate data consists of observations on two variables and

Analysis: Correlation, regression, and scatter plots are used to determine

You might also like