0% found this document useful (0 votes)
2 views

Module 1_BCS602_chapter 02.pptx

The document provides an overview of data types, including structured, semi-structured, and unstructured data, as well as the characteristics of big data. It discusses various analytics methods such as descriptive, diagnostic, predictive, and prescriptive analytics, along with data storage and processing techniques. Additionally, it covers data preprocessing, normalization, and the importance of central tendency and dispersion in data analysis.

Uploaded by

notfairksd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 1_BCS602_chapter 02.pptx

The document provides an overview of data types, including structured, semi-structured, and unstructured data, as well as the characteristics of big data. It discusses various analytics methods such as descriptive, diagnostic, predictive, and prescriptive analytics, along with data storage and processing techniques. Additionally, it covers data preprocessing, normalization, and the importance of central tendency and dispersion in data analysis.

Uploaded by

notfairksd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Machine

Learning
S. Sridhar and M.
Vijayalakshmi
Module 1_Chapter 2

Understanding of Data
What is Data?
• Data are facts
• Facts are in the form of numbers, audio, video, and image
• Need to analyze data for taking decisions.
• Today buisness organizations are accumulating vast amount of
data of the order of giga,tera,exa bytes of data.
Characteristics of Big Data
Characteristics of Big Data
Types of Data
• STRUCTURED DATA
• SEMI-STRUCTURED DATA
• UNSTRUCTURED DATA
Structured Data

A STRUCTURED DATA CAN BE ANY ONE OF THE FOLLOWING –

• RECORD DATA
• GRAPHICS DATA
• DATA MATRIX
• ORDERED DATA – SEQUENCE DATA, SPATIAL DATA, TEMPORAL
DATA
Sequence data
Temporal data
DNA sequences, speech
Stock prices, weather recognition, natural language
forecasting, sensor readings, processing (NLP).
traffic data.
Spatial data

Spatial Data

Satellite images,
geographical mapping,
urban planning, land
usage
Unstructured Data
AN UNSTRUCTURED DATA CAN BE ANY ONE OF THE
FOLLOWING –

• VIDEO, IMAGE, PROGRAMS


• BLOG DATA
• 80% OF ORGANIZATION DATA
Semi-Structured Data
A SEMI-STRUCTURED DATA CAN BE ANY ONE OF THE
FOLLOWING –

• XML/JSON OBJECTS
• RSS FEEDS
• HIERARCHICAL RECORDS
Data Storage and Representation
Data Storage
• DATABASE SYSTEMS
• TYPES ARE
1. TRANSACTIONAL
DATABASE
2. TIME SERIES DATABASE
3. TEMPORAL DATABASE
Data Storage
• OTHER
TYPES

© Oxford University Press 2021. All rights reserved


BIG DATA ANALYTICS
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
Types of Big Data Analytics
1. Descriptive Analytics (What Happened?)
Example: The company analyzes past customer data and finds that
5,000 customers left in the last 3 months.
Insight: Reports and dashboards show that most churned
(Customers left the company)customers were prepaid users
with high call drop rates.
ML Techniques: Data visualization, reports, trend analysis.

2. Diagnostic Analytics (Why Did It Happen?)


Example: The company investigates the data and finds that most
customers who left had frequent network issues and billing
disputes.
Insight: Customer complaints and surveys show that poor customer
service and high call drop rates were major reasons for churn.
ML Techniques: Correlation analysis, clustering, decision trees.
3.Predictive Analytics (What Might Happen in the Future?)
Example: Using historical data, the company builds a churn
prediction model that identifies customers likely to leave in the
next month.
Insight: The model predicts that 30% of prepaid customers with
frequent billing issues and low engagement will churn in the
next quarter.
ML Techniques: Regression models, neural networks, classification
algorithms.

4. Prescriptive Analytics (What Should Be Done?)


Example: Based on the predictive model, the company sends special
discount offers and better service plans to at-risk customers to
retain them.
Insight: By offering better network coverage and customer
service, the company reduces churn by 15% in the next quarter.
ML Techniques: Recommendation systems, reinforcement learning,
optimization algorithms.
Big Data Analysis Framework
Big data framework is a layered architecture.a 4 layer
architecture has
Data Connection Layer –Taking raw data and importing it in to
appropriate data structures.ETL-Extract,Transform,Load operations .

Data Management Layer—It performs preprocessing of data.Parallel


execution of queries,read,write and data management tasks.

Data analytic layer—It has many functionalities like statistical


tests,Machine learning algorithms,construction of machine learning
models.
Types of Processing
• CLOUD COMPUTING
• GRID COMPUTING
• H-COMPUTING(High Performance Computing-HPC)
Data collection
• GOOD DATA SHOULD HAVE THESE CHARACTERISTICS
Data source can be classified as
Open-Source Data
Social-Media Data
Multimodal Data
Social-Media Data
1. TWITTER DATA
2. FACEBOOK DATA
3. YOUTUBE VIDEOS
4. INSTAGRAM DATA

Multimodal Data
Text,Video,audio and mixed
type.
Data Preprocessing
•The process of detection and removal of data is called
“ Data Cleaning”
•In the real world available data is “dirty”,It meanS

• INCOMPLETE DATA
• OUTLIER DATA
• INCONSISTENT DATA
• INACCURATE DATA
• MISSING VALUES
• DUPLICATE DATA
DOB---not given-----incomplete data
-1500-----Noisy Data
“ “-----Missing data
DoB(5,1980)----Inconsistence data
136----Outlier(
MissingData Analysis—Primary data cleaning process
Removal of Noisy or Outlier value
•Noise is a random error or variance in a measured value.
•It can be measured by using binning.
• It is a method where the data values are sorted and
distributed in to equal frequency bins.
•Bins are also called as Buckets.
•Binning method then uses the neighbour values to smooth the
noisy value.
•Smoothing by bin Meadians.
•Smoothing by bin Boundaries.
Consider the following set .S =
{12,14,19,22,24,26,28,31,32}.Apply various various binning
techniques and show the result.

Smoothing by equal frequency


Smoothing by binning means
Smoothing by Bin Boundaries.
BINNING TECHNIQUE
Consider the following set .S = {12,14,19,22,24,26,28,31,32}.Apply
various various binning techniques and show the result.
1)Equal Frequency binning method----
Consider the following set
S = {4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34}.
Apply various binning techniques and show the result.
Assume bin size = 4.
Partition using equal frequency approach: -
Bin 1 : 4, 8, 9, 15
Bin 2 : 21, 21, 24, 25
Bin 3 : 26, 28, 29, 34
Smoothing by bin means: -
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries: -
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
Smoothing by bin median: -
Bin 1: 9 ‚9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Data Integration and data Transformation
Data Integration Involves merging of data from multiple
sources in to single data source.
This may produce Redundant data,which must be
removed.
Data transformation routines perform normalization of
data to improve the performance.
In normalization attribute values are fit in to a data
range,for example (0-1).
Some of the normalization procedures used are
1)Min-Max
2)z-Score
Data Normalization
MIN-MAX PROCEDURE
TRANSFORMS DATA TO THE RANGE 0-1
Neural networks use this procedure.
Data Normalization
Z-SCORE

Z-score normalization, also known as standardization, transforms the


data so that it has a mean (μ) of 0 and a standard deviation (σ) of 1.
DESCRIPTIVE STATISTICS
Descriptive statistics is a branch of statistics that does
data set summerization.It summarizes and describes
the data.
Data Set—Is a collection of data objects.
Types of Data

Ratio Data
CATEGORICAL or Qualitative Data
Numerical Or Qualitative Data
Interval data
Numerical data where the difference between values is
meaningful, but there is no true zero (i.e., zero does not
indicate an absence of the quantity).

Example:Temperature in Celsius or Fahrenheit:


The difference between 20°C and 30°C is the same as the difference
between 40°C and 50°C (10°C).
However, 0°C does not mean "no temperature"—it is just a reference point.
Also, saying 40°C is "twice as hot" as 20°C is incorrect because temperature
ratios are meaningless in the interval scale.
Ratio Data
Definition: Ratio data is numerical data where both differences
and ratios are meaningful because there is a true zero (zero
means the complete absence of the quantity).

Example:Height and Weight:


A person who is 180 cm tall is 20 cm taller than someone who is 160 cm
(meaningful difference).
A weight of 0 kg means no weight (true zero exists).
A 60 kg person is twice as heavy as a 30 kg person (ratios make sense).
Features Interval Data Ratio Data

Meaningful Differences ✅ Yes ✅ Yes

True Zero Exists ❌ No ✅ Yes

Meaningful Ratios
(Multiplication/Division) ❌ No ✅ Yes

Example Temperature (°C, °F) Height, Weight,


Age, Distance, Speed
rd
Types of Data based on number of variables(3 way)

In Univariate data data set has only one variable.


In Bivariate data dataset has two variables.
In Multivariate data dataset has three or more variables.
Univariate Data analysis and Visualization
Univariate analysis is the simplest form of data analysis, where we
analyze only one variable at a time.
The main purpose is to describe and summarize the data without
considering relationships between variables.
Types of Univariate Analysis:
Descriptive Statistics:
Measures used to describe the data:
Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation, IQR (Interquartile
Range)
Graphical Representation:
Bar Chart: Represents categorical data frequencies.
Pie Chart: Displays proportion in categorical data.
Histogram: Shows the distribution of numerical data.
Box Plot: Identifies outliers and the spread of data.
Data Visualization -BAR CHART
Data Visualization -PIE CHART
Data Visualization -HISTOGRAMS
Data Visualization-DOT PLOTS
Central Tendency
Why Do We Need Central Tendency?
✔ We can't remember all data points, so we
summarize data using central tendency.
✔ Helps find a single representative value for a
dataset.
✔ Useful for comparison and decision-making in data
analysis.
The three main measures of central tendency are:
1.Mean (Average)
2.Median (Middle Value)
3.Mode (Most Frequent Value)
Central Tendency

MEAN OF
DATA
Central Tendency
MEDIAN OF DATA
Central Tendency
MODE OF DATA
Dispersion
What is Dispersion?
🔹Dispersion measures how spread out data is
around the central tendency (mean, median, or mode).
🔹 If the data points are close together, dispersion is
low; if they are far apart, dispersion is high.
🔹 It helps us understand variability in a dataset.
Example:
•Dataset 1: [18, 19, 20, 21, 22] → Low dispersion
(values are close together).
•Dataset 2: [5, 10, 20, 35, 50] → High dispersion
(values are spread out).
DISPERSION
RANGE AND STANDARD DEVIATION
DISPERSION
QUARTILES AND Inter Quartile Range(IQR)
IQR = 13
1.5 x IQR = 1.5 x 13 =19.5
lower_bound = Q1 - 1.5 * IQR = 16.5-19.5
=-3
upper_bound = Q3 + 1.5 * IQR =
29.5+19.5 = 49
Five-point summary and Box Plots
5-POINT SUMMARY
Shape of Data

SKEWNESS
Mean <median—Negetive
Mean > median--Positive
Peak on the right, tail extending to the left ➝ Left-skewed Peak on the left, tail extending to the right ➝
Right-skewed
Shape of Data
KURTOSI
S
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF
VARIATION
Special Univariate Plots-Stem-Leaf Plot
Q-Q Plot is a
2D scatter plot of univariate data
QQ PLOT IS NORMALITY TEST. IF DATA CLOSER TO STRAIGHT LINE, THEN THE
DISTRIBUTION IS NORMAL.
Summary
Univariate and Bivariate Data
1. Univariate Data:

Definition: Univariate data consists of observations on


a single variable. It focuses on analyzing only one
characteristic at a time.
Example:
Heights of students in a class: {150 cm, 160 cm, 165 cm, 170 cm,
175 cm}
Monthly sales of a store: {₹50,000, ₹60,000, ₹55,000, ₹70,000}
Analysis: Mean, median, mode, range, and standard
deviation can be used to summarize and analyze
univariate data.
Bivariate Data:

Definition: Bivariate data consists of observations on two variables and


focuses on finding relationships between them.
Height vs. Weight of students:

Analysis: Correlation, regression, and scatter plots are used to determine


relationships between the two variables.

You might also like