Session1-DataCharacteristics

The document provides an overview of machine learning, its processes, and its applications in various fields such as healthcare and finance. It discusses the importance of data analytics, including descriptive, diagnostic, predictive, and prescriptive analytics, as well as exploratory data analysis (EDA) techniques. Additionally, it highlights the differences between structured, unstructured, and semi-structured data, and emphasizes the role of data processing in deriving insights from raw data.

Uploaded by

pavitradevi297

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Session1-DataCharacteristics

Uploaded by

pavitradevi297

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Demystifying Machine Learning

Dr. J.V. Benifa

Associate Dean,
Faculty of AI and Data Science & Lead- BigML Labs,
Indian Institute of Information Technology Kottayam
(Under Ministry of Education, Govt. of India)
bigml@iiitkottayam
Artificial Intelligence
• The need for intelligence in real-world applications stems from
the complex and challenging nature of problems that humans
encounter in various domains.
• Machine learning is used as a foundational technology in the
development of AI systems.
• Recognize and interpret complex patterns, understand natural
language, detect objects or faces in images, make
recommendations etc.,

➢Handling Complexity
➢Decision-Making and Optimization
➢Adaptability to Changing
➢Environments
➢Automation and Efficiency
What is Machine Learning?
• Machine learning is a set of methods that can automatically
detect patterns in data.
• These uncovered patterns are then used to predict future
data, or to perform other kinds of decision-making under
uncertainty.
• The key premise is learning from data!!
• Addresses the problem of analyzing huge bodies of data so
that they can be understood.
• Providing techniques to automate the analysis and
exploration of large, complex data sets.
• Tools, methodologies, and theories for revealing patterns in
data – critical step in knowledge discovery.
Machine Learning Process
Examples
• Machine learning plays a key role in many areas of science, finance and industry:
• Predict whether a patient, hospitalized due to a heart attack, will have a second
heart attack.
• The prediction is to be based on demographic, diet and clinical measurements
for that patient.
• Predict the price of a stock in 6 months from now, on the basis of company
performance measures and economic data.
• Identify the numbers in a handwritten ZIP code, from a digitized image.
• Estimate the amount of glucose in the blood of a diabetic person, from the
infrared absorption spectrum of that person’s blood.
• Identify the risk factors for prostate cancer, based on clinical and demographic
variables.
The Modelling Process

1 6 7
• Define Business • Optimize Model • Determine Best Fit
Problem

2 5 8
• Define Hypotheses • Develop Predictive • Utilize Model/Score
Model New Data

3 4 9
• Collect Data • Analyze Data • Monitor Model
Background of Learning Process

Data Science & AI Research Group, IIIT Kottayam

Machine Learning Vs Deep Learning
Data and Exploratory Analysis

Dr. J.V. Bibal Benifa

Associate Dean,
Faculty of AI and Data Science & Lead- BigML Labs,
Indian Institute of Information Technology Kottayam
(Under Ministry of Education, Govt. of India)

bigml@iiitkottayam
Contents

1. Introduction to Data

2. Data Analytics

3. Exploratory Data Analysis

DATA
• Data refers to distinct pieces of information, typically formatted in a
specific way, that can be measured, collected, reported, and
analyzed.
• It encompasses various forms such as numbers, text, sound,
images, or any other format.
• Data can be generated by humans, machines, or a combination of
both, and it can be stored in structured or unstructured formats
Data vs Information
Types of Data – Based on nature of data
Types of Data – Based on data organized
Structured Data
Structured data is data that is organized and designed in a specific way to make it easily
readable and understand by both humans and machines.
Advantages of Structured Data
• Easy to understand and use
• Consistency
• Efficient storage and retrieval
Disadvantages of Structured Data
• Inflexibility
• Limited complexity
• Limited context
Examples:
• Customer names and contact information
• Transaction records in finance
• Patient records and diagnostic reports in healthcare
Unstructured Data
• Unstructured data refers to information that does not have a predefined data
model or structure, making it challenging to collect, process, and analyze
using traditional data management tools.
Advantages of Unstructured Data:
• The data is not constrained by a fixed schema
• Very Flexible due to the absence of schema.
• Data is portable
• It is very scalable.
Disadvantages of Unstructured Data :
• It is difficult to store and manage unstructured data due to lack of schema and
structure.
• Indexing the data is difficult and error-prone due to unclear structure.
• Ensuring the security of data is a difficult task.
Semi-structured Data
• Semi-structured data is a type of data that is not purely structured, but also
not completely unstructured. It contains some level of organization or
structure, but does not conform to a rigid schema or data model, and may
contain elements that are not easily categorized or classified.
Advantages of Semi-structured Data:
• The data is not constrained by a fixed schema
• Flexible i.e. Schema can be easily changed.
• Data is portable
• It is possible to view structured data as semi-structured data.
Disadvantages of Semi-structured data:
• Lack of fixed, rigid schema make it difficult in storage of the data
• Interpreting the relationship between data is difficult.
• Queries are less efficient as compared to structured data.
• Complexity
Lifecycle of Data:
Data analytics

• Data analytics is the collection, transformation, and organization of

data in order to draw conclusions, make predictions, and drive
informed decision making.
• Data analytics is a multidisciplinary field that employs a wide range of
analysis techniques, including math, statistics, and computer science,
to draw insights from data sets
• Data analytics is a broad term that includes everything from simply
analyzing data to theorizing ways of collecting data and creating the
frameworks needed to store it
Data Analytics using AI
1. Descriptive Analytics
Descriptive analytics focuses on
summarizing historical data to
identify trends and patterns.

2. Diagnostic Analytics
Diagnostic analytics is used to
identify the root causes or reasons
behind a trend or anomaly
observed in the data.
Data Analytics using AI
3. Predictive Analytics
Predictive analytics helps forecast future events based on historical
data and trends

4. Prescriptive Analytics
Prescriptive analytics suggests actions to take based on the data
analysis and predictions, providing recommendations on the best
course of action.
Exploratory Data Analysis:
• Exploratory Data Analysis (EDA) involves analyzing and
visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the
method of studying and exploring record sets to apprehend their
predominant traits, discover patterns, locate outliers, and identify
relationships between variables.
• EDA is normally carried out as a preliminary step before
undertaking extra formal statistical analyses or modeling.
Key aspects of EDA:
• Distribution of Data: Examining the distribution of data points to understand their
range, central tendencies (mean, median), and dispersion (variance, standard
deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter
plots, and par charts to visualize relationships within the data and distributions of
variables.
• Outlier Detection: Identifying unusual values that deviate from other data points.
Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.
• Correlation Analysis: Checking the relationships between variables to understand
how they night affect each other. This includes computing correlation coefficients
and creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data
points, whether by imputation or removal, depending on their impact and the
amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends.
• Testing Assumptions: Many statistical tests and models assume the data meet
certain conditions. EDA helps verify these assumptions.
Python libraries for EDA

• Pandas: Provides extensive functions for data manipulation and

analysis, including data structure handling and time series
functionality.
• Matplotlib: A plotting library for creating static, interactive, and
• animated visualizations in Python.
• Seaborn: Built on top of Matplotlib, it provides a high-level interface
for drawing attractive and informative statistical graphics.
• Plotly: An interactive graphing library for making interactive plots and
offers more sophisticated visualization capabilities.
Data Science & AI Research Group, IIIT Kottayam
Introduction to Data Processing steps in
EDA:
• Data processing is essential to derive insights and value from raw
data.
• Key stages:
• Data Collection
• Data Cleaning
• Data Transformation
• Data Integration
• Data Aggregation
• Data Visualization
Data Collection
• Understand the Problem and the Data
• Key Points:
• Identify the business goal or research question.
• Understand variables and their representation.
• Determine data types (numerical, categorical,
text).
• Address known data quality issues or domain-
specific concerns.
Data Cleaning
Import and Inspect the Data
• Key Points:
• Load data into your analysis
environment (e.g., Python, R).
• Check data size (rows, columns)
and inspect for missing values.
• Identify data types and spot
inconsistencies or errors.
Data Transformation
• Definition: Changing the format,
structure, or values of data for analysis.
• Objective: Convert data into usable
forms, like scaling or aggregating.
• Common Techniques: Normalization,
aggregation, creating new variables.
• Use Case: Creating a column for crime
description length.
Data Integration
• Definition: Combining data from
different sources into one cohesive
dataset.
• Objective: Create unified datasets for
comprehensive analysis.
• Common Techniques: Merging datasets,
concatenating datasets.
• Use Case: Merging crime data with
weather data to analyze trends.
Data Aggregation
• Definition: Summarizing or grouping
data into categories.
• Objective: Reduce datasets by
summarizing key statistics.
• Common Techniques: Grouping by
categories, calculating summary
statistics.
• Use Case: Aggregating crime data by
year.
Data Filtering
• Definition: Selecting data subsets based on criteria.
• Objective: Focus analysis on relevant data points.
• Common Techniques: Filtering rows/columns based on conditions.
• Use Case: Filtering crime data for a specific year.
Data Reshaping
• Definition: Changing the layout of a dataset (pivoting, melting).
• Objective: Rearrange data for easier analysis or visualization.
• Common Techniques: Pivot tables, melting, transposing data.
• Use Case: Creating pivot tables for crimes by year.
Data Visualization
• Definition: Graphical
representation of data to identify
trends, patterns, and outliers.
• Objective: Visually understand the
distribution of data, correlations,
and variable relationships.
• Common Techniques: Bar plots,
histograms, scatter plots, and
heatmaps.
• Use Case: Visualizing crime types
using bar charts or trends over
time.
Other tools:
Use Cases 1:
Netflix - Personalized Recommendations
• Challenge: Netflix needed to improve user experience and
engagement by providing personalized content recommendations.
• Solution: Netflix used data analytics to analyze users' viewing history,
preferences, and behavior patterns.
• Outcome: Enhanced recommendations using machine learning
algorithms, leading to higher user retention and
engagement.Increased time spent on the platform and more content
consumption by users.
• Impact: The recommendations system is a key factor in Netflix's
success, with over 80% of content watched being based on
personalized recommendations
Use Case 2:
Walmart - Inventory Optimization
• Challenge: Walmart struggled with inventory management across
thousands of stores.
• Solution: By applying data analytics, Walmart analyzed sales data,
inventory levels, and supply chain data to predict demand.
• Outcome: Optimized inventory management and reduced stockouts.
Improved supply chain efficiency and reduced operational costs.
• Impact: Data analytics helped Walmart increase operational efficiency
and improve customer satisfaction by ensuring product availability.
Summary
• Data: Raw facts and figures that are collected and stored for analysis.
• Data Analysis: The process of inspecting, cleaning, transforming, and
modeling data to extract useful insights and support decision-making.
• Exploratory Data Analysis (EDA): The initial step of data analysis focused
on summarizing main characteristics of data, often using visual methods,
before applying formal modeling.
• EDA Tools: Common techniques in EDA include visualizations (histograms,
scatter plots) and statistical summaries (mean, median, standard
deviation).
• Purpose of Data Analysis: To uncover patterns, identify trends, test
hypotheses, and derive actionable insights for informed decision-making.
References

• GeeksforGeeks (https://www.geeksforgeeks.org/data-analysis/)
• IBM (https://www.ibm.com/topics/data-analysis)
• DataCamp (https://www.datacamp.com/community/tutorials/exploratory-
data-analysis-python)
• GeeksforGeeks (https://www.geeksforgeeks.org/text-mining-using-
python/)
• KDnuggets (https://www.kdnuggets.com/2021/03/data-analysis-overview-
process.html)
• Microsoft (https://learn.microsoft.com/en-us/azure/machine-
learning/data-science-workflow)
References (Cont.)
• Towards Data Science
(https://towardsdatascience.com/understanding-exploratory-data-
analysis-in-python-85f8eacfaedb)
• DataFlair (https://data-flair.training/blogs/text-mining-using-python/)
• SAS (https://www.sas.com/en_us/insights/analytics/text-
analytics.html)

Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Healthy KetosysIntermittent Fasting
96% (23)
Healthy KetosysIntermittent Fasting
64 pages
Arc Flash Incident Respone Drill - 14-Jul-2021
No ratings yet
Arc Flash Incident Respone Drill - 14-Jul-2021
7 pages
Sahlins - Goodbye To Triste Tropes
No ratings yet
Sahlins - Goodbye To Triste Tropes
25 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
Unit 1
No ratings yet
Unit 1
36 pages
Document (1)
No ratings yet
Document (1)
10 pages
Data Science_ppt
No ratings yet
Data Science_ppt
45 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Unit 1
No ratings yet
Unit 1
61 pages
UNIT 1 Exploratory Data Analysis
No ratings yet
UNIT 1 Exploratory Data Analysis
21 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Lesson 04 Data Analytics Overview
No ratings yet
Lesson 04 Data Analytics Overview
47 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Unit 2
No ratings yet
Unit 2
48 pages
unit-1
No ratings yet
unit-1
50 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Unit II
No ratings yet
Unit II
91 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
DS Module 1 Notes
No ratings yet
DS Module 1 Notes
25 pages
HubSpots Guide To Data Analytics
No ratings yet
HubSpots Guide To Data Analytics
50 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
DSA Module 1 Notes
No ratings yet
DSA Module 1 Notes
24 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Unit 1
No ratings yet
Unit 1
19 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
Data Science With Python - Lesson 02 - Data Analytics Overview
No ratings yet
Data Science With Python - Lesson 02 - Data Analytics Overview
54 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Data Analytics-Wps Office
No ratings yet
Data Analytics-Wps Office
21 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Analysis From Theoretical To Implementation Using Excel, Python, Flourish
No ratings yet
Data Analysis From Theoretical To Implementation Using Excel, Python, Flourish
30 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Module 1
No ratings yet
Module 1
35 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
Unit - I DA.pptx
No ratings yet
Unit - I DA.pptx
107 pages
DS Unit 1
No ratings yet
DS Unit 1
37 pages
abhijitya_midsem
No ratings yet
abhijitya_midsem
6 pages
DATA ANALYTICS 1
No ratings yet
DATA ANALYTICS 1
13 pages
BUSINESS ANALYTICS UNIT I
No ratings yet
BUSINESS ANALYTICS UNIT I
45 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Script
100% (1)
Script
2 pages
JO - 16 - Renault Follow-Up Plan Audit HSE Internship
No ratings yet
JO - 16 - Renault Follow-Up Plan Audit HSE Internship
2 pages
ULUS10219 (Zehncopy)
No ratings yet
ULUS10219 (Zehncopy)
14 pages
UCR Catalog 2019-2020
No ratings yet
UCR Catalog 2019-2020
551 pages
Student Bio Data Form PDF
No ratings yet
Student Bio Data Form PDF
2 pages
Woman in White Qs
No ratings yet
Woman in White Qs
2 pages
Sika Monotop®-610: Product Data Sheet
No ratings yet
Sika Monotop®-610: Product Data Sheet
3 pages
Cable Stay Bridge
No ratings yet
Cable Stay Bridge
42 pages
010 MILCO Presentation V2 - 4 GB
No ratings yet
010 MILCO Presentation V2 - 4 GB
36 pages
Impact of Marketing Research On Consumer Patronage of Telecommunication Product
No ratings yet
Impact of Marketing Research On Consumer Patronage of Telecommunication Product
58 pages
MCQ For C-1-13
No ratings yet
MCQ For C-1-13
13 pages
au, dong, tremblay 2021 jfqa _ann - employee flexibility, exogenous shocks and firm value
No ratings yet
au, dong, tremblay 2021 jfqa _ann - employee flexibility, exogenous shocks and firm value
60 pages
Calculation Tool Heat Flux Through A Single Pipe KNOWING Skin Temperatures
No ratings yet
Calculation Tool Heat Flux Through A Single Pipe KNOWING Skin Temperatures
5 pages
Kovalchuk2008 (Recovered) (Recovered)
No ratings yet
Kovalchuk2008 (Recovered) (Recovered)
7 pages
General Rules: Eastern Railway
100% (1)
General Rules: Eastern Railway
343 pages
Roll Stickers: File Format Colours
No ratings yet
Roll Stickers: File Format Colours
7 pages
Shapesoftelugu
No ratings yet
Shapesoftelugu
8 pages
Chapter 6
No ratings yet
Chapter 6
34 pages
The Starry Night
No ratings yet
The Starry Night
3 pages
Intermediate 3 Workbook Unit 2-1
No ratings yet
Intermediate 3 Workbook Unit 2-1
8 pages
OD5432 Snowplow Manufacturing Industry Report
No ratings yet
OD5432 Snowplow Manufacturing Industry Report
32 pages
Lecture 3 - Pearl Millet
No ratings yet
Lecture 3 - Pearl Millet
40 pages
DR Arlene's Leadership Story
No ratings yet
DR Arlene's Leadership Story
5 pages
Reprint Title Pages Edited by Sir Sol Final
No ratings yet
Reprint Title Pages Edited by Sir Sol Final
12 pages
Salat Dua o Zikir 1
No ratings yet
Salat Dua o Zikir 1
224 pages
Chapter One 1.1 Background To The Study
No ratings yet
Chapter One 1.1 Background To The Study
34 pages
A List of Commands and A Quick Description
No ratings yet
A List of Commands and A Quick Description
35 pages

Session1-DataCharacteristics

Uploaded by

Session1-DataCharacteristics

Uploaded by

Demystifying Machine Learning

Dr. J.V. Benifa

Data Science & AI Research Group, IIIT Kottayam

Dr. J.V. Bibal Benifa

3. Exploratory Data Analysis

• Data analytics is the collection, transformation, and organization of

• Pandas: Provides extensive functions for data manipulation and

You might also like