0% found this document useful (0 votes)
1 views

Unit-1-DEV

The document outlines a course on Data Exploration and Visualization, detailing objectives, activities, and examples of exploratory data analysis (EDA) across various fields such as sports, healthcare, and marketing. It emphasizes the importance of EDA in uncovering patterns, trends, and insights from data using both numerical and graphical methods. Additionally, it discusses different data types, measurement scales, and the contrast between exploratory and confirmatory data analysis.

Uploaded by

sdivya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Unit-1-DEV

The document outlines a course on Data Exploration and Visualization, detailing objectives, activities, and examples of exploratory data analysis (EDA) across various fields such as sports, healthcare, and marketing. It emphasizes the importance of EDA in uncovering patterns, trends, and insights from data using both numerical and graphical methods. Additionally, it discusses different data types, measurement scales, and the contrast between exploratory and confirmatory data analysis.

Uploaded by

sdivya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Data Exploration &

Visualization Dr. SELVA KUMAR S


Unit-I ASSISTANT PROFESSOR
DEPT. OF CSE
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Prescribed Text Book
Sl.
Book Title Authors Edition Publisher Year
No.
Hands-On Exploratory Suresh Kumar
1 Data Analysis with Mukhiya, 1st Edition Packt 2020
Python Usman Ahmed
Fundamental of Data
2 Claus O. Wilke 1st Edition O'Reilly 2019
Visualization
At the end of the course the student will be able to

Apply the computational approaches to perform Data


CO1
Exploration and Visualization.

Analyse the different techniques to perform Data


CO2
Exploration and Visualization for a given application.

Demonstrate exploratory data analysis to real data sets


CO3 and provide interpretations through relevant
visualization tools.
Instructions

Google Classroom code: mnjiy3o

Sl. No Week Activity

1 1st Formation of groups. Note: Student groups of size 2 to 4


2 2nd and Project topic selection by each group
3rd
3 4th Presentation-1: Student and Project topic introduction by each group
4 5th Data Acquisition and Data Preparation
5 6th and Presentation-2: Exploratory tools demonstration
7th
6 8th and Presentation-3: Techniques applied on EDA
9th
7 10th Presentation-4: Visualization tools demonstration
8 11th Complete Project Work Demonstration by each group
9 12th Project Report Submission
Guess the topic?
Guess topic?
Guess the topic?
Guess the topic?
Guess the topic?
Guess the topic?
Guess the topic?
Agenda
Exploratory Data Analysis
Introduction

Steps in EDA

Data Types
❖ Numerical Data
❖ Categorical Data

Measurement Scales
❖ Nominal
❖ Ordinal
❖ Interval
❖ Ratio

Comparing EDA with classical & Bayesian Analysis

Software tools for EDA


WHAT IS EDA?

• The analysis of datasets based on various


numerical methods and graphical tools.
• Exploring data for patterns, trends, underlying
structure, deviations from the trend, anomalies
and strange structures.
• It facilitates discovering unexpected as well as
conforming the expected.
• Another definition: An approach/philosophy for
data analysis that employs a variety of techniques
(mostly graphical).
13
WHAT IS EDA?

Primary Aim is to performing


quantitative and qualitative
evaluation of the data to draw
meaningful insights from it.
AIM OF THE EDA

• Maximize insight into a dataset


• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)

15
Exploratory vs Confirmatory Data Analysis

EDA CDA
• No hypothesis at first • Start with hypothesis

• Generate hypothesis • Test the null hypothesis

• Uses graphical methods • Uses statistical models


(mostly)

16
STEPS OF EDA

17
Classification of EDA

19
EXAMPLE 1 – Professional Sports

Guess??? how EDA can be used in Professional Sports

20
EXAMPLE 1 – Professional Sports

1.Player Performance Analysis:


1. EDA can be used to analyze player statistics, such as scoring, rebounds,
assists, steals, and turnovers, to identify strengths and weaknesses.
2. Visualizations like heat maps and shot charts can help understand a player's
shooting patterns and effectiveness from different areas of the court.
2.Team Performance Analysis:
1. EDA can provide insights into team performance by examining key metrics
like points scored, turnovers, field goal percentages, and more.
2. Time-series analysis can help identify trends in team performance throughout
a season or across multiple seasons.
3.Opponent Analysis:
1. EDA can be used to study opponent teams' historical data, helping teams
prepare game strategies.
2. It can reveal opponent strengths and weaknesses, player tendencies, and
optimal defensive strategies.

21
EXAMPLE 1 – Professional Sports

4.Player Health and Injury Prevention:


1. By analyzing player health data (e.g., player load, heart rate, and injury history),
teams can identify patterns that may lead to injuries.
2. EDA can help in developing training and recovery strategies to reduce injury risks.
5.Player Recruitment and Drafting:
1. EDA can assist in scouting potential new players for recruitment or drafting by
comparing their performance statistics to existing team needs.
2. It can help identify undervalued or underappreciated talent.
6.Game Strategy Optimization:
1. By analyzing historical game data, teams can identify effective offensive and
defensive strategies against different opponents.
2. EDA can reveal optimal lineup combinations, substitutions, and in-game decision-
making.
7.Fan Engagement and Marketing:
1. EDA can help sports organizations understand fan demographics, preferences, and
behaviors.
2. It can be used to tailor marketing strategies, ticket pricing, and fan engagement
activities.

22
EXAMPLE 1 – Professional Sports

8.In-Game Analytics:
1. Real-time EDA during games can provide coaches and analysts with immediate
insights.
2. Data on player performance, shot selection, and opponent behavior can be analyzed
to make in-game adjustments.
9.Performance Tracking Wearables:
1. Many athletes wear devices that track their movements and physiological data. EDA
can be used to interpret this data and make real-time performance assessments.
10.Revenue Optimization:
1. EDA can be used to analyze revenue sources, such as ticket sales, merchandise, and
broadcasting deals, to optimize revenue streams.
11.Player Development:
1. Coaches and trainers can use EDA to track player development over time and make
adjustments to training and practice routines.
12.Fantasy Sports and Betting:
1. EDA is also used by analysts, enthusiasts, and sports gamblers to gain insights for
fantasy sports and betting purposes.

23
EXAMPLE 2 - Healthcare

Guess?? How EDA can be used on Healthcare domain

24
EXAMPLE 2 - Healthcare

1.Patient Data Analysis:


1. EDA can be used to analyze patient medical records and electronic health records
(EHRs) to identify trends and patterns in patient health.
2. It can help identify risk factors, comorbidities, and treatment effectiveness.
2.Disease Surveillance:
1. EDA is valuable for monitoring the spread of diseases and outbreaks by analyzing
data such as the number of cases, geographic distribution, and demographic
information.
2. It aids in early detection and timely response to public health emergencies.
3.Clinical Trials:
1. EDA can be used to examine clinical trial data to assess the safety and efficacy of
new treatments.
2. It helps identify patient subgroups that may benefit more from specific treatments.
4.Drug Safety Analysis:
1. EDA can be applied to pharmacovigilance data to detect adverse drug reactions
and ensure medication safety.
2. It helps in making informed decisions about drug approvals and withdrawals.

25
EXAMPLE 2 - Healthcare

5.Quality of Care Assessment:


1. Healthcare facilities can use EDA to assess the quality of care by analyzing patient
outcomes, readmission rates, and adherence to clinical guidelines.
2. It aids in identifying areas for improvement.
6.Resource Allocation:
1. EDA can help healthcare organizations optimize resource allocation, such as the
allocation of healthcare staff, medical equipment, and hospital beds.
2. It ensures efficient use of resources and cost savings.
7.Patient Flow and Wait Times:
1. EDA can be used to analyze patient flow within hospitals and clinics, helping to
reduce wait times and improve patient satisfaction.
8.Predictive Modeling:
1. EDA is often a precursor to building predictive models, which can forecast disease
trends, patient readmissions, and resource demands.
9.Chronic Disease Management:
1. EDA helps identify at-risk patient populations for chronic diseases and develop
personalized care plans.
2. It enables the early detection of complications and proactive intervention.
26
EXAMPLE 2 - Healthcare

10.Telemedicine and Remote Monitoring:


1. EDA can be applied to data collected from remote patient
monitoring devices, supporting telemedicine and remote care
initiatives.
11.Patient Engagement:
1. By analyzing patient feedback and satisfaction surveys,
healthcare organizations can improve patient engagement
and the overall healthcare experience.
12.Public Health Policy:
1. EDA assists policymakers in making data-informed decisions
on issues like vaccination campaigns, public health initiatives,
and health regulations.
13.Health Insurance:
1. Health insurance providers use EDA to assess risks, set
premiums, and design healthcare plans. 27
EXAMPLE 3 - Marketing

Any Guess??

28
EXAMPLE 3 - Marketing

1.Customer Segmentation:
1. EDA can identify customer segments based on demographics, behavior, and purchase
history.
2. It helps in tailoring marketing strategies to specific customer groups.
2.Product Analysis:
1. EDA can help analyze product performance, identifying top-selling products,
underperforming items, and opportunities for product development.
3.Pricing Strategies:
1. Analyzing price elasticity and consumer demand through EDA can help optimize pricing
strategies.
4.Market Basket Analysis:
1. EDA can uncover patterns of products that are frequently purchased together, aiding in
cross-selling and recommendation systems.
5.Customer Churn Analysis:
1. EDA can identify factors contributing to customer churn and assist in designing retention
strategies.
6.Campaign Effectiveness:
1. Analyzing marketing campaign data helps determine the effectiveness of various channels,
messages, and timing.
29
2. EDA can reveal which campaigns generate the highest ROI.
EXAMPLE 4*

New cancer cases in the U.S. based on a cancer registry


• The rows in the registry are called observations they correspond to individuals
• The columns are variables or data fields they correspond to attributes of the individuals

https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf 30
Making Sense of Data
Quantitative/Numerical Data

• Quantitative data is data that can be counted or measured


in numerical values.
• The two main types of quantitative data are discrete data
and continuous data.

32
Discrete Data

• Discrete data is information that can only take certain fixed values.

• Data that is countable and its values can be listed out.

• Example: The number of players in a team, No. of employees, etc.


Continuous Data

• Variable that can have an infinite number of numerical values within a


specific range.

• Example: Website traffic, Water Temperature, Wind Speed, etc.


Continuous Data

• Continuous data can be further broken down into two categories: interval data and
ratio data.

• Interval data can include numerical data that does not use zero as a reference.

• Ratio data uses absolute zero as a reference point for measurement.


Qualitative or Categorical Data

• Categorical data is divided into groups or categories.

• The categories are based on qualitative characteristics.

• There is no order to categorical values and variables.

• Categorical data can take numerical values, but those


numbers don’t have any mathematical meaning.

• Categorical data is displayed graphically by bar charts


and pie charts. 36
Categorical data Example

Two-way Table
Nominal Data & Ordinal Data

• Unordered categorical data (nominal)


2 possible values (binary )
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.

• Ordered categorical data (ordinal)


Ratings or preferences
Cancer stage
Quality of life scales,
Severity of a software bug (critical, high, medium, low)
Experience working with us (very great, great, bad)

38
Problem-1

Given the following questions from the survey, state what


type of variable each question deals with.

Question
1. How old are you?
2. Where do you live? Give the name of your city
3. How many siblings do you have?
4. What is your height?
5. What is your birth date?
6. Do you have a pet?
7. What grade level are you in?

39
Problem-1 : Solution

Question Variable Type


1. How old are you? Quantitative
2. Where do you live? Give the name of your city Categorical
3. How many siblings do you have? Quantitative
It can also be categorical if put into groups (group people who have 1
sibling, then those that have 2 and so on or those who have less than 3,
those who have more than 3, etc).

4. What is your height? Quantitative


5. What is your birth date? Categorical

6. Do you have a pet? Categorical


7. What grade level are you in? Categorical

40
Problem-2

Consider a dataset containing information about a group of


students. Determine whether each of the following attributes
is numerical or categorical:
1.Age
2.Gender
3.Student ID
4.Test Scores
5.Favorite Color
6.ZIP Code
7.Number of Siblings
8.Country of Birth
9.Student's Email Address
10.Height (in inches) 41
Problem-2 - Solution

1.Age :Numerical: can be measured on a continuous scale.


2.Gender :Categorical: Gender typically has two or more categories.
3.Student ID : Categorical: Student ID is typically an identifier or label, and it doesn't have
inherent mathematical meaning.
4.Test Scores :Numerical: Test scores are quantitative data that can be measured and analyzed
mathematically.
5.Favorite Color : Categorical.
6.ZIP Code : Categorical: ZIP codes are typically used to represent geographic regions.
7.Number of Siblings : Numerical: The number of siblings is a numerical variable that represents
a count of discrete entities.
8.Country of Birth : Categorical: Country of birth is a categorical attribute with different
countries as categories.
9.Student's Email Address : Categorical: Email addresses are text-based identifiers and do not
have numerical values.
10.Height (in inches) :Numerical: Height is a quantitative attribute, and when expressed in
inches, it can be measured as a numerical value.

42
Excercise-1

43
Excercise-2
Measurement Scales
Measurement Scales

1.Nominal Scale:
1. Represents categorical data without any inherent order or ranking.
2. Examples: Gender, colors, categories.
3. Differences: Different categories are distinct but not ordered. No
quantitative relationship exists between categories.
2.Ordinal Scale:
1. Represents data with ordered categories or ranks but without
precise differences between them.
2. Examples: Rankings (1st, 2nd, 3rd), Likert scales (Strongly Disagree,
Disagree, Neutral, Agree, Strongly Agree).
3. Differences: Ordered categories, but the exact difference between
ranks may not be defined.
Measurement Scales

3.Interval Scale:
1. Represents data with ordered categories and precise, equal
intervals between them.
2. Examples: Temperature in Celsius or Fahrenheit, calendar dates.
3. Differences: Equal intervals between points on the scale, but there's
no true zero point (zero doesn't indicate the absence of the
attribute being measured).
4.Ratio Scale:
1. Represents data with ordered categories, equal intervals, and a true
zero point.
2. Examples: Height, weight, time, income.
3. Differences: Possesses all the properties of interval scale but also
has a true zero, enabling meaningful ratios and arithmetic
operations.
Example: Identify the measurement scale?

1.What is your favorite color?

2.Which country are you from?

3.Rate your satisfaction with our service on a scale from 1 to 5.

4.Rank the following in order of preference: Action, Comedy, Drama.

5.What is the temperature today in Celsius?

6.On a scale of 1 to 10, how happy are you today?

7.How many hours did you study for the exam?

8.What is your annual income?


Example

1.Question: What is your favorite color?


Solution: The answers (e.g., "Red," "Blue," "Green") represent categories without an inherent order, indicating a
nominal scale.
2.Question: Which country are you from?
Solution: Responses like "USA," "France," "Japan" represent categories without a specific order, indicating a nominal
scale.
3.Question: Rate your satisfaction with our service on a scale from 1 to 5.
Solution: Responses with ordered categories (e.g., "1 - Very Dissatisfied," "5 - Very Satisfied") without precise intervals
represent an ordinal scale.
4.Question: Rank the following in order of preference: Action, Comedy, Drama.
Solution: Responses like "1st - Comedy," "2nd - Action," "3rd - Drama" represent ordered categories without
quantifiable differences, indicating an ordinal scale.
5.Question: What is the temperature today in Celsius?
Solution: Answers like "20°C," "25°C," represent ordered categories with equal intervals but lack a true zero, indicating
an interval scale.
6.Question: On a scale of 1 to 10, how happy are you today?
Solution: Responses on a numerical scale from 1 to 10 represent ordered categories with equal intervals but without a
true zero, indicating an interval scale.
7.Question: How many hours did you study for the exam?
Solution: Answers like "0 hours," "2 hours," "5 hours" involve ordered categories with equal intervals and a true zero,
indicating a ratio scale.
8.Question: What is your annual income?
Solution: Responses involving numerical values (e.g., $30,000, $50,000) with a true zero point represent ordered
categories with equal intervals and a true zero, indicating a ratio scale.
Comparing EDA with classical and Bayesian analysis
Additional Resources
Classification of Digital Data
Digital data is classified into the following categories:

Structured data- This is the data which is in an organized form(e.g, rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.

Semi-structured data- This is the data which does not conform to a data
model but has some structure. However, it is not in a form which can be used
easily by a computer program, for example, emails, XML, markup languages like
HTML etc.,

Unstructured data- -This is the data which does not conform to a data model
or is not in a form which can be used easily by a computer program. About 80%-
90% data of an organization is in this format for example, memos, chat rooms,
powerpoint presentations, images, videos, letters etc,.
Approximate Distribution of Digital Data

Approximate percentage distribution of digital data


Structured Data
Structured Data

This is the data which is in an organized form (e.g., in


rows and columns) and can be easily used by a computer
program.
In structured data, all row in a table has the same set of columns.

Data stored in databases is an example of structured data.


Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
(ACID
properties
Semi-structured Data
Semi-structured Data

This is the data which does not conform to a data


model but has some structure. However, it is not in a
form which can be used easily by a computer program.

Example, emails, XML, markup languages like HTML,


etc. Metadata for this data is available but is not
sufficient.
Sources of Semi-structured Data

XML Extensible MarkUp Language

Other MarkUp Language

JSON(JavaScript Object Notation)

Semi-Structured
Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values

Data objects may have different


attributes not known beforehand
Unstructured Data
Unstructured Data

This is the data which does not conform to a data model


or is not in a form which can be used easily by a
computer program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint


presentations, images, videos, letters,
researches, white papers, body of an email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not being


formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics


Dealing with Unstructured Data

▪Data Mining
•Association Rule Mining
•Regression Analysis
•Collaborative Filtering

▪Text analysis and Text Mining

▪Natural Language Processing(NLP)

▪Noisy text Analysis

▪Manual tagging with metadata

▪Part-of-speech tagging

▪Unstructured Information Management Architecture(UIMA)


Answer a few quick questions …
Answer Me

Which category (structured, semi-structured, or unstructured) will you place


a Web Page in?

Which category (structured, semi-structured, or unstructured) will you place


Word Document in?

State a few examples of human generated and machine-generated data.


References …
Further Readings

Exploratory Data Analysis (EDA) | Introduction to EDA (analyticsvidhya.com)

https://www.simplilearn.com/tutorials/data-analytics-
tutorial/exploratory-data-analysis

https://www.analyticsvidhya.com/blog/2021/05/exploratory-data-
analysis-eda-a-step-by-step-guide/

https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/

https://intellipaat.com/blog/what-is-eda-in-data-science/

https://www.knowledgehut.com/blog/data-science/eda-data-
science#frequently-asked-questions
https://www.powershow.com/view/aca5d-
OTEwN/Exploratory_Data_Analysis_powerpoint_ppt_presentation

You might also like