0% found this document useful (0 votes)
5 views

Data Science Lecture No 03

Uploaded by

abdul baqi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Science Lecture No 03

Uploaded by

abdul baqi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture No.

03 SEN –5 th

Course: Data Science


Instructor: Dr. Maryum Nisar

12/15/2024 1
Data Science

12/15/2024 2
Lecture Contents
 Steps in EDA

12/15/2024 3
Steps in EDA
 The various steps involved in data analysis.
 Problem definition
 Data preparation
 Data analysis
 Development and representation of the results

12/15/2024 4
Problem definition
 It is essential to define the business problem to be solved. The
problem definition works as the driving force for a data analysis plan
execution.
The main tasks involved in:
 Defining the main objective of the analysis
 Defining the main deliverables
 Outlining the main roles and responsibilities,
 Obtaining the current status of the data
 Defining the timetable, and performing
 cost/benefit analysis.
5

12/15/2024 5
Data preparation
This step involves methods for preparing the dataset before actual
analysis.
In this step, we define
 The sources of data
 Define data schemas and tables
 Understand the main characteristics of the data
 Clean the dataset
 Delete non-relevant datasets
 Transform the data
 Divide the data into required chunks for analysis
6

12/15/2024 6
Data analysis
This is one of the most crucial steps that deals with descriptive statistics and analysis of the data.
The main tasks involve
 summarizing the data
 Finding the hidden correlation and relationships among the data
 Developing predictive models
 Evaluating the models
 Calculating the accuracies
Some of the techniques used for data summarization are:
 Summary tables
 Graphs
 Descriptive statistics
 Inferential statistics
 Correlation statistics
 Searching
 Grouping
7  Mathematical models.
12/15/2024 7
Development and representation of the results
This step involves
 Presenting the dataset to the target audience in the form
o Graphs
o Summary tables
o Maps and diagrams
 This is also an essential step as the result analyzed from the dataset should be
interpretable by the business stakeholders, which is one of the major goals of EDA.
 Most of the graphical analysis techniques include
o Scattering plots,
o Character plots
o Histograms
o Box plots
o Residual plots and mean plots
8

12/15/2024 8
Making sense of data
Different disciplines store different kinds of data for different purposes.
 For example, medical researchers store patients' data, universities store students'
and teachers' data, and real estate industries storehouse and building datasets.
 A dataset contains many observations about a particular object.
 For instance, a dataset about patients in a hospital can contain many observations.
o A patient can be described by a patient identifier (ID), name, address,
weight, date of birth, address, email, and gender. Each of these features that
describes a patient is a variable.

12/15/2024 9
Making sense of data
 These datasets are stored in hospitals and are presented for analysis. Most of this data
is stored in some sort of database management system in tables/schema. An example
of a table for storing patient information is shown here:

10

12/15/2024 10
Data Types
Numerical data
 This data has a sense of measurement involved in it;
 for example, a person's age, height, weight, blood pressure, heart rate,
temperature, number of teeth, number of bones, and the number of family
members.
 This data is often referred to as quantitative data in statistics.
 The numerical dataset can be either discrete or continuous types.
 Discrete data:
o This is data that is countable and its values can be listed out.
o For example, if we flip a coin, the number of heads in 200 coin flips can take
values from 0 to 200 (finite) cases.
o A variable that represents a discrete dataset is referred to as a discrete
variable. The discrete variable takes a fixed number of distinct values.
11

12/15/2024 11
Data Types
Numerical data
 This data has a sense of measurement involved in it;
 for example, a person's age, height, weight, blood pressure, heart rate,
temperature, number of teeth, number of bones, and the number of family
members.
 This data is often referred to as quantitative data in statistics.
 The numerical dataset can be either discrete or continuous types.
 Continuous data :
o A variable that can have an infinite number of numerical values within a
specific range is classified as continuous data.
o A variable describing continuous data is a continuous variable.
o For example, what is the temperature of your city today? Can we be finite?
12

12/15/2024 12
Data Types
Numerical data

13

12/15/2024 13
Data Types
Categorical data
 This type of data represents the characteristics of an object; for example, gender, marital
status, type of address, or categories of the movies. This data is often referred to as
qualitative datasets in statistics. To understand clearly, here are some of the most common
types of categorical data you can find in data:
o Gender (Male, Female, Other, or Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married,
o Polygamous, Never Married, Domestic Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical,
Horror, Mystery, e.t.c)
o Blood type (A, B, AB, or O)
o Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids,
Inhalants, or Cannabis)

14

12/15/2024 14
Data Types
Categorical data
 A variable describing categorical data is referred to as a categorical variable.
These types of variables can have one of a limited number of values. It is easier for
computer science students to understand categorical values as enumerated types
or enumerations of variables. There are different types of categorical variables:
o A binary categorical variable can take exactly two values and is also referred to
as a dichotomous variable. For example, when you create an experiment,
the result is either success or failure. Hence, results can be understood as a
binary categorical variable.
o Polytomous variables are categorical variables that can take more than two
possible values. For example, marital status can have several values, such as
annulled, divorced, interlocutory, legally separated, married, polygamous,
never married, domestic partners, unmarried, widowed, domestic partner, and
unknown. Since marital status can take more than two possible values, it is a
polytomous variable.
15

12/15/2024 15
Measurement scales
There are four different types of measurement scales described in statistics:
 Nominal
 Ordinal
 Interval
 Ratio
Nominal
 These are practiced for labeling variables without any quantitative value. The scales are
generally referred to as labels. And these scales are mutually exclusive and do not carry any
numerical importance.
 Let's see some examples:
o What is your gender?
o Male
o Female
o Third gender/Non-binary
16
o I prefer not to answer
12/15/2024 16
o Other
Nominal
In the case of a nominal dataset, you can certainly know the
following:
 Frequency is the rate at which a label occurs over a period of time within the
dataset.
 Proportion can be calculated by dividing the frequency by the total number of
events.
 Then, you could compute the percentage of each proportion.
 And to visualize the nominal dataset, you can use either a pie chart or a bar
chart.

17

12/15/2024 17
Ordinal
The main difference in the ordinal and nominal scale is the order. In
ordinal scales, the order of the values is a significant factor. Frequency
is the rate at which a label occurs over a period of time within the
dataset.
 Have you heard about the Likert scale, which uses a variation of an ordinal scale?

18

12/15/2024 18
Measurement Scales
Interval
 In interval scales, both the order and exact differences between the values are significant.
Interval scales are widely used in statistics, for example, in the measure of central tendencies
—mean, median, mode, and standard deviations.
 Examples include location in Cartesian coordinates and direction measured in degrees from
magnetic north. The mean, median, and mode are allowed on interval data.
Ratio
 Ratio scales contain order, exact values, and absolute zero, which makes it possible to be
used in descriptive and inferential statistics.
 These scales provide numerous possibilities for statistical analysis. Mathematical operations,
the measure of central tendencies, and the measure of dispersion and coefficient of variation
can also be computed from such scales.

19

12/15/2024 19
Measurement Scales

20

12/15/2024 20
Comparing EDA with classical and Bayesian
analysis
Classical data analysis:
 For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by
analysis and result communication.
Exploratory data analysis approach:
 For the EDA approach, it follows the same approach as classical data analysis
except the model imposition and the data analysis steps are swapped. The
main focus is on the data, its structure, outliers, models, and visualizations.
 Generally, in EDA, we do not impose any deterministic or probabilistic models
on the data.
Bayesian data analysis approach:
21
 The Bayesian approach incorporates prior probability distribution knowledge
12/15/2024
into the analysis steps. 21
Comparing EDA with classical and Bayesian
analysis

22

12/15/2024 22
Thank You !
12/15/2024 23

You might also like