Descriptive Analytics - Univariate and Bivariate
Descriptive Analytics - Univariate and Bivariate
Introductory
Big Data
College of Engineering
Chapter -2-
Descriptive Analytics –
Univariate and Bivariate
Dr Heba Ismail
Descriptive Analytics
Univariate and Bivariate
2
Learning Objectives
• Scale types
• Introduction to descriptive analytics
• Univariate and bivariate descriptive analytics
• Visualization
3
Let’s review a few facts from Chapter 1
• What is Data?
• Data, in the information age, are a large set of digital bits encoding numbers, texts,
images, sounds, videos, and so on.
• Can you make any decision based on raw data? (Class Discussion)
• Can you study the purchasing behavior of ALL teenagers around the
globe by surveying ALL teenagers?
• Is it feasible?
7
Descriptive Statistics
• Descriptive statistics are methods / techniques to describe or
summarize samples in order to help humans to understand it
8
Scale Types
• Qualitative scales
• Nominal: categorize data in a non-
ordinal way
• Operations: = and ≠
• E.g. friend’s name and gender (e.g. Eve is
a Female – Eve is not a Male)
• Ordinal: categorize data in a ordinal
way
• Operations: =, ≠, <, >, ≤, and ≥
• E.g. company
• Let’s compare Andrew and Marcus
Company
9
Scale Types
• Quantitative scales
• Relative (Interval): does not have an
absolute zero
• Operations: =, ≠, <, >, ≤, ≥, - and +
• E.g. temperature
• Absolute (Ratio): has an absolute zero
• Operations: =, ≠, <, >, ≤, ≥, -, +, / and ×
• E.g. weight and heigth
11
Class Activity
• Weight is expressed as an absolute scale
• Can you change it into ordinal?
• Can you change it into nominal?
• What do you notice after applying the change to the amount of
information obtained from the weight attribute using nominal
scale?
• Can you compare the weights of Andrew and Marcus after they
have converted to nominal?
12
Textbook Answers – pg. 33.
13
Scales vs Data Types
• In software packages we must choose the data type for each attribute
• Common types are text, character, factor, integer, real, float,
timestamp, date or several others
• A scale and a data type are different concepts despite related
• For instance, a quantitative scale implies the use of numeric data types
• However, an attribute can be expressed as a number but the scale type
can be qualitative
• Think about IDs (e.g., Students ID, National IDs, Shoppers IDs, … etc)
• what kind of quantitative information does it have?
• Can an ID with letters contain the same information?
14
Descriptive Univariate Analysis: Frequencies
• A frequency is basically a counter
• Absolute frequency counts how many times a value appears.
• Relative frequency counts the percentage of times that value appears.
15
Example 1 – Company
7/14=50%
16
Example 2 – Height
17
Descriptive Univariate Analysis: data
visualization
• Pie chart: it is used
typically for nominal scales
It is not advisable to use
them with scales where
the notion of order exists
– in other words for
ordinal and quantitative
scales – although this is
possible.
18
Descriptive Univariate Analysis: data
visualization
• Bar chart: It is used
typically for qualitative
scales.
• Sometimes it can be used
with quantitative scales
with a limited number of
values.
• It is argued to be easier to
read than pie charts.
19
Descriptive Univariate Analysis: data
visualization
• In a bar chart, we can also
separate the distributions
for the values of some other
attributes
24
Descriptive Univariate Analysis: data
visualization
• Empirical distributions are
based in samples
• Probability distributions are
about populations
25
Descriptive Univariate Analysis: statistics
• A statistic is a descriptor
• Location statistics:
• It describes numerically a • Minimum: is the lowest value
characteristic of the sample or • Maximum: is the largest value
the population • Mean: is the average value
• There are two main groups of • Mode: is the most frequent value
univariate statistics: • The value that is larger than:
• Location statistics • 25% of all values is the 1st quartile
• Dispersion statistics • 50% of all values is the median or 2nd
quartile
• 75% of all values is the 3rd quartile
26
Example
• Let us use as example the attribute
weight from our data set
28
Descriptive Univariate Analysis: statistics
• Box-plots can also be used
to describe the symmetry/
skewness of an attribute
• Plots can also be combined • There is only one value for the mean of a
• An example with the attribute population
Height • There is only one value for the mean of a
sample but can exist several samples
from a single population
• The population mean and the sample
mean are calculated in the same way but
are differently represented:
• is the mean population of
• is a mean sample of
31
Descriptive Univariate Analysis: statistics
• Dispersion statistic measures • Dispersion statistics (cont.):
how distant the different values • Mean absolute deviation: Mean
are absolute deviation: is a measure
for the mean absolute distance
• Dispersion statistics: between the observations and the
• Amplitude (Range): is the mean
difference between the maximum • Its math formula for the population
and the minimum values is:
• Interquartile range: is the
difference between the values of • Its math formula for a sample is:
the 3rd and 1st quartiles
32
Descriptive Univariate Analysis: statistics
• Dispersion statistics (cont): • Using again as example the
• Standard deviation: is another weight attribute, dispersion
measure for the typical distance statistics are as shown in the
between the observations and table
their mean
• Its math formula for the population
is: Dispersion statistic Weight (kg)
• Its math formula for a sample is:
Amplitude 60.00
• The square of the standard deviation
is named variance Interquartile range 21.75
14.31
s 17.38
33
Descriptive Univariate Analysis: common
univariate probability distributions
• Different events of our life follow • We present two of these
already studied distributions distributions:
• E.g. the height of adult men, the • The Uniform distribution
value of a random number, or • The Normal distribution, also
known as the Gaussian
the number of cars passing in a
given highway toll • Both are continuous
distributions and have known
probability density functions
35
Descriptive Univariate Analysis: common
univariate probability distributions
• The Normal distribution is a
• The Normal distribution symmetric and continuous
• Physical quantities that are expected to distribution with two
be the sum of many independent parameters:
factors (e.g., the men' height) typically • The mean localizes the
have approximately Normal highest point of the bell like
distributions distribution
• The standard deviation
defines how thin or larger
the bell form of the
distribution is
36
Descriptive bivariate analysis
• When the two attributes of the pair
are quantitative
• There are several visualization
techniques able to visually show the
distribution of points with two
quantitative attributes
• One of these techniques is the scatter
plots
37
Descriptive bivariate analysis
• Pearson correlation
• Sample Pearson correlation
38
Descriptive bivariate analysis
• The Spearman's rank correlation, as the name suggests, is based on
rankings
• Compares how similar are the ranking positions of the values of the
two attributes
39
Example Friend Weight Height Ranked Ranked
(cm) (cm) weight height
Andrew 77 175 1.0 1.0
Bernhard 110 195 4.0 2.0
• Pearson correlation Carolina 70 172 2.0 3.0
Dennis 85 180 3.0 4.0
Eve 65 168 5.0 5.5
• Spearman's rank Fred 75 173 6.0 5.5
correlation Gwyneth 75 180 7.5 7.0
• Hayden 63 165 9.0 8.0
Irene 55 158 7.5 9.5
James 66 163 11.0 9.5
Kevin 95 190 10.0 11.0
Lea 72 172 12.0 12.0
Marcus 83 185 14.0 13.0
Nigel 115 192 13.0 14.0
Reading
• Textbook: Chapter -2- from the textbook
• Moreira, João, André Carlos Ponce de Leon Ferreira, and Tomáš Horváth. A
general introduction to data analytics. Wiley, 2019. ISBN: 9781119296263.
41