0% found this document useful (0 votes)
13 views

Descriptive Analytics - Univariate and Bivariate

Uploaded by

UAXZxaXsx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Descriptive Analytics - Univariate and Bivariate

Uploaded by

UAXZxaXsx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

COE 102

Introductory
Big Data
College of Engineering

Chapter -2-
Descriptive Analytics –
Univariate and Bivariate

Dr Heba Ismail
Descriptive Analytics
Univariate and Bivariate

2
Learning Objectives

• Scale types
• Introduction to descriptive analytics
• Univariate and bivariate descriptive analytics
• Visualization

3
Let’s review a few facts from Chapter 1

• What is Data?
• Data, in the information age, are a large set of digital bits encoding numbers, texts,
images, sounds, videos, and so on.

• Can you give some examples of raw data? (Class Discussion)

• Can you make any decision based on raw data? (Class Discussion)

• What do we need to do with data to have meaningful insights?

• We need to produce some analytics!


4
Let’s Consider these Scenarios
• Can you study the employees behavior in ALL government
organizations by surveying ALL the employees?

• Can you study the purchasing behavior of ALL teenagers around the
globe by surveying ALL teenagers?

• Is it feasible?

• What would be a better alternative?


5
Statistical Concepts
Population
• A set of similar instances/objects or events which is of interest for some
question or experiment
• E.g. all students of my school, all nails produced by a machine
Sample
• A set of a data collected and/or selected from a population by a defined
procedure
• E.g. a subset of the students of my school that answered to a survey, a subset of
randomly selected nails produced by a machine

• Can you give more examples?


6
Statistical Concepts
Deduction
• Reasoning about the sample extracted from that population
• Deduction aims to study the probability of randomly extracting a
representative sample.
Induction
• Generalizing the knowledge obtained from a sample to all of a
population is called statistical inference (or induction)
• E.g. a subset of the students of my school that answered to a survey, a
subset of randomly selected nails produced by a machine

7
Descriptive Statistics
• Descriptive statistics are methods / techniques to describe or
summarize samples in order to help humans to understand it

8
Scale Types
• Qualitative scales
• Nominal: categorize data in a non-
ordinal way
• Operations: = and ≠
• E.g. friend’s name and gender (e.g. Eve is
a Female – Eve is not a Male)
• Ordinal: categorize data in a ordinal
way
• Operations: =, ≠, <, >, ≤, and ≥
• E.g. company
• Let’s compare Andrew and Marcus
Company
9
Scale Types
• Quantitative scales
• Relative (Interval): does not have an
absolute zero
• Operations: =, ≠, <, >, ≤, ≥, - and +
• E.g. temperature
• Absolute (Ratio): has an absolute zero
• Operations: =, ≠, <, >, ≤, ≥, -, +, / and ×
• E.g. weight and heigth

When the attribute “height” is zero it means there is no height.


This is also true for the weight. But for the temperature, when
we have 0∘C it does not mean there is no temperature. When
we talk about weight, we can say that Bernhard weighs twice as
much as Irene, but we cannot say that the maximum
temperature last week in Dennis’ home town was twice that in
10
Eve’s.
Changing Data Scale

This all means that when we have data expressed on an absolute


scale we can convert it to any of the other scales. When we have
data expressed on a relative scale we can convert it in any scale of
the two qualitative scale types. When we have data expressed on
an ordinal scale we can express it in a nominal scale.

11
Class Activity
• Weight is expressed as an absolute scale
• Can you change it into ordinal?
• Can you change it into nominal?
• What do you notice after applying the change to the amount of
information obtained from the weight attribute using nominal
scale?
• Can you compare the weights of Andrew and Marcus after they
have converted to nominal?

12
Textbook Answers – pg. 33.

13
Scales vs Data Types
• In software packages we must choose the data type for each attribute
• Common types are text, character, factor, integer, real, float,
timestamp, date or several others
• A scale and a data type are different concepts despite related
• For instance, a quantitative scale implies the use of numeric data types
• However, an attribute can be expressed as a number but the scale type
can be qualitative
• Think about IDs (e.g., Students ID, National IDs, Shoppers IDs, … etc)
• what kind of quantitative information does it have?
• Can an ID with letters contain the same information?

14
Descriptive Univariate Analysis: Frequencies
• A frequency is basically a counter
• Absolute frequency counts how many times a value appears.
• Relative frequency counts the percentage of times that value appears.

• The absolute cumulative frequency is the number of occurrences less


or equal than a given value

• The relative cumulative frequency is the percentage of occurrences


less or equal than a given value

15
Example 1 – Company

7/14=50%

16
Example 2 – Height

17
Descriptive Univariate Analysis: data
visualization
• Pie chart: it is used
typically for nominal scales
It is not advisable to use
them with scales where
the notion of order exists
– in other words for
ordinal and quantitative
scales – although this is
possible.

18
Descriptive Univariate Analysis: data
visualization
• Bar chart: It is used
typically for qualitative
scales.
• Sometimes it can be used
with quantitative scales
with a limited number of
values.
• It is argued to be easier to
read than pie charts.

19
Descriptive Univariate Analysis: data
visualization
• In a bar chart, we can also
separate the distributions
for the values of some other
attributes

• This is illustrated in the


figure where the frequencies
for the target value of
“company” is split by gender
20
Descriptive Univariate Analysis: data
visualization
• Line chart: They are specially Max Temp Day
used to deal with the notion of 21 1
time. 25 2
30 3
• Like area charts, these are used 20 4
when the horizontal bar uses a 21 5
quantitative scale with equal lag
between observations.

• Represent time series, graphs of


values obtained over regular time
sequences.
21
Descriptive Univariate Analysis: data
visualization Andrew Eve
• Area charts: are specially Max Temp Day Max Temp Day
used to compare time series 21 1 17 1
and distribution functions 25 2 18 2
30 3 19 3
20 4 20 4
• Understanding data 21 5 0 5
distributions give us strong
insights about an attribute.
We are able to see, for
instance, that data are more
concentrated in some values
or that other values are rare.
22
Descriptive Univariate Analysis: data
visualization
• Histograms: are used to
represent empirical distributions
for attributes with a quantitative
scale

• Histograms are characterized by


grouping values in cells, reducing
in this way the sparsity that is
common in quantitative scales.

• Histogram is more informative


than the bar chart.
Descriptive Univariate Analysis: data
visualization
• An important decision to draw a
histogram is to define the
number of cells
• The most advisable value is
problem dependent
• As rule of thumb you can use a
number around the square root
of the number of values

24
Descriptive Univariate Analysis: data
visualization
• Empirical distributions are
based in samples
• Probability distributions are
about populations

25
Descriptive Univariate Analysis: statistics
• A statistic is a descriptor
• Location statistics:
• It describes numerically a • Minimum: is the lowest value
characteristic of the sample or • Maximum: is the largest value
the population • Mean: is the average value
• There are two main groups of • Mode: is the most frequent value
univariate statistics: • The value that is larger than:
• Location statistics • 25% of all values is the 1st quartile
• Dispersion statistics • 50% of all values is the median or 2nd
quartile
• 75% of all values is the 3rd quartile

26
Example
• Let us use as example the attribute
weight from our data set

Graphical representation of the statistics


Location statistic Weight (kg)
Min 55.00
Max 115.00
Mean or average 79.00
Mode 75.00
1st quartile 65.75
2nd quartile or mode 75.00
3rd quartile 87.50
Descriptive Univariate Analysis: statistics
• Box-plots present the minimum,
the 1st quartile, the median, the
 Mean (or average), median and
3rd quartile and the maximum mode are known as measures
statistics, by this order, bottom- of central tendency, because
up or from left to right return a central value from a
• The attribute height set of values

Location statistic Nominal Ordinal Quantitative


Mean No Eventually Yes
Median No Yes Yes
Mode Yes Yes Yes

28
Descriptive Univariate Analysis: statistics
• Box-plots can also be used
to describe the symmetry/
skewness of an attribute

• The median or the mode


are more robust as a
central tendency statistic
than the mean in the
presence of extreme
values or strongly skewed
distributions
29
Descriptive Univariate Analysis: statistics

• Can the mean be used in ordinal


scales?

• This is strongly arguable but there are


examples of its use with numeric
ordinal scales such as the Likert scale

• The Likert uses an ordered scale, e.g.,


integers from 1 (highest
disagreement) to 5 (highest
agreement)
30
Descriptive Univariate Analysis: statistics

• Plots can also be combined • There is only one value for the mean of a
• An example with the attribute population
Height • There is only one value for the mean of a
sample but can exist several samples
from a single population
• The population mean and the sample
mean are calculated in the same way but
are differently represented:
• is the mean population of
• is a mean sample of
31
Descriptive Univariate Analysis: statistics
• Dispersion statistic measures • Dispersion statistics (cont.):
how distant the different values • Mean absolute deviation: Mean
are absolute deviation: is a measure
for the mean absolute distance
• Dispersion statistics: between the observations and the
• Amplitude (Range): is the mean
difference between the maximum • Its math formula for the population
and the minimum values is:
• Interquartile range: is the
difference between the values of • Its math formula for a sample is:
the 3rd and 1st quartiles

32
Descriptive Univariate Analysis: statistics
• Dispersion statistics (cont): • Using again as example the
• Standard deviation: is another weight attribute, dispersion
measure for the typical distance statistics are as shown in the
between the observations and table
their mean
• Its math formula for the population
is: Dispersion statistic Weight (kg)
• Its math formula for a sample is:
Amplitude 60.00
• The square of the standard deviation
is named variance Interquartile range 21.75
14.31
s 17.38

33
Descriptive Univariate Analysis: common
univariate probability distributions
• Different events of our life follow • We present two of these
already studied distributions distributions:
• E.g. the height of adult men, the • The Uniform distribution
value of a random number, or • The Normal distribution, also
known as the Gaussian
the number of cars passing in a
given highway toll • Both are continuous
distributions and have known
probability density functions

© João Moreira - FEUP/UP 34


Descriptive Univariate Analysis: common
univariate probability distributions
• An attribute that follows the uniform distribution with parameters
and , has equal frequency of occurrence of values in any interval of a
given size

35
Descriptive Univariate Analysis: common
univariate probability distributions
• The Normal distribution is a
• The Normal distribution symmetric and continuous
• Physical quantities that are expected to distribution with two
be the sum of many independent parameters:
factors (e.g., the men' height) typically • The mean localizes the
have approximately Normal highest point of the bell like
distributions distribution
• The standard deviation
defines how thin or larger
the bell form of the
distribution is

36
Descriptive bivariate analysis
• When the two attributes of the pair
are quantitative
• There are several visualization
techniques able to visually show the
distribution of points with two
quantitative attributes
• One of these techniques is the scatter
plots

37
Descriptive bivariate analysis
• Pearson correlation
• Sample Pearson correlation

• Is scale independent: values always between


[-1, 1]
• If the points form:
• an increasing line, the Pearson correlation
coefficient will be 1
• a decreasing line, its value will be -1
• a horizontal line or a cloud without increasing or
decreasing tendency, its value will be 0

38
Descriptive bivariate analysis
• The Spearman's rank correlation, as the name suggests, is based on
rankings
• Compares how similar are the ranking positions of the values of the
two attributes

39
Example Friend Weight Height Ranked Ranked
(cm) (cm) weight height
Andrew 77 175 1.0 1.0
Bernhard 110 195 4.0 2.0
• Pearson correlation Carolina 70 172 2.0 3.0
Dennis 85 180 3.0 4.0
Eve 65 168 5.0 5.5
• Spearman's rank Fred 75 173 6.0 5.5
correlation Gwyneth 75 180 7.5 7.0
• Hayden 63 165 9.0 8.0
Irene 55 158 7.5 9.5
James 66 163 11.0 9.5
Kevin 95 190 10.0 11.0
Lea 72 172 12.0 12.0
Marcus 83 185 14.0 13.0
Nigel 115 192 13.0 14.0
Reading
• Textbook: Chapter -2- from the textbook
• Moreira, João, André Carlos Ponce de Leon Ferreira, and Tomáš Horváth. A
general introduction to data analytics. Wiley, 2019. ISBN: 9781119296263.

41

You might also like