0% found this document useful (0 votes)
23 views49 pages

Topic 4 Descriptive Statistics

Uploaded by

racieanhdao5203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views49 pages

Topic 4 Descriptive Statistics

Uploaded by

racieanhdao5203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Topic 4 Descriptive statistics

Vincent Hoang (2022), Lecture 4


Camn et al (2016), Chapter 2
Three main goals

MEASURES OF CENTRAL MEASURES OF MEASURES OF


TENDENCY DISPERSION AND SHAPE ASSOCIATION
Review
Dichotomous:
two categories/levels, e.g.
“yes’ and ‘no’.

Source: slightly edited from https://studyonline.unsw.edu.au/sites/default/files/UNSW2.png


Descriptive statistics
• Used to describe and summarise a variable or
variables for a sample of data.
◦ For categorical or grouped data: the proportion.
◦ Measures of central tendency: mean, median, and mode
◦ Measures of dispersion: range, interquartile range, standard
deviation, coefficient of variation, percentiles, z-scores
◦ Measures of shape: skewness, kurtosis
◦ Measures of association: covariance, correlation
You can transform
data into new variables
Example 4.31 (textbook, p 178)
What is the proportion of orders that are placed with
The Proportion Spacetime Technologies?

Alum Sheeting 8
Durrable Products 13
Fast-Tie Aerospace 15
• The proportion, p, is the percentage of
Hulkey Fasteners 15
observations that have a certain
Manley Valve 11
characteristic
Pylon Accessories 5

• Very useful for categorical or grouped Spacetime Technologies 12

data Steelpin Inc. 15


Total 94
• Take the number of observations with a
characteristic (X) and divide it by the total 𝑋 12
number of observations (N) 𝑝= = = 0.128 = 12.8%
𝑁 94

12.8% of orders are placed with Spacetime Technologies.


Individual assessment
Measures of central tendency
• Three different measures of the “typical” or “representative” value in a dataset

Arithmetic Mean Median Mode

Total value of all The middle observation, after The most frequently
observations / number of ordering the data from occurring observation
observations smallest to largest
=AVERAGE(datarange)
=MEDIAN(datarange) =MODE.MULT(datarange)
Mean vs Median vs Mode
• Mean is often used for quantitative data unless outliers
exist or data is skewed.
• Median is often used in conjunction with the mean
since it is not affected by outliers. Comparing mean
with median gives us an idea of skewness.
• Mode is mainly used for qualitative data, rarely used for
numerical data. There may be no mode, multiple
modes, or the mode may not be close to the centre of
the data.
Individual assessment
• Consider investment in stock markets. For each stock,
◦ Daily return = closing price – opening price
◦ Over a year (a period) = price on the ending date – price on the beginning date
◦ Returns = price on purchase – price when re-selling

• What measures of central tendency would you use?


Excel’s Aggregate function
• Syntax AGGREGATE(function_num, options, ref1,
[ref2], …)

• Array form: AGGREGATE(function_num, options,


array, [k])
Individual assessment
Individual assessment
Three main goals

MEASURES OF CENTRAL MEASURES OF MEASURES OF


TENDENCY DISPERSION AND SHAPE ASSOCIATION
Skewness
• Measures symmetry relative to a bell- Mean = Median = Mode (no skewness)
shaped (normal) distribution.
• Normal distribution: bell shape; median
= mode = mean; no skewness
• If the mean is different to the median,
this implies skewness. As a general rule,
a value for skewness:
◦ < -1 or > 1 is highly skewed
◦ Between -1 and -0.5 or between 0.5 and 1 is Note: This rule may not apply to discrete or bimodal data.
moderately skewed
◦ Between -0.5 and 0 or between 0 and 0.5 is
approximately symmetric = SKEW(datarange)
Income distribution

https://ourworldindata.org/global-economic-inequality
“income inequality in Australia/Vietnam
has been increasing recently”
• What would you show in your analysis?
◦ Think of a specific context:
◦ (global vs) national vs state level
◦ by socio-economic demographic factors: gender, ethnicity, skills, education & qualification,
efforts etc.
◦ Think of a specific data set
◦ entire population vs income groups
◦ Think of specific measures (metrics/ indicators/ variables)
◦ mean – median – mode – skewness etc.
Measures of variation
Dispersion= Variation= Spread: refers to the
degree of variation in the data
Five key measures:
1. Range
2. Interquartile Range
What can we say about
3. Percentiles the variation in income?
4. Standard deviation
5. Coefficient of variation
Range and Interquartile Range
• Range: the difference between the minimum and maximum value in the data – sensitive to
outliers
• Interquartile Range: the range of the middle 50% of the data – the difference between the
third quartile and first quartile in the data (Q3 minus Q1) – not sensitive to outliers
1. Interpretation of percentile: percentile thứ mấy nghĩa là X% thấp hơn sample và (100-X)% cao hơn

Percentiles
sample
2. VD trong trường hợp này là 10th percentile là 12990 thì 10% Australia tax payer có mức thu nhập là thấp
hơn hoặc bằng 12990, 90% Australian tax payer có mức thu nhập cao hơn 12990

• The position in the dataset where p% of


observations are below it and (100-p)% are
above it, when ordered from smallest to
largest
◦ Useful for analysing specific points along the
distribution
◦ Most common percentiles are quartiles (i.e. 25th,
50th, 75th percentiles) or deciles (i.e. 10th,
20th,…, 90th percentiles)
◦ More extreme percentiles are affected by outliers

• =PERCENTILE.EXC(datarange, percentile)
◦ Make sure you put the percentile in as a fraction
(e.g. 20th percentile is 0.2)
Example: Gender pay gap
• If asked to use data to show current trends of gender pay
gap, what would you show?
• Consider GPG =

• https://data.wgea.gov.au/home
Standard deviation
• Difficult to interpret on its own, but assuming
the data is approximately bell-shaped (normally
distributed):
◦ 68% of observations are situated within ± 1 standard
deviation from the mean
◦ 95% of observations are situated within ± 2 standard
deviation from the mean
◦ 99.7% of observations are situated within ± 3 standard
deviation from the mean
= STDEV.S(datarange)
use coefficient variance to measure the votality of a stock

Real world business uses of SD


• Banking and finance:
◦ Standard deviation is often used as a measure of a relative riskiness of an asset.
◦ A volatile stock has a high standard deviation, while the deviation of a stable stock is usually rather
low.

• Actuaries calculate standard deviation of healthcare usage to know how much


variation in usage to expect in a given period (month, quarter, or year)
• Real estate agents calculate the standard deviation of house prices in a particular
area to inform their clients of the type of variation in house prices they can expect.
• Human Resource managers often calculate the standard deviation of salaries in a
certain field to know what type of variation in salaries to offer to new employees.
Coefficient of Variation
• The coefficient of variation (CV)
expresses the standard deviation NVL
of data relative to (divided by) its VCB
mean
• Useful for comparisons of NVL: VCB:
• Average = 5.3% • Average = 5.3%
variation across different sets of • SD = 2.67% • SD = 0.95%
data (e.g. between returns on • CV = 2.67/5.3 = 0.50 • CV = 0.95/5.3 = 0.18
different investments) Therefore, NVL has more variation in its returns
(higher risk) given the same “average” return.
Combining Mean and Standard Deviation
Individual assessment

• This shows top five stocks.


• You can calculate Coefficient of Variation to compare volatility across stocks.
Standardized Values (Z-scores)
• Sometimes we are interested in seeing where individual observations sit
relative to the mean.
• The Z-score tells us how many standard deviations away from the mean
an observation sits
• Use the =STANDARDIZE(x,mean,stdev) function in Excel
◦ a z-score of 1.0 (a positive value) means that the observation is one standard
deviation above the mean;
◦ a z-score of -1.5 means that the observation is 1.5 standard deviations below the
mean.
• Useful for checking if individual observations are outliers.
Outliers
• Skewness indicate the presence of outliers.
• No standard definition of what constitutes an
outlier.
• Several good rules of thumb are:
◦ Z-scores greater than +3 or less than −3
◦ Extreme outliers: more than 3*IQR to the left of Q1 or right of Q3
◦ Mild outliers: between 1.5*IQR and 3*IQR to the left Q1 or right of Q3
◦ Visual –an individual data point sit relative to the rest of the data
Outliers: Remove or not?
• Whether we remove outliers is a contentious debate and
this depends on the context
◦ Consider income or wealth inequality issues: definitely, we do not remove
(mild) outliers.
◦ But if we assess if education affects income, then it is reasonable to
remove outliers, definitely remove extreme outliers
Excel’s add-in: Toolpak vs RealStatistics
Outlier analysis

Visual approach Z-score approach


=STANDARDIZE(x, mean, standard deviation)
NVL Annual Return (%) Z-scores
BHP
NVL Coles
VCB BHP
NVL Coles
VCB =STANDARDIZE(0, 5.3, 2.67)
0 4 -1.99 -1.37

2 4 -1.24 -1.37

5 5 -0.11 -0.32

5 5 -0.11 -0.32
VCB
5 5 -0.11 -0.32

6 5 0.26 -0.32

6 6 0.26 0.74

This value stands out a little. 7

8
6

6
0.64

1.01
0.74

0.74 =STANDARDIZE(7, 5.3, 0.95)


9 7 1.39 1.79

None of the observations are more than


3 standard deviations from the mean
Measures of dispersion
• Dispersion= Variation= Spread: refers to the degree of variation in the data; that is,
the numerical spread (or compactness) of the data.
Tool
Measure Description Excel Formula
Pack?
The average of all the squared deviations from the mean =VAR.S(datarange)
Variance o Very difficult and often meaningless to interpret on its own Yes
o Affected by outliers
The square root of the variance =STDEV.S(datarange)
Standard o Difficult to interpret on its own, expressed in the same unit of
Yes
Deviation measurement as the variable of interest (e.g. dollars, metres)
o Affected by outliers
The standard deviation relative to (divided by) the mean
Coefficient of o Useful for comparing variation across variables when means are
No
Variation different (e.g. between returns on different stocks)
Measures of dispersion
Tool
Measure Description Excel Formula
Pack?
The difference between the maximum and minimum values =MIN(datarange)
in the data =MAX(datarange)
Range Yes
o Affected by outliers

The range of the middle 50% of the data =QUARTILE.EXC(datarange,3)


Interquartile o Calculated as Quartile 3 minus Quartile 1 =QUARTILE.EXC(datarange,1)
No
Range (IQR) o Not affected by outliers

The position in the dataset where p% of observations are =PERCENTILE.EXC(datarange,


below and (100-p)% are above percentile)
o More extreme percentiles are affected by outliers Make sure you put the percentile
Percentile o Most common percentiles are quartiles (i.e. 25th, 50th, 75th in as a fraction (e.g. 20th No
percentile is 0.2)
percentiles) or deciles (i.e. 10th, 20th,…, 90th percentiles)
Three main goals

MEASURES OF CENTRAL MEASURES OF MEASURES OF


TENDENCY DISPERSION AND SHAPE ASSOCIATION
Real-world questions
• Is that true that…
◦ bottled water sales increase as temperature increases?
◦ older houses are worth less?
◦ those that earn more consume more?
• We can gain insights by looking measures of association:
covariance and correlation
Using Bottledwater Data
Measures of association
• Covariance measures the direction of a relationship between two quantitative variables.
• Correlation measures both the direction and strength of the relationship between two quantitative
variables.
• A plot to gauge correlation by looking at how closed all the data points sit to the line of best fit.
Linear or Non-Linear Relationship
Measures of Association
• Two variables have a strong statistical relationship with one another
if they appear to move together.
• When two variables appear to be related, you might suspect a
cause-and-effect relationship.
• Sometimes, however, statistical relationships exist even though a
change in one variable is not caused by a change in the other.
Measures of Association: Covariance
• Covariance is a measure of the linear association between two variables, X and Y. Like
the variance, different formulas are used for populations and samples.

• Population covariance:

◦ Excel function: =COVARIANCE.P(array1,array2)

• Sample covariance:

◦ Excel function: =COVARIANCE.S(array1,array2)

• The covariance between X and Y is the average of the product of the deviations of each
pair of observations from their respective means.
Measures of Association: Correlation
• Correlation is a measure of the linear relationship between two variables, X and Y, which does not depend
on the units of measurement.
• Correlation is measured by the correlation coefficient, also known as the Pearson product moment
correlation coefficient.
• Correlation coefficient for a population:

• Correlation coefficient for a sample:

• The correlation coefficient is scaled between -1 and 1.


• Excel function: =CORREL(array1,array2)
Examples of Correlation
Notes on the CORREL Function
• When using the CORREL function, it does not matter if the data represent
samples or populations. In other words,

CORREL(array1,array2) =
COVARIANCE.P(array1,array2) / STDEV.P(array1)*STDEV.P(array2)

and

CORREL(array1,array2) =
COVARIANCE.S(array1,array2) / STDEV.S(array1)*STDEV.S(array2)
Excel Correlation Tool

Data >
Data Analysis >
Correlation

• Excel computes the correlation coefficient


between all pairs of variables in the Input Range. Input Range data must
be in contiguous columns.
Excel’s ToolPak add-in for multiple
variables
• Data > Data Analysis >
Correlation
• Can also use =CORREL(datarange1, datarange2)

• The function for covariance is


=COVARIANCE.S (datarange1, datarange2)

• Real-Statistics add-in allows


only two variables analysis.
Interpreting Correlation Coefficient
• Direction of the relationship: positive r Interpretation
or negative
0 No relationship
• Strength of the relationship: no,
weak, moderate, strong, very strong, < 0.3 Weak
perfect.
0.3 - 0.7 Moderate
• For example:
◦ Correlation of 0.4 indicates a moderate and Strong
positive linear relationship
> 0.7
◦ Correlation of -0.72 indicates a strong and Perfect relationship
negative linear relationship 1
A word of caution…
• When two variables appear to be related,
you might suspect a cause-and-effect
relationship.
• Sometimes, however, statistical
relationships exist even though a change
in one variable is not caused by a change
in the other.
• Correlation does imply CAUSATION
◦ More on this in week 6
Summaries
• Key descriptive statistics, dispersion, and association
◦ What are they?
◦ Their meanings, pros and cons.
◦ How to calculate these in Excel.
◦ How to apply these metrics in analysis.

You might also like