Topic 4 Descriptive Statistics
Topic 4 Descriptive Statistics
Alum Sheeting 8
Durrable Products 13
Fast-Tie Aerospace 15
• The proportion, p, is the percentage of
Hulkey Fasteners 15
observations that have a certain
Manley Valve 11
characteristic
Pylon Accessories 5
Total value of all The middle observation, after The most frequently
observations / number of ordering the data from occurring observation
observations smallest to largest
=AVERAGE(datarange)
=MEDIAN(datarange) =MODE.MULT(datarange)
Mean vs Median vs Mode
• Mean is often used for quantitative data unless outliers
exist or data is skewed.
• Median is often used in conjunction with the mean
since it is not affected by outliers. Comparing mean
with median gives us an idea of skewness.
• Mode is mainly used for qualitative data, rarely used for
numerical data. There may be no mode, multiple
modes, or the mode may not be close to the centre of
the data.
Individual assessment
• Consider investment in stock markets. For each stock,
◦ Daily return = closing price – opening price
◦ Over a year (a period) = price on the ending date – price on the beginning date
◦ Returns = price on purchase – price when re-selling
https://ourworldindata.org/global-economic-inequality
“income inequality in Australia/Vietnam
has been increasing recently”
• What would you show in your analysis?
◦ Think of a specific context:
◦ (global vs) national vs state level
◦ by socio-economic demographic factors: gender, ethnicity, skills, education & qualification,
efforts etc.
◦ Think of a specific data set
◦ entire population vs income groups
◦ Think of specific measures (metrics/ indicators/ variables)
◦ mean – median – mode – skewness etc.
Measures of variation
Dispersion= Variation= Spread: refers to the
degree of variation in the data
Five key measures:
1. Range
2. Interquartile Range
What can we say about
3. Percentiles the variation in income?
4. Standard deviation
5. Coefficient of variation
Range and Interquartile Range
• Range: the difference between the minimum and maximum value in the data – sensitive to
outliers
• Interquartile Range: the range of the middle 50% of the data – the difference between the
third quartile and first quartile in the data (Q3 minus Q1) – not sensitive to outliers
1. Interpretation of percentile: percentile thứ mấy nghĩa là X% thấp hơn sample và (100-X)% cao hơn
Percentiles
sample
2. VD trong trường hợp này là 10th percentile là 12990 thì 10% Australia tax payer có mức thu nhập là thấp
hơn hoặc bằng 12990, 90% Australian tax payer có mức thu nhập cao hơn 12990
• =PERCENTILE.EXC(datarange, percentile)
◦ Make sure you put the percentile in as a fraction
(e.g. 20th percentile is 0.2)
Example: Gender pay gap
• If asked to use data to show current trends of gender pay
gap, what would you show?
• Consider GPG =
• https://data.wgea.gov.au/home
Standard deviation
• Difficult to interpret on its own, but assuming
the data is approximately bell-shaped (normally
distributed):
◦ 68% of observations are situated within ± 1 standard
deviation from the mean
◦ 95% of observations are situated within ± 2 standard
deviation from the mean
◦ 99.7% of observations are situated within ± 3 standard
deviation from the mean
= STDEV.S(datarange)
use coefficient variance to measure the votality of a stock
2 4 -1.24 -1.37
5 5 -0.11 -0.32
5 5 -0.11 -0.32
VCB
5 5 -0.11 -0.32
6 5 0.26 -0.32
6 6 0.26 0.74
8
6
6
0.64
1.01
0.74
• Population covariance:
• Sample covariance:
• The covariance between X and Y is the average of the product of the deviations of each
pair of observations from their respective means.
Measures of Association: Correlation
• Correlation is a measure of the linear relationship between two variables, X and Y, which does not depend
on the units of measurement.
• Correlation is measured by the correlation coefficient, also known as the Pearson product moment
correlation coefficient.
• Correlation coefficient for a population:
CORREL(array1,array2) =
COVARIANCE.P(array1,array2) / STDEV.P(array1)*STDEV.P(array2)
and
CORREL(array1,array2) =
COVARIANCE.S(array1,array2) / STDEV.S(array1)*STDEV.S(array2)
Excel Correlation Tool
Data >
Data Analysis >
Correlation