0% found this document useful (0 votes)
2 views

Week 2 Cheat Sheet

This cheat sheet provides essential mathematical equations and Excel functions for statistics and data analysis, focusing on population vs. sample parameters, descriptive statistics, quartiles, percentiles, and histograms. Key Excel functions include AVERAGE, VAR.P, VAR.S, STDEV.P, STDEV.S, SKEW, KURT, QUARTILE, and PERCENTILE. It also explains how to visualize data distributions using histograms and offers guidance on determining the number of bins.

Uploaded by

raresdynu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Week 2 Cheat Sheet

This cheat sheet provides essential mathematical equations and Excel functions for statistics and data analysis, focusing on population vs. sample parameters, descriptive statistics, quartiles, percentiles, and histograms. Key Excel functions include AVERAGE, VAR.P, VAR.S, STDEV.P, STDEV.S, SKEW, KURT, QUARTILE, and PERCENTILE. It also explains how to visualize data distributions using histograms and offers guidance on determining the number of bins.

Uploaded by

raresdynu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Week 2 Cheat Sheet

StaŸsŸcs and Data Analysis with Excel, Part 1

Charlie Nu©elman

Here, I provide the mathemaŸcal equaŸons and some of the important Excel funcŸons required to
perform various calculaŸons in Week 2 of the course. The headings represent the screencasts in which
you will find those calculaŸons and concepts. Not all screencasts are referenced below – just the ones
that have complex mathemaŸcal formulas or Excel formulas that are tricky to use.

Difference Between PopulaŸon and Sample


The populaŸon size is denoted by 𝑁 and the sample size is denoted by 𝑛. PopulaŸon mean (𝜇) can be
esŸmated using the sample mean (𝑥̅ ). PopulaŸon variance (𝜎 6 ) can be esŸmated using the sample
variance (𝑠 6 ). Standard deviaŸon is the square root of variance.

Formulas for these parameters are:


Ç
1
𝜇 = Í 𝑥Ü
𝑁
Ü@5
Ç
1
𝑥̅ = Í 𝑥Ü
𝑛
Ü@5
Ç
1
𝜎 = Í(𝑥Ü − 𝑥̅ )6
6
𝑁
Ü@5
Ç
6
1
𝑠 = Í(𝑥Ü − 𝑥̅ )6
(𝑛 − 1)
Ü@5

PopulaŸon and sample mean (average) can be calculated using the AVERAGE funcŸon in Excel.
PopulaŸon and sample variances can be calculated using the VAR.P and VAR.S formulas, respecŸvely,
and the populaŸon and sample standard deviaŸons can be calculated using the STDEV.P and STDEV.S
formulas, respecŸvely. The COUNT funcŸon is useful in counŸng the number of observaŸons.

The SummaŸon Symbol


The summaŸon symbol, Σ, or Greek le©er sigma, is used as an indicator to sum over the expression that
follows the symbol. The integer below the symbol (typically wri©en as some index variable equal to 1, or
other number) is the start value for which iteraŸon and summaŸon will occur. Above the summaŸon
symbol is the stop value, or the number at which iteraŸon and summaŸon will occur:
For example, if x = {1, 2, 3, 4, 5}, then:
9

Í 𝑥Ü = 𝑥5 + 𝑥6 + ⋯ + 𝑥9
Ü@5

= 1 + 2 + 3 + 4 + 5 = 15
The summaŸon symbol is used in the definiŸon and calculaŸon of average and variance (see below).

DescripŸve StaŸsŸcs
Another common measure of spread in a set of data is the range of the data, which is just the maximum
value in the data set minus the minimum value. We can calculate the maximum value of a set of data in
Excel using the MAX funcŸon and the minimum value using the MIN funcŸon; the range is simply the
difference between those two values.

Skewness and kurtosis are someŸmes used to describe the asymmetry of a set of data when compared
to the normal distribuŸon. The SKEW and KURT funcŸons in Excel can determine these parameters. For
more informaŸon on how to interpret these values, please visit support.microsoL.com.

QuarŸles and PercenŸles


For either quarŸles or percenŸles, we first determine a rank, 𝑘, by using one of the formulas below. We
can either include the median or exclude the median (it is more common to exclude the median). The
parameter 𝑝 is the desired percenŸle; for quarŸles, the first quarŸle is the same as the 25th percenŸle
(𝑝 = 0.25) and the third quarŸle is the 75th percenŸle (𝑝 = 0.75).

Including the median: 𝑘 = 𝑝 ∙ (𝑛 − 1) + 1

Excluding the median: 𝑘 = (𝑛 + 1) ∙ 𝑝

Once we have the rank, we can linearly interpolate between ordered values in our data. For example, if
our (ordered) data is: 5, 9, 12, 14, 17, 18, 21, 22, 25 (𝑛 = 9) and we wish to find the first quarŸle
including the median, we would calculate the rank as 𝑘 = 0.25 ∙ (9 − 1) + 1 = 3. Therefore, the first
quarŸle in this case is 12. Similarly, the third quarŸle would be calculated to be 21 (𝑘 = 7).

For the same data set, if we wished to find the first quarŸle excluding the median, we would calculate
the rank as 𝑘 = (9 + 1) ∙ 0.25 = 2.5. Therefore, we linearly interpolate 50%of the way between the 2 nd
and 3rd values of the ordered data, and the first quarŸle is 9 + 0.
5 x (12 – 9) = 10.
5. Similarly, the third
quarŸle would be calculated to be 21.5 (𝑘 = 7.5).
PercenŸles are calculated exactly the same but 𝑝 can be any conŸnuous value between 0 and 1. For
example, for the above data set if we wanted to calculate the median-excluded 13 th percenŸle, we
calculate the rank: 𝑘 = (9 + 1) ∙ 0.13 = 1.3. The 13th percenŸle is then 30% of the way between the 1st
and 2nd of the ordered values = 5 + 0.3 x (9 – 5) = 6.2.

In Excel, we can use the QUARTILE(data,q), QUARTILE.INC(data,q), and QUARTILE.EXC(data.q) to


calculate quarŸles, where q = 1 for the 1st quarŸle and q = 3 for the 3rd quarŸle. We can use the
PERCENTILE(data,p), PERCENTILE.INC(data,p), and PERCENTILE.EXC(data,p) funcŸons in Excel to
calculate the 100pth percenŸle (for example, for the 67th percenŸle p would be 0.67).

Histograms
The best way to visualize the distribuŸon of univariate data is the use of a histogram. In a histogram,
the data are sorted into “bins” of constant width and frequencies of each bin are plo©ed as a column
chart. We typically esŸmate a lower bound and an upper bound for the number of bins:

𝑛ÕÜáæ,ßâêØå = 𝐼𝑁𝑇k𝐿𝑂𝐺6(𝑛)o − 1

𝑛ÕÜáæ,èããØå = √𝑛 (typically rounded to the nearest integer)

Here, 𝑛 is the number of observaŸons or experimental measurements. I like to choose the actual
number of bins to be somewhere between the lower and upper esŸmates for number of bins.

Excel’s histogram tool (Data  Data Analysis  Histogram) is great for parsing the data into the bins,
but the user must provide the bin boundaries.

You might also like