0% found this document useful (0 votes)
5 views

TUT1

This document provides an overview of descriptive statistics concepts including: 1) Types of data and variables. 2) Measures of central tendency like mean, median, and mode. 3) Additional measures like percentiles and how they are calculated. 4) Concepts related to the shape of a distribution like symmetry, skewness, and how they are determined. 5) Measures of spread such as range, interquartile range, variance, and standard deviation.

Uploaded by

makabigail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

TUT1

This document provides an overview of descriptive statistics concepts including: 1) Types of data and variables. 2) Measures of central tendency like mean, median, and mode. 3) Additional measures like percentiles and how they are calculated. 4) Concepts related to the shape of a distribution like symmetry, skewness, and how they are determined. 5) Measures of spread such as range, interquartile range, variance, and standard deviation.

Uploaded by

makabigail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

STAT1012 Statistics for Life Sciences

Tutorial 1 --- Descriptive Statistics

1) Types of Data: • only take certain values


Discrete: • Possible values are separated

Categorical: Quantitative: • take any value


Continuous: • Possible values form interval(s).

• Variables which only identify which category • Variables which take numeric values, have magnitude and
does an observation belong to. inherit some of the math. operations.
• No math. operations (e.g. + - * / < >) can be • Can be summarized by measure of locations and spread.
applied on them

2) Measure of Locations: (Central Tendency)


Numbers which summarize the “center” of a sample.

Measure Definition Advantages Disadvantages Translational Rescaling


Sample Mean The sum of all observations 1. Take into 1. It is sensitive to If every observation If every observation changes from 𝑋𝑖
divided by the total number of account every outliers. increases by c, the to c𝑋𝑖 , the sample mean changes from
observations: single 2. Not representing sample mean increases ∑𝑛𝑖=1 𝑋𝑖
observation. the location of by c. 𝑛
∑𝑛𝑖=1 𝑋𝑖 2. Easy to compute. majority of to
𝑋̅𝑛 = sample points. 𝑐 ∑𝑛𝑖=1 𝑋𝑖
𝑛
𝑛
Mode The value which has the greatest 1. Easy to interpret. Not useful to describe Same as Sample Mean Same as Sample Mean
number of occurrence. 2. Not sensitive to majority points.
outliers.
Median Suppose n is the sample size. Insensitive to Dominated by Same as Sample Mean Same as Sample Mean
When n is odd, median is the outliers. observations “in the
(n+1)/2th smallest observation. center”.
When n is even, median is the
average between the (n/2)th and
(n/2+1)th smallest observations.
(Sort the data)

1
Notes: 1) Mode may not be unique.

2) Sample Mean has a nice property that it can be updated sequentially: i.e. 𝑋̅𝑛 = [(n − 1)𝑋̅𝑛−1 + 𝑋𝑛 ]/𝑛 which means it isn’t required to know all the
𝑋1 , … . , 𝑋𝑛 to find 𝑋̅𝑛 , instead we only need 𝑋̅𝑛−1 𝑎𝑛𝑑 𝑋𝑛

3) Measure of Locations: (Percentile)

The pth percentile is a value such that it is (approximately) greater than p% of the observations.

(Denoted as: V𝑝/100)

Calculation Formula:

1. Sort the data from the smallest to the largest.


2. If np/100 is an integer, then pth percentile is average of (np/100)th and (np/100+1)th smallest data points.
If np/100 is not an integer, then let k = smallest integer > np/100, pth percentile is the kth smallest data point.

Notes:

1. Median = V50/100
2. Upper Quartile/Third quartile/Q3 = V 75 , Lower Quartile/First quartile/Q1 = V 25
100 100
3. Interquartile Range(IQR) = Q3 - Q1
4. Rescaling/Translation of the quartile.

4) Symmetric and Skewness

Definitions:

1. Symmetric: the shape of the left and right hand sides of the distribution are mirror image of each other.
2. Unimodal: Mode is unique/One clear peak.
3. Left-skewed (The mean of the distribution is to the left of the mode) Right-skewed (The mean of the distribution is to the right of the mode)

2
“Mass” concentrated to the right “Mass” concentrated to the left

Mean < Median Mean > Median

6 Have a few extreme low values Have a few extreme large values

Notes: 1) Symmetric implies Mean = Median. (Converse not true in general)

2) Symmetric + Unimodal implies Mean = Mode = Median. (Converse not true in general)

5) Measures of Spread

Numbers which describes how large the fluctuations or variability between observations are.

3
Measure Definition Advantages Disadvantages Translational Rescaling
Range Maximum - Minimum Easy to compute and 1. Extremely sensitive Unchanged if every If every observation
interpret. to outlier. observation changes from 𝑋𝑖 to c𝑋𝑖 ,
2. Range depends on increases by c range should be multiplied
the data size n and by c.
tends to increase
with it.
Interquartile V75/100 − V25/100 , which Not sensitive to Only determined by Unchanged Same as Range
Range means 75th percentile – 25th outliers. certain percentiles
percentile
𝑛
Sample Variance 1 Takes into account of Quite sensitive to outlier Unchanged If every observation
∑(𝑋𝑖 − 𝑋̅)2 all observations’ changes from 𝑋𝑖 to c𝑋𝑖 ,
𝑛−1
𝑖=1 deviations from sample variance should be
or
𝑛 mean. multiplied by 𝑐 2
1
(∑ 𝑋𝑖 2 − 𝑛𝑋̅ 2 )
𝑛−1
𝑖=1

Sample Standard √𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Same as Sample Same as Sample Variance Unchanged If every observation
Deviation Variance changes from 𝑋𝑖 to c𝑋𝑖 ,
sample standard deviation
should be multiplied by c.

6) Graphical Methods:
⚫ Bar Graph:
It is applied to categorical data, showing the number of observations in each category.
⚫ Histogram:
It is applied to quantitative data (nearly same as Bar graph), showing the number of observations in each category.
⚫ Stem and leaf plot:
Easy to locate the Median and the percentiles of the data (It can also be viewed as a histogram)
⚫ Boxplot:
- Showing five numbers (minimum,25th percentile, median 75th percentile, max) and some outliers
- Outliers are values > 75th percentile + 1.5 IQR or < 25th percentile – 1.5 IQR

4
7) Useful Excel worksheet functions:
Suppose 7 numbers are entered into cell A1, cell A2… cell A7 (A1:A7), the excel worksheet functions are:

Mean =AVERAGE(A1:A7)
Median =MEDIAN(A1:A7)
Mode =MODE(A1:A7)
Lower Quartile =PERCENTILE(A1:A7,0.25)
Upper Quartile =PERCENTILE(A1:A7,0.75)
Maximum =MAX(A1:A7)
Minimum =MIN(A1:A7)
Standard Deviation =STDEV(A1:A7)
Variance =VAR(A1:A7)
Coefficient of Variation =STDEV(A1:A7)/AVERAGE(A1:A7)

8) Install Data Analysis Add-in in Excel:


File → Options → Add-Ins → Go…→ Tick “Analysis ToolPak” and “Analysis ToolPak – VBA” → OK
After that, click “Data” and you would be able to find an extra option named “Data Analysis”
9) Useful links for Excel:
Creating a Histogram In Excel 2010
http://www.youtube.com/watch?v=RyxPp22x9PU
Making a column graph using excel 2010
http://www.youtube.com/watch?v=Y2U0bmo91ys&feature=related
Install Data Analysis Add-in
http://www.youtube.com/watch?v=nfv1z2ko6jk
5
10) Exercises:
1. Find mean, mode, median, upper quartile and lower quartile of the following data: 5 6 4 8 6 4 5 4 6 7 5 1 5 4 1

2. Determine whether the variables should be treated as i) Categorical/Quantitative ii) Discrete or Continuous

a) Time required for an athlete to complete 100m race


b) Time required for an athlete to complete 100m race measured by a stop watch.
c) Colors of a spectrum
1, 𝑖𝑓 𝑖 − 𝑡ℎ 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑚𝑎𝑙𝑒
d) X1 , 𝑋2 , … , 𝑋𝑛 , where Xi = {
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
3. For each of the data sets below (10 data points each), determine if the sample mean, the median and the mode are good as measures of location.
Data Set A: Age of male (in years) to have the first marriage 23, 23, 25, 28, 29, 30, 31, 32, 35, 70
Data Set B: Number of children in a family: 0, 1, 1, 2, 2, 2, 2, 3, 3, 5
Data Set C: 1 = Student taking STAT1012, 0 = Student not taking STAT 1012
0, 0, 1, 1, 1, 1, 1, 1, 1, 1

4. Plot the boxplot of following data set.


0, 0, 0, 0, 0, 0, 0, 1, 1, 10

5. ∑𝑛𝑖=1 𝑥𝑖 , ∑𝑛𝑖=1 𝑥𝑖2 , 𝑥‾𝑛 , ∑𝑛𝑖=1(2𝑥𝑖 + 10)2 , 𝑠 2 , {𝑥‾𝑛−1 , 𝑥𝑛 }


Example:
a. If the mean of five values is 8.2 and four of the values are 6, 10, 7, and 12, find the fifth value.
b. If the sample mean of 4 data points is 12, and a new data point is 14, find the new sample mean.
c. If 𝑠 2 = 10, ∑𝑛𝑖=1(𝑥𝑖 − 1)2 = 46, 𝑛 = 4, find ∑𝑛𝑖=1 𝑥𝑖2 and 𝑥‾𝑛

6
6. How to draw a stem and leaf plot and find the quartiles

n = 20
Q1: p = 25, 𝑛 × 𝑝/100 = 5 (integer),then Q1 is the average of the 5-th and 6-th smallest
data points.
23 + 25
𝑄1 = = 24
2
Q2(Median): n = 20 is even, then the Median is the average of the 10-th and 11-th smallest
data points.
Q3: ...

You might also like