TUT1
TUT1
• Variables which only identify which category • Variables which take numeric values, have magnitude and
does an observation belong to. inherit some of the math. operations.
• No math. operations (e.g. + - * / < >) can be • Can be summarized by measure of locations and spread.
applied on them
1
Notes: 1) Mode may not be unique.
2) Sample Mean has a nice property that it can be updated sequentially: i.e. 𝑋̅𝑛 = [(n − 1)𝑋̅𝑛−1 + 𝑋𝑛 ]/𝑛 which means it isn’t required to know all the
𝑋1 , … . , 𝑋𝑛 to find 𝑋̅𝑛 , instead we only need 𝑋̅𝑛−1 𝑎𝑛𝑑 𝑋𝑛
The pth percentile is a value such that it is (approximately) greater than p% of the observations.
Calculation Formula:
Notes:
1. Median = V50/100
2. Upper Quartile/Third quartile/Q3 = V 75 , Lower Quartile/First quartile/Q1 = V 25
100 100
3. Interquartile Range(IQR) = Q3 - Q1
4. Rescaling/Translation of the quartile.
Definitions:
1. Symmetric: the shape of the left and right hand sides of the distribution are mirror image of each other.
2. Unimodal: Mode is unique/One clear peak.
3. Left-skewed (The mean of the distribution is to the left of the mode) Right-skewed (The mean of the distribution is to the right of the mode)
2
“Mass” concentrated to the right “Mass” concentrated to the left
6 Have a few extreme low values Have a few extreme large values
2) Symmetric + Unimodal implies Mean = Mode = Median. (Converse not true in general)
5) Measures of Spread
Numbers which describes how large the fluctuations or variability between observations are.
3
Measure Definition Advantages Disadvantages Translational Rescaling
Range Maximum - Minimum Easy to compute and 1. Extremely sensitive Unchanged if every If every observation
interpret. to outlier. observation changes from 𝑋𝑖 to c𝑋𝑖 ,
2. Range depends on increases by c range should be multiplied
the data size n and by c.
tends to increase
with it.
Interquartile V75/100 − V25/100 , which Not sensitive to Only determined by Unchanged Same as Range
Range means 75th percentile – 25th outliers. certain percentiles
percentile
𝑛
Sample Variance 1 Takes into account of Quite sensitive to outlier Unchanged If every observation
∑(𝑋𝑖 − 𝑋̅)2 all observations’ changes from 𝑋𝑖 to c𝑋𝑖 ,
𝑛−1
𝑖=1 deviations from sample variance should be
or
𝑛 mean. multiplied by 𝑐 2
1
(∑ 𝑋𝑖 2 − 𝑛𝑋̅ 2 )
𝑛−1
𝑖=1
Sample Standard √𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Same as Sample Same as Sample Variance Unchanged If every observation
Deviation Variance changes from 𝑋𝑖 to c𝑋𝑖 ,
sample standard deviation
should be multiplied by c.
6) Graphical Methods:
⚫ Bar Graph:
It is applied to categorical data, showing the number of observations in each category.
⚫ Histogram:
It is applied to quantitative data (nearly same as Bar graph), showing the number of observations in each category.
⚫ Stem and leaf plot:
Easy to locate the Median and the percentiles of the data (It can also be viewed as a histogram)
⚫ Boxplot:
- Showing five numbers (minimum,25th percentile, median 75th percentile, max) and some outliers
- Outliers are values > 75th percentile + 1.5 IQR or < 25th percentile – 1.5 IQR
4
7) Useful Excel worksheet functions:
Suppose 7 numbers are entered into cell A1, cell A2… cell A7 (A1:A7), the excel worksheet functions are:
Mean =AVERAGE(A1:A7)
Median =MEDIAN(A1:A7)
Mode =MODE(A1:A7)
Lower Quartile =PERCENTILE(A1:A7,0.25)
Upper Quartile =PERCENTILE(A1:A7,0.75)
Maximum =MAX(A1:A7)
Minimum =MIN(A1:A7)
Standard Deviation =STDEV(A1:A7)
Variance =VAR(A1:A7)
Coefficient of Variation =STDEV(A1:A7)/AVERAGE(A1:A7)
2. Determine whether the variables should be treated as i) Categorical/Quantitative ii) Discrete or Continuous
6
6. How to draw a stem and leaf plot and find the quartiles
n = 20
Q1: p = 25, 𝑛 × 𝑝/100 = 5 (integer),then Q1 is the average of the 5-th and 6-th smallest
data points.
23 + 25
𝑄1 = = 24
2
Q2(Median): n = 20 is even, then the Median is the average of the 10-th and 11-th smallest
data points.
Q3: ...