Lecture03 Descriptive Statistics
Lecture03 Descriptive Statistics
LECTURE 03
JENNIFER JOYCE M. MONTEMAYOR - MAULANA
Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers.
Descriptive statistics are used to describe and/or summarize the data we are working with.
■ Mean
■ Median
■ Mode
= (30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110) / 12
= 696 / 12 = 58
The average salary is P58,000.
■ Median
○ In cases when we suspect outliers is present in our data
○ Calculated by taking the middle value from an ordered list of values
■ If even number of data points, take the average of the two middle values
○ Value that separates the higher half of the data set from the lower half
○ Expensive to compute when you have a large number of observations
■ Mode
○ most common value in the data set
○ value that occurs most frequently compared to neighboring values in the data set
○ Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal
○ In general, a data set with two or more modes is multimodal
■ Mode
They may be, positively skewed - mode occurs at a value lesser than the median
negatively skewed - mode occurs at a value greater than the median
How does the values fall around the center? How far apart are they?
Data can be dispersed thinly (low dispersion) or widely (high dispersion). This provides insights into how
consistent or variable the data is.
The range of the set is the difference between the largest value and the smallest value
Suppose the data for variable X are sorted in ascending numerical order and we can use specific data
points to split the data distribution into equal or approximately equal parts -- these data points are called
quantiles.
IQR = Q3 - Q1
Tte ■ Represents the range within which the
50% of the data falls
■ Gives the spread of the data around the
median
■ Quantifies how much dispersion is
present in the middle 50% of the
distribution
■ A robust measure of spread because it
is not influenced by extreme outliers or
skewed data
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 ■ Represents the range within which the
50% of the data falls
Median: (52 + 56) / 2 = 54 ■ Gives the spread of the data around the
median
Q1 = (47 + 50) / 2 = 48.5
■ Quantifies how much dispersion is
Q3 = (63 + 70) / 2 = 66.5 present in the middle 50% of the
distribution
IQR = Q3 - Q1 = 66.5 - 48.5 = 18 ■ A robust measure of spread because it
is not influenced by extreme outliers or
skewed data
1. Minimum (Min) This is the smallest value in the dataset. It represents the lowest data
point in the distribution.
2. First Quartile (Q1): This is the 25th percentile of the data, also known as the lower
quartile. It divides the lowest 25% of the data. Q1 indicates the boundary of the first
quarter of the data when it's arranged in ascending order.
3. Median (Q2 or Median): The median is the middle value of the dataset when the data is
ordered. It is the 50th percentile and represents the center of the data distribution.
4. Third Quartile (Q3): This is the 75th percentile of the data, also known as the upper
quartile. It divides the lowest 75% of the data. Q3 indicates the boundary of the last
quarter of the data when it is arranged in ascending order.
5. Maximum (Max): This is the largest value in the dataset. It represents the highest data
point in the distribution.
1. Minimum (Min) This is the smallest value in the dataset. It represents the lowest data
point in the distribution.
2. First Quartile (Q1): This is the 25th percentile of the data, also known as the lower
quartile. It divides the lowest 25% of the data. Q1 indicates the boundary of the first
quarter of the data when it's arranged in ascending order.
3. Median (Q2 or Median): The median is the middle value of the dataset when the data is
ordered. It is the 50th percentile and represents the center of the data distribution.
4. Third Quartile (Q3): This is the 75th percentile of the data, also known as the upper
quartile. It divides the lowest 75% of the data. Q3 indicates the boundary of the last
quarter of the data when it is arranged in ascending order.
5. Maximum (Max): This is the largest value in the dataset. It represents the highest data The box plot or box-and-whisker plot is the visual
point in the distribution. representation of the 5-number summary.
where,
and are individual data points for and .
and are the means of and , respectively
■ A positive covariance indicates that when X is above its
mean, Y tends to be above its mean as well, and vice is the number of data points
versa. This suggests a positive linear relationship.
■ A negative covariance suggests that when X is above its
mean, Y tends to be below its mean, and vice versa. This
suggests a negative linear relationship.
■ A covariance of zero indicates that there is no linear
relationship between X and Y. However, it does not
necessarily imply that there is no relationship of any kind.
where,
and are individual data points for and .
and are the means of and , respectively
■ A positive covariance indicates that when X is above its
mean, Y tends to be above its mean as well, and vice is the number of data points
versa. This suggests a positive linear relationship.
■ A negative covariance suggests that when X is above its A covariance matrix is used to describe the covariances
mean, Y tends to be below its mean, and vice versa. This between multiple variables in a dataset. The diagonal
suggests a negative linear relationship. elements of the covariance matrix represent the variances of
■ A covariance of zero indicates that there is no linear the individual variables, and the off-diagonal elements
relationship between X and Y. However, it does not represent the covariances between pairs of variables.
necessarily imply that there is no relationship of any kind.