0% found this document useful (0 votes)
10 views

Lecture03 Descriptive Statistics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture03 Descriptive Statistics

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Descriptive Statistics

LECTURE 03
JENNIFER JOYCE M. MONTEMAYOR - MAULANA

Department of Computer Science


College of Computer Studies
MSU - Iligan Institute of Technology
Statistics of data
For data preprocessing to be successful, it is essential to have an overall picture of your data.

Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers.

Descriptive statistics are used to describe and/or summarize the data we are working with.

■ Measure of central tendency - describes where the data is centered around


■ Measure of the dispersion of data - indicates how far apart the values are

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mean
■ Median
■ Mode

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
Three common statistics include,
■ Mean
○ also known as the “average”, “arithmetic mean”
○ most common and effective numeric measure
○ population mean is denoted by Greek symbol mu ( )
■ Represents the average value of a variable within an entire population
○ sample mean is denoted by x bar ( )
■ Represents the average value of a variable within a subset or sample of the
population
○ Very sensitive to outliers because each data point contributes equally to the calculation, so
extreme values can have substantial impact on the result
■ Values or observations that significantly deviates from other values in a dataset

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
Three common statistics include,
■ Mean
Suppose we have the following values for salary (in thousands of pesos),
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Using the following equation,

= (30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110) / 12
= 696 / 12 = 58
The average salary is P58,000.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Median
○ In cases when we suspect outliers is present in our data
○ Calculated by taking the middle value from an ordered list of values
■ If even number of data points, take the average of the two middle values
○ Value that separates the higher half of the data set from the lower half
○ Expensive to compute when you have a large number of observations

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mode
○ most common value in the data set
○ value that occurs most frequently compared to neighboring values in the data set
○ Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal
○ In general, a data set with two or more modes is multimodal

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mode

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode
are all the same center value. Data in real applications are not symmetric.

They may be, positively skewed - mode occurs at a value lesser than the median
negatively skewed - mode occurs at a value greater than the median

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of spread
Measures of spread tell us how data points in a data set are dispersed or scattered .

How does the values fall around the center? How far apart are they?

Data can be dispersed thinly (low dispersion) or widely (high dispersion). This provides insights into how
consistent or variable the data is.

Let x1, x2, .. xn be a set of observations for some numeric attribute X.

The range of the set is the difference between the largest value and the smallest value

range = max (X) - min(X)

Suppose the data for variable X are sorted in ascending numerical order and we can use specific data
points to split the data distribution into equal or approximately equal parts -- these data points are called
quantiles.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.

■ 2-quantile is the data point dividing


the lower and upper half -- median
■ 4-quantiles are the three data points
that split the data into four equal parts
-- quartiles
■ 100-quantiles are more commonly
referred to as -- percentiles

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.

■ First quartile denoted by Q1 is the


25th percentile
■ Second quartile denoted by Q2 is the
Tte 50th percentile
■ Third quartile denoted by Q3 is the
75th percentile -- cuts off the lowest
75% or highest 25% of the data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.
■ Interquartile range (IQR) is the distance
between the third and first quartiles

IQR = Q3 - Q1
Tte ■ Represents the range within which the
50% of the data falls
■ Gives the spread of the data around the
median
■ Quantifies how much dispersion is
present in the middle 50% of the
distribution
■ A robust measure of spread because it
is not influenced by extreme outliers or
skewed data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.
■ Interquartile range (IQR) is the distance
between the third and first quartiles
Suppose we have the following values for salary (in thousands
of pesos), IQR = Q3 - Q1

30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 ■ Represents the range within which the
50% of the data falls
Median: (52 + 56) / 2 = 54 ■ Gives the spread of the data around the
median
Q1 = (47 + 50) / 2 = 48.5
■ Quantifies how much dispersion is
Q3 = (63 + 70) / 2 = 66.5 present in the middle 50% of the
distribution
IQR = Q3 - Q1 = 66.5 - 48.5 = 18 ■ A robust measure of spread because it
is not influenced by extreme outliers or
skewed data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Five-number summary
No single measure of spread is very useful when describing skewed distributions.

A set of five values that provides description of the distribution of a dataset.

1. Minimum (Min) This is the smallest value in the dataset. It represents the lowest data
point in the distribution.
2. First Quartile (Q1): This is the 25th percentile of the data, also known as the lower
quartile. It divides the lowest 25% of the data. Q1 indicates the boundary of the first
quarter of the data when it's arranged in ascending order.
3. Median (Q2 or Median): The median is the middle value of the dataset when the data is
ordered. It is the 50th percentile and represents the center of the data distribution.
4. Third Quartile (Q3): This is the 75th percentile of the data, also known as the upper
quartile. It divides the lowest 75% of the data. Q3 indicates the boundary of the last
quarter of the data when it is arranged in ascending order.
5. Maximum (Max): This is the largest value in the dataset. It represents the highest data
point in the distribution.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Five-number summary
No single measure of spread is very useful when describing skewed distributions.

A set of five values that provides description of the distribution of a dataset.

1. Minimum (Min) This is the smallest value in the dataset. It represents the lowest data
point in the distribution.
2. First Quartile (Q1): This is the 25th percentile of the data, also known as the lower
quartile. It divides the lowest 25% of the data. Q1 indicates the boundary of the first
quarter of the data when it's arranged in ascending order.
3. Median (Q2 or Median): The median is the middle value of the dataset when the data is
ordered. It is the 50th percentile and represents the center of the data distribution.
4. Third Quartile (Q3): This is the 75th percentile of the data, also known as the upper
quartile. It divides the lowest 75% of the data. Q3 indicates the boundary of the last
quarter of the data when it is arranged in ascending order.
5. Maximum (Max): This is the largest value in the dataset. It represents the highest data The box plot or box-and-whisker plot is the visual
point in the distribution. representation of the 5-number summary.

■ Median is the thick line inside the box


■ Top of the box is Q3
■ Bottom of the box is Q1
■ Lines (or whiskers) extend both sides of the box to
represent minimum and maximum
■ Points are outliers or the values beyond the statistics

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Variance
■ Range tell us how dispersed the entire dataset is -- it does not tell us how the data is dispersed
around the mean
■ Describes how far apart observations are spread out from their average value (mean)
■ Measure of the average squared deviation of each data point from the mean
■ Calculated by taking the average of the squared difference between each data point and the mean
of the dataset
■ Population variance is denoted by sigma squared
■ Sample variance is denoted by
■ Expressed in squared units
■ Most statistical tools will give us sample variance by default, since it is very rare that we would have
data for the entire population

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Standard Deviation
■ Square root of the variance
■ More interpretable measure of spread compared to
variance because it is in the same units as the data
■ Measure how far on average are the data points from the
mean
■ Population is denoted by sigma
■ Sample is noted by s
■ Often used to assess the variability in a dataset
■ Small standard deviation means close to the mean
■ Large standard deviation means values are dispersed
widely

When comparing the relative variability of datasets with different units

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Coefficient of variation (CV)
■ Useful when comparing the relative variability of datasets with different units
○ Comparing level of dispersion of one to dataset to another
■ Allows standardized comparison because it is unitless
■ High CV - high relative variability, standard deviation is a significant portion of the mean
■ Low CV - low relative variability

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Covariance
■ Measure for describing how two attributes change together
■ Indicates:
○ whether there is a linear relationship between two variables
○ direction of the relationship (positive or negative)
■ Quantifies how changes in one variable are associated with changes in another

where,
and are individual data points for and .
and are the means of and , respectively
■ A positive covariance indicates that when X is above its
mean, Y tends to be above its mean as well, and vice is the number of data points
versa. This suggests a positive linear relationship.
■ A negative covariance suggests that when X is above its
mean, Y tends to be below its mean, and vice versa. This
suggests a negative linear relationship.
■ A covariance of zero indicates that there is no linear
relationship between X and Y. However, it does not
necessarily imply that there is no relationship of any kind.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Covariance
■ Measure for describing how two attributes change together
■ Indicates:
○ whether there is a linear relationship between two variables
○ direction of the relationship (positive or negative)
■ Quantifies how changes in one variable are associated with changes in another

where,
and are individual data points for and .
and are the means of and , respectively
■ A positive covariance indicates that when X is above its
mean, Y tends to be above its mean as well, and vice is the number of data points
versa. This suggests a positive linear relationship.
■ A negative covariance suggests that when X is above its A covariance matrix is used to describe the covariances
mean, Y tends to be below its mean, and vice versa. This between multiple variables in a dataset. The diagonal
suggests a negative linear relationship. elements of the covariance matrix represent the variances of
■ A covariance of zero indicates that there is no linear the individual variables, and the off-diagonal elements
relationship between X and Y. However, it does not represent the covariances between pairs of variables.
necessarily imply that there is no relationship of any kind.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03


Correlation
■ Describes the extent to which two variables are related or associated
■ Assess the strength (magnitude) and direction (same or opposite) of the relationship between two
variables
■ Correlation does not imply causation
○ Two variables can be correlated but it does not mean that one causes the other (other factors
may be at play)
■ Quantified by the Correlation Coefficient
○ Pearson Correlation Coefficient (r)
■ measures the linear relationship between two variables
■ assumes linear relationship, sensitive to outliers
■ coefficient ranges from -1 to 1
● positive correlation (r > 0)
○ when one variable increases, the other tends to increase
○ The closer r is to 1, the stronger the positive correlation
● negative correlation (r < 0)
○ when one variable increases, the other tends to decrease
○ the closer r is to -1, the stronger the negative correlation

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

You might also like