0% found this document useful (0 votes)

10 views

Lecture03 Descriptive Statistics

Uploaded by

rachelanne.balagbis

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture03 Descriptive Statistics

Uploaded by

rachelanne.balagbis

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Descriptive Statistics

LECTURE 03
JENNIFER JOYCE M. MONTEMAYOR - MAULANA

Department of Computer Science

College of Computer Studies
MSU - Iligan Institute of Technology
Statistics of data
For data preprocessing to be successful, it is essential to have an overall picture of your data.

Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers.

Descriptive statistics are used to describe and/or summarize the data we are working with.

■ Measure of central tendency - describes where the data is centered around

■ Measure of the dispersion of data - indicates how far apart the values are

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mean
■ Median
■ Mode

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
Three common statistics include,
■ Mean
○ also known as the “average”, “arithmetic mean”
○ most common and eﬀective numeric measure
○ population mean is denoted by Greek symbol mu ( )
■ Represents the average value of a variable within an entire population
○ sample mean is denoted by x bar ( )
■ Represents the average value of a variable within a subset or sample of the
population
○ Very sensitive to outliers because each data point contributes equally to the calculation, so
extreme values can have substantial impact on the result
■ Values or observations that signiﬁcantly deviates from other values in a dataset

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
Three common statistics include,
■ Mean
Suppose we have the following values for salary (in thousands of pesos),
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Using the following equation,

= (30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110) / 12
= 696 / 12 = 58
The average salary is P58,000.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Median
○ In cases when we suspect outliers is present in our data
○ Calculated by taking the middle value from an ordered list of values
■ If even number of data points, take the average of the two middle values
○ Value that separates the higher half of the data set from the lower half
○ Expensive to compute when you have a large number of observations

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mode
○ most common value in the data set
○ value that occurs most frequently compared to neighboring values in the data set
○ Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal
○ In general, a data set with two or more modes is multimodal

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.

Given an attribute, where do most of its values fall?

Three common statistics include,

■ Mode

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of central tendency
Measure the location of the middle or center of the data distribution.
Given an attribute, where do most of its values fall?
In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode
are all the same center value. Data in real applications are not symmetric.

They may be, positively skewed - mode occurs at a value lesser than the median
negatively skewed - mode occurs at a value greater than the median

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of spread
Measures of spread tell us how data points in a data set are dispersed or scattered .

How does the values fall around the center? How far apart are they?

Data can be dispersed thinly (low dispersion) or widely (high dispersion). This provides insights into how
consistent or variable the data is.

Let x1, x2, .. xn be a set of observations for some numeric attribute X.

The range of the set is the diﬀerence between the largest value and the smallest value

range = max (X) - min(X)

Suppose the data for variable X are sorted in ascending numerical order and we can use speciﬁc data
points to split the data distribution into equal or approximately equal parts -- these data points are called
quantiles.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.

■ 2-quantile is the data point dividing

the lower and upper half -- median
■ 4-quantiles are the three data points
that split the data into four equal parts
-- quartiles
■ 100-quantiles are more commonly
referred to as -- percentiles

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.

■ First quartile denoted by Q1 is the

25th percentile
■ Second quartile denoted by Q2 is the
Tte 50th percentile
■ Third quartile denoted by Q3 is the
75th percentile -- cuts oﬀ the lowest
75% or highest 25% of the data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Measure of spread
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal-size
consecutive sets.
■ Interquartile range (IQR) is the distance
between the third and ﬁrst quartiles

IQR = Q3 - Q1
Tte ■ Represents the range within which the
50% of the data falls
■ Gives the spread of the data around the
median
■ Quantiﬁes how much dispersion is
present in the middle 50% of the
distribution
■ A robust measure of spread because it
is not inﬂuenced by extreme outliers or
skewed data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 ■ Represents the range within which the
50% of the data falls
Median: (52 + 56) / 2 = 54 ■ Gives the spread of the data around the
median
Q1 = (47 + 50) / 2 = 48.5
■ Quantiﬁes how much dispersion is
Q3 = (63 + 70) / 2 = 66.5 present in the middle 50% of the
distribution
IQR = Q3 - Q1 = 66.5 - 48.5 = 18 ■ A robust measure of spread because it
is not inﬂuenced by extreme outliers or
skewed data

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Five-number summary
No single measure of spread is very useful when describing skewed distributions.

A set of ﬁve values that provides description of the distribution of a dataset.

1. Minimum (Min) This is the smallest value in the dataset. It represents the lowest data
point in the distribution.
2. First Quartile (Q1): This is the 25th percentile of the data, also known as the lower
quartile. It divides the lowest 25% of the data. Q1 indicates the boundary of the ﬁrst
quarter of the data when it's arranged in ascending order.
3. Median (Q2 or Median): The median is the middle value of the dataset when the data is
ordered. It is the 50th percentile and represents the center of the data distribution.
4. Third Quartile (Q3): This is the 75th percentile of the data, also known as the upper
quartile. It divides the lowest 75% of the data. Q3 indicates the boundary of the last
quarter of the data when it is arranged in ascending order.
5. Maximum (Max): This is the largest value in the dataset. It represents the highest data
point in the distribution.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Five-number summary
No single measure of spread is very useful when describing skewed distributions.

A set of ﬁve values that provides description of the distribution of a dataset.

■ Median is the thick line inside the box

■ Top of the box is Q3
■ Bottom of the box is Q1
■ Lines (or whiskers) extend both sides of the box to
represent minimum and maximum
■ Points are outliers or the values beyond the statistics

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Variance
■ Range tell us how dispersed the entire dataset is -- it does not tell us how the data is dispersed
around the mean
■ Describes how far apart observations are spread out from their average value (mean)
■ Measure of the average squared deviation of each data point from the mean
■ Calculated by taking the average of the squared diﬀerence between each data point and the mean
of the dataset
■ Population variance is denoted by sigma squared
■ Sample variance is denoted by
■ Expressed in squared units
■ Most statistical tools will give us sample variance by default, since it is very rare that we would have
data for the entire population

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Standard Deviation
■ Square root of the variance
■ More interpretable measure of spread compared to
variance because it is in the same units as the data
■ Measure how far on average are the data points from the
mean
■ Population is denoted by sigma
■ Sample is noted by s
■ Often used to assess the variability in a dataset
■ Small standard deviation means close to the mean
■ Large standard deviation means values are dispersed
widely

When comparing the relative variability of datasets with diﬀerent units

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Coefficient of variation (CV)
■ Useful when comparing the relative variability of datasets with different units
○ Comparing level of dispersion of one to dataset to another
■ Allows standardized comparison because it is unitless
■ High CV - high relative variability, standard deviation is a significant portion of the mean
■ Low CV - low relative variability

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Covariance
■ Measure for describing how two attributes change together
■ Indicates:
○ whether there is a linear relationship between two variables
○ direction of the relationship (positive or negative)
■ Quantiﬁes how changes in one variable are associated with changes in another

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

where,
and are individual data points for and .
and are the means of and , respectively
■ A positive covariance indicates that when X is above its
mean, Y tends to be above its mean as well, and vice is the number of data points
versa. This suggests a positive linear relationship.
■ A negative covariance suggests that when X is above its A covariance matrix is used to describe the covariances
mean, Y tends to be below its mean, and vice versa. This between multiple variables in a dataset. The diagonal
suggests a negative linear relationship. elements of the covariance matrix represent the variances of
■ A covariance of zero indicates that there is no linear the individual variables, and the oﬀ-diagonal elements
relationship between X and Y. However, it does not represent the covariances between pairs of variables.
necessarily imply that there is no relationship of any kind.

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Correlation
■ Describes the extent to which two variables are related or associated
■ Assess the strength (magnitude) and direction (same or opposite) of the relationship between two
variables
■ Correlation does not imply causation
○ Two variables can be correlated but it does not mean that one causes the other (other factors
may be at play)
■ Quantified by the Correlation Coefficient
○ Pearson Correlation Coefficient (r)
■ measures the linear relationship between two variables
■ assumes linear relationship, sensitive to outliers
■ coefficient ranges from -1 to 1
● positive correlation (r > 0)
○ when one variable increases, the other tends to increase
○ The closer r is to 1, the stronger the positive correlation
● negative correlation (r < 0)
○ when one variable increases, the other tends to decrease
○ the closer r is to -1, the stronger the negative correlation

Jennifer Joyce M. Montemayor / CSC172 / Lecture 03

Notes-Advanced Statistical Methods For Business Decision Making
No ratings yet
Notes-Advanced Statistical Methods For Business Decision Making
69 pages
Starbucks Case Data
No ratings yet
Starbucks Case Data
32 pages
Project 10 (Statistics
No ratings yet
Project 10 (Statistics
14 pages
DSBDL Asg 3 Write Up
No ratings yet
DSBDL Asg 3 Write Up
6 pages
IMS 504-Week 4&5 New
No ratings yet
IMS 504-Week 4&5 New
40 pages
Topic 3 - Data Presentation, Summarization, Measure of Central Tendency&Spread.
No ratings yet
Topic 3 - Data Presentation, Summarization, Measure of Central Tendency&Spread.
48 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Analytics compendium (incl stats)
No ratings yet
Analytics compendium (incl stats)
31 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Business Statistics - Session Descriptive Statistics
No ratings yet
Business Statistics - Session Descriptive Statistics
28 pages
Chapter 3
No ratings yet
Chapter 3
28 pages
RSU - Statistics - Lecture 3 - Final - myRSU
No ratings yet
RSU - Statistics - Lecture 3 - Final - myRSU
34 pages
Chapter 02-Describing Distributions With Numbers
No ratings yet
Chapter 02-Describing Distributions With Numbers
21 pages
Unit-3-Measure-of-Central-Location
No ratings yet
Unit-3-Measure-of-Central-Location
29 pages
Numerical Descriptive Measure, Lecture-2
No ratings yet
Numerical Descriptive Measure, Lecture-2
21 pages
3 Numerical Descriptive Measures
No ratings yet
3 Numerical Descriptive Measures
55 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
Descriptive
No ratings yet
Descriptive
13 pages
PUPSPC BUMA30063 - Chapter 2 Instructional Material
No ratings yet
PUPSPC BUMA30063 - Chapter 2 Instructional Material
10 pages
Math Majorship Statistics
No ratings yet
Math Majorship Statistics
9 pages
BigDataAnalytics _ Unit2
No ratings yet
BigDataAnalytics _ Unit2
15 pages
Analysing Quantitative Data - 13april2017
No ratings yet
Analysing Quantitative Data - 13april2017
41 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
Instructor'S Manual: Statistical Techniques in Financial Management
No ratings yet
Instructor'S Manual: Statistical Techniques in Financial Management
3 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
26 pages
Cental Tendency
No ratings yet
Cental Tendency
20 pages
Measures
No ratings yet
Measures
8 pages
PREREQUISITE SESSION 2
No ratings yet
PREREQUISITE SESSION 2
17 pages
Unit 3 Summarising Data - Averages and Dispersion
No ratings yet
Unit 3 Summarising Data - Averages and Dispersion
22 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
Day 3 Educational Statistics
No ratings yet
Day 3 Educational Statistics
37 pages
BAA Class Notes
No ratings yet
BAA Class Notes
16 pages
Introduction To Statistics PDF
No ratings yet
Introduction To Statistics PDF
32 pages
Chapter2 Measure of Central Tendency
No ratings yet
Chapter2 Measure of Central Tendency
33 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
Week 1 - CH3 Descriptive Summary Measures
No ratings yet
Week 1 - CH3 Descriptive Summary Measures
10 pages
SSC CGL Tier 2 Statistics - Last Minute Study Notes: Measures of Central Tendency
No ratings yet
SSC CGL Tier 2 Statistics - Last Minute Study Notes: Measures of Central Tendency
10 pages
Dispersion_and_Inequalities
No ratings yet
Dispersion_and_Inequalities
12 pages
III-pre-final
No ratings yet
III-pre-final
2 pages
Descriptive Statistics Unit 2
No ratings yet
Descriptive Statistics Unit 2
72 pages
Data Analysis and Graphically Presentation
No ratings yet
Data Analysis and Graphically Presentation
12 pages
Measure of Central Tendency Annya Mam
No ratings yet
Measure of Central Tendency Annya Mam
13 pages
Business Statistics NOtes
No ratings yet
Business Statistics NOtes
46 pages
Amit Singh - Ssjcet20024 - Business Statistic Assignment
No ratings yet
Amit Singh - Ssjcet20024 - Business Statistic Assignment
14 pages
Interview Questions
No ratings yet
Interview Questions
225 pages
Predictive Analytics Notes1
No ratings yet
Predictive Analytics Notes1
37 pages
Lecture 3
No ratings yet
Lecture 3
30 pages
Advanced Stats Measure of Central Tendency and Variations
No ratings yet
Advanced Stats Measure of Central Tendency and Variations
2 pages
20 - Levels of Measurement, Central Tendency Dispersion
No ratings yet
20 - Levels of Measurement, Central Tendency Dispersion
35 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Chap 2-1 Descriptive Statistics
No ratings yet
Chap 2-1 Descriptive Statistics
10 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
12 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
34 pages
M-1 CH-3 Descriptive Statistcs
No ratings yet
M-1 CH-3 Descriptive Statistcs
27 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Measures of Central Tendency and Variability
No ratings yet
Measures of Central Tendency and Variability
72 pages
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
AK - STATISTIKA - 02 - Describing Data (Cont.)
No ratings yet
AK - STATISTIKA - 02 - Describing Data (Cont.)
47 pages
APznzaajViX Gw8M-Nl3hyZSyh Kr28I1d3Kz3HWwaae2EFdkZmui 7vSUTONanpeapmq50Kaa3mqVb
No ratings yet
APznzaajViX Gw8M-Nl3hyZSyh Kr28I1d3Kz3HWwaae2EFdkZmui 7vSUTONanpeapmq50Kaa3mqVb
13 pages
FDS CH 2
No ratings yet
FDS CH 2
2 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Noraina Nazira Nazri - 202130564 - Individual Assignment 1
No ratings yet
Noraina Nazira Nazri - 202130564 - Individual Assignment 1
8 pages
Khairul Niezam Bin Abdul Aziz - Individual Assignment - mgt555
No ratings yet
Khairul Niezam Bin Abdul Aziz - Individual Assignment - mgt555
5 pages
Statistics in Business FOR UST
No ratings yet
Statistics in Business FOR UST
13 pages
Vehicle Pedestrian Collisions
No ratings yet
Vehicle Pedestrian Collisions
13 pages
Noor+Hassan+Mahdi
No ratings yet
Noor+Hassan+Mahdi
18 pages
ws8-6 Measures of Spread
No ratings yet
ws8-6 Measures of Spread
5 pages
maths0_compress
No ratings yet
maths0_compress
47 pages
Statistical Analysis PDF
No ratings yet
Statistical Analysis PDF
7 pages
Mat 152 (P2 Exam)
No ratings yet
Mat 152 (P2 Exam)
3 pages
70 Rr+Retno+Dwi+Susanti
No ratings yet
70 Rr+Retno+Dwi+Susanti
12 pages
Chapter 6 Stats
No ratings yet
Chapter 6 Stats
39 pages
Test Bank for Statistics for People Who Think They Hate Statistics Using Microsoft Excel 2016 4th Edition Salkind 1483374084 9781483374086 - All Chapter Instant Download
100% (3)
Test Bank for Statistics for People Who Think They Hate Statistics Using Microsoft Excel 2016 4th Edition Salkind 1483374084 9781483374086 - All Chapter Instant Download
43 pages
Balance IQOQPQ Doc For SKF BCE224i-1S
No ratings yet
Balance IQOQPQ Doc For SKF BCE224i-1S
15 pages
2 1 Statistical Measures WhZeNDRsQrqdQQyh
No ratings yet
2 1 Statistical Measures WhZeNDRsQrqdQQyh
37 pages
Sample Sizes Calculation
No ratings yet
Sample Sizes Calculation
5 pages
Download Full Medical Statistics: An A-Z Companion, Second Edition Filomena Pereira-Maxwell PDF All Chapters
100% (3)
Download Full Medical Statistics: An A-Z Companion, Second Edition Filomena Pereira-Maxwell PDF All Chapters
62 pages
Judul 2 Jurnal Internasional
No ratings yet
Judul 2 Jurnal Internasional
11 pages
Statistics Notes
No ratings yet
Statistics Notes
28 pages
PUC I Maths Passing Package
No ratings yet
PUC I Maths Passing Package
10 pages
TDP 202 Educational Measurement and Evaluation
No ratings yet
TDP 202 Educational Measurement and Evaluation
3 pages
2023 FINAL EXAM
No ratings yet
2023 FINAL EXAM
2 pages
Lesson 8 - Measure of Relative Position
No ratings yet
Lesson 8 - Measure of Relative Position
6 pages
How To Write A Technical Report
No ratings yet
How To Write A Technical Report
8 pages
SQB - PS - HARD (WithOUT)
No ratings yet
SQB - PS - HARD (WithOUT)
69 pages
STATISTICS-AND-PROBABILITY-11
No ratings yet
STATISTICS-AND-PROBABILITY-11
3 pages
Mod2 (Extraqns)
No ratings yet
Mod2 (Extraqns)
6 pages
Introductory stats for economics
No ratings yet
Introductory stats for economics
116 pages
ESSKA-2023-MCID and PASS in Knee Surgeries. Theoretical Aspects and Clinical Relevance References
No ratings yet
ESSKA-2023-MCID and PASS in Knee Surgeries. Theoretical Aspects and Clinical Relevance References
8 pages