0% found this document useful (0 votes)
2 views

STAT101_2025 Week 3 Notes(1)-2

The document outlines the objectives for a statistics lecture focused on constructing frequency tables, visualizing numerical data, and understanding various distribution shapes. It covers methods for organizing numeric data, including Sturges' Rule for determining class intervals, and provides examples of standardized intervals for blood pressure, cholesterol, and BMI. Additionally, it discusses visualization techniques such as histograms, scatter plots, and box plots, along with calculations for class boundaries and widths.

Uploaded by

Molf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

STAT101_2025 Week 3 Notes(1)-2

The document outlines the objectives for a statistics lecture focused on constructing frequency tables, visualizing numerical data, and understanding various distribution shapes. It covers methods for organizing numeric data, including Sturges' Rule for determining class intervals, and provides examples of standardized intervals for blood pressure, cholesterol, and BMI. Additionally, it discusses visualization techniques such as histograms, scatter plots, and box plots, along with calculations for class boundaries and widths.

Uploaded by

Molf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

STAT 101 – Week 3

Lecture Objectives: Upon completion of this lecture,


students should be able to:
Construct Frequency Tables for Numeric Data
➢Apply Sturges’ Rule to determine the optimal number of class
intervals.
➢Compute class width and determine class limits.
➢Construct frequency tables for grouped numeric data.
➢Compute cumulative frequency, relative frequency, and
relative cumulative frequency.

Visualize Numerical Data


➢Create and interpret histograms for grouped data.
➢Construct and analyze frequency polygons.
➢Plot and interpret cumulative frequency curves (Ogives),
e.t.c
➢Interpret and analyze stem-and-leaf plots
Describe the Shapes of Distributions
•Identify common distribution shapes:
•Symmetric (Bell-shaped)
•Right-skewed (Positively skewed)
•Left-skewed (Negatively skewed)
•Uniform
•Bimodal
•Multimodal
•Link distribution shapes to common distributions in statistics.

Understand Box Plots:


• Learn how to construct and interpret box plots.
• Identify key components of a box plot: median, quartiles,
interquartile range (IQR), and outliers.
• Analyze the spread, skewness, and overall distribution of a
dataset using box plots.
• Compare multiple box plots to assess variations across
different groups.
Understand Scatter Plots:
➢Construct and interpret scatter plots.
➢Identify different types of relationships between two
quantitative variables (positive correlation, negative correlation,
or no correlation).
➢Recognize patterns, clusters, and outliers in scatter plots.
➢Understand how scatter plots help reveal trends and
associations in data.
➢Use scatter plots to support decision-making in real-world
applications.
Organizing/Grouping Numeric Data
➢To create frequency tables for numerical data, there is one
preliminary step we need to take that was not necessary for
categorical data.

➢Considering the variable is numeric, there are no obvious


categories for us to present in a frequency table.

➢We can group the data into suitable categories that are ordinal,
equal, and non-overlapping.

➢However, standardized intervals for variables have been


developed in many fields to ensure consistency and accuracy.

➢These standardized intervals allow for easy categorization of


data and facilitate comparisons across studies and industries.
Examples of Standardized Intervals

Example 1: Blood Pressure


The NHLBI (National Heart Lung, and Blood Institute and
the American Heart Association use the following
classification of blood pressure (mmHg):

▪ Normal: systolic <120 and diastolic <80

▪ Pre-hypertension: systolic 120-139 or diastolic 80-89

▪ Stage I hypertension: systolic 140-159 or diastolic 90-99

▪ Stage II hypertension: systolic ≥ 160 or diastolic ≥ 100

5
Example 2: Cholesterol Level
The American Heart Association uses the following
classification for total cholesterol levels(mg/dL):

➢ Desirable: total cholesterol < 200,

➢ Borderline high risk: total cholesterol between 200–239,

➢ High risk: total cholesterol of ≥ 240

6
Example 3: Body Mass Index
Body mass index (BMI) is computed as the ratio of weight
in kilograms to height in meters squared and the following
categories are often used:

❖ Underweight: BMI < 18.5

❖ Normal weight: BMI ∈ [18.5, 24.9]

❖ Overweight: BMI ∈ [25, 29.9]

❖ Obese: BMI ≥ 30

➢ These standardized intervals make it easy to classify


data into meaningful categories without having to create
custom groupings. 7
Creating Intervals When No Standards
Exist
➢In cases where no standard intervals are available,
you need to create your own intervals to organize
the data.

➢One common method for determining the number of


intervals is Sturges' Rule.

➢Sturges' rule provides a formula for calculating the


optimal number of intervals (k) for a frequency table
based on the number of data points (n).
Creating Intervals When No Standards
Exist

➢Here's a step-by-step guide on how to create a


frequency table for numeric data:

Step 1: Decide on the Number of Intervals (Classes)

➢Decide how many intervals or classes you want to group


the data,
𝒌 = 𝟏 + 𝒍𝒐𝒈𝟐 𝒏 ≡ 𝟏 + 𝟑. 𝟑𝟐𝟐𝒍𝒐𝒈(𝒏)

where k is the number of intervals and n is the sample


size.
Step 2: Determine the Range of Data
➢Calculate the range of the data, which is the difference
between the maximum and minimum values:

𝑹𝒂𝒏𝒈𝒆 = 𝒎𝒂𝒙𝒊𝒎𝒖𝒎 𝒗𝒂𝒍𝒖𝒆 − 𝒎𝒊𝒏𝒊𝒎𝒖𝒎 𝒗𝒂𝒍𝒖𝒆

➢This helps in determining the range of values to cover in the


frequency table.
Step 3: Calculate the Class Width

➢Determine the width of each interval by dividing the range of


data by the number of intervals obtained in Step 2.

➢Round to the nearest whole number.

𝑹𝒂𝒏𝒈𝒆
𝑪𝒍𝒂𝒔𝒔 𝒘𝒊𝒅𝒕𝒉 =
𝒌
Step 4: Determine the Class Limits
➢Calculate the class limits for each interval.

➢Class limits are the lowest and highest values included in


each interval.

➢The lower limit of the first class is determined by the minimum


value in the dataset.

➢To determine the lower limit of the second class, add the
class width to the lower limit of the first class.

➢To find the lower limit of the third class, add the class width to
the lower limit of the second class.

➢Continue this process until you reach the lower limit of the k-
th class.
Step 5: Create the Frequency Table
➢Count the number of observations falling within each interval
(class) and record the frequency.

➢This frequency represents the number of data points falling


within that interval.

➢Always check that the final interval includes the maximum


data value.

➢If it doesn't, adjust the class width slightly upwards until it


does:

𝑹𝒂𝒏𝒈𝒆
𝑨𝒅𝒋𝒖𝒔𝒕𝒆𝒅 𝑪𝒍𝒂𝒔𝒔 𝒘𝒊𝒅𝒕𝒉 = + 𝜹,
𝒌
Where 𝜹 is chosen to ensure the last interval captures the
maximum value.

Adjust your class width by adding values of 𝜹 from 0.1 to 1.


Example 4: Grouping Numeric Data

Consider the following numeric data:

i. Range of Data: 𝑹𝒂𝒏𝒈𝒆 = 𝟑𝟖 − 𝟐𝟎 = 𝟏𝟖


ii. Number of intervals: 𝒌 = 𝟏 + 𝒍𝒐𝒈𝟐 𝒏 = 𝟓. 𝟗 ≈ 𝟔
𝑹𝒂𝒏𝒈𝒆 𝟏𝟖
iii. Class widths: = =𝟑
𝒌 𝟔
Using the class width of 3, the maximum value (38) is not
included in our class intervals.
Class Frequency Relative Cumulative Relative
Intervals/Limits Frequency Frequency Cumulative
Frequency
20-22 5
23-25 1
26-28 4
29-31 5
32-34 7
35-37 6
Total 28

How about 38? All observations must be accounted for!!


Adjusting the class width
➢Now, using the adjustment of adding 0.1 to our class
width, we achieve what we intended to, 38 is accounted
for.
➢Our new class width 3.1.
Class Frequency Relative Cumulative Relative
Intervals/Limits Frequency Frequency Cumulative
Frequency
20.0-23.0 5
23.1-26.1 4
26.2-29.2 3
29.3-32.3 5
32.4-35.4 7
35.5-38.5 6

Complete the table by computing the relative frequency,


cumulative frequency, and relative cumulative frequency.
Class Exercise 1:
➢Consider the following numeric datasets and group
them into classes to create a frequency table.
➢For each dataset, determine the appropriate number of
intervals using Sturges' rule, calculate the class width,
and construct the frequency table showing intervals,
frequencies, cumulative frequencies, and relative
frequencies.

Data 1:
4,6,8,5,7,3,9,5,8,7,2,6,7,4,5,8,3,7,6,9

Data 2:
18,21,25,27,30,33,35,38,40,42,44,48,50,52,55,58,60,63,
65,68
Visualizing Quantitative data
Here are some visualization techniques commonly used for
numeric data:

Visualization Best For Example

Histogram Data Distribution Exam score distribution

Summary of spread and


Box Plot Salary Distribution
outliers
Line graph Trends over time Monthly sales trends
Relationship between two
Scatter Plot Height vs. Weight
variables
Frequency Comparing data distribution, Monthly temperature
Polygon Shape of data distribution
Cumulative frequency and
Ogive Cumulative sales data
Percentile analysis
Stem-and-Leaf Showing individual data
Test scores in a class
Plot values and distribution 18
Information required for graphing a numerical
variable
Class Limits Class Class Class Frequency
Boundaries Marks Width
20.0-23.0 19.95-23.05 21.5 3.1 5
23.1-26.1 23.05-26.15 24.6 3.1 4
26.2-29.2 26.15-29.25 27.7 3.1 3
29.3-32.3 29.25-32.35 30.8 3.1 5
32.4-35.4 32.35-35.45 33.9 3.1 7
35.5-38.5 35.45-38.55 37.0 3.1 6

➢Class boundaries are essential for plotting the


histogram.

➢Class marks are essential for plotting the frequency


polygon and calculating averages for grouped data.
Calculation of the Class Boundaries
➢Class boundaries are calculated to remove gaps between intervals and
ensure continuous data representation. They are determined using the
following formula:

𝑼𝑪𝑳𝒋 + 𝑳𝑪𝑳𝒋+𝟏
𝑪𝑩 = ,
𝟐
where
• 𝑗 = 1, 2, … , 𝑘 represents the class number,
• 𝑈𝐶𝐿𝑗 is the upper-class limit of class 𝑗.
• 𝐿𝐶𝐿𝑗+1 is the lower-class limit of class 𝑗 + 1.

Lower-Class Boundary for the First Class:


➢Subtract the class width from the upper-class boundary of
the first class:
𝑳𝑪𝑩𝟏 = 𝑼𝑪𝑩𝟏 − 𝑪𝒍𝒂𝒔𝒔 𝑾𝒊𝒅𝒕𝒉

Upper-Class Boundary for Each Class:


➢Add the class width to the previous upper-class boundary:
𝑼𝑪𝑩𝒌 = 𝑼𝑪𝑩𝒌−𝟏 + 𝑪𝒍𝒂𝒔𝒔 𝑾𝒊𝒅𝒕𝒉
Calculation of the Class Marks
➢The class mark (or midpoint) represents the center of
each class.
➢Class marks are calculated as follows:

𝑈𝐶𝐿𝑗 + 𝐿𝐶𝐿𝑗
𝐶𝑀 = ,
2
OR

𝑈𝐶𝐵𝑗 + 𝐿𝐶𝐵𝑗
𝐶𝑀 =
2
where 𝑗 = 1, 2, … , 𝑘 is the class number,
• 𝑈𝐶𝐿𝑗 is the upper-class limit of class 𝑗.
• 𝐿𝐶𝐿𝑗 is the lower-class limit of class 𝑗.
• 𝑈𝐶𝐵𝑗 is the upper-class boundary of class 𝑗.
• 𝐿𝐶𝐵𝑗 is the lower-class boundary of class 𝑗.
Calculation of the Class Width
➢Given class limits/intervals and/or class boundaries,
class width is calculated as follows:

𝐶𝑊 = 𝑈𝐶𝐿𝑗+1 − 𝑈𝐶𝐿𝑗 𝑂𝑅 𝐶𝑊 = 𝐿𝐶𝐿𝑗+1 − 𝐿𝐶𝐿𝑗 ,


OR

𝐶𝑊 = 𝑈𝐶𝐵𝑗+1 − 𝑈𝐶𝐵𝑗 𝑂𝑅 𝐶𝑊 = 𝐿𝐶𝐵𝑗+1 − 𝐿𝐶𝐵𝑗

where 𝑗 = 1, 2, … , 𝑘 is the class number,


• 𝑈𝐶𝐿𝑗 is the upper-class limit of class 𝑗.
• 𝑈𝐶𝐿𝑗+1 is the upper-class limit of class 𝑗 + 1.
• 𝐿𝐶𝐵𝑗 is the lower-class boundary of class 𝑗.
• 𝐿𝐶𝐵𝑗+1 is the lower-class boundary of class 𝑗 + 1.
Histogram ➢ Histogram represents the
frequency distribution of a
numeric variable using
class boundaries.

➢ Each bar represents the


number of data points
falling within a specific
class interval.

➢ The height of the bar


indicates the frequency of
data points in that range.

➢ Histograms can be useful


because, immediately, we
can quickly see the shape
of the data.
➢ The data ranges from 10 MPG to 35
Key observations MPG, indicating the fuel efficiency of
the vehicles.

➢ The highest frequency is observed in


the 15–20 MPG range, with around 12
vehicles, meaning this range represents
the typical fuel efficiency for most
vehicles in this sample.

➢ The histogram appears right-skewed,


with a long tail towards higher MPG
values.
➢ This suggests that most vehicles have
moderate fuel efficiency, while a few
vehicles achieve higher MPG (above
30).
➢ Vehicles with 30–35 MPG are relatively
rare, indicating that highly fuel-efficient
cars were uncommon in this period.

➢ There is a noticeable dip between 25–


30 MPG, which may indicate fewer
vehicles in this range or a gap in the
data.
➢ A histogram has a bell-shaped/symmetric
distribution if it resembles a “bell” curve
and has one single peak in the middle of
How to the distribution.
Describe the
Shape of ➢ The most common real-life example of
Histograms this type of distribution is the normal
distribution.
25
➢ A histogram has a “uniform” distribution if
every value in a dataset occurs roughly
the same number of times.

How to ➢ This type of histogram often looks like a


Describe the rectangle with no clear peaks.
Shape of
Histograms ➢ The most common real-life example of this
type of distribution is the uniform
distribution.
26
➢ A histogram has a “bimodal”
distribution if it has two distinct
How to peaks.
Describe the
Shape of ➢ We often say that this type of
distribution has multiple modes –
Histograms that is, multiple values occur most
frequently in the dataset.
27
How to ➢ A histogram has a
Describe the “multimodal” distribution if it
Shape of has more than two distinct
peaks.
Histograms
28
➢ A histogram left-skewed
How to (negatively skewed)
Describe the distribution if it has a “tail” on
Shape of the left side of the
Histograms distribution.

29
How to ➢ A histogram has a right-skewed
Describe the (positively skewed) distribution if
Shape of it has a “tail” on the right side of
the distribution.
Histograms
30
➢ A frequency polygon
represents the frequency
Frequency Polygon distribution of a numeric
variable using class
midpoints/ class marks.
➢ Each point on the graph
corresponds to the
frequency of data in a
specific class interval.
➢ The points are connected by
straight lines, forming a
polygonal shape.
➢ Frequency polygons are
useful for comparing
multiple distributions and
visualizing the shape of the
data.
➢ The frequency polygon (black line)
connects the midpoints of each
Key observations bar in the histogram.

➢ It provides a clear visualization of


the overall shape of the data.

➢ The data is approximately bell-


shaped (normal distribution),
indicating that most scores are
concentrated in the middle range
(around 15–20), with fewer scores
at the extremes.

➢ The scores range from 5 to 35,


with the highest frequency
occurring in the 15–20 range.

➢ The frequency polygon is useful


for comparing the shape of
multiple distributions if more
datasets were added.
OGIVE
➢An ogive is a graph representing a dataset's cumulative
frequency.

➢There are two types; less than or less or equal to Ogive


and greater than or greater or equal to Ogive.

➢The information below is essential for plotting the


cumulative frequency curve.
Class Frequency CI for Less <CF CI for More >CF
Intervals than than
(CI) cumulative cumulative
20.0-23.0 5 <23.05 5 >19.95 30
23.1-26.1 4 <26.15 9 >23.05 25
26.2-29.2 3 <29.25 12 >26.15 21
29.3-32.3 5 <32.35 17 >29.25 18
32.4-35.4 7 <35.45 24 >32.35 13
35.5-38.5 6 <38.55 30 >35.45 6
Less than Ogive (Cumulative Frequency)

➢ A less than ogive (or cumulative frequency curve) is a


graph that shows the number of observations less
than and/or equal to a given value.

➢The curve always increases because cumulative


frequency never decreases.

➢It starts from the lowest upper-class boundary and


continues to the maximum value
How it’s constructed:
➢Use upper-class boundaries on the x-axis.

➢Plot cumulative frequency on the y-axis.

➢The cumulative frequency is obtained by adding the first-class frequency to the


second-class frequency, and so on.

Use Case:
➢Helps determine percentiles, quartiles, and medians.
➢Answers questions like "How many students scored less than 70?"
More Than Ogive (cumulative frequency(CF))

➢A more than ogive is a graph that represents the number


of observations greater than and/or equal to a given value.

➢The curve always decreases as cumulative frequency


reduces.

➢It starts from the highest lower-class boundary and moves


downward.
How it’s constructed:
➢Use lower-class boundaries on the x-axis.

➢Plot cumulative frequency in reverse order (starting from the total


and subtracting down).
Use Case:
➢ Helps determine how many values are greater than a given
threshold.

➢ Answers questions like "How many students scored more than


70?"
Cumulative Frequency Curve (Ogive)
Stem-and-Leaf Plots
➢Stem-and-leaf plots are used to display numerical
data while preserving the individual values.

➢There are several variations depending on how the


data is presented.

➢It helps in understanding the distribution, shape, and


spread of the data.

➢A longer stem section on one side indicates skewness.

➢Right-skewed (Positive Skew): More leaves on the


higher stems (at the beginning).

➢Left-skewed (Negative Skew): More leaves on the


lower stems (at the end).
39
Stem-and-Leaf Plots
When analyzing a stem-and-leaf plot, focus on these key aspects:
➢ The Range
✓This gives us the measure of spread in the dataset
➢ Clustering and Variability
✓Tightly packed data: Spread is low (less variation).
✓Widely spread data: Spread is high (greater variation).
✓Clusters: Groups of frequent values.
➢Skewness and Shape
✓Right-skewed (Positively skewed)
✓Left-skewed (Negatively skewed)
✓Symmetric.
✓Unimodal/Bimodal
➢Outliers:
✓Look for gaps and isolated values.
✓Gaps are missing values between stems indicating
uneven spread.
40
Simple Stem-and-Leaf Plot
Data: 18,21,25,27,30,33,35,38,40,42,44,48,50,52,55,58,60,63,
65,68
➢ This is the basic form, where data is divided into stems
(leading digits) and leaves (trailing digits).
Example:

1 8
2 1 5 7
3 0 3 5 8
4 0 2 4 8
5 0 2 5 8
6 0 3
➢ The stem represents the tens place.
➢ The leaves represent the ones’ place.
➢ The distribution is fairly symmetric.
41
Double Stem-and-Leaf Plots
➢ Used when a single stem has too many leaves
➢ Each stem is split into two parts such that you have stems with
leaves 0-4 and 5-9.
Example:

3 5 7
4 0 0 2 3
4∗ 5 6 6 6 8 9
5 2 3 4
5∗ 5 6 7 8 9
6 1 2 2 4
6∗ 9
7 2 3
7∗ 8
8 1

➢ List the data that correspond to the above stem-and-leaf display.

➢ Analyze the stem-and-leaf display above.


42
Analyzing the stem-and-leaf
3 |5 7
4 |0 0 2 3
4∗ | 5 6 6 6 8 9
5 | 2 3 4
5∗ | 5 6 7 8 9
6 |1 2 2 4
6∗ | 9
7 | 2 3
7∗ | 8
8 | 1

➢ The spread of the data is measured by the range, which is


46 (81 - 35). However, the spread is not uniform, meaning
some areas are densely packed while others are sparser.
➢ Most data points fall between 40 and 64, showing moderate
variability.
➢ There are no major gaps or isolated values, suggesting no
significant outliers. The data appears slightly right-skewed,
as a few values extend further toward the higher end.
➢ Two distinct peaks are visible, indicating a bimodal 43
distribution.
Back-to-Back Stem-and-Leaf Plot
➢ Used to compare two datasets on the same stem
➢ Useful for comparing distributions, such as before-and-after
studies.
Example:
Dataset A Stem Dataset B
532 | 3 | 467999
8761 | 4 | 359
998720 | 5 | 14
➢ Left side: Leaves for Dataset A.
➢ Right side: Leaves for Dataset B.
➢ Analyze the stem-and-leaf display above.
➢ What can you conclude about the distribution of Data A and
Data B?

44
Other Stem-and-Leaf Plots
➢ There are various ways in which stem-and-leaf displays
can be modified.

➢ For instance, the stem labels or the leaves could be two-


digit numbers, so that
𝟐𝟒 |𝟎 𝟐 𝟓 𝟖 𝟗
would represent the numbers 240, 242, 245, 248, and 249;

and

𝟐 |𝟑𝟏 𝟒𝟓 𝟕𝟎 𝟖𝟖
would represent the numbers 231, 245, 270, and 288.

45
The purpose of the scatter Scatter Plot for two
plot is: continuous variables
➢ To identify relationships
(correlation) between two
continuous variables.

➢ To detect patterns, clusters,


and outliers.

➢ To predict trends based on


existing data.

➢ To explore the variability and


spread of data points.

46
What type of relationship
(correlation) exists between the
two variables?: Identifying Relationships
➢ Is it positive, negative, or no
correlation?
Positive Correlation
➢ As one variable increases,
the other also increases.
➢ Engine size vs. car weight
(larger engines tend to be in
heavier cars).
Negative Correlation
➢ As one variable increases,
the other decreases.
No Correlation
➢ No visible pattern between
the variables.
➢ Car weight vs. driver’s age
(the weight of a car doesn’t
depend on who drives it). 47
Does the data show a linear trend, Data Variability &
or is it more scattered?:
➢ Can a straight line fit the data Spread, and Linear
well, or is the pattern irregular? Trend
Does the relationship appear to be
strong, moderate, or weak? Why?:

Are the points closely clustered


around an imaginary trend line or
spread apart?

➢ Tightly clustered points →


Strong relationship between
variables.
➢ Scattered points → Weak or no
relationship between variables.
➢ Unequal spread on one side →
The relationship may not be
linear (i.e., the trend changes at
different levels). 48
Clusters: Groups of points that Identifying Patterns,
form distinct groups.
Clusters, & Detecting
➢ Are there any visible clusters of Outliers
data points? If yes, around what
engine size and weight?

What does clustering suggest


about the data?:

➢ This suggests subcategories in


the data (e.g., SUVs, sedans,
and trucks forming separate
clusters).

Are there any points that do not


follow the general trend
(outliers)?:

➢ Identify at least one data point


that appears significantly 49
different
If a trend exists, we can estimate
future values based on the pattern?
Predicting Trends
➢ Do you expect a car with a very
large engine size (e.g., 200) to
weigh more or less than those
with smaller engines? Why?
If car weight increases with engine
size, then for an engine size of 200,
we can predict an approximate car
weight by looking at similar points on
the scatter plot.

➢ What additional variable could


help explain variations in car
weight besides engine size? E.g.
Car type, fuel efficiency, or
number of cylinders.

➢ Would you expect this


relationship to hold across all
types of vehicles (e.g., sports
cars, trucks)? Why or why not?
50
Test your knowledge
I. What type of relationship (correlation) exists between the two
variables?
II. Does the data show a linear trend, or is it more scattered?
III. Does the relationship appear to be strong, moderate, or
weak? Why?

51
IV. Are there any visible clusters of data points?
V. What does clustering suggest about the data?
VI. Are there any points that do not follow the general trend
(outliers)?
VII. Based on the trend, if a new car has an MPG of 32, what
would you predict its weight to be?

52

You might also like