0% found this document useful (0 votes)
20 views94 pages

Lecture-2 & 3

Uploaded by

ashtarsyed2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views94 pages

Lecture-2 & 3

Uploaded by

ashtarsyed2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

MTH-262: Statistics and

Probability Theory
Lecture-2 & 3

Dr. Shahid Hussain


Frequency Distribution and
Representation of Data – Part 1

• Overview of Data Representation


• Frequency Distribution (Concept and Example)
• Histograms (Definition and Example)
• Bar Charts (Definition and Example)
• Pie Charts (Definition and Example)
• Key Points for Data Visualization
• Summary and Practical Examples
Overview of Data Representation
Data representation refers to the methods
used to organize, store, and visualize data
in a structured way, making it easier to
understand and analyze. It is a crucial
aspect of statistics and data analysis,
especially when dealing with large or
complex datasets. Here's an overview of
key forms of data representation:
Overview of Data Representation
1. Tabular Representation
• Frequency Distribution Table: A table that
shows the frequency (or count) of data points
within specified intervals (classes). It is useful
for summarizing large datasets.
• Contingency Tables: These tables display the
frequency distribution of variables and help in
understanding the relationship between
categorical variables.
Overview of Data Representation
2. Graphical Representation
• Bar Chart: Uses bars to show frequencies or proportions for
different categories. It is effective for categorical data.
• Pie Chart: A circular chart divided into sectors, representing
proportions. Each sector's size corresponds to the proportion of each
category in the dataset.
• Histogram: Similar to a bar chart but used for continuous data. It
shows the frequency distribution of a variable by grouping data into
intervals (bins).
• Line Chart: Displays data points connected by straight lines, used
for time series or continuous data to show trends.
• Scatter Plot: Shows the relationship between two continuous
variables by plotting data points on a two-dimensional graph.
Overview of Data Representation
3. Summary Statistics
• Measures of Central Tendency: Describes the center of a dataset. The
most common measures are:
– Mean: The average of all data points.
– Median: The middle value when data points are ordered.
– Mode: The most frequent value(s).
• Measures of Dispersion: Shows how data is spread out. Common
measures include:
– Range: The difference between the largest and smallest values.
– Variance: The average of the squared differences from the mean.
– Standard Deviation: The square root of the variance, showing
how much data deviates from the mean on average.
Overview of Data Representation
4. Box Plot (Box-and-Whisker Plot)
A graphical representation that shows the
distribution of data based on five summary
statistics: minimum, first quartile, median, third
quartile, and maximum. It helps in identifying
outliers and the spread of data.
Overview of Data Representation
5. Stem-and-Leaf Plot
A plot that retains the original data values while
showing the distribution. It is a cross between a
histogram and a table and is useful for small
datasets.
Overview of Data Representation
7. Time Series Plots
Specialized plots that show how a dataset evolves
over time, often used for data with a temporal
component like stock prices or sensor readings.
Frequency Distribution: Concept
A frequency distribution is a statistical tool
used to organize and summarize data. It shows
how frequently each value or range of values
(called classes or intervals) appears in a
dataset. The primary purpose of a frequency
distribution is to provide a concise overview of
the data, making it easier to observe patterns,
such as the most common or least common
values.
Frequency Distribution: Concept
Components of a Frequency Distribution
• Class Intervals (or Bins): These are the ranges into which data is
grouped.
• Frequency: The count of data points that fall into each class interval.
• Cumulative Frequency: The running total of frequencies up to a
certain class interval.
• Relative Frequency: The proportion of the total dataset that falls
within each class interval, calculated by dividing the frequency by the
total number of data points.
• Cumulative Relative Frequency: The cumulative frequency divided
by the total number of data points, giving the running total of relative
frequencies.
RAW DATA
 Definition
 Data recorded in the sequence in which
they are collected and before they are
processed or ranked are called raw data.

12
Table 2.1 Ages of 50 students

21 19 24 25 29 34 26 27 37 33
18 20 19 22 19 19 25 22 25 23
25 19 31 19 23 18 23 19 23 26
22 28 21 20 22 22 21 20 19 21
25 23 18 37 27 23 21 25 21 24

13
Table 2.2 Status of 50 Students

J F SO SE J J SE J J J
F F J F F F SE SO SE J
J F SE SO SO F J F SE SE
SO SE J SO SO J J SO F SO
SE SE F SE J SO F J SO SO

14
ORGANIZING AND GRAPHING
QUALITATIVE DATA
• Frequency Distributions
• Relative Frequency and Percentage
Distributions
• Graphical Presentation of Qualitative Data
– Bar Graphs
– Pie Charts

15
TABLE 2.3 Type of Employment Students Intend to
Engage In

Number of
Variable Type of Employment Students
Private companies/businesses 44 Frequency
column
Federal government 16
State/local government 23
Categor Own business 17 Frequenc
y y
Sum = 100

16
Frequency Distributions
 Definition
 A frequency distribution for qualitative
data lists all categories and the number
of elements that belong to each of the
categories.

17
Example 2-1
 A sample of 30 employees from large
companies was selected, and these
employees were asked how stressful
their jobs were. The responses of these
employees are recorded next where
very represents very stressful,
somewhat means somewhat stressful,
and none stands for not stressful at all.
18
Example 2-1
Some None Somewh Very Very None
what at
Very Somewh Somewh Very Somewha Somewhat
at at t
Very Somewh None Very None Somewhat
at
Somewha Very Somewh Somewh Very None
t at at
Somewha Very very somewh None Somewhat
t at
Construct a frequency distribution table for
these data.

19
Solution 2-1
Table 2.4 Frequency Distribution of Stress on Job

Stress on Job Tally Frequency (f)


Very |||| |||| 10
Somewhat |||| |||| |||| 14
None |||| | 6
Sum = 30

20
Relative Frequency and Percentage Distributions

 Calculating Relative Frequency of a


Category
Frequency of that category
Re lative frequency of a category 
Sum of all frequencie s

21
Relative Frequency and Percentage Distributions cont.

 Calculating Percentage

Percentage = (Relative frequency) · 100

22
Example 2-2
 Determine the relative frequency and
percentage for the data in Table 2.4.

23
Solution 2-2
Table 2.5 Relative Frequency and Percentage
Distributions of Stress on Job

Stress on Job Relative Frequency Percentage


Very 10/30 = .333 .333(100) = 33.3
Somewhat 14/30 = .467 .467(100) = 46.7
None 6/30 = .200 .200(100) = 20.0
Sum = 1.00 Sum = 100

24
Example R Code for Categorical Data Frequency
Distribution
Let's assume you have survey data where respondents are asked about
their favorite programming language:

# Sample dataset: Favorite programming languages of respondents


languages <- c("Python", "R", "Python", "Java", "R", "Python", "C++", "R",
"Java", "C++", "Python", "Java", "R", "C++", "Python")
Example R Code for Categorical Data Frequency
Distribution

# Create a frequency distribution table


freq_table <- table(languages)

# Print the frequency table


print(freq_table)

# Optionally, display the relative frequencies


rel_freq <- prop.table(freq_table)
print(rel_freq)
Example R Code for Categorical Data Frequency
Distribution

# Create a frequency distribution table


freq_table <- table(languages)

# Print the frequency table


print(freq_table)

# Optionally, display the relative frequencies


rel_freq <- prop.table(freq_table)
print(rel_freq)
Example R Code for Categorical Data Frequency
Distribution

table() is used to construct a frequency distribution for


categorical data.
prop.table() can be used to compute the relative
frequencies. This helps visualize the distribution of qualitative
data in categories like product preferences, survey results, or

any categorical classification.


Graphical Presentation of Qualitative Data

 Definition
 A graph made of bars whose heights
represent the frequencies of respective
categories is called a bar graph.

29
Figure 2.1 Bar graph for the frequency distribution of
Table 2.4

16
14
12

Frequency
10
8
6
4
2
0
Very Somewhat None
Strees on Job

30
Graphical Presentation of Qualitative Data cont.

 Definition
 A circle divided into portions that
represent the relative frequencies or
percentages of a population or a
sample belonging to different
categories is called a pie chart.

31
Table 2.6 Calculating Angle Sizes for the Pie Chart
Relative
Stress on Job Angle Size
Frequency
Very .333 360(.333) =
Somewhat .467 119.88
None .200 360(.467) =
168.12 360(.200)
= 72.00
Sum = 1.00 Sum = 360

32
Figure 2.2 Pie chart for the percentage distribution of
Table 2.5.

None, 20%
Very,
33.30%

Somewhat,
46.70%

33
Graphical Presentation of Data
TYPES OF DATA

Qualitative Quantitative

Univariate Bivariate Discrete Continuous


Frequency Frequency
Table Table Frequency Frequency
Distribution Distribution
Percentages
Component Multiple Line Chart Histogram
Bar Chart Bar Chart
Pie Chart
Frequency
Bar Chart Polygon

Frequency
Curve 34
Presentation of Qualitative
Data
Qualitative

Univariate Bivariate
Frequency Frequency
Table Table

Percentages
Component Multiple
Bar Chart Bar Chart
Pie Chart

Bar Chart

For example Medium of Instruction of any student at school level


We will have an array of observations as follows:
U, U, E, U, E, E, E, U, ……
(U : URDU MEDIUM)
(E : ENGLISH MEDIUM) 35
This will result in the following table:

Medium of No. of Students


Institution (f)
Urdu 719
English 481

1200

800

600

400

200

0
1 2

Series1 719 481

36
Dividing the cell frequencies by the total frequency and
multiplying by 100 we obtain the following:

Medium of
f %
Institution
Urdu 719 59.9 = 60%
English 481 40.1 = 40%

1200

37
PIE CHART
Medium of
f Angle
Institution
Urdu 719 215.70
ENGLISH 481 144.30

1200

Urdu
215.70

English
144.30

38
Bivariate Data:
Suppose that along with the enquiry about the Medium of Institutio
you are also recording the gender of the student.

Student No. Medium Gender


1 U F
2 U M
3 E M
4 U F
5 E M
6 E F
7 U M
8 E M
: : :
: : :
39
Sex
Male Female Total
Med.
Urdu 202 517 719
English 350 131 481
Total 552 648 1200

COMPONENT BAR CHAR:


Urdu
800
English
700
600
500
400
300
200
100
0
Male Female
40
SIMPLE BAR CHART:

Suppose we have available to us information regarding the turnover of a company


for 5 years as given in the table below:

Years 1965 1966 1967 1968 1969


Turnover
35,000 42,000 43,500 48,000 48,500
(Rupees)

50,000

40,000

30,000

20,000

10,000

0
1965 1966 1967 1968 1969 41
MULTIPLE BAR CHART
Suppose we have information regarding the imports and exports of Pak
for the years 1970-71 to 1974-75 as shown in the table below:

Imports Exports
Years
(Crores of Rs.) (Crores of Rs.)
1970-71 370 200
1971-72 350 337
1972-73 840 855
1973-74 1438 1016
1974-75 2092 1029

42
Multiple Bar Chart Showing Imports & Exports of Pakistan 1970-71 to 1974-75

2500

2000

1500

1000 Imports
Exports
500

0
1

5
-7

-7

-7

-7

-7
70

71

72

73

74
19

19

19

19

19 43
ORGANIZING AND GRAPHING QUANTITATIVE DATA

• Frequency Distributions
• Constructing Frequency Distribution
Tables
• Relative and Percentage Distributions
• Graphing Grouped Data
– Histograms
– Polygons
44
Frequency Distributions
Table 2.7 Weekly Earnings of 100 Employees of a
Company
Weekly Earnings Number of Employees Frequency
Variable
(dollars) f column
401 to 600 9
601 to 800 22
801 to 1000 39 Frequency of
Third class
the third class
1001 to 1200 15
1201 to 1400 9
1401 to 1600 6

Lower limit of the Upper limit of


sixth class the sixth class

45
Frequency Distributions cont.
 Definition
 A frequency distribution for
quantitative data lists all the classes and
the number of values that belong to
each class. Data presented in the form
of a frequency distribution are called
grouped data.
46
Example
Let’s say we have data representing the response times (in
milliseconds) of a server to user requests. The dataset contains the
following times:
102, 150, 178, 199, 102, 220, 145, 178, 204, 190, 175, 160, 120, 210,
198, 145, 102, 150

Step 1: Create Class Intervals


First, we define intervals (or bins) to group the
data. Let’s use a bin width of 20 milliseconds:
•100–119
•120–139
•140–159
•160–179
•180–199
•200–219
•220–239
Step 2: Count the Frequencies
Now, we count how many values fall within each interval.
Step 3: Calculate Relative Frequency (Optional)
Relative frequency gives us the proportion of data points in each class.
Here’s how you calculate it:

For example, the relative frequency for the class 100–119 is:
# Sample dataset
data <- c(102, 150, 178, 199, 102, 220, 145, 178, 204, 190, 175, 160,
120, 210, 198, 145, 102, 150)
# Define the number of bins (intervals)
breaks <- seq(100, 240, by=20) # Define breaks for class intervals

# Create frequency distribution using the 'cut' function


freq_dist <- cut(data, breaks, right=FALSE)

# Create a table for the frequency distribution


freq_table <- table(freq_dist)

# Print the frequency distribution table


print(freq_table)

# Optionally, display the relative frequencies


rel_freq <- prop.table(freq_table)
print(rel_freq)
Explanation of right=FALSE:
right=FALSE:This means that the intervals are closed on the left and
open on the right. For example, an interval [100, 120) includes
values from 100 up to, but not including, 120.
right=TRUE (default behavior): This means that the intervals are
closed on the right and open on the left. For example, an interval
(100, 120] includes values greater than 100 and up to 120.

The prop.table(freq_table) function in R converts the frequency


table into a proportional (relative) frequency table. It shows
the proportion of the total dataset that falls into each category (or
bin), rather than the raw count (frequency).
Example R Code for Frequency Distribution
Example 2-3
 Table 2.9 gives the total home runs hit
by all players of each of the 30 Major
League Baseball teams during the 2002
season. Construct a frequency
distribution table.

53
Table 2.9 Home Runs Hit by Major League Baseball
Teams During the 2002 Season

Team Home Runs Team Home Runs

Anaheim 152 Milwaukee 139


Arizona 165 Minnesota 167
Atlanta 164 Montreal 162
Baltimore 165 New York Mets 160
Boston 177 New York Yankees 223
Chicago Cubs 200 Oakland 205
Chicago White Sox 217 Philadelphia 165
Cincinnati 169 Pittsburgh 142
Cleveland 192 St. Louis 175
Colorado 152 San Diego 136
Detroit 124 San Francisco 198
Florida 146 Seattle 152
Houston 167 Tampa Bay 133
Kansas City 140 Texas 230
Los Angeles 155 Toronto 187

54
Solution 2-3

Now we round this approximate width to a


230  124
Approximate width of each class  21.2
convenient number – say, 22. 5

55
Solution 2-3
The lower limit of the first class can be taken as 124
or any number less than 124. Suppose we take 124
as the lower limit of the first class. Then our classes
will be
124 – 145, 146 – 167, 168 – 189, 190 – 211,
and 212 - 233

56
Table 2.10 Frequency Distribution for the Data of
Table 2.9

Total Home Tally f


Runs
124 – 145 |||| | 6
146 – 167 |||| |||| ||| 13
168 – 189 |||| 4
190 – 211 |||| 4
212 - 233 ||| 3
∑f =
30
57
Relative Frequency and Percentage Distributions

Relative Frequency and Percentage Distributions

Frequency of that class f


Relative frequency of a class  
Sum of all frequencie s  f

Percentage (Relative frequency) 100

58
Example 2-4
 Calculate the relative frequencies and
percentages for Table 2.10

59
Solution 2-4
Table 2.11 Relative Frequency and Percentage
Distributions for Table 2.10
Total
Home Relative
Class Boundaries Percentage
Frequency
Runs
124 – 145 123.5 to less than .200 20.0
146 – 167 145.5 .433 43.3
168 – 189 145.5 to less than .133 13.3
190 – 211 167.5 .133 13.3
212 - 233 167.5 to less than .100 10.0
189.5
189.5 to less than
211.5
60
211.5 to less than
Graphing Grouped Data

 Definition
 A histogram is a graph in which classes are marked
on the horizontal axis and the frequencies, relative
frequencies, or percentages are marked on the
vertical axis. The frequencies, relative frequencies, or
percentages are represented by the heights of the
bars. In a histogram, the bars are drawn adjacent to
each other.

61
Figure 2.3 Frequency histogram for Table 2.10.
15

12
Frequency

0
124 146 168 - 190 212 -
- - 189 - 233
62
Total167
145 home runs211
Figure 2.4 Relative frequency histogram for Table
2.10.

Relative Frequency .50

.40

.30

.20

.10

0
124 146 168 - 190 212 -
- - 189 - 233
63
Total167
145 home runs211
Graphing Grouped Data cont.

 Definition
 A graph formed by joining the
midpoints of the tops of successive bars
in a histogram with straight lines is
called a polygon.

64
Example R Code for Histogram

# Example dataset: Task completion times in seconds


data <- c(5, 10, 12, 15, 17, 18, 19, 20, 25, 30, 30, 32, 35, 40, 45, 45, 50)

# Create the histogram


hist(data,
breaks=5, # Number of bins (or you can use 'seq()' to specify
custom bin ranges)
main="Histogram of Task Completion Times", # Title of the histogram
xlab="Time (Seconds)", # X-axis label
ylab="Frequency", # Y-axis label
col="lightblue", # Color of the bars
border="black") # Border color of the bars
Figure 2.5 Frequency polygon for Table 2.10.
15

12
Frequency

0
124 146 168 - 190 212 -
- - 189 - 233
67
145 167 211
SHAPES OF HISTOGRAMS
1. Symmetric
2. Skewed
3. Uniform or rectangular

68
Figure 2.8 Symmetric histograms.

69
Figure 2.9 (a) A histogram skewed to the right. (b) A
histogram skewed to the left.

(a) (b)

70
Figure 2.10 A histogram with uniform distribution.

71
Figure 2.11 (a) and (b) Symmetric frequency curves. (c) Frequency curve skewed to
the right. (d) Frequency curve skewed to the left.

72
CUMULATIVE FREQUENCY DISTRIBUTIONS

 Definition
 A cumulative frequency distribution
gives the total number of values that
fall below the upper boundary of each
class.

73
Example 2-7
 Using the frequency distribution of
Table 2.10, reproduced in the next slide,
prepare a cumulative frequency
distribution for the home runs hit by
Major League Baseball teams during
the 2002 season.

74
Example 2-7

Total Home f
Runs
124 – 145 6
146 – 167 13
168 – 189 4
190 – 211 4
212 - 233 3

75
Solution 2-7
Class
Class Boundaries Cumulative Frequency
Limits
Table1242.14 Cumulative Frequency Distribution of Home
– 145 123.5 to less than 145.5 6
Runs by Baseball Teams
124 – 167 123.5 to less than 167.5 6 + 13 = 19
124 – 189 123.5 to less than 189.5 6 + 13 + 4 = 23
124 – 211 123.5 to less than 211.5 6 + 13 + 4 + 4 = 27
124 – 233 123.5 to less than 233.5 6 + 13 + 4 + 4 + 3 =
30

76
CUMULATIVE FREQUENCY DISTRIBUTIONS cont.

 Calculating Cumulative Relative


Frequency and Cumulative Percentage
Cumulative frequency of a class
Cumulative relative frequency 
Total observatio ns in the data set

Cumulative percentage (Cumulativ e relative frequency) 100

77
Table 2.15 Cumulative Relative Frequency and Cumulative Percentage
Distributions for Home Runs Hit by baseball Teams

Cumulative Cumulative
Class Limits Relative Frequency Percentage
124 – 145 6/30 = .200 20.0
124 – 167 19/30 = .633 63.3
124 – 189 23/30 = .767 76.7
124 – 211 27/30 = .900 90.0
124 - 233 30/30 = 1.00 100.0

78
CUMULATIVE FREQUENCY DISTRIBUTIONS cont.

 Definition
 An ogive is a curve drawn for the
cumulative frequency distribution by
joining with straight lines the dots
marked above the upper boundaries of
classes at heights equal to the
cumulative frequencies of respective
classes.
79
Figure 2.12 Ogive for the cumulative frequency
distribution in Table 2.14

Cumulative frequency
3
0

5
123.5 145.5 167.5 189.5 211.5 233.5
1 80
Total home runs
STEM-AND-LEAF DISPLAYS
 Definition
 In a stem-and-leaf display of
quantitative data, each value is divided
into two portions – a stem and a leaf.
The leaves for each stem are shown
separately in a display.

81
Example 2-8
 The following are the scores of 30
college students on a statistics test:
75 52 80 96 65 79 71 87 93 95
69 72 81 61 76 86 79 68 50 92
83 84 77 64 71 87 72 92 57 98

 Construct a stem-and-leaf display.

82
Solution 2-8
 To construct a stem-and-leaf display for
these scores, we split each score into
two parts. The first part contains the
first digit, which is called the stem. The
second part contains the second digit,
which is called the leaf.

83
Solution 2-8
 We observe from the data that the
stems for all scores are 5, 6, 7, 8, and 9
because all the scores lie in the range
50 to 98

84
Figure 2.13 Stem-and-leaf display.

Stems

Leaf for 52

5 2
Leaf for 75
6
7 5
8
9

85
Solution 2-8
 After we have listed the stems, we read
the leaves for all scores and record
them next to the corresponding stems
on the right side of the vertical line.

86
Figure 2.14 Stem-and-leaf display of test scores.
5 2 0 7
6 5 9 1 8 4
7 5 9 1 2 6 9 7 1
8 2
9 0 7 1 6 3 4 7
6 3 5 2 2 8

87
Figure 2.15 Ranked stem-and-leaf display of test
scores.

5 0 2 7
6 1 4 5 8 9
7 1 1 2 2 5 6 7 9
8 9
9 0 1 3 4 6 7 7
2 2 3 5 6 8

88
Example 2-9
 The following data are monthly rents
paid by a sample of 30 households
selected from a small city.
880 108 721 107 102 775 1235 750 965 960
1210 1 123 5 3 825 1000 915 119 103
1151 985 1 932 850 1140 750 114 1 5
630 117 952 110 0 137 128
 Construct a stem-and-leaf display for
5 0 0 0

these data.
89
Solution 2-9
6 30
Figure 7 75 50 21 50
2.16 Stem-
and-leaf 8 80 25 50
display of 9 32 52 15 60 85 65
rents.
10 23 81 35 75 00
11 91 51 40 75 40 00
12 10 31 35 80
13 70

90
Example 2-10
 The following stem-and-leaf display is
prepared for the number of hours that
25 students spent working on
computers during the last month.

91
Example 2-10
0 6
1 1 7 9
2 2 6
3 2 4 7 8
4 1 5 6 9 9
5 3 6 8
6 2 4 4 5 7
7
8 5 6

 Prepare a new stem-and-leaf display by


grouping the stems.
92
Solution 2-10

Figure 2.17 Grouped stem-and-leaf display.

0–2 6 * 1 7 9 * 2 6
2 4 7 8 * 1 5 6 9 9 * 3 6
8
3–5 2 4 4 5 7 * * 5 6

6–8

93
Assignment # 1
• Download any large dataset from internet which includes
both quantitative and qualitative variables. Write down the
link of website from where dataset is downloaded (for
verification) and each student have different datasets.
• Describe the dataset with respect to each variable
• Make frequency distributions separately for quantitative and
qualitative variables in R and write down script and results
on simple word page (or can add screenshot).
• Draw all graphs which could be possible for all quantitative
and qualitative variables in R and write down script and
results on simple word page (or can add screenshot).

You might also like