Module 2 data collection
Module 2 data collection
DATA MANAGEMENT
PREPARATION
What to do: Riddle me this!
Instruction: Read the riddle , rearrange the jumble word and figure out the answer.
1. I tell how often something appears, in a count or a rate, what am I? ... – Frequency
2. I'm a part of a whole, expressed out of 100. What am I? ... -Percentage
3. I am raw and unorganized, facts and figures, what am I? … - Data
4. The art of handling information, what am I? ... Data Management
5. I'm the gap between the highest and lowest, a span to define. What am I? ... – Range
6. I'm a fixed distance between points, a measured space in time or size. What am I? ... –
Interval
7. I'm a running total, adding as I go. What am I? ... -Cumulative
PRESENTATION
A. Organization of Data
Statistical research requires organizing collected data for meaningful analysis. Frequency
distributions, grouping data into categories showing observation counts, are commonly used for
this purpose. Data presentation, often through graphs and charts, is crucial for study
interpretation. Understanding key terms is essential before constructing frequency distributions
and visualizing data using graphs and charts.
● Range is the difference between the highest value and the lowest value in a distribution.
● Class Limits (or Apparent Limits) are the highest and lowest values describing a class.
● Class Boundaries (or Real Limits) is the upper and lower values of a class for group
frequency distribution whose values has additional decimal place more than the class
limits and end with the digit 5.
● Interval (or width) is the distance between the class lower boundary and the class upper
boundary and it is denoted by the symbol i.
● Midpoint is the point halfway between the class limits of each class and is
representative of the data within that class.
A grouped frequency distribution is used when the range of the data set is large; the data
must be grouped into classes whether it is categorical data or interval data.
Example 1: Construct a frequency distribution using the data of twenty applicants who were
given a performance evaluation appraisal.
SOLUTION:
High
Average
Low
High IIIII-II
Average IIIII-II
Low IIIII
High IIIII-II 7
Average IIIII-II 8
Low IIIII 5
Step 4: Determine the percentage. The percentage is computed using the formula: Percentage
= × 100% n
where f= frequency of the class and n = total number of values.
Generally, the number of classes for a frequency distribution table varies from 5 to 20,
depending primarily on the number of observations in the data set. It is preferred to have more
classes as the size of a data set increases. The decision about the number of classes depends
on the method used by the researcher.
1. Rule 1. To determine the number of classes is to use the smallest positive integer k such that
2 ^ k >= n where n is the total number of observations.
Ragne HV −LV
i= =
k k
2. Rule 2. Another way to determine the class interval is by applying Sturges Formula.
Range
i=
1+3.322 log N
Example 2: Suppose a researcher wished to do a study on the monthly salary first would have
to collect the data by asking each young professional about of young professionals of selected
companies in Makati City. The researcher his monthly salary. The data collected in original form
is called raw data. In this case, the data are
24,300 21,900 26,300 27,500 30,400 17,800 34,500 25,300 20,750 23,400
23,700 21,600 25,900 26,900 29,300 15,700 30,700 24,750 18,700 22,800
23,700 21,300 30,500 24,500 15,500 22,000 30,700 24,700 18,400 22,750
24,100 21,900 26,200 27,400 30,200 23,400 32,400 25,150 20,500 23,200
17,950 21,800 26,100 27,300 30,100 17,300 32,100 25,000 20,000 23,000
18,350 20,800 25,400 26.500 27,600 14,050 25,700 26,800 27,900 23,850
23,500 21,000 25,600 26,500 27,800 14,300 30,650 24,600 17,400 22,600
23,700 21,750 26,000 27,000 29,500 17,000 30,750 25,000 18,800 22,900
a Range e. Percentages
b. Interval f. Cumulative frequencies
c. Class limits g. Midpoints
d. Relative frequencies
Solution:
Step 1: Arrange the raw data in ascending or descending order. In this particular example, we
will arrange raw data in ascending order. This will make it easier for us to tally the data.
14,050 17,950 20,800 22,000 23,400 24,500 25,400 26,500 27,600 30,500
14,300 18,350 21,000 22,600 23,500 24,600 25,600 26,500 27,800 30,650
15,500 18,400 21,300 22,750 23,700 24,700 25,700 26,800 27,900 30,700
15,700 18,700 21,600 22,800 23,700 24,750 25,900 26,900 29,300 30,700
17,000 18,800 21,750 22,900 23,700 25,000 26,000 27,000 29,500 30,750
17,300 20,000 21,800 23,000 23,850 25,000 26,100 27,300 30,100 32,100
17,400 20,500 21,900 23,200 24,100 25,150 26,200 27,400 30,200 32,400
17,800 20,750 21,900 23,400 24,300 25,300 26,300 27,500 30,400 34,500
The objective is to use just enough classes. We can determine the number of classes (k) using
"2 (k) for the number of classes such that 2k (2 raised to the power of k) is greater than the
number of observations (n). Using our example, there are 80 call center agents (or n = 80 ) If we
apply k = 6 which means we would use 6 2 ^ 8 = 2 ^ 7 = 128 which is greater than 80.
Therefore, the recommended number of classes t classes, then 2 ^ k = 2 ^ k = 64 somewhat
less than Thus, 6 are not enough classes. If we try k = 7 then 2 ^ 8 = 2 ^ 7 = 128 which is
greater than 80. Therefore, the recommended number of classes is 7.
Generally, the class interval for width) should be equal for all classes. The classes must
cover all the values in the raw data (that is, from lowest to highest). Class interval is generated
using the formula:
Note: Round the value of the interval up to the nearest whole number if there is a remainder.
The starting point can be the smallest data value or any convenient number less than the
smallest data value. In our case 14,000 is used.
We need to add the interval (or width) to the lowest score taken as the starting point to obtain
the lower limit for the next class. Keep adding until we reach the 7 classes, as reflected 14,000;
17,000; 20,000; 23,000; 26,000; 29,000 and 33,000.
CLASS LIMITS
To obtain the upper-class limits, we first need to add the interval
to the lower limit of the class to obtain the upper limit of the first 14,000-16,999
17,000-19,999
class. That is, 14000 + 3000 = 17000 Then add the interval (or 20,000-22,999
width) to each lower limit to obtain all the upper limits. 23,000-25,999
26,000-28,999
Step 3: Tally the raw data. 29,000-31,999
32,000-34,999
14,000-16,999 IIII
17,000-19,999 IIIII-IIII
20,000-22,999 IIIII-IIIII-IIIII-I
23,000-25,999 IIIII-IIIII-IIIII-IIIII-III
26,00028,999 IIIII-IIIII-IIIII-II
29,000-31,999 IIIII-III
32,000-34,999 III
14,000-16,999 IIII 4
17,000-19,999 IIIII-IIII 9
20,000-22,999 IIIII-IIIII-IIIII-I 16
23,000-25,999 IIIII-IIIII-IIIII-IIIII-III 23
26,000-28,999 IIIII-IIIII-IIIII-II 17
29,000-31,999 IIIII-III 83
32,000-34,999 III 8
Step 5: Determine the relative frequency. It can be found by dividing each frequency by the total
frequency.
14,000-16,999 4 0.05 4 ÷ 80
17,000-19,999 9 0.11 9 ÷ 80
20,000-22,999 16 0.20 16 ÷ 80
23,000-25,999 23 0.29 23 ÷ 80
26,000-28,999 17 0.21 17 ÷ 80
29,000-31,999 8 0.10 8 ÷ 80
32,000-34,999 3 0.04 3 ÷ 80
Step 6: Determine the percentage. It can be found by multiplying 100% in each relative
frequency.
Total: 80 100
Step 7: Determine the cumulative frequencies. The cumulative frequency can be found by
adding the frequency in each class to the total frequencies of the classes preceding that class.
14,000-16,999 4 4 4
17,000-19,999 9 13 4+9
20,000-22,999 16 29 4 + 9 + 16
23,000-25,999 23 52 4 + 9 + 16 + 23
26,000-28,999 17 69 4 + 9 + 16 + 23 + 17
29,000-31,999 8 77 4 + 9 + 16 + 23 + 17 + 8
32,000-34,999 3 80 4 + 9 + 16 + 23 + 17 + 8 + 3
Step 8: Determine the midpoints. The midpoint can be found by getting the average of the upper
limit and lower limit in each class.
Example 3: SJS Travel Agency, a nationwide local travel agency, offers special rates during the
summer period. The owner wants additional information on the ages of those people taking
trave tours. A random sample of 50 customers taking travel tours last summer revealed these
ages.
18 29 42 57 61 67 37 49 53 47
24 34 45 58 63 70 39 51 54 48
28 36 46 60 66 77 40 52 56 49
19 31 44 58 62 68 38 50 54 48
27 36 46 59 64 74 39 51 55 48
Solution:
Step 1: Arrange the raw data in ascending order.
18 29 37 42 47 49 53 57 61 67
19 31 38 44 48 50 54 58 62 68
24 34 39 45 48 51 54 58 63 70
27 36 39 45 48 51 55 59 64 74
28 36 40 46 49 52 56 60 66 77
Range 77−88 59
i= = = =8.88 ≈ 9
1+3.322 log N 1.322 log 50 6.643978354
Select a starting point for the lowest class limit. The lowest value in the data set is 18,
this will also serve as our starting point.
Set the individual class limit. We will add 9 to each lower-class limit until reaching the
number of classes (18, 27, 36, 45, 54, 63, and 72). To obtain the upper-class limits, we
need to add 9 to the lower limit of the class to obtain the upper limit of the first class.
Then add the interval (or width) to each upper limit to obtain all the upper limits (27, 36,
45, 54, 63, 72, and 81).
When the data set contains a large number of values, making conclusions from an
ordered array or stem-and-leaf plot is often difficult. We will need graphs or charts in such
situations. There are a number of graphs or chart to visually to show numerical data. These
include histogram, polygon, and cumulative frequency (ogive).
In this section, we will discuss several graphical methods that are used for interval data.
The most important of these graphical methods is the histogram. A histogram is a powerful
graphical technique used to summarize interval data, but it also helps explain an important
aspect of probability.
A histogram is a graph in which the classes are marked on the horizontal axis (x-axis)
and the class frequencies on the vertical axis (y-axis). The height of the bars represents the
class frequencies and the bars are drawn adjacent to each other. Nevertheless, the histogram
focuses on the frequency of each class and sacrifices whatever information is contained in the
actual observation.
A frequency polygon is a graph that displays the data using points that are connected by
lines. The frequencies are represented by the heights of the points at the midpoints of the
classes. The vertical axis represents the frequency of the distribution while the horizontal axis
represents the midpoints of the frequency distribution.
A cumulative frequency polygon or ogive (read as oh'-jive) is a graph that displays the
cumulative frequencies for the classes in a frequency distribution. The vertical axis represents
the cumulative frequency of the distribution while the horizontal axis represents the upper-class
boundaries (real upper limits) of the frequency distribution.
Solution:
a. Constructing a Histogram
Step 1: Find the midpoints of each class.
Step 2: Draw and label the x-axis and y-axis.
Step 3: Represent the frequency on the y-axis and the midpoints on the x-axis.
Step 4: Use the frequency to represent the height and draw the vertical bars.
As discussed in the previous section, the only allowable calculation on nominal data is to
count the frequency of each value of the variable. We can graphically display the counts in three
ways: Pareto charts, bar charts, and pie charts. This section also includes how to graphically
display time series graphs, pictographs, and scatter plots.
A Pareto chart is a graph used to represent a frequency distribution for categorical data
(or nominal-level) and frequencies are displayed by the heights of vertical bars, which are
arranged in order from highest to lowest.
A bar chart (bar chart) is similar to a bar histogram. The bases of the rectangles are
arbitrary intervals whose centers are the codes. The height of each rectangle represents the
frequency of that category. It is also applicable for categorical data (or nominal-level).
A pie chart (circle graph) is a circle divided into portions that represent the relative
frequencies (or percentages) of the data belonging to different categories. The data in a pie
chart should be categorical or nominal-level.
A time series graph represents data that occur over a specific period of time under
observation. In addition, it shows a trend or pattern of the increase or decreases over a period
of time.
A pictograph (pictogram) immediately suggests the nature of the data being shown. It is
a combination of the attention-getting quality and the accuracy of the bar chart. Appropriate
pictures arranged in a row (sometimes in a column) present the quantities for comparison.
A scatter plot is used to examine possible relationships between two numerical
variables. The two variables are plotted on the x-axis and y-axis.
Now we will illustrate how to construct the Pareto chart, bar chart, pie chart, time series
graph. pictograph, and scatter plot using the succeeding examples.
Example 2: Using the information in the table about the favorite Products Sales
snacks of 870 youth, construct a Pareto chart, bar chart, and pie Candy 135
chart. Chocolate 210
Ice Cream 185
Junk Foods 250
Solution:
Others 90
a. Constructing a Pareto Chart
It can easily be seen in the Pareto chart that candy is the most preferred snack followed
by chocolate while other kinds of snacks are least preferred by the youth from the given
population.
Step 1: Draw and label the x-axis (Products) and y-axis (Sales).
Step 2: Make a bar with the same width and draw the height corresponding to the frequencies.
shows the Bar Chart on the favorite snacks of the youth.
The same observation can also be seen in the bar chart that candy is the most preferred
snack followed by chocolate while other kinds of snacks are least preferred by the youth from
the given population.
c. Constructing a Pie Chart
Step 1: Since there are 360° in a circle, the frequency of each class must be converted into a
proportional part of the circle. This conversion is done by applying the formula.
Degree = (f/n)×360°
where f= frequency of each class, and n = sum of frequencies.
Hence, the following conversions are obtained. The degrees should total 360°.
Candy = (135/870) x 360° = 56°
Chocolate = (/210/870) × 360° = 87°
Ice Cream = (185/870) × 360° = 77°
Junk Foods = (250/870) × 360° =103°
Others = (90/870 ) × 360° = 37°
Step 2: Each frequency must also be converted to a percentage and the sum of these
percentages must have a total of 100%. This percentage can be done by applying the formula.
Step 3: Using a protractor, graph each section and write its name and appropriate percentage,
as shown in figure 4.6.
Since junk food have the biggest slice in the pie chart, it is the most preferred snacks
followed by chocolate while other kinds of snacks are least preferred by the youth from the
given population.
Example 3: Using the information in the table below about the dollar to the peso exchange rate
from January to December of 2019, construct a time series graph.
Month January February March April May June
Exchange Rate 52.46 52.19 52.41 52.11 52.26 51.18
Month July August September October Novembe December
r
Exchange Rate 51.14 52.05 51.10 51.50 50.72 50.76
Solution:
Step 1: Draw and label the x-axis and y-axis.
Step 2: Label the x-axis for months and the y-axis for Philippine Peso per US Dollar.
Step 3: Plot each point according to the table.
Step 4: Draw line segments connecting adjacent points.
It can be seen in the table that January has the highest exchange rate of US dollar to
Philippine peso and it is the lowest in the months of November and December.
Example 4: Suppose the National Housing Authority (NHA) develops houses to homeless
families in Quezon City. The information in the table shows the number of house construction
from 2017 to 2021. Construct a pictograph.
Solution:
Step 1: Draw and label the x-axis and y-axis.
Step 2: Label the x-axis for years and the y-axis for the Number of Houses.
Step 3: Draw a house to represent the number of houses.
It can be noted in the pictograph that the NHA built more houses in the years 2019 and
2021, while they only constructed less than 50% of houses year 2018 in Quezon City.
Example 5: An Economist would like to study the relationship of the US Dollar to the Philippine
Peso exchange rate and the inflation rate for 2019 in the Philippines. A sample of 12 monthly
average rates is selected with the results given as follows:
US Dollar to 52.45 52.19 52.41 52.11 52.26 51.8 51.14 52.05 52.1 51.50 50.72 50.76
Philippine Peso 0
Exchange Rate
Inflation Rate 4.4 3.8 3.3 3.0 3.2 2.7 2.4 1.7 0.9 0.8 1.3 2.5
Solution:
Step 1: Draw and label the x-axis and y-axis.
Step 2: Label the x-axis for US Dollar to Philippine Peso Exchange Rate and the y-axis for
Inflation Rate.
Step 3: Plot the points of each ordered pair in the Cartesian coordinate system.
We deduced in the graph that there is a positive substantial relationship between the US
Dollar to Philippine Peso Exchange Rate and Inflation Rate. It means to say that as the US
Dollar to Philippine Peso exchange rate increases the inflation rate also increases.
Good graphical displays tell what the data are conveying. Sadly, many graphs or charts
shown in newspapers and magazines are misleading, incorrect, or complicated. In order to
correctly develop good graphs/charts, there are some guidelines that one needs to bear in mind
such as.
Any data set can be characterized by measuring its central tendency. A measure of
central tendency, commonly referred to as an average, is a single value that represents a data
set. Its any data set can be characterized by measuring its central tendency. A measure of
central purpose is to locate the center of a data set. This chapter discusses three different
measures of central tendency: the mean, median, and mode. We will illustrate how to calculate
each of these measures for ungrouped and grouped data. The measure of central tendency
both for sample grouped and population grouped is also included in the discussion.
The arithmetic mean, often called the mean, is the most frequently used measure of
central tendency. The mean is the only common measure in which all values play an equal role,
meaning to determine its values you would need to consider all the values of any given data set.
The mean is appropriate to determine the central tendency of interval or ratio data.
The symbol overline x called "x bar," is used to represent the mean of a sample and the
symbol u, called "mu", is used to denote the mean of a population.
Properties of Mean
1. A set of data has only one mean.
2. Mean can be applied for interval and ratio data.
3. All values in the data set are included in computing the mean.
4. The mean is very useful in comparing two or more data sets.
5. Mean is affected by the extreme small or large values on a data set.
6. Mean is most appropriate in symmetrical data.
Mean=
∑ of all values Sample Mean : x̄ =
Σx
Population Mean : μ=
Σx
Number of values N n
Where: x̄ = sample mean (it is read “x bar”).
μ = population mean (it is read “mu”).
x = the value of any particular observation or measurement.
Σx = sum of all x’s.
n = total number of values in the sample.
N = total number of values in the population.
Properties of Median
1. The median is unique, there is only one median for a set of data.
2. The median is found by arranging the set of data from lowest or highest (or highest to
lowest) and getting the value of the middle observation.
3. Median is not affected by the extreme small or large values.
4. Median can be applied for ordinal, interval, and ratio data.
5. Median is most appropriate in skewed data.
To determine the value of median for ungrouped, we need to consider two rules:
1. If n is odd, the median is the middle-ranked.
2. If n is even, then the median is the average of the two middle-ranked values.
n+1
Median (ranked value) =
2
Note that n is the population/sample size.
The mode is the value in a data set that appears most frequently. Like the median and
unlike the mean, extreme values in a data set do not affect the mode. A data may not contain
any mode if none of The mode is the value in a data set that appears most frequently. Like the
median and unlike the of the values are "most typical". A data set that has only one value that
occurs the greatest frequency is said to be unimodal. If the data has two values with the same
greatest frequency, both values are considered the mode, and the data set is bimodal. If a data
set has more than two modes, then the data set is said to be multimodal. There are some cases
when a data set values have the same number frequency. When this occurs, the data set is said
to be no mode.
Properties of Mode
1. The mode is found by locating the most frequently occurring value.
2. The mode is the easiest average to compute.
3. There can be more than one mode or even no mode in any given data set.
4. Mode is not affected by the extreme small or large values.
5. Mode can be applied for nominal, ordinal, interval, and ratio data.
The weighted mean is particularly useful when various classes or groups contribute
differently to the total. The weighted mean is found by multiplying each value by its
corresponding weight and dividing by the sum of the weights.
x 1 w 1 + x 2 w2 + x 3 w 3 +...+ x n w n
x̄ w =
w 1+ w2 + w3 +...+w n
Example 1: The daily salaries of a sample of eight employees at GMS Inc. are P550, P420,
P560, P500, P700, P670, P860, P480. Find the mean, median, and mode daily rate of
employees.
Solution:
Mean Computation:
Median Computation:
Step 1: Arrange the data in order.
₱420, ₱480, ₱500, ₱550, ₱560, ₱670, ₱700, ₱860
th
4.5
Since the middle point falls between P550 and P560, we can determine the median of
the data set by getting the average of the two values.
550+560
Median = =555
2
The ordered array for these data is ₱420, ₱480, P500, P550, P560, P670, P700, P860.
There is no mode since the data set has the same frequency.
Example 2: Find the population mean of the ages of nine (9) middle-management employees of
a certain company. The ages are 53, 48, 59, 48, 54, 46, 51, 58, and 55.
Solution:
Mean Computation:
Median Computation:
Step 1: Arrange the data in order.
45, 46, 48, 51, 53, 54, 55, 58, 59
th
5
Hence, the median age is 53 years.
Mode Computation:
The ordered array for these data is 45, 48, 48, 51, 53, 54, 55, 58, 59, The mode is 53 years old
is no mode since the data set has the same frequency. Because 53 appears twice, more times
than the other values, therefore the mode is 48.
Example 3: Riana's first quarter grade is shown in the table below. Use the weighted mean
formula to find Riana's General Percentage Average (GPA) for the first quarter.
Solution:
Let w 1=3 w 2=3 w 3=3 w 4=3 w 5=3
x 1=90 x 2=87 x 3=88 x 4 =95 x 5=96
x 1 w 1 + x 2 w2 + x 3 w 3 + x 4 w 4+ x5 w 5
x̄ w =
w 1+ w2 +w 3 +w 4 +w 5
Example 4: A certain subdivision in Rizal Province consists of 50 homes. The table shows the
frequency distribution of homes with respect to the number of bedrooms it has. Find the mean
number of bedrooms for the 50 homes.
No. of Bedrooms 2 3 4 5 6
No. of Homes 13 21 10 4 2
Solutions:
Let w 1=2 w 2=3 w 3=4 w 4=5 w 5=6
x 1=13 x 2=21 x 3=10 x 4 =4 x 5=2
x 1 w 1 + x 2 w2 + x 3 w 3 + x 4 w 4+ x5 w 5
x̄ w =
w 1+ w2 +w 3 +w 4 +w 5
PRACTICE
A marketing research consultant conducted a survey of 40 people who used to visit online
shopping one morning. The age of the person was recorded to the nearest year as follows:
32 21 44 41 19 40 30 47 47 27 50 33 47 48 29 27 32 31 42 28
16 29 44 36 40 24 28 47 34 46 35 26 50 33 38 19 22 53 44 55
Prepare a frequency distribution table using Rule 1 and Rule 2.
ASSIGNMENT
Instructions: Answer the following problems.
1. A sales manager would like to see each of his sales representative unit sales per month.
A new recruit is told to keep a weekly record of the sales. The following are the data
from the previous month: 10, 14, 17, 18, 30, 28, 27, 38, 16, 5, 8, 19, 28, 17, 21, 23, 43,
21, 26, 28, 16, 6, 15, and 10. Find the mean, median, and mode.
2. A professor grades his students on 4 quizzes, a project, and a final
examination. Each quiz counts as 12% of the quiz grade. The project
counts as 22% of the course grade. The final examination counts as 30%
of the course grade. Achaiah has quiz scores of 75, 80, 85, and 90. His
project score is 95 and his final examination score is 92. Use the weighted
mean formula to find Achaiah's average for the course.