Chapter 2 - Data Handling
Chapter 2 - Data Handling
DATA HANDLING:
DESCRIPTIVE STATISTICS AND MEASURES OF CENTRAL
TENDENCY
2.1 Introduction
The word Data refers to information, which is gathered for statistical purposes. Statistics is a subject
that deals with the collection, organising, display and interpretation of information.
• So that better planning can be done. It is important, for example, for a government to know how
available funds should be used in the fields of education, social services, health services, etc.
• To test statements like female students, spend more hours studying than male students.
• To gain information about a situation in order to plan better. For example, What kind of chocolate
bar sells best in the cafeteria (market research); how many students buy their own textbooks, etc.
• To keep track of trends. For example, how many people have immigrated/emigrated during the last
four years, how many people per year are contracting aids, etc.
The processing of data consists of various steps, and you cannot start at step three, you have to start at
step one and then follow the steps one by one. The correct way of data processing involves the
following steps:
1. Deciding what the purpose is of the investigation is (what do you want to know?): knowing
what questions to ask.
2. Identifying the population that must be investigated.
Choosing a random sample from which the data can be extracted.
3. Collecting data (the gathering / collecting of information / facts) using a suitable method.
4. Organise the data and represent it in a suitable display.
5. Analysis of data (information / facts are analysed) using statistical measures.
6. Announcing the results (information is announced according to a specific format).
7. Analysis of results (answers to step 1 are found). The findings should be tested critically and
communicated
- Interviews: speaking directly to people, telling them why you are collecting the data and
what will be done with it.
- Questionnaires: drawing up data forms and asking people to complete a form where you
request information or ask a series of questions
- Surveys: observing a situation to gain information. For example, counting the number of
chameleons in your garden, or counting the number of cars that pass a school in a day.
Page 1 of 11
2.2 Population and Samples
When data is collected you could use the whole field about which you want information. This entire
field is called a population.
An example: when all the students in the university complete a questionnaire. When you use a sample,
it means that you choose a representative group from the population. For example, if you choose a
third of the students from each class to complete a questionnaire.
Samples can be composed in different ways. There are random samples and biased samples.
- In random samples all the members of the population have an equal chance of being picked.
- In a biased sample some members of the population have a greater chance of being picked than
others. For example, if you only choose people who wear glasses.
Once data has been collected we organise it by using tally or frequency tables or spreadsheets.
When two or more tables are summarised into a single table it is called a spreadsheet.
Information that you gather will usually be numbers, e.g. the number of students who use the cafeteria.
These numbers are organised by using frequency tables. There are various ways of using frequency
tables.
Data can be
- discrete, meaning it can take only certain values that can be counted, e.g. when a dice is thrown.
- continuous data is can be measured and can take any value which is realistic, e.g. the weight of a
person.
- category data is data that can be divided into groups
Page 2 of 11
Example 1:
The data presented below show the Price/Earnings ratio of some 25 listed companies:
Class Class
Class Class Frequency Relative
i Boundaries Tally fi Frequency
EXERCISE 1:
1. During a survey to find out how knowledgeable the general public is about the cabinet; 40 people
were asked to name as many ministers as possible in five minutes. The number of ministers that
each could name were as follows:
1 5 3 5 1 8 10 1 2 2 1 3 4 4
1 2 10 11 8 13 15 6 1 2 5 3 4 9
11 2 3 10 2 7 1 6 4 5 3 16
2. The total number of unoccupied seats in a complex of 8 cinema theatres for all screenings on one
day for 32 days were as follows:
32 21 15 23 11 39 10 9 25 29 13 42 10
9 13 5 4 8 10 11 9 5 7 3 2 1
5 6 2 3 4 9
Summarise the information on a frequency table using class intervals 0–4, 5–9, 10–14, etc.
Page 3 of 11
2.4 Statistical measures
When data is collected the analysis of the data begins. It is usually important to find some form of
“average” and some indication of the spread of the data. Statistically the “averages” are termed
measures of central tendency, and the spread are termed measures of dispersion. In this course we
will only look at the measures of central tendency.
There are three frequently used measures of central tendency, namely, the mean, the median and the
mode. Let us look at an example of some data which has been collected:
Mode:
The mode of a set of data is the element that occurs the most.
Median:
The median is found when you arrange all the elements in an ascending order (i.e. from the
smallest to the highest) and the median is found in the middle of the set. The easiest way to find the
median is first find the median position by using 12 ( n + 1) where n stands for the number of elements
in the data set.
Example 2:
It is best to arrange the elements in an ascending order. All element must be written down, even if it
is a number that appears more than once.
Ascending order: 4; 4; 5; 6; 6; 7; 7; 8; 8; 8; 8; 9; 9
Mean
∑𝒇
𝑴𝒆𝒂𝒏 =
𝒏
Number of elements, 𝑛 = 13
∑𝒇 = 4 + 4 + 5 + 6 + 6 + 7 + 7 + 8 + 8 + 8 + 8 + 9 + 9 = 89
𝟖𝟗
𝑴𝒆𝒂𝒏 = = 𝟔. 𝟖𝟒𝟔
𝟏𝟑
Median
Page 4 of 11
First calculate the median position,
1
𝑀𝑃 = (𝑛 + 1)
2
Number of elements, 𝒏 = 𝟏𝟑
1
𝑀𝑃 = (13 + 1) = 7
2
Median is 7.
NOTE: If there is an even number of elements in a data set. The median is found by adding the 2
values in the middle and dividing the answer by 2.
Example 3:
If the set of data is: 8; 10; 11; 4; 6; 7; 9; 4; 5; 9; 10; 9; 11; 12; 7; 10; 9; 6.
Find the mode, mean and median of the set of data.
Mode is ________ .
Mean
∑𝒇
𝑴𝒆𝒂𝒏 =
𝒏
Mean is __________ .
Median
Median position:
1
𝑀𝑃 = (𝑛 + 1)
2
Median is ________ .
NOTE: It is vital that before you can find the median the numbers have to be arranged in
ascending order (from smallest to biggest).
Page 5 of 11
EXERCISE 2:
1. A lecturer carried out a survey to find the number of lectures that a class of 35 students were
absent for during a course lasting 16 weeks. These were the results:
3 1 4 2 1 5 12 5 3 1 0 0 5
7
3 4 2 1 4 6 3 2 7 9 5 0 2
1
0 0 0 1 0 2 1
2. Five managerial staff members have travel and car allowances of R60 000, R19 000, R21 000,
R18 000 and R20 000 each. An additional six office staff each have a car allowance of
R10 000.
3. A doctor has eleven diabetic patients. The ages at which they were first diagnosed as diabetic
were: 12, 5, 43, 14, 60, 18, 16, 17, 57, 16, 61.
Page 6 of 11
2.6 Finding Means etc. using Frequency Tables
Very often one is presented with data already arranged in a frequency table and we are asked to find
the mean, mode and median. Surprisingly once the data has been arranged in a frequency table or
presented to us as a graph or chart, the task is made much easier. Let us look at an example.
Example 4:
A bank manager is interested in the amount of time it takes his tellers to service customers. He gets his
tellers to time their interaction with the customers and the results are given in the frequency table 2
below.
2 1 1 2
3 2 3 6
4 2 5 8
5 4 9 20
7 2 11 14
8 4 15 32
9 1 16 9
10 3 19 30
15 1 20 15
f = 20 ( f x) = 136
You should be able to see that there are 20 numbers in the sample and they have already been arranged
in numerical order (2, 3, 3, 4, 4, 5, 5, 5, 5, 7, 7, …etc.).
i) The mode is the most frequent occurrence. We see here that 5 minutes and 8 minutes both occur 4
times. Therefore, they are both modes.
ii) The median is the centre number. We have 20 numbers altogether therefore the number we want is
the n2+1 th = 202+1 = 10 12 th number which is the average of the 10th and 11th number. We must now
look at the table and find where the 10th and 11th number is. We see from the Cumulative Frequency
column that the first 9 digits are 2, 3, 4, and 5. The 10th and 11th digits are both 7. Therefore, the
median is 7.
iii) To find the mean we are required to add up all the numbers. We have one 2 and two 3s and two 4s
etc. We get the total by using the formula ( f x) and the mean
( f x) = 136 = 6.8 minutes
f 20
Page 7 of 11
2.7 Classes and Class Boundaries
Very often, when one is collecting statistical data, the range of the numbers that one is measuring is
very wide and detailed. For example, if one is trying to establish weights of people we might find
weights of 74.3kg, 74.6kg, 75.2kg etc in the sample. If we had 200 people in the sample we would
have 200 weights and we might find that no 2 were the same. In order to work with this type of data it
is normal to arrange the data (weights) into groups. The groups might be 50-55kg, 55-60kg, 60-65kg
etc. Thus, if the range of weights we had was from 50kg to 100kg we would group them into 10
groups each 5kg in size.
When one is grouping data, the groups must not overlap. If we had groups 50-55kg and 55-60kg
where would one put a weight of exactly 55kg? It could fit into either of the 2 groups. It is customary
therefore to reduce the upper limit of each group slightly (depending on how many decimal places or
significant figures one is using). In the above case it would be normal to have the 2 groups as 50-
54.9kg and 55-59.9kg. The weight of 55kg would then go into the second group.
When working with groups we use the midpoint of the group to represent that group in any calculation.
In other words, the 50-54.9kg group would be represented by 52.5kg and the 55-59.9kg group would
be represented by 57.5kg. Let us now look at an example.
Example 5:
The following table shows weights of the members QSC class of 2005. Find the mean, median and
modal weights in the class.
Page 8 of 11
Solution
mean =
( f x) = 15665 = 78.3kg
f
i) Mean: -
200
ii) Mode: - The modal class is 75-79.9kg because there are 35 members of that group. Therefore, we
can say that the mode = 77.5kg (midpoint)
iii) Median: - There are 200 in the sample therefore the median is the average of the 100th and 101st
weight. Both these weights are in the 75-79.9kg group because that group contains the
79th to the 113th number (we get this using the Cumulative Frequency column).
Therefore, we can say that the median is also 77.5kg (midpoint).
EXERCISE 3:
i) Find the mean, mode and median for the raw data (Table 1)
ii) Find the mean, mode and median from the frequency table.
iii) If the answers to (i) and (ii) are different why would that be? Which are more correct?
2) The table below lists the number of unemployed people by age in the Western Cape as per census
2001 figures. Calculate the mean, mode and median age of the unemployed.
15 – under 20 64 954
45 – under 55 48 188
55 – under 65 13 916
f = 527 025
Page 9 of 11
2.8 An interpolation formula for calculating the median for grouped data
Let 𝑚0 and 𝑚1 be respectively the lower and upper endpoints ofthe median class, i.e., the class in
which the median sits. Let f be the frequency of the median class, and let F be thecumulative
frequency at the point 𝑚0. The total number of observations under consideration is n. Then the
median can be estimated with the formula:
Example 6:
The following table represents the percentage marks obtained by first year university students
who wrote a certain standardized numeracy test. Calculate the median.
Solution
Page 10 of 11
Therefore, the median class is the class [60 -- 69]. For better accuracy, the numbers 𝑚0 and 𝑚1 are
taken as the “splittings” between the boundaries and not necessarily the given class limits. So, in
this case we take:
𝒏
Answer. n = 426, and 𝟐 = 213.
EXERCISE 4:
The following table represents the percentage of a Life Sciences test obtained by Grade 10 learners.
Find the modal class, the mean and median.
Midpoint Cumm
Interval (x) Frequency (f) freq x.f
0 -29 14.5 17
30-39 34.5 26
40-49 44.5 34
50-59 54.5 56
60-69 64.5 59
70-79 74.5 21
80-89 84.5 12
90-99 94.5 3
Page 11 of 11