4-Unit9 Statistics
4-Unit9 Statistics
Descriptive statistics
Descriptive statistics is the term given
to the analysis of data that helps
describe, show or summarize data in a
meaningful way, which allows simpler
interpretation of the data.
Inferential statistics
We have seen that descriptive statistics provides information about our
immediate group of data. For example, we could calculate the mean and standard
deviation of the height of the 4th ESO students in a Secondary School, and this
could provide valuable information about this group of students. Any group of
data like this, which includes all the data you are interested in, is called a
population.
Often, however, you do not have access to the whole population you are
interested in investigating, but only a limited number of data instead. For
example, you might be interested in the height of 4th ESO students in Spain. It
is not feasible to get the heights of all of them, so you have to use a smaller
sample of students, which are used to represent the larger population of all 4th
ESO students.
To sum up:
STATISTICS
DESCRIPTIVE INFERENTIAL
STATISTICS STATISTICS
Grouped data
Example: A teacher marked a set of 32 test papers. The scores earned by the
students were as follows:
90, 85, 74, 86, 65, 62, 100, 95, 77, 82, 50, 83, 77, 93, 73, 72,
98, 66, 45, 100, 50, 89, 78, 70, 75, 95, 80, 78, 83, 81, 72, 75
Because of the large number of different scores, we organize the data into
intervals, which must be equal in size.
When unorganized data are grouped into intervals, we must follow certain rules
in setting up the intervals:
• The intervals must cover the complete range of values. The range is the
difference between the highest and the lowest values.
• The number of intervals should be between 6 and 15. The use of too many
or too few intervals is not effective.
• Every data value to be tallied must fall into one and only one interval.
Thus, the intervals should not overlap.
Exercise 1
87 85 61 51 64 75 80 70 69 82
80 79 82 74 92 76 72 73 63 65
67 71 88 76 68 73 70 76 71 86
b) Use a histogram to display the data and draw the frequency polygon.
• Measures of variation: they tell us how far the numbers are scattered
about the center value of the set. They are also called measures of
dispersion. The most common parameters of variation are: the range, the
standard deviation and the variance.
MEAN:
The mean of a set of data is the total of all the values divided by the number of
values, that is, the average value of all the data in the set. The mean is denoted
by x .
∑x f i i
∑f = N
i
→ total number of data
x=
∑f i
∑x f i i
→ sum of all the data
VARIANCE:
The variance is the average squared deviation of each number from the mean of
a data set. The variance is denoted by σ2 .
2
∑ (x − x )i
fi
σ2 =
∑f i
σ2 =
∑x f − xi i
2
∑f i
STANDARD DEVIATION
The standard deviation, σ , is the square root of the variance. It is the most
used measure of spread.
The units of the standard deviation are the same as the units of the data.
2
∑ (x − x )
i
fi ∑x f − x 2
i i
2
σ= =
∑f i ∑f i
To compare the spread of two sets of data, you use the coefficient of
variation, CV. It is defined as the ratio of the standard deviation to the mean.
Exercise 2
Find the mean, the standard deviation and the coefficient of variation of the
two following set of data.
xi fi Interval fi
0 12 50.5 - 57.5 1
1 9 57.5 – 64.5 3
2 7 64.5 – 71.5 8
3 6 71.5 – 78.5 8
4 3 78.5 – 85.5 6
5 3 85.5 – 92.5 4
Exercise 3
The mean weight of the boys in a class is 58.2 kg and their standard deviation is
3.1 kg. The mean weight of the girls in that class is 52.4 kg and their standard
deviation is 5.2 kg. Find the coefficient of variation and compare the spread of
both groups.
Exercise 4
The table below shows the weights and the heights of 6 people.
Find the coefficient of variation and decide which set of data has more spread.
Median
The median, Me, of a data set is the value in the middle when the data items are
arranged in ascending order.
• For even number of data: the median is the average of the middle two
values.
Whenever a data set has extreme values, the median is the preferred measure
of central location.
Example: The heights, in inches, of 20 students are shown in the following list.
The median is the average of the 10th and 11th data values.
53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73
66 + 66
Median = = 66
2
Quartiles
When the values in a set of data are listed in numerical order, the median
separates the values into two equal parts. The numbers that separate the set
into four equal parts are called quartiles.
To find the quartile values, we first divide the set of data into two equal parts
and then divide each of these parts into two equal parts.
53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73
The difference between the upper quartile and the lower quartile is called the
interquartile range. It is a useful way to quantify scatter.
Notice that:
• Q1 is the number such that 25% of data are less than it and 75% are
larger.
• Me = Q2 is the number such that 50% of data are less than it and 50%
are larger.
• Q3 is the number such that 75% of data are less than it and 25% are
larger.
53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73
Percentiles
Quartiles are useful and they help to describe the distribution of values as we
have seen before.
However, we often want to know how one particular data value compares to the
rest of the data. For example, when taking standardized test scores, you want
to know not only your own scores, but also how my score ranks in relation to all
scores. Percentiles are perfect for this situation.
The k-th percentile is the number such that k% of all data values are less than it
and (100 - k)% are larger.
Exercise 5
The lower quartile for a set of data was 40. These data consisted of the
heights, in inches, of 680 children. At most, how many of these children
measured more than 40 inches?
Exercise 6
On a standardized test, Sally scored at the 80th percentile. This means that
Exercise 7
For a set of data consisting of test scores, the 50th percentile is 87. Which of
the following could be false?
a) 50% of the scores are 87 c) Half of the scores are at least 87.
b) 50% of the scores are 87 or less d) The median is 87.
Cumulative frequency
How can we find the median, the quartiles and the percentiles in a set of values?
Fi = f1 + f2 + ... + fi−1 + fi
The number of children We can add more columns with the cumulative
of 120 couples is shown frequencies and the percentages that
in the table below. corresponds to this frequencies
xi fi Fi
xi fi Fi % ⋅ 100
N
0 10 0 10 10 8.3
1 20 1 20 30 25
2 41 2 41 71 59.2
3 29 3 29 100 83.3
4 14 4 14 114 95
5 5 5 5 119 99.2
6 1 6 1 120 100
In the example:
Exercise 9
Find Me, Q1 , Q3 , p80 , p90 and p99 for the following set of marks.
Marks 1 2 3 4 5 6 7 8 9 10
Number of students 7 15 41 52 104 69 26 13 19 14
Step 1: Draw a scale with numbers from the minimum to the maximum values of
a set of data.
Q1 Me Q3
Step 3: Add the whiskers by drawing two line segments that include all the data
with the following condition:
The length of each segment must be less than or equal to 1.5 the length of the
box.
If one (or more) data lie below or above that length, the corresponding whisker
is drawn with this limit, and this (or these) data are drawn in its corresponding
place.
The maximum length of each of the two whiskers will be less than or equal to
13.5.
The length of the whisker on the right will be 13.5 and we will add and asterisk
to place the data 189.
*
Q1 Me Q3
The statistical summary for the marks of 87 people is: Q1 = 4.1 , Me = 5.1 ,
Q3 = 6.8 . All the marks lie between 1 and 9. Construct a box-and-whisker plot.
Exercise 11
Exercise 12
0 1 2 3 1 0 1 1 1
4 3 2 2 1 1 0 1 2
3 1 0 1 1 1 4
9.6.-INFERENTIAL STATISTICS
Sampling
For any statistical project, you need to find out information about a group of
people or things. This group is called the population.
When you collect data about every member of a population, it’s called a census.
If you have a really big population or it’s no very well defined it can be really
hard or even impossible to do a census –it might take too long, cost too much or
be impractical.
When it’s not sensible to collect information using a census, you have to use
sampling.
Example: A student is trying to find out the average Key Stage 2 SATs score
for Maths in England in 2014.
The population would be all students in England who took the Key Stage 2 Maths
SAT in 2014.
The sample frame would be a list of all students who took the Key Stage 2
Maths Sat in 2014.
You can use the data you collect to make estimates and draw conclusions about
the whole population. The techniques that allow us to use samples to make
generalizations about the populations are called inferential statistics.
Exercise 13
Suppose you want to carry out studies about the following statistical variables:
This means any conclusions you draw from the data in the sample can be applied
to the whole population.
A biased study is one that doesn’t fairly represent the whole population. To
avoid bias you need to:
A student wants to find out if students at their school think the tuck shop
provides good value for money. She chooses to sample the first 20 people in the
queue for the tuck shop at break time.
The student is sampling the wrong population –any students who don’t shop at
the tuck shop are excluded. People who strongly think the tuck shop is bad value
for money probably won’t shop there.
The sample is also non-random because she hasn’t randomly selected the
students in the queue, or the time or day of sampling.
Exercise 14
Exercise 15
Example: If you had a bag with 3 different coloured balls (all of the same size)
in it and picked one out without looking, it would be random because all the balls
have an equal chance of being picked.
In a simple random sample, you randomly select your sample from the sample
frame.
It’s easiest to do this type of sampling with a small, well defined population.
Example: Describe how you would use simple random sampling to select a sample
of 50 students from a population of 900 students at a school.
• To get numbers between 1 and 900 you 0712 7839 6210 0335 7748
could choose to use the last three digits
5509 1784 7362 2731 4283
of each number and read across the table
9936 8012 3502 7523 3718
e.g. the first number would be 712.
6021 1344 9275 3281 5002
• If you come to a number that’s outside
1712 7787 3243 3262 4452
your range or a repeated number you just
5166 4797 1044 4510 4971
ignore it e.g. you’d ignore 936.
1244 1397 4425 8677 1986
• So, using the first row and only
1691 1725 3450 2914 3718
numbers between 1 and 900 gives the list
2272 0560 6682 9172 4680
712, 839, 210, 335, 748, 509, …
3) Then match the 50 random numbers from the table with the list to get
your sample.
If we want to estimate the height of 21 year old males, we can use a random
sample of 200 of them. The mean of the heights of this sample is x = 173 cm.
‘The mean height for 21 year old males is approximately equal to 173 cm’.
With the word “approximately” we want to indicate “more or less”. For example,
either 0.5 cm more or 0.5 cm less.
That is, the mean height for 21 year old males lies within the interval
(172.5 , 173.5 ) .
Sure? Of course not. It’s just probable. The statement could be something like
this:
‘The mean height for 21 year old males lies within the interval
(172.5 , 173.5 ) with a confidence level of 90%.
(This means that the probability that the statement is true is 0.9).
If you want more confidence that an interval contains the true parameter, then
the intervals will be wider. If you want to be 100% sure that an interval contains
the true parameter, it has to contain every possible value so be very wide. If you
are willing to be only 50% sure that the interval contains the true value, then it
can be much narrower.
Exercise 17
The 64 people of a random sample of certain population have done a test. Using
the marks of this sample we draw the following conclusion:
“The mean mark for all the population would lie between the values 42.7 and 44.1
points. And we say this with a 95% confidence level”.
• If the confidence interval was 42 – 44.8, then the confidence level would
be …
a) 90% b) 95% c) 98%
• If we wanted a 99% confidence level and an interval with the same length,
what would be the size of the sample?