0% found this document useful (0 votes)
25 views

4-Unit9 Statistics

Uploaded by

gloria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

4-Unit9 Statistics

Uploaded by

gloria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Unit 9: STATISTICS

9.1.- DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics
Descriptive statistics is the term given
to the analysis of data that helps
describe, show or summarize data in a
meaningful way, which allows simpler
interpretation of the data.

Descriptive statistics is very important


because if we simply presented our raw
data it would be hard to visualize what
the data was showing, especially if there
was a lot of it.

In order to deal with the collection, organisation, presentation, analysis and


interpretation of data, descriptive statistics uses frequency tables, diagrams
and charts, measures of central tendency (mean, median, mode), measures of
spread (range, variance, standard deviation, quartiles), ...

Inferential statistics
We have seen that descriptive statistics provides information about our
immediate group of data. For example, we could calculate the mean and standard
deviation of the height of the 4th ESO students in a Secondary School, and this
could provide valuable information about this group of students. Any group of
data like this, which includes all the data you are interested in, is called a
population.

Often, however, you do not have access to the whole population you are
interested in investigating, but only a limited number of data instead. For
example, you might be interested in the height of 4th ESO students in Spain. It
is not feasible to get the heights of all of them, so you have to use a smaller
sample of students, which are used to represent the larger population of all 4th
ESO students.

Inferential statistics are techniques that allow us to use samples to make


generalizations about the populations from which the samples were drawn. The
process of achieving these samples is called sampling.

F. Cano Cuenca 1 Mathematics 4º ESO


Inferential statistics arise out the fact that sampling naturally incurs sampling
error and thus a sample is not expected to perfectly represent the population.

The methods of inferential statistics are the estimation of parameters and


testing of statistical hypotheses.

To sum up:

STATISTICS

DESCRIPTIVE INFERENTIAL
STATISTICS STATISTICS

Collecting Making inference


Organising Hypothesis testing
Summarising Determining relationships
Presenting Making predictions

9.2.- FREQUENCY TABLES


When the statistical variable that we are studying has a small number of
possible values, the frequency table is really easy to make. We just have to do
the tally.

However, when the variable has a large number of possible values, it is


convenient to organize the data in groups or intervals.

Grouped data
Example: A teacher marked a set of 32 test papers. The scores earned by the
students were as follows:

90, 85, 74, 86, 65, 62, 100, 95, 77, 82, 50, 83, 77, 93, 73, 72,
98, 66, 45, 100, 50, 89, 78, 70, 75, 95, 80, 78, 83, 81, 72, 75

Because of the large number of different scores, we organize the data into
intervals, which must be equal in size.

Here we will use six intervals whose length is 10:

41-50, 51-60, 61-70, 71-80, 81-90, 91-100

F. Cano Cuenca 2 Mathematics 4º ESO


Interval Tally Frequency
41-50 lll 3 This table, containing a set of
51-60 0 intervals ant the corresponding
61-70 llll 4 frequency for each interval, is an
71-80 llll llll l 11 example of grouped frequency
81-90 llll lll 8 table.
91-100 llll l 6

When unorganized data are grouped into intervals, we must follow certain rules
in setting up the intervals:

• The intervals must cover the complete range of values. The range is the
difference between the highest and the lowest values.

• The intervals must be equal in size.

• The number of intervals should be between 6 and 15. The use of too many
or too few intervals is not effective.

• Every data value to be tallied must fall into one and only one interval.
Thus, the intervals should not overlap.

• The intervals must be listed in order, either lowest to highest or highest


to lowest.

Exercise 1

The following data show the beats per minute of 30 people.

87 85 61 51 64 75 80 70 69 82
80 79 82 74 92 76 72 73 63 65
67 71 88 76 68 73 70 76 71 86

a) Organize the data in a grouped frequency table. Use 6 intervals (from


50.5 to 92.5).

b) Use a histogram to display the data and draw the frequency polygon.

9.3.- STATISTICAL PARAMETERS


In addition to graphs and tables of numbers, statisticians often use common
parameters to describe sets of numbers. There are two major categories of
these parameters:

F. Cano Cuenca 3 Mathematics 4º ESO


• Measures of central tendency: they measures how a set of numbers is
centered around a particular point on a line scale. The most important are
the mean, the mode and the median.

• Measures of variation: they tell us how far the numbers are scattered
about the center value of the set. They are also called measures of
dispersion. The most common parameters of variation are: the range, the
standard deviation and the variance.

MEAN:
The mean of a set of data is the total of all the values divided by the number of
values, that is, the average value of all the data in the set. The mean is denoted
by x .

∑x f i i
∑f = N
i
→ total number of data
x=
∑f i
∑x f i i
→ sum of all the data

VARIANCE:
The variance is the average squared deviation of each number from the mean of
a data set. The variance is denoted by σ2 .

2
∑ (x − x )i
fi
σ2 =
∑f i

This formula is equivalent to the following one:

σ2 =
∑x f − xi i
2

∑f i

STANDARD DEVIATION

The standard deviation, σ , is the square root of the variance. It is the most
used measure of spread.

The units of the standard deviation are the same as the units of the data.

2
∑ (x − x )
i
fi ∑x f − x 2
i i
2
σ= =
∑f i ∑f i

F. Cano Cuenca 4 Mathematics 4º ESO


COEFFCIENT OF VARIATION

To compare the spread of two sets of data, you use the coefficient of
variation, CV. It is defined as the ratio of the standard deviation to the mean.

σ Sometimes, this ratio is expressed


CV = as a percentage.
x

Exercise 2

Find the mean, the standard deviation and the coefficient of variation of the
two following set of data.

xi fi Interval fi
0 12 50.5 - 57.5 1
1 9 57.5 – 64.5 3
2 7 64.5 – 71.5 8
3 6 71.5 – 78.5 8
4 3 78.5 – 85.5 6
5 3 85.5 – 92.5 4

Exercise 3

The mean weight of the boys in a class is 58.2 kg and their standard deviation is
3.1 kg. The mean weight of the girls in that class is 52.4 kg and their standard
deviation is 5.2 kg. Find the coefficient of variation and compare the spread of
both groups.

Exercise 4

The table below shows the weights and the heights of 6 people.

Weight (kg) Height (m)


65 1.70
60 1.50
63 1.70
63 1.70
68 1.75
68 1.80

Find the coefficient of variation and decide which set of data has more spread.

F. Cano Cuenca 5 Mathematics 4º ESO


9.4.- MEASURES OF LOCATION

Median
The median, Me, of a data set is the value in the middle when the data items are
arranged in ascending order.

• For odd number of data: the median is the middle value.

• For even number of data: the median is the average of the middle two
values.

Whenever a data set has extreme values, the median is the preferred measure
of central location.

Example: The heights, in inches, of 20 students are shown in the following list.
The median is the average of the 10th and 11th data values.

Lower half Upper half

53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73

66 + 66
Median = = 66
2

Quartiles
When the values in a set of data are listed in numerical order, the median
separates the values into two equal parts. The numbers that separate the set
into four equal parts are called quartiles.

To find the quartile values, we first divide the set of data into two equal parts
and then divide each of these parts into two equal parts.

In the previous example:

53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73

1st quartile 2nd quartile 3rd quartile


64.5 66 69.5
(Median)

The quartiles are denoted Q1, Q2 and Q3.

F. Cano Cuenca 6 Mathematics 4º ESO


Q1 is also called the lower quartile and Q3 is also called the upper quartile.

The difference between the upper quartile and the lower quartile is called the
interquartile range. It is a useful way to quantify scatter.

Notice that:

• Q1 is the number such that 25% of data are less than it and 75% are
larger.

• Me = Q2 is the number such that 50% of data are less than it and 50%
are larger.

• Q3 is the number such that 75% of data are less than it and 25% are
larger.

53, 60, 61, 63, 64, 65, 65, 65, 65, 66, 66, 67, 67, 68, 69, 70, 70, 71, 71, 73

25% 25% 25% 25%


Q1 Me Q3

Percentiles
Quartiles are useful and they help to describe the distribution of values as we
have seen before.

However, we often want to know how one particular data value compares to the
rest of the data. For example, when taking standardized test scores, you want
to know not only your own scores, but also how my score ranks in relation to all
scores. Percentiles are perfect for this situation.

The k-th percentile is the number such that k% of all data values are less than it
and (100 - k)% are larger.

Example: You are the fourth tallest person in a group of 20.

80% of people are shorter than you.

That means that you are at the 80th percentile.

F. Cano Cuenca 7 Mathematics 4º ESO


The k-th percentile is denoted pk.

Notice that: Q1 = p25 Me = p50 Q3 = p75

Exercise 5

The lower quartile for a set of data was 40. These data consisted of the
heights, in inches, of 680 children. At most, how many of these children
measured more than 40 inches?

Exercise 6

Select the correct answer.

On a standardized test, Sally scored at the 80th percentile. This means that

a) Sally answered 80 questions correctly.


b) Sally answered 80% of the questions correctly.
c) Of the students who took the test, about 80% had the same score as
Sally.
d) Of the students wo took the test, at least 80% had scores that were less
than or equal to Sally’s score.

Exercise 7

For a set of data consisting of test scores, the 50th percentile is 87. Which of
the following could be false?

a) 50% of the scores are 87 c) Half of the scores are at least 87.
b) 50% of the scores are 87 or less d) The median is 87.

Cumulative frequency

How can we find the median, the quartiles and the percentiles in a set of values?

In order to answer this question is useful to know the concept of cumulative


frequency.

The cumulative frequency, Fi , corresponding to the i-th value, xi , is the sum of


the frequency of this value and the previous ones.

Fi = f1 + f2 + ... + fi−1 + fi

F. Cano Cuenca 8 Mathematics 4º ESO


Example:

The number of children We can add more columns with the cumulative
of 120 couples is shown frequencies and the percentages that
in the table below. corresponds to this frequencies

xi fi Fi
xi fi Fi % ⋅ 100
N
0 10 0 10 10 8.3
1 20 1 20 30 25
2 41 2 41 71 59.2
3 29 3 29 100 83.3
4 14 4 14 114 95
5 5 5 5 119 99.2
6 1 6 1 120 100

95% of the couples has less


than or equal to 4 children.

It the cumulative frequencies are expressed as percentages, we can easily find


the percentiles.

The percentile pk is the value whose cumulative frequency, expressed as a


percentage, is higher than k%.

In the example:

Me = p50 = 2 (because for the value xi = 2, Fi is greater than 50%)

Q1 = p25 = 1.5 (because for the value xi = 1, Fi is exactly 25%)

Q3 = p75 = 3 (because for the value xi = 3, Fi is greater than 75%)

p99 = 5 (because for the value xi = 5, Fi is greater than 99%)

p95 = 4.5 (because for the value xi = 4, Fi is exactly 95%)

F. Cano Cuenca 9 Mathematics 4º ESO


Exercise 8

The height, in cm, of a group of students is:

150 169 171 172 172 175 181


182 183 177 179 176 184 158

Find the median and quartiles.

Exercise 9

Find Me, Q1 , Q3 , p80 , p90 and p99 for the following set of marks.

Marks 1 2 3 4 5 6 7 8 9 10
Number of students 7 15 41 52 104 69 26 13 19 14

9.5.- BOX-AND-WHISKER PLOT

A box-and-whisker plot is a diagram that uses the quartiles values, together


with the maximum and minimum values, to display information about a set of
data.

Example: The heights, in cm, of 40 students are listed in order:

149, 150, 154, 156, …………………………………………………, 174, 175, 175, 189

The statistical summary for these height is:

Minimum value = 149 , Q1 = 160.5 , Me = 166 , Q3 = 169.5 , Maximum value = 189

To draw a box-and-whisker plot, we use the following steps.

Step 1: Draw a scale with numbers from the minimum to the maximum values of
a set of data.

150 160 170 180 190

F. Cano Cuenca 10 Mathematics 4º ESO


Step 2: Draw a box between the values that represent the lower and upper
quartiles, and a vertical line in the box through the point that represents the
median.

150 160 170 180 190

Q1 Me Q3

Step 3: Add the whiskers by drawing two line segments that include all the data
with the following condition:

The length of each segment must be less than or equal to 1.5 the length of the
box.

If one (or more) data lie below or above that length, the corresponding whisker
is drawn with this limit, and this (or these) data are drawn in its corresponding
place.

In the example: the length of the box is 169.5 − 160.5 = 9 .

One and a half this length is 1.5 ⋅ 9 = 13.5 .

The maximum length of each of the two whiskers will be less than or equal to
13.5.

169.5 + 13.5 = 183 < 189

upper end maximun length maximun value


of the box of the whisker of the data

The length of the whisker on the right will be 13.5 and we will add and asterisk
to place the data 189.

150 160 170 180 190

*
Q1 Me Q3

F. Cano Cuenca 11 Mathematics 4º ESO


Exercise 10

The statistical summary for the marks of 87 people is: Q1 = 4.1 , Me = 5.1 ,
Q3 = 6.8 . All the marks lie between 1 and 9. Construct a box-and-whisker plot.

Exercise 11

Construct the box-and-whisker plot corresponding to the exercise 8.

Exercise 12

The following data consist of the number of cars


that owns each of the 25 families in a housing
development.

0 1 2 3 1 0 1 1 1
4 3 2 2 1 1 0 1 2
3 1 0 1 1 1 4

a) Find the median and the quartiles.


b) Construct the box-and-whisker plot.

9.6.-INFERENTIAL STATISTICS

Sampling
For any statistical project, you need to find out information about a group of
people or things. This group is called the population.

When you collect data about every member of a population, it’s called a census.

If you have a really big population or it’s no very well defined it can be really
hard or even impossible to do a census –it might take too long, cost too much or
be impractical.

Example: if a battery manufacturer that makes thousands of batteries a day


wanted to find out how long their batteries lasted, it wouldn’t be sensible to
test every single battery (a census) because the population is far too big (and
they wouldn’t have any batteries left to sell).

When it’s not sensible to collect information using a census, you have to use
sampling.

Choosing only a few members of a population is called sampling.

F. Cano Cuenca 12 Mathematics 4º ESO


Before you can choose a sample from a population you need to make a list or map
of all the member of the population –this is called a sample frame.

Example: A student is trying to find out the average Key Stage 2 SATs score
for Maths in England in 2014.

The population would be all students in England who took the Key Stage 2 Maths
SAT in 2014.

The sample frame would be a list of all students who took the Key Stage 2
Maths Sat in 2014.

Sampling is a cheap and easy alternative to a census.

Example: Peter needs to find about the heights of trees


in a forest for his biology project. Measuring every tree
would take ages so he surveys a sample of 500 trees to
represent the whole population.

You can use the data you collect to make estimates and draw conclusions about
the whole population. The techniques that allow us to use samples to make
generalizations about the populations are called inferential statistics.

Exercise 13

Suppose you want to carry out studies about the following statistical variables:

- Type of means of transport used by the inhabitants of a neighbourhood to


go to work.
- Type of education that the students in a Secondary School want to do
after finishing the ESO.
- The age of the people who have seen a play in a city.
- Daily time that children from 5 to 10 spend on watching TV in Castilla-La
Mancha.

a) What is the population in each of these studies?


b) In which of them do you thing is necessary to use a sample? Why?

F. Cano Cuenca 13 Mathematics 4º ESO


Sample data must be representative
When you sample a population it’s important to make sure the sample fairly
represents the whole population.

This means any conclusions you draw from the data in the sample can be applied
to the whole population.

A biased study is one that doesn’t fairly represent the whole population. To
avoid bias you need to:

• Sample from the correct population.


• Select your sample at random (see next section).

A bigger sample is better because it’s more likely to be representative. It will


provide more reliable data, but it might be less practical to collect.

However, it is possible to draw really good conclusions about the whole


population using small samples.

Example of a biased sample:

A student wants to find out if students at their school think the tuck shop
provides good value for money. She chooses to sample the first 20 people in the
queue for the tuck shop at break time.

The student is sampling the wrong population –any students who don’t shop at
the tuck shop are excluded. People who strongly think the tuck shop is bad value
for money probably won’t shop there.

The sample is also non-random because she hasn’t randomly selected the
students in the queue, or the time or day of sampling.

Exercise 14

A survey of the incomes of people in a town uses a sample of 10 households in


one street. Do you think this is sensible sample to choose? Explain you answer.

Exercise 15

A food company wants to find out about the snack


preferences of young people in a city. As their sample
frame they use a list of all the students who go one
particular secondary school. Explain why this isn’t a
representative sample.

F. Cano Cuenca 14 Mathematics 4º ESO


Exercise 16

A car manufacturer sends out customer satisfaction questionnaires to all the


people who have bought new cars from them in the past year. Will the returned
forms make up a representative sample? Explain you answer.

Choosing things at random


Something is chosen at random when every item in a group has an equal chance
of being chosen.

Example: If you had a bag with 3 different coloured balls (all of the same size)
in it and picked one out without looking, it would be random because all the balls
have an equal chance of being picked.

Simple random sampling


There are different sampling methods: simple random sampling, stratified
sampling, systematic sampling, cluster sampling, …

We will study the first one: simple random sampling.

In a simple random sample, you randomly select your sample from the sample
frame.

It’s easiest to do this type of sampling with a small, well defined population.

To select the sample you need to use random numbers.


Here’s how you do it:

1) Assign a number to every member in the sample frame.


2) Use a computer, calculator or random number table to
create a list of random numbers.
3) Finally, match the numbers of the members in the
sample frame to the numbers on the random list to
create the sample.

Example: Describe how you would use simple random sampling to select a sample
of 50 students from a population of 900 students at a school.

1) Assign numbers 1 to 99 to all the students in the sample frame.

2) Use a random number table to get a list of 50 different random numbers


between 1 and 900.

F. Cano Cuenca 15 Mathematics 4º ESO


Random number tables are just what they sound like –tables of random
numbers. You need to make your method of choosing numbers from the
table clear and stick to it.

• To get numbers between 1 and 900 you 0712 7839 6210 0335 7748
could choose to use the last three digits
5509 1784 7362 2731 4283
of each number and read across the table
9936 8012 3502 7523 3718
e.g. the first number would be 712.
6021 1344 9275 3281 5002
• If you come to a number that’s outside
1712 7787 3243 3262 4452
your range or a repeated number you just
5166 4797 1044 4510 4971
ignore it e.g. you’d ignore 936.
1244 1397 4425 8677 1986
• So, using the first row and only
1691 1725 3450 2914 3718
numbers between 1 and 900 gives the list
2272 0560 6682 9172 4680
712, 839, 210, 335, 748, 509, …

3) Then match the 50 random numbers from the table with the list to get
your sample.

Conclusions drawn from a sample


Focus on the following example in order to study the conclusions that we can
draw using a sample and their confidence level.

If we want to estimate the height of 21 year old males, we can use a random
sample of 200 of them. The mean of the heights of this sample is x = 173 cm.

From these data you can make the following estimate:

‘The mean height for 21 year old males is approximately equal to 173 cm’.

With the word “approximately” we want to indicate “more or less”. For example,
either 0.5 cm more or 0.5 cm less.

That is, the mean height for 21 year old males lies within the interval
(172.5 , 173.5 ) .
Sure? Of course not. It’s just probable. The statement could be something like
this:

‘The mean height for 21 year old males lies within the interval
(172.5 , 173.5 ) with a confidence level of 90%.
(This means that the probability that the statement is true is 0.9).

F. Cano Cuenca 16 Mathematics 4º ESO


People are often surprised to learn that 99% confidence intervals are wider
than 95% intervals, and 90% intervals are narrower. But this makes perfect
sense.

If you want more confidence that an interval contains the true parameter, then
the intervals will be wider. If you want to be 100% sure that an interval contains
the true parameter, it has to contain every possible value so be very wide. If you
are willing to be only 50% sure that the interval contains the true value, then it
can be much narrower.

The higher the confidence level, the wider the interval.

The size of the sample is also important:

For a given interval length, the larger your sample size,


the larger your confidence level.

For a given confidence level, the larger your sample size,


the smaller your confidence interval.

Exercise 17

The 64 people of a random sample of certain population have done a test. Using
the marks of this sample we draw the following conclusion:

“The mean mark for all the population would lie between the values 42.7 and 44.1
points. And we say this with a 95% confidence level”.

• If the confidence interval was 42 – 44.8, then the confidence level would
be …
a) 90% b) 95% c) 98%

• If we wanted a 99% confidence level and an interval with the same length,
what would be the size of the sample?

a) less than 64 b) 64 c) greater than 64

F. Cano Cuenca 17 Mathematics 4º ESO

You might also like