0% found this document useful (0 votes)
17 views77 pages

2025 Data Notes.docx

The document outlines the curriculum for VCE General Mathematics in 2025, focusing on Data Analysis. It covers various topics including data distributions, associations between variables, linear modeling, and data transformations. Key concepts include types of data, displaying categorical and numerical data, and using logarithmic scales for data representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views77 pages

2025 Data Notes.docx

The document outlines the curriculum for VCE General Mathematics in 2025, focusing on Data Analysis. It covers various topics including data distributions, associations between variables, linear modeling, and data transformations. Key concepts include types of data, displaying categorical and numerical data, and using logarithmic scales for data representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

VCE GENERAL MATHEMATICS

2025

Area of Study 1

Data Analysis

Chapter 1: Investigating data distributions.

Chapter2: Investigating association between two variables.

Chapter 3: Investigating and modelling linear associations.

Chapter 4: Data Transformations

Chapter 5: Investigating and modelling time series data

1
Chapter 1:

Investigating Data Distributions

1A Types of Data

Categorical Data
• Categorical variables classify
or name a quality or numerical Data
attribute– for example, a • Numerical data have data values which are
person's eye colour, study quantities, generally arising from counting
mode, or fitness level. or measuring.
• Nominal variables have data • Discrete are those which may take on
values that are simply names. only a countable number of distinct
• Ordinal variables have data values such as 0, 1, 2, 3, 4
values that can be used to • Continuous data take an infinite
both name and order. number of possible values and are
often associated with measuring.

Deciding …..


Example 1 Numerical data can always be
used to perform arithmetic
Classify the following data as nominal, ordinal, discrete or computations such as finding the
average.
continuous. • The way the data are recorded
a. The number of chocolate chips in each of 10 cookies can determine its type. Ie. If the
is counted. variable weight is recorded in
b. The time taken for 20 students to complete a puzzle is kilograms, its numerical, if
recorded in seconds. However, if the data are
c. Member of a football club were asked to rate how they recorded as 'underweight',
felt about t 'normal weight', 'overweight', its
d. he current coach, 1= Very satisfied, 2 = Satisfied, 3 = Indifferent, categorical
4 = Dissatisfied, 5 = Very dissatisfied.
e. Students are asked to each choose their preferred colour from the
list 1= Blue, 2 = Green, 3 = Red, 4 = Yellow.
f. Students weights were classified as 'less than 60kg', '60kg - 80kg'
or 'more than 80kg'.

2
3
1B Displaying and Describing Categorial Data

Frequency table Bar chart


% 𝐹𝑟𝑒𝑞𝑢. • frequency (or percentage
𝑛𝑜. 𝑜𝑓𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑐𝑢𝑟𝑠 frequency) is shown on the vertical
=
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 axis
• the variable being displayed is
plotted on the horizontal axis
• the height of the bar (column)
gives the frequency (number or
percentage)
• the bars are drawn with gaps to
• Frequency table show that each value is a separate
• Find the maximum and the minimum category
values in the data set. • there is one bar for each category.
• Construct a table including all the
values between the minimum and the
maximum. Segmented bar chart
• Count the number of values. in the
• In a segmented bar chart, the
data set.
bars are stacked one on top of
• Record these values in the number
column and add the frequencies to another to give a single bar with
find the total. several parts or segments.
• Convert the frequencies to • The lengths of the segments are
percentages, and record in the per determined by the frequencies.
cent • The height of the bar gives the total
• Total the percentages and record frequency.

Constructing a Segmented bar


chart
• Use a ruler.
• Provide a key.
• Label axis.
• Provide a title.
• Ensure accuracy (100%)

4
Example 1
A group of 11 preschool children were asked to choose between chocolate and vanilla ice-cream (C =chocolate, V =
vanilla):

CVVCCVCCCVV

Construct a frequency table (including percentage frequencies) to display the data.

Example 2
Constructing a percentage segmented bar chart from a frequency table.

The climate type of 23 countries is classified as 'cold', 'mild' or 'hot'. Construct a percentage frequency segmented
bar chart to display this information.

5
6
1C Displaying and describing Numerical Data

Example 1
Constructing a frequency table for discrete numerical data
Grouped Frequency table
taking a small number of values.
• Every data value should be in an
The number of bedrooms in each of the 24 properties interval.
• The intervals should not overlap.
sold in a certain area over a one month period are as follows:
• There should be no gaps between the
2 3 4 3 3 4 intervals.
• A division which results in about 5 to
3 4 4 1 3 2 15 groups,
1 2 2 2 4 5 • Choose an interval width that is easy
to interpret
3 4 4 5 3 4 • intervals of 0–49, 50–99, 100–149
would be preferred over the intervals
Construct a table for these data showing both frequency and 1–50, 51–100, 101–150
percentage frequency, to one decimal place.

Example 2
The data below give the average hours worked per week in 23 countries. Construct a grouped frequency table with
five intervals.

35.0 48.0 45.0 43.0 38.2

50.0 39.8 40.7 40.0 50.0

35.4 38.8 40.2 45.0 45.0

40.0 43.0 48.8 43.3 53.1

35.6 44.1 34.8

Example 3
The data below give the average hours worked per week in 23 countries. Construct a grouped frequency table with
five intervals.

35.0 48.0 45.0 43.0 38.2 50.0 39.8 40.7 40.0 50.0 35.4 38.8

40.2 45.0 45.0 40.0 43.0 48.8 43.3 53.1 35.6 44.1 34.8

7
Constructing a Histogram
From a frequency table
Example 1 frequency (count or per cent) is
shown on the vertical axis
Construct a histogram from a frequency table
•the values of the variable being
displayed are plotted on the
Average Frequency
hours horizontal axis
worked Number % •each bar in a histogram corresponds
to a data interval
30.0−34.9 1 4.3

35.0−39.9 6 26.1
•the height of the bar gives the
frequency (or the percentage
40.0−44.9 8 34.8 frequency).
45.0−49.9 5 21.7 From raw data
50.0−54.9 3 13.0 • Create a grouped frequency table
Total 23 99.9 first.

Example 2
CAS Tips
Construct a histogram from a frequency table

• Add Lists and Spreadsheets


Average hours
Frequency • Name Column A
worked
• Enter the data
30.0–34.9 1 • Add Data and Statistics
• Select x data set
35.0–39.9 6
• Menu - plot type – histogram
40.0–44.9 8 • Plot properties
45.0–49.9 5 • Equal bin widths

50.0–54.9 3

Total 23

8
Analysing a Histogram
The purpose of constructing a histogram is to help understand the key features of the data distribution. These
features are:

Shape
Symmetrical
Bimodal
Positive skew
Negative Skew

Centre
Middle

Spread
Wide
Narrow
Outliers
Extreme values

Describing the features of a distribution from a histogram

Example 1

The histogram shows the gestation period (completed weeks) for a sample for 1000 babies born in Australia one
year. Describe this histogram in terms of, centre, shape spread and outliers in order.

Analysing a histogram
Write a paragraph including all 4
features.

9
10
1D Dot Plots and Stem Plots Constructing a dot plot

Example 1 • To display small numerical


discrete data.
Construct a dot plot for the ages (in years) of the 13 members of • Draw a number line with each
data point marked by a dot.
a cricket team.
• Same values are stacked on top
22 19 18 19 of each other.
• Ensure equal space between
23 25 22 29
dots.
18 22 23 24 • Label number line.
22

.
• Constructing a Stem Plot

• Stem plots are used for both


discrete and continuous data.
• Display small- to medium-sized
Example 2 sets of data (up to about 50 data
values).
Construct a stem plot for the University participation rates (%) in • A pen and paper technique.

23 countries are given below. • Each data value is separated into


two parts: its leading digits, which
26 3 12 20 36 1 25 26 make up the 'stem' of the number,
and its last digit, which is called
13 9 26 27 15 21 7 8 the 'leaf'

22 3 37 17 55 30 1 • Include a key
• Ensure numbers are spaced out
evenly.
• Splitting the stems is useful
when there are only a few
different values for the stem.

Example 3

Consider the marks obtained by 17 VCE students on a statistics test.

Split the stem plot in halves and fifths.

2 12 13 9 18

17 7 16 12 10

16 14 11 15 16

15 17

11
12
1E Using a logarithmic (base 10) scale to display data
Many numerical variables that we deal with in statistics have values that range over several orders of magnitude or
very small and large numbers that need to be represented on the same scale.

numbers 0.01 0.1 1 10 100 1000 10 000 100 000 1 000 000
powers 10−2 10−1 100 101 102 103 104 105 106
logs -2 -1 0 1 2 3 4 5 6

Logarithmic Transformation

Example 1 Write the number 100 as a power of 10, and then write down its logarithm.

• logs
a. 1 b. 10

a. 100 d. 1 000 • Recall indices ie 103


• Logs are the powers
e. 10 000 f. 100 000 • Logs are used to graph very small or
very large data sets on one set of axis.
g. 1 000 000 h. 10 000 000 • We will only use log to the base of 10.

number 1 10 100
indices 100 101 102
logs 0 1 2

Example 2

a.Use your CAS to find the logs of the following numbers to 1 d.p

a. 45 b. 245 c. 3546
CAS Tips

b.Find the number whose logarithm are as follows to one decimal place. FROM Number TO Log

𝑙𝑜𝑔10 (45)𝑒𝑛𝑡𝑒𝑟
a. 3.1876 b. 2.8517 c. 4.6531 FROM Log TO Number

101.65321 𝑒𝑛𝑡𝑒𝑟

13
Example 3
CAS Tips
The histogram shows the distribution of the • Add List and Spreadsheets
• Enter x into columns A
weights of 27 animal species plotted on a log scale. • Name column B as ‘logs’
• Directly below type ‘=
𝑙𝑜𝑔10 (𝑥)’ into template
• Add Data and Statistics
• Click to add variable
• Plot Properties>Histogram
Properties>Bin Settings>Equal
Bin Width and set the column
width (bin) to 1 and alignment
(start point) to −2 and use
menu>Window/Zoom>Zoom-
Data ,to rescale.
a. What body weight (in kg) is represented by the number 4 on the log scale?

b. How many of these animals have body weights more than 10 000 kg?

c. The weight of a cat is 3.3 kg. Use your calculator to determine the log of its weight correct to two significant figures.

d. Determine the weight (in kg) of the animal with a log(body weight) of 3.4 (the elephant). Write your answer correct

to the nearest whole number.

14
15
1F Measures of centre and spread
The mean The Median The Range The Interquartile range Standard deviation

The median
Example 1 • Measures centre and
used when data is
Order each of the following data sets, locate the median, and then write skewed.
down its value. • 𝑛+1
2
𝑛
(odd) 2 + 1 (𝑒𝑣𝑒𝑛)
a. 2 9 1 8 3 5 3 8 1 • Middle value in an
b. 10 1 3 4 8 6 10 1 2 9 ordered data set
𝑛+1
The mean
2

Measures centre

mean
Example 2 𝑠𝑢𝑚 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
=
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
Finding the median value from a dot plot

The dot plot opposite displays the age distribution (in years) of the

13 members of a local cricket team. The range


• Measures spread and
not used if outliers
present

𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 − 𝑚𝑖𝑛

Example 3
The IQR
The histogram shows the average number of hours per week a group • Measures spread of
the middle 50% of
of 23 people spent on the internet. Find possible values for the data
median and quartiles of this distribution. • Used if outliers
present
• 𝐼𝑄𝑅 = 𝑄3 − 𝑄1

The STANDARD DEVIATION


• Measures spread of the
average distance of each
value in relation to the
mean.

∑(𝑥 − 𝑥)2
𝑆. 𝐷. =
𝑛−1

16
Example 4

The stem plot displays the maximum temperature (in ∘C) for 12 days in January. Determine the median maximum
temperature for these 12 days.

Example 5

Determine the temperature range over these 12 days.

Example 6

Find the interquartile range of the weights of the 18 cats whose weights are displayed in the ordered stem plot
below. (n = even)

1 | 2 represents 1.2 kg

Example 7

The stem plot shows the life expectancy (in years) for 23 countries. Find the IQR for life expectancies.
Stem: 5|2=52 years

17
Example 8

Calculating the mean from the formula. The following is a set of reaction times (in milliseconds):

38 36 35 43 46 64 48 25 CAS TIPS

Write down the values of the following, correct to one decimal place. • Add List and Spreadsheets
• Enter x into columns A
a. 𝑛 b. ∑𝑥 c. 𝑥 • Enter data
• Menu Stats
• Stats Calculations
• One variable statistics
Symbols
𝑥 = mean
𝑠𝑥 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑑𝑒𝑣𝑖𝑎𝑡𝑜𝑛
Q2= median
𝑛 = number of values in the
set

18
19
1G The Five Number Summary and the Box Plot
Five-number summary
• minimum,
• Q1,
• 𝑸𝟐 𝒐𝒓 M,
• Q3,

• maximum

Example 1

The stem plot shows the distribution of life expectancies (in years) in 23 countries. Construct and analyse the
boxplot.

20
Analsying Box Plots with outliers

Example 2

For the boxplot shown, write down the


values of:

a. the median
b. the quartiles Q1 and Q3
c. the interquartile range (IQR)
d. the minimum and maximum values
e. the values of any possible outliers
f. the smallest value in the upper end of the data set that will be classified as an outlier
g. the largest value in the lower end of the data set that will be classified as an outlier.

Example 3

For the boxplot shown, estimate the percentage of values:

a. less than 54 b.less than 55

c. less than 59 d. greater than 59

e.between 54 and 59 f.between 54 and 86.

Example 4

Describe the distribution represented by the boxplot in


terms of shape, centre and spread.

21
Example 5 Describe the distributions
represented by the boxplot.

Example 6

The boxplot shows the gestation period (completed


weeks) for a sample for 1000 babies born in Australia
one year. Describe the distribution of gestation period
in terms of shape, centre, spread and outliers.

22
23
1H The normal distribution and the 68–95–99.7% rule

68%-95%99.7% Rule
68% (x¯−s,x¯+sx¯−s,x¯+s).

95% (x¯−2s,x¯+2sx¯−2s,x¯+2s).

99.7% (x¯−3s,x¯+3sx¯−3s,x¯+3s).

STANDARD SCORES
𝑥−𝑥
𝑍=
𝑠

• Meaning of z – score
• a positive z-score = above the
mean
Example 1 • a z-score of zero = is equal to
the mean
The heights of a group of young women have a mean of 𝑥=160 cm and
a. standard deviation of s=8cm. Determine the standard or z-scores of a • a negative z-score = below
woman who is: the mean.
a.172 cm tall b.150 cm tall c.160 cm tall.

Example 2

b. Consider a student who obtained a mark of 75 in Psychology and a


mark of 70 in Statistics. In which subject did she do better?

c. Another student studying the same two subjects obtained a mark of 55 for both Psychology and Statistics.
Does this mean that she performed equally well in both subjects?

24
Example 3

Suppose the weight of a certain species of bird is normally distributed with a mean of 42 grams with a standard
deviation of 3 grams. If a bird selected at random from this population has a standardised weight of z=−1, what
percentage of birds in this population weigh more than this bird? Approximately what percentage of birds would
weigh between 39 and 48 grams?

Example 4

A class test (out of 50) has a mean mark of 𝑥 = 34 and a standard deviation of 𝑠 = 4 Joe's standardised test mark
was 𝑧 = −1.5. What was Joe's actual mark?

Example 5
Suppose the heights of red flowering gum trees have a mean of 10.2 metres, and 2.5% of these trees grow to more
than 11.4 metres tall. Assuming that the heights of these trees are approximately normally distributed, what is the
standard deviation of the height of the red flowering gum trees?

Example 6

The marks scored in an examination are known to be approximately normally distributed. If 16% of students score
more than 80 marks, and 2.5% of students score less than 20 marks, estimate the mean and standard deviation of
this distribution.

25
26
Chapter 2: Investigating associations between two variables
2A Bivariate data - Classifying the variables
Bivariate data
Example 1 • Where the two variables
are linked in some way
For each of the following questions, determine if they involve investigating associations(associated)
between so that they
one numerical variable and one categorical variable or vary together, thus
two categorical variables or two numerical variables. bivariate
• One of the variables as the
a. Are younger people (age measured in years) more likely explanatory variable. The
to believe in astrology (measured as 'yes' or 'no') than older other variable is the
people? response variable. We use
the explanatory variable to
b. Do people who weigh more (weight measured in kg) tend explain changes that might
to have higher blood pressure (blood pressure measured be observed in the
in mmHg)? response variable.

c. Are people who have a driver's licence (measured as 'yes' or


'no') more likely to be in favour of lowering the driving
age (measured as 'yes' or 'no')?

Example 2
We wish to investigate the question, 'Does the time it takes a student to get to school depend on their mode of
transport?' The variables here are time and mode of transport. Which is the response variable (RV) and which is the
explanatory variable (EV)?

Example 3
a. We wish to investigate the question. Can we predict people's height (in cm) from their wrist measurement?
The variables in this investigation are height and wrist measurement. Which is the response variable (RV)
and which is the explanatory variable (EV)?
Framing the question
The way the question is framed
determines the EV and RV.
b. We wish to investigate the question 'Can we predict
people's wrist measurement from their height?'
Which is the response variable (RV) and which is t
he explanatory variable (EV)?

27
28
2B Investigating associations between categorical variables
Constructing a two-way frequency table
Question – Does support for gun control depends on where a person lives.

Use a telly Constructing two-way


Frequency table
Residence
• Decide the EV and RV
Attitude to gun Country City Total • EV is placed in the columns.
control • RV is placed in the rows.
• Count responses using a telly
For //// //// //// //// //// 62 • Values in the cells can be
//// //// //// //// //// raw numbers or
// //// //// percentages.
• To analyse the data compare
Against //// //// //// //// //// 38 ONE ROW
//// //// / // • Write a report
• Use comparative language.
Total 58 42 100

A two-way frequency table – with raw values

A two-way frequency table – with Percentages

Residence

Country City

For

Against

total

Yes, a relationship exist. In this sample of 100 people, a higher percentage of city people
were for gun control than country people: 71.4% to 55.2%. This indicates that a person's
attitude to gun control is associated with their place of residence.

29
Example 1

The following data were obtained when a sample of ten Year 9


students were asked if they intended to go to university
(university). The gender of the student was also recorded. Create a
two-way frequency table from these data.

University

male female

yes

no

total

University

male female

yes

no

total

University

male female

yes

no

total

30
Example 2

Are males and females in Year 9 equally likely to


indicate an intention to go to university? Data from
interviews with 200 Year 9 students are summarised in
the following table. Write a brief report addressing this
question and quoting appropriate percentages.

Two-way frequency tables for categorical variables taking more than two values

Example 3

The table displays the smoking status for a


group of adults (smoker, past smoker, never
smoked) by educational level (Year 9 or less,
Year 10 or 11, Year 12, university).

Example 4

A survey was conducted with 1000 males under 50 years old. As


part of this survey, they were asked to rate their interest in sport
as 'high', 'medium', or 'low'. Their age group was also recorded as
'under 18','19–25', '26–35' and '36–50'. The results are displayed
in the table. Is there an association?

31
The segmented bar chart-from a two-way frequency table
Segmented bar chart
• A visual display is a segmented
bar chart.
• consists of separate bars for
each value of the EV,
• each bar separated into parts
(segments)
• it shows the percentage for
each value of the response
variable.

Example 5

Construct a segmented bar chart to display the association


interest in sport and age group displayed in the table and describe
the association.

Example 6

The percentage segmented bar chart below shows the association


between preferred holiday (country or coast) and age group (under
40, 40 or over) for a sample of 800 visitors to a travel website.

32
33
34
2C Investigating the association between a numerical and a categorical
variable

Example 1 Data Display

The parallel dot plots below display the distribution of the number of ✓ Parallel dot plots,
sit-ups performed by 15 people before and after they had completed ✓ Back-to-back stem plots or
a gym program. Do the parallel dot plots support the contention that ✓ The parallel boxplots.
the number of sit-ups performed is associated with completing the
Describe the distribution
gym program? Write a brief explanation that compares the
distributions. • shape
• centre
• spread
• outliers
Analyse the distribution
• Medians
• Mean
• IQR
• Range
• Standard deviation

Note:

• If distribution skewed, Use


median and IQR instead of
mean and range

Comparative Report

• Comparative language,
higher, lower, more, less,
greater, smaller

35
Example 2

The back-to-back stem plot below displays the distribution of life expectancy (in years) for the same 13 countries in
1970 and 2010. Do the back-to-back stem plots support the contention that life expectancy has changed between
these two time periods?

36
Example 3

Use the following parallel boxplots to compare the pulse rates (in beats/minute) for a group of 70 male
students and 90 female students.

37
Example 4

Use the parallel boxplots below to compare the salary distribution for workers in a certain industry across four
different age groups: 20–29 years, 30–39 years, 40–49 years and 50–65 years.

38
39
2D Investigating associations between two numerical variables

Scatterplots

Example 1

We wish to investigate the association between university participation • A scatterplot - when both
rate (the EV) and average hours worked (the RV) in nine countries. of the variables are
The starting point for this investigation is again a graphical display numerical.
of the data. Here our options are to construct a scatterplot. • each point represents a
The data for 9 countries are shown below. single case.
• the vertical or y-axis for
the response variable (RV)
• the horizontal or x-
axis for the explanatory
variable (EV).

CAS TIPS
• Add List and Spreadsheets
• Label column A with EV
• Label column B with RV
• Add Data and statistics
• Click to add variable
• Create the scatterplot

40
41
2E How to interpret a scatterplot- Describing Association
• Direction (positive or negative)
• Strength, (strong, moderate, weak)
• Form-Linear or non-linear

Association Form
Linear - A scatterplot is said to have a linear form
Positive association when the value of the response when the points tend to follow a straight line.
variable tends to increase as the value of the
explanatory variable increases. Non- linear A scatterplot is said to have a non-
linear form when the points tend to follow a
curved line.

Negative association when the value of the response


variable tends to decrease as the value of the
explanatory variable increases.
.

No association when there is no consistent change in Strength


the value of the response variable when the values of
the explanatory variable increase Strong association between the variables, there is
only a small amount of scatter in the plot, and a
pattern is clearly seen.

As the amount of scatter in the plot increases, the


patt ern becomes less clear. We say its moderate

As the amount of scatter increases further, the


pattern becomes even less clear. We say its weak.

42
Example 1 Example 2
Classify the association and form Classify association and direction

2F Strength of a linear relationship: the correlation coefficient


Pearson’ s Correlation Coefficient
• To measure the strength of a linear
relationship, Pearsons Correlation
Coefficient, r, is used.
• ‘r’ has a value between −1 and +1
Use ‘r’ only if:

1. the variables are numeric


2. the association is linear
3. there are no outliers in the data

Example 1
CAS TIPS
Classify the strength of each of the following linear
associations using the previous table: • Add list and spreadsheets
a. 𝑟 = 0.35 • Menu, stats, stats calculations,
b. 𝑟 = −0.507 • Look for r in the list of statistics.
c. 𝑟 = 0.992
d. 𝑟 = −0.159

43
2G The coefficient of determination
Coefficient of Determination
Example 1
Degree of Prediction
If the correlation between weight and height is 𝑟 = 0.8, The degree to which one variable can be
find the value of the coefficient of determination. predicted from another linearly related
Express your answer as a percentage. variable is given by a statistic called
the coefficient of determination.

Calculating r2,
By squaring r, expressed as a %.
Example 2 Interpreting
The coefficient of determination (as a
It is found the coefficient of determination between
percentage) tells us the variation in the
height and weight to be 0.64 (or 64%). Interpret this
response variable that is explained by the
value in terms of the variables weight and height. variation in the explanatory variable.

Example 3

The level of carbon monoxide (CO) in the air measured at


the roadside, and the traffic volume at the same location are
linearly related, with 𝑟 = +0.985. Determine the value of the
coefficient of determination, write it in percentage terms and
interpret. In this relationship, traffic volume is the explanatory
variable.

Example 4

Scores on tests of verbal and mathematical ability are linearly


related with correlation coefficient 𝑟 = +0.275. Determine
the value of the coefficient of determination, write it in
percentage terms and interpret. In this relationship,
verbal ability is the explanatory variable.

Example 5

For the relationship described by this scatterplot, the coefficient of determination


=0.5210. Determine the value of the correlation coefficient, r, rounded to four
decimal places.

44
45
2H Correlation and causality
Correlation does not imply causality.

• A correlation tells you about the strength of the association between the variables, but no more. It tells you
nothing about the source or cause of the association.
• Common response, Confounding factors, coincidence are all possible and often further investigation needs to
be undertaken.

2I Which graph?

46
Chapter 3: Investigating and modelling linear associations

Linear regression Finding the equation


• The process of modelling an association with a straight Finding the equation by hand
line is known as linear regression • Equation for a straight line
• the resulting line is often called the regression line. 𝑦 = 𝑎 + 𝑏𝑥
• The equation is 𝑦 = 𝑎 + 𝑏𝑥 • The slope
𝑟𝑠𝑦
• Y-the variable represented on the y axis 𝑏=
𝑠𝑥
• x– the variable represented on the x axis
• The y-intercept
• 𝑎 and 𝑏 are constants.
𝑎 = 𝑦 − 𝑏𝑥
• 𝑎 is the 𝑦 − 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 The slope (𝑏) estimates the
• Pearson’s coefficient
average change (increase/decrease) in the response 𝑏𝑠𝑥
variable (y) for each one-unit increase in the explanatory 𝑟=
𝑠𝑦
variable (x)

• 𝑏 is the slope The intercept (𝑎) estimates the average


Fining the equation using the CAS
value of the response variable (y) when the explanatory
variable (x) equals 0. • Add list and spreadsheets
• Pearsons’s correlation coefficient-determines the • Enter data into Columns A and B
strength of the association • Find the statistics
• r2 Coefficient of determination -determines the % of Add Data and statistics
association. • Click to add the variables
• least squares method assumes that the variables are • Create the scatterplot
linearly related, • Press Menu, Analyse, Regression, 𝑎 + 𝑏𝑥
• It works best when there are no clear outliers in the
data.
• vertical distances away from the fitted line , are known
as residuals.
Data points
• above the line = positive residual,
• below the line = negative residual
• on the line = zero residual.
Assumptions

• the data is numerical


• the association is linear
• there are no clear outliers.

47
Example 1
The height and weight of 11 people have been recorded, and the values of the following statistics determined.
Calculate the values of the slope and intercept rounded to two significant figures.

height weight

mean 173.3 cm 65.45 kg

standard deviation 7.444 cm 7.594 kg

correlation coefficient r=0.8502r=0.8502

Example 2
Use the following information to find the value of the correlation coefficient r, rounded to three significant figures.
hours studied exam score

mean 5.87 68.3

standard deviation 1.34 5.42

least squares equation 𝑒𝑥𝑎𝑚 𝑠𝑐𝑜𝑟𝑒 = 52. +2.45 ℎ𝑜𝑢𝑟𝑠 𝑠𝑡𝑢𝑑𝑖𝑒𝑑

48
49
3B Using the least squares regression line to model a relationship
We wish to investigate the nature of the association between the price of a second-hand car and its age.
The ultimate aim is to find a mathematical model that will enable the price of a second-hand car to be
predicted from its age.

Interpreting mathematical models

Regression equation
𝑝𝑟𝑖𝑐𝑒 = 35100 − 3940 × 𝑎𝑔𝑒
Y variable = price

x-variable = age

a-=35100

b=-3940

r=−0.9643

r2=0.9299

Interpretation:

There is a strong negative linear association between


the price of a car and its age. as the Pearsons
correlation coefficient, r, is -0.9643.The y -intercept
of 31500 indicates the price of the car at age 0,
which means, when it was purchased. The slope of -
3940 means that for every year the car ages it loses
$3940 of its value. The Coefficient of determination
r2 , at 0.9299 suggests that 92.99% of the variation of
the age can explain the variation in the price. The
remaining 7.01% can be explained by other factors
that could determine the price of the car apart from
age. These could be for example, the model of the
car or the colour of the car, or maybe if the car has
been involved in an accident just to name a few. To
determine this other sets of data have to be
collected and further analysis needs to take place.

50
Predicting using the equation

As a general rule, a regression equation only applies to the range of values of the explanatory variables used to
determine the equation.

Predicting within the range of values of the explanatory variable is called interpolation. Interpolation is generally
considered to give a reliable prediction.

Predicting outside range of values of the explanatory variable is called extrapolation. Extrapolation is generally
considered to give an unreliable prediction.

Example 1

The equation of a regression line that enables the price of a second-hand car
to be predicted from its age is: 𝑝𝑟𝑖𝑐𝑒 = 35100 − 3940 × 𝑎𝑔𝑒.

a. Use this equation to predict the price of a car that is 5.5 years old.
State the reliability of the prediction.
b. Use this equation to predict the price of a car that is 2 years old.
State the reliability of the prediction.
c. Use this equation to predict the price of a car that is 10 years old.
State the reliability of the prediction.
d. Use this equation to predict the price of a car that is 12 years old. State the reliability of the prediction.
e. Find the Correlation of determination and explain its meaning.

51
Residual plot-further analysis
A residual plot is a graph of the residuals (plotted on the vertical axis) against the explanatory variable (plotted on
the horizontal axis), where:

𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝒗𝒂𝒍𝒖𝒆 = 𝒀(𝒂𝒄𝒕𝒖𝒂𝒍 𝒅𝒂𝒕𝒂 𝒗𝒂𝒍𝒖𝒆) − 𝒀(𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒅𝒂𝒕𝒂 𝒗𝒂𝒍𝒖𝒆)

Remember, the residual value informs us of the distance of individual data values away from the regression line.
However, sometimes the scatterplot is not sensitive enough to reveal the non-linear structure of a relationship. To
gain more information we need to investigate the fit of the regression line to the data, and we do this using a
residual plot. The residual plot is used to check the linearity assumption required for a linear regression.

Association is linear. Association is non - linear.


Relationship is linear as residual plot is scattered. Relationship is non-linear as the residual plot
shows systematic patterns or is curved.

Example 1 continued

Calculating a residual

The actual price of the 6-year-old car is $6500

Calculate the residual when its price is predicted using the regression equation and mark it on the graph.

𝑝𝑟𝑖𝑐𝑒 = 35100 − 3940 × 𝑎𝑔𝑒

Steps

1. Write the equation


2. Substitute the given actual value for x
into the equation.
3. Find y. This is the predicted value
4. Write down the formular for calculating
the residual
5. Compare the difference
6. Locate the difference on the residual
plot.

52
Example 2

Which of the following residual plots would call into question the assumption of linearity in a regression analysis?
Give reasons for your answers.

Example 3

Construct a report to describe the association between the price and age of second-hand cars.

• Association
• Strength
• Form
• Direction
• r and r2
• Equation
• Slope
• Y-intercept
• Residual Plot
• Conclusion

53
Example 4

The table below shows the scores obtained by nine students on two tests. We want to be able to predict test B
scores from test A scores. Preform a Regression Analysis

Test A score (x) 18 15 9 12 11 19 11 14 16

Test B score (y) 15 17 11 10 13 17 11 15 19

54
55
CHAPTER 4 DATA TRANSORMATIONS
Data transformation is a process through which the data is linearised. The circle of transformation is an easy way to
determine which are the best transformation to apply depending on the type of curve. Possible transformations to
the y and x axis are

Squared Reciprocal Logarithmic

Once the curve is identified all three possible transformations are applied and the r 2 is identified. Based on the
highest r2, and the residual plot, the chosen equation is changed to reflect the type of transformation applied,
equations are compared to the original and predictions are made.

• CAS TIPS Data transformations S


• Add List and Spreadsheets
• Enter x and y into columns A and B.
• Label columns as x and y
For an x or y squared transformation
• Name column C as ‘xsqu’ or ‘ysqu’
• Directly below type ‘=𝑥 2 ’ or y2 into the template
For a reciprocal x transformation
• Name column C as ‘recx’ or ‘recy’
1 1
• Directly below type ‘=𝑥 or 𝑦 into the template
• For a log x or y transformation
• Name column C as ‘logarx’ or ‘logary’
• Directly below type ‘= 𝑙𝑜𝑔10 ’ or 𝑙𝑜𝑔10 (𝑦)into template
• Add Data and Statistics
• Construct a scatterplot, as required of
o 𝑦 𝑎𝑔𝑎𝑖𝑛𝑠 𝑥𝑠𝑞𝑢 or 𝑦 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 𝑟𝑒𝑐𝑥 or logrx Ect
• Construct a scatterplot, as required
• Find the equation and find the r2 values
• Make predictions

56
4A The squared transformation
Example 1

A base jumper leaps from the top of a cliff, 1560 metres above the valley
floor. The scatterplot below shows the height (in metres) of the base
jumper above the valley floor every second, for the first 10 seconds of the
jump.

A scatterplot shows that there is a strong negative association between the


height of the base jumper above the ground and time.

Apply a squared transformation to the variable time, and determine the


least squares regression line for the transformed data.

Use the least squares equation to predict to the nearest metre the
height of the base jumper after 3.4 seconds.

Example 2

In a study of the effectiveness of fertiliser on the yield of strawberry


plants, differing amounts of liquid fertiliser (in mL) were given to groups of
plants, and their average yield (in kg) measured.

A scatterplot shows that there is a strong positive association between the


fertiliser and yield.

Apply a squared transformation to the variable yield, and determine the least
squares regression line for the transformed data.

Use the least squares equation to predict the yield of a plant given

6.5 mL of fertiliser, giving your answer to 1 decimal place.

57
58
4B Logarithmic transformation

Example 1

The general wealth of a country, often measured by its Gross Domestic


Product (GDP), is one of several variables associated with lifespan in
different countries. However, the association is not linear, as can be
seen in the scatterplot below which plots lifespan (in years) against GDP
per person (in dollars) for 13 different countries.

The scatterplot shows that there is a strong positive association


between the lifespan and GDP.

Apply a log transformation to the variable GDP, and determine the least
squares regression line for the transformed data.

Use the least squares equation to predict the lifespan of a country with a
GDP of $20 000 per person, giving your answer rounded to one decimal place.

Example 2

The numbers of cases of a very infectious disease were recorded over a 12


day period. The association is not linear, as can be seen in the scatterplot
below which plots cases against days.

The scatterplot shows that there is a strong positive association between the
number of case and day.

Apply a log transformation to the variable cases, and determine the least
squares regression line for the transformed data.

Use the least squares equation to predict the cases on day 13.

59
60
4C The reciprocal transformation

Example 1

After embarking on a new healthy eating and exercise plan, Ben recorded
his weekly weight loss over a 10 week. The association is not linear, as can
be seen in the scatterplot below which plots weekly weight loss in kg
against length of diet in weeks.

The scatterplot shows that there is a strong negative association between


weekly weight loss and length of diet.

Apply a reciprocal transformation to the variable length of diet, and


determine the least squares regression line for the transformed data.

Use the least squares equation to predict the weekly weight

loss in week 11, giving your answer to one decimal place.

Example 2

A homeware company makes rectangular sticky labels with a variety of


lengths and widths. The scatterplot opposite displays the width (in cm)
and length (in cm) of eight of the sticky labels. The scatterplot shows
that there is a strong negative association between the width of the
sticky labels and their lengths, but it is clearly non-linear.

Apply a reciprocal transformation to the variable width, and determine


the least squares regression line for the transformed data.

Use the least squares equation to predict the width of a

sticky label which is 5 cm long, giving your answer to two decimal places.

61
Example 3

The scatterplot shows the age (in years) and diameter at a height of
1.5 metre (in cm) for a sample of 19 trees of the same species. Use an
appropriate transformation to find a regression model which allows
the age of this species of tree to be predicted from its diameter.

62
63
5A Time series data
Key features of a time series
Exercise 1
graphs
Maximum temperature was recorded each day for a week
The features we look for in a time series
in a certain town. Construct a time series plot of the data.
are:
• trend
Day M T W T F S S • cycles
Temp (∘C) 20 21 25 36 34 25 26 • seasonality
• structural change
• possible outliers
Exercise 2 • irregular (random) fluctuations.

Identifying trends Trend is present when there is a long-


Consider the time series plot of the Australian annual birth term upward or downward movement in
rates over the years from 1931 to 1990, shown below. a time series.
Comment on the trend shown in the plot. Cycles are present when there is a
periodic movement in a time series. The
period is the time it takes for one
complete up and down movement in the
time series plot. In practice, this term is
reserved for periods greater than 1 year.
Seasonality is present when there is a
periodic movement in a time series that
has a calendar related period – for
example a year, a month or a week.
Structural change is present when there
is a sudden change in the established
pattern of a time series plot.
Outliers are present when there are
Example 3 individual values that stand out from the
data.
Sunspots are darker, cooler area on the surface of the sun.
The following plot shows the sunspot activity for the period Irregular (random) fluctuations are
1945 to 2016. Comment on the cycles shown in the plot. always present in any real-world time
series plot. They include all the
unexplained variations in a time series.

64
Example 4

The plot below shows the total percentage of hotel rooms


occupied in Australia by quarter, over the years 2012–2016.
Comment on the seasonality shown in the plot.

Example 5

The time series plot below shows the power bill for a rental house
(in kWh) for the 12 months of a year. Comment on any structural
change in the plot.

Example 6

The time series plot below shows the daily power bill for a house
(in kWh) for a fortnight. Comment on any outliers in the plot.

65
66
67
5B Smoothing a time series using moving means
Example 1
Three- and five-moving mean smoothing of a time series Smoothing is a process which involves
The following table gives the number of births per month over a replacing individual data points with the
calendar year in a country hospital. Use the three-moving mean mean of the data point and some adjacent
and the five-moving mean methods, rounded to one decimal place, points. This allows for trends in the data to
to complete the table. be observed more clearly as the
presence of irregular fluctuations
seasonality or cycles may obscure
underlying trends.

Smoothing by Finding the mean where,


𝑛 = 𝑜𝑑𝑑

Three mean moving -Take three values


including the data value being smoothed,
Month J F M A M J J A S O N D one on each side around the data value to
Raw be smooth and find the mean.
3-mean Five mean smoothing – take five values
5 mean including the data value to be smoothed,
and 2 adjacent values and find the mean.
𝑦1 + 𝑦2 + 𝑦3
𝑦=
3
𝑦1 + 𝑦2 + 𝑦3 + 𝑦4 + 𝑦5
𝑦=
5

Example 2
The table below gives the temperature (∘C) recorded at a weather
station at 9.00 a.m. each day for a week. Calculate the three and
five moving mean smoothed temperature for Tuesday.

Month M T W T F S S
Raw
3-mean
5 mean

68
Example 3
Two-moving mean smoothing with centring Soothing by finding the mean
The temperatures (∘C) recorded at a weather station at 9 a.m. each where
day for a week are displayed in the table. Calculate the two-moving 𝑛 = 𝑒𝑣𝑒𝑛
mean smoothed temperature with centring for Tuesday. Two-moving mean
1. Locate data value to be
smoothed.
2. Identify 3 values around the
data point.
3. Find the mean of the first two
values.
4. Find the mean of the last two
values.
5. Find the mean of the
Example 4 answers.
Four-moving mean
Four- and six-moving mean smoothing with centring 1. Locate data value to be
The table below gives the temperature (°C) recorded at a weather
smoothed.
station at 9.00 a.m. each day for a week. Calculate the four and six
smoothed temperature with centring for Thursday. 2. Identify 5 values around the
data point.
3. Find the mean of the first four
Month J F M A M J J A S O N D values.
Births 10 12 6 5 22 18 13 7 9 10 8 5 4. Find the mean of the last four
values.
5. Find the mean of the
answers.
Six-moving mean
1. Locate data value to be
smoothed.
2. Identify 7 values around the
data point.
3. Find the mean of the first
seven values.
4. Find the mean of the last
seven values.
5. Find the mean of the
answers.

69
70
5C Smoothing a time series plot using moving medians

Construct a three-median smoothed plot of this time series plot.


Smoothing with medians is another
technique used to smooth the data by
removing fluctuations or aspects of
seasonality, cycles or outliers.
Step 1
Identify the first three data values
Step 2
Construct a five-median smoothed plot of this time series plot. Identify the middle data point moving
in the x-direction. Draw a vertical line
through this value as shown.
Step 3
Identify the middle data point moving
in the y-direction. Draw a horizontal
line through this value as shown.
Step 4
The median value is where the two
lines intersect – in this case, at the
point (3, 3).
Construct a seven-median smoothed plot of this time series plot.
Mark this point with a cross (×).

Note: the same method applies for 5


and 7 moving median smoothing.

71
72
5D Seasonal indices • Remember, Seasonality is a characteristic of a
time series in which the data experiences regular
Example 1 and predictable fluctuations or patterns that
recur every calendar year.
Interpreting seasonal indices
Suppose that the seasonal indices (SI) for electricity • When the data is considered to have a seasonal
usage in Esse's home are as shown in the table: component, it has noticeable peaks and troughs
and it is often necessary to remove these so any
What does the seasonal index for Winter tell us?
underlying trend is clearer. The Seasonal Index is
What does the seasonal index for Spring tell us? a measure to the extent of the seasonal
component as a comparison to the average for
the season.
• The process of removing the seasonal component
is called deseasonalising the data or
to seasonally adjust the data for the purpose of
clarity in identifying trends.
• We can use seasonal indices to remove or add
the seasonal component, deseasonalise or
Example 2 reseasonalise.
• To deseasonalise the data we need to calculate
Using seasonal indices seasonal indices which tells us how a particular
The seasonal indices (SI) for cold drink sales for Imogen's season (generally a day, month or quarter)
kiosk are as shown in the table. If the actual cold drink sales compares to the average season.
last summer totalled $21 653, what is the deseasonalised • Seasonal indices are calculated so that their
sales figure for that time period? If the deseasonalised cold average is 1 or 100%. This means that the sum of
drink sales last spring totalled $10 870, what were the the seasonal indices equals the number of
actual sales for that time period? seasons. E.g if the seasons are months, the
seasonal indices add to 12. If the seasons are
quarters, then the seasonal indices add to 4.

• Seasonal index of 1.2 or 120% means 20% higher


than the average. Seasonal index of 0.90 or
90%.of the possible 100% or 10% lower than the
average.

• Once the data is deseasonalised, a trend line can


be fitted and used to make prediction or forecast
Example 3 and a regression equation can be calculated.
Using seasonal indices to determine percentage change When using deseasonalised data to fit a trend
required to correct for seasonality. Consider the table below line, you must remember that the result of any
which gives the seasonal indices for heater sales at a discount prediction is a deseasonalised value. To be
store. By what percentage should the sales in summer and meaningful, this result must then be
winter be increased or decreased in order to de-seasonalise reseasonalised by multiplying by the appropriate
the data? Give your answer as a percentage rounded to one seasonal index.
𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑠𝑒𝑎𝑠𝑜𝑛
decimal • 𝐒𝐞𝐚𝐬𝐨𝐧𝐚𝐥 𝐈𝐧𝐝𝐞𝐱 𝑠𝑖 = 𝑠𝑒𝑎𝑠𝑜𝑛𝑎𝑙 𝑎𝑣𝑒𝑟𝑎𝑔𝑒
• DE seasonalising data
𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
𝑑𝑣 =
𝑠𝑎𝑠𝑜𝑛𝑎𝑙 𝑖𝑛𝑑𝑒𝑥

• Re-seasonalising data

𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 = 𝑑𝑣 × 𝑠𝑖
73
• The percentage change is calculate by
100
%= − 100
𝑆𝑖
place.

Example 4 Calculating SI

Calculating seasonal indices (1 year's data) SI for one season


Mikki runs a shop and she wishes to determine quarterly seasonal 1. Find the average for the season
indices for the number of customers to her shop based on last 2. Calculate SI=value for season
year's figures which are shown in the table opposite.
/seasonal average
Calculate the SI for each season.
SI for multiple seasons

1. Find the average for each season.


2. Calculate the SI for each season.
3. Calculate the average of the SI for
each season.
4. Present the SI in a table.

Example 5
Calculating seasonal indices (several years' data)
Suppose that Mikki has 3 years of data, as shown.
Calculate seasonal indices, rounded to two decimal
places.

Example 6
De-seasonalising a time series
The quarterly sales figures for Mikki's shop over a 3-year
period are as shown. Use the seasonal indices shown to
de- seasonalise these sales figures.
Summer Autumn Winter Spring Write answers rounded to the nearest
whole number.
1.16 0.94 1.26 0.64
Summer Autumn Winter Spring
1.16 0.94 1.26 0.64

74
75
5E Fitting a trend line and forecasting

Example 1
The table below shows the number of female students in Victoria enrolled in at least one subject in the
Mathematics learning area at year 12 over the period 2010–18.

a. Fit a trend line to the data and interpret the slope.


b. How many female students in Victoria do we predict being enrolled in at least one subject in the
Mathematics learning area at year 12 in 2026 if the same increasing trend continues? Give your answer
rounded to the nearest whole number.

Example 2
Fitting a trend line (seasonality)
The deseasonalised quarterly sales data from Mikki's shop are shown below.

a. Fit a trend line and interpret the slope.


b. What sales do we predict for Mikki's shop in the winter of year 4? (Because many items have to be ordered
well in advance, retailers often need to make such decisions.)

76
End of Notes

77

You might also like