s1 s2 Notes by STDGPK
s1 s2 Notes by STDGPK
pk
Mathematical Modelling
A model is a simplification of the real thing. It will be both quicker and cheaper to produce
than the real one and will help us to understand the real world object or situation.
A statistical experiment is a test, investigation or some process adopted for collecting data to
provide evidence for or against a hypothesis.
A disadvantage is that a model does not replicate real-world situations in every detail.
www.studyguide.pk
Collecting Data
Collecting data is important as a method must be used to avoid bias.
One source of bias is using data from responses to questions as people may lie about personal
questions such as age and weight.
Another source of bias is when using data that does not properly apply to the problem. eg.
Using published unemployment figures to investigate the number of people looking for work,
but they don't include students, people past retirement age etc. but they may include people
who are not looking for work.
Types of Data
Qualitative Data
These are non-numerical values such as attitudes, gender, colour, football shirt number
Quantitative Data
These data have valid numerical values such as shoe size, number of broken eggs, height,
time
● Discrete data come from variables which can only take particular values such as shoe
size.
● Continuous data come from variables which can take any value within a given range.
Summarising Data
The reason that a sample is taken is to make deductions about the population.
Graphical and numerical summaries are essential in order to help us analyse the data
collected.
The purpose of these summaries is to condense the data to reveal patterns ans to enable
comparisons to be made. Summarising can lead to a loss of accuracy.
A stem and leaf diagram gives a quick visual impression of the shapes of distribution. Both
integers and decimal can be represented though the data is usually to 2 sig fig. It may be
necessary to round data to meet this constraint.
www.studyguide.pk
If a large number of leaves are associated with one line then it is usual to use two lines. We
can also improve our diagrams by showing the number of leaves on each stem in brackets.
If direct comparison of two data sets is required, a back-to-back stem and leaf diagram can be
drawn.
The information becomes more concise, but the original information has been lost.
It allows summaries and estimates to be made.
Both continuous data and discrete data can be grouped. The boundaries of the groups must be
matched, even if this results in a negative starting point.
Age is a special case, the boundaries are matched to complete years ie. 21-24, 25-28 is
actually 21-25, 25-29.
To draw a cumulative frequency curve, we plot the ucb of each interval against its cumulative
frequency (cf) and join with a smooth curve. For a cumulative frequency polygon, we join the
points with straight lines as opposed to a smooth curve.
Histograms
If the data available is for a continuous variable and it is summarised by a grouped frequency
distribution, then the data can be represented by means of a histogram.
There are no gaps between the bars of a histogram. Thus boundaries must be matched.
There is an important relationship between the area of a histogram bar and the frequency that
it is representing.
There are times when it is useful to draw a histogram based on relative frequencies rather
than frequencies. The relative frequencies are obtained by expressing the frequencies as a
proportion of the total frequency.
Methods of Dispersion
These are used to represent the spread or variation within the data since it is unlikely that all
the values in a data set will be the same.
Measures of Location
The Mode
The mode is the value that occurs most often. It is not always unique (can be bi-modal) and
there may not be a mode. In the case of grouped frequencies, the mode is not always useful,
but there are ways to estimate the mode using a histogram. Usually, the modal class would be
sufficient.
www.studyguide.pk
The Median
The middle value of an ordered set of data. If there are n observations arranged in order of
size, the median value is the n + 1 th observation.
2
To find the median, we use the cumulative frequency.
n+1 -fL
Median Q2 = L + 2 *c
f
Other Quantiles
Can be done using the formula above but with n+1 over 4 for quartiles, 10 for deciles and 100
Always state the appropriate values in your answer ie. Σfx, Σf, n
When given two means and the frequency you must find the totals and add these together and
divide by the total frequency to get the new mean (weighted mean)
then you consider the first group as 0-10 therefore the midpoint would be 5.
Even if we have grouped frequency distributions of unequal intervals, this makes no difference
to the calculation of the mean. Remember that for grouped data, the mean is only an
estimate.
From this we can calculate the mean of y and decode to find the mean of x
x = by + a
Weighted Mean
When we wish to place greater emphasis on some of the values we use a weighted mean
Measures of Dispersion
Range
● The simplest measure of spread
● Based entirely on extreme values
Interquartile Range
● range of the middle 50%
● IQR = Q3 – Q1
● Not affected by extreme values
● If the median is the measure of location used then the IQR is the appropriate measure
of dispersion
● Often used when data has extreme values or has open-ended classes or is not
symmetrical
● Used extensively in conjunction with box plots
● Can help us identify outliers and examine the skewness of a distribution
Semi-Interquartile Range
SIQR = IQR/2
σ2 = Σx2 - x2
n
σ = Σx2 - x2
n
Where
x = Σx
n
For most distributions, the bulk (95%) of the distribution lies within 2sd's of the mean
The units of sd are the same as the original data
We can never get a negative variance (as its sqrt is the sd)
For similar sets of data it is useful to compare the sd's
σ = Σfx2 - x2
We can code and decode like before but when decoding, you do not need to +a as this does
not alter the spread.
Skewness
Symmetrical Bell-Shaped Distribution
mean=median=mode
Normal Distribution
Measures of Skewness
Pearson's Measure of Skewness
Pearson's Measure of Skewness = mean – mode
standard deviation
Normal Distribution
Q3 - Q2 = Q2 – Q1
Quartile skewness = 0
Box Plots
illustrates the dispersion or spread of the distributions, as well as the average (median)
it uses the highest and lowest values of the data, and the three quartiles
the box encloses the middle 50% (the IQR)
The whiskers extend to the upper and lower values (the range)
Always draw box plots on graph paper and label your axis clearly. Use a suitable scale.
Procedure
Find the value of the quartiles
Evaluate Q1 – 1.5(Q3 – Q1) and Q3 + 1.5(Q3 – Q1) and note any values that fall outside this
range
Draw a box based on the quartile values.
If there are any outliers, label them with crosses. The whisker is usually drawn to the next
value towards the median
Only calculate these outliers if the question specifically asks you to do so
Correlation
the relationship between two variables x and y
bi-variate data
produce a bi-variate distribution
There may be a relationship but you cannot necessarily expect to find a law/formula
relating them
We initially look for basic associations
Scatter Diagrams
Bi-variate data is conveniently displayed through scatter diagrams
They help to assess correlation and regression.
We can use to help show linear correlation
Even if we find a mathematical relationship, this does not imply that there is a relationship in
reality, or indeed that an increase in one variable causes an increase in the other.
Correlation measures the relationship and the strength of this relationship between the two
variables.
If both variables increase together we say that they are positively correlated.
If one variable increases as the other decreases we say that they are negatively correlated.
If no relationship can be seen we say there is no correlation.
When drawing scatter diagrams it doesn't matter which axis is used for which variable,
however it does when measuring regression.
www.studyguide.pk
If a horizontal line and a vertical line are drawn through the mean point (x, y), you can see the
association between the two variables in a different way:
For a postive correlation most points lie in the first and third quadrants (top right and bottom
left respectively)
For a negative correlation most points lie in the second and fourth quadrants (top left and
bottom right respectively)
If there is no correlation the points are randomly distributed in all four quadrants.
The calculation of r should only follow after a scatter diagram has been drawn in reality. It
should only be calculated if the scatter diagram reveals some degree of linear correlation. If
correlation is non-linear than pmcc is not appropriate.
Outliers, or rogue results, should be identified as they may upset the general trend.
Calculation
r= Sxy
√(SxxSyy)
where Sxy = Σxy – ΣxΣy
n
It makes the values of x and y smaller. You can subtract any number from the x values, since
this only moves the axis. You can divide the result by any number since this only changes the
scale. The correlation coefficient is unaffected by either of these operations.
x-a
X= /b
StudyGuide.PK A-Level Maths S1 Notes Page 8
www.studyguide.pk
y-c
Y= /d
Note:
Just because two variables have a linear correlation does not necessarily mean that they are
related. Thus, you should have some reason to believe that there might be a relationship
before calculating the PMCC, unless your aim is to prove that they are unrelated.
Data can be distorted by an outlier, so the information should be plotted on a scatter-graph
first.
Note:
A quadratic graph would give a PMCC of 0, as it has correlation, but it is non-linear.
Often variables are linked only through a third variable. Particularly changes that take place
over time.
Regression
Purpose: to find a law connecting two variables, so that we can make predictions about the
value of y for any given value of x.
The response variable will be subject to some level of error or natural variation.
The explanatory variable is always plotted horizontally and the response variable is always
plotted vertically.
By examining the scatter diagrams for data, we can see if a straight line would be a good or
appropriate model for the relationship between x and y.
Having assumed the linear regression model, the results are used to find a regression line.
This line is known as the regression line of y on x, since y is the response variable for a given
value of x.
If you assume a linear regression line, each point with coordinates (x i, yi) will have a vertical
distance ri from the regression line. These are known as residuals.
If the residuals are very small, a line may be drawn by eye, however a much better solution is
to find the line of best fit using the method of least squares.
Legendre formulated this method. The resulting line is known as the least squares regression
line.
We substitute the mean point (x, y) into the equation y - y = b(x - x) and rearrange to get
y = a + bx
The gradient m is given by the letter b and is called the regression coefficient of y on x. We will
StudyGuide.PK A-Level Maths S1 Notes Page 9
www.studyguide.pk
need to calculate b using the formula;
b= Sxy
Sxx
x= Σx
n
y= Σy
n
To draw this line, we choose three points: the mean point and one point whose x value is at
the low end of the observed values and another point whose x value is at the high end of the
observed values.
We can use our regression line to obtain estimates of y given values of x under appropriate
conditions.
You do not know what happens outside the range of our values of our experimental data.
We are assuming a linear relationship within our observed values and for all we know the
relationship between the variables outside of the range of values may be non linear. Therefore
it is dangerous to make predictions or estimates for the response variable based on values
outside the range of observed values. The process is known as extrapolation.
You will also be asked to give interpretation for the values of a and b from your regression lie
within the context of the question.
While regression is concerned with finding a linear law between the two variables in question,
the value of the response depending for its value upon that of the explanatory, correlation is
concerned with how strongly two variables are linearly associated (not a law)
Probability
Venn Diagrams and Probability Definitions
∩ = intersection AND
U = union OR OR in maths means the probability of both
A| = NOT A
P(A) = 1 - P(A|)
P(A|) = 1 - P(A)
P(A|∩B|) = 1 - P(AUB)
Mutual Exclusivity
Two events A & B are said to be mutually exclusive (m.e) if they cannot occur at the same
time. In this case, in the Venn Diagram, A & B do not overlap
Thus P(A∩B) = 0
P(AUB) = P(A) + P(B) for these events
Exhaustion
If two events A & B are such that AUB makes up all the possible outcomes
P(AUB) = 1
We say that A & B are exhaustive
P(A) + P(B) - P(A∩B) = 1
Note: We can extend this basic conditional probability definition to things like
P(A||B) = P(A|∩B) / P(B)
Independent Events
2 events are independent if the probability that 1 of them occurs is no way influenced by
whether or not the other has occurred.
The set of all possible values of a r.v. together with their probabilities is called a probability
distribution (probability disn)
Also, the function that describes how the probabilities are assigned is called the probability
function.
Remember Σ P(X=x) = 1
Random variables are denoted by capital letters and the particular values they take are
denoted by lower case letters.
www.studyguide.pk
Whatever the question is, always define what the random variable is.
The function that is responsible for allocating the probabilities P(X=x) is also known as the
probability density function (pdf)
Expectation E(X)
E(X) = Σ x P(X=x)
A discrete random variable with pdf P(X=x) = k , for all given values of x, where k is a
constant is said to follow a Uniform Distribution
The definition of expectations can be extended to any function of the r.v X, such as X 2 , 9X,
X-4, 3X2 - 5X
The following results hold when X is a discrete random variable and when both a and b are
constants
1. E(a) = a
2. E(aX) = aE(X)
3. E(aX + b) = aE(X) + b
The Variance of X
Var(a) = 0
Var(aX) = a2 Var(X)
Var(aX + b) = a2 Var(X)
Var(aX ± bY) = a2 Var(X) + b2 Var(Y)
If X is the discrete uniform variable and x n = n (ie. x values start at 1 and progress up
consecutively)
The probability density function of the normal random variable is very complicated. The shape
of the curve depends on two parameters, mean and variance.
X ~ N(μ, σ2)
Z ~ N(0, 1)
This contains z values for the normal variable Z~N(0,1) such that r.v exceeds z with
probability p.
P(Z>z)
You can use both tables in reverse to find the value of z, given a probability.