Correlation of Experimental Data CLIL 2017
Correlation of Experimental Data CLIL 2017
Almir Ahmetović
Faculty of Mechanical Engineering,
Fakultetska 1,
Zenica, Bosnia i Herzegovina
SUMMARY
This paper’s topic is Correlation of Experimental Data. The concept of correlation, its presentation
correlation coefficient, and the method of least squares will be explained, with a focus on the most
common correlation coefficients used (Pearson's correlation coefficient and Spearman's correlation
coefficient). The experimental part with two examples will show the use of correlation and the
correlation experiment itself. The examples are made for Pearson's correlation coefficient and for the
method of least squares.
Key words: coefficient, correlation, experiment, chart, Pearson, variables
SAŽETAK
Tema ovog rada je Korelacija eksperimentalnih podataka. Pojam korelacije, njegovo predstavljanje
koeficijent korelacije, te metoda najmanjih kvadrata će biti objašnjeni, sa naglaskom na najčešće
korištene koeficijente korelacije (Pearsonov koeficijent korelacije i Spearmanov koeficijent
korelacije). Ekperimentalni dio sa dva korištena primjera će pokazati upotrebu korelacije i samog
korelacijskog eksperimenta. Primjeri su rađeni za Pearsonov koeficijent korelacije i za metodu
najmanjih kvadrata.
Ključne riječi: koeficijent, korelacija, eksperiment, dijagram, Pearson, varijabla
1. INTRODUCTION
Correlation represents the relation and interconnection between different appearances presented by
values of two variables. At the same time, It is possible to predict the value of one variable, with a
given probability, on the basis of the second variable's value. Changing the value of one variable
affects the change of the value of the other variable. The variable that affects the second variable is
called the independent variable. The variable which is affected is called the dependent variable. There
may be cases that the two variables simultaneously influence each other, in those cases both variables
are at the same time dependent and independent. Mutual relationship between two variables can be
displayed using two-dimensional graphics called scatter charts. The values of one variable are shown
on the x axis, and the other variable on the y axis of the chart. Cross sections are around a certain
direction, which is called regression line. When the cross sections are closer to the direction, the
correlation is higher, and if they dispers from the direction, the correlation is lower. In practice, it is
visually very difficult, except in the case of perfect correlation to determine the degree of correlation
between variables. Depending on the mutual relationship between two variables, the correlation can be
linear or nonlinear. For the linear correlation the cross sections are grouped around the direction, and
for the non-linear correlation the cross sections are grouped around a another curve.
1
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
corra b X , Y ) corr( X , Y , if b 0
This feature is based on the fact that the variables are essentially z-transformed and in this
transformation retain their distribution, because of that they have a mean value of zero and a standard
deviation equal one.
2
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
If there are two variables x and y, and the results of the experiment show pairs of data, a linear
correlation can be defined as:
n
xi x yi y
r i 1 (1)
xy 1/ 2
n
2 n
xi x yi y 2
i 1 i 1
The resulting value of rxy will range from -1 to +1. The value of +1 would exhibit an ideal linear
relationship between the variables with a positive increase (with increasing x, y increases). A value of
-1 is indicating an ideal relative to the relation with a negative growth, or decrease of the values x and
y. A value 0 is indicating the absence of linear correlation between the two variables. Even if there is
no correlation, it is unlikely that the value rxy be exactly 0. For a given sample size, we use statistical
history data, to determine whether the calculated result of the coefficient rxy is significant or is due to a
chance. Harnett and Murphy dealt with these issues in 1975 as well as Johnson in 1988.
For practical problems, this process can be simplified in the form of a table. Critical values for r are
certain so it is possible to compare them with the calculated value rxy. For two variables and n number
of data pairs, suitable critical (limit) values of r, rt were calculated and given in Table 1. rt is the
function of the number of samples and the level of relevance, α. Values of r in the table are limiting
values for which we can expect that the results are pure coincidence. For each rt variable in the table,
there is only a probability α so that the experimental values rxy will be greater than the chances of a
pure coincidence. In the case that experimental values overcome the tabular values, It is expected that
the experimental data will show a real correlation with the value 1-α. For the use in engineering
problems, the confidence interval (a measure of uncertainty) is taken as 95%, which is corresponding
to the value from α to 0,0. For a given data set, we get rt from the table and compare the values with rxy
from the computational data. If the absolute value rxy is higher than rt the assumption is that y depends
on x, with the absence of coincidence, and the linear relationship is expected to provide an
approximation of the real functional connections. The value rxy being smaller than rt hints that we can’t
be sure that there is a functionally linear relationship. It is not necessary that the functional
relationship is linear so the significant correlation coefficient could be calculated. For example,
parabolic connection which shows a little dispersion will give us the high correlation coefficient. On
the other hand, some strings (multi value circular function), even though they’re obviously strong will
result in a poor value of the correlation coefficient rxy.
3
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
Two additional precautions must be noted before the correlation coefficient can be used. First, one
point with a bad data value can have a strong influence on the value rxy. If possible, it is good to
remove the boundary values of the measurements before the evaluation of the coefficient. It is also a
mistake to conclude a significant correlation value only based on that the change in one variable is
going to change the value of other variables. Causality is necessary to determine on the basis of other
information about the problem.
4
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
Example 1:
It is thought that lap times for a race car depend on the ambient temperature. The following data for
the same car with the same driver were measured at different races:
Ambient
temperature 40 47 55 62 66 88
(oF)
Lap time (s) 65,3 66,5 67,3 67,8 67 66,6
First, the data given in the previous table, we will plot in the following form. From the plot, at first
glance it seems that there could be a weak positive correlation between lap time and ambient
temperature, but the correlation coefficient can be computed to determine whether this correlation is
real or might be due to pure chance. We can determine this coefficient using Eq. 3. The computation
table would be as follows:
x y xx x x 2 y y y y 2 x x y y
40 65,3 -19,67 386,78 -1,45 2,10 28,52
47 66,5 -12,67 160,44 -0,25 0,06 3,17
55 67,3 -4,67 21,78 0,55 0,30 -2,57
62 67,8 2,33 5,44 1,05 1,10 2,45
66 67 6,33 40,11 0,25 0,06 1,58
88 66,6 28,33 802,78 -0,15 0,02 -4,25
5
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
x x
x x
ei x
y Value of Yi
x x x Value of yi
x
x
i i
e 2 Y y 2 ax b y 2
i i i
(6)
The sum of the squared errors for all the data points would be:
i i
i
E Y y 2 ax b y 2
i where
n
(7)
i 1
Now we choose a and b in order to minimize the value E by differentiating E with a and b, and by
setting the results to zero:
6
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
E
a
0 2 ax b y
i i
xi
E
b
0 2 ax b y
i i
xi (8)
a
n x y x y
i i i i
2
n x x
i i
2 (9a)
xi yi xi xi yi
2
b
n x 2 x 2
(9b)
i i
The resulting line, Y ax b , is called the least-squares best fit to the data represented by (xi, yi)
When it comes to linear regression analysis, it is desirable to confirm just how reliable the obtained
form is. The idea of this data element can be identified in the display as shown in Figure 3 for these
cases. A good measure of the adequacy of a regression model is provided with a coefficient of
determination, given in the form
ax b y
2
1
2 i i
r
y y (10)
2
i
In the engineering data, r2 will be very high and will be in the range of 0.8 to 0.9 or more, a small
value can be an indicator that there is an important variable that has not been taken into account, which
influences the result.
Another measure of reliability of the obtained coefficients is called the standard error of the estimate,
given by
S
yi Yi 2
y, x n2 (11)
This is the standard deviation of the difference or the distance between the data points and the real
function. If you make use of the equation (4), the line for the best function will not pass through the
origin.
7
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017
With the origin fixed at (0,0), the fitted line will have the form
Y ax (12)
The value of a is calculated from
n
x y i i
a i 1
n (13)
xi2
i 1
3. CONCLUSION
Using the correlation between the variables is significant in setting a hypothesis in scientific papers, in
which using the observed links between the two variables is trying to establish a causal link, that the
correlation can never prove. Good knowledge of the rules for the use of the correlation coefficient is
required in order to avoid arriving at the wrong conclusions. For the experimental use of correlation it
is usually necessary to find the appropriate mathematical function that closely matches the problem,
after which by testing we check a given correlation. Correlation is often used to verify the test results.
After the testing, the appropriate correlation between testing and the obtained results needs to be
detemined. After the testing is repeated, once again we determine the correlation between the new and
previously obtained results. In case the correlation does not exist, usually we find that the experiment
conducted is very unstable after the repeated experiments can’t replicate the previous results. Finally it
is important to point out that correlation does not represent the dependence of variables on each other
and that it does not entail causation.
4. LITERATURE
[1] Wheleer A. J. & Ganji R. A. (2010). Introduction to Engineering Experimentation 2010.
[2] Correlation: http://matheguru.com/stochastik/272-korrelation.html (march, 2017.)
[3] Osnove zdravstvene statistike: Metode istraživanja u fizioterapiji:
https://ldap.zvu.hr/~oliverap/MetodeIstrazivanjaFT/11_Korelacija.pdf (march, 2017.)
[4] Crashkurs-statistik: Korrelation: http://www.crashkurs-statistik.de/spearman-korrelation-rangkorrelation/
(march, 2017.)