0% found this document useful (0 votes)
23 views8 pages

Correlation of Experimental Data CLIL 2017

This paper discusses correlation of experimental data. It will explain the concept of correlation and correlation coefficients, focusing on Pearson's and Spearman's correlation coefficients. It will present the experimental part with two examples using Pearson's correlation coefficient and the method of least squares to show how to apply correlation analysis and conduct a correlation experiment. The document is written in both English and Bosnian.

Uploaded by

Almir Ahmetovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Correlation of Experimental Data CLIL 2017

This paper discusses correlation of experimental data. It will explain the concept of correlation and correlation coefficients, focusing on Pearson's and Spearman's correlation coefficients. It will present the experimental part with two examples using Pearson's correlation coefficient and the method of least squares to show how to apply correlation analysis and conduct a correlation experiment. The document is written in both English and Bosnian.

Uploaded by

Almir Ahmetovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The 2nd Student Conference – Content and English Language Integrated Learning

CLIL 2017, Zenica, BiH, April 2017

CORRELATION OF EXPERIMENTAL DATA

Almir Ahmetović
Faculty of Mechanical Engineering,
Fakultetska 1,
Zenica, Bosnia i Herzegovina

SUMMARY
This paper’s topic is Correlation of Experimental Data. The concept of correlation, its presentation
correlation coefficient, and the method of least squares will be explained, with a focus on the most
common correlation coefficients used (Pearson's correlation coefficient and Spearman's correlation
coefficient). The experimental part with two examples will show the use of correlation and the
correlation experiment itself. The examples are made for Pearson's correlation coefficient and for the
method of least squares.
Key words: coefficient, correlation, experiment, chart, Pearson, variables

SAŽETAK
Tema ovog rada je Korelacija eksperimentalnih podataka. Pojam korelacije, njegovo predstavljanje
koeficijent korelacije, te metoda najmanjih kvadrata će biti objašnjeni, sa naglaskom na najčešće
korištene koeficijente korelacije (Pearsonov koeficijent korelacije i Spearmanov koeficijent
korelacije). Ekperimentalni dio sa dva korištena primjera će pokazati upotrebu korelacije i samog
korelacijskog eksperimenta. Primjeri su rađeni za Pearsonov koeficijent korelacije i za metodu
najmanjih kvadrata.
Ključne riječi: koeficijent, korelacija, eksperiment, dijagram, Pearson, varijabla

1. INTRODUCTION
Correlation represents the relation and interconnection between different appearances presented by
values of two variables. At the same time, It is possible to predict the value of one variable, with a
given probability, on the basis of the second variable's value. Changing the value of one variable
affects the change of the value of the other variable. The variable that affects the second variable is
called the independent variable. The variable which is affected is called the dependent variable. There
may be cases that the two variables simultaneously influence each other, in those cases both variables
are at the same time dependent and independent. Mutual relationship between two variables can be
displayed using two-dimensional graphics called scatter charts. The values of one variable are shown
on the x axis, and the other variable on the y axis of the chart. Cross sections are around a certain
direction, which is called regression line. When the cross sections are closer to the direction, the
correlation is higher, and if they dispers from the direction, the correlation is lower. In practice, it is
visually very difficult, except in the case of perfect correlation to determine the degree of correlation
between variables. Depending on the mutual relationship between two variables, the correlation can be
linear or nonlinear. For the linear correlation the cross sections are grouped around the direction, and
for the non-linear correlation the cross sections are grouped around a another curve.

1
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

2. THE CORRELATION COEFFICIENT


Scatter is a common characteristic of virtually all measurements. In some cases the scatter may be
large that it is difficult to detect a trend. For example an experiment in which variable x is varied
systematically and variable y is then measured. We would like to know if the value of y depends on
the value of x. If the results appeared as in figure 1.1(a), an immediate conclusion would be that there
is a strong relationship between y and x. If the data appeared as shown in figure 1.1(b), we would
probably conclude that there is no functional relationship between y and x. If the data appeared as
shown in figure 1.1(c), there would exist some uncertainty. A trend of increasing y with increasing x
appears, but the scatter is so big that the trend might be a result of pure chance. There exists a
statistical parameter called the correlation coefficient used to determine whether an trend is real or is
simply a result of pure chance.
The correlation coefficient rxy is the number whose value can be used to check for a functional link
between the two measured variables x and y. There are multiple correlation coefficients which are
used in different cases. The most common correlation coefficients used are:
1) Pearson's correlation coefficient - which is most commonly used when working with linear
models
2) Spearman's correlation coefficient - which is most commonly used when working with models
that are not linear

Figure 1. Data showing variation in scatter of the dependent variable y

2.1. Pearson's correlation coefficient


Pearson's correlation coefficient is used in cases where between the observed variables a linear
relationship and normal distribution exist. The value of Pearson's correlation coefficient is moving
between +1 (for perfect positive correlation) and -1 (perfect negative correlation). The coefficient
instructs us to the direction of correlation, whether positive or negative, but we are not referring to the
strength of the correlation. Pearson's correlation coefficient is based on a comparison of the actual
impact of the observed variables to one another in relation to the maximum possible impact of two
variables.
One of the most important mathematical abilities of the correlation coefficient is that it is invariable to
the linear transformation. Addition, subtraction and multiplication of each value with a constant value
is representing a linear transformation. Through the linear transformation particular sizes retain their
relative positions in the schedule.
corra  b  X,Y)  corr(X,Y ,if b  0

corra  b  X , Y )  corr( X , Y , if b  0

This feature is based on the fact that the variables are essentially z-transformed and in this
transformation retain their distribution, because of that they have a mean value of zero and a standard
deviation equal one.

2
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

If there are two variables x and y, and the results of the experiment show pairs of data, a linear
correlation can be defined as:
n

 xi  x yi  y  
r  i 1 (1)
xy 1/ 2
 n
 2 n
  xi  x  yi  y   2

i  1 i 1 
The resulting value of rxy will range from -1 to +1. The value of +1 would exhibit an ideal linear
relationship between the variables with a positive increase (with increasing x, y increases). A value of
-1 is indicating an ideal relative to the relation with a negative growth, or decrease of the values x and
y. A value 0 is indicating the absence of linear correlation between the two variables. Even if there is
no correlation, it is unlikely that the value rxy be exactly 0. For a given sample size, we use statistical
history data, to determine whether the calculated result of the coefficient rxy is significant or is due to a
chance. Harnett and Murphy dealt with these issues in 1975 as well as Johnson in 1988.
For practical problems, this process can be simplified in the form of a table. Critical values for r are
certain so it is possible to compare them with the calculated value rxy. For two variables and n number
of data pairs, suitable critical (limit) values of r, rt were calculated and given in Table 1. rt is the
function of the number of samples and the level of relevance, α. Values of r in the table are limiting
values for which we can expect that the results are pure coincidence. For each rt variable in the table,
there is only a probability α so that the experimental values rxy will be greater than the chances of a
pure coincidence. In the case that experimental values overcome the tabular values, It is expected that
the experimental data will show a real correlation with the value 1-α. For the use in engineering
problems, the confidence interval (a measure of uncertainty) is taken as 95%, which is corresponding
to the value from α to 0,0. For a given data set, we get rt from the table and compare the values with rxy
from the computational data. If the absolute value rxy is higher than rt the assumption is that y depends
on x, with the absence of coincidence, and the linear relationship is expected to provide an
approximation of the real functional connections. The value rxy being smaller than rt hints that we can’t
be sure that there is a functionally linear relationship. It is not necessary that the functional
relationship is linear so the significant correlation coefficient could be calculated. For example,
parabolic connection which shows a little dispersion will give us the high correlation coefficient. On
the other hand, some strings (multi value circular function), even though they’re obviously strong will
result in a poor value of the correlation coefficient rxy.

Table 1. Minimum values of correlation coefficients for different values of α [1]


α
n
0,20 0,10 0,05 0,02 0,01
3 0,951 0,988 0,997 1,000 1,000
4 0,800 0,900 0,950 0,980 0,990
5 0,687 0,805 0,878 0,934 0,959
6 0,608 0,729 0,811 0,882 0,917
7 0,551 0,669 0,754 0,833 0,875
8 0,507 0,621 0,707 0,789 0,834
9 0,472 0,582 0,666 0,750 0,798
10 0,443 0,549 0,632 0,715 0,765
11 0,419 0,521 0,602 0,685 0,735
12 0,398 0,497 0,576 0,658 0,708
13 0,380 0,476 0,553 0,634 0,684

3
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

14 0,365 0,458 0,532 0,612 0,661


15 0,351 0,441 0,514 0,592 0,641
16 0,338 0,426 0,497 0,574 0,623
17 0,327 0,412 0,482 0,558 0,606
18 0,317 0,400 0,468 0,543 0,590
19 0,308 0,389 0,456 0,529 0,575
20 0,299 0,378 0,444 0,516 0,561
25 0,265 0,337 0,396 0,462 0,505
30 0,241 0,306 0,361 0,423 0,463
35 0,222 0,283 0,334 0,392 0,430
40 0,207 0,264 0,312 0,367 0,403
45 0,195 0,248 0,294 0,346 0,380
50 0,184 0,235 0,279 0,328 0,361
100 0,129 0,166 0,197 0,233 0,257
200 0,091 0,116 0,138 0,163 0,180

Two additional precautions must be noted before the correlation coefficient can be used. First, one
point with a bad data value can have a strong influence on the value rxy. If possible, it is good to
remove the boundary values of the measurements before the evaluation of the coefficient. It is also a
mistake to conclude a significant correlation value only based on that the change in one variable is
going to change the value of other variables. Causality is necessary to determine on the basis of other
information about the problem.

2.2. Spearman's correlation coefficient


Spearman's correlation coefficient (rank correlation) is used to measure the correlation between
variables where it is not possible to apply the Pearson correlation coefficient. It is based on the method
of measuring the consistency of correlation between the variables arranged, and the form of the
connection (e.g., a linear shape that is a prerequisite for using the Pearson coefficient) is not essential.
Cases in which we use the Spearman coefficient are for example, where a linear relationship doesn’t
exist between variables, and it is not possible to apply the appropriate transformation which would
translate it into a linear correlation (e.g., the link between the seismic attributes and borehole data in
the petroleum geology). Spearman's correlation coefficient as a result gives the approximate value of
the correlation coefficient, which is treated as its sufficiently good approximation. When using the
Spearman coefficient, the values of variables need to be ranked and thus reduced to a common
measure. The simplest way of ranking is to assign rank 1 to the minimum value of each variable, the
next largest rank being 2 and so on until the maximum rank is assigned. The calculation of the
coefficient is being done using the values assigned to the ranks. Spearman's coefficient will be marked
with rs.
The formula for calculating the Spearman correlation coefficient is:
n d2
r 1 6  i (2)
i  1 n n  1

s 2
 
where d is the differential value of the rank of the two observed variables, and n is the number of
different series.

4
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

Example 1:
It is thought that lap times for a race car depend on the ambient temperature. The following data for
the same car with the same driver were measured at different races:
Ambient
temperature 40 47 55 62 66 88
(oF)
Lap time (s) 65,3 66,5 67,3 67,8 67 66,6
First, the data given in the previous table, we will plot in the following form. From the plot, at first
glance it seems that there could be a weak positive correlation between lap time and ambient
temperature, but the correlation coefficient can be computed to determine whether this correlation is
real or might be due to pure chance. We can determine this coefficient using Eq. 3. The computation
table would be as follows:

x y xx x  x 2 y y  y  y 2 x  x  y  y 
40 65,3 -19,67 386,78 -1,45 2,10 28,52
47 66,5 -12,67 160,44 -0,25 0,06 3,17
55 67,3 -4,67 21,78 0,55 0,30 -2,57
62 67,8 2,33 5,44 1,05 1,10 2,45
66 67 6,33 40,11 0,25 0,06 1,58
88 66,6 28,33 802,78 -0,15 0,02 -4,25

  358   400,50   1414,33   3,66   28,90

Figure 2. Diagram of the given variables in Excel [1]

The correlation coefficient is then computed, using Eq. 3:


28,9
r   0,4013 (3)
xy
1417,33  3,661 / 2
For the confidence interval of 95%, α equals 1-0.95 = 0.05. For 6 pairs of data, from Table 1, we
obtain the value of rt=0.811. Since rxy is less than rt, the conclusion is that at the first sight the trend of
the data occurs only due to chance.

5
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

2.3. Least-Squares Linear Fit


It is often required when performing experiments to establish the correlation of experimental data with
appropriate mathematical functions (graphics) as a straight line or a parabola or an exponential
function. The most commonly used function for this purpose is a straight line. The linear distribution
is suitable for a large number of results, and in some cases data can be transformed into an
approximate linear shape. As shown in Figure 3, if there are n data pairs (xi, yi), the establishing of a
form of a straight line will be attempted through data.
Y  ax  b (4)
We should get the value of the constants a and b. If there are only two pairs of data, the solution is
simple, because the observed points form and completely determined one line. In the case when there
are more points of interest, the best case (best fit) has to be determined for the data. A person who
performs an experiment can simply put the line through the diagram whose direction touches the
highest number of points or the closes to the points, and it is often the best approximation to obtain a
linear relationship. The more systematic and appropriate approach is to use the method of least squares
or linear regression to find the best form of data. Regression is a completely defined mathematical
formulation that is easily automated. Assume that the test data consist of pairs of data. For each xi
value (which is considered to be without fault), it is possible to predict the value of y based on a linear
relationship Y  ax  b
For each value of xi, and error occurs:
e Y  y (5)
i i i

x x
x x
ei x
y Value of Yi

x x x Value of yi
x
x

Figure 3. Fitting a straight line through data [1]

The square of the error would be:

 i i
 
e 2  Y  y 2  ax  b  y 2
i i i
 (6)
The sum of the squared errors for all the data points would be:


i i
 i

E   Y  y 2   ax  b  y 2
i where  
 
n
(7)
i 1
Now we choose a and b in order to minimize the value E by differentiating E with a and b, and by
setting the results to zero:

6
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

E
a
 0   2 ax  b  y
i i
 xi 

E
b
 0   2 ax  b  y
i i
 xi  (8)

These two equations can be solved simultaneously for a and b:

a
  
n x y   x  y
i i i i
2  
n x   x
i i
2 (9a)

 xi  yi   xi  xi yi 
2
b
n  x 2   x 2
(9b)
i i
The resulting line, Y  ax  b , is called the least-squares best fit to the data represented by (xi, yi)
When it comes to linear regression analysis, it is desirable to confirm just how reliable the obtained
form is. The idea of this data element can be identified in the display as shown in Figure 3 for these
cases. A good measure of the adequacy of a regression model is provided with a coefficient of
determination, given in the form

 ax  b  y 
2

 1
2 i i
r
 y  y (10)
2
i

In the engineering data, r2 will be very high and will be in the range of 0.8 to 0.9 or more, a small
value can be an indicator that there is an important variable that has not been taken into account, which
influences the result.

Figure 4. Least-squares line with forced origin [1]

Another measure of reliability of the obtained coefficients is called the standard error of the estimate,
given by

S 

 yi  Yi 2 
y, x n2 (11)

This is the standard deviation of the difference or the distance between the data points and the real
function. If you make use of the equation (4), the line for the best function will not pass through the
origin.

7
The 2nd Student Conference – Content and English Language Integrated Learning
CLIL 2017, Zenica, BiH, April 2017

With the origin fixed at (0,0), the fitted line will have the form
Y  ax (12)
The value of a is calculated from
n

x y i i
a i 1
n (13)
 xi2
i 1

3. CONCLUSION
Using the correlation between the variables is significant in setting a hypothesis in scientific papers, in
which using the observed links between the two variables is trying to establish a causal link, that the
correlation can never prove. Good knowledge of the rules for the use of the correlation coefficient is
required in order to avoid arriving at the wrong conclusions. For the experimental use of correlation it
is usually necessary to find the appropriate mathematical function that closely matches the problem,
after which by testing we check a given correlation. Correlation is often used to verify the test results.
After the testing, the appropriate correlation between testing and the obtained results needs to be
detemined. After the testing is repeated, once again we determine the correlation between the new and
previously obtained results. In case the correlation does not exist, usually we find that the experiment
conducted is very unstable after the repeated experiments can’t replicate the previous results. Finally it
is important to point out that correlation does not represent the dependence of variables on each other
and that it does not entail causation.

4. LITERATURE
[1] Wheleer A. J. & Ganji R. A. (2010). Introduction to Engineering Experimentation 2010.
[2] Correlation: http://matheguru.com/stochastik/272-korrelation.html (march, 2017.)
[3] Osnove zdravstvene statistike: Metode istraživanja u fizioterapiji:
https://ldap.zvu.hr/~oliverap/MetodeIstrazivanjaFT/11_Korelacija.pdf (march, 2017.)
[4] Crashkurs-statistik: Korrelation: http://www.crashkurs-statistik.de/spearman-korrelation-rangkorrelation/
(march, 2017.)

You might also like