TAMS65 Assignment 1
TAMS65 Assignment 1
TAMS65
Examinator: Jörg-Uwe Löbus
TAMS65 – Assignment 1
Background
The purpose of this report is to analyze a given dataset of bacteria grown in varying numbers over a
varying set of circumstances. The biomedical experiment collected data relating humidity levels and
temperature to bacterial growth. This report will go into detail about how the data is distributed as well as
use this regression in order to make a prediction of how the bacterial growth will present on a day with a
given set of circumstances.
Correlation
The correlation between data is an indicator of the strictly linear relation between explanatory variables
and resulting response variables. Data can be at most correlated to a complete degree, yielding a
correlation coefficient of 1. When observing the data given in the experiment (shown in a scatter plot in
Figure 1) it is visually observable that the data does not follow a linear relationship, instead showing the
bacterial growth increasing exponentially. When using data tools the correlation coefficient was
calculated to be 0.6459. According to Hinkle DE, a correlation coefficient between 0.50 and 0.70 only
should be considered a moderate correlation, while a correlation coefficient of 0.70 or higher would be a
more preferable correlation in order to apply linear regression to study the data.
In order to create values that can more easily be fitted to a linear regression, the data is transformed using
a logarithmic function (seen in Figure 2). Once transformed, the correlation was once again observed and
this time a linear correlation of 0.7404 was found. A correlation coefficient of 0.70-0.90 is considered a
high positive correlation and far easier to regress to a linear function.
1
Figure 1: Data collected from 30 trials of the experiment
2
Figure 3: Transformed data with fitted linear regression
Prediction
Using the regression, a prediction interval for the logarithmic value of y could be found with the
constraints that x1 was to be 25 degrees celsius, and x2 was to be 0. The 95% prediction interval for z =
log(Y) = (9.7739, 11.8168) so, by transforming the upper and lower bounds with an inverse-log function,
the interval for Y = (17569, 135509). Thus, the expected value of the
Distribution
To make sure that the residuals (error) are normally distributed, a histogram is created using the software
R. The output (shown in Figure 4) closely resembles that of a normally distributed graph following the
traditional bell-curve. This shape of the residuals indicates that the error correlating to the model is, in
fact, normally distributed.
3
Figure 5: The residual values plotted as a histogram
Furthermore, a normal probability plot (Q-Q) was generated and shows a clear linear relationship between
the residuals and the quantities that are expected when the assumption is that the residuals are normally
distributed.
Figure 7 shows the fitted values plot, where the assumed mean is 0. In the figure we can observe the red
line, loosely fitted to the data points, barely deviating from the expected mean value of 0, indicating that
4
the previously proven normal distribution is distributed around 0. With this final graph we can conclude
that ε ∼ N (0 , σ) .
Conclusion
The experimental data closely follows an inverse log equation. Once transformed logarithmically, the
values of bacterial growth can be formulated using a linear regression with an error that is normally
distributed around 0. Using this regression, an interval of bacterial growth during a fixed time can be
obtained, and for a temperature of 25 degrees celsius and low humidity one can expect to see anywhere
between 17569 and 135509 new bacteria.
References
Hinkle DE, Wiersma W, Jurs SG. Applied Statistics for the Behavioral Sciences. 5th ed. Boston:
Houghton Mifflin; 2003.
The lecture slides of the course TAMS65, Mathematical Statistics by Zhenxia Liu at Linköping
university.