Stat 4
Stat 4
Band Weight
64955 1.749
65318 2.551
64612 1.768
64393 2.327
64092 2.127
... ...
Generate a frequency histogram of male kiwi weights. This distribution represents the population (all possible
observations) of male kiwi weights. Note that this is the statistical population and not a biological population -
obviously a biological population entirely lacking in females would not last long!
Since we have the weights of all male kiwi in the population, is possible to calculate population parameters (such
as population mean and standard deviation) directly!
Q1-2. What is the mean (a location measure) and standard deviation (a measure of spread) of the
population?
a. Mean
b. SD
Assuming, the population is normally distributed, it is possible to calculate the probability that a randomly
recaptured male kiwi will weigh greater than a particular value, less than a particular value, or weigh between a
range of weights. This probability is just the area under a particular region of a normal distribution and can be
calculated using the normal probabilities.
Q1-3. Assuming that the population is normally distributed, what is the probability of recapturing a
male little spotted kiwi that weighs greater than 2.9 kg?
For data sets with large numbers of observations, the distribution of observations can be examined via a
histogram - as demonstrated above. However, histograms are only meaningful for summarizing large data sets.
For smaller data sets other exploratory tools (such as boxplots) are necessary. To appreciate the relationship
between boxplots and the underlying distribution of data, construct a boxplot of male kiwi weights.
Before continuing, make sure you are clear on what the observations, variables and populations are.
Construct a boxplot of dissolved organic carbon (DOC) from the sample observations.
Provided the data were collected without bias (ideally random) and with adequate replication, the sample should
reflect the entire population. Therefore sample statistics should be good estimates of the population
parameters.
The mean of a sample is considered to be a location characteristic of the sample. Along with the mean, it is often
desirable to characterize the spread of data in a sample - that is to determine how variable the sample is.
For most purposes, the sample itself is of little interest - it is purely used to estimate the population. Therefore it is
necessary to be able to estimate how well the sample mean estimates the true population mean. The Standard
error (SE) of the mean is a measure of the precision of the mean.
Following on from the idea of precision of the mean, is the concept of confidence intervals, by which an interval
is calculated that we are 95% confident will contain the true population mean.
Many statistical analyses assume that the population from which the sample was collected is normally distributed.
However, biological data is not always normally distributed. To normalize the data, try transforming to logs.
Earlier we identified the presence of an outlier in the DOC variable. To investigate the impact of this outlier on a
range of summary statistics, calculate the following measures of location (mean and median) and spread
(standard deviation and interquartile range) for DOC, with and without the outlying observation and complete
the table below.
Median
Variance
Standard deviation
Inter-quartile range
Q2-10. Which measures of location and spread are most robust to inclusion and exclusion of a single
unusual observation?
Q3-1.For percentage plant cover, Calculate the following summary statistics separately for each
colony type and complete the table below.
Variance
Standard deviation
Coefficient of variation
b. Which is the most variable when corrected for the mean? (N, R or B)
Normality
Before proceeding, make sure you are familiar with the significance of normally distributed sample data and thus
why it is necessary to examine the distribution of sample data as part of routine exploratory data analysis
(EDA) prior to any formal data analysis.
Q3-2. Construct a boxplot for total 1996 beetle abundance for each colony type separately.
c. Now transform the response variable to logs and redraw the boxplots, does this
change (improve?) the shape of the distributions? (Y or N)
Linearity
Often it is necessary to examine the nature of the relationship or association between variables as part of
routine exploratory data analysis (EDA) prior to any formal data analysis. The nature of relationships/associations
between continuous data is explored using scatterplots.
Q3-3. Construct a scatterplot for beetle abundance against total 1996 plant cover.
b. Note, that the boxplots also enable us to explore the normality of both variables
(populations). Is there any evidence of non-normality? (Y or N)
Sánchez-Piñero & Polis (2000) measured a number of continuous variables (% cover of guano, % cover or plants
and abundance of beetles. Therefore, they might be interested in exploring the relationships between each of
these variables. That is, the relationship between guano and plants, guano and beetles, and beetles and plants.
While it is possible to create separate scatterplots for each pair (in this case three separate scatterplots), a
scatterplot matrix is usually more informative and efficient.
Q3-4. Construct a scatterplot matrix or SPLOM for % of guano, % of plant cover and beetle
abundance. Are there any obvious relationships?
Homogeneity of variance
Many statistical hypothesis tests assume that populations are equally varied. For hypothesis tests that compare
populations (such as t-tests - see Question 4), it is important that one of the populations is not substantially more
or less variable than the other population(s). Thus, such tests assume homogeneity of variance.
Q3-5. Construct an examine boxplots of beetle abundance for each of the three colony types.
b. Try square-root transforming (preferred over log transformation when applying to count
data, since log(0) is not legal) the beetle variable (function is sqrt) and using this
transformed variable to reconstruct the boxplots. Note that it may be necessary to
perform a forth-root transformation (which performing the square-root transformation
twice) in order to normalize this highly skewed data. This can be done using the
expression to compute as sqrt(sqrt(BEETLE96)). If this successfully normalizes the
data, focus on whether there is any evidence that the populations are equally varied. Is
there any evidence that the assumption of homogeneity of variance is violated? (Y or
N)
c. Try calculating the variance or standard deviation of beetle abundance for each
colony type separately (remember to use the transformed data, as the raw data was
obviously non-normal and non-normality often results in unequal variances). Do these
values provide any evidence for unequally varied populations? (Y or N)
d. The primary concern of the equal variance assumption is that there should not be a
relationship between population mean and variance. Use the sample statistics to plot
mean against variance for the transformed beetle abundance data. Any evidence of a
relationship between mean and variance? (Y or N)
Q4-1. The researchers were interested in testing whether there is a difference in the metabolic rate of
male and female breeding northern fulmars. In light of this, list the following:
The appropriate statistical test for testing the null hypothesis that the means of two independent populations are
equal is a t-test
Before proceeding, make sure you understand what is meant by normality and equal variance as well as the
principles of hypothesis testing using a t-test.
Q4-2. For the null hypothesis test of interest (that the mean population metabolic rate of males and
females were the same), calculate the Degrees of freedom
Q4-3. Calculate the critical t-values for the following null hypotheses (&alpha = 0.05)
a. The metabolic rate of males is higher than that females (one-tailed test)
b. The metabolic rate of males is the same as that of females (two-tailed test)
Since most hypothesis tests follow the same basic procedure, confirm that you understand the basic steps of
hypothesis tests.
Q4-4.In the table below, list the assumptions of a t-test along with how violations of each assumption
are diagnosed and/or the risks of violations are minimized.
II.
III.
So, we wish to investigate whether or not male and female fulmars have the same metabolic rates, and that we
intend to use a t-test to test the null hypothesis that the population mean metabolic rate of males is equal to the
population mean metabolic rate of females. Having identified the important assumptions of a t-test, use the
samples to evaluate whether the assumptions are likely to be violated and thus whether a t-test is likely to be
reliability.
Q4-6. Perform a t-test to examine the effect of sex on the mass of fulmars using either (which ever is
most appropriate) a pooled variance t-test (for when population variances are very similar) or
separate variance t-test (for when the variance of one population is likely to be up to 2.5 times greater
or less than the other population). Ensure that you are familiar with the output of a t-test.
a. What is the t-value? (Excluding the sign. The sign will depend on whether you
compared males to females or females to males, and thus only indicates which group
had the higher mean).
Q4-7. Write the results out as though you were writing a research paper/thesis. For example (select
the phrase that applies and fill in gaps with your results):
The mean metabolic rate of male fulmars was (choose correct option)
(choose correct option) (t = , df = ,P =
)
the mean metabolic rate of female fulmars.
Q4-8.Construct a bar graph showing the mean metabolic rate of male and female fulmars and an
indication of the precision of the means with error bars.
Q5-1. What is an appropriate statistical test for testing an hypothesis about the difference in
dimensions of webs spun in light versus dark conditions? Explain why?
Q5-2. The actual H0 is that the mean of the differences between the pairs (light and dim for each
spider) equals zero. Use a paired t-test to test the H0 that the mean of the differences in vertical
diameter and separately, in horizontal diameter of the web between the pairs (light and dim for each
spider) equal zero.
Q5-3. Write the results out as though you were writing a research paper/thesis. For example (select
the phrase that applies and fill in gaps with your results):
The mean vertical diameter of spider webs in dim conditions was (choose correct option)
(choose correct option) (t = , df = ,P =
)
the vertical dimensions in light conditions.
The mean horizontal diameter of spider webs in dim conditions was (choose correct option)
(choose correct option) (t = , df = ,P =
)
the horizontal dimensions in light conditions.
Since the males and female fulmars were all independent of one another, a t-test would be appropriate to test the
null hypothesis of no difference in mean body weight of male and female fulmars.
Q6-1. Are the assumptions underlying this test met? (Y or N) Hint: check the relative sizes of the two
sample variances and the distribution of body weight for each sex.
When the distributional assumptions are violated, parametric tests are unreliable. Under these circumstances,
non-parametric tests can be very useful.
Q6-2. The Wilcoxon-Mann-Whitney test is described as a non-parametric test for comparing two
groups.
Q6-3. If the assumptions are met, test the null hypothesis of no difference in body weight between
male and female fulmars using a Wilcoxon test. Based on this outcome, what are your conclusions?
a. Statistical:
b. Biological (include trend):
Q6-4.Construct a bar graph showing the mean mass of male and female fulmars and an indication of
the precision of the means with error bars.