0% found this document useful (0 votes)
15 views

Chapter 2-Data Preparation

Uploaded by

Alexsandra Tosta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Chapter 2-Data Preparation

Uploaded by

Alexsandra Tosta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CHAPTER 2

Data preparation

Preparing data before analysis An example of species survey data


Before ecological data can be analysed, they need Imagine that you are interested in investigating
to be prepared and put into the right format. Data the hypothesis that soil depth influences tree
that are entered in the wrong format cannot be species diversity. The data that will allow you to
analysed or will yield wrong results. test this hypothesis are data on soil depth and
Different statistical programs require data in data on diversity collected for a series of sample
different formats. You should consult the manual plots. We will see in a later chapter that diversity
of the statistical software to find out how data need can be estimated from information on the species
to be prepared. Alternatively, you could check identity of every tree. Figure 2.1 shows species
example datasets. An example of data preparation and soil depth data for the first four sample plots
for the R package is presented at the end of this that were inventoried (to test the hypothesis, we
session. need several sample plots that span the range from
Before you embark on the data analysis, it is shallow to deep soils). For site A, three species
essential to check for mistakes in data entry. If you were recorded (S1, S2 and S3) and a soil depth
detect mistakes later in the analysis, you would of 1 m. For site B, only two species were recorded
need to start the analysis again and could have (S1 with four trees and S3 with one tree) and a soil
lost considerable time. Mistakes in data entry can depth of 2 m.
often be detected as exceptional values. The best
procedure of analysing your results is therefore to
start with checking the data.

Figure 2.1 A simplified example of information


recorded on species and environmental data.

19
20 CHAPTER 2

The species information from Figure 2.1 can be A general format for species
recorded as follows: survey data
Site Species S1 Species S2 Species S3 As seen above, all information can be recorded
(count) (count) (count) in the form of data matrices. All the types of
A 1 1 1 data that are described in this manual can be
B 4 0 1 prepared as two matrices: the species matrix and
the environmental matrix. Table 2.1 shows a part
C 2 2 0
of the species matrix for a well-studied dataset in
D 0 1 2 community ecology, the dune meadow dataset.
This dataset contains 30 species of which only
13 are presented. The data were collected on
the vegetation of meadows on the Dutch island
The environmental information from Figure 2.1 of Terschelling (Jongman et al. 1995). Table 2.2
can be recorded in a similar fashion: shows the environmental data for this dataset.
You can notice that the rows of both matrices
Site Soil depth (m) have the same names – they reflect the data
A 1.0 that were collected for each site or sample unit.
B 2.0 Sites could be sample plots, sample sites, farms,
biogeographical provinces, or other identities.
C 0.5
Sites are defined as the areas from which data were
D 1.5 collected during a specific time period. We will
use the term “site” further on in this manual. Sites
will always refer to the rows of the datasets.
Some studies involve more than one type of
This chapter deals with the preparation of data sampling unit, often arranged hierarchically. For
matrices as the two matrices given above. Note example, villages, farms in the village and plots
that the example of Figure 2.1 is simplified: within a farm. Sites of different types (such as plots,
typical species matrices have more than 100 rows villages and districts) should not be mixed within
and more than 100 columns. These matrices can the same data matrix. Each site of the matrix should
be used as input for the analyses shown in the be of the same type of sampling unit.
following chapters. They can be generated by a The columns of the matrices indicate the
decent data management system. These matrices variables that were measured for each site. The cells
are usually not the ideal method of capturing, of the matrices contain observations – bits of data
entering and storing data. Recording species data recorded for a specific site and a specific variable.
in the field is typically done with data collection We prefer using rows to represent samples and
forms that are filled for each site separately and columns to represent variables to the alternative
that contain tables with a single column for form where rows represent variables. Our preference
the species name and a single column for the is simply based on the fact that some general
abundance. This is also the ideal method of statistical packages use this format. Data can be
storing species data. presented by swapping rows and columns, since the
contents of the data will remain the same.
Table 2.1 An example of a species matrix, where rows correspond to sites, columns correspond to species and cell entries are the abundance of the
species at a particular site

Site Achmil Agrsto Airpra Alogen Antodo Belper Brarut Brohor Calcus Chealb Cirarv Elepal Elyrep …
X1 1 0 0 0 0 0 0 0 0 0 0 0 4 …
X2 3 0 0 2 0 3 0 4 0 0 0 0 4 …
X3 0 4 0 7 0 2 2 0 0 0 0 0 4 …
X4 0 8 0 2 0 2 2 3 0 0 2 0 4 …
X5 2 0 0 0 4 2 2 2 0 0 0 0 4 …
X6 2 0 0 0 3 0 6 0 0 0 0 0 0 …
X7 2 0 0 0 2 0 2 2 0 0 0 0 0 …
X8 0 4 0 5 0 0 2 0 0 0 0 4 0 …
X9 0 3 0 3 0 0 2 0 0 0 0 0 6 …
X10 4 0 0 0 4 2 2 4 0 0 0 0 0 …
X11 0 0 0 0 0 0 4 0 0 0 0 0 0 …
X12 0 4 0 8 0 0 4 0 0 0 0 0 0 …
X13 0 5 0 5 0 0 0 0 0 1 0 0 0 …
X14 0 4 0 0 0 0 0 0 4 0 0 4 0 …
X15 0 4 0 0 0 0 4 0 0 0 0 5 0 …
X16 0 7 0 4 0 0 4 0 3 0 0 8 0 …
X17 2 0 2 0 4 0 0 0 0 0 0 0 0 …
X18 0 0 0 0 0 2 6 0 0 0 0 0 0 …
X19 0 0 3 0 4 0 3 0 0 0 0 0 0 …
X20 0 5 0 0 0 0 4 0 3 0 0 4 0 …
DATA PREPARATION
21
22 CHAPTER 2

Table 2.2 An example of an environmental matrix, where rows correspond to sites and columns correspond to
variables

Site A1 Moisture Management Use Manure


X1 2.8 1 SF Haypastu 4
X2 3.5 1 BF Haypastu 2
X3 4.3 2 SF Haypastu 4
X4 4.2 2 SF Haypastu 4
X5 6.3 1 HF Hayfield 2
X6 4.3 1 HF Haypastu 2
X7 2.8 1 HF Pasture 3
X8 4.2 5 HF Pasture 3
X9 3.7 4 HF Hayfield 1
X10 3.3 2 BF Hayfield 1
X11 3.5 1 BF Pasture 1
X12 5.8 4 SF Haypastu 2
X13 6 5 SF Haypastu 3
X14 9.3 5 NM Pasture 0
X15 11.5 5 NM Haypastu 0
X16 5.7 5 SF Pasture 3
X17 4 2 NM Hayfield 0
X18 4.6 1 NM Hayfield 0
X19 3.7 5 NM Hayfield 0
X20 3.5 5 NM Hayfield 0

The species matrix and for site 13 indicates a range of 5-12.5% in


cover percentage. The species matrix should not
The species data are included in the species contain a range of values in a single cell, but a
matrix. This matrix shows the values for each single number (the database can contain the range
species and for each site (see data collection for that is used to calculate the coding for the range).
various types of samples). For example, the value An extreme method of collecting data that only
of 5 was recorded for species Agrostis stolonifera reflect a range of values is the presence-absence
(coded as Agrsto) and for site 13. Another name scale, where a value of 0 indicates that the species
for this matrix is the community matrix. was not observed and a value of 1 shows that the
The species matrix often contains abundance species was observed.
values – the number of individuals that were A site will often only contain a small subset of
counted for each species. Sometimes species data all the species that were observed in the whole
reflect the biomass recorded for each species. survey. Species distribution is often patchy. Species
Biomass can be approximated by percentage data will thus typically contain many zeros. Some
cover (typical for surveys of grasslands) or by statistical packages require that you are explicit
cross-sectional area (the surface area of the stem, that a value of zero was collected – otherwise the
typical for forest surveys). Some survey methods software could interpret an empty cell in a species
do not collect precise values but collect values that matrix as a missing value. Such a missing value
indicate a range of possible values, so that data will not be used for the analysis, so you could
collection can proceed faster. For instance, the obtain erroneous results if the data were recorded
value of 5 recorded for species Agrostis stolonifera as zero but treated as missing.
DATA PREPARATION 23

The environmental matrix For the thickness of A1 horizon of Table 2.2, we


obtain following summary statistics.
The environmental dataset is more typical of the
type of dataset that a statistical package normally Min. 1st Qu. Median Mean 3rd Qu. Max.
handles. The columns in the environmental dataset 2.800 3.500 4.200 4.850 5.725 11.500
contain the various environmental variables. The
rows indicate the sites for which the values were
recorded. The environmental variables can be These statistics summarize the values that were
referred to as explanatory variables for the types obtained for the quantitative variable. Another
of analysis that we describe in this manual. Some method by which the values for a quantitative
people prefer to call these variables independent variable can be summarized is a boxplot graph
variables, and others prefer the term x variables. as shown in Figure 2.2. The whiskers show the
For instance, the information on the thickness minimum and maximum of the dataset, except if
of the A1 horizon of the dune meadow dataset some values are farther than 1.5 × the interquartile
shown in Table 2.2 can be used as an explanatory range (the difference between the 1st and 3rd
variable in a model that explains where species quartile) from the median value. Note that various
Agrostis stolonifera occurs. The research hypotheses software packages or options within such package
will have indicated which explanatory variables will result in different statistics to be portrayed
were recorded, since an infinite number of in boxplot graphs – you may want to check
environmental variables could be recorded at each the documentation of your particular software
site. package. An important feature of Figure 2.2 is
The environmental dataset will often contain that it shows that there are some outliers in the
two types of variables: quantitative variables and dataset. If your data are normally distributed,
categorical variables. then you would only rarely (less than 1% of the
Quantitative variables such as the thickness of time) expect to observe an outlier. If the boxplot
the A1 horizon of Table 2.2 contain observations indicates outliers, check whether you entered the
that are measured quantities. The observation for data correctly (see next page).
the A1 horizon of site 1 was for example recorded
by the number 2.8. Various statistics can be
calculated for quantitative variables that cannot be
calculated for categorical variables. These include:
• The mean or average value
• The standard deviation (this value indicates how
close the values are to the mean)
• The median value (the middle value when values
are sorted from low to high) (synomyms for this
value are the 50% quantile or 2nd quartile)
• The 25% and 75% quantiles = 1st and 3rd
quartiles (the values for which 25% or 75% of
values are smaller when values are sorted from
low to high)
• The minimum value
• The maximum value
24 CHAPTER 2

Figure 2.2 Summary of a quantitative variable as a boxplot. The variable that is summarized is the thickness of the
A1 horizon of Table 2.2.

Figure 2.3 Summary of a quantitative variable as a Q-Q plot. The variable that is summarized is the thickness of the
A1 horizon of Table 2.2. The two outliers (upper right-hand side) correspond to the outliers of Figure 2.2.
DATA PREPARATION 25

Graphically, the summary can be represented as


a barplot. Figure 2.4 shows an example for the
management of Table 2.2.
Some researchers record observations of
categorical variables as a number, where the
number represents the code for a specific type
of value – for instance code “1” could indicate
“standard farming”. We do not encourage the
usage of numbers to code for factor levels since
statistical software and analysts can confuse the
Figure 2.4 Summary of a categorical variable by a bar
plot. The management of Table 2.2 is summarized. variable with a quantitative variable. The statistical
software could report erroneously that the average
There are other graphical methods for checking management type is 2.55, which does not make
for outliers for quantitative variables. One of sense. It would definitely be wrong to conclude
these methods is the Q-Q plot. When data are that the average management type would be 3 (the
normally distributed, all observations should be integer value closest to 2.55) and thus be hobby-
plotted roughly along a straight line. Outliers will farming. A better way of recording categorical
be plotted further away from the line. Figure 2.3 variables is to include characters. You are then
gives an example. Another method to check for specific that the value is a factor level – you could
outliers is to plot a histogram. The key point is to for instance use the format of “c1”, “c2”, “c3” and
check for the exceptional observations. “c4” to code for the four management regimes.
Categorical variables (or qualitative variables) Even better techniques are to use meaningful
are variables that contain information on data abbreviations for the factor levels – or to just use
categories. The observations for the type of the entire description of the factor level, since
management for the dune meadow dataset most software will not have any problems with
(presented in Table 2.2) have four values: “standard long descriptions and you will avoid confusion of
farming”, “biological farming”, “hobby farming” collaborators or even yourself at later stages.
and “nature conservation management”. The Ordinal variables are somewhere between
observation for the type of management is thus quantitative and categorical variables. The manure
not a number. In statistical textbooks, categorical variable of the dune meadow dataset is an ordinal
variables are also referred to as factors. Factors can variable. Ordinal variables are not measured on
only contain a limited number of factor levels. a quantitative scale but the order of the values
The only way by which categorical variables is informative. This means for manure that
can be summarized is by listing the number progressively more manure is used from manure
of observations or frequency of each category. class 0 until 4. However, since the scale is not
For instance, the summary for the management quantitative, a value of 4 does not mean that four
variable of Table 2.2 could be presented as: times more manure is used than for value 1 (if it
was, then we would have a quantitative variable).
For the same reason manure class 3 is not the
Category
average of manure class 2 and 4.
BF HF NM SF
You can actually choose whether you treat
observations 3 5 6 6
ordinal variables as quantitative or categorical
26 CHAPTER 2

variables in the statistical analysis. In many entered instead of 4.3. Compare with Figure 2.2.
statistical packages, when the observations of You should be aware of the likely ranges of all
a variable only contain numbers, the package quantitative variables.
will assume that the variable is a quantitative Some mistakes for categorical data can easily
variable. If you want the variable to be treated be spotted by calculating the frequencies of
as a categorical variable, you will need to inform observations for each factor level. If you had entered
the statistical package about this (for example by “NN” instead of “NM” for one management
using a non-numerical coding system). If you are observation in the dune meadow dataset, then
comfortable to assume for the analysis that the a table with the number of observations for
ordinal variables were measured on a quantitative each management type would easily reveal that
scale, then it is better to treat them as quantitative mistake. This method is especially useful when
variables. Some special methods for ordinal data the number of observations is fixed for each
are also available. level. If you designed your survey so that each
type of management should have 5 observations,
then spotting one type of management with 4
Checking for exceptional observations and one type with 1 observation
would reveal a data entry error.
observations that could be
Some exceptional observations will only be
mistakes spotted when you plot variables against each
The methods of summarizing quantitative other as part of exploratory analysis, or even later
and categorical data that were described in when you started conducting some statistical
the previous section can be used to check for analysis. Figure 2.6 shows a plot of all possible
exceptional data. Maximum or minimum values pairs of the environmental variables of the dune
that do not correspond to the expectations will meadow dataset. You can notice the two outliers
easily be spotted. Figure 2.5 for instance shows for the thickness of the A1 horizon, which occur
a boxplot for the A1 horizon that contained a at moisture category 4 and manure category 1,
data entry error for site 3 as the value 43 was for instance.

Figure 2.5 Checking for exceptional observations.


DATA PREPARATION 27

After having spotted a potential mistake, you need should not be changed or assumed to be missing.
to record immediately where the potential mistake If it is clearly a nonsense value, but no explanation
occurred, especially if you do not have time to can be found, then it should be omitted. If it is
directly check the raw data. You can include a text just a strange value then various courses are open
file where you record potential mistakes in the to you. You can try analysing the data with and
folder where you keep your data. Alternatively, without the observation to check if it makes a big
you could give the cell in the spreadsheet where difference to results. You might have to go back to
you keep a copy of the data a bright colour. Yet the field and take the measurement again, finding a
another method is to add an extra variable in your field explanation if the odd value is repeated.
dataset where comments on potential mistakes are Do not get confused when you have various
listed. However the best method is to directly check datasets in various stages of correction. Commonly
and change your raw data (if a mistake is found). scientists end up with several versions of each data
Always record the changes that you have made and file and loose track of which is which. The best
the reasons for them. Note that an observation that method is to have only one dataset, of which you
looks odd but which can not be traced to a mistake make regular backups.

Figure 2.6 Checking for exceptional data by pairwise comparisons of the variables of Table 2.2.
28 CHAPTER 2

Methods of transforming the We recommend only transforming variables if


values in the matrices you have a good reason to investigate a particular
pattern that will be revealed by the transformation.
There are many ways in which the values of For example, an extreme way of transforming the
the species and environmental matrices can be species matrix is to change the values to 1 if the
transformed. Some methods were developed species is present and 0 if the species is absent. The
to make data more conform to the normal subsequent analysis will thus not be influenced by
distribution. What transformation you use will differences in species’ abundances. By comparing
depend on your objectives and what you want the results of the analysis of the original data with
to assume about the data. For several types of the results from the transformed data, you can get
analysis described in later chapters you do not an idea of the influence of differences in abundance
need to transform the species matrix, and most on the results. If one species dominates and the
analyses do not actually require the explanatory ordination results are only influenced by that one
variables to be normally distributed. It is species, then you could use a logarithmic or square-
therefore not good practice to always transform root transformation to diminish the influence of
explanatory variables to be normally distributed. the dominant species – again this means that there
Moreover, in many cases it will not be possible to is a good reason for the transformation and such
find a transformation that will result in normally should not be a standard approach. The fact that
distributed data. the results are influenced by the dominant species
is actually a clear demonstration of an important
pattern in your dataset.
DATA PREPARATION 29

Examples of the analysis with the menu options of Biodiversity.R


See in chapter 3 how data can be loaded from an external file:
Data > Import data > from text file…
Enter name for dataset: data (choose any name)
Click “OK”
Browse for the file and click on it

To save data to an external file:


Data > Active Dataset > export active dataset…
File name: export.txt (choose any name)

Select the species and environmental matrices:


Biodiversity > Environmental Matrix > Select environmental matrix
Select the dune.env dataset
Biodiversity > Community matrix > Select community matrix
Select the dune dataset

To summarize the data and check for exceptional cases:


Biodiversity > Environmental Matrix > Summary…
Select variable: A1
Click “OK”
Click “Plot”
30 CHAPTER 2

Examples of the analysis with the command options of Biodiversity.R


To load data from an external file:
data <- read.table(file=”D://my files/data.txt”)
data <- read.table(file.choose())

To save data to an external file:


write.table(data, file=”D://my files/data.txt”)
write.table(data, file.choose())

To summarize the data and check for exceptional cases:


summary(dune.env)
boxplot(dune.env$A1)
points(mean(dune.env$A1),cex=1.5)
table(dune.env$Management)
plot(dune.env$Management)
pairs(dune.env)

To transform the data:


dune.ln.transformed <- log(dune+1)
dune.squareroot.transformed <- dune^0.5
dune.speciesprofile <- decostand(dune,”total”)
dune.env$A1.standard <- scale(dune.env$A1)

Checking whether data is normally distributed:


qq.plot(dune.env$A1)
shapiro.test(dune.env$A1)
ks.test(dune.env$A1,pnorm)

You might also like