Chapter 2-Data Preparation
Chapter 2-Data Preparation
Data preparation
19
20 CHAPTER 2
The species information from Figure 2.1 can be A general format for species
recorded as follows: survey data
Site Species S1 Species S2 Species S3 As seen above, all information can be recorded
(count) (count) (count) in the form of data matrices. All the types of
A 1 1 1 data that are described in this manual can be
B 4 0 1 prepared as two matrices: the species matrix and
the environmental matrix. Table 2.1 shows a part
C 2 2 0
of the species matrix for a well-studied dataset in
D 0 1 2 community ecology, the dune meadow dataset.
This dataset contains 30 species of which only
13 are presented. The data were collected on
the vegetation of meadows on the Dutch island
The environmental information from Figure 2.1 of Terschelling (Jongman et al. 1995). Table 2.2
can be recorded in a similar fashion: shows the environmental data for this dataset.
You can notice that the rows of both matrices
Site Soil depth (m) have the same names – they reflect the data
A 1.0 that were collected for each site or sample unit.
B 2.0 Sites could be sample plots, sample sites, farms,
biogeographical provinces, or other identities.
C 0.5
Sites are defined as the areas from which data were
D 1.5 collected during a specific time period. We will
use the term “site” further on in this manual. Sites
will always refer to the rows of the datasets.
Some studies involve more than one type of
This chapter deals with the preparation of data sampling unit, often arranged hierarchically. For
matrices as the two matrices given above. Note example, villages, farms in the village and plots
that the example of Figure 2.1 is simplified: within a farm. Sites of different types (such as plots,
typical species matrices have more than 100 rows villages and districts) should not be mixed within
and more than 100 columns. These matrices can the same data matrix. Each site of the matrix should
be used as input for the analyses shown in the be of the same type of sampling unit.
following chapters. They can be generated by a The columns of the matrices indicate the
decent data management system. These matrices variables that were measured for each site. The cells
are usually not the ideal method of capturing, of the matrices contain observations – bits of data
entering and storing data. Recording species data recorded for a specific site and a specific variable.
in the field is typically done with data collection We prefer using rows to represent samples and
forms that are filled for each site separately and columns to represent variables to the alternative
that contain tables with a single column for form where rows represent variables. Our preference
the species name and a single column for the is simply based on the fact that some general
abundance. This is also the ideal method of statistical packages use this format. Data can be
storing species data. presented by swapping rows and columns, since the
contents of the data will remain the same.
Table 2.1 An example of a species matrix, where rows correspond to sites, columns correspond to species and cell entries are the abundance of the
species at a particular site
Site Achmil Agrsto Airpra Alogen Antodo Belper Brarut Brohor Calcus Chealb Cirarv Elepal Elyrep …
X1 1 0 0 0 0 0 0 0 0 0 0 0 4 …
X2 3 0 0 2 0 3 0 4 0 0 0 0 4 …
X3 0 4 0 7 0 2 2 0 0 0 0 0 4 …
X4 0 8 0 2 0 2 2 3 0 0 2 0 4 …
X5 2 0 0 0 4 2 2 2 0 0 0 0 4 …
X6 2 0 0 0 3 0 6 0 0 0 0 0 0 …
X7 2 0 0 0 2 0 2 2 0 0 0 0 0 …
X8 0 4 0 5 0 0 2 0 0 0 0 4 0 …
X9 0 3 0 3 0 0 2 0 0 0 0 0 6 …
X10 4 0 0 0 4 2 2 4 0 0 0 0 0 …
X11 0 0 0 0 0 0 4 0 0 0 0 0 0 …
X12 0 4 0 8 0 0 4 0 0 0 0 0 0 …
X13 0 5 0 5 0 0 0 0 0 1 0 0 0 …
X14 0 4 0 0 0 0 0 0 4 0 0 4 0 …
X15 0 4 0 0 0 0 4 0 0 0 0 5 0 …
X16 0 7 0 4 0 0 4 0 3 0 0 8 0 …
X17 2 0 2 0 4 0 0 0 0 0 0 0 0 …
X18 0 0 0 0 0 2 6 0 0 0 0 0 0 …
X19 0 0 3 0 4 0 3 0 0 0 0 0 0 …
X20 0 5 0 0 0 0 4 0 3 0 0 4 0 …
DATA PREPARATION
21
22 CHAPTER 2
Table 2.2 An example of an environmental matrix, where rows correspond to sites and columns correspond to
variables
Figure 2.2 Summary of a quantitative variable as a boxplot. The variable that is summarized is the thickness of the
A1 horizon of Table 2.2.
Figure 2.3 Summary of a quantitative variable as a Q-Q plot. The variable that is summarized is the thickness of the
A1 horizon of Table 2.2. The two outliers (upper right-hand side) correspond to the outliers of Figure 2.2.
DATA PREPARATION 25
variables in the statistical analysis. In many entered instead of 4.3. Compare with Figure 2.2.
statistical packages, when the observations of You should be aware of the likely ranges of all
a variable only contain numbers, the package quantitative variables.
will assume that the variable is a quantitative Some mistakes for categorical data can easily
variable. If you want the variable to be treated be spotted by calculating the frequencies of
as a categorical variable, you will need to inform observations for each factor level. If you had entered
the statistical package about this (for example by “NN” instead of “NM” for one management
using a non-numerical coding system). If you are observation in the dune meadow dataset, then
comfortable to assume for the analysis that the a table with the number of observations for
ordinal variables were measured on a quantitative each management type would easily reveal that
scale, then it is better to treat them as quantitative mistake. This method is especially useful when
variables. Some special methods for ordinal data the number of observations is fixed for each
are also available. level. If you designed your survey so that each
type of management should have 5 observations,
then spotting one type of management with 4
Checking for exceptional observations and one type with 1 observation
would reveal a data entry error.
observations that could be
Some exceptional observations will only be
mistakes spotted when you plot variables against each
The methods of summarizing quantitative other as part of exploratory analysis, or even later
and categorical data that were described in when you started conducting some statistical
the previous section can be used to check for analysis. Figure 2.6 shows a plot of all possible
exceptional data. Maximum or minimum values pairs of the environmental variables of the dune
that do not correspond to the expectations will meadow dataset. You can notice the two outliers
easily be spotted. Figure 2.5 for instance shows for the thickness of the A1 horizon, which occur
a boxplot for the A1 horizon that contained a at moisture category 4 and manure category 1,
data entry error for site 3 as the value 43 was for instance.
After having spotted a potential mistake, you need should not be changed or assumed to be missing.
to record immediately where the potential mistake If it is clearly a nonsense value, but no explanation
occurred, especially if you do not have time to can be found, then it should be omitted. If it is
directly check the raw data. You can include a text just a strange value then various courses are open
file where you record potential mistakes in the to you. You can try analysing the data with and
folder where you keep your data. Alternatively, without the observation to check if it makes a big
you could give the cell in the spreadsheet where difference to results. You might have to go back to
you keep a copy of the data a bright colour. Yet the field and take the measurement again, finding a
another method is to add an extra variable in your field explanation if the odd value is repeated.
dataset where comments on potential mistakes are Do not get confused when you have various
listed. However the best method is to directly check datasets in various stages of correction. Commonly
and change your raw data (if a mistake is found). scientists end up with several versions of each data
Always record the changes that you have made and file and loose track of which is which. The best
the reasons for them. Note that an observation that method is to have only one dataset, of which you
looks odd but which can not be traced to a mistake make regular backups.
Figure 2.6 Checking for exceptional data by pairwise comparisons of the variables of Table 2.2.
28 CHAPTER 2