R package iNEXT
R package iNEXT
Examples
T. C. Hsieh, K. H. Ma, and Anne Chao
Latest Updates in July 2022: (1) We have modified (in the main function iNEXT) the bootstrap method
used to obtain confidence intervals for the coverage-based rarefaction and extrapolation curves. We have
expanded the iNEXT output ($iNextEst) to include two lists ($size_based and $coverage_based). (2) In the
function estimateD, for a given coverage value, we have refined our algorithm to find the corresponding
sample size (not necessarily restricted to integers) to obtain more accurate diversity estimates. (3) We
have changed some column names in the output in order to conform to our forthcoming iNEXT series
(iNEXT.3D, iNEXT.4steps, iNEXT.link). Please download the latest version of iNEXT available from CRAN or
from Anne Chao’s iNEXT_github, or use the latest version of iNEXT Online available from Shiny iNEXT-
Online.
iNEXT (iNterpolation and EXTrapolation) is an R package modified from the original version which was
supplied in the Supplement of Chao et al. (2014). In the latest updated version, we have added more user‐
friendly features, improved some algorithms, and refined the graphic displays. In this document, we
provide a quick introduction demonstrating how to run iNEXT. Detailed information about iNEXT functions is
provided in the iNEXT Manual, also available in CRAN. See Chao & Jost (2012), Colwell et al. (2012) and
Chao et al. (2014) for methodologies. A short review of the theoretical background and a brief description
of methods are included in an application paper by Hsieh, Ma & Chao (2016). An online version of iNEXT-
online is also available for users without an R background.
iNEXT focuses on three measures of Hill numbers of order q: species richness (q = 0), Shannon diversity (q
= 1, the exponential of Shannon entropy) and Simpson diversity (q = 2, the inverse of Simpson
concentration). For each diversity measure, iNEXT uses the observed sample of abundance or incidence
data (called the “reference sample”) to compute diversity estimates and the associated 95% confidence
intervals for the following two types of rarefaction and extrapolation (R/E):
1. Sample-size-based (or size-based) R/E sampling curves: iNEXT computes diversity estimates for
rarefied and extrapolated samples up to an appropriate size. This type of sampling curve plots the
diversity estimates with respect to sample size.
2. Coverage‐based R/E sampling curves: iNEXT computes diversity estimates for rarefied and
extrapolated samples based on a standardized level of sample completeness (as measured by
sample coverage) up to an appropriate coverage value. This type of sampling curve plots the
diversity estimates with respect to sample coverage.
iNEXT also plots the above two types of sampling curves and a sample completeness curve (which depicts
how sample coverage varies with sample size). The sample completeness curve provides a bridge
between the size- and coverage-based R/E sampling curves.
If you publish your work based on the results from the iNEXT package, you should make references to the
following methodology paper (Chao et al. 2014) and the application paper (Hsieh, Ma & Chao, 2016):
Chao, A., Gotelli, N.J., Hsieh, T.C., Sander, E.L., Ma, K.H., Colwell, R.K. & Ellison, A.M. (2014)
Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species
diversity studies. Ecological Monographs, 84, 45–67.
Hsieh, T.C., Ma, K.H. & Chao, A. (2016) iNEXT: An R package for interpolation and extrapolation of
species diversity (Hill numbers). Methods in Ecology and Evolution, 7, 1451-1456.
Required: R
Suggested: RStudio IDE
The iNEXT package is available from CRAN and can be downloaded with a standard R installation
procedure or can be downloaded from Anne Chao’s iNEXT_github using the following commands. For a
first‐time installation, an additional visualization extension package (ggplot2) must be installed and loaded.
The arguments of this function are briefly described below, and will be explained in more details by
illustrative examples in later text. This main function computes diversity estimates of order q, the sample
coverage estimates and related statistics for K (if knots = K) evenly‐spaced knots (sample sizes) between
size 1 and the endpoint, where the endpoint is described below. Each knot represents a particular sample
size for which diversity estimates will be calculated. By default, endpoint = double the reference sample
size (total sample size for abundance data; total sampling units for incidence data). For an example, if
endpoint = 10, knots = 4, diversity estimates will be computed for a sequence of samples with sizes (1, 4,
7, 10). In a later real-data example, we have endpoint = 336, knots = 40; diversity estimates will be
computed for a sequence of samples with sizes (1, 10,19, 28, …, 318, 327, 336).
Argument Description
an integer vector of sample sizes for which diversity estimates will be computed. If
size NULL, then diversity estimates will be calculated for those sample sizes determined
by the specified/default endpoint and knots.
an integer specifying the sample size that is the endpoint for R/E calculation; If
endpoint
NULL, then endpoint=double the reference sample size.
an integer specifying the number of equally‐spaced knots between size 1 and the
knots
endpoint; default is 40.
a logical variable to calculate the bootstrap standard error and conf confidence
se
interval.
conf a positive number < 1 specifying the level of confidence interval; default is 0.95.
This function returns an "iNEXT" object which can be further used to make plots using the function
ggiNEXT() to be described below.
DATA FORMAT/INFORMATION
a. Incidence-raw data (datatype="incidence_raw"): for each assemblage, input data for a reference
sample consisting of a species-by-sampling-unit matrix; each element in the raw matrix is 1 for a
detection, and 0 otherwise. When there are N assemblages, input data consist of N lists of raw
matrices, and each matrix is a species-by-sampling-unit matrix.
b. Incidence-frequency data (datatype="incidence_freq"): input data for each assemblage consist of
species sample incidence frequencies (i.e., row sums of the corresponding incidence raw matrix).
When there are N assemblages, input data consist of an (S+1) by N matrix, or N lists of species
incidence frequencies. The first entry of each column/list must be the total number of sampling units,
followed by the species incidence frequencies.
Four data sets are included in the iNEXT package for illustration. There are two abundance data sets:
spider (list of two vectors) and bird (in data.frame format), and two incidence data sets: ant (list of 5
vectors) and ciliates (list of 3 matrices). The input datatypes are the same for the two abundance data
sets (datatype="abundance"), but the input datatypes are different for the ant data
(datatype="incidence_freq") and the ciliates data (datatype="incidence_raw"). We first use the spider
data for illustration; see Chao et al. (2014) for analysis details and data interpretations. The spider data
consist of abundance data from two canopy manipulation treatments (“Girdled” and “Logged”) of hemlock
trees (Ellison et al. 2010). For these data, the following commands run the iNEXT() function for q = 0.
data(spider)
str(spider)
iNEXT(spider, q=0, datatype="abundance")
The iNEXT() function returns the "iNEXT" object including three output lists: $DataInfo for summarizing data
information; $iNextEst for showing size- and coverage-based diversity estimates along with related
statistics for a series of rarefied and extrapolated samples; and $AsyEst for showing asymptotic diversity
estimates along with related statistics. $DataInfo, as shown below, returns basic data information including
the reference sample size (n), observed species richness (S.obs), sample coverage estimate for the
reference sample (SC), and the first ten frequency counts (f1‐f10). This part of output can also be
computed by the function DataInfo()
For incidence data, the list $DataInfo includes the reference sample size (T), observed species richness
(S.obs), total number of incidences (U), sample coverage estimate for the reference sample (SC), and the
first ten incidence frequency counts (Q1‐Q10).
In the Girdled treatment assemblage, by default, 40 equally spaced knots (samples sizes) between 1 and
336 (= 2 x 168, double the reference sample size, Chao et al. 2014) are selected. Diversity estimates and
related statistics are computed for these 40 knots (corresponding to sample sizes m = 1, 10, 19, …, 327,
336), which locates the reference sample at the mid-point of the selected knots. If the argument se=TRUE,
then the bootstrap method is applied to obtain the 95% confidence intervals for each diversity and sample
coverage estimate.
The list $iNextEst output includes two data frames: $size_based and $coverage_based. (Note the output in
the list $iNextEst is different from that obtained from earlier iNEXT versions < 3.0.0, due to a modification in
the bootstrap method.) For the sample size corresponding to each knot, the first data frame (as shown
under $size_based) includes the name of Assemblage, the sample size (m, i.e., each of the 40 knots), the
method (Rarefaction, Observed, or Extrapolation, depending on whether the size m is less than, equal to, or
greater than the reference sample size), the diversity order (order.q), the diversity estimate of order q (qD),
the 95% lower and upper confidence limits of diversity (qD.LCL, qD.UCL), and the sample coverage estimate
(SC) along with the 95% lower and upper confidence limits of sample coverage (SC.LCL, SC.UCL). These
sample coverage estimates with confidence intervals are used for plotting the sample completeness curve.
NOTE: The above output only shows five estimates for each assemblage; call
iNEXT.object$iNextEst$size_based to view complete output.
The second data frame (as shown under $coverage_based) includes the name of Assemblage, the
standardized sample coverage (SC), the corresponding sample size for the standardized coverage (m, i.e.,
each of the 40 knots), the method (Rarefaction, Observed, or Extrapolation, depending on whether the
sample coverage SC is less than, equal to, or greater than the reference sample coverage), the diversity
order (order.q), the diversity estimate of order q (qD), and the 95% lower and upper confidence limits of
diversity (qD.LCL, qD.UCL). These diversity estimates and confidence intervals are used for plotting the
coverage-based R/E curves.
$coverage_based (LCL and UCL are obtained for fixed coverage; interval length is wider due to
varying size in bootstraps.)
NOTE: The above output only shows five estimates for each assemblage; call
iNEXT.object$iNextEst$coverage_based to view complete output.
In the above output ($size_based and $coverage_based), the confidence intervals of any standardized
diversity are obtained by a bootstrap method. In the size-based standardization, the sample size is fixed in
each regenerated bootstrap sample. In the coverage-based standardization, for a given standardized
coverage value, the corresponding size needed to attain the same level of coverage may vary with
regenerated bootstrap samples. Thus, the sampling uncertainty is greater in the coverage-based
standardization and the resulting confidence interval is wider than that in the corresponding size-based
standardization. For example, if the size for a future survey will be fixed at a sample size of 84, we can
obtain a 95% CI of (15.9, 21.9) for the expected diversity (q = 0) based on the first data frame ($size_based
output). However, if the coverage of a survey is fixed at the level of 0.9, the size needed for the current
data is 84, but the size needed for a regenerated bootstrap sample may be different from 84; the second
data frame ($coverage_based output) shows a CI of (10.8, 27.1), which is wider than the former one based
on a size of 84. Because we use a random bootstrapping/regeneration process, with 50 replications
(default), to obtain each CI, the output for qD.LCL and qD.UCL may vary slightly each time you enter the
same data.
$AsyEst lists the name of Assemblage, the Diversity (species richness for q = 0, Shannon diversity for q = 1,
and Simpson diversity for q = 2), the observed diversity (Observed), the asymptotic diversity estimate
(Estimator), the s.e. of the asymptotic estimator (s.e.) and the associated 95% lower and upper
confidence limits (LCL, UCL). The estimated asymptotes are calculated via the functions ChaoRichness() for q
= 0, ChaoShannon() for q = 1 and ChaoSimpson() for q = 2; see Chao et al. (2014) for the formulas of all
asymptotic estimators. The output for the spider data is shown below.
Further, iNEXT can simultaneously run R/E computation for Hill numbers with q = 0, 1, and 2 by specifying a
vector for the argument q as follows:
data(bird)
str(bird) # 41 obs. of 2 variables
iNEXT(bird, q=0, datatype="abundance")
The function ggiNEXT(), which extends ggplot2 to the "iNEXT" object with default arguments, is described
as follows:
1. Sample-size-based R/E curve (type=1): see Figs. 1a and 2a in Hsieh et al. (2016). This curve plots
diversity estimates with confidence intervals (if se=TRUE) as a function of sample size up to double
the reference sample size, by default, or a user‐specified endpoint.
2. Sample completeness curve (type=2) with confidence intervals (if se=TRUE): see Figs. 1b and 2b in
Hsieh et al. (2016). This curve plots the sample coverage with respect to sample size for the same
range described in (1).
3. Coverage-based R/E curve (type=3): see Figs. 1c and 2c in Hsieh et al. (2016). This curve plots the
diversity estimates with confidence intervals (if se=TRUE) as a function of sample coverage up to the
maximum coverage obtained from the maximum size described in (1).
The ggiNEXT() function is a wrapper around the ggplot2 package to create a R/E curve using a single line
of code. The resulting object is of class "ggplot", so it can be manipulated using the ggplot2 tools. The
argument facet.var=("None", "Order.q", "Assemblage" or "Both") can be used to create a separate plot
for each value of the specified variable. See the following examples.
The argument facet.var="Assemblage" in the ggiNEXT function creates a separate plot for each assemblage
as shown below:
# Sample‐size‐based R/E curves, separating by "Assemblage""
out <- iNEXT(spider, q=c(0, 1, 2), datatype="abundance", endpoint=500)
ggiNEXT(out, type=1, facet.var="Assemblage")
The argument facet.var="Order.q" and color.var="Assemblage" creates a separate plot for each diversity
order assemblage, and within each plot, different colors are used for the two assemblages.
For illustration, we use the tropical ant data (in the dataset ant included in the package) at five elevations
(50m, 500m, 1070m, 1500m, and 2000m) collected by Longino & Colwell (2011) from Costa Rica. The 5
lists of incidence frequencies are shown below. The first entry of each list must be the total number of
sampling units, followed by the species incidence frequencies.
data(ant)
str(ant)
List of 5
$ h50m : num [1:228] 599 330 263 236 222 195 186 183 182 129 ...
$ h500m : num [1:242] 230 133 131 123 78 73 65 60 60 56 ...
$ h1070m: num [1:123] 150 99 96 80 74 68 60 54 46 45 ...
$ h1500m: num [1:57] 200 144 113 79 76 74 73 53 50 43 ...
$ h2000m: num [1:15] 200 80 59 34 23 19 15 13 8 8 ...
The argument color.var = ("None", "Order.q", "Assemblage" or "Both") is used to display curves in
different colors for values of the specified variable. For example, the following code using the argument
color.var="Assemblage" displays the sampling curves in different colors for the five assemblages. Note that
theme_bw() is a ggplot2 function to modify the display setting from a grey to a white background with black
gridlines. The following commands return three types of R/E sampling curves for the ant data.
We use the ciliates data collected from three coastal dune habitats to demostrate the use of the input
datatype="incidence_raw". The data set (ciliates) included in the package is a list of three species-by-
plots matrices. Run the following commands to get the output as shown below.
data(ciliates)
str(ciliates)
List of 3
$ EtoshaPan : int [1:365, 1:19] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:365] "Acaryophrya.collaris" "Actinobolina.multinucleata.n..sp."
"Afroamphisiella.multinucleata.n..sp." "Afrothrix.multinucleata.n..sp." ...
.. ..$ : chr [1:19] "x53" "x54" "x55" "x56" ...
$ CentralNamibDesert : int [1:365, 1:17] 0 0 0 0 0 1 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:365] "Acaryophrya.collaris" "Actinobolina.multinucleata.n..sp."
"Afroamphisiella.multinucleata.n..sp." "Afrothrix.multinucleata.n..sp." ...
.. ..$ : chr [1:17] "x31" "x32" "x34" "x35" ...
$ SouthernNamibDesert: int [1:365, 1:15] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:365] "Acaryophrya.collaris" "Actinobolina.multinucleata.n..sp."
"Afroamphisiella.multinucleata.n..sp." "Afrothrix.multinucleata.n..sp." ...
.. ..$ : chr [1:15] "x9" "x17" "x19" "x20" ...
NOTE: The above output only shows five estimates for each assemblage; call
iNEXT.object$iNextEst$size_based to view complete output.
$coverage_based (LCL and UCL are obtained for fixed coverage; interval length is wider due to
varying size in bootstraps.)
to compute diversity estimates with q = 0, 1, 2 for any particular level of sample size (base="size") or any
specified level of sample coverage (base="coverage") for abundance data (datatype="abundance") or
incidence data (datatype="incidence_freq" or "incidence_raw"). If base="size" and level=NULL, then this
function computes the diversity estimates for the minimum among all doubled reference sample sizes. If
base="coverage" and level=NULL, then this function computes the diversity estimates for the minimum
among the coverage values for samples extrapolated to double the size of the reference sample.
The following command returns the species diversity with a specified level of sample coverage of 98.5%
for the ant data. For some assemblages, this coverage value corresponds to rarefaction (i.e., less than the
coverage of the reference sample), while for the others it corresponds to extrapolation (i.e., greater than
the coverage of the reference sample), as indicated under the method column of the output.
estimateD(ant, datatype="incidence_freq",
base="coverage", level=0.985, conf=0.95)
The ggiNEXT() function is a wrapper around the ggplot2 package to create a R/E curve using a single line
of code. The resulting object is of class "ggplot", so it can be manipulated using the ggplot2 tools. The
following are some useful examples for customizing graphs.
Remove legend
The data visualization package ggplot2 provides the scale_ function to customize data which is mapped
into an aesthetic property of a geom_. The following functions would help user to customize ggiNEXT output.
change point shape: scale_shape_manual
change line type : scale_linetype_manual
change line color: scale_colour_manual
change band color: scale_fill_manual
see quick reference for style setting.
To show how to custmize ggiNEXT output, we use abundance-based data spider as an example.
library(iNEXT)
library(ggplot2)
library(gridExtra)
library(grid)
data("spider")
out <- iNEXT(spider, q=0, datatype="abundance")
g <- ggiNEXT(out, type=1, color.var = "Assemblage")
g
Change shapes, line types and colors
In order to change the size of the reference sample point or rarefaction/extrapolation curve, the user need
to modify the ggplot object.
change point size:
the reference sample size point is drawn on the first layer by ggiNEXT. Hack the point size by the
following
Customize theme
A ggplot object can be themed by adding a theme. The User could run help(theme_grey) to show the
default themes in ggplot2. Further, some extra themes are provided by the ggthemes package. Examples
are shown in the following:
The following are custmized themes for black-white figures. To modifiy the legend, see Cookbook for R for
more details.
In iNEXT, we provide a S3 ggplot2::fortify method for class iNEXT. The function fortify offers a single
plotting interface for rarefaction/extrapolation curves. Set argument type = 1, 2, 3 to plot the
corresponding rarefaction/extrapolation curves.
License
The iNEXT package is licensed under the GPLv3. To help refine iNEXT, your comments or feedback would
be welcome (please send them to Anne Chao or report an issue on the iNEXT github iNEXT_github.
References
Chao, A., Gotelli, N.J., Hsieh, T.C., Sander, E.L., Ma, K.H., Colwell, R.K. & Ellison, A.M. (2014)
Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species
diversity studies. Ecological Monographs, 84, 45–67.
Chao, A. & Jost, L. (2012) Coverage‐based rarefaction and extrapolation: standardizing samples by
completeness rather than size. Ecology, 93, 2533–2547.
Colwell, R.K., Chao, A., Gotelli, N.J., Lin, S.‐Y., Mao, C.X., Chazdon, R.L. & Longino, J.T. (2012)
Models and estimators linking individual-based and sample-based rarefaction, extrapolation and
comparison of assemblages. Journal of Plant Ecology, 5, 3–21.
Ellison, A.M., Barker-Plotkin, A.A., Foster, D.R. & Orwig, D.A. (2010) Experimentally testing the role
of foundation species in forests: the Harvard Forest Hemlock Removal Experiment. Methods in
Ecology and Evolution, 1, 168–179.
Hsieh, T.C., Ma, K.H. & Chao, A. (2016) iNEXT: An R package for interpolation and extrapolation of
species diversity (Hill numbers). Methods in Ecology and Evolution, 7, 1451-1456.
Longino, J.T. & Colwell, R.K. (2011) Density compensation, species composition, and richness of
ants on a neotropical elevational gradient. Ecosphere, 2:art29.