0% found this document useful (0 votes)
120 views

A Practical Guide To Amplicon and Metagenomic Analysis of Microbiome Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

A Practical Guide To Amplicon and Metagenomic Analysis of Microbiome Data

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Protein Cell 2021, 12(5):315–330

https://doi.org/10.1007/s13238-020-00724-8 Protein & Cell

REVIEW
A practical guide to amplicon
and metagenomic analysis of microbiome data
Yong-Xin Liu1,2,3& , Yuan Qin
1,2,3,4
, Tong Chen5 , Meiping Lu6 , Xubo Qian6 , Xiaoxuan Guo1,2,3 ,
Yang Bai1,2,3,4&
1
State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences,
Beijing 100101, China
2
CAS Center for Excellence in Biotic Interactions, University of Chinese Academy of Sciences, Beijing 100049, China
3

Protein & Cell


CAS-JIC Centre of Excellence for Plant and Microbial Science, Institute of Genetics and Developmental Biology, Chinese
Academy of Sciences, Beijing 100101, China
4
College of Advanced Agricultural Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
5
National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China
6
Department of Rheumatology Immunology & Allergy, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou,
Zhejiang Province 310053, China
& Correspondence: [email protected] (Y.-X. Liu), [email protected] (Y. Bai)
Received February 4, 2020 Accepted April 10, 2020

ABSTRACT KEYWORDS metagenome, marker genes, high-


throughput sequencing, pipeline, reproducible analysis,
Advances in high-throughput sequencing (HTS) have
visualization
fostered rapid developments in the field of microbiome
research, and massive microbiome datasets are now
being generated. However, the diversity of software INTRODUCTION
tools and the complexity of analysis pipelines make it
Microbiome refers to an entire microhabitat, including its
difficult to access this field. Here, we systematically
microorganisms, their genomes, and the surrounding envi-
summarize the advantages and limitations of micro-
ronment (Marchesi and Ravel, 2015). With the development
biome methods. Then, we recommend specific pipelines
of high-throughput sequencing (HTS) technology and data
for amplicon and metagenomic analyses, and describe
analysis methods, the roles of the microbiome in humans
commonly-used software and databases, to help
(Gao et al., 2018; Yang and Yu, 2018; Zhang et al., 2018a),
researchers select the appropriate tools. Furthermore,
animals (Liu et al., 2020), plants (Liu et al., 2019a; Wang
we introduce statistical and visualization methods suit-
et al., 2020a), and the environment (Mahnert et al., 2019;
able for microbiome analysis, including alpha- and beta-
Zheng et al., 2019) have gradually become clearer in recent
diversity, taxonomic composition, difference compar-
years. These findings have completely changed our under-
isons, correlation, networks, machine learning, evolu-
standing of the microbiome. Several countries have laun-
tion, source tracing, and common visualization styles to
ched successful international microbiome projects, such as
help researchers make informed choices. Finally, a step-
the NIH Human Microbiome Project (HMP) (Turnbaugh
by-step reproducible analysis guide is introduced. We
et al., 2007), the Metagenomics of the Human Intestinal
hope this review will allow researchers to carry out data
Tract (MetaHIT) (Li et al., 2014), the integrative HMP (iHMP)
analysis more effectively and to quickly select the
(Proctor et al., 2019), and the Chinese Academy of Sciences
appropriate tools in order to efficiently mine the bio-
Initiative of Microbiome (CAS-CMI) (Shi et al., 2019b). These
logical significance behind the data.
projects have made remarkable achievements, which have
pushed microbiome research into a golden era.
The framework for amplicon and metagenomic analysis
was established in the last decade (Caporaso et al., 2010;
Yong-Xin Liu, Yuan Qin and Tong Chen have contributed equally to Qin et al., 2010). However, microbiome analysis methods
this work

© The Author(s) 2020


REVIEW Yong-Xin Liu et al.

and standards have been evolving rapidly over the past few microbe-, DNA-, and mRNA-level analyses (Fig. 1A). The
years (Knight et al., 2018). For example, there was a pro- appropriate method(s) should be selected based on sample
posal to replace operational taxonomic units (OTUs) with types and research goals.
amplicon sequence variants (ASVs) in marker gene-based Culturome is a high-throughput method for culturing and
amplicon data analysis (Callahan et al., 2016). The next- identifying microbes at the microbe-level (Fig. 1A). The
generation microbiome analysis pipeline QIIME 2, a repro- microbial isolates are obtained as follows. First, the samples
ducible, interactive, efficient, community-supported platform are crushed, empirically diluted in liquid medium, and dis-
was recently published (Bolyen et al., 2019). In addition, new tributed in 96-well microtiter plates or Petri dishes. Second,
methods have recently been proposed for taxonomic clas- the plates are cultured for 20 days at room temperature.
sification (Ye et al., 2019), machine learning (Galkin et al., Third, the microbes in each well are subjected to amplicon
2018), and multi-omics integrated analysis (Pedersen et al., sequencing, and wells with pure, non-redundant colonies are
2018). selected as candidates. Fourth, the candidates are purified
The development of HTS and analysis methods has and subjected to 16S rDNA full-length Sanger sequencing.
provided new insights into the structures and functions of Finally, the newly characterized pure isolates are preserved
microbiome (Jiang et al., 2019; Ning and Tong, 2019). (Zhang et al., 2019). Culturome is the most effective method
However, these new developments have made it challenging for obtaining bacterial stocks, but it is expensive and labor
for researchers, especially those without a bioinformatics intensive (Fig. 1B). This method has been used for micro-
Protein & Cell

background, to choose suitable software and pipelines. In biome analysis in humans (Goodman et al., 2011; Zou et al.,
this review, we discuss the widely used software packages 2019), mouse (Liu et al., 2020), marine sediment (Mu et al.,
for microbiome analyses, summarize their advantages and 2018), Arabidopsis thaliana (Bai et al., 2015), and rice
limitations, and provide sample codes and suggestions for (Zhang et al., 2019). These studies not only expanded the
selecting and using these tools. catalog of taxonomic and functional databases for metage-
nomic analyses, but also provided bacterial stocks for
experimental verification. For further information, please see
HTS METHODS OF MICROBIOME ANALYSIS
(Lagier et al., 2018; Liu et al., 2019a).
The first step in microbiome research is to understand the DNA is easy to extract, preserve, and sequence, which has
advantages and limitations of specific HTS methods. These allowed researchers to develop various HTS methods (Fig. 1A).
methods are primarily used for three types of analysis: The commonly used HTS methods of microbiome are amplicon

A B
Method Advantages Limitations

High-throughput Expensive
Culturome Targeted selection Laborious
Microbes Provides microbial isolates Influenced by media and the environment

Quick analysis PCR and primer biases


Amplicon
Low-biomass requirement Resolution limited to genus level
(16S/18S/ITS)
Applicable to samples
False positive in low-biomass samples
contaminated by host DNA
8
6
Taxonomic resolution to Expensive
4
2 species or strain level
0
−2 Metagenome Time-consuming in analysis
Functional potential
Uncultured microbial genome Host-derived contamination
DNA

Can identify RNA and DNA Most expensive


Virome viruses Difficult to analysis
Quick diagnosis Severe host-derived contamination

Can identify live microbes Complex sample collection and analysis


Metatranscriptome Can evaluate microbial Expensive and complex in sequencing
activity
mRNA Host mRNA and rRNA contamination
Transcript-level responses

Figure 1. Advantages and limitations of HTS methods used in microbiome research. A Introduction to HTS methods for
different levels of analysis. At the molecule-level, microbiome studies are divided into three types: microbe, DNA, and mRNA. The
corresponding research techniques include culturome, amplicon, metagenome, metavirome, and metatranscriptome analyses. B The
advantages and limitations of various HTS methods for microbiome analysis.

316 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

and metagenomic sequencing (Fig. 1B). Amplicon sequencing, sample, virus enrichment (Metsky et al., 2019) or the
the most widely used HTS method for microbiome analysis, can removal of host DNA (Charalampous et al., 2019) is
be applied to almost all sample types. The major marker genes essential steps for obtaining sufficient quantities of viral DNA
used in amplicon sequencing include 16S ribosome DNA (rDNA) or RNA for analysis (Fig. 1B).
for prokaryotes and 18S rDNA and internal transcribed spacers The selection of sequencing methods depends on the
(ITS) for eukaryotes. 16S rDNA amplicon sequencing is the most scientific questions and sample types. The integration of
commonly used method, but there is currently a confusing array different methods is advisable, as multi-omics provides
of available primers. A good method for selecting primer is to insights into both the taxonomy and function of the micro-
evaluate their specificity and overall coverage using real sam- biome. In practice, most researchers select only one or two
ples or electronic PCR based on the SILVA database (Klindworth HTS methods for analysis due to time and cost limitations.
et al., 2012) and on host factors including the presence of Although amplicon sequencing can provide only the taxo-
chloroplasts, mitochondria, ribosomes, and other potential nomic composition of microbiota, it is cost effective ($20–50
sources of non-specific amplification. Alternatively, researchers per sample) and can be applied to large-scale research. In
can refer to the primers used in published studies similar to their addition, the amount of data generated from amplicon
own, which would save time in method optimization and facilitate sequencing is relatively small, and the analysis is quick and
to compare results among studies. Two-step PCR is typically easy to perform. For example, data analysis of 100 amplicon
used for amplification and to add barcodes and adaptors to each samples could be completed within a day using an ordinary

Protein & Cell


sample during library preparation (de Muinck et al., 2017). laptop computer. Thus, amplicon sequencing is often used in
Sample sequencing is often performed on the Illumina MiSeq, pioneering research. In contrast to amplicon sequencing,
HiSeq 2500, or NovaSeq 6000 platform in paired-end 250 bases metagenomic sequencing not only extends taxonomic res-
(PE250) mode, which generates 50,000–100,000 reads per olution to the species- or strain-level but also provides
sample. Amplicon sequencing can be applied to low-biomass potential functional information. Metagenomic sequencing
specimens or samples contaminated by host DNA. However, also makes it possible to assemble microbial genomes from
this technique can only reach genus-level resolution. Moreover, it short reads. However, it does not perform well for low-bio-
is sensitive to the specific primers and number of PCR cycles mass samples or those severely contaminated by the host
chosen, which may lead to some false-positive or false-negative genome (Fig. 1B).
results in downstream analyses (Fig. 1B).
Metagenomic sequencing provides more information than
ANALYSIS PIPELINES
amplicon sequencing, but it is more expensive using this
technique. For ‘pure’ samples such as human feces, the “Analysis pipeline” refers to a particular program or script that
accepted amount of sequencing data for each sample ran- combines several or even dozens of software programs
ges from 6 to 9 gigabytes (GB) in a metagenomic project. organically in a certain order to complete a complex analysis
The corresponding price for library construction and task. As of January 23, 2020, the words “amplicon” and
sequencing ranges from $100 to $300. For samples con- “metagenome” were mentioned more than 200,000 and
taining complex microbiota or contaminated with host- 40,000 times in Google Scholar, respectively. Due to their
derived DNA, the required sequencing output ranges from wide usage, we will discuss the current best-practice
30 to 300 GB per sample (Xu et al., 2018). In brief, 16S rDNA pipelines for amplicon and metagenomic analysis.
amplicon sequencing could be used to study bacteria and/or Researchers should get acquainted with the Shell environ-
archaea composition. Metagenomic sequencing is advisable ment and R language, which we discussed in our previous
for further analysis if higher taxonomic resolution and func- review (Liu et al., 2019b).
tional information are required (Arumugam et al., 2011; Smits
et al., 2017). Of course, metagenomic sequencing could be Amplicon analysis
used directly in studies with smaller sample sizes, assuming
The first stage of amplicon analysis is to convert raw reads
sufficient project funding is available (Carrión et al., 2019;
(typically in fastq format) into a feature table (Fig. 2A). The
Fresia et al., 2019).
raw reads are usually in paired-end 250 bases (PE250)
Metatranscriptomic sequencing can profile mRNAs in a
mode and generated from the Illumina platforms. Other
microbial community, quantify gene expression levels, and
platforms, including Ion Torrent, PacBio, and Nanopore, are
provide a snapshot for functional exploration of a microbial
not discussed in this review and may not be suitable for the
community in situ (Turner et al., 2013; Salazar et al., 2019). It
analysis pipelines discussed below. First, raw amplicon
is worth noting that host RNA and other rRNAs should be
paired-end reads are grouped based on their barcode
removed in order to obtain transcriptional information of
sequences (demultiplexing). Then the paired reads
microbiota (Fig. 1B).
are merged to obtain amplicon sequences, and barcode and
Since viruses have either DNA or RNA as their genetic
primers are removed. A quality-control step is normally
materials, technically, metavirome research involves a
needed to remove low-quality amplicon sequences. All of
combination of metagenome and metatranscriptome analy-
these steps can be completed using USEARCH (Edgar,
ses (Fig. 1A and 1B). Due to the low biomass of viruses in a

© The Author(s) 2020 317


REVIEW Yong-Xin Liu et al.

A B
Amplicon Metagenome
(fastq)

Raw reads Raw reads


Input

(Paired-end 250/300 bp) ( Paired-end 100/150 bp)

Merging, removal of barcodes & Quality control (KneadData /


(QIIME/USEARCH)
primers, and quality control & host removal Trimmomatic & Bowtie 2)

Merged clean
Clean reads
amplicons

Picking representative Assemble-based


Intermediate files

Reads-based
sequences (MEGAHIT/metaSPAdes)
(fastq/fasta)

Clustering Denoising
(USEARCH) (DADA2/Deblur)

Operational taxonomic Amplicon sequence


Curated databases Contigs
units (OTUs) variants (ASVs)
Functional profiling Prediction
Taxonomic
(HUMAnN2/MEGAN) (metaGeneMark/Prokka)
profiling
Protein & Cell

(MetaPhlAn2/
Quantifying (QIIME2/USEARCH) Contigs
Kraken2)

Quantifying
(feature table)

Taxonomic table Functional table (Salmon/Bowtie 2)


Output

(OTU/ASV/genus…) (Gene/KO/pathway…)

Functional prediction
(PICRUSt/Tax4Fun)

Figure 2. Workflow of commonly used methods for amplicon (A) and metagenomic (B) sequencing. Blue, orange, and green
blocks represent input, intermediate, and output files, respectively. The text next to the arrow represents the method, with frequently
used software shown in parentheses. Taxonomic and functional tables are collectively referred to as feature tables. Please see
Table 1 for more information about the software listed in this figure.

2010) or QIIME (Caporaso et al., 2010). Alternatively, clean However, many available software packages have been
amplicon data supplied by sequencing service providers can developed to predict potential functional information. The
be used for next analysis (Fig. 2A). principle behind this prediction is to link the 16S rDNA
Picking the representative sequences as proxies of a sequences or taxonomy information with functional descrip-
species is a key step in amplicon analysis. Two major tions in literature. PICRUSt (Langille et al., 2013), which is
approaches for representative sequence selection are clus- based on the OTU table of the Greengenes database
tering to OTUs and denoising to ASVs. The UPARSE algo- (McDonald et al., 2011), could be used to predict the
rithm clusters sequences with 97% similarity into OTUs metagenomic functional composition (Zheng et al., 2019) of
(Edgar, 2013). However, this method may fail to detect subtle Kyoto Encyclopedia of Genes and Genomes (KEGG) path-
differences among species or strains. DADA2 is a recently ways (Kanehisa and Goto, 2000). The newly developed
developed denoising algorithm that outputs ASVs as more PICRUSt2 software package (https://github.com/picrust/
exactly representative sequences (Callahan et al., 2016). picrust2) can directly predict metagenomic functions based
The denoising method is available at denoise-paired/single on an arbitrary OTU/ASV table. The R package Tax4Fun
by DADA2, denoise-16S by Deblur in QIIME 2 (Bolyen et al., (Asshauer et al., 2015) can predict KEGG functional capa-
2019), and -unoise3 in USEARCH (Edgar and Flyvbjerg, bilities of microbiota based on the SILVA database (Quast
2015). Finally, a feature table (OTU/ASV table) can be et al., 2013). The functional annotation of prokaryotic taxa
obtained by quantifying the frequency of the feature (FAPROTAX) pipeline performs functional annotation based
sequences in each sample. Simultaneously, the feature on published metabolic and ecological functions such as
sequences can be assigned taxonomy, typically at the nitrate respiration, iron respiration, plant pathogen, and ani-
kingdom, phylum, class, order, family, genus, and species mal parasites or symbionts, making it useful for environ-
levels, providing a dimensionality reduction perspective on mental (Louca et al., 2016), agricultural (Zhang et al., 2019),
the microbiota. and animal (Ross et al., 2018) microbiome research. Bug-
In general, 16S rDNA amplicon sequencing can only be Base is an extended database of Greengenes used to pre-
used to obtain information about taxonomic composition. dict phenotypes such as oxygen tolerance, Gram staining,

318 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

Table 1. Introduction to software for amplicon and metagenomic analysis

Name Link Description and advantages Reference


QIIME http://qiime.org The most highly cited and comprehensive amplicon analysis (Caporaso
pipeline, providing hundreds of scripts for analyzing various et al., 2010)
data types and visualizations
QIIME 2 https://qiime2.org This next-generation amplicon pipeline provides integrated (Bolyen et al.,
https://github.com/ command lines and GUI, and supports reproducible analysis 2019)
YongxinLiu/ and big data. Provides interactive visualization and Chinese
QIIME2ChineseManual tutorial documents and videos
USEARCH http://www.drive5.com/ Alignment tool includes more than 200 subcommands for (Edgar, 2010)
usearch amplicon analysis with a small size (1 Mb), cross-platform,
https://github.com/ high-speed calculation, and free 32-bit version. The 64-bit
YongxinLiu/ version is commercial ($1485)
UsearchChineseManual
VSEARCH https://github.com/ A free USEARCH-like software tool. We recommend using it (Rognes
torognes/vsearch alone or in addition to USEARCH. Available as a plugin in et al., 2016)
QIIME 2

Protein & Cell


Trimmomatic http://www.usadellab.org/ Java based software for quality control of metagenomic raw (Bolger et al.,
cms/index.php?page= reads 2014)
trimmomatic
Bowtie 2 http://bowtie-bio. Rapid alignment tool used to remove host contamination or for (Langmead
sourceforge.net/bowtie2 quantification and
Salzberg,
2012)
MetaPhlAn2 https://bitbucket.org/ Taxonomic profiling tool with a marker gene database from more (Truong et al.,
biobakery/metaphlan2 than 10,000 species. The output is relative abundance of 2015)
strains
Kraken 2 https://ccb.jhu.edu/ A taxonomic classification tool that uses exact k-mer matches to (Wood et al.,
software/kraken2 the NCBI database, high accuracy and rapid classification, and 2019)
outputs reads counts for each species
HUMAnN2 https://bitbucket.org/ Based on the UniRef protein database, calculates gene family (Franzosa
biobakery/humann2 abundance, pathway coverage, and pathway abundance from et al., 2018)
metagenomic or metatranscriptomic data. Provide species’
contributions to a specific function
MEGAN https://github.com/ A GUI, cross-platform software for taxonomic and functional (Huson et al.,
husonlab/megan-ce analysis of metagenomic data. Supports many types of 2016)
http://www-ab.informatik. visualizations with metadata, including scatter plot, word
uni-tuebingen.de/ clouds, Voronoi tree maps, clustering, and networks
software/megan6
MEGAHIT https://github.com/voutcn/ Ultra-fast and memory-efficient metagenomic assembler (Li et al.,
megahit 2015)
metaSPAdes http://cab.spbu.ru/ High-quality metagenomic assembler but time-consuming and (Nurk et al.,
software/spades large memory requirement 2017)
MetaQUAST http://quast.sourceforge. Evaluates the quality of metagenomic assemblies, including N50 (Mikheenko
net/metaquast and misassemble, and outputs PDF and interactive HTML et al., 2016)
reports
MetaGeneMark http://exon.gatech.edu/ Gene prediction in bacteria, archaea, metagenome and (Zhu et al.,
GeneMark/ metatranscriptome. Support Linux/MacOSX system. Provides 2010)
webserver for online analysis
Prokka http://www. Provides rapid prokaryotic genome annotation, calls (Seemann,
vicbioinformatics.com/ metaProdigal (Hyatt et al., 2012) for metagenomic gene 2014)
software.prokka.shtml prediction. Outputs nucleotide sequences, protein sequences,
and annotation files of genes
CD-HIT http://weizhongli-lab.org/ Used to construct non-redundant gene catalogs (Fu et al.,
cd-hit 2012)
Salmon https://combine-lab.github. Provides ultra-fast quantification of reads counts of genes using a (Patro et al.,
io/salmon k-mer-based method 2017)

© The Author(s) 2020 319


REVIEW Yong-Xin Liu et al.

Table 1 continued

Name Link Description and advantages Reference

metaWRAP https://github.com/bxlab/ Binning pipeline includes 140 tools and supports conda install, (Uritskiy
metaWRAP default binning by MetaBAT, MaxBin, and CONCOCT. Provides et al., 2018)
refinement, quantification, taxonomic classification and
visualization of bins
DAS Tool https://github.com/cmks/ Binning pipeline that integrates five binning software packages (Sieber et al.,
DAS_Tool and performs refinement 2018)

and pathogenic potential (Ward et al., 2017); this database is algorithms to perform taxonomic classification (Wood et al.,
mainly used in medical research (Mahnert et al., 2019). 2019). For a review about benchmarking 20 tools of taxo-
nomic classification, please see Ye et al. (2019). HUMAnN2
(Franzosa et al., 2018), the widely used functional profiling
Metagenomic analysis
software, can also be used to explore within- and between-
Compared to amplicon, shotgun metagenome can provide sample contributional diversity (species’ contributions to a
Protein & Cell

functional gene profiles directly and reach a much higher specific function). MEGAN (Huson et al., 2016) is a cross-
resolution of taxonomic annotation. However, due to the platform graphical user interface (GUI) software that per-
large amount of data, the fact that most software is only forms taxonomic and functional analyses (Table 1). In addi-
available for Linux systems, and the large amount of com- tion, various metagenomic gene catalogs are available,
puting resources are needed to perform analysis. To facili- including catalogs curated from the human gut (Li et al.,
tate software installation and maintenance, we recommend 2014; Pasolli et al., 2019; Tierney et al., 2019), the mouse
using the package manager Conda with BioConda channel gut (Xiao et al., 2015), the chicken gut (Huang et al., 2018),
(Grüning et al., 2018) to deploy metagenomic analysis the cow rumen (Stewart et al., 2018; Stewart et al., 2019),
pipelines. Since metagenomic analysis is computationally the ocean (Salazar et al., 2019), and the citrus rhizosphere
intensive, it is better to run multiple tasks/samples in parallel, (Xu et al., 2018). These customized databases can be used
which requires software such as GNU Parallel for queue for taxonomic and functional annotation in the appropriate
management (Tange, 2018). field of study, allowing efficient, precise, rapid analysis.
The Illumina HiSeqX/NovaSeq system often produces Assembly-based methods assemble clean reads into
PE150 reads for metagenomic sequencing, whereas reads contigs using tools such as MEGAHIT or metaSPAdes
generated by BGI-Seq500 are in PE100 mode. The first (Fig. 2B). MEGAHIT is used to assemble large, complex
crucial step in metagenomic analysis is quality control and metagenome datasets quickly using little computer memory
the removal of host contamination from raw reads, which (Li et al., 2015), while metaSPAdes can generate longer
requires the KneadData pipeline (https://bitbucket.org/ contigs but requires more computational resources (Nurk
biobakery/kneaddata) or a combination of Trimmomatic et al., 2017). Genes present in assembled contigs are then
(Bolger et al., 2014) and Bowtie 2 (Langmead and Salzberg, identified using metaGeneMark (Zhu et al., 2010) or Prokka
2012). Trimmomatic is a flexible quality-control software (Seemann, 2014). Redundant genes from separately
package for Illumina sequencing data that can be used to assembled contigs must be removed using tools such as
trim low-quality sequences, library primers and adapters. CD-HIT (Fu et al., 2012). Finally, a gene abundance
Reads mapped to host genomes using Bowtie 2 are treated table can be generated using alignment-based tools such as
as contaminated reads and filtered out. KneadData is an Bowtie 2 or alignment-free methods such as Salmon (Patro
integrated pipeline, including Trimmomatic, Bowtie 2, and et al., 2017). Millions of genes are normally present in a
related scripts that can be used for quality control, to remove metagenomic dataset. These genes must be combined into
host-derived reads, and to output clean reads (Fig. 2B). functional annotations, such as KEGG Orthology (KO),
The main step in metagenomic analysis is to convert modules and pathways, representing a form of dimensional
clean data into taxonomic and functional tables using reads- reduction (Kanehisa et al., 2016).
based and/or assembly-based methods. The reads-based In addition, metagenomic data can be used to mine gene
methods align clean reads to curated databases and output clusters or to assemble draft microbe genomes. The anti-
feature tables (Fig. 2B). MetaPhlAn2 is a commonly used SMASH database is used to identify, annotate, and visualize
taxonomic profiling tool that aligns metagenome reads to a gene clusters involved in secondary metabolite biosynthesis
pre-defined marker-gene database to perform taxonomic (Blin et al., 2018). Binning is a method that can be used to
classification (Truong et al., 2015). Kraken 2 performs exact recover partial or complete bacterial genomes in metage-
k-mer matching to sequences within the NCBI non-redun- nomic data. Available binning tools include CONCOCT (Al-
dant database and uses lowest common ancestor (LCA) neberg et al., 2014), MaxBin 2 (Wu et al., 2015), and

320 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

A a B C

Relative abundance
ab
ab
b b

PCo 2
Index

c
c

Group PCo 1 Group


α diversity β diversity taxonomy

D E
Taxonomic table Functional table
Sample ID Sample ID

Sample metadata
-Log10(P-value)

S1 S2 S3 S4 S1 S2 S3 S4
OTU_1 KO_01

Protein & Cell


OTU_2 KO_02

OTU_3 KO_03

OTU_4 KO_04

OTU_5 KO_05
Taxonomy Feature abundance
......

......

difference comparison OTU_n KO_n corralation

F G H

OTU 14
OT U 13
OTU 12

OTU 15
OTU 11
OTU

16
OTU 10

17
18
OTU
OTU 8

OTU

U 19
227
OTU 7

OTU

20
OT 21

9
OT U 6

OT

T U 36
U
U

OT
U

TU 3 5
OT U5

OT
O

97
O

T
Class A

O
T 25
OT U 4

O
U TU 22
OT 3 O U
U
2 OT 74
OT U
U OT
1 U 23
OT
24
OTU
26
OTU OTU
71 27
OTU 70 OTU
OTU 28
OTU 69
OTU 29
OTU 68
OTU 30
OTU 67 OTU 73
OTU 66 OTU 81
OTU 65 OTU 34
64 OTU
OTU 31
63 OTU
OTU 32
62 OTU
OTU 33
Class B

OT
U 61 U3
OT 60 OT 5
U U
OT 59 OT 36
U 8 O U
OT U 5 T U 37
T
TU 57

38

O
O

TU 72
TU

56

OT U 40
O

55

39
OT

U
54
O

OT
U 53
OT

OT
U

52

U4
OTU
OT

51

OTU

U 42
50

OTU 145
OTU 49
OT
OTU

OTU 45
OTU 48

1
OTU 46
OTU 47
OTU
OTU

43
44

Network analysis Classification Phylogenetic tree

Figure 3. Overview of statistical and visualization methods for feature tables. Downstream analysis of microbiome feature
tables, including alpha/beta-diversity (A/B), taxonomic composition (C), difference comparison (D), correlation analysis (E), network
analysis (F), classification of machine learning (G), and phylogenetic tree (H). Please see Table 2 for more details.

MetaBAT2 (Kang et al., 2015). Binning tools cluster contigs STATISTICAL ANALYSIS AND VISUALIZATION
into different bins (draft genomes) based on tetra-nucleotide
The most important output files from amplicon and metage-
frequency and contig abundance. Reassembly is performed
nomic analysis pipeline are taxonomic and functional table-
to obtain better bins. We recommend using a binning pipe-
s (Figs. 2 and 3). The scientific questions that researchers
line such as MetaWRAP (Uritskiy et al., 2018) or DAStool
could answer using the techniques include the following:
(Sieber et al., 2018), which integrate several binning soft-
Which microbes are present in the microbiota? Do different
ware packages to obtain refined binning results and more
experimental groups show significant differences in alpha-
complete genomes with less contamination. These pipelines
and beta-diversity? Which species, genes, or functional
also supply useful scripts for evaluation and visualization.
pathways are biomarkers of each group? To answer these
For a more comprehensive review on metagenomic experi-
questions, methods are needed for both overall and details
ments and analysis, we recommend Quince et al. (2017).
statistical analysis and visualization. Overall visualization

© The Author(s) 2020 321


REVIEW Yong-Xin Liu et al.

Table 2. Introduction to various analysis and visualization methods

Method Scientific question Visualization Description and example reference


Alpha Within-sample diversity Boxplot Distribution (Edwards et al., 2015) or significant
diversity difference (Zhang et al., 2019) of alpha diversity among
groups (Fig. 3A)
Rarefaction Sample diversity changes with sequencing depth or
curve evaluation of sequencing saturation (Beckers et al.,
2017)
Venn diagram Common or unique taxa (Ren et al., 2019)
Beta diversity Distance among samples or Unconstrained Major differences of samples showing group differences
groups PCoA scatter (Fig. 3B) or gradient changes with time (Zhang et al.,
plot 2018b)
Constrained Major differences among groups (Zgadzaj et al., 2016;
PCoA scatter Huang et al., 2019)
plot
Dendrogram Hierarchical clustering of samples (Chen et al., 2019)
Protein & Cell

Taxonomic Relative abundance of features Stacked bar plot Taxonomic composition of each sample (Beckers et al.,
composition 2017) or group (Jin et al., 2017) (Fig. 3C)
Flow or alluvial Relative abundance (RA) of taxonomic changes among
diagram seasons (Smits et al., 2017) or time-series (Zhang
et al., 2018b)
Sanky diagram A variety of Venn diagrams showing changes in RA and
common or unique features among groups (Smits
et al., 2017)
Difference Significantly different biomarkers Volcano plot A variety of scatter plots showing P-value, RA, fold
comparison between groups change, and number of differences (Shi et al., 2019a)
Manhattan plot A variety of scatter plots showing P-values, taxonomy,
and highlighting significantly different biomarkers
(Zgadzaj et al., 2016) (Fig. 3D)
Extend bar plot Bar plot of RA combined with difference and confidence
intervals (Parks et al., 2014)
Correlation Correlation between features and Scatter plot with Shows changes in features with time (Metcalf et al.,
analysis sample metadata linear fitting 2016) or relationships with other numeric metadata
(Fig. 3E)
Corrplot Correlation coefficient or distance triangular matrix
visualized by color and/or shape (Zhang et al., 2018b)
Heatmap RA of features that change with time (Subramanian
et al., 2014)
Network Global view correlation of features Colored based Finding correlation patterns of features based on
analysis on taxonomy or taxonomy (Fig. 3F) and/or modules (Jiao et al., 2016)
modules
Colors highlight Highlighting important features and showing their
important positions and connections (Wang et al., 2018b)
features
Machine Classification groups or Heatmap Colored block showing classification results (Fig. 3G)
learning regression analysis for numeric (Wilck et al., 2017) or feature patterns in a time series
metadata prediction (Subramanian et al., 2014).
Bar plot Feature importance, RA (Zhang et al., 2019), and
increase in mean squared error (Subramanian et al.,
2014).
Treemap Phylogenetic tree or taxonomy Phylogenetic Phylogenetic tree (Fig. 3H) shows relationship of OTUs
hierarchy tree or or species (Levy et al., 2018). Taxonomic cladogram
cladogram highlighting interesting biomarkers (Segata et al.,
2011).

322 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

Table 2 continued

Method Scientific question Visualization Description and example reference

Circular tree map Shows features in a hierarchy color bubble (Carrión


et al., 2019)

can be used to explore differences in alpha/beta- diversity permutational multivariate analysis of variance (PERMA-
and taxonomic composition in a feature table. Details anal- NOVA) with the adonis() function in vegan (Oksanen et al.,
ysis could involve identifying biomarkers via comparison, 2007).
correlation analysis, network analysis, and machine learning Taxonomic composition describes the microbiota that are
(Fig. 3). We will discuss these methods below and provide present in a microbial community, which is often visualized
examples and references to facilitate such studies (Fig. 3 using a stacked bar plot (Fig. 3C and Table 2). For simplicity,
and Table 2). the microbiota is often shown at the phylum or genus level in
Alpha diversity evaluates the diversity within a sample, the plot.
including richness and evenness measurements. Several Difference comparison is used to identify features (such
software packages can be used to calculate alpha diversity, as species, genes, or pathways) with significantly different

Protein & Cell


including QIIME, the R package vegan (Oksanen et al., abundances between groups using Welch’s t-test, Mann-
2007), and USEARCH. The alpha diversity values of sam- Whitney U test, Kruskal-Wallis test, or tools such as
ples in each group could be visually compared using box- ALDEx2, edgeR (Robinson et al., 2010), STAMP (Parks
plots (Fig. 3A). The differences in alpha diversity among or et al., 2014), or LEfSe (Segata et al., 2011). The results of
between groups could be statistically evaluated using Anal- difference comparison can be visualized using a volcano
ysis of Variance (ANOVA), Mann-Whitney U test, or Kruskal- plot, Manhattan plot (Fig. 3D), or extended error bar plot
Wallis test. It is important to note that P-values should be (Table 3). It is important to note that this type of analysis is
adjusted if each group is compared more than twice. Other prone to produce false positives due to increases in the
visualization methods for alpha diversity indices are descri- relative abundance of some features and decreases in other
bed in Table 2. features. Several methods have been developed to obtain
Beta diversity evaluates differences in the microbiome taxonomic absolute abundance in samples, such as the
among samples and is normally combined with dimensional integration of HTS and flow cytometric enumeration (Van-
reduction methods such as principal coordinate analysis deputte et al., 2017), and the integration of HTS with spike-in
(PCoA), non-metric multidimensional scaling (NMDS), or plasmid and quantitative PCR (Tkacz et al., 2018; Guo et al.,
constrained principal coordinate analysis (CPCoA) to obtain 2020; Wang et al., 2020b).
visual representations. These analyses can be implemented Correlation analysis is used to reveal the associations
in the R vegan package and visualized in scatter plots between taxa and sample metadata (Fig. 3E). For example,
(Fig. 3B and Table 2). The statistical differences between it is used to identify associations between taxa and envi-
these beta-diversity indices can be computed using ronmental factors, such as pH, longitude and latitude, and

Table 3. Useful websites or tools for reproducible analysis

Resource Links Description


GSA http://gsa.big.ac.cn HTS data deposition and sharing. Fast data transfer, interfaces in both
Chinese and English, automated submission, technical support via
email or QQ group, and widely recognized by international journals
Qiita https://qiita.ucsd.edu Platform for amplicon data deposition, analysis, and cross-study
comparisons
MGnify https://www.ebi.ac.uk/metagenomics Webserver for amplicon and metagenomic data deposition, sharing,
analysis, and cross-study comparisons
gcMeta https://gcmeta.wdcm.org Webserver for amplicon and metagenomic data analysis, deposition, and
sharing
R Markdown https://rmarkdown.rstudio.com Uses a productive notebook interface to weave together narrative text
and code to produce an elegantly formatted report in HTML or PDF
format. Is becoming increasingly popular in microbiome research
R Graph Gallery https://www.r-graph-gallery.com R code for 42 chart types
GitHub https://github.com Online code-saving and sharing platforms with version control systems.
Supports searching

© The Author(s) 2020 323


REVIEW Yong-Xin Liu et al.

clinical indices, or to identify key environmental factors that describing their methods. Reproducibility is critical for
affect microbiota and dynamic taxa in a time series (Edwards microbiome analysis because it is impossible to reproduce
et al., 2018). results without raw data, detailed sample metadata, and
Network analysis explores the co-occurrence of features analysis codes. If the readers can run the codes, they will
from a holistic perspective (Fig. 3F). The properties of a better understand what has been done in the analyses. We
correlation network might represent potential interactions recommend that researchers share their sequencing data,
between co-occurring taxa or functional pathways. Correla- metadata, analysis codes, and detailed statistical reports
tion coefficients and significant P-values could be computed using the following steps:
using the cor.test() function in R or more robust tools that are
suitable for compositional data such as the SparCC (sparse Upload and share raw data and metadata in a data
correlations for compositional data) package (Kurtz et al., center
2015). Networks could also be visualized and analyzed
Amplicon or metagenomic sequencing generates a large
using R library igraph (Csardi and Nepusz, 2006), Cytoscape
volume of raw data. Normally, raw data must be uploaded to
(Saito et al., 2012), or Gephi (Bastian et al., 2009). There are
data centers such as NCBI, EBI, and DDBJ during publica-
several good examples of network analysis, such as studies
tion. In recent years, several repositories have also been
exploring the distribution of phylum or modules (Fan et al.,
established in China to provide data storage and sharing
2019) or showing trends at different time points (Wang et al.,
services. For example, the Genome Sequence Archive
Protein & Cell

2019).
(GSA) established by the Beijing Institute of Genomics
Machine learning is a branch of artificial intelligence that
Chinese Academy of Sciences (Wang et al., 2017; Mem-
learns from data, identifies patterns, and makes decisions
bers, 2019) has a lot of advantages (Table 3). We recom-
(Fig. 3G). In microbiome research, machine learning is used
mend that researchers upload raw data to one of these
for taxonomic classification, beta-diversity analysis, binning,
repositories, which not only provides backup but also meets
and compositional analysis of particular features. Commonly
the requirements for publication. Several journals such as
used machine learning methods include random forest
Microbiome require that the raw data should be deposited in
(Vangay et al., 2019; Qian et al., 2020), Adaboost (Wilck
repositories before submitting the manuscript.
et al., 2017), and deep learning (Galkin et al., 2018) to
classify groups by selecting biomarkers or regression anal-
ysis to show experimental condition-dependent changes in Share pipeline scripts with other researchers
biomarker abundance (Table 2).
Pipeline scripts could help reviewers or readers evaluate the
Treemap is widely used for phylogenetic tree construction
reproducibility of experimental results. We provide sample
and for taxonomic annotation and visualization of the
pipeline scripts for amplicon and metagenome analyses at
microbiome (Fig. 3H). Representative amplicon sequences
https://github.com/YongxinLiu/Liu2020ProteinCell. The run-
are readily used for phylogenetic analysis. We recommend
ning environment and software version used in analysis
using IQ-TREE (Nguyen et al., 2014) to quickly build high-
should also be provided to help ensure reproducibility. If
confidence phylogenetic trees using big data and online
Conda is used to deploy software, the command “conda env
visualization using iTOL (Letunic and Bork, 2019). Annota-
export environment_name > environment.yaml” can gener-
tion files of tree can easily be generated using the R script
ate a file containing both the software used and various
table2itol (https://github.com/mgoeker/table2itol). In addition,
versions for reproducible usage. For users who are not
we recommend using GraPhlAn (Asnicar et al., 2015) to
familiar with command lines, webservers such as Qiita
visualize the phylogenetic tree or hierarchical taxonomy in
(Gonzalez et al., 2018), MGnify (Mitchell et al., 2020), and
an attractive cladogram.
gcMeta (Shi et al., 2019b) could be used to perform analysis.
In addition, researchers may be interested in examining
However, webservers are less flexible than the command
microbial origin to address issues such as the origin of gut
line mode because they provide fewer adjustable steps and
microbiota and river pollution, as well as for forensic testing.
parameters.
FEAST (Shenhav et al., 2019) and SourceTracker (Knights
et al., 2011) were designed to unravel the origins of microbial
communities. If researchers would like to focus on the reg- Provide a detailed statistical and visualization reports
ulatory relationship between genetic information from the
The tools used for statistical analysis and visualization of a
host and microorganisms (Wang et al., 2018a), genome-
feature table include Excel, GraphPad, and Sigma plot, but
wide association analysis (GWAS) might be a good choice
these are commercial software tools, and are difficult to
(Wang et al., 2016).
quickly reproduce the results. We recommend using tools
such as R Markdown or Python Notebooks to trace all
REPRODUCIBLE ANALYSIS analysis codes and parameters and storing them in a version
control management system such as GitHub (Table 3).
Reproducible analysis requires that researchers submit their
These tools are free, open-source, cross-platform, and easy-
data and code along with their publications instead of merely

324 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

to-use. We recommend that researchers record all scripts studies are needed to dissect the causality of microbiome
and results of statistical analysis and visualization in R and host phenotypes.
markdown files. An R markdown document is a fully repro- Shotgun metagenomic sequencing could provide insights
ducible report that includes codes, tables, and figures in into a microbial community structure at strain-level, but it is
HTML/PDF format. This work mode would greatly improve difficult to recover high-quality genome (Bishara et al., 2018).
the efficiency of microbiome analysis and make the analysis Single-cell genome sequencing shows very promising
process transparent and easier to understand. R visualiza- applications in microbiome research (Xu and Zhao, 2018).
tion codes can refer to R Graph Gallery (Table 3). The input Based on flow cytometry and single-cell sequencing, Meta-
files (feature tables + metadata), analysis notebook (*.Rmd), Sort could recover high-quality genomes from sorted sub-
and output results (figures, tables, and HTML reports) of the metagenome (Ji et al., 2017). Recently developed third-
analysis can be uploaded to GitHub, which would allow generation sequencing techniques have been used for
peers to repeat your analyses or reuse your analysis codes. metagenome analysis, including Pacific Biosciences (Pac-
ImageGP (http://www.ehbio.com/ImageGP) provides more Bio) single molecule real time sequencing and the Oxford
than 20 statistical and visualization methods, making it a Nanopore Technologies sequencing platform (Bertrand
good choice for researchers without a background in R. et al., 2019; Stewart et al., 2019; Moss et al., 2020). With the
improvement in sequencing data quality and decreasing
costs, these techniques will lead to a technological revolution
NOTES AND PERSPECTIVES

Protein & Cell


in the field of microbiome sequencing and bring microbiome
It is worth noting that experimental operations have a far research into a new era.
greater impact on the results of a study than the pipeline
chosen for analysis (Sinha et al., 2017). It is better to record
CONCLUSION
detailed experimental processes as metadata, which
includes sampling method, time, location, operators, DNA In this review, we discussed methods for analyzing amplicon
extraction kit, batch, primers, and barcodes. The metadata and metagenomic data at all stages, from the selection of
can be used for downstream analyses and help research- sequencing methods, analysis software/pipelines, statistical
ers to determine whether these operational differences analysis and visualization to the implementation of repro-
contribute to false-positive results (Costea et al., 2017). ducible analysis. Other methods such as metatranscriptome,
Some specific experimental steps could be used to provide a metaproteome, and metabolome analysis may provide a
unique perspective on microbiome analysis. For example, better perspective on the dynamics of the microbiome, but
the development and use of methods to remove the host these methods have not been widely accepted due to their
DNA can effectively increase the proportion of the micro- high cost and the complex experimental and analysis
biome in plant endophytes (Carrión et al., 2019) and human methods required. With the further development of these
respiratory infection samples (Charalampous et al., 2019). A technologies in the future, a more comprehensive view of the
large amount of relic DNA in soil can be physically removed microbiome could be obtained.
with propidium monoazide (Carini et al., 2016). In addition,
when using samples with low microbial biomass, research- ACKNOWLEDGEMENTS
ers must be particularly careful to avoid false-positive results
This work was supported by grants from the Strategic Priority
due to contamination (de Goffau et al., 2019). For these
Research Program of the Chinese Academy of Sciences (Precision
situations, DNA-free water should be used as a negative Seed Design and Breeding, XDA24020104), the Key Research
control. In human microbiome studies, the major differences Program of Frontier Sciences of the Chinese Academy of Science
in microbiome composition among individuals are due to (grant nos. QYZDB-SSW-SMC021), the National Natural Science
factors such as diet, lifestyle, and drug use, such that the Foundation of China (grant nos. 31772400).
heritability is less than 2% (Rothschild et al., 2018). For
recommendations about information that should be col- ABBREVIATIONS
lected, please refer to minimum information about a marker
gene sequence (MIMARKS) and minimum information about ANOVA, analysis of variance; ASV, amplicon sequence variants;
metagenome sequence (Field et al., 2008; Yilmaz et al., CAS-CMI, Chinese Academy of Sciences Initiative of Microbiome;
CPCoA, constrained principal coordinate analysis; FAPROTAX,
2011), minimum information about a single amplified gen-
Functional Annotation of Prokaryotic Taxa; GB, gigabyte; GSA,
ome (MISAG) and a metagenome-assembled genome
Genome Sequence Archive; GUI, graphical user interface; GWAS,
(MIMAG) of bacteria and archaea (Bowers et al., 2017), and
genome-wide association analysis; HMP, Human Microbiome Pro-
minimum information about an uncultivated virus genome
ject; HTS, high-throughput sequencing; iHMP, integrative HMP;
(Roux et al., 2019). In the early stage of microbiome
KEGG, Kyoto Encyclopedia of Genes and Genomes; KO, KEGG
research, data-driven studies provide basic components and
Ortholog; LCA, lowest common ancestor; MetaHIT, Metagenomics
conceptual frame of microbiome, however, with the devel-
of the Human Intestinal Tract; NMDS, non-metric multidimensional
opment of experimental tools, more hypothesis-driven scaling; OTU, operational taxonomic units; PacBio, Pacific

© The Author(s) 2020 325


REVIEW Yong-Xin Liu et al.

Biosciences; PERMANOVA, permutational multivariate analysis of Beckers B, Op De Beeck M, Weyens N, Boerjan W, Vangronsveld J
variance; PE250, paired-end 250 bp; PCoA, principal coordinate (2017) Structural variability and niche differentiation in the
analysis; RA, relative abundance; rDNA, ribosome DNA. rhizosphere and endosphere bacterial microbiome of field-grown
poplar trees. Microbiome 5:25
Bertrand D, Shaw J, Kalathiyappan M, Ng AHQ, Kumar MS, Li C,
COMPLIANCE WITH ETHICS GUIDELINES Dvornicic M, Soldo JP, Koh JY, Tong C et al (2019) Hybrid
metagenomic assembly enables high-resolution analysis of
Yong-Xin Liu, Xubo Qian and Yang Bai contributed to write the
resistance determinants and mobile elements in human micro-
paper. Yuan Qin designed and draw the figures. Tong Chen tested
biomes. Nat Biotechnol 37:937–944
all the software mentioned in this review and share the codes. All
Bishara A, Moss EL, Kolmogorov M, Parada AE, Weng Z, Sidow A,
authors read, revise and approved this paper. Yong-Xin Liu, Yuan
Dekas AE, Batzoglou S, Bhatt AS (2018) High-quality genome
Qin, Tong Chen, Xubo Qian, Meiping Lu, Xiaoxuan Guo and Yang
sequences of uncultured microbes by assembly of read clouds.
Bai declare that they have no conflict of interest. This article does not
Nat Biotechnol 36:1067–1075
contain any studies with human or animal subjects performed by the
Blin K, Weber T, Lee SY, Medema MH, Pascal Andreu V, de los
any of the authors.
Santos ELC, Del Carratore F (2018) The antiSMASH database
version 2: a comprehensive resource on secondary metabolite
biosynthetic gene clusters. Nucleic Acids Res 47:D625–D630
OPEN ACCESS
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible
Protein & Cell

This article is licensed under a Creative Commons Attribution 4.0 trimmer for Illumina sequence data. Bioinformatics 30:2114–2120
International License, which permits use, sharing, adaptation, Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith
distribution and reproduction in any medium or format, as long as GA, Alexander H, Alm EJ, Arumugam M, Asnicar F et al (2019)
you give appropriate credit to the original author(s) and the source, Reproducible, interactive, scalable and extensible microbiome
provide a link to the Creative Commons licence, and indicate if data science using QIIME 2. Nat Biotechnol 37:852–857
changes were made. The images or other third party material in this Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud
article are included in the article's Creative Commons licence, unless D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA
indicated otherwise in a credit line to the material. If material is not et al (2017) Minimum information about a single amplified
included in the article's Creative Commons licence and your genome (MISAG) and a metagenome-assembled genome
intended use is not permitted by statutory regulation or exceeds (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725–731
the permitted use, you will need to obtain permission directly from Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA,
the copyright holder. To view a copy of this licence, visit http:// Holmes SP (2016) DADA2: high-resolution sample inference
creativecommons.org/licenses/by/4.0/. from Illumina amplicon data. Nat Methods 13:581–583
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD,
Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI et al
REFERENCES (2010) QIIME allows analysis of high-throughput community
sequencing data. Nat Methods 7:335–336
Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ,
Carini P, Marsden PJ, Leff JW, Morgan EE, Strickland MS, Fierer N
Lahti L, Loman NJ, Andersson AF, Quince C (2014) Binning
(2016) Relic DNA is abundant in soil and obscures estimates of
metagenomic contigs by coverage and composition. Nat Meth-
soil microbial diversity. Nat Microbiol 2:16242
ods 11:1144–1146
Carrión VJ, Perez-Jaramillo J, Cordovez V, Tracanna V, de Hollan-
Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende
der M, Ruiz-Buck D, Mendes LW, van Ijcken WFJ, Gomez-
DR, Fernandes GR, Tap J, Bruls T, Batto JM et al (2011)
Exposito R, Elsayed SS et al (2019) Pathogen-induced activation
Enterotypes of the human gut microbiome. Nature 473:174–180
of disease-suppressive functions in the endophytic root micro-
Asnicar F, Weingart G, Tickle TL, Huttenhower C, Segata N (2015)
biome. Science 366:606–612
Compact graphical representation of phylogenetic data and
Charalampous T, Kay GL, Richardson H, Aydin A, Baldan R, Jeanes
metadata with GraPhlAn. PeerJ 3:e1029
C, Rae D, Grundy S, Turner DJ, Wain J et al (2019) Nanopore
Asshauer KP, Wemheuer B, Daniel R, Meinicke P (2015) Tax4Fun:
metagenomics enables rapid clinical diagnosis of bacterial lower
predicting functional profiles from metagenomic 16S rRNA data.
respiratory infection. Nat Biotechnol 37:783–792
Bioinformatics 31:2882–2884
Chen Q, Jiang T, Liu Y-X, Liu H, Zhao T, Liu Z, Gan X, Hallab A,
Bai Y, Müller DB, Srinivas G, Garrido-Oter R, Potthoff E, Rott M,
Wang X, He J et al (2019) Recently duplicated sesterterpene
Dombrowski N, Münch PC, Spaepen S, Remus-Emsermann M
(C25) gene clusters in Arabidopsis thaliana modulate root
et al (2015) Functional overlap of the Arabidopsis leaf and root
microbiota. Sci China Life Sci 62:947–958
microbiota. Nature 528:364–369
Costea PI, Zeller G, Sunagawa S, Pelletier E, Alberti A, Levenez F,
Bastian M, Heymann S, and Jacomy M (2009). Gephi: an open
Tramontano M, Driessen M, Hercog R, Jung F-E et al (2017)
source software for exploring and manipulating networks. In:
Towards standards for human fecal sample processing in
Third international AAAI conference on weblogs and social
metagenomic studies. Nat Biotechnol 35:1069–1076
media.
Csardi G, Nepusz T (2006) The igraph software package for
complex network research. InterJ Complex Syst 1695:1–9

326 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

de Goffau MC, Lager S, Sovio U, Gaccioli F, Cook E, Peacock SJ, Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch
Parkhill J, Charnock-Jones DS, Smith GCS (2019) Human CH, Valieris R, Köster J, The Bioconda T (2018) Bioconda:
placenta has no microbiome but can contain potential pathogens. sustainable and comprehensive software distribution for the life
Nature 572:329–334 sciences. Nat Methods 15:475–476
de Muinck EJ, Trosvik P, Gilfillan GD, Hov JR, Sundaram AYM Guo X, Zhang X, Qin Y, Liu Y-X, Zhang J, Zhang N, Wu K, Qu B, He
(2017) A novel ultra high-throughput 16S rRNA gene amplicon Z, Wang X et al (2020) Host-associated quantitative abundance
sequencing library preparation method for the Illumina HiSeq profiling reveals the microbial load variation of root microbiome.
platform. Microbiome 5:68 Plant Commun 1:100003
Edgar RC (2010) Search and clustering orders of magnitude faster Huang AC, Jiang T, Liu Y-X, Bai Y-C, Reed J, Qu B, Goossens A,
than BLAST. Bioinformatics 26:2460–2461 Nützmann H-W, Bai Y, Osbourn A (2019) A specialized metabolic
Edgar RC (2013) UPARSE: highly accurate OTU sequences from network selectively modulates Arabidopsis root microbiota.
microbial amplicon reads. Nat Methods 10:996–998 Science 364:eaau6389
Edgar RC, Flyvbjerg H (2015) Error filtering, pair assembly and error Huang P, Zhang Y, Xiao K, Jiang F, Wang H, Tang D, Liu D, Liu B, Liu
correction for next-generation sequencing reads. Bioinformatics Y, He X et al (2018) The chicken gut metagenome and the
31:3476–3482 modulatory effects of plant-derived benzylisoquinoline alkaloids.
Edwards J, Johnson C, Santos-Medellín C, Lurie E, Podishetty NK, Microbiome 6:211
Bhatnagar S, Eisen JA, Sundaresan V (2015) Structure, varia- Huson DH, Beier S, Flade I, Górska A, El-Hadidi M, Mitra S,
tion, and assembly of the root-associated microbiomes of rice. Ruscheweyh H-J, Tappu R (2016) MEGAN community edition—

Protein & Cell


Proc Natl Acad Sci USA 112:E911–E920 interactive exploration and analysis of large-scale microbiome
Edwards JA, Santos-Medellín CM, Liechty ZS, Nguyen B, Lurie E, sequencing data. PLoS Comput Biol 12:e1004957
Eason S, Phillips G, Sundaresan V (2018) Compositional shifts in Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC (2012) Gene and
root-associated bacterial and archaeal microbiota track the plant translation initiation site prediction in metagenomic sequences.
life cycle in field-grown rice. PLoS Biol 16:e2003862 Bioinformatics 28:2223–2230
Fan K, Delgado-Baquerizo M, Guo X, Wang D, Wu Y, Zhu M, Yu W, Ji P, Zhang Y, Wang J, Zhao F (2017) MetaSort untangles
Yao H, Zhu Y-g, Chu H (2019) Suppressed N fixation and metagenome assembly by reducing microbial community com-
diazotrophs after four decades of fertilization. Microbiome 7:143 plexity. Nat Commun 8:14306
Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova Jiang X, Li X, Yang L, Liu C, Wang Q, Chi W, Zhu H (2019) How
T, Thomson N, Allen MJ, Angiuoli SV et al (2008) The minimum microbes shape their communities? A microbial community
information about a genome sequence (MIGS) specification. Nat model based on functional genes. Genom Proteom Bioinf
Biotechnol 26:541–547 17:91–105
Franzosa EA, McIver LJ, Rahnavard G, Thompson LR, Schirmer M, Jiao S, Liu Z, Lin Y, Yang J, Chen W, Wei G (2016) Bacterial
Weingart G, Lipson KS, Knight R, Caporaso JG, Segata N et al communities in oil contaminated soils: biogeography and co-
(2018) Species-level functional profiling of metagenomes and occurrence patterns. Soil Biol Biochem 98:64–73
metatranscriptomes. Nat Methods 15:962–968 Jin T, Wang Y, Huang Y, Xu J, Zhang P, Wang N, Liu X, Chu H, Liu G,
Fresia P, Antelo V, Salazar C, Giménez M, D’Alessandro B, Jiang H et al (2017) Taxonomic structure and functional associ-
Afshinnekoo E, Mason C, Gonnet GH, Iraola G (2019) Urban ation of foxtail millet root microbiome. Giga Sci 6:1–12
metagenomics uncover antibiotic resistance reservoirs in coastal Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes
beach and sewage waters. Microbiome 7:35 and genomes. Nucleic Acids Res 28:27–30
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for Kanehisa M, Sato Y, Morishima K (2016) BlastKOALA and
clustering the next-generation sequencing data. Bioinformatics GhostKOALA: KEGG tools for functional characterization of
28:3150–3152 genome and metagenome sequences. J Mol Biol 428:726–731
Galkin F, Aliper A, Putin E, Kuznetsov I, Gladyshev VN, Zha- Kang DD, Froula J, Egan R, Wang Z (2015) MetaBAT, an efficient
voronkov A (2018) Human microbiome aging clocks based on tool for accurately reconstructing single genomes from complex
deep learning and tandem of permutation feature importance and microbial communities. PeerJ 3:e1165
accumulated local effects. bioRxiv 507780 Klindworth A, Pruesse E, Schweer T, Peplies J, Quast C, Horn M,
Gao L, Xu T, Huang G, Jiang S, Gu Y, Chen F (2018) Oral Glöckner FO (2012) Evaluation of general 16S ribosomal RNA
microbiomes: more and more importance in oral cavity and whole gene PCR primers for classical and next-generation sequencing-
body. Protein Cell 9:488–500 based diversity studies. Nucleic Acids Res 41:e1–e1
Gonzalez A, Navas-Molina JA, Kosciolek T, McDonald D, Vázquez- Knight R, Vrbanac A, Taylor BC, Aksenov A, Callewaert C, Debelius
Baeza Y, Ackermann G, DeReus J, Janssen S, Swafford AD, J, Gonzalez A, Kosciolek T, McCall L-I, McDonald D et al (2018)
Orchanian SB et al (2018) Qiita: rapid, web-enabled microbiome Best practices for analysing microbiomes. Nat Rev Microbiol
meta-analysis. Nat Methods 15:796–798 16:410–422
Goodman AL, Kallstrom G, Faith JJ, Reyes A, Moore A, Dantas G, Knights D, Kuczynski J, Charlson ES, Zaneveld J, Mozer MC,
Gordon JI (2011) Extensive personal human gut microbiota Collman RG, Bushman FD, Knight R, Kelley ST (2011) Bayesian
culture collections characterized and manipulated in gnotobiotic community-wide culture-independent microbial source tracking.
mice. Proc Natl Acad Sci USA 108:6252–6257 Nat Methods 8:761

© The Author(s) 2020 327


REVIEW Yong-Xin Liu et al.

Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau sequence diversity in metagenomes with comprehensive and
RA (2015) Sparse and compositionally robust inference of scalable probe design. Nat Biotechnol 37:160–168
microbial ecological networks. PLoS Comput Biol 11:e1004226 Mikheenko A, Saveliev V, Gurevich A (2016) MetaQUAST: evalu-
Lagier J-C, Dubourg G, Million M, Cadoret F, Bilen M, Fenollar F, ation of metagenome assemblies. Bioinformatics 32:1088–1090
Levasseur A, Rolain J-M, Fournier P-E, Raoult D (2018) Culturing Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J,
the human microbiota and culturomics. Nat Rev Microbiol Cochrane G, Crusoe MR, Kale V, Potter SC, Richardson LJ
16:540–550 et al (2020) MGnify: the microbiome analysis resource in 2020.
Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Nucleic Acids Res 48:D570–D578
Reyes JA, Clemente JC, Burkepile DE, Vega Thurber RL, Knight Moss EL, Maghini DG, and Bhatt AS (2020) Complete, closed
R et al (2013) Predictive functional profiling of microbial commu- bacterial genomes from microbiomes using nanopore sequenc-
nities using 16S rRNA marker gene sequences. Nat Biotechnol ing. Nat Biotechnol
31:814 Mu D-S, Liang Q-Y, Wang X-M, Lu D-C, Shi M-J, Chen G-J, Du Z-J
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with (2018) Metatranscriptomic and comparative genomic insights into
Bowtie 2. Nat Methods 9:357–359 resuscitation mechanisms during enrichment culturing. Micro-
Letunic I, Bork P (2019) Interactive tree of life (iTOL) v4: recent biome 6:230
updates and new developments. Nucleic Acids Res 47:W256– Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ (2014) IQ-
W259 TREE: a fast and effective stochastic algorithm for estimating
Levy A, Salas Gonzalez I, Mittelviefhaus M, Clingenpeel S, Herrera maximum-likelihood phylogenies. Mol Biol Evol 32:268–274
Protein & Cell

Paredes S, Miao J, Wang K, Devescovi G, Stillman K, Monteiro F Ning K, Tong Y (2019) The fast track for microbiome research.
et al (2018) Genomic features of bacterial adaptation to plants. Genom Proteom Bioinf 17:1–3
Nat Genet 50:138–150 Nurk S, Meleshko D, Korobeynikov A, Pevzner PA (2017)
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W (2015) MEGAHIT: an metaSPAdes: a new versatile metagenomic assembler. Genome
ultra-fast single-node solution for large and complex metage- Res 27:824–834
nomics assembly via succinct de Bruijn graph. Bioinformatics Oksanen J, Kindt R, Legendre P, O’Hara B, Stevens MHH, Oksanen
31:1674–1676 MJ, Suggests M (2007) The vegan package. Commun Ecol Pack
Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, Arumugam M, 10:631–637
Kultima JR, Prifti E, Nielsen T et al (2014) An integrated catalog of Parks DH, Tyson GW, Hugenholtz P, Beiko RG (2014) STAMP:
reference genes in the human gut microbiome. Nat Biotechnol statistical analysis of taxonomic and functional profiles. Bioinfor-
32:834–841 matics 30:3123–3124
Liu C, Zhou N, Du M-X, Sun Y-T, Wang K, Wang Y-J, Li D-H, Yu H-Y, Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F,
Song Y, Bai B-B et al (2020) The mouse gut microbial Biobank Beghini F, Manghi P, Tett A, Ghensi P et al (2019) Extensive
expands the coverage of cultured bacteria. Nat Commun 11:79 unexplored human microbiome diversity revealed by over
Liu Y-X, Qin Y, Bai Y (2019) Reductionist synthetic community 150,000 genomes from metagenomes spanning age, geography,
approaches in root microbiome research. Curr Opin Microbiol and lifestyle. Cell 176:649–662.e620
49:97–102 Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon
Liu Y-X, Qin Y, Guo X, Bai Y (2019) Methods and applications for provides fast and bias-aware quantification of transcript expres-
microbiome data analysis. Hereditas (Beijing) 41:1–18 sion. Nat Methods 14:417–149
Louca S, Parfrey LW, Doebeli M (2016) Decoupling function and Pedersen HK, Forslund SK, Gudmundsdottir V, Petersen AØ,
taxonomy in the global ocean microbiome. Science 353:1272– Hildebrand F, Hyötyläinen T, Nielsen T, Hansen T, Bork P, Ehrlich
1277 SD et al (2018) A computational framework to integrate high-
Mahnert A, Moissl-Eichinger C, Zojer M, Bogumil D, Mizrahi I, Rattei throughput ‘-omics’ datasets for the identification of potential
T, Martinez JL, Berg G (2019) Man-made microbial resistances in mechanistic links. Nat Protoc 13:2781–2800
built environments. Nat Commun 10:968 Proctor LM, Creasy HH, Fettweis JM, Lloyd-Price J, Mahurkar A,
Marchesi JR, Ravel J (2015) The vocabulary of microbiome Zhou W, Buck GA, Snyder MP, Strauss JF, Weinstock GM et al
research: a proposal. Microbiome 3:31 (2019) The integrative human microbiome project. Nature
McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, 569:641–648
Probst A, Andersen GL, Knight R, Hugenholtz P (2011) An Qian X, Liu Y-X, Ye X, Zheng W, Lv S, Mo M, Lin J, Wang W, Wang
improved Greengenes taxonomy with explicit ranks for ecological W, Zhang X et al (2020) Gut microbiota in children with juvenile
and evolutionary analyses of bacteria and archaea. ISME J 6:610 idiopathic arthritis: characteristics, biomarker identification, and
Members BDC (2019) Database resources of the BIG data center in usefulness in clinical prediction. BMC Genom 21:286
2019. Nucleic Acids Res 47:D8–D14 Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C,
Metcalf JL, Xu ZZ, Weiss S, Lax S, Van Treuren W, Hyde ER, Song Nielsen T, Pons N, Levenez F, Yamada T et al (2010) A human
SJ, Amir A, Larsen P, Sangwan N et al (2016) Microbial gut microbial gene catalogue established by metagenomic
community assembly and metabolic function during mammalian sequencing. Nature 464:59–65
corpse decomposition. Science 351:158–162 Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P,
Metsky HC, Siddle KJ, Gladden-Young A, Qu J, Yang DK, Brehio P, Peplies J, Glockner FO (2013) The SILVA ribosomal RNA gene
Goldfarb A, Piantadosi A, Wohl S, Carter A et al (2019) Capturing

328 © The Author(s) 2020


A practical guide to amplicon and metagenomic analysis of microbiome data REVIEW

database project: improved data processing and web-based Assessment of variation in microbial community amplicon
tools. Nucleic Acids Res 41:D590–596 sequencing by the microbiome quality control (MBQC) project
Quince C, Walker AW, Simpson JT, Loman NJ, Segata N (2017) consortium. Nat Biotechnol 35:1077–1086
Shotgun metagenomics, from sampling to analysis. Nat Biotech- Smits SA, Leach J, Sonnenburg ED, Gonzalez CG, Lichtman JS,
nol 35:833 Reid G, Knight R, Manjurano A, Changalucha J, Elias JE et al
Ren Z, Li A, Jiang J, Zhou L, Yu Z, Lu H, Xie H, Chen X, Shao L, (2017) Seasonal cycling in the gut microbiome of the Hadza
Zhang R et al (2019) Gut microbiome analysis as a tool towards hunter-gatherers of Tanzania. Science 357:802–806
targeted non-invasive biomarkers for early hepatocellular carci- Stewart RD, Auffret MD, Warr A, Walker AW, Roehe R, Watson M
noma. Gut 68:1014–1023 (2019) Compendium of 4,941 rumen metagenome-assembled
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a biocon- genomes for rumen microbiome biology and enzyme discovery.
ductor package for differential expression analysis of digital gene Nat Biotechnol 37:953–961
expression data. Bioinformatics 26:139–140 Stewart RD, Auffret MD, Warr A, Wiser AH, Press MO, Langford KW,
Rognes T, Flouri T, Nichols B, Quince C, Mahé F (2016) VSEARCH: Liachko I, Snelling TJ, Dewhurst RJ, Walker AW et al (2018)
a versatile open source tool for metagenomics. PeerJ 4:e2584 Assembly of 913 microbial genomes from metagenomic
Ross AA, Müller KM, Weese JS, Neufeld JD (2018) Comprehensive sequencing of the cow rumen. Nat Commun 9:870
skin microbiome analysis reveals the uniqueness of human skin Subramanian S, Huq S, Yatsunenko T, Haque R, Mahfuz M, Alam
and evidence for phylosymbiosis within the class mammalia. MA, Benezra A, DeStefano J, Meier MF, Muegge BD et al (2014)
Proc Natl Acad Sci USA 115:E5786–E5795 Persistent gut microbiota immaturity in malnourished Banglade-

Protein & Cell


Rothschild D, Weissbrod O, Barkan E, Kurilshikov A, Korem T, Zeevi shi children. Nature 510:417
D, Costea PI, Godneva A, Kalka IN, Bar N et al (2018) Tange O (2018). Gnu parallel 2018 (Lulu. com).
Environment dominates over host genetics in shaping human Tierney BT, Yang Z, Luber JM, Beaudin M, Wibowo MC, Baek C,
gut microbiota. Nature 555:210 Mehlenbacher E, Patel CJ, Kostic AD (2019) The landscape of
Roux S, Adriaenssens EM, Dutilh BE, Koonin EV, Kropinski AM, genetic content in the gut and oral human microbiome. Cell Host
Krupovic M, Kuhn JH, Lavigne R, Brister JR, Varsani A et al Microbe 26:283–295.e288
(2019) Minimum information about an uncultivated virus genome Tkacz A, Hortala M, Poole PS (2018) Absolute quantitation of
(MIUViG). Nat Biotechnol 37:29–37 microbiota abundance in environmental samples. Microbiome
Saito R, Smoot ME, Ono K, Ruscheinski J, Wang P-L, Lotia S, Pico 6:110
AR, Bader GD, Ideker T (2012) A travel guide to cytoscape Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli
plugins. Nat Methods 9:1069–1076 E, Tett A, Huttenhower C, Segata N (2015) MetaPhlAn2 for
Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh H-J, enhanced metagenomic taxonomic profiling. Nat Methods
Cuenca M, Field CM, Coelho LP, Cruaud C, Engelen S et al 12:902–903
(2019) Gene expression changes and community turnover Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R,
differentially shape the global ocean metatranscriptome. Cell Gordon JI (2007) The human microbiome project. Nature
179:1068–1083.e1021 449:804–810
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Turner TR, Ramakrishnan K, Walshaw J, Heavens D, Alston M,
Bioinformatics 30:2068–2069 Swarbreck D, Osbourn A, Grant A, Poole PS (2013) Comparative
Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, metatranscriptomics reveals kingdom level changes in the
Huttenhower C (2011) Metagenomic biomarker discovery and rhizosphere microbiome of plants. ISME J 7:2248–2258
explanation. Genome Biol 12:R60 Uritskiy GV, DiRuggiero J, Taylor J (2018) MetaWRAP—a flexible
Shenhav L, Thompson M, Joseph TA, Briscoe L, Furman O, Bogumil pipeline for genome-resolved metagenomic data analysis. Micro-
D, Mizrahi I, Pe’er I, and Halperin E (2019) FEAST: fast biome 6:158
expectation-maximization for microbial source tracking. Nat Vandeputte D, Kathagen G, D’hoe K, Vieira-Silva S, Valles-Colomer
Methods M, Sabino J, Wang J, Tito RY, De Commer L, Darzi Y, et al (2017)
Shi W, Li M, Wei G, Tian R, Li C, Wang B, Lin R, Shi C, Chi X, Zhou Quantitative microbiome profiling links gut community variation to
B et al (2019) The occurrence of potato common scab correlates microbial load. Nature 551:507–511
with the community composition and function of the geocaulo- Vangay P, Hillmann BM, Knights D (2019) Microbiome Learning
sphere soil microbiome. Microbiome 7:14 Repo (ML Repo): A public repository of microbiome regression
Shi W, Qi H, Sun Q, Fan G, Liu S, Wang J, Zhu B, Liu H, Zhao F, and classification tasks. GigaScience 8:giz042
Wang X et al (2019) gcMeta: a global catalogue of metagenomics Wang J, Chen L, Zhao N, Xu X, Xu Y, Zhu B (2018) Of genes and
platform to support the archiving, standardization and analysis of microbes: solving the intricacies in host genomes. Protein Cell
microbiome data. Nucleic Acids Res 47:D637–D648 9:446–461
Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Wang J, Jia Z, Zhang B, Peng L, and Zhao F (2019) Tracing the
Banfield JF (2018) Recovery of genomes from metagenomes via accumulation of in vivo human oral microbiota elucidates micro-
a dereplication, aggregation and scoring strategy. Nat Microbiol bial community dynamics at the gateway to the GI tract. Gut,
3:836–843 gutjnl-2019–318977
Sinha R, Abu-Ali G, Vogtmann E, Fodor AA, Ren B, Amir A, Wang J, Thingholm LB, Skiecevičienė J, Rausch P, Kummen M, Hov
Schwager E, Crabtree J, Ma S, Abnet CC et al (2017) JR, Degenhardt F, Heinsen F-A, Rühlemann MC, Szymczak S

© The Author(s) 2020 329


REVIEW Yong-Xin Liu et al.

et al (2016) Genome-wide association analysis identifies varia- Yang J, Yu J (2018) The association of diet, gut microbiota and
tion in vitamin D receptor and other host factors influencing the colorectal cancer: what we eat may imply what we get. Protein
gut microbiota. Nat Genet 48:1396–1406 Cell 9:474–487
Wang J, Zheng J, Shi W, Du N, Xu X, Zhang Y, Ji P, Zhang F, Jia Z, Ye SH, Siddle KJ, Park DJ, Sabeti PC (2019) Benchmarking
Wang Y et al (2018) Dysbiosis of maternal and neonatal metagenomics tools for taxonomic classification. Cell 178:779–
microbiota associated with gestational diabetes mellitus. Gut 794
67:1614–1625 Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L,
Wang W, Yang J, Zhang J, Liu Y-X, Tian C, Qu B, Gao C, Xin P, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G et al
Cheng S, Zhang W et al (2020) An Arabidopsis secondary (2011) Minimum information about a marker gene sequence
metabolite directly targets expression of the bacterial type III (MIMARKS) and minimum information about any (x) sequence
secretion system to inhibit bacterial virulence. Cell Host Microbe (MIxS) specifications. Nat Biotechnol 29:415–420
27:601–613.e607 Zgadzaj R, Garrido-Oter R, Jensen DB, Koprivova A, Schulze-Lefert
Wang X, Wang M, Xie X, Guo S, Zhou Y, Zhang X, Yu N, and Wang P, Radutoiu S (2016) Root nodule symbiosis in Lotus japonicus
E (2020b) An amplification-selection model for quantified rhizo- drives the establishment of distinctive rhizosphere, root, and
sphere microbiota assembly. Sci Bull nodule bacterial communities. Proc Natl Acad Sci USA 113:
Wang Y, Song F, Zhu J, Zhang S, Yang Y, Chen T, Tang B, Dong L, E7996–E8005
Ding N, Zhang Q et al (2017) GSA: genome sequence archive*. Zhang F, Cui B, He X, Nie Y, Wu K, Fan D, Feng B, Chen D, Ren J,
Genom Proteom Bioinf 15:14–18 Deng M et al (2018) Microbiota transplantation: concept,
Protein & Cell

Ward T, Larson J, Meulemans J, Hillmann B, Lynch J, Sidiropoulos methodology and strategy for its modernization. Protein Cell
D, Spear JR, Caporaso G, Blekhman R, Knight R et al (2017) 9:462–473
BugBase predicts organism-level microbiome phenotypes. bioR- Zhang J, Liu Y-X, Zhang N, Hu B, Jin T, Xu H, Qin Y, Yan P, Zhang X,
xiv 133462 Guo X et al (2019) NRT1.1B is associated with root microbiota
Wilck N, Matus MG, Kearney SM, Olesen SW, Forslund K, composition and nitrogen use in field-grown rice. Nat Biotechnol
Bartolomaeus H, Haase S, Mähler A, Balogh A, Markó L et al 37:676–684
(2017) Salt-responsive gut commensal modulates TH17 axis and Zhang J, Zhang N, Liu Y-X, Zhang X, Hu B, Qin Y, Xu H, Wang H,
disease. Nature 551:585–589 Guo X, Qian J et al (2018) Root microbiota shift in rice correlates
Wood DE, Lu J, and Langmead B (2019) Improved metagenomic with resident time in the field and developmental stage. Sci China
analysis with Kraken 2. bioRxiv 762302 Life Sci 61:613–621
Wu Y-W, Simmons BA, Singer SW (2015) MaxBin 2.0: an automated Zheng M, Zhou N, Liu S, Dang C, Liu Y-X, He S, Zhao Y, Liu W,
binning algorithm to recover genomes from multiple metage- Wang X (2019) N2O and NO emission from a biological aerated
nomic datasets. Bioinformatics 32:605–607 filter treating coking wastewater: main source and microbial
Xiao L, Feng Q, Liang S, Sonne SB, Xia Z, Qiu X, Li X, Long H, community. J Clean Prod 213:365–374
Zhang J, Zhang D et al (2015) A catalog of the mouse gut Zhu W, Lomsadze A, Borodovsky M (2010) Ab initio gene identifi-
metagenome. Nat Biotechnol 33:1103 cation in metagenomic sequences. Nucleic Acids Res 38:e132–
Xu J, Zhang Y, Zhang P, Trivedi P, Riera N, Wang Y, Liu X, Fan G, e132
Tang J, Coletta-Filho HD et al (2018) The structure and function Zou Y, Xue W, Luo G, Deng Z, Qin P, Guo R, Sun H, Xia Y, Liang S,
of the global citrus rhizosphere microbiome. Nat Commun 9:4894 Dai Y et al (2019) 1,520 reference genomes from cultivated
Xu Y, Zhao F (2018) Single-cell metagenomics: challenges and human gut bacteria enable functional microbiome analyses. Nat
applications. Protein Cell 9:501–510 Biotechnol 37:179–185

330 © The Author(s) 2020

You might also like