Bioinformatics A Practical Guide To Next Generation Sequencing Data
Bioinformatics A Practical Guide To Next Generation Sequencing Data
This book contains the latest material in the subject, covering next generation sequencing
(NGS) applications and meeting the requirements of a complete semester course. This book
digs deep into analysis, providing both concept and practice to satisfy the exact need of
researchers seeking to understand and use NGS data reprocessing, genome assembly, vari-
ant discovery, gene profiling, epigenetics, and metagenomics. The book does not introduce
the analysis pipelines in a black box, but with detailed analysis steps to provide readers
with the scientific and technical backgrounds required to enable them to conduct analysis
with confidence and understanding. The book is primarily designed as a companion for
researchers and graduate students using sequencing data analysis but will also serve as a
textbook for teachers and students in biology and bioscience.
Chapman & Hall/CRC Computational Biology Series
About the Series: This series aims to capture new developments in computational biology, as well
as high-quality work summarizing or contributing to more established t opics. Publishing a broad
range of reference works, textbooks, and handbooks, the series is designed to appeal to students,
researchers, and professionals in all areas of computational biology, including genomics, pro-
teomics, and cancer computational biology, as well as interdisciplinary researchers involved in
associated fields, such as bioinformatics and systems biology.
Hamid D. Ismail
First edition published 2023
by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
The right of Hamid D. Ismail to be identified as author of this work has been asserted in accordance with sections 77
and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any
information storage or retrieval system, without permission in writing from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003355205
Typeset in Minion
by codeMantra
Contents
Preface, xiii
Author, xvii
v
vi ◾ Contents
INDEX, 327
Preface
T he use of next-generation sequencing data analysis is the only analysis that can
make sense of the massive genomic data produced by the high-throughput sequenc-
ing technologies and accumulated in gigabytes and terabytes in our hard drives and cloud
databases. With the presence of computational resources and elegant algorithms for NGS
data analysis, scientists need to know how to master the tools of these analyses to achieve
the goals of their research. Learning NGS data analysis techniques has already become one
of the most important assets that bioinformaticians and biologists must acquire to keep
abreast of the progress in the modern biology and to avail of the genomic technologies and
resources that have become the de facto in bioscience research and applications including
diagnosis, drug and vaccine discovery, medical studies, and the investigations of pathways
that give clues to many biological activities and pathogenicity of diseases.
In the last two decades, the progress of next-generation sequencing has made a strong
positive impact on human life and a forward stride in human civilization. Introduction of
new sequencing technologies revolutionizes the bioscience. As a result, a new field of biol-
ogy called genomics has emerged. Genomics focuses on the composition, structure, func-
tional units, evolution, and manipulation of genomes, and it generates massive amount
of data that need to be ingested and analyzed. As a consequence, bioinformatics has also
emerged as an interdisciplinary field of science to address the specific needs in data acqui-
sition, storage, processing, analysis, and integration of that data into a broad pool to enrich
the genomic research.
This book is designed primarily to be a companion for the researchers and graduate
students who use sequencing data analysis in their research, and it also serves as a text-
book for teachers and students in biology and bioscience. It contains an updated material
in the subject covering most NGS applications and meeting the requirements of a complete
semester course. The reader will find that this book is digging deep in the analysis, pro-
viding both concept and practice to satisfy the exact need of the researchers who seek to
understand and use NGS data reprocessing, genome assembly, variant discovery, gene pro-
filing, epigenetics, and metagenomics. The book does not introduce the analysis pipelines
in a black box as the existing books do, but with the analysis steps, it pervades each topic in
detail to provide the readers with the scientific and technical background that enable them
to conduct the analysis with confidence and understanding.
The book consists of eight chapters. All chapters include real-world worked examples
that demonstrate the steps of the analysis workflow with real data downloadable from the
xiii
xiv ◾ Preface
databases. Most programs used in this book are open-source Unix/Linux-based programs.
Others can be used in Anaconda environments.
Chapter 1 discusses sequencing data acquisition from NGS technologies and databases,
FASTQ file format, and Phred base call quality. The chapter covers the quality assessment
of the FASTQ and read quality metrics in some detail so that the readers can diagnose
potential problems in raw data and learn how to fix any possible quality problem before
analysis.
Chapter 2 discusses read alignment/mapping to reference genomes. The strategies of
both reference genome indexing algorithms and read mapping algorithms are discussed in
detail with illustrations so that the readers can understand how mapping process works,
the different indexing and alignment algorithms currently used, and which aligners are
good for RNA sequencing applications. The chapter discusses indexing and searching
algorithms like suffix tree, suffix arrays, Burrow-Wheeler Transform (BWT), FM-index,
and hashing, which are the algorithms used by aligners. The chapter then discusses the
mapping process and aligners like BWA, Bowtie, STAR, etc. The SAM/BAM file format
is discussed in detail so that the reader can understand how alignment information are
stored in fields in the SAM/BAM file. Finally, the chapter discusses the manipulation of
alignments in SAM/BAM files using Samtools programs for different purposes, including
SAM to BAM conversion, alignment sorting, indexing BAM files, extracting alignments of
a chromosome or a specific region, filtering and counting alignment, removing duplicate
reads, and generating descriptive statistics.
Chapter 3 discusses de novo genome assembly and de novo assembly algorithms includ-
ing greedy algorithm, overlap-consensus graphs, and de Bruijn graphs. The quality assess-
ment of the assembled genome is discussed through two approaches: statistical approach
and evolutionary approach.
Chapter 4 covers variant calling (SNPs and InDels) in detail. The introduction of this
chapter discusses variants, variant file format (VCF), and the general workflow of the vari-
ant calling. The chapter then discusses both consensus-based variant calling and hap-
lotype-based variant calling and example callers from each group including BCFTools,
FreeBayes, and GATK best practice variant calling pipelines. Finally, the chapter discusses
variant annotation and prioritization and annotation programs including SIFT, SnpEff,
and ANNOVAR.
Chapter 5 discusses RNA-Seq data analysis. The introduction includes RNA-Seq basics
and applications. The chapter then discusses the steps of RNA-Seq analysis workflow,
including data acquisition, read alignment, alignment quality control, quantification,
RNA-Seq data normalization, statistical modeling and differential expression analysis,
using R packages for differential analysis, and visualization of RNA-Seq data.
Chapter 6 covers ChIP-Seq data analysis. It discusses in detail the workflow of ChIP-Seq
data analysis including data acquisition, quality control, read mapping, peak calling, visu-
alizing peak enrichment and peak distribution, peak annotation, peak functional analysis,
and motif discovery.
Chapter 7 discusses targeted gene metagenomic data analysis (amplicon-based microbial
analysis) for environmental and clinical samples. The chapter covers raw data preprocessing
Preface ◾ xv
Hamid D. Ismail, PhD, earned an MSc and a PhD in computational science at North
Carolina Agricultural and Technical State University (NC A&T), USA, and DVM and BSc
at the University of Khartoum, Sudan. He also earned several professional certifications,
including SAS Advanced Programmer and SQL Expert Programmer. Currently, he is a
postdoc scholar at Michigan Technological University and an adjunct professor of data
science and bioinformatics at NC A&T State University. Dr. Ismail is a bioinformatician,
biologist, data scientist, statistician, and machine learning specialist. He has contributed
widely to the field of bioinformatics by developing bioinformatics tools and methods for
applications of machine learning on genomic data.
xvii
Chapter 1
Sequencing and
Raw Sequence Data
Quality Control
DOI: 10.1201/9781003355205-1 1
2 ◾ Bioinformatics
base pair of guanine (C/G)) as shown in Figure 1.1. Adenine and thymine form two hydro-
gen bonds (weak bond), while cytosine and guanine form three hydrogen bonds (strong
bond). Those base pairings are specific so that a sequence of a strand is predicted from the
other one. The length of a DNA sequence is given in base pair (bp), kilobase pair (kbp),
or megabase pair (Mbp). The RNA exists in a single strand; however, it sometimes forms
double-stranded secondary structure with itself to perform specific function.
The genome of an organism is the book of life for that organism. It determines the living
aspects and biological activities of cells. A genome contains coding regions known as genes
that carry information for protein synthesis. Genes are transcribed into messenger RNA
(mRNA), which is translated into proteins and the proteins control most of the biological
processes in the living organisms.
A gene consists of coding regions, non-coding regions, and a regulatory region. The cod-
ing regions in the eukaryotic genes are not continuous, but non-coding sequences (called
introns) are found between the coding sequences (called exons). These introns are removed
from the transcribed transcripts before protein translation, leaving only the exons which
form the coding region called the open reading frame (ORF). Each eukaryotic gene has its
own regulatory region that controls its expression. In prokaryotic cells, a group of genes,
called an operon, are regulated by a single regulatory region. The viruses, which fall in
the margin between living organisms and chemical particles, function and replicate only
inside host cells by using the host cells machineries such as ribosomes to create structural
and non-structural proteins of viruses and to replicate to create new virions.
H H O
N
N N
N N H
O Thymine
N N
Adenine H
N
O H
H N N
N N
H Cytosine
N O
N N
Guanine H
FIGURE 1.1 Base pairing and hydrogen bonds between pairs of the DNA nucleotides.
Sequencing and Raw Sequence Data Quality Control ◾ 3
1.2 SEQUENCING
DNA/RNA sequencing is the determination of the order of the four nucleotides in a nucleic
acid molecule. The recovered order of the nucleotides in a genome of an organism is called
a sequence. Sequencing of the DNA helps scientists to investigate the functions of genes,
roles of mutations in traits and diseases, species, evolutionary relationships between spe-
cies, diagnosis of diseases caused by genetic factors, development of gene therapy, criminal
investigations and legal problems, and more. Since the nucleotides are distinguished by the
bases, the DNA and RNA sequences are represented in bioinformatics by the sequences of
the four-nucleobase single-character symbols (A, C, G, and T for DNA and A, C, G, and
U for RNA).
The attempts to sequence nucleic acid began immediately after the landmark discovery
in 1953 of the double-helix structure of the DNA by James Watson and Francis Crick.
The alanine tRNA was the first nucleic acid sequenced in 1965 by the Nobel prize winner
Robert Holley. Holley used two ribonuclease enzymes to split the tRNA at specific nucleo-
tide positions, and the order of the nucleotides was determined manually [1]. The first DNA
molecule was sequenced in 1972 by Walter Fiers. That DNA molecule was the gene that
codes the coat protein of the bacteriophage MS2, and the sequencing was made by using
enzymes to break the bacteriophage RNA into pieces and separating the fragments with
electrophoresis and chromatography [2]. The sequencing of the alanine tRNA by Robert
Holley and the sequencing of the gene of the bacteriophage MSE coat protein are ones of
the major milestones in the history of genomics and DNA sequencing. They paved the way
for the first-generation sequencing.
C+T, and C). The order of the nucleotides (A, C, G, and T) in the DNA sequence can then
be solved from the bands on the gel.
On the other hand, the steps of the Sanger sequencing method are similar to that of the
polymerase chain reaction (PCR) including denaturing, primer annealing, and comple-
mentary strand synthesis by polymerase. However, in the Sanger sequencing, the sample
DNA is divided into four reaction tubes labeled ddATP, ddGTP, ddCTP, and ddTTP. In the
four reaction tubes, the four types of deoxynucleotides triphosphates (dATP, dGTP, dCTP,
and dTTP) are added as in the PCR but one of the four radio-labeled dideoxynucleotide
triphosphates (ddATP, ddGTP, ddCTP, or ddTTP) is also added to the reactions, as labeled,
to terminate the DNA synthesis at certain positions of known nucleotides. The synthesis
termination results in DNA fragments of varying lengths ending with the labeled ddNTPs.
Those fragments are then separated by size using gel electrophoresis on a denaturing poly-
acrylamide-urea gel with each of the four reactions running in a separate lane labeled A,
T, G, and C. The DNA fragments will be separated by lengths; the smaller fragments will
move faster in the gel. The DNA bands are then graphed by autoradiography, and the order
of the nucleotide bases on the DNA sequence can be directly read from the X-ray film or
the gel image.
for PCR DNA strand synthesis and barcode sequence for indexing the sample DNA. This
allows multiple samples (multiplexing) to be sequenced in a single run; the DNA fragments
of each sample will have a unique barcode. Later, after sequencing, the sample sequences
can be separated in the analysis by demultiplexing. In the sequencing of some application
like gene expression (RNA-Seq) and epigenetics, an enrichment step is usually included
to amplify or to separate only the targeted sequences. In RNA application, enrichment is
performed to separate mRNA from the other types of RNA. In epigenetics, the genomic
regions where protein interaction is taking place can also be enriched. The enrichment is
usually performed with PCR, but there are other means as well. The library preparation of
the DNA/RNA is similar for all NGS technologies but the sequencing process is different
from one to another. The sequences produced by the NGS technologies range between 75
and 400 base pairs (bp) in length. These sequences are called short reads. In general, short-
read sequencing (SRS) can either be a single-end sequencing, which sequences the forward
strand only, or paired-end sequencing, which sequences both forward and reverse strands.
The latter reduces the chance of making basal error in the resulted sequence. The DNA or
RNA reads consist of the four nucleobase characters A, C, G, and T. However, the sequence
may also include N for an unresolved base.
and captured by a sensor or a camera. Figure 1.2 shows the steps of pyrosequencing after
adding the DNA fragments to the wells, where sequencing takes place by incorporating a
known nucleotide to the growing complementary strand each time and releasing of PPI
that is translated into a light signal, which is detected by a camera.
what has been discussed above. In the library preparation step, the DNA is broken down
into fragments. The ends of these fragments are repaired and then adaptor sequences are
ligated to the ends. As Roche 454 and Ion Torrent sequencing, beads are also used and the
fragments attached to the beads are amplified using PCR. After the amplification, specific
primers are used to synthesize strands complementing the ssDNA templates and provid-
ing a free 5′ phosphate group (instead of 3′ hydroxyl group) that can ligate to one of a set
of fluorescently labeled probes. A probe is an 8-mer oligonucleotide with a fluorescent dye
at the 5′-end and a hydroxyl group at the 3′-end. The first two nucleotides of the probe spe-
cifically complement the nucleotides on the sequenced fragments. The next three nucleo-
tides are universal that can bind to any one of the four nucleotides. The remaining three
nucleotides of the probe (at the 5′-end) are also universal but are fluorescently labeled and
they are able to be cleaved during the sequencing leaving only the other five nucleotides
binding to the template strand. The set of probes consists of a combination of 16 possible
2-nucleotides that can ligate to the primer sequence on the fragments. Every time two
specific nucleotides ligate, the last three nucleotides of the probe are cleaved and the fluo-
rescence specific to the ligated probe is emitted, captured, and translated into the corre-
sponding two bases. This step is followed by removing the fluorescently labeled nucleotide
and regenerating the 5′ phosphate group. The process (ligation, detection, and cleavage)
is repeated multiple times for extending the complementary strand. The strand extension
products are then removed, and a new primer is used for the second round of ligation,
detection, and cleavage. The cycles are then repeated and every time a new primer is used
for the remaining of the DNA fragments.
nucleotide. The base call is based on the intensity of the signals during chain synthesis and
labeled nucleotide incorporation.
The DNA library preparation, as described above, includes the DNA fragmentation,
DNA end repair, and adaptor ligation. Sequencing is carried out on the solid surface of a
flow cell divided into lanes. On the surface, there are two types of oligonucleotides that
complement the anchor sequence in the adaptors attached to the DNA template. When the
DNA fragments are added to the cell flow, the anchor sequences in the fragments anneal
to (complement) the surface oligonucleotides forming bridges. The oligos also act as prim-
ers that initiate the synthesis of complementary strands generating clusters of the DNA
fragments. The sequencing step begins by denaturing the DNA strands and adding fluo-
rescently labeled nucleotides. Only one nucleotide is incorporated in the DNA template at
a time. After the addition of each nucleotide, the clusters are excited by a light source and
a signal is emitted. The signal intensity determines the base call.
applications like the de novo genome assembly, variant discovery, and epigenetics. Usually,
genomes have long repeated sequences ranging from hundreds to thousands of bases that
are hard to cover with short reads produced by the NGS. The foundation for TGS was
emerged in 2003, when the DNA polymerase was used to obtain a sequence of 5 bp from
a single DNA molecule by using fluorescent microscopy [4]. The single-molecule sequenc-
ing (SMS) then evolved to include (i) direct imaging of individual DNA molecules using
advanced microscopy techniques and (ii) nanopore sequencing technologies in which a
single molecule of DNA is threaded through a nanopore and molecule bases are detected
as they pass through the nanopore. Although TGS provides long reads (from a few hun-
dred to thousands of base pairs), that may come at the expense of the accuracy. However,
lately, the accuracy of the TGS has been greatly improved. The TGS provides long reads
that can enhance de novo assembly and enable direct detection of haplotypes and higher
consensus accuracy for better variant discovery. In general, there are two TGS technolo-
gies that are currently available: (i) Pacific Bioscience (PacBio) single-molecule real-time
(SMRT) sequencing and (ii) Oxford Nanopore Technologies (ONTs).
called highly accurate long sequencing reads (HiFi reads). The sequencing is carried out in
parallel on thousands of nanophotonic wells, generating long reads. PacBio can produce
long reads with 99% accuracy (>Q20) and uniform coverage.
∑
n
length of read i
Coverage = i =1
(1.2)
genome size ( bp )
where n is the number of sequenced reads.
The sequencing coverage is expressed as the number of times the genome (e.g., 1X, 2X,
20X,…, etc.).
The sequencing depth affects the genomic assembly completeness, accuracy of de novo
assembly and reference-guided assembly, number of detected genes, gene expression lev-
els in RNA-Seq, variant calling, genotyping in the whole genome sequencing, microbial
identification and diversity analysis in metagenomics, and identification of protein–DNA
interaction in epigenetics. Therefore, it is important to investigate sequencing depth before
sequence analysis. The higher the number of times that bases are sequenced, the better the
quality of the data.
a Phred quality score to measure the accuracy of each base called. The Phred quality score
(Q-score) transforms the probability of calling a base wrongly into an integer score that is
easy to interpret. The Phred score is defined as
Q = −10log10 ( p ) (1.4)
where p is the probability of the base call being wrong as estimated by the caller software.
The Phred quality scores are encoded using ASCII single characters. All ASCII char-
acters have a decimal number associated with them. However, since the first 32 ASCII
characters are non-printable and the integer 33, which is the decimal number for the
exclamation mark ASCII character “!”, the Q=0 is the exclamation mark and the encod-
ing that begins with “!” as zero is called Phred+33 encoding. Illumina 1.8 and later ver-
sions use this Phred+33 encoding (Q33) to encode the base call quality in FASTQ files. The
older Illumina versions (e.g., Solexa) used Phred+64 encoding, in which the character “@”,
whose decimal number is 64, corresponds to Q=0. Table 1.1 shows the Phred quality score
(Q), corresponding probability (P), and the decimal number and ASCII code. For instance,
when the probability of calling a base is 0.1, the Phred score will be 10 (Q=10), but instead
of giving the number 10, that quality score is encoded as the plus sign “+”.
Higher Q scores indicate a smaller probability of error and lower Q scores indicate
low qualities of the base called which is more likely that the base was called wrongly. For
instance, a quality score of 20 indicates the chance of making an error rate (1 error) in 100,
corresponding to 99% call accuracy. In general, the Q-score of 30 is considered a benchmark
TABLE 1.2 Phred Quality Score and Error Probability and Base Call Accuracy
Q Error Probability Base Call Accuracy (%) Interpretation
10 0.1 90 1 error in 10 calls
20 0.01 99 1 error in 100 calls
30 0.001 99.9 1 error in 1,000 calls
40 0.0001 99.99 1 error in 10,000 calls
50 0.00001 99.999 1 error in 100,000 calls
60 0.000001 99.9999 1 error in 1000,000 calls
for good quality in the high-throughput sequencing (HTS). Table 1.2 shows some of the Q
scores and corresponding error probability, base call accuracy, and interpretation.
Table 1.3 describes the elements of the Illumina FASTQ identifier line and Figure 1.6
shows an example FASTQ file with three read records. The sequence observed in the index
sequence (part of the adaptor) is written to the FASTQ header in place of the sample num-
ber. This information can be useful for troubleshooting and demultiplexing. However,
these metadata elements may be altered or replaced by other elements especially when they
are submitted to a database or altered by users.
The second line of the FASTQ file contains the bases inferred by the sequencer. The
bases include A, C, G, and T for Adenine, Cytosine, Guanine, and Thymine, respectively.
The character N may be included if the base in a position is ambiguous (was not deter-
mined due to a sequencing fault).
The third line starts with a plus sign “+”, and it may contain other additional metadata
or the same identifier line elements.
14 ◾ Bioinformatics
The fourth line of the FASTQ file contains the ASCII-coded string that represents the
per base Phred quality scores. The numeric value of each ASCII character corresponds to
the quality score of a base in the sequence line.
Researchers usually acquire raw sequencing data for their own research from a sequenc-
ing instrument. Raw sequencing data can also be downloaded from a database, where sci-
entists and research institutions deposit their raw data and make it available for public. In
either case, the raw sequencing data is usually obtained in FASTQ files. The NCBI SRA
database is one of the largest databases of raw data for hundreds of species. The FASTQ
files are stored in Sequence Read Archive (SRA) format, and they can be downloaded and
extracted using SRA-toolkit [9], which is a collection of programs developed by the NCBI
and can be downloaded and installed by the instructions available at “https://trace.ncbi.
nlm.nih.gov/Traces/sra/sra.cgi”.
For the purpose of demonstration, we will download raw data from the NCBI SRA data-
base. We will use a single-end FASTQ file with the run ID “SRR030834”, whose size is
3.5G. The FASTQ file contains reads sequenced from an ancient hair tuft of 4000-year-old
male individual from an extinct Saqqaq Palaeo-Eskimo, excavated directly from culturally
deposited permafrozen sediments at Qeqertasussuk, Greenland. To keep file organized, you
can create the directory “fastqs” and then download the FASTQ file using “fasterq-dump”
Sequencing and Raw Sequence Data Quality Control ◾ 15
command (make sure that you have installed the SRA-toolkit on your computer and it is
on the path):
mkdir fastqs
cd fastqs
mkdir single
cd single
fasterq-dump --verbose SRR030834
As shown in Figure 1.7, the FASTQ file “SRR030834.fastq” has been downloaded to the
directory, and we will use that file to show how to use some Linux commands to perform
some operations with that file.
FASTQ files may contain up to millions of entries, and their sizes can be several mega-
bytes or gigabytes, which often make them too large to open in a normal text editor. In
general, no need to open a FASTQ file unless it is necessary for troubleshooting or out of
curiosity. To display a large FASTQ file, we can use some Unix or Linux commands such
as “less” or “more” to display very large text file page by page or “cat” to display the content
of the file.
less SRR030834.fastq
more SRR030834.fastq
cat SRR030834.fastq
If a FASTQ file name ends with the “.gz” extension, that means the file is compressed with
“gzip” program. In this case, instead of “less”, “more”, and “cat” commands, use “zless”,
“zmore”, and “zcat” commands, respectively, without decompressing the files.
We can also use “head” and “tail” to display the first lines and last lines, respectively.
The following command will display the first 15 lines of the file:
If a FASTQ file is large, we can compress it with the “gzip” program to reduce its size more
than three times. Compressing the “SRR030834.fastq” file with gzip will reduce its size to
less than one gigabyte.
gzip SRR030834.fastq
FIGURE 1.7 Downloading a FASTQ file from the NCBI SRA database.
16 ◾ Bioinformatics
gzip -d SRR030834.fastq.gz
If you need to know the number of records in a FASTQ file, you can use a combination of
“cat” or “zcat” and “wc -l”, which counts the number of lines in a text file. Remember that
a record in a FASTQ file has 4 lines. We can use the Unix pipe symbol “|” to transfer the
output of the “cat” command to the “wc -l” command. The following command line will
count the number of records stored in the FASTQ files:
If we need to display the file name and read count for multiple files, with the “.fastq” file
name extension, in a directory, we can use the following script:
To display a FASTQ file in a tabular format, you can use the “cat” command and then use
the Unix pipe to transfer the output to the “paste” command, which converts the four lines
of the FASTQ records into tabular format.
The command will store the new tabular file in a new file “SRR030834_tab.txt”. You can
open this file in any spreadsheet, or you can display it as follows:
less -S SRR030834_tab.txt
Creating a tabular file from a FASTQ file will help us to perform several operations such as
sorting of the entries, filtering out the duplicate reads, extracting read IDs, sequences, or
quality scores, and creating a FASTA file. We expect that the format of the identifier lines
of a FASTQ file is consistent. If you display “SRR030834_tab.txt”, you will notice that some
of the identifier line fields are separated by spaces, and if we consider the space as a column
separator, the IDs will be in the first column and the sequence will be in the fourth column.
However, this column order may be different in tabular files extracted from other FASTQ
files. Assume that we wish to extract only the IDs and sequences from “SRR030834_tab.
txt” in a separate text file, then we can use the “awk” command as follows:
The “awk” command extracts the first column and fourth column from “SRR030834_tab.
txt” and prints the two columns separated by a tab. The output is directed to a new text file
“SRR030834_seq.txt” (Figure 1.8).
Linux commands allow us to do multi-step operations. Assume that we want to create a
FASTA file from the FASTQ file; we can do that in multiple steps. First, we need to extract
both IDs and sequences in a file as we did above, then we can remove “@” symbol leaving
only the IDs, then we need to add “>” in the beginning of each line with no space between
the “>” and the IDs, and finally, we separate the two columns, forming the definition line
(defline) of FASTA and the sequence, store them in a file, and delete the temporary files.
In the FASTA format, as shown in Figure 1.9, each entry contains a definition line and a
sequence. The defline begins with “>” and can contain an identifier immediately after “>”
(no whitespace in between).
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
cd FastQC
This directory contains several files but we will focus only on “fastqc” file, which is the pro-
gram command that we will use and “Configuration” directory which contains three files
“adapter_list.txt”, “contaminant_list.txt”, and “limits.txt”, which can be used to configure
the QC report.
Use the Linux command “chmod” to make “fastqc” file executable:
You may need to add the program to the Linux path so that you can use it from any direc-
tory. To do that, open the file “~/.bashrc” with any text editor and add the program path to
the end of the file, save the file and exit.
export PATH=”/home/mypath/FastQC”:$PATH
You may need to replace “/home/mypath” in the above statement with the path of your
downloaded FastQC.
You need to restart the terminal or you can use “source” command to make that change
active.
source ~/.bashrc
Finally, test the FastQC from any directory with the following:
fasqc -h
If the path was set correctly, you will see the program help screen.
On Linux, the FastQC program can also be installed simply by running “sudo apt-get
install fastqc”. Once it has been installed, you do not need any further configuration as we
did above. Although any one of the two installation methods works, in the first method,
you will know where you download the program and you can reach the “Configuration”
directory easily to make changes to any of the three configuration files stored there.
The FastQC program can be run in one of two modes: either as (i) an interactive graphi-
cal user interface (GUI) in which we can load FASTQ files and view their results or (ii) as a
non-interactive mode on a command line where we can specify the FASTQ files and then it
will generate an HTML report for each file. However, if we run it non-interactively, to display
the report, we will need to open the HTML files on an Internet browser. The command-line
mode allows FastQC to be run as part of an analysis pipeline. FastQC uses a FASTQ file as
20 ◾ Bioinformatics
an input and generates quality assessment reports including per base sequence quality, per
tile sequence quality, per sequence quality scores, per base sequence content, per sequence
GC content, per base N content, sequence length distribution, sequence duplication levels,
overrepresented sequences, adaptor content, and k-mer content. FastQC supports all vari-
ants of FASTQ formats and gzip-compressed FASTQ files.
We will download some public single-end FASTQ files from an NCBI BioProject with
an accession “PRJNA176149” for practicing purpose. The SRA files of this project contain
genomic single-end reads of Escherichia coli str. K-12. To keep the files organized, we can
create the directory “ecoli” using “mkdir ecoli” and then move it inside this directory “cd
ecoli” and save the following IDs (each in a line) in a text file with the file name “ids.txt”
using any text editor:
SRR653520
SRR653521
SRR576933
SRR576934
SRR576935
SRR576936
SRR576937
SRR576938
Then, run the following script to create the subdirectory “fastQC” and to download the
FASTQ files associated with the IDs stored in the “ids.txt” file into the directory:
mkdir fastQC
while read f;
do
fasterq-dump \
--outdir fastQC “$f” \
--progress \
--threads 4
done < ids.txt
Once the raw FASTQ files have been downloaded, we can use the command “ls -lh fastQC”
to display the file names as shown in Figure 1.10.
On Linux terminal, we can use FastQC non-interactively and later we will display the
generated reports on an Internet browser. But before running FastQC, it is important to
know about the “limits.txt” file in the “Configuration” directory. This file contains the
default values for the FastQC options, and we can use it to determine which report to gen-
erate. Use a text editor of your choice to open that file and study its content. In most cases,
no change is needed. At this point, we will change only
Then, save the file and exit. This change is necessary to include k-mer report when we run
the program.
The following is a simple syntax for running the FastQC program non-interactively on
the command line:
The input can be a single FASTQ file name or multiple file names separated by whitespaces.
The FastQC program has several options that can be displayed using the following
command:
fastqc --help
Since we have downloaded the eight E. coli raw FASTQ files above and stored them in the
“fastQC” directory, we can either run the program for each file or provide all file names
as input as shown in the above syntax. However, the efficient way is to use the bash com-
mands if we are using a Linux/Unix platform. The following bash script creates a directory
“qc”, changes to “fastQC” directory where the FASTQ files are stored, stores the file names
in a variable “filename”, then runs the FastQC program non-interactively, and finally saves
the QC reports in the “qc” directory:
mkdir qc
cd fastQC
filenames=$(ls *.fastq)
fastqc $filenames \
--outdir ../qc \
--threads 3
cd ..
mkdir qc
cd fastQC
fastqc *.fastq --outdir ../qc --threads 3
The QC reports of the FASTQ files will be stored in the “qc” directory. FastQC will gen-
erate an HTML file “*_fastqc.html” and a zipped file “*_fastqc.zip” for each FASTQ file.
22 ◾ Bioinformatics
The zipped file contains the images and the text files that will be displayed on the HTML
when it is opened on an Internet browser. We can display the HTML files from the Linux
terminal using Firefox. If Firefox is not installed, use “sudo apt-get install firefox” to
install it. If you are using Putty to access a remote computer, you may need to enable “X11
Forwarding”.
Use the following syntax to display the QC reports:
cd qc
htmlfiles=$(ls *.html)
firefox $htmlfiles
cd ..
The Firefox Internet browser will display the QC FastQC reports of all FASTQ files, each
on a separate tab as shown in Figure 1.11. Click a tab to move from a report to another.
A QC FastQC report includes a summary, basic statistics, and the quality control met-
rics that we will discuss in some detail. Summary, on the left-hand side, provides a quick
visual report to the different QC metrics available in the report and allows us to identify
any existing problem. With each metric title, Summary displays a sign indicating whether
there is any potential quality problem; the green color indicates a normal metric, orange
indicates a slightly abnormal metric, and red indicates a quality problem associated with
the metric. A metric with an orange warning requires an attention and correction if it is
possible and a metric with a red warning indicates that there is a clear fault in the sequence
data and it is recommended to fix it before proceeding to the analysis step. Summary also
acts as a table of content. Clicking a Summary item will take you to that item graph. In the
following, we will discuss each item that may be included on the QC report.
The GC content (%GC) is important for showing sequencing problem due to bias. There
is a relationship between GC content and read coverage across a genome. Some short-read
sequencers may tend to sequence the region with higher GC content more or less than the
region with the lower GC content [11]. The average of the GC content of bacterial genome
varies from less than 15% to more than 75% [12], and the average genomic GC content of
most eukaryotes lies somewhere within 40%–50% [13]. Very small or very large GC con-
tent may indicate a potential sequencing bias problem that we may need to fix.
analysis, we may need to filter the reads that have low-quality bases or to trim the ends
of the reads beginning from the 34th base. Figure 1.15 shows a per base sequence quality
graph without warning.
bases in a base position and to quickly detect if there is any poor quality associated with
any tile. A graph with a completely blue color indicates an overall good quality base call
from all tiles (Figure 1.17a).
The quality losses associated with the physical positions on the flow cell are of different
causes. A random low quality at different positions is usually due to a technical problem
with the run such as the flow cell overlapping (Figure 1.17b).
Sequencing and Raw Sequence Data Quality Control ◾ 27
FIGURE 1.17 Per tile sequence quality graphs, some with quality problems.
Reads with such random scattered hot areas (errors) are hard to be fixed. Another
cause of the loss of quality is the failure of the imaging system at the edges of the flow
cell to read signals (Figure 1.17b). However, the reads in such case can be still useable.
The loss of quality that begins from somewhere and continues until the end of the run
(Figure 1.17c) is usually due to an obstruction to the imaging system (because of a dirt
on the surface of the flow cell). The reads with low quality, in this case, can be filtered
out. A final type of errors associated with tiles is a temporary loss of quality in specific
base positions (Figure 1.17d) due to an obstruction to few cycles (because of bubbles). The
obstruction usually stops the imaging and also prevents the reagents from getting to the
DNA template clusters. Such quality problem may introduce false insertion to the reads
and it cannot be fixed with sequence trimming since the low-quality bases are not at the
end of the reads.
Indeed, not every quality problem associated with tiles can be removed by trimming
or filtering out reads originating from the affected tiles. However, the last one may affect
several tiles and may extend several reads without lowering the overall sequence score. We
should watch out to this kind of fault especially for reads used for variant discovery.
28 ◾ Bioinformatics
content across the reads in a FASTQ file and then compares it to the theoretical normal
distribution of the GC content, which is estimated from the observed data. If there is no
sequencing bias and the library is random, we will expect that the observed distribution of
the GC content of the reads to be approximately normal and roughly similar to the theo-
retical distribution in which the central peak corresponds to the overall GC content of the
underlying genome. Deviation of the distribution of the per sequence GC content from the
normal distribution is an indication of a contaminated library or a fault in the sequencing
process. However, a bell-shaped normal curve that deviates from theoretical curve may or
may not be biased. In this case, there is a chance that the observed distribution may repre-
sent the actual distribution of the genome of the organism; therefore, no warning will be
issued.
30 ◾ Bioinformatics
A warning sign is displayed if the observed distribution deviates from normal distribu-
tion by a sum of more than 15% of the reads. A failure sign will be displayed if the distribu-
tion deviates by a sum of more than 30% of reads.
FIGURE 1.22 Sequence length distribution graphs (equal length and variable lengths).
length of the reads equal. However, sometimes reads with unequal reads are generated
especially if the reads are trimmed to remove low-quality bases at the beginning or ends
of the reads. The sequence length distribution graph shows the read length distribution.
If the reads are of the same length, the graph will be simple with a single peak at a bar
indicating a single value (Figure 1.22a). When reads are of a variable length, the graph will
show the relative read count of each read length (Figure 1.22b). A warning is displayed if
the reads do not have the same length.
due to the gene expression and that overrepresentation can be of a biological importance
rather than a bias. The overrepresented sequences report is a table that shows the over-
represented sequences, counts, percentage, and possible source. To save memory, only the
first 200,000 reads are checked in the FASTQ file; therefore, the list is not exhaustive and
other overrepresented sequences may skip the check. For each overrepresented sequence,
the FastQC program will search on a database of known contaminants and report the best
match that is at least 20 bases in length and has no more than a single mismatch. A warn-
ing will be issued if a sequence is overrepresented more than 0.1% of the total and failure
will occur if the overrepresentation is more than 1% of the total. As shown in Figure 1.24,
five overrepresented sequences are found, three of which are contaminating adaptors and
two sequences have no hits. The count and percentage reflect the significance of each of
these overrepresented sequences. The count of the first sequence in the table represents
29.4% of the total count of the reads in the FASTQ file. It is clear that this sequence is origi-
nated from a primer contamination and it must be removed before analysis.
the proportion of the reads which contain the adaptor sequences at each position. Known
adaptor sequences and description are stored in the “adapter_list.txt” file as shown in
Figure 1.25.
K-mers (sequences of k size of bases) are formed from the adaptor sequences in the
“adapter_list.txt” file and then the program searches for these k-mers to report the total
percentage of the reads which contain these k-mers. The report may discover the sources of
bias due to contaminating adaptor dimers in the library.
A warning is raised if any sequence is present in more than 5% of all reads, and a failure
occurs if any sequence is present in more than 10% of all reads.
Figure 1.26 shows a FASTQ file with raw reads without adaptor content (left) and a
FASTQ file with reads with failed metric due to significant content of Illumina Universal
Adaptor.
mkdir fastxtoolkit
cd fastxtoolkit
wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_
binaries_Linux_2.6_amd64.tar.bz2
tar xvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
Copy the program files from the “bin” directory to “/usr/local/bin” so that it can be exe-
cuted from any directory on the computer:
Figure 1.28 shows the FASTX-toolkit tools in the “bin” directory after downloading and
extracting the compressed archive file.
FASTX-toolkit includes several tools for the processing of FASTQ files as described in
Table 1.4. You can display the usage and options of any of the executable programs by
entering the program name with “-h” option on the command-line prompt. For instance,
to display the help for “fastq_quality_filter”, simply enter the following on the command
line:
fastq_quality_filter -h
To show how FASTQ files are processed, we will download a raw FASTQ file from the NCBI
SRA database and modify its name for the practice. The following commands create the
36 ◾ Bioinformatics
directory “preprocessing”, then download the FASTQ file from the NCBI SRA d atabase,
and finally rename it to “bad.fastq” file for the practice purpose. The script then generates
the QC FastQC report and displays the report on the Firefox browser.
mkdir preprocessing
cd preprocessing
fasterq-dump --verbose SRR957824
rm SRR957824_1.fastq
mv SRR957824_2.fastq bad.fastq
fqfile=$(ls *.fastq)
fastqc $fqfile
htmlfile=$(ls *.html)
firefox $htmlfile
When all the commands have been executed sequentially without an error, the QC report
will be displayed on the Firefox Internet browser. Study the reports carefully and identify
any potential problems on the quality metrics that we have discussed in the previous sec-
tion. Figure 1.29 shows that the reads in the file have three failures and a single warning.
Next, we will try to fix these problems as possible.
Using such FASTQ file in the downstream analysis without fixing some of the quality
problems will definitely impact the results negatively and may lead to misleading results.
The good strategy whenever there are warnings or failures is to try the available ways to fix
the problems as possible, and if there is any unfixable problem, you may need to be aware
of it and to know how it may affect the results.
FIGURE 1.29 The QC report summary and per base sequence quality for “bad.fastq” file.
Sequencing and Raw Sequence Data Quality Control ◾ 37
Figure 1.29 shows a common deterioration of the quality of bases toward the end of the
reads produced by short-read sequencing instruments. We can also notice that some qual-
ity scores in some positions are low as 2 Phred (probability of error is 0.6).
The report shows three failures and a single warning: failed per base sequence quality
(Figure 1.29), failed per base sequence content and failed k-mer content (Figure 1.30), and
overrepresented sequences warning (Figure 1.31).
The QC processing strategies are different from a FASTQ file to another depending on
the failed metrics. Understanding the problem always gives a good idea about which kinds
of QC processing to perform. In our example file, we will begin by filtering the low-quality
reads and clipping the overrepresented sequences and then we will run FastQC again to see
how the quality is improved.
First, we will try to fix the per base quality score of the reads in the FASTQ file by
using “fastq_quality_filter” to keep the reads that have 80% of the bases which have qual-
ity scores equal or greater than 28. The following script performs filtering (the output file
is “filtered.fastq”), runs FastQC to generate the new QC report, and then runs Firefox to
display the QC report on the Internet browser:
fastq_quality_filter \
-i bad.fastq \
-q 28 \
FIGURE 1.30 Failed per base sequence content and k-mer content.
-p 80 \
-o filtered.fastq \
-Q33
fastqc filtered.fastq
firefox filtered_fastqc.html
In the above script, “-i” option specifies the input FASTQ file, “-q” specifies the minimum
Phred quality threshold, “-p” specifies the percentage of bases of the reads that have at least
the specified threshold quality, “-o” specifies a name of the output FASTQ file where the
filtered reads are stored, and “-Q33” is to tell the program that the FASTQ quality encod-
ing is Phred+33 (the default is “-Q64”; therefore, we must use “-Q33” for FASTQ files with
Illumina 1.9 encoding or later).
Figure 1.32 shows the per base sequence quality graph of the filtered FASTQ file. The
filtering process removed 499,970 reads, which did not meet the criteria. The per base
sequence quality, which is the most important metric, has been improved and per base
sequence content has been also improved. However, some positions at the ends of the reads
have still low Phred quality scores. We can trim the low-quality bases from the ends of the
reads by using the “fastq_quality_trimmer” program. Instead of removing the reads that
FIGURE 1.32 A graph of the filtered “bad.fastq” file with low-quality bases at the read ends.
Sequencing and Raw Sequence Data Quality Control ◾ 39
do not meet the criteria entirely, this program cuts only the bases, whose quality scores are
less than the specified threshold, from the ends of the reads.
fastq_quality_trimmer \
-i bad_filt.fastq \
-t 28 \
-o bad_filt_trim.fastq \
-Q33
fastqc bad_filt_trim.fastq
htmlfiles=$(ls *.html)
firefox $htmlfiles
The “-t” option specifies the quality threshold, which is the minimum quality score below
which the bases will be trimmed from the ends of the reads. When trimming is performed,
the resulted reads may be of unequal lengths, which may not be accepted by some pro-
grams used in following steps of analysis. As shown in Figure 1.33, although the per base
sequence quality has been improved by trimming, it also raised a sequence length distribu-
tion warning since trimming resulted in reads with unequal lengths. We may need to filter
reads by length.
FIGURE 1.33 The QC report of the filtered and trimmed “bad.fastq” file.
40 ◾ Bioinformatics
The sequence length distribution warning also can be raised by clipping a daptors or
verrepresented sequences. Thus, we can use “fastx_clipper” first to remove the overrep-
o
resented sequences (see Figure 1.31). The following script removes a contaminating over-
represented sequence:
fastx_clipper \
-a ATCGGGAGAGGGGCGGGGAGGGGAAGAGGGGAGAATTCGGGGGGGGCCGG \
-i bad_filt_trim.fastq \
-o bad_filt_trim_clip.fastq \
-v \
-Q33
fastqc bad_filt_trim_clip.fastq
htmlfiles=$(ls *.html)
firefox $htmlfiles
Since some aligners in the next step of analysis may not accept sequences with unequal
lengths, we can use a bash script to filter out the short reads. Figure 1.34 shows sequence
length distribution. If the aligner that we intend to use does not accept unequal read
lengths, then we can filter out all reads whose length is less than 150 bases using the fol-
lowing script:
If you use the above script for other FASTQ files, you may need to change the read length
and the numbers of the columns. In our example FASTQ file, “$2” is for the sequence col-
umn and “$4” is for the quality column. The numbers of these columns may vary depend-
ing on the content of the FASTQ definition line.
Figure 1.35 shows the per base sequence quality of the final FASTQ file. You should
remember that you may not be able to fix all quality problems, and that filtering and clip-
ping may compromise the sequencing depth. Fortunately, most of the problems other than
base quality errors can be tolerated by the majority of the aligning programs. However, we
should try to fix the failed metrics as possible before continuing to the subsequent step of
the analysis.
The FASTX-toolkit tools listed in Table 1.4 are used for quality assessment and quality
adjustment. The major limitation is that “fastq_quality_filter” of FASTX-toolkit does not
process the paired-end FASTQ files together and that usually results in singletons or reads
without pairs in any of the two paired-end FASTQ files. Most aligners do not accept to pro-
cess paired-end FASTQ files with singletons. The FASTX-toolkit solution to the singleton
problem is to mask the low-quality bases instead of removing the reads with low-quality
bases. Thus, “fastq_masker” program is used instead of “fastq_quality_filter” to mask the
bases of Phred quality score less than a user-defined threshold “-q”.
fastq_masker \
-q 20 \
-i bad.fastq \
-o bad_masked.fastq \
-Q33
fastqc bad_masked.fastq
firefox bad_masked_fastqc.html
The above “fastq_masker” command masks the bases with quality lower than 20 Phred
quality score “-q 20”; therefore, they will be ignored by aligners and assemblers.
For paired-end FASTQ files produced by an Illumina instrument, there is another
FASTQ processing program, developed by Illumina for paired-end FASTQ files, called
Trimmomatic [15]. It is a multithreaded command-line Java-based program and is more
modern than FASTX-toolkit. It was developed by Illumina to perform several operations,
including detection and removing the known adaptor fragments (adapter.clip), trim-
ming low-quality regions from the beginning of the reads (trim.leading), trimming low-
quality regions from the end of the reads (trim.trailing), filtering out short reads (min.
read.length), in addition to other operations with different quality-filtering strategies for
dropping low-quality bases in the reads (max.info and sliding.window). Trimmomatic can
be used in two modes: simple and palindrome modes. In the simple mode, for removing
adaptor sequences, the pairwise local alignment between adaptor sequence and reads is
used to scan reads from 5′ ends to 3′ ends using seed and extend approach. If a score of
a match exceeds a user-defined threshold, both the matched region and the region after
alignment will be removed. The entire read is removed if an alignment covers all the read.
The simple Trimmomatic approach may not be able to detect the short adaptor sequence.
Therefore, the palindrome model is used because it is able to detect and remove short
fragment sequences of adaptors. Palindrome is used only for the paired-end data. Both
forward and reverse reads will have equal number of valid bases and each read comple-
ments another. The valid reads are followed by the contaminating bases from the adaptors.
The tool uses the two complementary reads to identify the adaptor fragment or any other
contaminating technical sequence by globally aligning the forward and reverse reads. An
alignment score that is greater than a user-defined threshold indicates that the first parts
of each read reversely complement one another and the remaining read fragments which
match the adaptor sequence will be removed.
Sequencing and Raw Sequence Data Quality Control ◾ 43
$ wget http://www.usadellab.org/cms/uploads/supplementary/
Trimmomatic/Trimmomatic-0.39.zip
$ unzip Trimmomatic-0.39.zip
Notice that the version may change in the future. The unzipped directory is
“Trimmomatic-0.39”, where there will be two files (“LICENSE” and “trimmomatic-0.39.
jar”) and a directory (“adapters”). The file “trimmomatic-0.39.jar” is the Java executable
program that performs the preprocessing tasks and the directory “adaptors” contains the
known adaptor sequences in FASTA files. The following script uses Trimmomatic to repro-
cess the paired-end FASTQ files, then runs FastQC to generate QC reports, and finally
displays the reports on the Firefox browser:
The option “PE” is used for paired end, and then the two paired-end FASTQ files
“SRR957824_1.fastq” and “SRR957824_2.fastq” were provided as inputs. The adaptors that
were detected and removed from the reads are stored in the “TruSeq3-PE.fa” file in the
“adaptors” directory. Hence, “ILLUMINACLIP:TruSeq2-PE.fa” is used to specify the file
in which the adaptor sequences are stored. The program removed the leading and trail-
ing edges of reads with low quality that is below 3 Phred quality score. The “SLIDING-
WINDOW:5:35” is used so that the program can scan the read with a 5-base wide sliding
window and remove a read when the window per base average quality score declines to
below 30. Finally, the program removes the reads that are shorter than 35 bases.
In Figures 1.36 and 1.37, notice how the quality of the two files have been improved and
also notice that the total sequence is equal in both files. However, the read lengths vary.
If for any reason we need reads of the same length as some aligners may require, we can
set “MINLEN:” to the maximum length. Since the maximum read length is 150 bases, we
can use “MINLEN:151” as follows:
44 ◾ Bioinformatics
Check out the reports of the two paired-end FASTQ files to see that only the reads with
equal length are left.
There are several other programs that can be used, as well, for the quality improvement
of FASTQ reads. For instance, Fastp can be used for the filtering of low-quality reads and
adaptor trimming. This program is specifically fast and easy to use as part of a pipeline.
Moreover, it is able to identify adaptor sequences and trim them without the need of pro-
viding adaptor sequences [16].
1.7 SUMMARY
The NGS produces short reads that are widely used for the different sequencing applica-
tions for the high accuracy and low cost. However, the long reads produced by the TGS
(Pacific Bioscience and Oxford Nanopore Technologies) have also gained some popularity
in applications like de novo assembly, metagenomics, and epigenetics. The accuracy of the
long-read technologies has been substantially improved, but the cost is still high and less
affordable when they are compared to short-read technologies. The sequencing depth and
base call quality are the two crucial factors for most applications, and the analysts must
keep looking at them before proceeding with the analysis. Most HTS instruments per-
form quality control before delivering raw sequence data in FASTQ files. However, per base
qualities and other quality metrics must be assessed before using raw data in any analysis.
46 ◾ Bioinformatics
There are several programs for quality assessment, but FastQC is the most popular one.
FastQC is a user-friendly program to assess the quality of the reads generated by any of the
sequencing technologies, and it produces a report that summarizes the results in graphs
that are easy to interpret. The potential quality problems include low-quality bases, pres-
ence of adaptor sequences connected to the reads, presence of adaptor dimers or other
technical contaminating sequences, overrepresented PCR sequences, sequence length dis-
tribution, per base sequence content, per sequence GC content, per base N content, and
k-mer content. The per base sequence quality and adaptor content are the most important
metrics that we should look at and take the appropriate action. The ideal sequencing data
are the one without warnings or failed metrics. Therefore, we should try to fix the prob-
lems as possible. However, some problem may not be solved. If the unsolved problem does
not affect the reads severely, that data still can be used in the analysis. However, we must
be aware that unsolved problems may have some negative impact in the results. The read
quality problems can be solved based on the failed metrics by removing low-quality reads,
trimming the reads from the beginning and the end of the reads, and masking the bases
with low-quality scores. There are several programs for the processing of raw sequence
data. FASTX-toolkit is the most popular one for single-end FASTQ files, and Trimmomatic
is more sophisticated and can be used for both single-end and paired-end raw data. Fastp
filters low-quality reads and automatically recognizes and trims adaptor sequences. It is
important to process the paired-end FASTQ files (forward and reverse) together to avoid
leaving out singletons, which may not be accepted by almost all aligners. In this chapter, we
discussed the command-line programs for quality controls. However, those programs or
similar ones are implemented in Python, R, and other programing languages, but under-
standing the general principle for checking the raw data quality and solving potential qual-
ity problems are the same. Most sequencing applications use these kinds of QC processing,
but when we cover the metagenomic data analysis, you will learn how to preprocess micro-
bial raw data using different programs. Once the raw sequencing data are cleaned, then we
can move safely to the next step of sequence data analysis depending on the application
workflow that we are adopting.
REFERENCES
1. Holley RW, Everett GA, Madison JT, Zamir A: Nucleotide sequences in the yeast alanine
transfer ribonucleic acid. J Biol Chem 1965, 240: 2122–2128.
2. Jou WM, Haegeman G, Ysebaert M, Fiers W: Nucleotide sequence of the gene coding for the
bacteriophage MS2 coat protein. Nature 1972, 237(5350):82–88.
3. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman
MS, Chen Y-J, Chen Z et al: Genome sequencing in microfabricated high-density picolitre
reactors. Nature 2005, 437(7057):376–380.
4. Braslavsky I, Hebert B, Kartalov E, Quake SR: Sequence information can be obtained from sin-
gle DNA molecules. Proceedings of the National Academy of Sciences 2003, 100(7):3960–3964.
5. Rhoads A, Au KF: PacBio sequencing and its applications. Genomics, Proteomics &
Bioinformatics 2015, 13(5):278–289.
6. Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW: Zero-mode wave-
guides for single-molecule analysis at high concentrations. Science 2003, 299(5607):682–686.
Sequencing and Raw Sequence Data Quality Control ◾ 47
7. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences
with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010,
38(6):1767–1771.
8. FASTQ Files [https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/
Informatics/BS/FASTQFiles_Intro_swBS.htm]
9. Leinonen R, Sugawara H, Shumway M: The sequence read archive. Nucleic Acids Res 2011,
39(Database issue):D19–21.
10. Andrews S: FastQC: A Quality Control Tool for High Throughput Sequence Data. Babraham
Bioinformatics, Babraham Institute, Cambridge, United Kingdom; 2010.
11. Chen Y-C, Liu T, Yu C-H, Chiang T-Y, Hwang C-C: Effects of GC bias in next-generation-
sequencing data on de novo genome assembly. PLOS One 2013, 8(4):e62856.
12. Lightfield J, Fram NR, Ely B: Across bacterial phyla, distantly-related genomes with similar
genomic GC content have similar patterns of amino acid usage. PLoS One 2011, 6(3):e17677.
13. Romiguier J, Ranwez V, Douzery EJ, Galtier N: Contrasting GC-content dynamics across 33
mammalian genomes: relationship with life-history traits and chromosome sizes. Genome
Res 2010, 20(8):1001–1009.
14. FASTX-toolkit [http://hannonlab.cshl.edu/fastx_toolkit/]
15. Bolger AM, Lohse M, Usadel B: Trimmomatic: A flexible trimmer for Illumina sequence data.
Bioinformatics 2014, 30(15):2114–2120.
16. Chen S, Zhou Y, Chen Y, Gu J: fastp: An ultra-fast all-in-one FASTQ preprocessor.
Bioinformatics 2018, 34(17):i884–i890.
Chapter 2
DOI: 10.1201/9781003355205-2 49
50 ◾ Bioinformatics
together with their gene annotations, which can be used in the process of read alignment/
mapping to act as guides on which new genomes are assembled fast. A reference genome
of an organism is a curated sequence that is built up using the DNA information of several
normal individuals of that organism. The reference genome curation was pioneered by
the Genome Reference Consortium (GRC), which is founded in 2008 as a collaboration of
the National Center for Biotechnology Information (NCBI), the European Bioinformatics
Institute (EBI), the McDonnell Genome Institute (MGI), and the Wellcome Sanger Institute
to maintain and update the human and mouse genome reference assemblies. Now, GRC
maintains the human, mouse, zebrafish, rat, and chicken reference genomes. Reference
genomes of other organisms are curated by specialized institutions including NCBI and
many others, which manually select genome assemblies that are identified as standard or
representative sequences (RefSeq) against which data of the individuals from those organ-
isms can be compared. All eukaryotes have a single reference genome per species, but pro-
karyotes may have multiple reference genome sequences for a species. The NCBI curates
reference genomes from the assemblies categorized as RefSeq on the GenBank database. If
a eukaryotic species has no assemblies in the RefSeq, then the best GenBank assembly for
that species is selected as a representative genome. Viruses as well may have more than one
reference genomes per species. Generally, the update of a reference genome of any species is
a continuous process and a new version, usually called “Build”, may be released whenever
new information emerges. A release of a reference genome may be accompanied by gene
annotations. A well-curated reference genome, like human and other model organisms’
reference genomes, is usually released with annotation information such as gene anno-
tation and variant annotation. Reference genomes are made available at the NCBI web-
site in both FASTA file format and GenBank file format. Several annotation files may be
found including gene annotation in GFF/GTF file format, GenBank format, and tabular
format. The reference transcriptome (whole mRNA of an organism) and proteins may also
be available as shown in Figure 2.1.
For the alignment/mapping of reads produced by sequencing instruments, we may
need to download a reference genome of the species from which the sequencing raw data
are taken. The sequence of the reference genome must be in the FASTA file format. For
example, to download the FASTA file of the human genome, you can copy the link from
“genome” hyperlink on the Genome database web page and on Linux terminal use “wget”
to download the file to the directory of your choice “e.g. refgenome”:
mkdir refgenome
wget \
-O “refgenome/GRCh38.p13_ref.fna.gz” \
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.
fna.gz
This script will create the “refgenome” directory, where it will download the compressed
FASTA sequence of the human reference genome “GRCh38.p13_ref.fna.gz”. The size of
the compressed current FASTA sequence file of the human genome (GRCh38.p13) is only
921M. We can decompress it using the “gunzip” command.
gunzip -d GRCh38.p13_ref.fna.gz
This command will decompress the reference genome file to “GRCh38.p13_ref.fna” and
the file size now is 3.1G. A large file can be displayed using a program for displaying a large
text such as “less” or “cat” Linux commands. The reference sequences are in the FASTA file
format. A file contains several sequences representing the genomic units such as chromo-
somes. Each FASTA sequence entry consists of two parts: a definition line (defline), which
is a single line that begins with “>” symbol, and a sequence, which may span several lines.
Figure 2.2 shows the beginning of the human genome reference sequence. Notice that the
defline includes the GenBank accession of the sequence, species scientific name, genome
unit (chromosome number), and the human genome Build. A chromosome sequence may
begin with multiple ambiguous bases (Ns). In Figure 2.2, we removed several lines of Ns
intentionally to show the DNA nucleobases.
The following Unix/Linux commands are used with the text files as general and here we
can use them with FASTA files to collect some useful information.
To display the FASTA file content page by page, you can use “less” command:
less GRCh38.p13_ref.fna
To count the number of FASTA sequences in the FASTA file, use “grep” command:
FIGURE 2.2 Part of the FASTA sequence of the human reference genome.
To count the total number of bases in the reference file, you can combine “grep”, “wc”, and
“awk” commands as follows:
If for any reason, you want to split the reference sequences into files, you can use the fol-
lowing script that creates the directory, “chromosomes”, and then it splits the main FASTA
file into several FASTA files:
mkdir chromosomes
cd chromosomes
csplit -s -z ../GRCh38.p13_ref.fna ‘/>/’ ‘{*}’
for i in xx* ; do \
n=$(sed ‘s/>// ; s/ .*// ; 1q’ “$i”) ; \
mv “$i” “$n.fa” ; \
done
The annotation files relevant to a reference genome may also be needed for some of the
steps in the downstream analysis. You can download the annotation file as above. The
annotation file is a description of where genetic element also called a feature such as genes,
introns, and exons are located in the genome sequence, showing the start and end coordi-
nates, and feature name. The annotation files are usually in GFF or GTF file format. The
GFF (General feature format) is a simple tab-delimited text file for describing genomic
features and mapping them to the reference sequence in the FASTA file. The GTF (Gene
Transfer Format) is similar to GFF but it has additional elements. Figure 2.3 shows the
first part of the human annotation file in the GFF format. The content of an annotation file
including the chromosome name or chromosome GenBank accession in the first column,
and features and other annotations are in the other columns.
Both the FASTA reference file and (sometimes) its annotation file are required by the
alignment programs, shortly called aligners, for mapping the reads in the FASTQ files to
Mapping of Sequence Reads to the Reference Genomes ◾ 53
FIGURE 2.3 Part of the human annotation file in GTF file format.
the reference genome in a process known as read sequence mapping or alignment. In the
read mapping process, the FASTA files may contain millions of read sequences that we
wish to align to a sequence of a reference genome to produce aligned reads in a file format
called SAM, which stands for Sequence Alignment Map format. The aligned reads can also
be stored in the SAM binary form called BAM (Binary Alignment Map format). We will
discuss this file format later in some detail.
In general, sequence mapping or alignment requires three elements: A reference file in
the FASTA format, short-sequence reads in FASTQ files, and an aligner, which is a program
that uses an algorithm to align reads to a reference genome sequence. We have already
discussed how to download the sequence of a reference genome of an organism from the
NCBI Genome database. However, before using a reference genome with any aligner, it
may require indexing with the “samtools faidx” command. You can download and install
Samtools by following the instructions available at “http://www.htslib.org/download/”. On
Ubuntu, you can install it using the following command:
Once you have installed Samtools successfully, you can use that tool to index the reference
genome and other tasks that you will learn later.
You have already downloaded the human reference genome above. If you didn’t do that,
you can download and decompress it using the following commands:
mkdir refgenome
wget \
-O “refgenome/GRCh38.p13_ref.fna.gz” \
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.
fna.gz
cd refgenome
gunzip -d GRCh38.p13_ref.fna.gz
54 ◾ Bioinformatics
Notice that the name of the reference genome or its URL may change in the future. The
above commands will create the directory “refgenome” where it downloads the human
reference genome and decompresses it. Once the reference genome has been downloaded,
you can use “samtools faidx” to index it as follows:
FIGURE 2.4 Part of the fai index file of the human reference genome.
Mapping of Sequence Reads to the Reference Genomes ◾ 55
2.2.1 Trie
A trie or a prefix tree is a data structure that is used for fast retrieval on large datas-
ets such as looking up sequencing reads. The name was derived by E. Fredkin in 1960
from the phrase “Information Retrieval” by Fredkin (1960) [5]; however, the idea was
first described by the Norwegian mathematician Axel Thue in 1912. A trie is an ordered
tree that can be used to represent a set of strings (reads) over a finite alphabet set, which
is A, C, G, and T for the DNA sequences. It allows reads with common prefixes to use
that prefix and stores only the ends of the reads to indicate different sequences. A trie is
represented by nodes and edges. The root node is empty and then each node in the trie
represents a single character. The maximum number of possible children of a node is four
in case of DNA. The children of a node are ordered alphabetically. The leaf nodes are the
nodes that represent the last characters of reads. A read is represented by the path from
the root node to a leaf node.
The trie data structure, as it is, is not suitable for indexing a genome sequence since
it stores a set of string. However, the generalized idea of the trie is used in the suffix tree
which is widely used to index a reference genome sequence.
$ 10
A $ 9
GA $ 8
GGA $ 7
TGGA $ 6
CTGGA $ 5
GCTGGA $ 4
GGCTGGA $ 3
TGGCTGGA $ 2
TTGGCTGGA $ 1
CTTGGCTGGA $ 0
Notice that each line includes a suffix (key) and a position (value). Then from the key-value
pairs, we can construct a tree made up of nodes and edges. The positions (values) will be
the nodes and suffixes (keys) will be the edges of the tree. The suffix tree is built as shown
in Figure 2.5, starting from the first suffix on the top and it moves down to make branched
nodes and edges to avoid repeating common characters. This way, we will construct a
suffix tree with nodes and edges. An entire reference genome can be divided into suffixes
and stored this way with both suffixes and indexed positions in the unbranched nodes so
finding a pattern or a position of a read in the reference genome will be easy.
Once the reference sequence is indexed using the suffix tree, one of several searching
algorithms can be used to find the location where a read maps. For instance, to find “TGG”
in Figure 2.5, we will start searching from the root looking for “T”, and from the next node,
we will look for “GG”; thus, since there are two leaf nodes with the indexes 2 and 6, that
means “TGG” is aligned to the reference sequence in the positions 2 and 6 as shown by the
red color (Figure 2.5).
of all suffixes of a given sequence. There is a variety of algorithms used for constructing
the suffix array implemented by software packages [7]. In its simplest form, a suffix array
is constructed for a sequence or a string. For the string (sequence) S with a length N and
the characters (bases) indexed as 1, 2, 3, …, N, we can construct an array for all suffixes or
substrings of the sequence as S[1…N], S[2...N],…, S[N...N] and then sort the suffixes lexi-
cographically. For instance, for the sequence S = “CTTGGCTGGA$”, we can construct the
arrays of suffixes and sort them as shown in Figure 2.6.
Figure 2.6 shows how the suffix array is constructed from sorted suffixes. The numbers
are the positions in the sequence. The sorting of suffixes lets suffixes beginning with the
same string of characters to appear one after the other and that allows a fast lookup when
we try to find exact matches of a read (substring). For instance, the exact match for the
reads “TGG” can be found by jumping to the sorted suffixes that begin with “TGG” and
we can fast locate the positions 7 and 3 (the coordinate begins from 1 and not 0 as in suf-
fix tree). When a reference genome is indexed using the suffix array, finding a position of
a pattern or a read will have a linear time complexity. A major drawback of using suffix
arrays is that they require a large memory storage depending on the size of the genome
being used. STAR [8] is as an example of the aligners that use suffix arrays.
rotations then are sorted alphabetically to form a matrix called BWM (Burrows–Wheeler
Matrix) as shown in the second column of Figure 2.7. The last column of characters (in
red) in the BWM is the BWT of the sequence. As shown in Figure 2.8, the BWT of the
sequence S is “AGG$GGTTCTC”. When an aligner uses BWT, it transforms the entire
genome sequence into a BWT. However, computing a BWT using the above naïve approach
is computationally expensive with O(n2 log n) time complexity. Instead, we can use the
suffix array to compute the BWT of a reference genome in O(n) time complexity. This
strategy is utilized by the aligners that use BWT. Constructing a BWT from a suffix array
is straightforward and very simple. To get the ith character of the BWT of the string S,
whose characters are indexed as i=1, 2, …, n, from the suffix array (A), simply use the
following rule:
Figure 2.8 shows the indexes i=1, 2, 3, …,11 (first row) of the sequence S (second row) and
suffix array index A (third row), as shown in Figure 2.7. Now to infer the BWT from A,
let us begin from BWT(S)[11]. By applying the rule, BWT(S)[11] = S[A[11] - 1] = S[10] = A.
Likewise, BWT(S)[10] = S[A[10] - 1] = S[9]= G; BWT(S)[6] = S[A[6] - 1]=S[5]=G; but BWT(S)
[1], A[1] is less than 1; hence, BWT(S)[1] = $. For the BWT(S)[i] corresponding to A=9, 5, 8,
4, 7, 3, 2, we will continue using BWT(S)[i] = S[A[i] - 1]. The BWT is shown in Figure 2.9.
The question that comes up is why we need to transform a sequence of a reference
genome into BWT? The BWT serves two purposes; first, BWT groups the characters of
the sequence so that a single character appears many times in a row because the column
is sorted alphabetically and that can be used for sequence compression, which reduces the
memory storage. In this sense, the BWT is compressible and reversible. The second pur-
pose is that BWT is a data structure that can be used for indexing the sequence of a refer-
ence genome to make finding a position of a read fast.
The property of the BWT is that we can reverse it to obtain the BWM by using the last-
to-first column mapping property or simply as LF mapping, where L is the last column
60 ◾ Bioinformatics
in the sorted suffixes that corresponds to the BWT and F is the first column in the sorted
rotations. The L and F columns are separated as shown in Figure 2.9.
We need only the first (F) and last (L) columns to reverse the BWT to the original sorted
suffixes of the string. The BWT can be recovered using the LF mapping.
Since F must begin with “$”, we know that this character is the last one in the original
string. The “$” is preceded by the first character in L, which is “A”; so now we could infer
the last two characters “A$” in the original sequence. To infer the character that comes
before “A”, let us find the first “A” in F, which is in the second row, so the character in the
second row of L which is “G” is the character that comes before “A$”; so now we could
infer the last three characters “GA$” of the original string. Since this “G” is the first “G”
in L, it points to the first “G” in F (5th row); thus, the character that comes before “GA$”
is the character in the 5th row of L, which is “G”. So now the inferred last four characters
are “GGA$”. Since this “G” is the third “G” in L, it points to the third “G” in F, which is in
the 7th row, and the character in the 7th row of L, which is “T”, is the character that comes
before “GGA$” so the inferred last five characters are “TGGA$”. The rank of this “T” is
the first “T” in L which points to the first “T” in F in the 9th row, so the character which
comes before “TGGA$” is “C” in the 9th row of L. Thus, the inferred last 6 characters of
the original string are “CTGGA$”. We can continue using this L-to-F mapping until we
recover the entire sequence “CTTGGCTGGA$” as indicated by the arrows in Figure 2.10.
This reversion process is called backward search mechanism since it begins from the very
end of the string and moves backward. A computer algorithm usually performs this task.
We can recover the original string using the same LF mapping property but in a dif-
ferent way. Given the LF matrix, the string S is reconstructed starting from the end of the
Mapping of Sequence Reads to the Reference Genomes ◾ 61
string, $, which is also the first character of F column. The character before $ will be the
first character of L column. In general, the character which comes before can be inferred
with the pairs L and F relationships as shown in Figure 2.10. These relationships are also
illustrated in Figure 2.11 which shows that, first, we recover the first two columns of the
BWM by creating a two-column matrix from columns L and F and sorting it as shown
in Figure 2.11a. The third column is recovered by creating a three-column matrix from
column L and the first two recovered columns and then we sort it (Figure 2.11b). The same
is repeated to recover the fourth column of the sorted rotations as shown in Figure 2.11c.
We will continue like this until we recover all columns and the BWM will be as shown
in Figure 2.11d. The original string is the one ends with the character “$” in the BWM as
indicated in the shaded row in Figure 2.11.
The steps shown in Figures 2.10 and 2.11 are to show how a BWT is reversible. However,
there are several algorithms implemented by software for reversing a BWT using LF
mapping. The BWT is used as data structure for indexing a genome (compressed or
uncompressed) by transforming the entire genome into a BWT. Once a genome has been
transformed, there are several searching algorithms and approaches for finding the posi-
tion of a substring (read) on the reference genome.
62 ◾ Bioinformatics
2.2.5 FM-Index
FM-index (Full-text Minute-space index) performs a backward search mechanism on the
top of BWT to find exact pattern matches on a string with a linear computational time
complexity O(N), regardless of the length of the string. Moreover, it could achieve high
compression ratios that allow the indexing of the whole human genome into less than 2.0G
of memory storage [2]. An FN-index searching is performed by using LF mapping matrix.
Searching for a read (a substring) with FM-index uses the backward matching in which the
LF mapping is used repeatedly until matches for the substring are found. FM-index intro-
duces the rank table and lookup table. A rank table is generated from the BWT (column L)
using the character’s rank, which is defined as the number of times a character occurs pre-
viously in the prefix L[1…k] (i.e., lists the occurrence and order of each unique character).
The rank table acts as a function B(c, k), which is explained as follows: if a character c and
a prefix of length k are given, then we can infer the number of times that character occurs
in the prefix. A lookup table lists the index of the first occurrence of each character c from
the first column (column F) of the sorted matrix. The first occurrence of the character c in
the lookup table can be described as C[c].
Figure 2.12 shows Burrows–Wheeler matrix (a), the rank table (b), and the lookup table
(c). Using the rank table and lookup table, the LF mapping can be described as
The original string can be recovered using the above rule and backward mechanisms are
as follows:
The last character in the string is “$” on the first row in column F, preceded by “A” on
the first row in column K. So, the last two characters are “A$”. Now we can use the above
rule to find the character that comes before “A$”. Since “A” is on row 1 in column L, we can
locate that character in column F as LF(1) as follows:
LF(1) = C(A] + B(A, 1) = 1 + 2 =2 : The character is on row 2 in column F. So, “A” is pre-
ceded by “G”, which is on row 2 in column L. The last three characters are “GA$”.
LF(2) = C(G] + B(G, 2) = 4 + 1 =5 : The character is on row 5 in column F. So, “G” is pre-
ceded by “G”, which is on row 5 in column L. The last four characters are “GGA$”.
LF(5) = C(G] + B(G, 5) = 4 + 3 =7 : The character is on row 7 in column F. So, “G” is pre-
ceded by “T”, which is on row 7 in column L. The last four characters are “TGGA$”.
LF(7) = C(T] + B(T, 7) = 8 + 1 =9 : The character is on row 9 in column F. So, “T” is pre-
ceded by “C”, which is on row 9 in column L. The last four characters are “CTGGA$”.
We can continue using that rule with the rank table and lookup table until we recover
the entire spring “CTTGGCTGGA$”.
The FM-index can also be used as a pattern searching technique that operates on the
BWT to map read sequences to locations on the BW-transformed reference genome.
Searching for a pattern is a backward process like recovering the original string; a
character is processed at a time, beginning with the last character of the pattern (read
sequence) [11].
Mapping of Sequence Reads to the Reference Genomes ◾ 63
FIGURE 2.12 (a) BWT, (b) rank table, and (c) lookup table.
studied. To overcome this challenge, aligners must adopt a strategy to perform the gapped
alignment rather than performing only the exact alignment.
Read mapping is required by the most sequencing applications including reference-based
genome assembly, variant discovery, gene expression, epigenetics, and metagenomics. As
shown in Figure 2.13, the workflow of read mapping/alignment includes downloading the
right FASTA file of the reference genome of the species studied, indexing the sequence of
the reference genome with “samtools index” command, indexing the reference genome
with an aligner, and finally performing mapping of the cleaned reads with the aligner
itself. Remember that, before mapping, the step of quality control must be performed as
discussed in Chapter 1. So, when we move to the step of read mapping, we should have
already cleaned up the reads and fixed most of the failed metrics shown by the QC reports.
Even if we couldn’t fix all errors, we should also be aware of the effect of that in the final
results.
We have also discussed above how to download a reference genome of an organism and
how to index it. To avoid repetition, we will delve into read mapping and generation of
SAM/BAM files without covering the topics that we have already discussed.
Read mapping is the process of finding locations on a reference genome where reads,
contained in FASTQ files, map. The read mapping information are then stored in a
SAM/BAM file format, which is a special file format for storing sequences alignment
Mapping of Sequence Reads to the Reference Genomes ◾ 65
information. Most aligners are capable of performing both exact matching and inexact
matching, which are essential to find the locations of reads that may have some base call
errors or varied genetically from the reference genome. The different aligners implement
different algorithms to perform both kinds of lookups in the indexed reference genome
stored in data structures like suffix tree, suffix array, hashing table, and BWT. While the
exact lookup is straightforward, the inexact matching uses sequence similarity to find the
most likely locations where a read is originated. Although there are different ways to mea-
sure sequence similarity, most aligners used Hamming distance [12] or Levenshtein dis-
tance [13] to score the similarity between a reads and portions of the reference genomes
based on a threshold. Some aligners use the seed-and-extend strategy to extend a seed (an
exact matched substring) across multiple mismatched bases to allow mapping reads with
base call errors or variations. Most aligners employ seed-and-extend strategy on the local
sequence alignment using SW algorithm. Seeds are created by making overlapping k-mers
(substrings or words of length k) from the reference genome sequence. Some aligners like
Novoalign [14] and SOAP [15] index k-mers with the trie or hash table data structures for
a quick search.
by one of the two-letter header record type codes. A record type code may have two-letter
subtype codes. Table 2.1 lists and describes the two-letter codes of the SAM header section,
and Figure 2.14 shows an example header section. Notice that the SAM file begins with
“@HD VN:1.0 SO:coordinate”, which indicates that the specification of SAM version 1.0
was used and the alignments in the file are sorted by the coordinate. We can also notice
that there are several @SQ header lines, each line is for the reference sequence used for the
alignment. The @SQ header includes the sequence name (SN), which is the chromosome
number, and sequence length (LN). The last two lines in the header section should include
@PG, which describes the program used for the alignment, and @CO, which describes the
command lines.
The alignment section begins after the header section. Each alignment line has 11
mandatory fields to store the essential alignment information. The alignment section may
have variable number of optional fields, which are used to provide additional and aligner
specific information.
Figure 2.15 shows a partial alignment section of a SAM file. The columns of the align-
ment section are split because they do not fit the page. Table 2.2 lists and describes 11 man-
datory fields of the SAM alignment section.
These 11 mandatory fields are always present in a SAM file. If the information of
any of these mandatory fields is not available, the value of that field will be replaced
with “0” if its data type is integer or “*” if the data type is string. Most field names are
self-explanatory.
TABLE 2.1 The Two-Letter Codes of the Header Section and Their Description
Code Header Code Description
@HD This header codes for metadata, and if it is present, it must be the first line of the SAM file. This
header line may include subtypes: VN for format version, SO for sorting order, GO for
grouping alignment, and SS for sub-sorting order of alignments
@SQ This is for the reference sequence used for aligning the reads. A SAM file may include multiple
@SQ lines for the reference sequences used. The order of the sequences defines the order of
alignment sorting. The two most common sub-type codes used in this header line include SN
for reference sequence name and LN for reference sequence length
@RG This header line is used to identify read group and it is used by some downstream analysis
programs (e.g., GATK) for grouping files based on the study design. Multiple lines can exist in
a SAM file. This line may include the ID for the unique read group identifier, BC for the
barcode sequence identifying the sample, CN for the name of sequencing facility, DS for
description to be used for the read group, DT for the date of sequencing, LB for the
sequencing library, PG for the programs used for processing the read group, PL for the
platform of the sequencing technology used to generate the reads, PM for the platform model,
PU for the platform unit, which is a unique identifier (e.g., flow cell/slide barcode), and SM
for the sample identifier, which is the pool name where a pool is being sequenced
@PG This is the header line for describing the program used to align the reads. It may include ID for
the program unique record identifier, PN for the program name, CL for the command line
used to run the program, PP for the previous @PG-ID, DS for description, and VN for the
program version
@CO This is the header line for a text comment. Multiple @CO lines are allowed
Mapping of Sequence Reads to the Reference Genomes ◾ 67
TABLE 2.2 Eleven Mandatory Fields of the Alignment Section of the SAM File
Column Field Description
1 QNAME The query sequence name (string)
2 FLAG Bitwise flag (integer)
3 RNAME Reference sequence name (string)
4 POS Mapping position from the leftmost first base (integer)
5 MAPQ Mapping quality (integer)
6 CIGAR CIGAR string (string)
7 RNEXT Reference name of the mate or next read (string)
8 PNEXT Position of the mate or next read (integer)
9 TLEN Sequence length (integer)
10 SEQ The read sequence (string)
11 QUAL ASCII code of the Phred-scaled base (Phred+33) (string)
The QNAME field is to show the read sequence name, which is obtained from the FASTQ
file. FLAG is a bitwise integer that describes the alignments (e.g., paired, unaligned, and
duplicate). The description is stored as codes as shown in Table 2.3.
The FLAG field in the SAM file may have one of the decimal values listed in Table 2.3
or the sum of any of those decimal values. For instance, Figure 2.16 shows the FLAG for
the read “SRR062634.6862698” is 99, which means that this FLAG combines 4 conditions:
1+2+32+64=99. This means that the read maps to that position of the reference genome are
described as follows: the read is paired (1), the aligner mapped the two pairs properly (2),
the next sequence (SEQ) is a reverse strand (32), and first read in the pair (64). Instead of
doing mental math to figure out a such FLAG number, we can use “samtools flags” com-
mand as follows (Figure 2.16):
samtools flags 99
samtools flags
The RNAME field shows the reference sequence name of the alignment such as a chromo-
some name (e.g., 1). This field will be filled with “*” for unmapped read.
68 ◾ Bioinformatics
FIGURE 2.16 Using “samtools flags” to convert numeric FLAG representations into textual.
The POS field specifies the leftmost mapping position in the reference genome of the
first base of the aligned read. For the sequence “SRR062634.6862698”, the alignment posi-
tion is 10001 as shown in Figure 2.15. The POS value of the unmapped read is “0”.
The MAPQ field specifies the mapping quality in Phred scoring unit. If the quality score
is not available, the value will be set to 255.
The CIGAR field contains the CIGAR string, which is a sequence of the base length and
the associated operations. It is used to indicate where, in the aligned read, an operation
like match/mismatch, deletion, or insertion took place. Table 2.4 lists the CIGAR opera-
tions that can be associated with a read alignment. For instance, in the SAM file shown in
Figure 2.15, the sequence “SRR062634.6862698” has the CIGAR string “30M70S”, which
means that the first 30 bases of that sequence match (30M) bases in the reference sequence
and 70 bases showing soft clip (70S), which is special mismatch due to error in the reference
sequence.
The RNEXT field contains the reference sequence name of the primary alignment of the
NEXT read in the template. If it is not present, the field will be set to “=” or “*”. The PNEXT
field specifies the position of the primary alignment of the NEXT read in the template. This
field will be set to “0” when the information is unavailable.
The TLEN field specifies the signed observed template length. This field will be positive
for the leftmost read, negative for the rightmost, and undefined for any middle read.
The SEQ field contains the read sequence. This field will be set to “*” if the sequence
was not stored. Otherwise, the sequence must be equal to the sum of lengths of the opera-
tions in CIGAR string. The “=” symbol indicates that the base of the read is identical to
the reference base. The mapped reads are aligned to the forward genomic strand (plus
strand). Therefore, if a read is mapped to the reverse strand (minus strand) of the refer-
ence sequence, the SEQ field will contain the complementary strand of the unmapped plus
strand.
The QUAL field contains the Phred quality scores in ASCII character (Phred+33). The
string in this field is the same as the string in the FASTQ file. This field may be set to “*” if
Mapping of Sequence Reads to the Reference Genomes ◾ 69
TABLE 2.3 The FLAG Bitwise Decimal and Hexadecimal Numbers and Their Descriptions
Decimal Hexadecimal Description of Read
1 0x1 The read is paired
2 0x2 The aligner mapped the two pairs properly
4 0x4 The read is unmapped
8 0x8 Next segment in the template is unmapped
16 0x10 The sequence in SEQ is a reverse strand (minus strand)
32 0x20 The next sequence (SEQ) is a reverse strand
64 0x40 First read in paired reads
128 0x80 Second read in paired reads
256 0x100 The alignment is secondary
512 0x200 The read fails platform/vendor quality checks
1024 0x400 The read is PCR or optical duplicate (technical sequence)
2048 0x800 The alignment is supplementary
the quality string is not stored. Otherwise, it must be equal to the length of the sequence
in SEQ.
The alignment section of a SAM file may contain a number of optional fields. Each
optional field is defined by a standard tag accompanied with a data type and a value in the
following format:
TAG:TYPE:VALUE
The TAG is a two-character string. There are several predefined standard tags for SAM
optional fields. The complete list is available at “https://samtools.github.io/hts-specs/
SAMtags.pdf”. The user is allowed to add a new tag.
The TYPE is a single character defining the data type of the field. It can be “A” for the
character data type, “B” for general array, “f” for real number, “H” for hexadecimal array,
“i” for integer, and “Z” for string.
VALUE is the value of the field defined by the tag data type.
Notice that the last four columns in the SAM file shown in Figure 2.16 are for optional
fields identified by the four predefined standard tags: “NH”, “HI”, “AS”, and “NM”. The
“NH” tag shows the number of reported alignments (number of hits) that contain the read
70 ◾ Bioinformatics
sequence in the current record. The “HI” tag shows the index of the query hit. The “AS” tag
shows the alignment score defined by the aligner. The “NM” tag shows the edit distance,
which is defined as the minimal number of single-nucleotide edits (substitutions, inser-
tions, and deletions) needed to transform the read sequence into the aligned segment of
the reference sequence.
For more details about SAM file, read the specification of the Sequence Alignment/Map
file format, which is available at “https://samtools.github.io/hts-specs/”.
mkdir data
cd data
fasterq-dump --verbose SRR769545
The size of each file is 11G; the two files take around 22G of storage space. To save storage
space, we can compress these files using gzip utility, which will reduce each file to only
2.6G. Most aligners accept the gzipped FASTQ files.
gzip SRR769545_1.fastq
gzip SRR769545_2.fastq
The “.gz” will be added to the name of each file to indicate that the two files were com-
pressed with gzip.
We can also run FastQC and display the QC reports as follows:
The per base quality reports for the two FASTQ files are shown in Figure 2.17. We can
notice that the reads in the two files have a good quality.
Mapping of Sequence Reads to the Reference Genomes ◾ 71
FIGURE 2.17 The per base quality reports for the reads in the two FASTQ files.
In Section 2.1, we showed how to download the FASTA file of the reference genome
sequence of an organism and how to index it using “samtools faidx”. So, if you did not do
that, follow the steps in that section to download the human reference and then to index
it. The sequences of reference genomes can also be downloaded from other databases such
as UCSC database. We have also downloaded and compressed example paired-end FASTQ
files for practice. The next step is to show you how to use an aligner (BWA, Bowtie, and
STAR) for read mapping.
The above will clone the BWA source files into your working directory and then it will
compile it. Once BWA has been installed successfully, you may need to set its path so that
72 ◾ Bioinformatics
you will be able to use it from any directory. While you are in the “bwa” directory run the
“pwd” command to print the absolute path of BWA, copy it, then change to your home
directory, and open “.bashrc” file using “vim” or any text editor of your choice:
cd $HOME
vim .bashrc
export PATH=”your_path/bwa”:$PATH
Do not forget to replace “your_path” with the path to the “bwa” directory on your com-
puter. Save the “.bashrc” file, exit, and restart the terminal for the change to take effect.
Type “bwa” on the terminal and press the enter key. If the BWA software was installed and
added to the path correctly, you will see the help screen.
BWA has three alignment algorithms: BWA-MEM “bwa mem”, BWA-SW “bwa bwasw”,
and BWA-backtrack “bwa aln/samse/sampe”. Both “bwa mem” and “bwa bwasw” algo-
rithms are used for mapping short and long sequences produced by any of the sequenc-
ing technologies. The “bwa aln/samse/sampe” also called BWA-backtrack is designed for
Illumina short-sequence reads up to 100 bp. Among the three algorithms, “bwa mem” is
the most accurate and the fastest.
Indeed, aligning read sequences to a reference genome with BWA requires indexing the
reference genome using “bwa index” command. We can use this command to index the
human reference genome which was downloaded and indexed with “samtools faidx” above
as follows:
The indexing will take some time depending on the size of the reference genome and the
memory of your computer. When the “bwa index” command finishes indexing, it will dis-
play the information, including the number of iterations, the elapsed time in second, the
indexed FASTA file name, and the real time and CPU time taken for the indexing process.
The indexing of the human genome may take up to six hours on a desktop computer of
32G RAM.
The BWA indexing process creates five bwa index files with extensions “.amb”, “.ann”,
“.bwt”, “.pac”, and “.sa”. The total storage space for the current human reference genome
and its index files is around 9.4G.
The “.amb” file indexes the locations of the ambiguous (unknown) bases in the FASTA
reference file that are flagged as N or another character but not as A,C,G, or T. The “.ann”
file contains annotation information such as sequence IDs and chromosome numbers.
The “.bwt” is a binary file for the Burrows–Wheeler transformed sequence. The “.pac” is a
binary file for the packed reference sequence. The “.sa” is also a binary file containing the
suffix array index. For mapping read sequences to the reference genome, all these five files
must be together in the same directory.
Mapping of Sequence Reads to the Reference Genomes ◾ 73
So far, we have two directories: “data” where the FASTQ files were downloaded and
“refgenome” where the FASTA sequence of the reference genome was downloaded and
indexed.
In the next step, we will use any of the alignment algorithms to map the reads to the
human reference genome. The “bwa mem” algorithm can be tried first as recommended
by the BWA.
1-BWA-MEM algorithm
In the following, the “bwa mem” command maps the reads in the FASTQ files to the
human reference genome and saves the output file, which contains the mapped reads in a
new directory called “bwa_sam”. Before running the commands, make sure that the work-
ing directory is just one level out of the two directories.
mkdir sam
bwa mem \
-t 4 \
refgenome/GRCh38.p13_ref.fna \
data/SRR769545_1.fastq.gz \
data/SRR769545_2.fastq.gz \
> sam/SRR769545_mem.sam 2> sam/SRR769545_mem.log
The “bwa mem” command has the capability of mapping reads of a length ranging between
70 bp and 1 Mbp. The BWA-MEM algorithm uses seeding alignments with maximal exact
matches (MEMs) and then it extends seeds with the affine-gap SW algorithm. We can run
“bwa mem” on the command line without any option to learn more about “bwa mem”
command. In the above, we use “-t 4” so that the program can use four parallel threads to
perform read mapping. Then, we provided the path of the FASTA file name of the reference
genome and the two FASTQ file names. The output of the aligned reads will be redirected
to a new file “SRR769545_mem.sam” using the Unix/Linux redirection symbol “>”. The
command execution comments are redirected to the log file “SRR769545_mem.log” using
“2>”, which is used in Unix/Linux to redirect stderr (standard error) to a text file. Both the
output files will be saved in the “sam” directory. The log file contains important informa-
tion about the comments and any standard error that occurs during the mapping process.
Always open the log file if the running fails. The log file will tell you if the failure is due to
the wrong syntax, wrong file path, or any other standard error. Even if the program has
finished running normally, it is a good practice to open the log file and look at the statistics
and comments.
cd sam
less SRR769545_mem.log
You can also display the SAM file produced by “bwa mem” with the following:
less -S SRR769545_mem.sam
74 ◾ Bioinformatics
The size of this SAM file is about 19G. We can convert it to the BAM format using Samtools
to save some storage space. First, you need to change to the “sam” directory and then run
the following:
samtools view \
-uS \
-o SRR769545_mem.bam \
SRR769545_mem.sam
The new BAM file is about 15G. We can then delete the SAM file as follows to save some
storage space:
rm SRR769545_mem.sam
BWA-MEM2 is an optimized BWA-MEM algorithm that has been recently released. This
new version produces alignments identical to BWA-MEM but it is faster and the indexing
occupies less storage space and memory [18]. You can install BWA-MEM2 separately by
following the instructions available at “https://github.com/bwa-mem2/bwa-mem2”.
2-BWA-SW
The BWA-SW [9] algorithm, like BWA-MEM, can also be used for the alignment of
single- and paired-end long reads generated by all platforms. It uses SW local alignment
approach to map reads to a reference genome. BWA-SW has a better sensitivity when
alignments have frequent gaps. However, this algorithm has been depreciated by his devel-
oper since BWA-MEM is restructured for better performance. The following “bwa bwasw”
performs read alignment as above:
bwa bwasw \
-t 4 \
refgenome/GRCh38.p13_ref.fna \
data/SRR769545_1.fastq.gz \
data/SRR769545_2.fastq.gz \
> sam/SRR769545_bwasw.sam 2> sam/SRR769545_bwasw.log
You can convert this SAM file to BAM file as we did above or you can just delete it to save
some space.
3-BWA-backtrack
The BWA-backtrack algorithm is designed for aligning Illumina short reads of a length
up to 100 bp with sequencing error rates below 2%. It involves two steps: (i) using “bwa
aln” to find the coordinates of the positions, where the short reads align, on the refer-
ence genome, and then (ii) generating alignments with “bwa samse” for single-end reads
or “bwa sampe” for paired-end reads. The base call quality usually deteriorates toward
the end of reads generated by Illumina instruments. This algorithm optionally trims low-
quality bases from the 3′-end of the short reads before alignment. Therefore, it is able to
align more reads with high error rate toward the 3′-ends of the reads.
Mapping of Sequence Reads to the Reference Genomes ◾ 75
In the following, we will use “bwa aln” to perform the first step of the alignment. Run
“bwa aln” without any option on the command line to learn more about the usage and
options. If the quality of reads at the 3′-end is low, we can use the “-q” option with this
command to specify a quality threshold for read trimming down to 35 bp. Run the follow-
ing commands while you are one step out of the “refgenome” and “data” directories:
bwa aln \
refgenome/GRCh38.p13_ref.fna \
data/SRR769545_1.fastq.gz \
> data/SRR769545_1.sai
bwa aln \
refgenome/GRCh38.p13_ref.fna \
data/SRR769545_2.fastq.gz \
> data/SRR769545_2.sai
Then, we can use “bwa sampe” to generate the SAM file for the alignments.
$ bwa sampe \
refgenome/GRCh38.p13_ref.fna \
data/SRR769545_?.sai \
data/SRR769545_?.fastq.gz \
> sam/SRR769545_aln.sam 2> sam/SRR769545_aln.log
2.3.2.2 Bowtie2
Bowtie2 is an aligner that uses BWT and FM-index as data structures for indexing the
reference genome. It is an ultrafast, memory-efficient short read aligner, and it allows
mapping millions of reads to a reference genome on a typical desktop computer. Bowtie2
is the next generation of the original Bowtie which requires the reads to have equal length
and it does not align reads with gaps. Bowtie2 was developed to overcome those limita-
tions. It performs read mapping in four steps: (i) extraction of seeds from the reads and
their reverse strands, (ii) using FM-index for exact ungapped alignment of the seeds, (iii)
sorting the alignments by scores and identifying the alignment position on the refer-
ence genome from the index, and (iv) extending seeds into full alignments using paral-
lel dynamic programming [16]. Bowtie2 can be installed on Linux with the following
commands:
Then, you need to set Bowtie2 path so that you can run it from any directory by editing
“.bashrc” file from your home directory.
cd #HOME
vim .bashrc
76 ◾ Bioinformatics
export PATH=”your_path/bowtie2”:$PATH
Do not forget to change “your_path” with the right path on your computer. Save the file
and exit. You may need to restart the terminal or run “source .bashrc” to make the change
active. Then, you can enter “bowtie2” on the terminal. If Bowtie2 is installed and its path
was set, help screen will be displayed.
Before read mapping, we need to use “bowtie2-build” command to index the FASTA
sequence of the reference genome. Enter “bowtie2-build” on the command line of the ter-
minal to display the help screen that shows the usage and options. The general syntax is as
follows:
bowtie2-build \
--threads 4 \
refgenome/GRCh38.p13_ref.fna \
refgenome/bowtie2
The indexing may take around 25 minutes using four processors on a computer with 32G
of memory. The “bowtie2-build” command generates six index files prefixed with the pre-
fix string provided for the command. Pre-built indexes for some organisms can also be
downloaded from the official Bowtie2 website.
After indexing the reference genome, we can use “bowtie2” command to align the
paired-end reads and to generate SAM file:
bowtie2 -x refgenome/bowtie2 \
-1 data/SRR769545_1.fastq.gz \
-2 data/SRR769545_2.fastq.gz \
-S sam/SRR769545_bowtie2.sam
Instead of “-S” option to generate a SAM file, we can use “-b” option to generate a BAM
file. To learn more about Bowtie2’s options, enter “bowtie2” on the command line of the
Linux terminal.
2.3.2.3 STAR
STAR, which stands for Spliced Transcripts Alignment to a Reference, is a fast read aligner
developed to handle the alignment of massive number of RNA-Seq reads. Its alignment
Mapping of Sequence Reads to the Reference Genomes ◾ 77
algorithm uses uncompressed suffix array (SA) data structure to perform sequential
maximum mappable seed search, which is defined by the developer as the longest sub-
string of a read that matches exactly one or more substrings of the reference genome. This
search is achieved by mapping seeds to the reference genome. A read with a splice junction
site will not be mapped continuously. The algorithm will try to align the first unmapped
seed to a donor splice site and then it repeats the search and aligns the unmapped to an
acceptor splice site. The search is performed for forward and reverse direction. This kind
of search will help in the detection of base mismatches and InDels. If a single or multiple
mismatches are found, the matched substrings will act as anchors on the genome to allow
extension. The search is then followed by a seed clustering by proximity for determining
the anchor seeds. Then, the aligned seeds around the anchor seeds within a user-defined
window are stitched together using dynamic programming. STAR is capable of detecting
splices and chimeric transcripts and mapping complete RNA transcripts that are formed
from non-contiguous exons in eukaryotes [8].
The STAR software can be installed by following the installation instructions, which are
available at “https://github.com/alexdobin/STAR”. On Ubuntu, you can install STAR using
the following command:
As most of the read aligners, STAR basic workflow includes both index generation and read
alignment. However, for index generation, both a reference genome in the FASTA format
and reference annotation file in GTF format are required. Pre-built indexes for genomes
of some species can be downloaded from the STAR official website. As discussed before,
the reference genomes can be downloaded from databases such as NCBI Assembly, UCSC
genome collection, or any other database. For the aligners discussed before, we down-
loaded the human reference genome from the NCBI Genome database. For STAR, we will
download the human reference genome and its GTF annotation file from the UCSC data-
base. The reason is that UCSC maintains the gene annotation file in GTF format. Use the
following command to create a new directory “ucscref” and then download and decom-
press the human reference genome and GTF annotation file:
mkdir ucscref
wget \
-O “ucscref/hg38.fa.gz” \
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/
hg38.fa.gz
wget \
-O “ucscref/hg38.fa.gz” \
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg38.ncbiRefSeq.gtf.gz
gzip -d ucscref/hg38.fa.gz
gzip -d ucscref/hg38.ncbiRefSeq.gtf.gz
Then, we will build the index for the reference genome using the “STAR” command.
78 ◾ Bioinformatics
mkdir indexdir
STAR --runThreadN 4 \
--runMode genomeGenerate \
--genomeDir indexdir \
--genomeFastaFiles ucscref/hg38.fa \
--sjdbGTFfile ucscref/hg38.ncbiRefSeq.gtf \
--sjdbOverhang 100
Above, with the “STAR” command, we used “--runThreadN” to specify the number of
threads used for indexing, “--runMode genomeGenerate” to tell the command that
we wish to generate a genome index, “--genomeDir” to specify the directory where the
index files are to be saved, “--genomeFastaFiles” to specify the file path of the reference
genome FASTA file, “--sjdbGTFfile” to specify the file path of the annotation GTF file, and
“--sjdbOverhang” to specify the length of the genomic read around the annotated junction
to be used in constructing the splice junctions database. For this option, we can provide
read size minus one (n-1) if the read size is equal for all reads; otherwise, we can provide
the maximum size minus one.
The process of indexing may take a long time and may consume much memory and stor-
age space compared to the other aligners. Several files will be generated including binary
genome sequence files, files of the suffix arrays, a text file for the chromosome names or
lengths, splice junctions’ coordinates, and transcripts/genes information. Those files are
for the STAR internal use; however, the chromosome names can be renamed in the chro-
mosome file if needed.
The next step is to use STAR command for aligning the reads. This time we will use
“--runMode alignReads” to tell the program to run read mapping mode, “outSAMtype
BAM Unsorted” to generate an unsorted BAM file, “--readFilesCommand zcat” to tell the
program that the FASTQ files are compressed, “--genomeDir” to specify the index direc-
tory, “--outFileNamePrefix” to specify the prefix for the output files, and “--readFilesIn” to
specify the FASTQ file names. You can also set “outSAMtype BAM SortedByCoordinate”
to generate a BAM file sorted by the alignment coordinates. However, that will exhaust the
memory of a 32G-RAM computer.
STAR --runThreadN 4 \
--runMode alignReads \
--outSAMtype BAM Unsorted \
--readFilesCommand zcat \
--genomeDir indexdir \
--outFileNamePrefix STARoutput/SRR769545 \
--
readFilesIn data/SRR769545_1.fastq.gz data/SRR769545_2.
fastq.gz
STAR alignment mode produces a BAM file, containing read alignment information,
and four text files, three log files with file names “*Log.out”, “*Log.progres.out”, and
“*Log.final.out”, where “*” is for the prefix specified by “--outFileNamePrefix” option,
Mapping of Sequence Reads to the Reference Genomes ◾ 79
and a tab-delimited file with an extension “*SJ.out.tab”. The “*Log.out” file contains
detailed information about the run. The “*Log.progres.out” file contains statistics for the
alignment progress in 1-minute intervals. The “*Log.final.out” file contains important
summary statistics for the alignment after the alignment is complete. We will discuss
this later in the RNA-Seq chapter. “*Log.final.out” contains information about the splice
junctions.
2.4.1 Samtools
Samtools is a collection of command-line utilities for the manipulation of alignments in
the SAM/BAM files. These utilities are grouped into indexing, editing, file operations, sta-
tistics, and viewing tools. Samtools can be installed on Linux by running “sudo apt-get
install samtools” or you can follow the installation instructions available at “http://www.
htslib.org/download/”. After installation, you can display the complete list of the Samtools’
utilities by running “samtools” without any option on the command line.
A Samtools utility is executed using the following format:
where <command> is any Samtools command and [options] is any of the command
options.
The most used commands include “cat”, “index”, “markup”, “rmdup”, “sort”, “mpileup”,
“coverage”, “depth”, “flagstats”, and “view”.
To learn more about the usage and options of any of these commands, you can use the
following format:
samtools <command>
For instance, to display the usage and options of “index” command, run “samtools index”.
In the following, we will discuss the most common uses of Samtools utilities.
samtools view \
-@ 4 \
-uS -o SRR769545_mem.bam SRR769545_mem.sam
The “-@” option specifies the number of threads, the “-u” option is to produce uncom-
pressed BAM output, “-S” is to ignore auto-detection of the input format, and “-o” specifies
the output file.
If you have already run the above command and the old SAM file is still there, you can
delete it with “rm SRR769545_mem.sam” command and then run the following to convert
the BAM file back to a SAM file:
samtools view \
-@ 4 -h \
-o SRR769545_mem.sam \
SRR769545_mem.bam
samtools sort \
-@ 4 \
-T mem.tmp.sort \
-o SRR769545_mem_sorted.bam \
SRR769545_mem.bam
This will create a new BAM file with sorted alignments. You can delete the unsorted BAM
file if you need to save some storage space.
You can use any of the reference sequence names in the RNAME field of the SAM/BAM
file, so you may need to display the content of the file to check how the reference sequences/
chromosomes are named.
The chimeric read is the one that aligns to two distinct portions of the genome with little
or no overlap.
To count the number of chimeric reads, we can use “wc -l” command.
We can also use the option “-c” with “samtools view” to count the number of reads in a
BAM file:
We can use values in FLAG field of the SAM/BAM file to count the number of reads defined
by a specific FLAG value. For instance, since the unmapped reads will be flagged as “0x4”
in BAM files, we can count all mapped reads by excluding the unmapped from counting
using the “-F” option.
To count unmapped reads, use the “-f” option instead of the “-F” option as:
We can also use the “samtools view” command together with some Unix/Linux com-
mands and pipe symbol “|” to perform more complex count. For instance, we can count
82 ◾ Bioinformatics
the number of insertions (I) or deletion (D) from the CIGAR strings in a BAM file. Since
the CIGAR field is the sixth column, first, we will use “samtools view -F 0x4 SRR769545_
mem_sorted.bam” to extract the mapped records. Then, we can transfer that output to “cut
-f 6” using the pipe symbol “|” to separate the sixth column. The output is then transferred
to “grep -P” to select only the strings that have either the character “D” or “I” using the
class pattern “[ID]” to match any of the two characters. Then, the output is transferred to
the “tr” command to delete any characters other than “I” and “D”. Finally, the output is
transferred to the “wc -c” command to count the remaining characters:
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘[ID]’ \
| tr -cd ‘[ID]’ \
| wc -c
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘I’ \
| tr -cd ‘I’ \
| wc -c
samtools view \
-F 0x4 SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep -P ‘D’ \
| tr -cd ‘D’ \
| wc -c
Refer to Table 2.3 for the different FLAG values and descriptions.
samtools rmdup \
SRR769545_mem_sorted.bam \
Mapping of Sequence Reads to the Reference Genomes ◾ 83
The above command removes the duplicate reads from the BAM file (paired end). If we
need paired-end reads to be treated as single end, use “-S” option.
For other Samtools commands, you can check the Samtools documentation which is avail-
able at “http://www.htslib.org/doc/samtools.html”.
mkdir ref_guided_ass
cd ref_guided_ass
mkdir data
fasterq-dump --verbose SRR769545
gzip SRR769545_1.fastq
gzip SRR769545_2.fastq
cd ..
mkdir ref
84 ◾ Bioinformatics
cd ref
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.
fa.gz
gunzip -d hg38.fa.gz
samtools faidx hg38.fa
bowtie2-build --thread 4 hg38.fa hg
cd ..
mkdir sam
bowtie2 -x \
ref/hg \
-1 data/SRR769545_1.fastq.gz \
-2 data/SRR769545_2.fastq.gz \
-S sam/SRR769545.sam
This time we have chosen to download the UCSC human reference genomes because the
chromosome names are given instead of accession numbers.
Once the SAM file has been created, we can convert it into a BAM file, sort it, and index
it using Samtools.
Then, we can use bcftools to detect the variants like variant substitutions and write that
information into a binary variant call format (BCF), which is the binary form of vari-
ant call format (VCF). This file includes the difference in genotype between the reference
genome and the genome of the individual sequenced. The variant discovery and VCF file
format will be discussed in detail in Chapter 4. However, they have to be used here to show
how the reference-guided genome assembly is performed. The bcftools can be installed in
Linux as follows:
Now, we can use the “bcftools mpileup” command to collapse the pileup of the reads
aligned to the reference genomes in the BAM file and then to write only the variant infor-
mation into a VCF file. In the following, the variant information is written into a BCF file
and then that binary file is converted into a VCF file:
The VCF can be compressed using the Linux compression program “bgzip” and then the
compressed file is indexed using tabix program, which is a tool for indexing large bioinfor-
matics text files. The tabix program can be installed on Linux using:
The following commands compress the VCF file and index it:
The final step in the reference-guided genome assembly is to create a consensus sequence
by transferring the sequence reference genome, which was used to create the original BAM
file, to the “bcftools consensus” command that utilizes the indexed variant call file to cre-
ate a new genome sequence for the individual studied.
cat ../ref/hg38.fa \
| bcftools consensus SRR769545.vcf.gz \
> SRR769545_genome.fasta
2.6 SUMMARY
Except for the sequencing applications that use de novo genome assembly, read mapping
to a reference genome is the most fundamental step in the workflow of the sequencing
data analysis. In the NGS or TGS, the DNA molecules are fragmented into pieces in the
library preparation step and then the DNA libraries are sequenced to produce the raw data
in the form of millions of reads of specific lengths. The lengths of the reads produced by
a high-throughput sequencing instrument vary based on the technology used into short
reads (50–400 bp) or long reads (>400 bp). The quality control step to assess and prepro-
cess the raw reads is essential to reduce the errors in the base calling and the biases that
may arise due to the presence of technical reads. Indeed, read mapping is the most com-
putationally expensive step in the sequencing data analysis workflow. That is because the
alignment program attempts to determine the points of origin for millions or billions of
reads in a reference genome. The alignment requires even more efforts for RNA-Seq read
86 ◾ Bioinformatics
mapping because the aligner should also be able to detect the splice junctions. The process
of alignment is usually complicated by the possible existence of mismatches, which may
be due to base call errors or due to genetic variations in the individual genome. In general,
aligners are required to use a strategy that enables them to perform both an exact search
and an inexact search to allow locating positions of reads with mismatches. Almost all
read aligners perform alignment in two major steps: indexing of the sequence of the refer-
ence genome and finding the most likely locations of the reads in the reference genome.
The FASTA sequence of the reference of an organism can be downloaded from genome
databases such as NCBI Genome and UCSC database. The FASTA genome sequence is
indexed first by the “samtools faidx” command to allow fast processing by the aligners.
The commonly used data structures for genome indexing include BWT, FM-index, suf-
fix arrays, and hash table for their memory efficiency and capability to store a genome
sequence. There are a variety of aligners that use different indexing and lookup algorithms.
We discussed only BWA, Bowtie2, and STAR. However, those are only examples. Before
using an aligner, you may need to know its memory efficiency and whether it is capable to
use short reads, long reads, or both. If you have RNA-Seq reads, you may also need to know
whether that aligner is capable to detect splice junctions or not. Both BWA and Bowtie2 are
general purpose aligners that can be used for all kinds of reads and they can operate well
on a desktop computer with 32GB of RAM or more. STAR is better for RNA-Seq read and
it can also run on a desktop computer but it requires much more memory for both index-
ing and mapping.
Almost all aligners produce SAM/BAM files, which store read mapping information.
A SAM/BAM file consists of a header section and an alignment section. The alignment
section includes nine mandatory columns; each row of the columns contains the mapping
information of a read. The alignment information includes read name, FLAG, reference
sequence name (e.g., chromosome name or accession), position in the reference sequence
(coordinate), mapping quality, CIGAR string, reference name of the mate, position of the
mate, read length, segment of the read sequence, and Phred base quality. FLAG field stores
standard codes that describe the alignment (e.g., unmapped reads, duplicate reads, and
chimeric alignments). The CIGAR string describes the operations that took place on the
reads such as matches, mismatches, insertions, and deletions.
SAM/BAM files can be manipulated by some programs like Samtools and PICARD.
The file manipulation includes format conversion, indexing, sorting, displaying, statistics,
viewing, and filtering.
The SAM/BAM files are used in the downstream data analysis such as reference-guided
genome assembly, variant discovery, gene expression (RNA-Seq data analysis), epigenetics
(ChIP-Seq data analysis), and metagenomics as we will discuss in coming chapters.
REFERENCES
1. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453.
2. Chacón A, Moure JC, Espinosa A, Hernández P: n-step FM-index for faster pattern matching.
Proc Comput Sci 2013, 18:70–79.
Mapping of Sequence Reads to the Reference Genomes ◾ 87
3. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981,
147(1):195–197.
4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol
Biol 1990, 215(3):403–410.
5. Fredkin E: Trie memory. Commun ACM 1960, 3:490–499.
6. Ukkonen E: On-line construction of suffix trees. Algorithmica 1995, 14(3):249–260.
7. Shrestha AMS, Frith MC, Horton P: A bioinformatician’s guide to the forefront of suffix array
construction algorithms. Brief Bioinfor 2014, 15(2):138–154.
8. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras
TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15–21.
9. Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform.
Bioinformatics 2009, 25(14):1754–1760.
10. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25.
11. Fernandez E, Najjar W, Lonardi S: String Matching in Hardware Using the FM-Index. In:
2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing
Machines: 1–3 May 2011 2011. 218–225.
12. Sankoff D, Kruskal J, Nerbonne J: Time Warps, String Edits, and Macromolecules: The Theory
and Practice of Sequence Comparison: Cambridge University Press; 2000.
13. Levin LA: Problems of Information Transmission. In: 1973.
14. Mu JC, Jiang H, Kiani A, Mohiyuddin M, Bani Asadi N, Wong WH: Fast and accurate read
alignment for resequencing. Bioinformatics 2012, 28(18):2366–2373.
15. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713–714.
16. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R:
The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25(16):2078–2079.
17. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nature Methods 2012,
9(4):357–359.
18. Vasimuddin M, Misra S, Li H, Aluru S: Efficient Architecture-Aware Acceleration of BWA-
MEM for Multicore Systems. In: 2019 IEEE International Parallel and Distributed Processing
Symposium (IPDPS): 20–24 May 2019 2019. 314–324.
19. Otto TD, Dillon GP, Degrave WS, Berriman M: RATT: Rapid Annotation Transfer Tool.
Nucleic Acids Res 2011, 39(9):e57.
Chapter 3
DOI: 10.1201/9781003355205-3 89
90 ◾ Bioinformatics
The coverage describes how often, on average, a reference sequence is covered by bases
from the reads. The sequencing depth is defined as the total number of reads that align to
or cover a specific locus (Figure 3.1).
The de novo assembly strategy has been greatly enhanced by the single-molecule real-
time (SMRT) sequencing (Pacific Biosciences (PacBio)) and nanopore sequencing (Oxford
Nanopore Technologies (ONT)), which produce long reads. In general, the de novo genome
assembly depends on the read length, sequencing depth, and the assembling algorithm.
Based on the sequencing technology used, reads can be classified into short reads, which
are produced by the most sequencing technologies including Illumina, and long reads pro-
duced by PacBio and ONT. In general, either sequence alignment or graphs approach is
used for the de novo assembly. The algorithms used by the de novo genome assemblers can
be classified into (i) greedy approach, (ii) overlap-layout-consensus with Hamiltonian path,
or (iii) de Bruijn graph with Eulerian path. The last two approaches share the idea that a
genome assembly can be represented as a path in graphs.
The overlap-consensus algorithm performs pairwise alignment and then it represents the
reads and the overlaps between the reads with graphs, where contiguous reads are the
vertexes (nodes) and their overlaps (common prefix or suffix) are the edges. Each node (ri )
corresponds to a read and any two reads are connected by an edge e ( r1 , r2 ) where the suffix
of the first read matches the prefix of the second read. The algorithm then attempts to find
the Hamiltonian path of the graph that includes the vertexes (reads) exactly once. Contigs
are then created from the consensus sequences of the overlapped suffixes and prefixes. The
downside of this method is that it requires a large sequencing coverage and the complexity
of the Hamiltonian path increases with the increase of reads. An example for an assembler
using this approach is Celera Assembler [4]. To show you how the approach of the overlap-
consensus graphs works, assume that we have the reads:
CAGAACCTAGC
ATAGCAGAACG
GCTAAGCAGTG
AGTGTCACGAC
TAGCAACTATG
AACGTAGCAGA
Figure 3.3 and 3.4 show the overlap graphs in which the reads are the nodes and the over-
lapped prefixes and suffixes are the edges. The Hamiltonian path includes each node
exactly once to form the consensus sequence.
k-1 are connected by an edge. The contiguous reads are merged by finding Eulerian path,
which includes every edge exactly once. Most modern assemblers like ABySS and SPAdes
use de Bruijn graphs method because it is less complex than the overlap graphs. The qual-
ity of the de novo assembly is greatly affected by the value of the parameter k (number of
bases). For instance, assume that we have the read TCTACTAAGCTAACATGCAATGCA.
When we implement de Bruijn algorithm in any programming languages, we can obtain
the graphs showing the nodes (k-mers) and the edges as shown in Figure 3.5.
For both overlap and de Bruijn graphs, once we have constructed the graphs, the next
step is to find the relevant paths that are likely to represent the assembly. There should be a
single path for the assembly if there is no error or sequence repeats, which is rare in most
cases.
De Novo Genome Assembly ◾ 93
3.2.1 ABySS
ABySS (Assembly By Short Sequences) [8] is a parallel de novo genome assembler devel-
oped to assemble very large data of short reads produced by NGS technologies. It per-
forms assembly in two stages. First, it generates all possible k-mers from the reads, removes
potential errors, and builds contigs using de Bruijn graphs. Second, it uses mate-pair infor-
mation to extend contigs benefiting from contig overlaps and merges the unambiguously
connected graph nodes. ABySS can be used in two modes: bloom filter mode, which uses
hashing, and MPI mode, which uses message passing interface (MPI) to parallelize the de
novo assembly. It is recommended to use the bloom filter mode over the legacy MPI because
it reduces the memory usage to 10 folds. ABySS can be installed following the instructions
available at “https://github.com/bcgsc/abyss”. On Ubuntu, we can install it using “sudo
apt-get install abyss”. Once ABySS has been installed, the “abyss-pe” command can be
94 ◾ Bioinformatics
used for the de novo assembly. For the purpose of the practice, we will use paired-end short
reads produced by Illumina MiSeq for whole genome sequencing of Escherichia coli (str.
K-12). The files (forward and reverse FASTQ files) are available at the NCBI SRA database.
First, using the Linux terminal, create a directory for the exercise and change to it.
Then, you can download the raw data files into “fastq” directory using SRA toolkits as
follows:
To save some storage space, you can compress the two FASTQ files with GZIP utility as
follows:
gzip ERR1007381_1.fastq
gzip ERR1007381_2.fastq
The compression will reduce the storage of the two files from 11 G to 3 G.
We will use “abyss-pe” command to perform the de novo genome assembly. Change to
the main exercise directory just a single step out of “fastq” directory by using “cd ..”. Then,
run the following command to construct contigs:
abyss-pe \
name=ecoli \
j=4 \
k=25 \
c=360 \
e=2 \
s=200 \
v=-v \
in=’fastq/ERR1007381_1.fastq.gz fastq/ERR1007381_2.fastq.gz’ \
contigs \
2>&1 | tee abyss.log
abyss-pe \
name=ecoli \
j=4 \
k=25 \
c=360 \
De Novo Genome Assembly ◾ 95
e=2 \
s=200 \
v=-v \
in=’fastq/ERR1007381_1.fastq.gz fastq/ERR1007381_2.fastq.gz’ \
scaffolds stats
The “name=ecoli” specifies the prefix string as “ecoli”, “j=4” specifies the number of
processors to be used, “k=25” specifies the length of the k-mer (substring), “c=255” specifies
the minimum mean k-mer coverage of a contig (a high-confidence contig), “e=2” specifies
the minimum erosion k-mer coverage, “s=200” specifies the minimum contig size (bp)
required for building scaffolds, “v=-v” enables verbose, and “in=” specifies the FASTQ file
names as an inputs.
When the execution of the above commands on the Linux terminal is complete, the
assembly statistics will be displayed. Assembling a genome with “abyss-pe” command is
performed in three stages: assembling contigs without paired-end information, aligning
the paired-end reads to an initial assembly, and finally merging contigs joined by paired-
end information. Multiple files will be generated for those three stages. The descriptions of
these files are as follows:
• A file with “*.dot” extension is a graph file in Graphviz format in which graphs (nodes
and edges) are defined in DOT language script. We can visualize this file by drawing
the graphs with the Graphviz program, which can be installed on Linux using “sudo
apt install graphviz”. We can then draw the graphs and save it in the PNG format using:
• A file with “*.hist” extension is a histogram file made of two tab-separated columns:
the fragment size and count. It shows the distribution of the fragment sizes.
• A file with “*.path” extension is an ABySS graph path file, which describes how
sequences should be joined to form new sequences.
• A file with “*.sam.gz” extension is a SAM file that is used by ABySS to describe align-
ments of reads to assemble sequences at different stages of the assembly.
• A file with “*.fa” extension is a FASTA file format for contig and scaffold sequences.
The definition line of the FASTA file consists of three parts: <SEQ_ID> <SEQ_LEN>
<KMERS>, where SEQ_ID is a unique identifier for the sequence assigned by ABySS,
SEQ_LEN is the length of the sequence in bases, and KMERS is the number of
KMERS that mapped to the sequence in the assembly.
• A file with “*.fai” extension is an indexing file for the index of the corresponding
FASTA sequences. It includes five tab-separated columns (name of the sequence,
length of the sequence, offset of the first base in the file, number of bases in each line,
and number of bytes in each line).
96 ◾ Bioinformatics
The contig/scaffold N50 metric is the most widely used metric for describing the quality
of a genome assembly. A contig/scaffold N50 is calculated by first ordering every contig/
scaffold by length from the longest to the shortest. Next, the lengths of contigs are summed
starting from the longest contig until the sum equals one-half of the total length of all con-
tigs in the assembly. The contig/scaffold N50 of an assembly is the length (bp) of the short-
est contig/scaffold from the sequences that form 50% of the assembly. To compare between
assemblies, the longer the N50 and the smaller the L50, the better the assembly.
In Figure 3.6, the scaffolds file (ecoli-scaffolds.fa) contains 836 sequences, of which
107 sequences are more than 500 bp. The shortest sequence has 584 bp and the longest is
267,586 bp. The N50 is the sequence of length 112,320 bp and L50 (the number of scaffolds
that accounts for more than 50% of the genome assembly) is 15.
Figure 3.7 shows a diagram explaining the major metrics of the genome assembly
(N25=55, N50=70, N75=75, L25=4, L50=6, and L75=7) and how they can be computed. In
the figure, there are eight contigs ranked from the smallest to the largest. The total number
of bases is 445 Mb (100%) and the half number is 222.5 Mb (50%).
You can display both contigs and scaffolds file on a Linux terminal using the “less”
Linux command as:
less ecoli-contigs.fa
less ecoli-scaffolds.fa
Use “awk” command to print the length of the longest scaffold in the scaffold file.
3.2.2 SPAdes
SPAdes [9] is a de novo genome assembler developed primarily for assembling small
genomes of bacteria. Later, modules were added for assembling small genomes of other
organisms including fungi and viruses. It is not recommended for assembling large mam-
malian genomes. The current SPAdes version works with both Illumina and Ion Torrent
reads, and it can be used for genome hybrid assembly for PacBio, Oxford Nanopore, and
Sanger reads. This assembler can process several paired-end and mate-paired files in the
same time. The program also provides separate modules for metagenomic data, plasmid
assembly from the whole genome sequencing data, plasmid from metagenomic data, tran-
scriptome assembly from RNA-Seq data, biosynthetic gene cluster assembly with paired-
end reads, viral genome assembly from RNA viral data, SARS-CoV-2 assembly, and TruSeq
barcode assembly. The assembling process of SPAdes includes four stages. First, de Bruijn
graphs are built from overlapping k-mers generated from the reads. Second, the k-mers are
adjusted to obtain accurate distance estimates between k-mers using both distance histo-
grams and paths in the assembly graphs. The program then constructs paired de Bruijn
graphs, which is a generalization of the de Bruijn graph that incorporates mate-pair infor-
mation into the graph structure [10]. Finally, contigs are constructed from the graphs.
SPAdes program is made up of modules in Python. The installation instructions are
available at “https://cab.spbu.ru/files/release3.15.4/manual.html”. To install the program
on Linux, use the following steps:
Using the Linux terminal, first download and decompress the source program in a local
directory.
wget https://cab.spbu.ru/files/release3.15.4/SPAdes-3.15.4-Linux.
tar.gz
tar -xzf SPAdes-3.15.4-Linux.tar.gz
Notice that the program name or path may change in the future.
98 ◾ Bioinformatics
export PATH=”/home/your_path/SPAdes-3.15.4-Linux/bin:$PATH”
Replace “your_path” with the correct path to the directory on your computer. After adding
the line, restart the terminal or use “source ~/.bashrc” for the change to take effect. You can
test the installation by running the following command on the Linux terminal command
line:
spades.py --test
The above command will assemble toy read data that comes with the program. The output
will be saved in “spades_test”, which are the typical SPAdes output files.
Remember that we have downloaded two FASTQ files for E. coli and we used them with
ABySS above. The two files were saved in “denovo/fastq”; run the following command on
the Linux command line while you are in the working directory “denovo”:
python spades.py \
--pe1-1 fastq/ERR1007381_1.fastq.gz \
--pe1-2 fastq/ERR1007381_2.fastq.gz \
--isolate \
-o spades_ecoli_ass
The above code assembles contigs and scaffolds from the paired-end reads in the input
FASTQ files and saves the output files in a new directory “spades_ecoli_ass”. The output
files include intermediate file used for the construction of contigs and scaffolds, log files,
contigs’ file, and scaffolds’ file in FASTA format. The latter represents the assembly.
SPAdes can perform hybrid de novo assembly using any of PacBio continuous long reads
(CLR) or Oxford Nanopore reads as input with Illumina or Ion Torrent reads. You will
need to use “--pacbio” option for the PacBio FASTQ file and “--nanopore” option for the
Oxford Nanopore FASTQ file. In the following example, first, we download four PacBio
SMRT FASTQ files “SRR801646”, “SRR801649”, “SRR801652”, and “SRR801638” for E. coli
K12. Second, we will use these four PacBio files with the Illumina short read files to assem-
ble E. coli genome with SPAdes program.
python spades.py \
--pe1-1 fastq/ERR1007381_1.fastq.gz \
--pe1-2 fastq/ERR1007381_2.fastq.gz \
--pacbio pacbio/SRR801646.fastq.gz \
--pacbio pacbio/SRR801649.fastq.gz \
--pacbio pacbio/SRR801652.fastq.gz \
--pacbio pacbio/SRR801638.fastq.gz \
De Novo Genome Assembly ◾ 99
--careful \
--isolate \
-o hyb_spades_ecoli_ass
Then, you can download the FASTQ files from the NCBI SRA database using SRA toolkits
program “fasterq-dump”.
Then, you can run SPAdes program to assemble the SAR-CoV-2 genome using “--corona”
option.
python spades.py \
--pe1-1 ERR8314890_1.fastq \
--pe1-2 ERR8314890_2.fastq \
--corona \
-o sarscov2_genome
The output files including FASTA files of contigs and scaffolds will be saved in the specified
output directory “sarscov2_genome”.
The other SPAdes modes work the same.
assembly. An example of programs which use the statistical approach is QUAST. The sec-
ond approach depends on evolutionary relatedness to similar organisms to estimate the
number of genes in the genomes and gene completeness. An example of the programs that
use this approach is BUSCO.
wget https://downloads.sourceforge.net/project/quast/quast-
5.0.2.tar.gz
tar -xzf quast-5.0.2.tar.gz
cd quast-5.0.2
sudo ./setup.py install_full
This will install QUAST and set the path so that you will be able to run it from any direc-
tory. Notice that the above file path or name may change in a future version. QUAST can
be used to assess genome assemblies (contigs or scaffolds) generated by de novo assemblers.
It performs reference-guided assessment, which requires a reference genome sequence for
the species studied, or non-reference assessment, in which no reference sequence is used.
This is usually the case when the organism is unknown. In the following, we will assess the
de novo E. coli genomes assembled with ABySS and SPAdes above without using a refer-
ence sequence. You can copy the FASTA scaffold files created by these two programs into a
separate directory (e.g., “qc”) with new names using Linux command line as follows:
mkdir qc
cp abyss_ecoli_ass/ecoli-scaffolds.fa qc/abyss_ecoli_ass.fasta
cp spades_ecoli_ass/scaffolds.fasta qc/spades_ecoli_ass.fasta
cp hyb_spades_ecoli_ass/scaffolds.fasta qc/spades_hyb_ecoli_ass.
fasta
With the above commands, we created a directory “qc” and we copied the scaffolds’ files
created by ABySS and SPAdes into it.
The next step, change into the new directory “cd qc” and run the following QUAST
command to perform quality assessment for the three genome assemblies.
De Novo Genome Assembly ◾ 101
quast.py \
-o quast_Ecoli_ass \
-t 4 \
abyss_ecoli_ass.fasta \
spades_ecoli_ass.fasta \
spades_hyb_ecoli_ass.fasta
cd quast_Ecoli_ass
firefox report.html
This will display a statistics table, which includes assembly assessment statistics (Figure 3.8),
and three interactive plots: cumulative length, Nx, and GC content (two are shown in
Figure 3.9). Use the mouse to display these three plots and also move the mouse to obtain
immediate information about the scaffold.
The assembly metrics shown in Figure 3.6 are the ones used for assembly assessment.
The important metrics are the total length of the assembly, N50, N75, L50, and L75. Figure
3.8 shows the metrics for three assemblies in three columns. The two assemblies generated
by SPAdes (non-hybrid and hybrid) are better than the assembly generated by ABySS. The
largest N50 value, the lowest L50 value, and longest assembly are indicative measures of
better assembly. The total lengths of the three assemblies are 4,609,549b (4.609549Mb),
4,686,651b (4.686651Mb), and 4,623,212b (4.623212Mb), respectively, which are close to
the length of the reference genome of E. coli str. K-12, which is 4.641650Mb. Notice also
that SPAdes assemblies have the largest N50 and N75 and the lowest L50 and L75.
The QUAST report also includes Icarus contig viewer, which is a genome viewer based
on QUAST for the assessment and analysis of genomic draft assemblies. Click “View in
Icarus contig browser” on the top right to display the Icarus QUAST contig browser.
In Figure 3.10, the Icarus visualizer shows contig size viewer, on which contigs are ordered
from the longest to the shortest, highlights N50, N75 (NG50, NG75) and long contigs larger
than a user-specified threshold. To learn more about Icarus viewer and QUAST output
reports, refer to the QUAST manual at “http://cab.cc.spbu.ru/quast/manual.html#sec3.4”.
If you need to assess the assembly using a reference as a guide, you can download the
FASTA file of a reference genome (Genome) with its annotation file (GFF) from a database
De Novo Genome Assembly ◾ 103
(e.g., NCBI genome) for the organism, and use “-r” and “-g” options as follows (you may
need to decompress GFF file) (Figures 3.11 and 3.12):
quast.py \
-o ref_quast_Ecoli_ass \
-t 4 \
-r ecolref/GCF_000005845.2_ASM584v2_genomic.fna.gz \
-g ecolref/GCF_000005845.2_ASM584v2_genomic.gff \
abyss_ecoli_ass.fasta \
spades_ecoli_ass.fasta \
spades_hyb_ecoli_ass.fasta
FIGURE 3.12 Icarus contig browser displaying de novo assemblies aligned to a reference genome.
BUSCO software can be cloned and installed by running the following commands:
If the installation is successful, you can run the following to display the help:
busco –help
BUSCO databases include ortholog databases for several clades of organisms. Before using
BUSCO, you may need to identify the database to use for the assessment. The database list
can be displayed using the following command:
busco --list-datasets
Now, we can use BUSCO to assess the three E. coli assemblies (one generated with ABySS
and two generated by SPAdes). We can save the output of each assessment in a separate
directory.
busco \
-i abyss_ecoli_ass.fasta \
-o abyss_ecoli_ass.out \
-l bacteria \
-m genome
busco \
-i spades_ecoli_ass.fasta \
-o spades_ecoli_ass.out \
-l bacteria \
-m genome
busco \
-i spades_hyb_ecoli_ass.fasta \
-o spades_hyb_ecoli_ass.out \
-l bacteria \
-m genome
The BUSCO assessment output for each assembly will be saved in a separate directory:
“abyss_ecoli_ass.out”, “spades_ecoli_ass.out”, and “spades_hyb_ecoli_ass.out”. Each of
these directories includes an assessment report as a text file and JSON file, in addition to
subdirectories for the predicted genes and used ortholog database.
Comparing between the three assemblies based on BUSCO assessment metrics (Figures
3.13–3.15), the two assemblies generated by SPAdes are better than the one generated by
ABySS. A total number of 4085 genomes and 124 genes were used to extract informed
expected information. The E. coli assembled by SPAdes shows 100% completeness (C:100%),
no duplicate (D:0.0%), no fragments (F:0.0%), no missing gene (M:0.0%) out of the 124
genes, whereas the BUSCO assessment report for the assembly generated by SPAdes shows
C:98.4% [S:98.4%, D:0.0%], F:1.6%, M:0.0%, n:124, which indicates 98.4% of completeness
(122 genes are recovered), 1.6% of fragment (2 partially recovered genes).
Combining both statistical and evolutionary assessment for the de novo assembly will
provide a good idea about the quality of the de novo assembled genome.
106 ◾ Bioinformatics
FIGURE 3.13 BUSCO assessment for the E. coli assembly generated by ABySS (Illumina reads).
FIGURE 3.14 BUSCO assessment for the E. coli assembly generated by SPAdes (illumine reads).
FIGURE 3.15 BUSCO assessment for the E. coli assembly generated by SPAdes (Illumine reads).
3.4 SUMMARY
Genome assembly is the process of putting nucleotide reads generated by sequencers into
the correct order as in the genome of the organism. Despite the development of sequencing
technologies and assembling algorithms, the assembled genomes will remain approximate
De Novo Genome Assembly ◾ 107
for the actual genomes. This is because it is hard to avoid errors in sequencing and also
genomes of many organisms including repetitive sequences. But the assembly accuracy
can be increased by deep sequencing, paired-end sequencing, and the use of long reads
produced by PacBio and Oxford Nanopore Technology. The de novo genome assembly
has been widely used for different kinds of organisms, specially in metagenomics for the
assembly of bacterial, fungal, and viral genomes.
We can use either greedy alignment approach or graph-based approaches for the de novo
genome assembly. The greedy method works as multiple sequence alignment by perform-
ing pairwise alignment and merging reads to build up contigs. The graph theory approach
can be either overlap-consensus graphs or de Bruijn graphs. In the overlap graphs, reads
are represented as nodes and the overlapping regions of the reads as edges. Contigs are
built by finding the Hamiltonian path which includes each node once. On the other hand,
de Bruijn graphs form k-mers from the reads and the k-mers then are represented as nodes.
Contigs are formed by including edges using Eulerian path. De Bruijn graphs are more
efficient than overlap graphs. The assemblers that use Bruijn graphs include ABySS for
small and large genomes and SPAdes for bacterial, fungal, and viral small genomes. SPAdes
program has several modules such as metagenomic module, viral assembly module, and
SARS-CoV2. SPAdes can also be used to assemble a genome from hybrid reads such as
Illumina reads and PacBio reads or Oxford Nanopore reads.
After assembling, a genome can be assessed using both statistical method and evolu-
tionary method. The statistical method generates the number, lengths, and distributions of
contigs. The assembly with few but long contigs is an indicator of the good quality. Metrics
used to describe statistical quality are N25, N50, N75, L25, L50, and L75. We can com-
pare the performance of assemblers using these metrics. The evolutionary assessment of an
assembly relies on the genomes of the evolutionarily related species to identify the number
of genes in the assembled genome. The completeness of the genome assembly depends on
the complete identified genes. For statistical quality assessment, we can use QUAST, and
for evolutionary assessment, we can use BUSCO.
REFERENCES
1. Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathemati-
cal analysis. Genomics 1988, 2(3):231–239.
2. Pop M, Kosack D: Using the TIGR assembler in shotgun sequencing projects. Methods Mol
Biol 2004, 255:279–294.
3. de la Bastide M, McCombie WR: Assembling genomic DNA sequences with PHRAP. Curr
Protoc Bioinform 2007, Chapter 11:Unit11. 14.
4. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry
CM, Reinert KH, Remington KA et al: A whole-genome assembly of Drosophila. Science
2000, 287(5461):2196–2204.
5. Pevzner PA, Tang H: Fragment assembly with double-barreled data. Bioinformatics 2001,
17(suppl_1):S225–S233.
6. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly.
Proc Natl Acad Sci U S A 2001, 98(17):9748–9753.
7. Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res
2008, 18(2):324–330.
108 ◾ Bioinformatics
8. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for
short read sequence data. Genome Res 2009, 19(6):1117–1123.
9. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM,
Nikolenko SI, Pham S, Prjibelski AD et al: SPAdes: a new genome assembly algorithm and its
applications to single-cell sequencing. J Comput Biol 2012, 19(5):455–477.
10. Medvedev P, Pham S, Chaisson M, Tesler G, Pevzner P: Paired de bruijn graphs: a novel
approach for incorporating mate pair information into genome assemblers. J Comput Biol
2011, 18(11):1625–1634.
11. Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: quality assessment tool for genome
assemblies. Bioinformatics 2013, 29(8):1072–1075.
12. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM: BUSCO update: novel and
streamlined workflows along with broader and deeper phylogenetic coverage for scoring of
eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 2021, 38(10): 4647–4654.
13. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM: BUSCO: assessing
genome assembly and annotation completeness with single-copy orthologs. Bioinformatics
2015, 31(19):3210–3212.
Chapter 4
Variant Discovery
in this case, can either lead to no change in the structure and function of the protein
(conservative mutation) or it may lead to a deleterious consequence if the change alters
the protein s tructure and function. The base substitution may also have a nonsense con-
sequence if it results in a stop codon that truncates the translated protein leading to an
incomplete and nonfunctional protein.
A deletion mutation is the removal of a single pair of nucleotides or more from a gene
that may result in a frameshift and a garbled message and nonfunctional product. Deletion
may have deleterious consequence or not depending on the part it alters and its impact
on the protein sequence. The insertion mutation is the insertion of additional base pairs
and it may lead to frameshifts depending on whether or not multiples of three base pairs
are inserted. Mutations may include combinations of insertions and deletions leading to a
variety of outcomes.
In general, a gene variant is a permanent change in the nucleotide sequence of a gene
that can be either germline variants, which occur in eggs and sperms of parents and pass
to offspring, or somatic variants, which are present only in specific cells and are generally
not hereditary.
In terms of sequence change, variants can be classified into single-nucleotide variant
(SNV), insertion–deletion (InDel), or structural variation (SV). The SNV is a base substi-
tution of a single nucleotide for another. It is known as single-nucleotide polymorphism
(SNP) if its allelic frequency in a population is more than 1%. InDel refers to insertion and/
or deletion of nucleotides into genomic DNA and it includes events less than 1000 nucleo-
tides in length. InDels are implicated as the driving mechanism underlying many diseases.
The SV involves change in more than 50 base pairs in a sequence of a gene; the change may
include rearrangement of part of the genome, a deletion, duplication, insertion, inversion,
translocation, or a combination of these. A CNV is a duplication or deletion that changes
the number of copies of a particular DNA segment within the genome. SVs have been
implicated in a number of health conditions.
In this chapter, we will learn about the major steps in the process of variant identifica-
tion and analysis, including variant representation, variant calling workflow, and variant
annotation. The process by which we identify variants from sequence data (reads) is called
variant calling, which is the central topic of this chapter.
sequencing, variant calling software that created the file, or the reference genome used for
determining variants. The first few lines of metadata section describe file format, file date,
source program, and the reference used. The metadata section also declares and describes
the fields provided at both the site-level (INFO) and sample-level (FORMAT) in the data
lines of the data section. For example, the following metadata lines describe the ID, data
type, and description of some fields that can be found in the INFO and FORMAT columns
in the data section:
##INFO=<ID=NS,Number=1,Type=Integer,Description=”Number of Samples
Data”>
##INFO=<ID=DP,Number=1,Type=Integer,Description=”Total Depth”>
##INFO=<ID=AF,Number=A,Type=Float,Description=”Allele Frequency”>
##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype
Quality”>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Read Depth”>
The data section begins with a tab-delimited single header line that has eight mandatory
fields representing columns for each data line (Table 4.1). The column headers of the data
section are as follows:
Only if there is genotype data, then a FORMAT column is declared and followed by unique
sample names. All of these column names must be separated by tabs as well. Each line in
the data section represents a position in the genome. The data corresponds to the columns
specified in the header and must be separated by tabs and ended with a new line. Below are
the columns and their expected values. In all cases, MISSING values should be represented
by a dot (“.”).
As shown in Figure 4.1, the variants are in chromosome 20 on the reference genome
NCBI36 (hg18). The figure shows five positions whose coordinates are 14370, 18330,
1110696, 1230237, and 1234567. The first two variants are single point substitutions. The
third position has two alternate alleles (G and T) that replaced the ref nucleotide (A). The
fourth variant is a deletion of a single nucleotide T since the alt allele is missing (“.”). In the
fifth position, there are two alternative alleles, the first is a deletion of two nucleotides (T
and C) and the second is an insertion of a single nucleotide T.
The QUAL column holds the quality level of the data at each position. The FILTER column
designates what filters can be applied; the keywords in this column can be used to filter the
variants as we will discuss later. The second row (position 17330) does not pass the thresh-
old for the quality of more than 10 Phred quality score.
The INFO column includes position-level information for that data row and can be
thought as aggregate data that includes all of the sample-level information specified.
The FORMAT column specifies the sample-level fields to expect under each sample.
Each row has the same format fields (GT, GQ, DP, and HQ) except for the last row which
does not have HQ. Each of these fields is described in the metadata section. GT (Genotype)
indicates which alleles separated by / are unphased or | phased, GQ is the Genotype
Quality which is a single integer, DP is the Read Depth which is a single integer, and HQ is
the Haplotype Quality, and it has two integers separated by a comma.
This VCF file has three samples identified by their names (NA00001, NA00002, and
NA00003) in columns 10 through 12.
Genetic variants discovered by researchers are submitted, usually in VCF files, to data-
bases that archive information of the genetic differences with other related information.
Researchers submit data to these databases, which collect, organize, and publicly docu-
ment the evidence supporting links between genetic variants and diseases or conditions.
The variants are usually submitted with their assertions, which are informed assessments
of the association or lack of association between a disease or condition and a genetic vari-
ant based on the current state of knowledge. The variant databases include dbSNP (for
human variants of lesser than 50 base pairs), dbVar (for human variants of greater than 50
base pairs), and European Variation Archive (for variants of all species).
Variant submitted to a database is given a unique identifier that can be used in finding
that variant in the database and the related information because they are unambiguous,
Variant Discovery ◾ 113
unique, and stable, unlike descriptive names, which can be used differently by different
people. For example, the NCBI dbSNP assigns an ID with “rs” prefix to the accepted human
variants with asserted positions mapped to a reference sequence as reference variants
(RefSNP) and also it assigns an ID with “ss” prefix for a variant submitted with flanking
sequence. Figure 4.1 shows two dbSNP IDs for reference variants in the VCF ID columns.
Variants may have identifiers from multiple databases. You will see these different types of
identifiers used throughout the literature and in other databases. Different types of identi-
fiers are used for short variants and structural variants.
collapsing bases in each position and identifying the most frequent base. In the figure, you
can notice how depth is important in determining the consensus base.
The “bcftools mpileup” is the command used for the pileup. The “-f” option specifies the
reference file used by the aligner to produce the BAM file, “-b” option specifies the BAM
file from which we wish to call variants. The above form allows both pileup and variant
calling.
In the following, we will discuss an example of variant calling pipeline using “bcftools”.
We will use SARS-CoV-2 raw sequence data from the NCBI SRA database to demonstrate
variant calling.
SARS-CoV-2 is the virus that causes COVID-19 which affected millions of people
around the world and caused thousands of deaths. The virus mutates in a short period of
time, producing new strains. The following are the NCBI SRA run unique identifiers of raw
data generated by whole genome sequencing of SARS-CoV-2:
SRR16989486
SRR16537313
SRR16537315
SRR16537317
In the following, we wish to identify variants (SNVs and InDels) from short reads of those
four samples and report that in a VCF file.
The programs used with this example include SRA toolkits, wget, gzip, samtools, bwa,
and bcftool on Linux.
The workflow for variant discovery includes: (i) acquiring the raw data (FASTQ files), (ii)
quality control if required, (iii) downloading and indexing the reference genome, (iv) using
an aligner to map reads in FASTQ files to a reference genome to generate a BAM file, (v)
sorting the aligned reads in BAM files, (vi) removal of duplicate reads from the BAM files,
and (vii) performing variant calling and generation of a single VCF file for genotyping of all
samples. Any new generated BAM files must be indexed before proceeding to the next step.
In the following, we will discuss each of these steps. First, we need to open the Linux ter-
minal, create a directory called “sarscov2” for this project, and make it the current working
directory.
mkdir sarscov2
cd sarscov2
The “--progress” option is to display the downloading progress, “--outdir fastq” specifies
the directory where FASTQ files are downloaded, and “id” is replaced by any of the above
SRA run IDs.
The above “fasterq-dump” form is suitable for a single run, but what if we have multiple
run IDs as above, or in some cases, we may have tens of IDs to download and running that
command for each ID would be tedious. In such case, bash “while loop” would come in
handy. First, we need to store the above run IDs in the file “ids.txt”, each run ID in a line,
and save the file in the current directory and then run the following bash script, which cre-
ates the subdirectory “fastq” and then uses “while loop” to loop over each run ID in the text
file and use it as an argument for the “fasterq-dump” command as follows:
mkdir fastq
while read id;
do
fasterq-dump --progress --outdir fastq “$id”
done < ids.txt
The above script creates the directory “fastq” and downloads the FASTQ files into it. There
are two FASTQ files for each sample since the reads are paired end (forward and reverse
FIGURE 4.3 Using fasterq-dump to download FASTQ files from the NCBI SRA database.
118 ◾ Bioinformatics
FASTQ files). When the FASTQ files have been downloaded successfully as shown in
Figure 4.3, we can use “ls fastq” to display the files in the new created directory.
mkdir ref
cd ref
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/
GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.
fna.gz
f=$(ls *.*)
gzip -d ${f}
f=$(ls *.*)
samtools faidx ${f}
bwa index ${f}
cd ..
When you display the content of the “ref” subdirectory, you may see the following files if
you follow the above steps successfully:
GCF_009858895.2_ASM985889v3_genomic.fna
GCF_009858895.2_ASM985889v3_genomic.fna.amb
GCF_009858895.2_ASM985889v3_genomic.fna.ann
GCF_009858895.2_ASM985889v3_genomic.fna.bwt
Variant Discovery ◾ 119
GCF_009858895.2_ASM985889v3_genomic.fna.fai
GCF_009858895.2_ASM985889v3_genomic.fna.pac
GCF_009858895.2_ASM985889v3_genomic.fna.sa
bwa mem -M -t 4 \
-R “@RG\tID:….\tSM:…..” \
ReferenceSequence \
file1.fastq file2.fastq > output.sam 2> output.log
The “-M” option is to mark shorter split hits as secondary, “-t” option specifies the num-
ber of threads to speed up the mapping, and “-R” to add a header line. For variant calling,
we may need to add or create a read group (RG) header to hold the sample information.
As input, we will provide the FASTA reference sequence and the forward FASTQ file and
reverse FASTQ file in the case of paired-end reads. The output can be directed to a SAM
file.
The above form is suitable for a single run. However, we may have multiple samples and
mapping the reads of a single-sample files at a time may be tedious. Instead, we can use a
bash script with a loop to align all runs. The following script creates the subdirectory “sam”
and then aligns reads of each sample to the reference genome and outputs a SAM file and a
log file. When you run the script, make sure that the main project directory is the working
directory.
mkdir sam
cd fastq
for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq);
do
bwa mem -M -t 4 \
-R “@RG\tID:${i}\tSM:${i}” \
../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
${i}_1.fastq ${i}_2.fastq > \
../sam/${i}.sam 2> ../sam/${i}.log;
done
cd ..
To make sure that the read mapping process has been completed successfully, you can dis-
play the content of the “sam” directory with “ls sam”. You can also display the content of
any of the SAM files by using “less -S SamFileName”.
120 ◾ Bioinformatics
Notice that we added a RG header “@GR” with “ID” and “SM” to each SAM file to pro-
vide the sample information that will be used later by the variant caller to create a genotype
column for each sample.
mkdir bam
cd sam
for i in $(ls *.sam | rev | cut -c 5- | rev);
do
samtools view -uS -o ../bam/${i}.bam ${i}.sam
done
cd ..
The above script will create the subdirectory “bam” and convert the SAM files into BAM
files and stores them in the “bam” subdirectory.
mkdir sortedbam
cd bam
for i in $(ls *.bam);
do
samtools sort -T ../sortedbam/tmp.sort -o ../sortedbam/${i} ${i}
done
cd ..
cd sortedbam
for i in $(ls *.bam);
do
samtools index ${i}
done
cd ..
in whole genome sequencing. However, several studies have shown that retaining PCR and
Illumina clustering duplicates does not cause significant artifacts as long as the library
complexity is sufficient. The duplicate alignments can be removed with “samtools rmdup”.
The following script creates a new subdirectory “dupRemBam” and then removes duplicate
alignments from the BAM files and stores the new BAM file in the new subdirectory:
mkdir dupRemBam
cd sortedbam
for i in $(ls *.bam);
do
samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log
done;
cd ..
mkdir variants
cd dupRemBam
ls *.bam | rev | rev > bam_list.txt
bcftools mpileup -Ou \
-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
-b bam_list.txt \
| bcftools call -vmO z \
-o ../variants/sarscov2.vcf.gz
cd ..
The VCF file that contains the variants (SNPs and InDels) will be saved in “variants” sub-
directory. For further analysis, the VCF file can be indexed using “tabix”, which is a generic
indexer for TAB-delimited genome position files.
tabix variants/sarscov2.vcf.gz
All steps from downloading the raw data to variant calling can be included in a single bash
file that can be executed on the command-line prompt of the Linux terminal. First, create
a new directory, make that directory the current working directory, and then create a file
with the run ids “ids.txt” as described above. Then, create a bash file “pipeline_bcftools.
122 ◾ Bioinformatics
sh” and copy and save the following script on it and then execute the bash file as “bash
pipeline_bcftools.sh”:
#!/bin/bash
#Sars-Cov2 variant calling
#-------------------------
#1- download fastq files from the NCBI SRA database
mkdir fastq
while read f;
do
fasterq-dump --progress --outdir fastq “$f”
done < ids.txt
#2- download and extract the reference genome
mkdir ref
cd ref
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/
GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.
fna.gz
#Extract the compressed the reference FASTA file
f=$(ls *.*)
gzip -d ${f}
#3- Index the reference FASTA file using samtools and bwa
f=$(ls *.*)
samtools faidx ${f}
bwa index ${f}
cd ..
#4- Align the fastq reads (multiple samples) to the reference
genome
mkdir sam
cd fastq
for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq);
do
bwa mem -M -t 4 \
-R “@RG\tID:${i}\tSM:${i}” \
../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
${i}_1.fastq ${i}_2.fastq > \
../sam/${i}.sam 2> ../sam/${i}.log;
done
cd ..
#5- convert SAM files into BAM files
mkdir bam
cd sam
for i in $(ls *.sam | rev | cut -c 5- | rev);
do
samtools view -uS -o ../bam/${i}.bam ${i}.sam
done
cd ..
Variant Discovery ◾ 123
The resulted VCF file, which contains the genotypes of all samples (Figure 4.4), can be
further analyzed as we will discuss at the end of this chapter. However, before analysis, the
identified variants must be filtered.
124 ◾ Bioinformatics
bcftools filter -O z \
-o filtered_sarscov2.vcf.gz \
-i ‘%QUAL>60’ sarscov2.vcf.gz
The same results can be obtained with the following command, which filters out variants
with quality score less than 60:
bcftools view -O z \
-o filtered_sarscov2.vcf.gz \
-e ‘QUAL<=60’ sarscov2.vcf.gz
However, we can also filter variants based on other criteria using the statistics or annota-
tions in INFO or FORMAT fields. For example, you may decide to filter out variants with
depth less than 300; thus, you can use the following command:
Variant Discovery ◾ 125
bcftools filter -O z \
-o filtered2_sarscov2.vcf.gz \
-i ‘DP>300’ filtered_sarscov2.vcf.gz
You can open the filtered VCF file to notice the changes and that the filter command will
be added to the VCF header.
Usually, you can implement different filters on the variants in a VCF file to achieve accu-
rate and reliable results.
Another way to filter variants is to use “bcftools isec” with truth variants in a VCF file
as input together with your raw VCF file to create intersections, unions, and complements
of the VCF files.
window passes a threshold, then that window will be identified as an active region. A mea-
sure like entropy may be used to measure the activity on the region.
The haplotypes are constructed from the reassembled reads following the identifica-
tion of the active regions. The de Bruijn-like graph is used to reassemble the active region
and to identify the possible haplotypes present in the alignments. Once the haplotypes are
determined, the original alignment of the reads will be ignored and the candidate haplo-
types are realigned to the haplotypes of the reference genome using the Smith-Waterman
local alignment. The pairwise alignment is also performed using Pairwise Hidden Markov
Model (PairHMM) which generates a likelihood matrix of haplotypes given. These likeli-
hoods are then marginalized to obtain the likelihoods of alleles for each potentially variant
site given the read data. The genotype or the most likely pair of alleles is then determined
for each position. For a given genotype (Gi) on a subset of overlapped reads (Ri), the variant
callers then use Bayesian statistics to evaluate the posterior probability of the hypothetic
phenotype (Gi) as follows:
P( Ri | Gi ) × P (Gi )
P (Gi |Ri ) = (4.1)
P ( Ri )
where the posterior probability P (Gi |Ri ) is the probability of the phenotype (Gi) given that
subset of reads (Ri), P (Gi ) is the prior probability that we expect to observe the genotype
based on previous observations, P ( Ri ) is the probability of the subset of the reads being
true (the probability of observing the evidence), and P( Ri | Gi ) is the probability of reads
given the genotype. The Bayesian variant caller writes the above formula as:
P( Ri | Gi ) × P (Gi )
P (Gi |Ri ) = (4.2)
∑ P( R | Rk ) × P (Gk )
k
We can ignore the denominator because it is the same for all genotypes. Thus
The variant callers use a flat prior probability that can be changed by the users if the
probabilities of the genotypes are known based on previous observations. The important
probability in the above formula is P( Ri | Gi ), which can also be described in terms of the
likelihood of the hypothesis of the genotype (Gi ) given the reads (Ri ):
L ( R j | H1 ) L ( R j | H 2 )
P ( Ri |Gi ) = L (Gi |Ri ) = ∏
j
2
+
2
(4.4)
genotype is called and assigned to the sample, and positions with putative variants (substi-
tutions or InDels) are written into the VCF file.
In the following, we will perform variant calling with both FreeBayes and GATK, which
are examples of haplotype-based variant callers.
Use the following command to read more about FreeBayes usage and options:
freebayes –help
freebayes \
-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
-C 5 \
-L bam_list.txt \
-v ../variants/sarscov2.vcf
The “-f” option specifies the reference file, “-C” specifies the minimum number of observa-
tions supporting an alternate allele within a single individual in order to evaluate the posi-
tion, “-L” will pass the name of the text file that contains the names of the BAM files (each
file name in a line), “-v” will pass the VCF file name.
We will use FreeBayes in the above example to identify variants in the SARS-CoV-2
samples. We will follow the same steps we did for “bcftools” above. First, we will create a
project directory and store the run IDs in a file “ids.txt” as above. Then, we will save the
following script in a file “pipeline_freebayes.sh” and execute it as “bash pipeline_freebayes.
sh”:
#!/bin/bash
#Sars-Cov2 variant calling
#-------------------------
#1- download fastq files from the NCBI SRA database
mkdir fastq
while read f;
do
fasterq-dump --progress --outdir fastq “$f”
done < ids.txt
128 ◾ Bioinformatics
cd ..
#7- Marking/removing duplicate alignments
mkdir dupRemBam
cd sortedbam
for i in $(ls *.bam);
do
samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log
done;
cd ..
#8- Alignment pileup and variant calling using bcftools
mkdir variants
cd dupRemBam
for i in $(ls *.bam);
do
samtools index ${i}
done;
ls *.bam | rev | rev > bam_list.txt
freebayes \
-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
-C 5 \
-L bam_list.txt \
-v ../variants/sarscov2.vcf
cd ..
#9- Compress and index the VCF file
bgzip variants/sarscov2.vcf
tabix variants/sarscov2.vcf.gz
#to view contents
bcftools view variants/sarscov2.vcf.gz | less -S
We may notice that there is a little difference between the VCF file generated by bcftools
and the one generated by FreeBayes. The reason is that each program uses a different algo-
rithm for variant calling as discussed above.
mkdir fastq
cd fastq
while read id;
do
sam-dump \
--verbose \
--fastq \
--aligned-region chr21 \
--output-file ${id}_chr21.fastq \
${id}; \
done < ../ids.txt
cd ..
The download may take a long time depending on the speed of the Internet connection and
computer memory and processing units. The above script will create the “fastq” directory
and save the FASTQ files of chromosome 21 of 13 human individuals.
With the FASTA sequence, we can also download the sequence dictionary file. Once we
have downloaded the FASTA sequence of the reference genome, we can index the sequence
with both samtools and bwa. The following script first creates the directory “refgenome”,
and then, in that directory, it downloads the sequence of the current human genome and
its dictionary file from GATK resource bundle and indexes the FASTA:
mkdir refgenome
cd refgenome
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dict
#or create a fasta dict for reference
f=$(ls *.fasta)
samtools faidx ${f}
bwa index ${f}
cd ..
Only human sequences are available in the GATK resource bundle. However, if you down-
load the sequence of a reference genome from a different database, you may need to create
a dictionary for that sequence using Picard software as follows:
mkdir sam
cd fastq
ref=$(ls ../refgenome/*.fasta)
for i in $(ls *.fastq | rev | cut -c 13- | rev);
do
bwa mem -M -t 4 \
-R “@RG\tID:${i}\tSM:${i}” \
${ref} \
${i}_chr21.fastq > \
../sam/${i}_chr21.sam 2> ../sam/${i}_chr21.log;
done
cd ..
Variant Discovery ◾ 133
mkdir bam
cd sam
for i in $(ls *.sam | rev | cut -c 5- | rev);
do
samtools view -uS -o ../bam/${i}.bam ${i}.sam
done
cd ..
mkdir sortedbam
cd bam
for i in $(ls *.bam);
do
samtools sort -T ../sortedbam/tmp.sort -o ../sortedbam/${i} ${i}
samtools index ../sortedbam/${i}
done
cd ..
mkdir chr21
cd sortedbam
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
samtools view -b ${i}.bam chr21 > ../chr21/${i}.bam
samtools index ../chr21/${i}.bam
done
cd ..
134 ◾ Bioinformatics
mkdir dedup
cd chr21
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
java -Xmx7g -jar ~/software/picard.jar MarkDuplicates \
-INPUT ${i}.bam \
-OUTPUT ../dedup/${i}.dedup.bam \
-METRICS_FILE ../dedup/${i}.dedup_metrics.txt
samtools index ../dedup/${i}.dedup.bam
done
cd ..
GATK4 RGs’ tags allow us to differentiate samples and technical features that are asso-
ciated with artifacts. Such information is used to reduce the effects of artifacts during the
duplicate marking and base recalibration steps. GATK requires several RG fields to be pres-
ent in input files and will fail with errors if this requirement is not met. The RG fields are
usually set when the reads are aligned to a reference genome by the aligner. However, if the
BAM files do not include the RG fields, we can use the “AddOrReplaceReadGroups” Picard
function to add or modify the RG fields as we will do next. AddOrReplaceReadGroups uses
the following arguments to add or modify the RG fields on a BAM file:
RGID: This adds group read ID, which must be unique across the samples.
RGPU: This argument is optional for GATK, and it holds the flow cell barcode, lane ID,
and sample barcode separated by a period “.”.
RGSM: This holds the sample name that will be used later as a header of sample column
in the VCF file.
Variant Discovery ◾ 135
RGLP: This argument sets the RG field which holds the technology name used for
sequencing. The value can be ILLUMINA, SOLID, LS454, HELICOS, or PACBIO.
RGLB: This sets the RG field which holds the DNA preparation library identifier. The
“MarkDuplicates” function of GATK uses this field to determine which RGs contain
duplicates.
The following script creates the directory “RG” and uses Picard to add the RG to each
BAM file. We use the run ID as RG and sample number.
mkdir RG
cd dedup
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
java -jar ~/software/picard.jar AddOrReplaceReadGroups \
I=${i}.bam \
O=../RG/${i}.RG.bam \
RGID=${i} \
RGLB=lib RGPL=ILLUMINA \
SORT_ORDER=coordinate \
RGPU=bar1 RGSM=${i}
samtools index ../RG/${i}.RG.bam
done
cd ..
The following bash script downloads a standard human SNPs and InDels VCF files with
their index files in the “refvcf” subdirectory and then it performs the two steps of BQSR.
Notice that for BaseRecalibrator tool, the known variant files are provided in “--known-
sites” option. The outputs of the base recalibration are stored in “applyBQSR” subdirectory.
mkdir refvcf
cd refvcf
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.
vcf.gz
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.
vcf.gz.tbi
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.
gz
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.
gz.tbi
cd ..
##B- Build the BQSR model
#------------------------
mkdir BQSR
cd RG
ref=$(ls ../refgenome/*.fasta)
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
~/software/gatk-4.2.3.0/gatk --java-options \
-Xmx4g BaseRecalibrator \
-I ${i}.bam \
-R ${ref} \
--known-sites ../refvcf/1000G_phase1.snps.high_confidence.
hg38.vcf.gz \
--known-sites ../refvcf/Homo_sapiens_assembly38.known_
indels.vcf.gz \
-O ../BQSR/${i}.table
done
cd ..
##C- Apply the model to adjust the base quality scores
#----------------------------------------------------
mkdir applyBQSR
cd RG
ref=$(ls ../refgenome/*.fasta)
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
~/software/gatk-4.2.3.0/gatk \
--java-options \
Variant Discovery ◾ 137
-Xmx4g ApplyBQSR \
-I ${i}.bam \
-R ${ref} \
--bqsr-recal-file ../BQSR/${i}.table \
-O ../applyBQSR/${i}.bqsr.bam
done
cd ..
mkdir gvcf
cd applyBQSR
ref=$(ls ../refgenome/*.fasta)
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
~/software/gatk-4.2.3.0/gatk \
--java-options \
-Xmx10g HaplotypeCaller \
-I ${i}.bam \
-R ${ref} \
-L chr21 \
-ERC GVCF \
-O ../gvcf/${i}.g.vcf.gz
done
cd ..
files, we can use “-V” for each one. Instead of using “-V” option several times for multiple
gVCF files, the sample information can be saved in a text file called cohort sample map file.
The file then can be passed in “--sample-name-map” option. The cohort sample map file is
a plain text file that contains two tab-separated columns; the first column is for the sample
IDs and the second column is for the names of the gVCF files. Each sample ID is mapped
to a sample file name as shown in Figure 4.6.
The cohort sample map file can be created manually by the user. However, we can also
use bash script to create it. The following script creates a cohort sample map file for our 13
samples and the file will be as shown in Figure 4.6:
cd gvcf
#a- make file name and absolute path
find “$PWD”/*_chr21.dedup.RG.bqsr.g.vcf.gz -type f -printf ‘%f
%h/%f\n’ > ../tmp.txt
#b- remove _1/2.fastq
awk ‘{ gsub(/_chr21.dedup.RG.bqsr.g.vcf.gz/,”,”, $1); print } ‘
../tmp.txt > ../tmp2.txt
rm ../tmp.txt
#remove space
cat ../tmp2.txt | sed -r ‘s/\s+//g’ > ../tmp3.txt
rm ../tmp2.txt
#replace comma with tab
sed -e ‘s/\,\+/\t/g’ ../tmp3.txt > ../cohort.sample_map
rm ../tmp3.txt
Once we have created the cohort sample map file, we can run GenomicsDBImport tool
to import gVCF sample files and GenotypeGVCFs tool to consolidate the variants of 13
samples in a single VCF file.
#create a database
ref=$(ls ../refgenome/*.fasta)
~/software/gatk-4.2.3.0/gatk \
--java-options -Xmx10g \
GenomicsDBImport \
Variant Discovery ◾ 139
-R ${ref} \
--genomicsdb-workspace-path ../gvcf21db \
--batch-size 50 \
--sample-name-map ../cohort.sample_map \
-L chr21 \
--tmp-dir ../tmp \
--reader-threads 4
cd ..
mkdir vcf
ref=$(ls refgenome/*.fasta)
~/software/gatk-4.2.3.0/gatk \
--java-options -Xmx10g \
GenotypeGVCFs \
-R ${ref} \
-V gendb://gvcf21db \
-O vcf/allsamples_chr21.vcf
#SNPS
~/software/gatk-4.2.3.0/gatk \
--java-options \
-Xmx10g SelectVariants \
-V vcf/allsamples_chr21.vcf \
-select-type SNP \
-O vcf/allsamplesSNP_chr21.vcf
#INDEL
~/software/gatk-4.2.3.0/gatk \
--java-options \
-Xmx4g SelectVariants \
-V vcf/allsamples_chr21.vcf \
-select-type INDEL \
-O vcf/allsamplesIndels_chr21.vcf
dataset from validated data resources, such as 1000 Genomes, OMNI, and hapmap, and
then it uses the model to filter out the putative artifacts from the called variants. The
application of the model results in assigning a log-odds ratio score (VQSLOD) for each
variant that measures how likely that variant is real based on the data used in the training.
The VQSLOD is added to the INFO field of the variant. The variants are then filtered based
on a threshold. SNPs and InDels are recalibrated separately. The variant calibration and
filtering are performed in two steps:
(i) Building of the recalibration model:
The recalibration model is built using VariantRecalibrator tool. The input file for this
tool is the variants to be recalibrated “-V” and the known training dataset “--resource”.
The latter must be downloaded from a reliable source such as GATK resource bundle. The
fitted model is used to estimate the relationship between the probability that whether a
variant is true or artifact and continuous covariates that include QD (quality depth), MQ
(Mapping quality), and FS (FisherStrand). The VQSLOD is estimated based on Gaussian
mixture model whether a variant is true versus being false. Each variant in the input VCF
file is assigned a VQSLOD in INFO field of the VCF file and the variants are ranked by
VQSLOD. A tranche sensitivity threshold can be provided in “-tranche” as a percentage.
Several thresholds can be set. The output of this step is a recalibrated VCF file and other
files including tranches, which will be used by ApplyVQSR, and plot files.
cd refvcf
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
wget https://storage.googleapis.com/genomics-public-data/
resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx
cd ..
mkdir VQSR
cd vcf
~/software/gatk-4.2.3.0/gatk --java-options \
-Xmx10g VariantRecalibrator \
-R ../refgenome/Homo_sapiens_assembly38.fasta \
-V allsamplesSNP_chr21.vcf \
--trust-all-polymorphic \
-tranche 100.0 \
-tranche 99.95 \
-tranche 99.90 \
-tranche 99.85 \
-tranche 99.80 \
-tranche 99.00 \
-tranche 98.00 \
-tranche 97.00 \
-tranche 90.00 \
--max-gaussians 6 \
--resource:1000G,known=false,training=true,truth=true,prior=10.0
\
Variant Discovery ◾ 141
../refvcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=false,training=true,truth=false,prior=2.0
\
../refvcf/Homo_sapiens_assembly38.dbsnp138.vcf \
-an QD -an MQ -an MQRankSum \
-an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O ../VQSR/tranches.out \
--tranches-file ../VQSR/vqsrplots.R
cd ..
#Apply VQSR
mkdir filteredVCF
~/software/gatk-4.2.3.0/gatk \
--java-options -Xmx4G ApplyVQSR \
-V vcf/allsamplesSNP_chr21.vcf \
-O filteredVCF/humanSNP.vcf \
--truth-sensitivity-filter-level 99.7 \
--tranches-file VQSR/vqsrplots.R \
--recal-file VQSR/tranches.out \
-mode SNP \
--create-output-variant-index true
--filter-name “QD2” \
-filter “QUAL<30.0” \
--filter-name “QUAL30” \
-filter “SOR>3.0” \
--filter-name “SOR3” \
-filter “FS>60.0” \
--filter-name “FS60” \
-filter “MQ<40.0” \
--filter-name “MQ40” \
-filter “MQRankSum<-12.5” \
--filter-name “MQRankSum-12.5” \
-filter “ReadPostRankSum<-8.0” \
--filter-name “ReadPostRankSum-8” \
-O filteredVCF/hardfilteredVCF.vcf
--filter-name “ReadPostRankSum-20” \
-O filteredVCF/hardfilteredIndels.vcf
remember that the consequence may also be beneficial in some cases. For instance, it has
been reported that truncating variants in CARD9, IL23R, and RNF186 proteins may protect
against Crohn’s disease and ulcerative colitis and also truncating variants in ANGPTL4,
APOC3, PCSK9, and LPA proteins may protect against coronary heart disease [9, 10]. The
impact of variants on a protein is also measured by the number of isoforms produced by
the affected gene, the percentage of the protein affected, and moreover, we should put into
consideration that a frameshift may be bypassed by splicing or its impact may be avoided
by another frameshift. In the attempt to annotate genetic variants, the SNVs effect can be
predicted with high accuracy followed by small InDels (1–50 bp) and then medium InDels
(50–100 bp). It is also easy to predict the effect of missense SNV. The variant annotation
tools have several approaches to predict the effect of missense variant. For instance, they
can take into consideration the physicochemical properties of amino acids, whether the
variant is in a conserved region or not, or does it affect the three-dimensional structure of
the protein. A variant in a region conserved across the species or in a region of a secondary
or tertiary structure is more likely to be deleterious. Some tools use homology modeling
to simulate the structure of the new protein to predict the effect of variants and other tools
use machine learning utilizing multiple features to annotate the variants with the right
information and consequences. Figure 4.8 and Table 4.3 show a segment of a eukaryotic
genomic gene and possible variant annotations in each region.
In a typical genome-wide variant study, thousands of variants may be discovered. The
significance of these variants varies based on the type, location, and possible consequence.
Variant Discovery ◾ 145
Covering all variants is daunting and hence prioritizing these variants with potential
associations to phenotypes of interest is usually the key target of the variant annotation.
During the last decade, numerous GWASs were conducted to identify genomic variants
associated with many complex diseases and traits. However, most studies were focused
on human and some model organisms. Databases for variant mapping and genotype–
phenotype association were developed to serve as rich resources for variant annotation.
Examples of these databases include NHLBI, which contains health information col-
lected by NHLBI’s epidemiological cohorts and clinical trials, dbGaP, which is an NCBI
database of Genotypes and Phenotypes association, the Exome Aggregation Consortium
(ExAC), which includes sequencing data from a variety of large-scale sequencing projects,
Catalogue of Somatic Mutations in Cancer (COSMIC), etc. Numerous similar databases
were developed for specific diseases such as cancer, autoimmune diseases, and Alzheimer’s
disease. Prioritizing genetic variants relevant to the human diseases is the top. Guidelines
have been developed for investigating variants and their association with human diseases
so that such knowledge can be used for diagnosis in a clinical setting. Indeed, after acquir-
ing high-confidence variants, the next step is to annotate and interpret these variants using
either prior knowledge or functional prediction based on the impact of the variant on the
translated protein. The studies on genetic variants are usually interested in the variants
that are associated with diseases, traits, or have an effect on functions of protein. There are
a variety of consequences that can be caused by variants. A variant may be pathogenic or
implicated with healthy conditions, or may be a damaging variant that alters the normal
function of a gene, or may be deleterious variant that reduces the quality of the affected
individuals. Hence, variant annotation must be conducted after filtering variants as dis-
cussed above to avoid misinterpretation, false positive, and false negative. Generally, we
can define variant annotation as the process of assigning functional or phenotype infor-
mation to genetic variants such as SNPs, InDels, or copy number variants. Based on this
definition, perhaps, the most significant variants are the ones on the coding region of the
genome. This is because mutations on coding region may have a direct impact on the pro-
tein and may be implicated in a disease. The variants on non-coding region of the genome
may also have impact but the challenge is that it is difficult to establish a testable hypoth-
esis. Therefore, statistical methods were developed for variant prioritization by incorpo-
rating diverse functional evidence, so that variants with small effect sizes but possessing
functional features may be prioritized over variants with similar effect sizes but less likely
to be functional.
There are numerous variant annotation tools that attempt to associate variants to knowl-
edge-based information and generate reports. The most commonly used tools include SIFT,
SnpEff, Annovar, and VEP, which we will use to annotate the variants.
4.4.1 SIFT
The SIFT [11], which stands for Sorting Intolerant from Tolerant, was first introduced
in 2001 as an online variant annotation tool that annotates coding region of genes with
the missense variant effects on the translated protein. SIFT relies on the assumption that
substitutions in conserved regions are more likely to be deleterious if the missense SNV
146 ◾ Bioinformatics
resulted in a nonsynonymous codon that is translated into an amino acid with d ifferent
physicochemical properties. For instance, if a hydrophobic amino acid is replaced by
another hydrophobic amino acid, SIFT will predict that change is tolerated; however, if it is
substituted with a polar amino acid, the variant will be predicted as deleterious. SIFT algo-
rithm avails of the NCBI PSI-BLAST as it uses the translated protein as a query sequence
against a database of protein sequences. The search hit sequences are aligned using mul-
tiple sequence alignment (MSA) and the probabilities of all possible substitutions at each
position are computed forming position-specific scoring matrix (PSSM), where each entry
in the matrix represents the probability of observing an amino acid in that column of the
alignment. The probabilities are normalized based on the consensus amino acids. Then,
position with normalized probability ranges between 0 and 1. SIFT predicts that a SNV
with a probability between 0.0 and 0.05 on that position is deleterious and will affect the
function of the protein and a probability greater than 0.05 (>0.05) can be tolerated. SIFT
also measures conservation of the sequence using the median sequence conservation,
which ranges from 0 to log2(20) or from 0 to 4.32, where median sequence conservation
of 4.32 indicates that all sequences in the alignment are identical to each other, and hence,
any variant in this region will be predicted as damaging. SIFT also reports the number of
sequences at the variant position. The latest version of SIFT is SIFT 4G (SIFT for genomes),
which is faster and enables practical computations on reference genomes using precom-
puted databases and also it provides SIFT prediction for more organisms. Hundreds of
databases for different organisms are available.
Use the following steps to annotate variants using SIFT 4G on Linux terminal:
First, create a directory with the name of your choice or “sift4g” and change into it.
mkdir sift4g
cd sift4g
wget https://sift.bii.a-star.edu.sg/sift4g/public/Homo_sapiens/
GRCh38.78.zip
unzip GRCh38.78.zip
Each chromosome will have three files: a compressed file with “gz” file extension, a region
file with “.region” file extension, and a chromosome statistics file with “.txt” file extension.
Download SIFT 4G Annotator Java executable file (.jar) in a directory or in your work-
ing directory:
wget https://github.com/paulineng/SIFT4G_Annotator/raw/master/
SIFT4G_Annotator.jar
Variant Discovery ◾ 147
Assuming that your VCF file is in the current directory, you can run the following on the
command line:
The option “-jar” for the SIFT 4G path, “-c” is essential for the command line, “-i” for the
input path, “-d” for the database path, “-r” specifies the output folder, and “-t” is to extract
annotations for multiple transcripts.
The above command generates two files: an Excel annotation file with “.xls” file exten-
sion and a VCF file.
Instead of the command line, you can also use SIFT 4G with a graphic user interface
(GUI) by running the following command:
This will open SIFT interface where you can browse to select the VCF file and the database
and click Start to annotate the variants. The links of the two output files are given at the
bottom of the GUI (Figure 4.9).
We can open the Excel file to check its content (Figure 4.10). We can notice that some
annotations have been added to the variants, such as Ensemble transcript ID, gene ID,
gene name, region, variant type (synonymous or nonsynonymous), SIFT score, median
conservation score, number of sequences, dbSNP accession, and SIFT prediction whether
the variant is tolerated or deleterious.
4.4.2 SnpEff
SnpEff [12] is another variant annotation tool that categorizes the coding effects of variants
based on their genomic locations such as introns, untranslated region (UTR), upstream,
downstream, and splicing site. SnpEff predicts a variety of variant effects including syn-
onymous or nonsynonymous substitution, start-gain codon, start-loss codon, stop-gain
codon, stop-loss codon, or frameshifts.
In general, SnpEff consists of two main components: (i) database builds and (ii) vari-
ant effect calculation. The SnpEff database builds are usually distributed with SnpEff and
there are around hundreds of databases available. A database build is a gzip-compressed
serialized object that is formed of the genome FASTA sequence and an annotation file (in
GTF or GFF format). These database files can be acquired from database resources such as
ENSEMBL and UCSC. The variant effect calculation is performed after building the data-
base. It begins with building a data structure which is a hash table interval trees indexed
by chromosome. The data structure indexes intervals and makes their search efficient. The
SnpEff program uses the VCF file as input and finds the intersections with the annotated
database. The intersecting genomic regions are then identified and the variant effect is
calculated from exonic region only. Simply, SnpEff will take information from the pro-
vided annotation database and populate the input VCF file by adding annotation into the
INFO field name, ANN. Data fields are encoded separated by pipe sign “|”; and the order
of fields is written in the VCF header. As examples, variants may be categorized by SnpEff
as SNP (single-nucleotide polymorphism), Ins (insertion), Del (deletion), MNP (multiple-
nucleotide polymorphism), or MIXED (multiple-nucleotide and InDel). The impacts of
variants are classified into high, moderate, low, or modifier based on the affected region. A
variant will have a high impact when it is disruptive and likely to cause protein truncation,
loss of function, or triggering nonsense mediated decay. The variants with high impact
are frameshift and stop-gain variants. The non-disruptive variants such as missense SNV
and inframe deletion that might change protein effectiveness only are moderate impact
Variant Discovery ◾ 149
variants. The low-impact variants are the synonymous ones that do not change protein
behavior; however, they may still have some effects. Variants on introns (intron-variants)
and variants in the downstream region of a gene are annotated as MODIFIER since they
affect non-coding regions, where prediction of the impact is difficult or there is no substan-
tial evidence of any effect.
To use snpEff, we should install snpEff software and download the database of interest.
Installation of SnpEff software requires Java V1.8 or later installed on your computer. The
installation instructions are available at “https://pcingola.github.io/SnpEff/download/” or
“http://pcingola.github.io/SnpEff/”. First, download the compressed folder using “wget”
Linux command and then decompress it with “unzip” command. You can also install the
binary.
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_
core.zip
unzip snpEff_latest_core.zip
This will create the directory “snpEff”, which contains two Java executable files with “.jar”
extension (snpEff.jar and snpSift.jar), a snpEff configuration file (snpEff.config), a license
file, and four subdirectories (“examples”, “exec”, “galaxy”, and “script”). By default, the
database will be stored in a subdirectory “data” in the “snpEff” directory but we can change
it by opening “snpEff.config” and editing “data.dir = ./data/” to the path that we need.
The database can be downloaded automatically and this is recommended but you can
also install it manually. For example, if you need to download the human database manu-
ally, you can run the following:
Figure 4.11 shows the directory structure of the snpEff. If you installed the binary files, you
can use any of the executables without the “java” command as follows:
When you use snpEff, you must provide the right file path. For instance, if you are just one
step out of the “snpEff” directory, then you can run:
The “snpeff” is the current directory that we created to store the snpEff software, the VCF
file, and the output. We will copy our VCF file to the root directory “snpeff”, while the
snpEff executable file and database are in “snpEff”. After copying our VCF file “humanSNP.
vcf” into the working directory, you can annotate it using the following command:
This command will produce three files: a VCF file (mySNPanot.vcf), gene file (snpEff_
genes.txt), and summary file in html format (snpEff_summary.html). SnpEff adds func-
tional annotations in the ANN keyword in the INFO field of the VCF output file. Figure
4.12 shows the VCF output file, which is modified to show ANN under INFO field. The
INFO field may include the effect of the variant (stop loss, stop gain, etc.), effect impact
on gene (High, Moderate, Low, or Modifier), or functional class of the variant (nonsense,
missense, frameshift, etc.).
Moreover, we can view the summary on the html file to have a general idea about the
type and regions and effects of the variants. If you have “firefox” installed, you can display
the summary on the html file using the “firefox” command or you can open it with an
Internet browser.
firefox snpEff_summary.html
Figure 4.13 shows the summary of the annotation using SnpEff and variant rate details.
Remember that the VCF file contains the variants of the human chromosome 21 only.
Figure 4.14 shows the number of variant effects by impact and by functional class. Only
68 SNVs (0.009%) have high impact. The remaining variants are SNV with moderate
impact (0.149%), SNV with low impact (0.149), and modifier (99.575%).
Figure 4.15 shows the number of effects by type and region. Figure 4.16 shows a bar
chart showing the percentage of effects by region.
4.3.3 ANNOVAR
ANNOVAR [13] is one of the most commonly used annotation tools that is used to anno-
tate SNVs and InDels with different types of annotations including functional conse-
quence on genes, inferring cytogenetic bands, functional importance, impacts of variants
in conserved regions, and identifying variants reported in the NCBI dbSNP and the 1000
152 ◾ Bioinformatics
Ready-built human databases are available at ANNOVAR server, UCSC Genome Browser
website, or third parties and can be downloaded using “annotate_variation.pl” program.
For the non-human species which have no available annotation databases, a database can
be built from the FASTA sequence of the reference genome and the GFF/GTF annota-
tion file of that species. Those two files can be downloaded from databases such as NCBI
Genome database or UCSC database.
As shown in Table 4.4, ANNOVAR consists of six Perl files that can be used as com-
mand-line programs on any computer with Perl installed. The download instructions are
available at “https://annovar.openbioinformatics.org/en/latest/user-guide/download/”.
You may be asked to register with your school email. The download link will be emailed to
you, and then you can download the compressed file onto your computer and decompress
it with “tar xvf” command. If you are using Linux, you can add ANNOVAR to the path by
adding the following line to the end of “.bashrc” file:
Export PATH=”YOURPATH/annovar:$PATH”
annotate_variation.pl \
[arguments] \
<query-file|table-name> \
<database-location>
annotate_variation.pl -h
To list the available annotation databases for the hg19 build of the human reference genome,
you can run the following command:
154 ◾ Bioinformatics
The list of the databases of the hg19 build will be saved in a file “hg19_avdblist.txt” in
“dblist” subdirectory. You can open that file and study the available annotation databases.
The “-webfrom” argument specifies the source of database; the values can be “ucsc”,
“annovar”, or a URL where the database is available.
The “-downdb” argument is the command to download an annotation database from
the source specified by “-webfrom”.
The “-buildver” argument specifies the genome build version (e.g., hg19).
For example, the following command downloads some annotation databases for hg19
build and they will be stored in “humandb” directory:
To download the gene-based annotation database for human variants:
annotate_variation.pl \
-buildver hg19 \
-downdb \
-webfrom annovar refGene humandb/
annotate_variation.pl \
-buildver hg19 \
-downdb cytoBand humandb/
annotate_variation.pl \
-buildver hg19 \
-downdb \
-webfrom annovar exac03 humandb/
annotate_variation.pl \
-buildver hg19 \
-downdb \
-webfrom annovar avsnp147 humandb/
To download the filter-based dbNSFP annotation database, which was developed for func-
tional prediction and annotation of all potential nonsynonymous SNVs in the human
genome:
annotate_variation.pl \
-buildver hg19 \
Variant Discovery ◾ 155
-downdb \
-webfrom annovar dbnsfp30a humandb/
The database files are downloaded into the specified directory “humandb”. Save the anno-
tation databases of each organism in a separate file.
Not all non-human organisms have annotation databases. In this case, you can build an
annotation database for any organism by yourself. The following steps show how to build a
gene-based annotation database. As an example, we will build an annotation database for
SARS-CoV-2 and we will use it later to annotate the variants called in a previous example.
The following are the steps to build SARS-CoV-2 gene-based annotation database:
1. Download the reference genome sequence of the organism in FASTA format and the
sequence annotation file in GFF/GTF format. For SARS-CoV-2, we can download both
files from the NCBI Genome database at
https://www.ncbi.nlm.nih.gov/genome/86693?genome_assembly_id=757732
Use the following commands to create a directory “sarscov2db” and download the ref-
erence FASTA file and GFF file into it:
mkdir sarscov2db
cd sarscov2db
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/
GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.
fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/
GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.
gff.gz
gunzip GCF_009858895.2_ASM985889v3_genomic.fna.gz
gunzip GCF_009858895.2_ASM985889v3_genomic.gff.gz
2. Use the “gff3ToGenePred” tool to convert the GFF file to GenePred file, which is a file
format used to specify the gene track annotations for an imported genome. For GFT for-
mat, use “gtfToGenePred” to convert it into GenePred file. Both “gff3ToGenePred” and
“gtfToGenePred” are ones of the UCSC Genome Browser application binaries built for
standalone command-line use on Linux and UNIX platforms. They can be downloaded by
choosing the right platform at “http://hgdownload.soe.ucsc.edu/admin/exe/”. For the sake
of simplicity, we can download “gff3ToGenePred” in the same “sarscov2db” directory and
use “chmod” to allow it to run as a program:
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
gff3ToGenePred
chmod +x gff3ToGenePred
If you wish to download all UCSC Genome Browser binaries, run the following:
156 ◾ Bioinformatics
mkdir ucsc
cd ucsc
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.
x86_64/ ./
gff3ToGenePred \
GCF_009858895.2_ASM985889v3_genomic.gff \
SARSCOV2_refGene.txt
retrieve_seq_from_fasta.pl \
--format refGene \
--seqfile GCF_009858895.2_ASM985889v3_genomic.fna \
SARSCOV2_refGene.txt \
--out SARSCOV2_refGeneMrna.fa
convert2annovar.pl -h
Variant Discovery ◾ 157
The VCF file format is the standard format for variant calling. The VCF file can be con-
verted into ANNOVAR input file by using “-format vcf4” argument.
Figure 4.17 shows the directory tree which includes “annovar” directory that contains
the ANNOVAR scripts and subdirectories and the “input” directory that includes the VCF
files (sarscov2.vcf and humanSNP.vcf) from the previous SARS-CoV-2 and human variant
calling examples. We copied them to this directory for simplicity. The following command
will convert “humanSNP.vcf” file into ANNOVAR input format “humanSNP.avinput”:
convert2annovar.pl \
-format vcf4 input/humanSNP.vcf \
> input/humanSNP.avinput
Figure 4.18 shows the ANNOVAR input file, which includes the first five essential columns
and additional three columns.
For converting other variant calling file formats, run “convert2annovar.pl -h”. This
command is also used with “-dbSNP” option to add the dbSNP accessions.
Variant annotation with ANNOVAR:
The “annotate_variation.pl” script is the core program for ANNOVAR annotation. It
requires ANNOVAR input file. However, “table_annovar.pl” script is also used for annota-
tion and it takes a VCF file as input.
./annotate_variation.pl \
-out ../output/humanSNPannot \
158 ◾ Bioinformatics
-build hg19 \
../input/humanSNP.avinput \
humandb/
The above command generates three files with the “humanSNPannot” prefix as shown in
Figure 4.19. The “humanSNPannot.variant_function” file contains annotation for all vari-
ants, by adding two columns to the beginning of each input line (Figure 4.20).
The first column is annotated with the affected part of the gene. The parts can be exonic,
splicing, nrRNA, UTR5, UTR3, intronic, upstream, downstream, or intergenic region. The
second added column annotates the gene name or names.
The second annotation output file, “humanSNPannot.exonic_variant_function”, con-
tains the amino acid changes as a result of the exonic variant (Figure 4.21).
The first column includes the line number in the original input file. The second column
shows the functional consequences of the variant. The possible consequences are nonsyn-
onymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift
insertion, nonframeshift deletion, frameshift block substitution, or nonframeshift block
substitution. The third column includes the gene name, the transcript identifier, and the
sequence change in the corresponding transcript.
The “annotate_variation.pl” tool has numerous arguments. Use “annotate_variation.pl
-h” to display the complete list of arguments.
ANNOVAR provides “table_annovar.pl” script as an easy way to annotate variants in
a VCF file as an input. No need to convert VCF file into ANNOVAR input file. It takes a
VCF file as an input and generates a tab-delimited output file with many columns, each
represents one set of annotations. It also generates a new output VCF file with the INFO
field filled with annotation information.
The “-remove” option removes all temporary files. The “-protocol” option is comma-delim-
ited string that specifies an annotation protocol. These strings typically represent data-
base names in ANNOVAR. The “-operation” option tells ANNOVAR which operations
to use for each of the protocols, where “g” means gene-based, “gx” means gene-based
with cross-reference annotation (from -xref argument), “r” means region-based, and
“f ” means filter-based. The above ANNOVAR command generated three output files:
“humanSNP2.avinput”, “humanSNP2.hg19_multianno.txt”, and “humanSNP2.hg19_
multianno.vcf ”. The first one is an ANNOVAR input file. The second one is the annota-
tion file with annotation columns, and the last one is a VCF file with annotation added
to INFO fields. Open each of these files and study their contents.
We can also try the annotation database that we created for SARS-CoV-2. We can anno-
tate “sarscov2.vcf”, which was generated from a previous variant calling example. You can
copy it to the “input” directory for easy use. Thus, we can annotate it using the following
script:
-operation g \
-nastring . \
-vcfinpu
4.5 SUMMARY
The high-throughput sequencing makes variant discovery much easier than the use of
traditional methods like microarrays. Raw data obtained from sequencing technology is
used for the detection of variants including base substitutions, insertions, deletions, and
structural variants. Variants can be in any region of the genome; however, only variants
that affect functions of the genes are studied. The consequences of variants depend on the
affected regions and they may be deleterious implicating in healthy conditions and disease
like cancers or may lead to the appearance of a new strain in bacteria and viruses that
is more infectious and lethal like the recent variants of SARS-CoV2 or more antibiotic-
resistant strain of bacteria. This why the variant discovery using sequencing data gained
importance and it is widely used in genetics, medical diagnosis, and drug discovery.
Sequencing depth, paired-end sequencing, and the use of long reads make variant
detection more accurate and allow detection of large-scale variants like structural vari-
ants, insertions, and deletions.
The variant calling pipelines use SAM/BAM files of whole genome, whole transcrip-
tome, or targeted gene sequences to discover the bases in the samples that are different
from the bases on the same locations on the reference genome. Variant calling programs
use two approaches for variant calling. The first approach is used by bcftools and it is based
on consensus sequence which is formed by collapsing the piled-up aligned reads. The sec-
ond approach is used by the recent variant callers like GATK. This approach is based on
haplotypes of the variants that are more likely to be inherited together. GATK 4 is the most
commonly used program for variant calling. It uses an advanced workflow pipeline called
GATK best practice pipeline which leads to the detection of accurate variants.
After variant identification with a variant calling program and filtering, variants can
be annotated by assigning functional information to variants using annotation programs.
Variant Discovery ◾ 161
REFERENCES
1. Monroe JG, Srikant T, Carbonell-Bejerano P, Becker C, Lensink M, Exposito-Alonso M, Klein
M, Hildebrandt J, Neumann M, Kliebenstein D et al: Mutation bias reflects natural selection
in Arabidopsis thaliana. Nature 2022, 602(7895): 101–105.
2. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter
G, Marth GT, Sherry ST et al: The variant call format and VCFtools. Bioinformatics 2011,
27(15):2156–2158.
3. Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D: Benefits and limitations of genome-
wide association studies. Nat Rev Genetics 2019, 20(8):467–484.
4. Martincorena I, Campbell PJ: Somatic mutation in cancer and normal cells. Science (New
York, NY) 2015, 349(6255):1483–1489.
5. Population genetics [https://www.nature.com/subjects/population-genetics]
6. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T,
McCarthy SA, Davies RM et al: Twelve years of SAMtools and BCFtools. Gigascience 2021,
10(2).
7. Pegueroles C, Mixão V, Carreté L, Molina M, Gabaldón T: HaploTypo: a variant-calling pipe-
line for phased genomes. Bioinformatics 2020, 36(8): 2569–2571.
8. Yun T, Li H, Chang PC, Lin MF, Carroll A, McLean CY: Accurate, scalable cohort variant
calls using DeepVariant and GLnexus. Bioinformatics 2021, 36(24): 5582–5589.
9. Rivas MA, Graham D, Sulem P, Stevens C, Desch AN, Goyette P, Gudbjartsson D, Jonsdottir
I, Thorsteinsdottir U, Degenhardt F et al: A protein-truncating R179X variant in RNF186
confers protection against ulcerative colitis. Nat Commun 2016, 7:12342.
10. Stitziel NO, Stirrups KE, Masca NG, Erdmann J, Ferrario PG, König IR, Weeke PE, Webb TR,
Auer PL, Schick UM et al: Coding Variation in ANGPTL4, LPL, and SVEP1 and the Risk of
Coronary Disease. N Engl J Med 2016, 374(12):1134–1144.
11. Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic
Acids Res 2003, 31(13):3812–3814.
12. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM:
A program for annotating and predicting the effects of single nucleotide polymorphisms,
SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin)
2012, 6(2):80–92.
13. Yang H, Wang K: Genomic variant annotation and prioritization with ANNOVAR and wAN-
NOVAR. Nat Protoc 2015, 10(10):1556–1566.
Chapter 5
region consists of codons that are translated into amino acids. The major difference
between prokaryotic and eukaryotic genes is that the coding regions of a eukaryotic gene
(also known as exons) are found between the non-coding regions (known as introns)
that make gene expression in eukaryotes more complex. In eukaryotes, once mRNA
is transcribed, the introns are removed by splicing and the final mRNA transcript will
form an open reading frame (ORF) that can be translated into a polypeptide. A single
eukaryotic gene may code for several genes by the so-called alternative splicing produc-
ing multiple proteins called isoforms. In eukaryotic mRNA, introns are spliced out at
specific splicing sites found at the 5′ and 3′ ends of introns. Most commonly, the intron
sequence that is spliced out begins with the dinucleotide GT (donor splice site) at its 5′
end and ends with AG (acceptor splice site) at its 3′ end [1]. Very rare splice site may be
found at AT of 3′-end and AC at the 3′-end of an intron and also at GC (5′-end) and AG
(3′ end) of an intron [2]. On the other hand, a prokaryotic gene consists of a continuous
sequence of codons that makes gene expression in prokaryotes simpler. A typical func-
tional promoter region of a eukaryotic gene consists of the transcription start site (TSS,
where Polymerase II binds), a core promoter called TATA box (around 40 nucleobases
upstream from the transcription start site), upstream promoters (different in sequence
from a gene to another), and the enhancer (may be found thousands of nucleobases
upstream or downstream). TATA box is consistent among eukaryotic species; it is the
site where transcription factors (TATA box-binding proteins or TBP) bind to form a
complex that is necessary for t ranscription to take place. The upstream promoters can
bind to different types of proteins to activate the gene. The enhancer binds to special
types of proteins to bend the DNA sequence so the enhancer that can bind to the protein
complex on the promoter region.
The prokaryotic genes are organized into blocks called operons. Genes in a single operon
are regulated and expressed together. An operon contains all genes that encode proteins
which carry out together a specific function. For instance, the bacterial genes required for
lactose as energy source are found next to each other in the lactose operon (lac operon).
The gene structure and regulation of prokaryotic genes is different from that of eukaryotic
ones. There is no intron in prokaryotic genes and an operon, which contains several genes,
has a single promoter controlling the gene expression of all genes in that operon. There are
three types of proteins (repressors, activators, and inducers) that regulate expression of
genes in an operon.
In general, the normally functioning cells of a living organism must be able to control
gene transcription by turning on and off genes based on the need of the cells for protein
synthesis and other functions. The process of turning on a gene to be transcribed to syn-
thesize functional gene products is known as the gene expression. The activity of a gene is
measured by the amount of its transcript (mRNA). Laboratory techniques like northern
blot, serial analysis of gene expression (SAGE), quantitative PCR (qPCR), and DNA micro-
array have been used to study gene expression. However, these techniques can be replaced
these days by the high-throughput RNA sequencing, which provide more information
than any of the other techniques.
RNA-Seq Data Analysis ◾ 165
which one is lowly transcribed or even not transcribed at all. The RNA-Seq count data is
used as an alternative to microarray data in eQTL analysis. QTL analysis is a statistical
method that links phenotypic data (trait measurements) and genotypic data (markers usu-
ally SNPs) in an attempt to explain the genetic basis of variation in complex traits. On the
other hand, eQTL analysis links markers (genotype) with gene expression levels measured
in a large number of individuals and the data is modeled using generalized linear models.
RNA-Seq is a powerful tool for detecting alternative splice patterns, which are important
to understand development of human diseases. Paired-end sequencing enables sequence
information from both ends and help in detecting splicing patterns without requirement
for previous knowledge of transcript annotations. The single-molecule, real-time (SMRT)
sequencing is the core technology powering long-read sequencing that allows examination
of splicing patterns and transcript connectivity in a genome-scale manner by generating
full-length transcript sequences.
RNA-Seq is also used for fusion gene detection. A fusion gene is a gene made by join-
ing two different genes. It is usually created when a gene from one chromosome moves to
another chromosome. The fusion gene is transcribed into mRNA that will be translated
into fusion protein. The fusion proteins implicate usually in some types of cancer includ-
ing leukemia; soft tissue sarcoma; cancers of the prostate, breast, lung, bladder, colon,
and rectum; and CNS tumors. Paired-end RNA-Seq data are usually used for fusion gene
detection [4].
Other kinds of RNA-Seq applications include integration of RNA-Seq data analysis with
other technologies.
The library preparation of the mRNA is similar to that of DNA. However, mRNA must
be separated from other types of RNA by enrichment technique which uses either PCR
amplification or the depletion of the other types of RNA. The RNA must be converted
into complementary DNA (cDNA) by reverse transcription before library preparation. As
DNA library, the cDNA library preparation involves fragmentation and adaptor ligation
to each end of the fragments. The cDNA fragments then are sequenced with the sequenc-
ing machine and the sequencing can either be single end (forward strand only) or paired
end (forward and reverse strands). The sequencing generates sequence data in a form of
reads in FASTQ files. Those reads are the sequenced fragments of the expressed genes in
the sample.
downloaded and used for research purposes or for learning. Most RNA-Seq raw sequence
data are in FASTQ file format. When analyzing the RNA-Seq data, we must pay attention
to the design of the study. For instance, if the purpose is the differential gene expression,
there must be control raw data that we can use for comparison. The control raw data is
determined by the research goal. In conditions like cancers, researchers may use sequenc-
ing raw data of healthy tissue as control against the raw data of the affected tissue and
both from the same individual. However, researchers may also intend to compare gene
expression across individuals or samples. Most researchers include replicate samples in the
design of their study, and thus, there will be multiple raw data for a single sample. Replicate
samples will reduce errors generated by the laboratory technique used and also the possible
errors generated during the sequencing steps.
For practicing, we will use RNA-Seq raw data of a breast cancer study for differential
gene expression in tumor cells. The data is in six FASTQ files (three replicates for tumor
and three replicates for normal) containing paired-end reads of the size 151 bases. For the
sake of simplicity, the files include only the RNA-seq reads of chromosome 22. The data
was adapted to be as simple as possible, so its processing does not take too much time. To
keep the files organized, create a main directory “rnaseq” to be as the project directory and
create inside it the subdirectory “fastq”, and then, inside this subdirectory, download the
raw data from “https://github.com/hamiddi/ngs”. To avoid repetition, assume that the raw
data files have been cleaned from adaptors, duplicates, and the low-quality reads.
process of mapping reads to a reference genome producing a SAM/BAM file that contains
the mapping information. Refer to Chapter 2 for the read mapping and the content of
SAM/BAM files. When dealing with RNA-Seq data, we can either align the reads to a
reference genome or a reference transcriptome. When we align RNA-Seq to a eukaryotic
reference genome, we must use an aligning program like STAR that is able to detect the
splice junctions. The reads in this case will map to the exons leaving introns and other
non-coding regions of the genome uncovered. On the other hand, when aligning RNA-Seq
reads to a reference transcriptome, the aligned reads may cover the entire sequence. This
strategy is preferable when reads are very short (less than 50 bases). The downside of align-
ing reads to a transcriptome is that we may miss some novel genes since the transcriptome
is made up of only known transcripts. As discussed in Chapter 2, there are several align-
ers; however, for RNA-Seq, we prefer to use a splice-aware aligner that is able to introduce
long gaps to span introns when aligning reads to a reference genome. The commonly used
aligners for RNA-Seq data include STAR [5], segemehl [6], GEM [7], BWA [8], BWA-MEM
[8], and BBMap [9].
Before deciding on which of the aligners to use with RNA-Seq reads, make sure that the
aligner is splicing-aware and able to distinguish between reads aligned across exon–intron
boundaries and reads with short insertions [10]. The splicing-aware aligners include STAR
[5], GSNAP [11], MapSplice [12], RUM [13], and HISAT2 [14]. Each of these aligners has
different advantages and disadvantages in terms of memory efficiency, performance, and
speed. Refer to the user guide of any of these aligners to learn more about them. We will
use STAR (Spliced Transcripts Alignment to a Reference) as an example aligner for align-
ing RNA-Seq data. Several studies found that STAR is one of the most accurate aligners of
RNA-Seq reads [15]. However, STAR requires a large memory for indexing and mapping.
The reference sequence must be indexed by STAR before alignment. STAR begins mapping
process by aligning the longest reads that exactly match a single or multiple location on the
reference sequence. For partially aligned reads, STAR will attempt to align the unmapped
region to a different region. Those parts of the reads which align to different locations of
the reference sequence are called seeds. If STAR does not find an exact match to a read on
the reference sequence, the read will be extended by inserting gaps. If the extension does
not give a good alignment, it will be removed. In the second step of the STAR alignment
process, multiple seeds will be clustered based on proximity to a set of anchor seeds. The
clustered seeds are stitched together based on the best alignment score [5].
When reads are mapped to a reference sequence, the percentage of mapped reads reflects
the quality of the alignment. Low percentage indicates contamination of the DNA. Read
coverage and depth on exons are other factors that determine alignment quality.
Above, we have downloaded the FASTQ files in the directory “fastq”. We can map reads
in the FASTQ files to a reference genome using STAR program. STAR is a short read aligner
designed to align RNA-Seq reads to a reference sequence (genome or transcriptome). For
aligning reads in the FASTQ files using STAR, we need to download a reference sequence
together with its annotation file in GTF format. Since our example FASTQ files are from
human samples, we need to download the latest human genome and its annotation file.
RNA-Seq Data Analysis ◾ 169
To keep the files organized, we will create two subdirectories in our project directory:
“refgenome” where we will store the reference genome and “gtf” where we will save the
GTF annotation file.
The sequences of reference genomes and annotation are available in many sequence data-
bases such as Ensembl, UCSC, and NCBI genome database. iGenomes built by Illumina
has facilitated the process of downloading the reference data for the frequently analyzed
organisms. Genome builds in FASTA and their annotation in GTF/GFF files from the
above major databases are available for download. The iGenomes website that includes
the download links is available at “https://support.illumina.com/sequencing/sequencing_
software/igenome.html”. Reference data can also be downloaded from “https://hgdown-
load.soe.ucsc.edu/goldenPath/hg38/bigZips/”. For aligning with STAR, we will download
the UCSC human reference genome sequence in FASTA and gene annotation in GTF file
because the chromosomes are indicated by names rather than accession numbers. While
you are in the main directory “rnaseq”, run the following bash script to create the subdi-
rectories and to download the human reference genome and its gene annotation:
mkdir refgenome
wget \
-O “refgenome/hg38.fa.gz” \
“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.
fa.gz”
gzip -d refgenome/hg38.fa.gz
mkdir gtf
wget \
-O “gtf/hg38.ncbiRefSeq.gtf.gz” \
“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg38.ncbiRefSeq.gtf.gz”
gzip -d gtf/hg38.ncbiRefSeq.gtf.gz
Mapping reads to a reference genome using STAR is a two-step process: creating a refer-
ence sequence index and then mapping reads to the reference sequence.
The following command creates the STAR index for the reference genome sequence.
The “--runThreadN” specifies the number of processors to use, “--genomeDir” specifies
the directory where the index files will be saved, “--genomeFastaFiles” and “--sjdbGTFfile”
specify the directories of the reference genome file and annotation file, respectively, and
“--sjdbOverhang” specifies the read length -1 (read length minus one).
mkdir indexes
STAR --runThreadN 4 \
--runMode genomeGenerate \
--genomeDir indexes \
--genomeFastaFiles refgenome/hg38.fa \
--sjdbGTFfile gtf/hg38.ncbiRefSeq.gtf \
--sjdbOverhang 150
170 ◾ Bioinformatics
For indexing the human genome, STAR may require more than 32GB of RAM and it may
take a long time depending on the computer RAM and the number of processors used.
Once indexing has been created, you can align the reads in FASTQ files to the reference
genome sequence. Since the practice sequence data is paired end, there are two FASTQ files
(r1 and r2) for each sample. STAR requires these two files as inputs in addition to the refer-
ence genome index we created above. Instead of running STAR for each sample, we can use
bash script as follows to align the reads of the sample files in “fastq” directory:
mkdir starout
mkdir bam
cd fastq
for i in $(ls *.gz | rev | cut -c 13- | rev | uniq);
do
STAR --runThreadN 4 \
--runMode alignReads \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand zcat \
--genomeDir ../indexes \
--outFileNamePrefix ../starout/${i} \
--readFilesIn ${i}_r1.fastq.gz ${i}_r2.fastq.gz
done
for f in $(ls *.bam | rev | cut -c 30-|rev);
do
mv ${f}Aligned.sortedByCoord.out.bam ../bam/${f}.bam
done
cd ..
In the above, first we created a directory “starout” for STAR output and “bam” for the BAM
files. The Linux bash command “$(ls *.gz | rev | cut -c 13- | rev | uniq)” lists the names of
the FASTQ files found in the directory and cut the common name (ID) for the files of each
sample. The “for loop” will provide that common name as a variable “${i}” to assemble the
FASTQ file names and the output file name prefix to the STAR command each time until
all FASTQ files are processed. There will be output files including alignment log files and a
BAM file for each sample (6 BAM files for all example data). The second “for loop” renamed
the long BAM file names and moved the files to the “bam” subdirectory.
The bash scripting comes in handy whenever there is a task that needs to be repeated
multiple times.
The STAR output log files provide important statistics about the read alignment. You
can refer to STAR manual to read about the output log files.
Before moving to the next step, you need to index the BAM files using “samtools index”
command as follows:
#index bam
cd bam
for i in $(ls *.bam);
RNA-Seq Data Analysis ◾ 171
do
samtools index ${i}
done
cd ..
As shown in Figure 5.2, the alignment statistics of one of the BAM files show the total
number of reads, the average reads length, the number and percentage of uniquely mapped
reads, splice statistics, statistics of the reads mapped to multiple genes, statistics of the
unmapped reads, and chimeric reads. Pay attention to the reads mapped to multiple loci
and chimeric; when their number is large, that indicates low-quality alignment. Remember
that this BAM file includes the alignments of chromosome 22 only. The number of reads
will be huge if the BAM file contains the alignments of all chromosomes.
In addition to the statistics in the STAR log files, there are a variety of programs for
assessing alignments in BAM files. Examples of those programs include Qualimap [16],
RNA-seQC [17], and RSeQC [18]. Those programs compute metrics for RNA-Seq data,
including per-transcript coverage, junction sequence distribution, genomic localization of
reads, 5′–3′ bias, and consistency of the library protocol. As an example, you can download
and use Qualimap to obtain an overall view about the alignment quality on an HTML for-
mat. You can download Qualimap from “http://qualimap.conesalab.org/” and unzip it in
your project directory. Run Qualimap for each sample and study the reports carefully. The
following script is an example of how to use it:
mkdir qc
qualimap_v2.2.1/qualimap rnaseq \
-outdir qc \
-a proportional \
-bam bam/norm_rep1.bam \
-p strand-specific-reverse \
-gtf gtf/hg38.ncbiRefSeq.gtf \
--java-mem-size=8G
The above script creates the directory “qc” where the Qualimap output files will be saved.
The program takes a BAM file and the reference annotation file as inputs and generates
an HTML report that includes summary statistics about read alignments, reads genomic
origin, transcript coverage profile, splice junction analysis, and figures about read genomic
origins, coverage profile along genes, coverage histogram, and junction analysis. As a biol-
ogist, you may need to study these metrics to have a general idea about the sample align-
ment before proceeding.
5.3.4 Quantification
Gene profiling or studying gene expression is centered in the quantification of aligned
reads per gene or locus. Quantification of reads begins by counting the number of reads
aligned to each gene annotated on the sequence of the reference sequence. Given a BAM
file with aligned RNA-Seq reads and a list of genomic features in an annotation file (GFT
format), the task of the read counting program is to count the number of reads mapping
to each feature. In general, a feature, in this case, is a gene which represents a transcript
or unions of exons of a gene for eukaryotic organisms. Some programs can also consider
exons as features. This is especially useful for checking alternative splicing in the eukary-
otic genes. A read in the BAM file may map to a single feature (unique) or may map or
RNA-Seq Data Analysis ◾ 173
overlap with multiple features (non-unique). A read that maps or overlaps with multiple
features is considered as ambiguous and will not be counted. Only reads mapping unam-
biguously to a single feature are counted.
For RNA-Seq read counts, we can use a variety of programs but the most used free open-
source programs include HTSeq-count [19] (Python-based program) and FeatureCounts
[20], which is used as a Linux command-line program or as a function in the R package
Rsubread. Both these programs require BAM files and GFT file (reference annotation file)
as inputs.
In the following, we will use HTSeq-count to count RNA-seq reads aligned to the genes
in chromosome 22. Install HTSeq-count by following the installation instructions avail-
able at “https://htseq.readthedocs.io/en/master/install.html”. The following HTSeq-count
command counts aligned reads in all BAM files stored in the “bam” subdirectory and saves
the output in “features/htcount.txt”. The “-m union” option specifies the union mode to
handle reads overlapping more than one feature. We used “--additional-attr=transcript_
id” to add transcript accession numbers to the output.
mkdir features
htseq-count \
-m union \
-f bam \
--additional-attr=transcript_id \
-s yes bam/*.bam \
gtf/hg38.ncbiRefSeq.gtf \
> features/htcount.txt
sed ‘/^__/ d’ < htcount.txt > htcount2.txt
The “sed” command has been used to remove the last rows that begin with “__” and to save
that change in a new file “htcount2.txt”, which we will use in the next step of the analysis.
The HTSeq-count output file contains the feature count for each sample as shown in
Figure 5.3. The feature count file includes tab-delimited columns for gene symbols, tran-
script IDs, and a count column for each sample. We can notice that some genes have zero
reads aligned to them. Later, we will filter out the genes that have no aligned reads and the
one with low coverage.
5.3.5 Normalization
In general, when we analyze gene expression data, we may need to normalize it to avoid
some biases that may arise due to the gene lengths, GC contents, and library sizes (the total
number of reads aligned to all genes in a sample) [21]. The normalization of count data is
important for comparing between expression of genes within the samples and between
different samples. The normalized gene length fixes the bias that may affect within-sample
gene expression comparison. It is known that a longer gene would have a higher chance to
be sequenced than a shorter gene. Consequently, a longer gene would have a higher number
of aligned reads than a shorter one at the same gene expression level in the same sample.
The GC content also affects within-sample comparison of gene expressions. The GC-rich
and GC-poor fragments tend to be under-represented in RNA-Seq sequencing, and hence,
the gene with the GC content closest to 40% would have higher chance to be sequenced
[22]. The library size affects the comparison between the expressions of the same gene in
different samples (between-sample effect).
There are several normalization methods for adjusting the biases resulted from the
above-mentioned possible causes. Choosing the right normalization method depends on
whether the comparison is within-sample or between-samples. In the following, we will
discuss some of these normalization methods used by gene expression analysis program
like EdgeR and DESeq2 [23].
ki
RPKMi = (5.1)
i N
l
3 6
10 10
In the denominator of Formula 5.1, the length of gene (l) (in base) is divided by 1000 to
be in kilobase and the total number of reads aligned to the reference sequence is divided
by 1000,000 (million). When the number of reads aligned to a gene is divided by the
RNA-Seq Data Analysis ◾ 175
denominator, we will obtain the reads per kilobase per million or RPKM. The RPKM is
for single-end reads. However, for paired-end reads, both forward and reverse reads are
aligned, and thus, “fragment” is used instead of “read” and the normalized unit of gene
expression in this case is FPKM (fragment per kilobase per million).
ki 1
TPMi = × × 106 (5.2)
li kj
∑
j lj
ki
RPMi = × 106 (5.3)
N
Ygj
Kj
M g = log 2 (5.4)
Ygr
K r
176 ◾ Bioinformatics
where Ygj is the observed read count of the gene g of interest in the sample j, K j is the size
of library of sample j (total number of aligned reads), and Ygr is the observed reads count of
the same gene g in the reference sample r of library size K r .
Then, the value of the gene expression fold change, M g , is trimmed by 30% followed
by taking the weighted average for the trimmed M g using inverse of the variances of read
counts of genes (Vgi ) as weight since the log-fold changes from gene with larger read counts
will have lower variance on the logarithm scale. The TMM adjustment, f j , is given as
∑V M
n
gi gi
f = i
(5.5)
∑V
j n
gi
i
The adjustment f j is an estimate for relative RNA production of two samples. The TMM
normalization factor for the sample j with m genes is given by
Nj = fj ∑Y g =1
gj (5.6)
TMM does not correct the observed read counts for the gene length, and hence, it is not
suitable for comparison between the gene expressions in the same sample.
median y gj
Nj = (5.7)
g 1/n
∏ Ygr
n
r =1
or healthy and diseased) but it can also be a complex study that includes more than a
single factor (factorial design). Once the study design has been determined as metadata,
inferential statistics is used to identify which gene has statistically significant change in
expression compared to the same gene in a reference sample. The fundamental step in
the differential expression analysis is to model the association between gene counts (Y)
and the covariates (conditions) of interest (X). The number of replicates is crucial for the
statistical differential analysis. Most of the time, the number of replicates in an RNA-Seq
study is small. Instead of non-parametric statistical analysis, most programs for RNA-Seq
data analysis use generalized linear models (GLMs) by assuming that the count data fol-
lows a certain statistical distribution. That approach also assumes that each RNA-Seq read
is sampled independently from a population of reads and the read is either aligned to the
gene g or not. When the read is aligned to the gene g, we call this a success and otherwise
is a failure. The process of random trials with two possible outcomes (success or failure) is
called Bernoulli’s process. Thus, according to the probability theory, the number of reads
(successes), Yg , for a given gene g from sample j follows a binomial distribution.
Assume Ygj is the number of reads sequenced from sample j, n j represents the number of
independent trials in Bernoulli’s process, π gj is the probability of success (a read is aligned
to the gene g in sample j), and 1 − π gj is the probability of failure
Assume also that for the gene g on the sample j and that gene has the length l g and read
count Ygj . All possible positions in g that can produce a read can be described as Ygj l g [30].
Thus, probability of success, π gj , is given as
Ygj l g
π gj = (5.9)
∑
G
Ygj l g
g =1
µ gj = n j × π gj (5.10)
The probability that the number of reads (X gj = x gj ) for a given gene is given by
nj
P (Ygj = y gj ) = π gj gj (1 − π gj )
y n j − y gj
(5.11)
y g
However, since in RNA-Seq count data, a very large number of reads are represented and
the probability of aligning a read to a gene is very small, the Poisson distribution is more
appropriate than the binomial distribution if the mean of read counts of a gene is equal to
the variance as the Poisson distribution assumes.
178 ◾ Bioinformatics
Yg ~ Poi ( λ g ) (5.12)
where λ g is the Poisson parameter which represents the rate of change in the count of the
gene g in a sample.
The Poisson distribution assumes that the rate of change is equal to the mean, λ g , which
is also equal to the variance.
λ g yg × e − λg
p (Yg = y g ; λ g ) = (5.14)
yg !
To model the RNA-Seq count data with the Poisson distribution it requires that the mean
is equal to the variance. A key challenge is the small number of replicates in typical RNA-
Seq experiments (two or three replicates per condition). Therefore, inferential methods
that deal with each gene separately may suffer, in this case, from lack of power, due to
the high uncertainty of within-group variance estimates. This challenge can be overcome
either by grouping the count data into groups and then calculating the variance and the
mean in each group or by pooling information across genes by assuming the similarity of
the variances of different genes measured in the same experiment. In general, RNA-Seq
count data suffers from over-dispersion, where variance is greater than the mean. There
are a variety of software that use different technique for modeling the RNA-Seq count
data, but most of them use quasi-Poisson, negative binomial, or quasi-negative binomial
distribution, which deal with over-dispersed data.
The quasi-Poisson is similar to the Poisson distribution, but the variance is linearly cor-
related to the mean of the counts [31].
Yg ~ qPoi ( µ g ,θ g ) (5.15)
Yg ~ NB ( µ g ,α g ) (5.17)
where α is the dispersion parameter and P is an integer but commonly we use P = 2 (NB2
or quadratic model).
RNA-Seq Data Analysis ◾ 179
(
var (Yg ) = σ g2 µ g + θµ g2 ) (5.20)
The RNA-Seq study design may include a single or several conditions called factors. A
researcher usually may be interested in testing the effect of a condition. For instance,
assume that a researcher wants to study breast cancer in women. She conducted an RNA-
Seq study on samples from healthy and cancer tissues of five affected women. The analysis
programs require a matrix that describes the design called a design matrix. The design
matrix defines the model (structure of the relationship between genes and explanatory
variables), and it is also used to store values of the explanatory variable [32]. The design
matrix will be created from the study metadata as shown in Table 5.1.
The design matrix will include dummy variables setting the level of each factor to either
zero or one as we will see soon.
The generalized linear model will fit the data of this study design so that the expression
of each gene will be described as a linear combination of the dummy explanatory variables.
where y is the response variable that represents the gene expression in a specific unit, β0
is the intercept or the average gene expression when the other parameters are zero, and β1
and β 2 are the generalized linear regression parameters that represent the effect of each
explanatory variable. A log-linear model is used as
where XiT is a vector of covariates (explanatory variables) that specifies the conditions/
factors applied to sample i and β g is a vector of regression coefficients for the gene g.
µ gi = N i pgi (5.23)
For differential expression analysis using the negative binomial regression, the parameters
of interest are the relative abundance of each gene (pgi).
As we have discussed above, the negative binomial distribution changes to the Poisson
distribution when the count data is not dispersed (φ g = 0) or to the quasi-Poisson distribu-
tion if the variance is linearly correlated to the mean. EdgeR estimates the dispersion (φ g )
as the coefficient of variation (CV) of biological variation between the samples. Dispersion
means biological coefficient of variation (BCV) squared that is estimated by dividing
Formula (5.24) by µ gi2 .
CV 2 = 1/ µ gi + φ g (5.25)
EdgeR calculates the common dispersion for all genes and it can also calculate gene-wise
dispersions and then it shrinks them toward a consensus value. Differential expression is
then assessed for each gene using an exact test for over-dispersed data [33].
In the following, we will analyze the non-normalized count data obtained by HTSeq-
count program in the previous step and saved as “htcount.txt” file in the “features” direc-
tory. The analysis will be carried out in R. Therefore, R must be installed on your computer.
The instructions of R installation are available at “https://cran.r-project.org/”. You can also
use R on Anaconda as well. We assume that you have R installed on your computer and it
is running. On R, you will also need to install Limma and EdgeR Bioconductor packages
by following the installation instructions available at “https://bioconductor.org/packages/
release/bioc/html/edgeR.html” to install EdgeR and “https://bioconductor.org/packages/
release/bioc/html/limma.html” to install limma. For the current versions, open R, and on
the R shell, run the following:
Once you have R, EdgeR, and limma package installed, you will be ready for the next steps
of the differential analysis which can be broken down into the following steps.
cd bam
ls *.bam \
| rev \
| cut -c 5-\
|rev > tmp.txt
echo -e “sampleid\tcondition\tpatient” \
> ../features/sampleinfo.txt
awk -F ‘_’ ‘{print $1 “_” $2 “\t” $1 “\t” $2}’ \
tmp.txt@ ../features/sampleinfo.txt
rm tmp.txt
cd ../features
This script creates the sample info file from the BAM file names and saves it in the “fea-
tures” subdirectory together with the read count data. For your own data, you may need
to modify this script or you can create yours using Linux bash commands or manually.
Then, you need to open R, make the “features” directory as the working directory, and
load both limma and edgeR packages.
library(limma)
library(edgeR)
Load both the count data file and sample info file to the R session as data frame.
Run the following command to display the first rows of the count data frame:
head(seqdata)
You will notice that the first two columns are the gene symbol and the transcript IDs. The
other six columns contain the read counts. In the next step, we need to separate the count
182 ◾ Bioinformatics
columns in a different data frame called “countdata” and then we need to add column
names and row names to that count data frame. The row names can be the transcript IDs
and the column names can be the sample IDs as listed in the sample info file. The sample
IDs in the sample info file must be in the same order as the columns of the read counts in
the read count file.
The “head” function will display the first rows of the “countdata0” data frame. You can
notice that, as shown in Figure 5.4, the data frame is without row names and that the col-
umn names do not indicate the sample names. You can also notice that there are numer-
ous rows with all columns being zeros. This is mainly because we have aligned reads for
chromosome 22 only.
The second step is to make the gene symbol as the row names and sample IDs as col-
umn names of the “countdata0” data frame and to remove rows with zero for all samples
(Figure 5.5).
After creating a count data frame as in Figure 5.5, the next step is to create a DGEList
object to hold the read counts that will be analyzed by EdgeR. The DGEList object is a
container for the count data and the associated metadata, including sample names, sample,
group, library size, and normalization factors. The DGEList for our example data is created
by the following:
group = factor(sampleinfo$condition)
y <- DGEList(countdata, group=group)
Figure 5.6 shows the created DGEList object. At this time, it holds two slots: counts and
samples. The counts slot contains the count data and the samples slot contains the sample
FIGURE 5.4 The count data frame without row name and column names.
RNA-Seq Data Analysis ◾ 183
FIGURE 5.5 The data frame after adding row and column names and removing rows with all zeros.
info (sample id, group, library size (lib.size), and normalization factors (norm.factor)) with
1 values. This data will be updated soon, and more slots will be added as well.
5.3.7.2 Annotation
The row names of the count data frame are the gene symbols as shown in Figure 5.5. For
some of the downstream analysis, we may need the count data to be annotated with the
NCBI Entrez IDs and full gene names which are not included in the count dataset at this
point. To add these annotations to the DGEList object (y), we need to make the Entrez IDs
as the row names instead of the gene symbols. To obtain the Entrez IDs and gene names,
we need to install and load the “org.Hs.eg.db” Bioconductor package, which is a genome-
wide annotation for human based on mapping using Entrez Gene identifiers [34]. You can
install and upload this package by running the following script on R prompt:
The available annotation column names are displayed with the following (Figure 5.7):
columns(org.Hs.eg.db)
Each of these annotation columns has a row value corresponding to the gene annotated
in the reference sequence. We can select the annotation columns that we need and add
an annotation slot with the selected columns to the DGEList object. The following script
creates a vector of the Entrez IDs mapped to the gene symbol on the counts data, makes
the Entrez IDs as the row names, selects annotation columns mapped to the count data,
adds the annotation as a slot to the DGEList object, and finally removes any row without
an Entrez ID:
Figure 5.8 shows the annotation slot “genes” that includes Entrez IDs, gene symbols, and
gene names mapping to the count data “counts”.
design matrix using the sample information in the “sampleinfo.txt” that we have created
above. In EdgeR, the design matrix can be defined with or without an intercept. The inter-
cept is used when there is a reference for the differential expression analysis. When the
design matrix is defined without an intercept, the differential analysis can be performed
by using a contrast as we will do. In the following, we define a design matrix without an
intercept (Figure 5.9):
This design matrix defines two dummy variables representing the levels of the condition
studied (1 if the condition is correct and zero otherwise). When we fit a negative binomial
generalized log-linear model described in Formula 22, two coefficient estimates will be
calculated; one for each dummy variable.
As shown in Figure 5.10, after filtering, the counts slot contains only genes with sufficient
abundance and the library size in the samples slot has been adjusted. Notice the difference
in the number of genes and library size between Figures 5.10 and 5.6. The new counts slot
contains only 133 genes compared to 632 genes before filtering and the library sizes have
been adjusted to reflect the new ones.
FIGURE 5.10 DGEList object after filtering out genes with low gene expression.
5.3.7.5 Normalization
After filtering the low-expressed genes, we can normalize the count data. EdgeR uses TMM
to compute normalization factor that corrects sample-specific biases. Without normaliza-
tion, if only few genes have high expression, those genes will account for a substantial
proportion of the library size for a specific sample, causing other genes to be under-rep-
resented. The normalization factor is multiplied by the library size to yield the effective
library size, which is used for normalization. The following function calculates the TMM
normalization factor:
Notice that as shown in Figure 5.11, the normalization factor was changed for each sample.
share a dispersion. Three types of dispersions are calculated: a common estimate across
all genes, mean-variance trend dispersion using genes’ similar abundance, and gene-wise
dispersion (tagwise dispersion).
The “estimateDisp(DGEList, design)” function estimates the common, trended, and
tagwise negative binomial dispersions by using weighted likelihood empirical Bayes algo-
rithm [33]. This function requires a DGEList object with normalized counts and a design
matrix.
The three dispersions will be estimated and stored in the DGEList object (y) as shown in
Figure 5.12. You can display the common, trended, and tagwise dispersions by using the
following (Figure 5.13):
yNorm$common.dispersion
head(yNorm$trended.dispersion)
head(yNorm$tagwise.dispersion)
188 ◾ Bioinformatics
We can plot the BCV against gene abundance (in log2 counts per million) using
“plotBCV(y)” function. As described above, the BCV is the square root of the dispersion
parameter in the negative binomial model.
jpeg(‘myBCVPlot.jpg’)
plotBCV(yNorm, pch=16, cex=1.2)
dev.off()
The above script plots the biological coefficient of variations (based on the three types of
dispersions) against the average abundance of each gene. As shown in Figure 5.14, the plot
shows the square root estimates of the common, trended, and tagwise negative binomial
dispersions.
For RNA-Seq count data, the negative binomial dispersions tend to be higher for genes
with very low counts and the dispersion trend tends to decrease smoothly with the abun-
dance (CPM) increase and becomes asymptotic to a constant value for genes with a larger
abundance.
RNA-Seq Data Analysis ◾ 189
png(file=”libsizeplot.png”)
x<-barplot(yNorm$samples$lib.size/1e06,
names=colnames(yNorm),
las=2, ann=FALSE,
cex.names=0.75,
col=”lightskyblue”,
space = .5)
mtext(side = 1, text = “Samples”, line = 4)
mtext(side = 2, text = “Library size (millions)”, line = 3)
title(“Barplot of library sizes”)
dev.off()
As shown in Figure 5.15, the library sizes or the sequencing depths of the six samples are
similar. This bar chart gives an idea about the distribution of the library sizes and any
potential source of bias from the library sizes.
In the normalization step, we normalized the count data to eliminate composition biases
between libraries. We can assess the TMM normalization by the MD plot (mean-difference
plot), which displays the library size-adjusted log-fold change (difference) between two
libraries against the average log-expression across these libraries (the mean). The points on
the MD plot should be centered at a line of zero log-fold change if the biases between librar-
ies were removed successfully by the normalization. The “plotMD(y, column=i)” function
creates MD plot by converting the count (y) to log2-CPM values and then creating an
artificial array by averaging all samples other than the sample specified (column=i) in the
function. The function then creates the MD plot from the specified sample and the arti-
ficial sample. The following script creates the MD plots for the six samples (Figure 5.16):
jpeg(‘mdplots.jpg’)
par(mfrow=c(2,3))
for (i in 1:6) {
plotMD(yNorm, column=i,
xlab=”Average log CPM (all samples)”,
ylab=”log-ratio (this vs others)”)
abline(h=0, col=”red”, lty=2, lwd=2)
}
dev.off()
We can also create boxplots for the unnormalized and normalized log-CPM values to show
the expression distributions in each sample using the following script:
png(file=”logcpmboxplot.png”)
par(mfrow=c(1,2))
logcounts <- cpm(y,log=TRUE)
boxplot(logcounts, xlab=””, ylab=”Log2 counts per million”,las=2)
abline(h=median(logcounts),col=”blue”)
title(“Unnormalized logCPMs”)
logcountsNorm <- cpm(yNorm,log=TRUE)
boxplot(logcountsNorm, xlab=””, ylab=”Log2 counts per
million”,las=2)
abline(h=median(logcountsNorm),col=”blue”)
title(“Normalized logCPMs”)
dev.off()
Figure 5.17 shows the boxplot of the TMM-normalized data for each sample. The black line
dividing each box represents the median of the count data, the top of the box shows the
upper quartile, and the bottom of the box shows the lower quartile. The top and bottom
whiskers show the highest and lowest count values, respectively, and the circles indicate
the outliers.
We can use multidimensional scaling (MDS) [35] for representing relationships graphi-
cally between samples in multidimensional space and showing the overall differences
between the gene expression profiles for the different samples. The MDS uses the pairwise
dissimilarity Euclidean distances between samples in terms of the leading log-fold change
(logFC) for the genes that best characterize the pair of samples. The leading logFC is cal-
culated as the root-mean-square of the top log2-fold changes between the pair of samples
(the default is 500 (top=500) logFCs). Edger uses “plotMDS” function to plot the MDS plot.
The general syntax is as follows:
The samples are then represented graphically in two dimensions such that the distance
between points on the plot approximates their multivariate dissimilarity. The objects that
are closer together on the MDS plot are more similar than the distant ones. In our example
data, if there is a difference between the normal and tumor samples, then we can see clear
patterns in multivariate dataset.
install.packages(“RColorBrewer”)
library(RColorBrewer)
png(file=”MDSPlot.png”)
pseudoCounts <- log2(yNorm$counts + 1)
colConditions <- brewer.pal(3, “Set2”)
colConditions <- colConditions[match(sampleinfo$condition,
levels(factor(sampleinfo$condition)))]
patients <- c(8, 15, 16)[match(sampleinfo$patient,
levels(factor(sampleinfo$patient)))]
plotMDS(pseudoCounts, pch = patients, col = colConditions, xlim =
c(-2,2))
legend(“topright”, lwd = 2, col = brewer.pal(3, “Set2”)[1:2],
legend = levels(factor(sampleinfo$condition)))
legend(“bottomright”, pch = c(8, 15, 16),
legend = levels(factor(sampleinfo$patient)))
dev.off()
As shown in Figure 5.18, the patterns are very clear that the samples are clustered by the
condition; normal samples are grouped together, and tumor samples are grouped together,
which indicates that the difference between the two groups is much larger than the dif-
ferences within groups and it is likely that the between-group difference is statistically
RNA-Seq Data Analysis ◾ 193
significant. The distance between the normal samples and the tumor sample is about 2
log2-fold change on the x-axis or 4 folds.
We can also use heatmap to cluster the most variable genes in the samples. We expect
that some samples may have similar pattern depending on the given condition (normal
or tumor). The following heatmap script will describe the relationships between samples
using hierarchical clustering:
install.packages(“gplots”)
library(“gplots”)
png(file=”heatmap1.png”)
logcountsNorm <- cpm(yNorm,log=TRUE)
var_genes <- apply(logcountsNorm, 1, var)
select_var <- names(sort(var_genes, decreasing=TRUE))[1:10]
highly_variable_lcpm <- logcountsNorm[select_var,]
mypalette <- brewer.pal(11,”RdYlBu”)
morecols <- colorRampPalette(mypalette)
col.con <- c(rep(“purple”,3),
rep(“orange”,3))[factor(sampleinfo$condition)]
heatmap.2(highly_variable_lcpm,
col=rev(morecols(50)),trace=”none”,
main=”Top 10 most variable genes”,
ColSideColors=col.con,scale=”row”,
margins=c(12,8),srtCol=45)
dev.off()
194 ◾ Bioinformatics
Figure 5.19 shows that samples are clustered into tumor and normal samples based on the
profiles of the genes in these samples.
Fitting the count data to a negative binomial model generates several data as shown in
Figure 5.20. For instance, the GLM coefficients are in “coefficients” slot, fitted values are in
“fitted.values” slot, and the estimated quasi-likelihood dispersions are in “dispersion” slot.
The quasi-likelihood extends the negative binomial to account for gene-specific variability
from both biological and technical aspects. We can visualize the quasi-likelihood disper-
sion with “plotQLDisp” function. The quasi-likelihood gene-wise dispersion estimates are
squeezed toward a consensus trend, which will reduce the uncertainty of the estimates
and improves testing power. The following script creates a quasi-likelihood dispersion plot
showing the raw, squeezed, and trend dispersions (Figure 5.21):
jpeg(‘qlDispplots.jpg’)
fitq <- glmQLFit(yNorm, design)
plotQLDisp(fitq, pch=16, cex=1.2)
dev.off()
Once we have fitted the count data to a GLM log-linear model, we can then be able to con-
duct the gene-wise statistical tests for a given coefficient (coef) or we can use “contrast” to
196 ◾ Bioinformatics
perform differential analysis. Since our primary goal is to study the difference between the
normal and tumor samples, we can construct a contrast using “makeContrasts” function
and then we can conduct the statistical test using the “glmQLFTest” function as follows:
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq <- glmQLFit(yNorm, design)
qlfq<- glmQLFTest(fitq,contrast=my.contrasts)
topTags(qlfq, n=10, adjust.method=”BH”, sort.by=”PValue”,
p.value=0.05)
The “qlfq” is a DGELRT object that stores the results of a GLM-based differential expres-
sion analysis for DGE data. The “topTags” function prints the top ten (n = 10) set of the
most significantly differential genes as shown in Figure 5.22. The p-value threshold is set
to “p.value=0.05” so only genes with p-value less than 0.05 will be listed. The negative log-
fold changes (logFC) represent genes that are downregulated (down-expressed) in tumor
samples over normal sample; logCPM is the log count-per-million; F is the test statistic
for the null hypothesis that no difference in the gene expression between the normal and
tumor samples; pvalue is the significance measure (p-value < 0.05 is significant); and FDR
is the false discovery rate.
To use GLM negative binomial model instead of the quasi-negative binomial model for
the differential expression, you can use the following script:
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitg<- glmFit(yNorm, design)
qlfg<- glmLRT(fitg,contrast=my.contrasts)
topTags(qlfg, n=10, adjust.method=”BH”, sort.by=”PValue”,
p.value=0.05)
Notice that we used contrast to conduct differential analysis to study the effect of tumor
on the gene expression. However, the design may be more complex and EdgeR provides
several ways to construct contrast based on the study design. For instance, you can add
the intercept to the design matrix so the first column on the design will be the reference.
We can also use “coef=” with “glmQLFTest” or “glmLRT” functions to indicate that the
specified coefficients are zero. In more complex design like factorial design, an ANOVA-
like analysis can be conducted. Refer to Edger users’ manual for more information about
the designs.
EdgeR also provides “glmTreat” function to conduct gene-wise statistical tests for a
given coefficient or a contrast relative to a specified fold-change threshold. For instance,
assume that we need to test the hypothesis that the gene expression in normal cells and
tumor cells is equal at a threshold of 2 log-fold change (lfc=2), assuming that a gene with a
log-fold change less than 2 is equally expressed in both normal and tumor cells.
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq <- glmQLFit(yNorm, design)
qlfq<- glmTreat(fitq, contrast=my.contrasts, lfc=2)
topTags(qlfq, n=10, adjust.method=”BH”, sort.by=”PValue”,
p.value=0.05)
Figure 5.23 shows the top ten significantly expressed genes using a log-fold change thresh-
old (lfc=2).
Since the FDR (the rate of incorrectly rejecting null hypothesis) may be inflated in
the case of multiple testing, it can be corrected using Benjamini–Hochberg method [36],
which orders hypotheses and then rejects or accepts a hypothesis based on the Benjamini–
Hochberg critical value.
The “decideTestsDGE” function can be used to display the total number of differentially
expressed genes identified at an FDR of 5% (upregulated, downregulated, and no-signifi-
cant regulation).
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq<-glmQLFit(yNorm, design)
qlfq<-glmQLFTest(fitq, contrast=my.contrasts)
DEGenes<-decideTestsDGE(qlfq,
adjust.method=”BH”, p.value=0.05, lfc=2)
summary(DEGenes)
198 ◾ Bioinformatics
Figure 5.24 shows the number of downregulated, nonsignificant, and upregulated genes at
the threshold of 2 log-fold change and 0.05 significance level.
The level of the differential gene expression change can be visualized with the mean-
difference plot (MD plot):
jpeg(‘mdPlotfitted.jpg’)
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq<-glmQLFit(yNorm, design)
qlfq<-glmQLFTest(fitq, contrast=my.contrasts)
DEGenes<-decideTestsDGE(qlfq,
adjust.method=”BH”, p.value=0.05, lfc=2)
plotMD(qlfq, status=DEGenes,
RNA-Seq Data Analysis ◾ 199
values=c(1,0,-1),
col=c(“red”,”black”,”blue”),
legend=”topright”)
dev.off()
The MD plot in Figure 5.25 shows upregulated (red color), downregulated (blue color), and
nonsignificant gene (black color) in log-fold change versus average log cpm (abundance)
plain. The points are usually dense when all chromosomes are investigated. Remember
that for the sake of simplicity, we demonstrate the differential analysis of gene expression
in a single chromosome (chromosome 22). Both upregulated and downregulated genes are
affected by the condition studied. That suggests that they may have a role on the condition
or the effect may be due to the condition. In this example case, the upregulated and down-
regulated genes may have implications for the breast cancer. These genes can be singled out
and studied to know their roles in the condition.
We can also use the volcano plots to display the differential gene expression results.
A volcano plot is a scatterplot that shows the p-values versus the fold changes in gene
expression. It enables a quick visual identification of genes with large fold changes that
are also statistically significant. In a volcano plot, the upregulated genes are shown toward
the right, the downregulated genes are shown toward the left, and the most statistically
FIGURE 5.25 MD plot showing upregulated (red), downregulated (blue), and nonsignificant
(black).
200 ◾ Bioinformatics
significant genes are toward the top of each side. We can create a volcano plot of the RNA-
Seq results using the data of the top differentially expressed genes as shown in Figure 5.26.
#############
#volcano plot
jpeg(‘volcano.jpg’)
fitq <- glmQLFit(yNorm, design)
qlfq<- glmTreat(fitq, contrast=my.contrasts, lfc=2)
resFilt<- topTags(qlfq, n=100, adjust.method=”BH”, sort.
by=”PValue”, p.value=1)
volcanoData <- cbind(resFilt$table$logFC,
-log2(resFilt$table$PValue))
colnames(volcanoData) <- c(“logFC”, “negLogPval”)
plot(volcanoData, pch=19)
dev.off()
Once again, when all chromosomes are studied, the point distribution on the volcano plot
will be very dense. Moreover, some programs are able to color the upregulated and down-
regulated genes for better visualization.
The heatmap has been known as the most popular graphical representation for visual-
izing complex and multidimensional data. We can use the heatmap to visualize differential
expression of the genes. First, we need to use “cpm” function to convert read abundance
into log2 CPM values. The heatmap can be created for the top differentially expressed
genes. The following script creates a heatmap for the top 20 differentially expressed genes:
jpeg(‘heatmap2.jpg’)
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq<-glmQLFit(yNorm, design)
qlfq<-glmQLFTest(fitq, contrast=my.contrasts)
DEGenes<-decideTestsDGE(qlfq,
adjust.method=”BH”, p.value=0.05, lfc=2)
logCPM <- cpm(yNorm, prior.count=2, log=TRUE)
rownames(logCPM) <- yNorm$genes$SYMBOL
colnames(logCPM) <- paste(yNorm$samples$group, 1:3, sep=”-”)
o <- order(qlfq$table$PValue)
logCPM <- logCPM[o[1:20],]
logCPM <- t(scale(t(logCPM)))
col.pan <- colorpanel(100, “blue”, “white”, “red”)
heatmap.2(logCPM, col=col.pan, Rowv=TRUE, scale=”none”,
trace=”none”, dendrogram=”both”, cexRow=1, cexCol=1.4,
margin=c(10,9), lhei=c(2,10), lwid=c(2,6))
dev.off()
FIGURE 5.27 Heatmap for the top 20 most differentially expressed genes across the samples.
202 ◾ Bioinformatics
As shown in Figure 5.27, the heatmap clusters genes and samples based on Euclidean dis-
tance between the expression values. As expected, samples from the same group are clus-
tered together.
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq<-glmQLFit(yNorm, design)
qlfq<-glmTreat(fitq,contrast=my.contrasts, lfc=2)
go <- goana(qlfq, species=”Hs”)
topGO20<-topGO(go, sort=”up”, n=20)
write.csv(topGO20,file=”topGO20.csv”)
Figure 5.28 shows GO IDs, terms, ontology (ONT), the total number of genes annotated
with each ontology term (N), the number of genes that are significantly upregulated (up)
and downregulated (down), the gene counts of the top significant GO ordered by the sig-
nificance of the p-value, and the p-values for the upregulated (P.Up) and p-values for the
downregulated (P.Down). Since the p-values are not adjusted for multiple testing, it is rec-
ommended to ignore GO terms with p-values greater than about 10−5.
Since this exercise is based on a single chromosome (chromosome 22), we do not expect
much information as when we analyze the entire genome. In general, GO analysis tells
us about the different biological processes, their localizations in the cells, and molecular
functions based on the upregulated and downregulated genes.
In the same way, we can perform KEGG pathway analysis to identify the molecular
pathways and disease signatures (Figure 5.29).
my.contrasts
<- makeContrasts(conditiontumo-conditionnorm,levels=design)
fitq<-glmQLFit(yNorm, design)
qlfq<-glmTreat(fitq,contrast=my.contrasts, lfc=2)
keg <- kegga(qlfq, species=”Hs”)
keg20<- topKEGG(keg, sort=”up”, n=20)
write.csv(keg20,file=”keg20.csv”)
In the following, we will use “vidger” package to create plots to visualize the example RNA-
Seq data. Open R and make the “features” directory where you saved the RNA-Seq count
data file as your working directory. The vidger functions require DGEList object with
group and normalized count data. The following script will create a DGEList, “yNorm”
that can be used as input for the functions:
#Loading packages
library(edgeR)
library(“vidger”)
library(org.Hs.eg.db)
#Loading data
seqdata <- read.delim(“htcount2.txt”, stringsAsFactors=FALSE)
sampleinfo <- read.delim(“sampleinfo.txt”, stringsAsFactors=FALSE)
#Preparing data
countdata0 <- seqdata[,-(1:2)]
rownames(countdata0) <- seqdata[,1]
colnames(countdata0) <- sampleinfo$sampleid
countdata <- countdata0[rowSums(countdata0[])>0,]
group = factor(sampleinfo$condition)
#Creating DGEList object
y <- DGEList(countdata, group=group)
#Adding annotation
ENTREZID <- mapIds(org.Hs.eg.db,rownames(y),
keytype=”SYMBOL”,column=”ENTREZID”)
rownames(y$counts) <- ENTREZID
ann<-select(org.Hs.eg.db,keys=rownames(y$counts),
columns=c(“ENTREZID”,”SYMBOL”,”GENENAME”))
y$genes <- ann
RNA-Seq Data Analysis ◾ 205
Once you have run the above script successfully without any error, then you can use the
“vidger” functions to create plots as follows.
FIGURE 5.30 Box plots showing the distribution of normal and tumor counts in CPM.
206 ◾ Bioinformatics
FIGURE 5.31 A scatter plot showing comparisons of log10 values of CPM for the two conditions.
RNA-Seq Data Analysis ◾ 207
dev.off()
#Scatter plot
jpeg(‘CPMscatterplot.jpg’)
vsScatterPlot(
x = ‘norm’,
y = ‘tumo’,
FIGURE 5.32 MA plot showing the log-fold change against mean expression.
208 ◾ Bioinformatics
data = yNorm,
type = ‘edger’,
d.factor = NULL,
title = TRUE, grid = TRUE)
dev.off()
jpeg(‘MAPlot.jpg’)
vsMAPlot(
x = ‘norm’, y = ‘tumo’,
data = yNorm,
d.factor = NULL,
type = ‘edger’,
padj = 0.05,
y.lim = NULL,
lfc = 2,
title = TRUE, legend = TRUE, grid = TRUE)
dev.off()
As shown in Figure 5.32, the blue points on the graph represent the genes with statistically
significant log-fold changes above the specified fold change indicated in the graph by the
dashed lines. The green points represent the genes with statistically significant log-fold
changes less than the specified fold change. The gray points represent the genes without sta-
tistically significant log-fold changes. Moreover, the values in parentheses for each legend
color show the number of genes that meet the conditions. The triangular shapes represent
values that are not displayed. The point size indicates the magnitude of the log-fold change.
jpeg(‘volcanoPlot2.jpg’)
vsVolcano(
x = ‘norm’, y = ‘tumo’,
data = yNorm,
d.factor = NULL,
type = ‘edger’,
padj = 0.05,
x.lim = NULL,
lfc = 2,
title = TRUE,
legend = TRUE, grid = TRUE,
data.return = FALSE)
dev.off()
5.4 SUMMARY
The PCR allows studying gene expression in very narrow scale since only expression of
few known genes can be investigated. Microarrays were emerged as the first technique for
studying the gene expression of massive number of genes. However, recently, the RNA-
Seq seems to replace microarrays because it is better at detecting gene transcripts and
gene isoforms and it also provides more accurate and sensitive results for differential gene
expression.
Studying gene expression allows researchers to identify the roles of genes in the biologi-
cal activities in the cells and their implications for diseases. This specifically provides broad
insight when massively parallel sequencing is used in the investigation of gene expression
210 ◾ Bioinformatics
across the whole genome. Genes code for proteins that determine traits and biological
functions. RNA-Seq allows investigation of the entire transcriptome through the profiling
of genes so researchers will know which sets of genes are coregulated during the conditions
studied. RNA-Seq is used to study diseases like cancers and patient’s response to therapeu-
tics and other diseases.
The first steps of the workflow of RNA-Seq are similar to the steps of the other NGS
applications as discussed in the previous chapters. The raw data is obtained in FASTQ
files, which can be single end or paired end. The raw data is subjected to quality con-
trol for the trimming of adaptors and removal of duplicate sequences and low-quality
reads. The cleaned raw data is either aligned to a reference genome or a reference tran-
scriptome. The alignment program should be chosen depending on the read length and
splicing awareness. STAR is a good choice for aligning short RNA-Seq reads. However, it
requires sufficient memory and storage space for the indexing of the reference sequence
and read alignment. The alignment information is stored in SAM/BAM files. Programs
like Samtools and PICARD can be used to manipulate and generate alignment statistics
from the BAM files. Quantification of reads aligned to each gene is essential in the RNA-
Seq data analysis because it is the only way to see the gene transcript abundance, which is
an indication for its biological activities. The reads aligned to each gene in BAM files are
counted using reads counting programs like htseq-count and featureCount. In addition to
the BAM files as inputs, these programs require also a reference annotation file in GTF for-
mat. The reads counts produced by the counting programs are stored in a tab-delimited file
containing gene symbols or gene IDs and the corresponding counts of the aligned reads
for each gene. The count file may contain a count column for each sample. Sample counts
in different files can be merged for easy processing. The count data is usually normalized
to be suitable for within-sample gene comparisons and between-sample gene comparisons.
The normalization is made to adjust for the biases arising from the differences in the gene
lengths, GC contents, and library sizes across samples. The normalization methods like
RPKM/FPKM and CPM adjust for the biases arising from the difference in gene lengths;
therefore, they can be used when comparison between the expressions of different genes
within the sample is intended. Other normalization methods like TMM and RLE adjust
for the bias arising from the difference in the library sizes across samples; therefore, they
can be used for between-sample differential expression analysis. The count data generated
by reads counting programs are further analyzed by the differential programs like edgeR,
DESeq2, and cuffdiff. As an example, we used edgeR to demonstrate the steps of the dif-
ferential analysis of the RNA-Seq data.
The differential analysis requires a metadata file that holds the information of the
samples and the study design from which the design matrix is generated. Genes with low
abundance can be filtered out since they do not contribute to the differential analysis. The
normalized count data as response variable and the design matrix as explanatory vari-
ables are used for a GLM fitting. Since the count data is over-dispersed (variance is greater
than the mean), the negative binomial distribution is assumed to fit the RNA-Seq count
data to a log-linear model to estimate the dispersions and other statistics for the differen-
tial gene expression including log-fold change, p-values, and false discovery rates (FDR).
RNA-Seq Data Analysis ◾ 211
The expression level of a gene is described by the log-fold change of the gene transcript
abundance with respect to a reference. This reference will be a different gene in the case of
intra-sample comparison or the same gene in a different sample in the case of inter-sample
differential analysis. The significance of differential gene expression is measured by the
fold change, test statistic, p-value, adjusted p-value, and FDR. In the case of multiple com-
parison, the adjusted p-value is used. The expressed genes are usually sorted by p-values
(from the lowest to the largest) so that it will be easy to identify the top upregulated and
downregulated genes. These genes can be further analyzed by annotating them with their
GO and KEGG pathways terms to gain insight into their biological interpretation under
the conditions studied.
REFERENCES
1. Reyes JL, Kois P, Konforti BB, Konarska MM: The canonical GU dinucleotide at the 5’
splice site is recognized by p220 of the U5 snRNP within the spliceosome. Rna 1996,
2(3):213–225.
2. Wilkinson ME, Charenton C, Nagai K: RNA Splicing by the spliceosome. Annu Rev Biochem
2020, 89: 359–388.
3. Haque A, Engel J, Teichmann SA, Lönnberg T: A practical guide to single-cell RNA-sequencing
for biomedical research and clinical applications. Genome Med 2017, 9(1):75.
4. Vu TN, Deng W, Trac QT, Calza S, Hwang W, Pawitan Y: A fast detection of fusion genes from
paired-end RNA-seq data. BMC Genomics 2018, 19(1):786.
5. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras
TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15–21.
6. Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller
J: Fast mapping of short sequences with mismatches, insertions and deletions using index
structures. PLoS Comput Biol 2009, 5(9):e1000502.
7. Marco-Sola S, Sammeth M, Guigó R, Ribeca P: The GEM mapper: fast, accurate and versatile
alignment by filtration. Nat Methods 2012, 9(12):1185–1188.
8. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform.
Bioinformatics 2010, 26(5):589–595.
9. Bushnell B: BBMap: A Fast, Accurate, Splice-Aware Aligner. In.: Lawrence Berkeley National
Lab.(LBNL), Berkeley, CA (United States); 2014.
10. Kukurba KR, Montgomery SB: RNA Sequencing and Analysis. Cold Spring Harb Protoc 2015,
2015(11):951–969.
11. Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short
reads. Bioinformatics 2010, 26(7):873–881.
12. Wei X, Wang X: A computational workflow to identify allele-specific expression and epigen-
etic modification in maize. Genomics Proteomics Bioinfor 2013, 11(4):247–252.
13. Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch
JB, Pierce EA: Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq
unified mapper (RUM). Bioinformatics 2011, 27(18):2518–2528.
14. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL: Graph-based genome alignment and
genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019, 37(8):907–915.
15. Corchete LA, Rojas EA, Alonso-López D, De Las Rivas J, Gutiérrez NC, Burguillo FJ:
Systematic comparison and assessment of RNA-seq procedures for gene expression quantita-
tive analysis. Sci Rep 2020, 10(1): 19737.
16. Okonechnikov K, Conesa A, García-Alcalde F: Qualimap 2: advanced multi-sample quality
control for high-throughput sequencing data. Bioinformatics 2016, 32(2):292–294.
212 ◾ Bioinformatics
17. DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, Reich M,
Winckler W, Getz G: RNA-SeQC: RNA-seq metrics for quality control and process optimiza-
tion. Bioinformatics 2012, 28(11):1530–1532.
18. Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics
2012, 28(16):2184–2185.
19. Anders S, Pyl PT, Huber W: HTSeq--a Python framework to work with high-throughput
sequencing data. Bioinformatics 2015, 31(2):166–169.
20. Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning
sequence reads to genomic features. Bioinformatics 2013, 30(7):923–930.
21. Robinson MD, Oshlack A: A scaling normalization method for differential expression analy-
sis of RNA-seq data. Genome Biol 2010, 11(3):R25.
22. Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-t hroughput
sequencing. Nucleic Acids Res 2012, 40(10):e72–e72.
23. Maza E: In Papyro Comparison of TMM (edgeR), RLE (DESeq2), and MRN Normalization
Methods for a Simple Two-Conditions-Without-Replicates RNA-Seq Experimental Design.
Front Genet 2016, 7:164.
24. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mam-
malian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621–628.
25. Wagner GP, Kin K, Lynch VJ: Measurement of mRNA abundance using RNA-seq data:
RPKM measure is inconsistent among samples. Theory Biosci 2012, 131(4):281–285.
26. Bushel PR, Ferguson SS, Ramaiahgari SC, Paules RS, Auerbach SS: Comparison of
Normalization Methods for Analysis of TempO-Seq Targeted RNA Sequencing Data. Front
Genet 2020, 11.
27. Robinson MD, Oshlack A: A scaling normalization method for differential expression analy-
sis of RNA-seq data. Genome Biol 2010, 11(3):R25.
28. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol
2010, 11(10):R106.
29. Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normal-
ization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010,
11(1):94.
30. Finotello F, Di Camillo B: Measuring differential gene expression with RNA-seq: challenges
and strategies for data analysis. Brief Funct Genomics 2015, 14(2):130–142.
31. Ver Hoef JM, Boveng PL: Quasi-Poisson vs. negative binomial regression: how should we
model overdispersed count data? Ecology 2007, 88(11):2766–2772.
32. Law CW, Zeglinski K, Dong X, Alhamdoosh M, Smyth GK, Ritchie ME: A guide to creating
design matrices for gene expression experiments. F1000Res 2020, 9: 1444.
33. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 2010, 26(1):139–140.
34. Carlson M: org.Hs.eg.db: Genome wide annotation for human. R package version 3100 2019.
35. Mead A: Review of the development of multidimensional scaling methods. J R Stat Soc D
1992, 41(1):27–39.
36. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J R Stat Soc B 1995, 57(1):289–300.
37. Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic Acids
Res 2004, 32(suppl_1):D258–D261.
38. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S,
Okuda S, Tokimatsu T et al: KEGG for linking genomes to life and the environment. Nucleic
Acids Res 2007, 36(suppl_1):D480–D484.
39. vidger: Create rapid visualizations of RNAseq data in R [https://bioconductor.org/packages/
release/bioc/html/vidger.html]
Chapter 6
Chromatin
Immunoprecipitation
Sequencing
The library preparation of the ChIP-Seq DNA fragments follows the same steps as
that of the whole genome sequencing (WGS), which includes fragmentation, end repair,
adaptor ligation, and enrichment. The sequencing steps follow the same steps used for
DNA sequencing by the sequencing technology. The sequencing raw data includes m illions
of ChIP-Seq reads.
The sequencing strategies used for ChIP-Seq are the same as the ones followed for the
WGS and RNA-Seq. The design of the ChIP-Seq experiment is usually tailored to the condi-
tion studies and that design will guide the subsequent data analysis. The raw data produced
by the sequencer are raw reads in FASTQ files. Sequencing can be single end or paired end,
short reads (e.g., Illumina) or long reads (e.g., PacBio). However, most ChIP-Seq datasets
have been generated using single-end libraries and we should be aware that some programs
do not use paired-end libraries. The run can be for a single sample or multiplexed for sev-
eral samples; the fragments of each sample in the run are with a unique barcode.
of the ChIP-Seq signal may vary depending on the binding protein studied. The ChIP-Seq
signal can be sharp, broad, or a mix of sharp and broad signal. The sharp signal character-
izes the binding site of the TF which binds to a specific site in the DNA sequence called
motif. Histones form broad ChIP-Seq signals because they span several nucleosomes and
may cover several nucleotides on the DNA. The RNA polymerase II (Pol II) initiates the
process of transcription by localizing on the promoter region of the gene and then it moves
during the messenger RNA transcription. Therefore, the ChIP-Seq signal of Poly II may
include both sharp and broad signals (Figure 6.1).
Peak-calling programs use sliding windows to scan the genome for these patterns to
locate the binding regions by counting both Watson and Crick tags. However, for these
kinds of tags to fit in a single window, they must be shifted to the center so that Watson tags
are shifted toward the 3′ end and Crick tags are shifted toward the 5′ end to form a peak in
the putative binding site. Peak-calling programs like MACS take advantage of the bimodal
pattern to empirically model the shifting size to precisely locate the binding sites [2] on the
genomic DNA sequences.
Peak calling is a step unique to ChIP-Seq data analysis and it aims to identify the
genomic regions occupied by the protein of interest and enriched due to the ChIP. The
abundance of the aligned reads normalized by input reads in a sliding window is the basis
of the peak calling, which is performed using statistics that determine peak significance.
The ChIP-Seq tags are usually normalized by input read (control), but some peak callers
can also call peaks without using input reads. Instead, they assume even background signal
FIGURE 6.1 Sharp signal (TF), broad signal (histones), and mixed signal (Poly II).
FIGURE 6.2 ChIP-Seq read alignment. (The peaks represent reads aligned to the reference
genome.)
Chromatin Immunoprecipitation Sequencing ◾ 217
or other strategies. There are several peak-calling programs that use their own algorithms
to define protein-binding sites in the genome by identifying regions where sequence reads
are enriched after mapping to a reference genome. The peak caller assumes that ChIP-Seq
reads should align in a larger number to the sites of protein binding than to other regions
on the genome (Figure 6.2).
Peak-calling programs use different strategies to compute the statistical significance of
peaks in the binding sites. Some peak callers assume Poisson or negative binomial distri-
bution to model the counts of the reads and to compute the p-value for the statistical sig-
nificance of the peak with respect to the background. Since multiple windows (thousands)
are tested generating multiple p-values, the chance of making Type I error (false positive)
will increase. Some peak callers adjust the p-value based on the number of windows by
computing the false discovery rate (FDR). Other callers use the height of peaks over back-
ground without providing statistical significance metric and others use machine learning
to generate statistical metrics that allow peak calling. Sequencing depth and library com-
plexity are crucial for statistical significance of the fold enrichment.
In general, most peak aligners perform well. However, attention should be paid to the
type of binding sites that a peak caller is good for. For instance, some callers are good for
TFs (e.g., SISSRs [3]) and some are good for histone modifications, and some can handle
both (e.g., MACS [2]). Some callers do not use paired-end library although paired-end
reads can be treated as single-end reads by using either forward or reverse reads but not
both. HOMER [4] caller was developed as a tool for de novo motif identification from
peak regions. JAMM [5] requires replicated samples to improve confidence in peak call-
ing. We should also pay attention to how a caller handles broad and sharp peaks. Callers
may merge the close peaks and that may lead to the loss of some resolution. ChIP-Seq
reads originated for histone modification generate a broad peak signal that requires a large
region. However, determining a region boundary for the histone enrichment is still a chal-
lenge for the peak callers. In contrast to the histone modifications, ChIP-Seq signals of TFs
and Poly II exhibit sharp peak signal, and therefore, the peak callers suitable for TFs and
Poly II should be able to identify those narrow regions. In the following, we will discuss the
steps of ChIP-Seq workflow with a worked example.
strand forming an initial DNA–RNA hybrid from which the new mRNA transcript is
separated. The purpose of this exercise is to investigate the promoter regions in the gene
targeted by the DNA-directed RNA polymerase II subunit RPB1 during gene transcrip-
tion. The raw data consists of four single-end FASTQ files generated by Illumina Genome
Analyzer and available at ENCODE database with the accession numbers: ENCFF000XJP,
ENCFF000XJS, and ENCFF000XKD, and the accession number of the input data (control)
is ENCSR000EZM. For the sake of keeping the files organized, we can create a project
directory called “chipseq”, and inside that directory, we can create a subdirectory called
“data” where we can download the FASTQ files as follows:
The four files will be downloaded into the “data” directory. The four files are
ENCFF000XJP_chp1.fastq.gz, ENCFF000XJS_chp2.fastq.gz, ENCFF000XKD_chp3.fastq.
gz, and ENCFF000XGP_inp0.fastq.gz. The latter is the FASTQ file that contains the input
or control data.
cd data
fastqc \
ENCFF000XJP_chp1.fastq.gz \
ENCFF000XJS_chp2.fastq.gz \
Chromatin Immunoprecipitation Sequencing ◾ 219
ENCFF000XKD_chp3.fastq.gz \
ENCFF000XGP_inp0.fastq.gz
Then, we can display the reports in an Internet browser using Firefox command as follows:
firefox \
ENCFF000XJP_chp1_fastqc.html \
ENCFF000XJS_chp2_fastqc.html \
ENCFF000XKD_chp3_fastqc.html \
ENCFF000XGP_inp0_fastqc.html
cd ..
To avoid repeating what had been discussed in Chapter 1, we will assume that the four
FASTQ files are cleaned and ready for the next step.
Once the above operations have been performed successfully, we can use Bowtie2 to align
both ChIP-Seq reads and control reads to the reference genome; since each file is aligned
separately, four SAM files will be produced.
mkdir bam
bowtie2 \
-p 4 \
-x ref/hg19 \
220 ◾ Bioinformatics
-U data/ENCFF000XGP_inp0.fastq.gz \
-S bam/ENCFF000XGP_inp0.sam \
2> bam/inp0.log
bowtie2 \
-p 4 \
-x ref/hg19 \
-U data/ENCFF000XJP_chp1.fastq.gz \
-S bam/ENCFF000XJP_chp1.sam \
2> bam/chp1.log
bowtie2 \
-p 4 \
-x ref/hg19 \
-U data/ENCFF000XJS_chp2.fastq.gz \
-S bam/ENCFF000XJS_chp2.sam \
2> bam/chp2.log
bowtie2 \
-p 4 \
-x ref/hg19 \
-U data/ENCFF000XKD_chp3.fastq.gz \
-S bam/ENCFF000XKD_chp3.sam \
2> bam/chp3.log
The four SAM files produced by the above commands contain the alignment information
of the reads. However, they may also include alignment information that we do not need
and removing that will make us focus only on the regions of interest and also reduce the
computational complexity. We can remove the mitochondrion read alignments, which are
defined as “chrM” in the chromosome field of the SAM file and the unidentified, ran-
dom, and haploid reads, which are defined as “chrUn”, “random”, and “*hap*”, respectively,
keeping only the reads aligned to the human chromosomes. We can use “sed” Linux com-
mand to do that and the filtered alignments are saved in new files.
cd bam
sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XGP_inp0.sam >
ENCFF000XGP_inp0_filt.sam
sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XJP_chp1.sam >
ENCFF000XJP_chp1_filt.sam
sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XJS_chp2.sam >
ENCFF000XJS_chp2_filt.sam
sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XKD_chp3.sam >
ENCFF000XKD_chp3_filt.sam
We can then convert the SAM files into BAM files using “samtools view” command.
The BAM file takes less storage space. Then, we can delete the SAM file to save some storage
space if we need to. Just be careful not to delete the BAM files.
Now, we have three BAM files for the three ChIP-Seq data and one file for the control
data. Before proceeding, we need to know the number of alignments in each file and then
draw a sample of control reads approximately equal to the reads of any of the ChIP-Seq
files to be the input reads for that ChIP-Seq file. We do that to avoid library coverage bias.
The following “samtools view” commands count the alignments in each BAM file:
Table 6.1 shows the number of aligned reads in each BAM file and the factor, which is the
read count of a ChIP-Seq file divided by the read count of the control file. This fraction is
used to sample input reads from the control file for that ChIP-Seq file.
The following commands store the counts in bash variables and then use “samtools
view” command to draw a subsample of reads from the control file and store them in a
separate control file for that ChIP-Seq file. The “-b” option is to output a BAM file and “-s”
option is to draw a subsample from the file.
TABLE 6.1 Read Count in Each BAM File, the Fraction for Sampling Reads from the Control BAM file, and
Number of Reads in the Control File for Each ChIP-Seq File
Sample Read Count Sampling Factor Control Read Count
ENCFF000XGP_inp0_filt.bam 30,923,163 N/A N/A
ENCFF000XJP_chp1_filt.bam 8,942,010 0.289168673 8,941,151
ENCFF000XJS_chp2_filt.bam 12,748,871 0.412275775 12,744,729
ENCFF000XKD_chp3_filt.bam 13,217,349 0.427425519 13,212,672
222 ◾ Bioinformatics
You can then double check the read count in the new control files for the ChIP-Seq files.
The read counts for control files are shown in Table 6.1.
Up to this point, we have three ChIP-Seq BAM files and three control BAM files, one for
each ChIP-Seq file. Before proceeding to the next step, you may decide to view the align-
ment in the BAM file with the “samtools tview” command or any other BAM viewing pro-
gram. When we use “samtools tview”, the “-p” option is used to specify a specific position.
To view a BAM file, you need to sort the alignments by coordinate and then index it. As an
example, we will view only one file (Figure 6.3).
FIGURE 6.3 Visualizing reads aligned to the reference genome using “samtools tview” command.
Chromatin Immunoprecipitation Sequencing ◾ 223
ENCFF000XJP_chp1_filt_so.bam \
-p chr17:56084561-56084661 \
../ref/hg19.fa
The “samtools tview” provides a quick way to visualize the alignments in a BAM file and
to inspect the binding site in a specific location. However, when you need to view a specific
gene, you may need to inspect the BAM file to see how the chromosomes are named. You
can also find gene coordinates in the reference genome from the NCBI Gene database.
Viewing a BAM file is not necessary unless you need to inspect the alignments.
The next step is to use one of the peak-calling programs or shortly a peak caller that uses
a ChIP-Seq BAM file and a control BAM file as inputs to perform the process of peak call-
ing as discussed above. There are several peak callers available as open source, but we will
use the oldest and the most commonly used one, MACS3.
the display of continuous-valued data in a track in genome browsers like UCSC Genome
Browser or Integrated Genome Browser (IGB), and “--outdir” specifies the directory where
the output files are saved.
cd ..
mkdir macs3output
macs3 callpeak \
-t bam/ENCFF000XJP_chp1_filt.bam \
-c bam/ENCFF000XGP_inp0_filt_inp1.bam \
-f BAM \
-g hs \
-n chip1 \
-q 0.05 \
--bdg \
--outdir macs3output
macs3 callpeak \
-t bam/ENCFF000XJS_chp2_filt.bam \
-c bam/ENCFF000XGP_inp0_filt_inp2.bam \
-f BAM \
-g hs \
-n chip2 \
-q 0.05 \
--bdg \
--outdir macs3output
macs3 callpeak \
-t bam/ENCFF000XKD_chp3_filt.bam \
-c bam/ENCFF000XGP_inp0_filt_inp3.bam \
-f BAM \
-g hs \
-n chip3 \
-q 0.05 \
--bdg \
--outdir macs3output
Several output files for each ChIP-Seq data have been saved in the specified output directory
as shown in Figure 6.4. The description of these MACS3 output files is as follows:
Excel file: The “*_peaks.xls” is an Excel file, which is the most important file that con-
tains header information in the beginning of the file and information about the peaks
called by MACS3 in nine columns. These columns are Chromosome name (chr), start posi-
tion of peak (start), end position of peak (end), length of peak region (length), absolute peak
summit position (abs_summit), pileup height at peak summit (pileup), -log10(pvalue) for
the peak summit (-log10(pvalue)), fold enrichment for this peak summit against random
Poisson distribution with local lambda (fold_enrichment), -log10(qvalue) at peak summit
(-log10(qvalue)), and name.
BedGraph format: The “*_treat_pileup.bdg” contains the peak enrichment signal and
“*_control_lambda.bdg” contains local biases for each location in the reference genome.
Chromatin Immunoprecipitation Sequencing ◾ 225
FIGURE 6.4 MACS3 output files for the three ChIP-Seq data.
These two files are the largest and they are in bedGraph format that can be visualized in the
UCSC browser. The bedGraph format is a format developed to display genomic informa-
tion in a track on the genomic browser. It consists of four tab-separated columns: chromo-
some, start, end, and value. The chromosome coordinates are 0-based. The positions (start
and end) are listed in ascending order. The data displayed in the track are the values in the
value column and they can be integer or real, positive or negative.
BED format: The “*_summits.bed” file is in BED format. The BED (Browser Extensible
Data) format defines the data lines that are viewed in an annotation track of a genomic
browser. The BED file contains three required fields (chromosome, start, and end) and nine
additional optional fields (name, score, strand, thick start, thick end, RGB (an RGB color
value), block count, block size, and block start).
The “summits.bed” file contains the summit locations for every peak. The fifth column
in this file is the same as what is in the narrowPeak file. This file can be used for finding
motifs at the binding sites.
R script file: The “*_model.r” file is an R script to produce a PDF file containing peak
model plot and cross-correlation plot. The pdf files for our ChIP-Seq data are generated
using the “Rscript” command as follows:
Rscript chip1_model.r
Rscript chip2_model.r
Rscript chip3_model.r
BED6+4 format: The “*_peaks.narrowPeak” file is in BED6+4 format, which stores infor-
mation about signal enrichment of the called peaks based on pooled, normalized read
counts. This format consists of ten tab-separated columns: chromosome, start, end, name
(region name), score (peak density score from 0 to 1000) based on the signal value, strand
(+/-), signal value (average enrichment), p-value (int(-10*log10pvalue), FDR or q-value
226 ◾ Bioinformatics
(int(-10*log10qvalue), and peak (o-based offset from start). Either p-value or q-value is
present in “*_peaks.narrowPeak” file depending on which threshold option is used: “-p”
or “-q”.
rsync -aP \
rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/
bedGraphToBigWig ./
rsync -aP \
rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/
bigWigToWig ./
rsync -aP \
rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/
fetchChromSizes ./
The “bedGraphToBigWig” utility converts a file from the bedGraph format to the BigWig
format. The “bigWigToWig” utility converts a file from the BigWig format to the Wig for-
mat. The “fetchChromSizes” utility creates a text file (from a reference genome) containing
the chromosome names and sizes in bases. For more information about these commands
and file format, refer to the UCSC website.
Before proceeding, we need to create a text file containing the name of the human chro-
mosomes and their sizes using “fetchChromSizes”.
The file “hg19.chrom.sizes” will be created in the working directory; you can view it using
“less hg19.chrom.sizes”. This file consists of two columns: chromosome name and chromo-
some size in bases.
Now, we can convert the bedGraph files to BigWig format and then from BigWig format
to Wig format using the following commands while the working directory is the one where
the MACS3 output files are found:
mkdir vis
#Convert *control_lambda.bdg from bedGraph to BigWig
bedGraphToBigWig \
Chromatin Immunoprecipitation Sequencing ◾ 227
chip1_control_lambda.bdg hg19.chrom.sizes \
vis/chip1_control_lambda.bw
bedGraphToBigWig \
chip2_control_lambda.bdg hg19.chrom.sizes \
vis/chip2_control_lambda.bw
bedGraphToBigWig \
chip3_control_lambda.bdg hg19.chrom.sizes \
vis/chip3_control_lambda.bw
We need also to modify the BED file “peaks.narrowPeak” by keeping only columns 1–4.
228 ◾ Bioinformatics
The converted files were saved in “macs3output/vis” directory as shown in Figure 6.5. Those
files can be visualized in a genome browser. If you are working with a Linux with a graphi-
cal desktop, that works for the next step; otherwise, you can copy “macs3output” directory
including “vis” to a Windows or Mac desktop using FileZilla, which is an open-source FTP
application for file transfer, available at “https://filezilla-project.org/”. The “macs3output”
directory contains all MACS3 output files that we will use in later analyses and “vis” sub-
directory contains files for visualization in a genome browser.
In the next step, we will use a genome browser. In this exercise, we will use the IGB [8],
which is a standalone software available for Linux, Windows, and Mac OS X. It can be
downloaded from “https://www.bioviz.org/” and installed on a local computer. Once it has
been installed, open it, click “H. sapiens”, open the directory where the Wig and BED files
are stored, and for each ChIP-Seq sample, use the mouse to drag “*control_lambda.wig”,
“*treat_pileup.wig”, and “*peaks.bed” to the viewer just above “RefSeq curated (+)” track
and every time click “Load Data” button on the top right. You can also change the color of
the tracks of each sample by clicking the right button on the track and select “Customize”
from the popup menu and then choose a color (Figure 6.6).
As shown in Figure 6.6, each ChIP-Seq sample has three tracks: control (input data),
treated (ChIP-Seq), and the peaks. It is clear that the signal of input data (control) is flat but
the ChIP-Seq signals are with peaks. The peak track shows the peak position. Compare the
control signal of each sample to its corresponding treated (ChIP-Seq) signal. You can move
along the genome by using grab tool (hand icon), or zoom in and out using the horizontal
zoom slider on the top. You can also use the mouse to move left or right. The IGB comes
with search tools that can be used to search for a gene or to navigate to a specific position
in a chromosome.
Figure 6.7 shows a typical mixed (sharp and broad) signal of Poly II for the three sam-
ples after zooming in.
FIGURE 6.5 Wig and BED files ready to be visualized in a genome browser.
Chromatin Immunoprecipitation Sequencing ◾ 229
FIGURE 6.6 The IGB displaying the tracks of the three ChIP-Seq samples.
and install R. You can also download and install RStudio by following the instructions
at “https://www.rstudio.com/products/rstudio/download/”. Once you have R installed on
your computer, run R, and on R prompt, run the following to install the Bioconductor
packages required for the remaining ChIP-Seq data analysis:
library(clusterProfiler)
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(EnsDb.Hsapiens.v75)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(“dplyr”)
peaks2<- read.table(“chip2_peaks.narrowPeak”,header=FALSE)
colnames(peaks2) <- colnames
colnames(peaks2)
peaks3<- read.table(“chip3_peaks.narrowPeak”,header=FALSE)
colnames(peaks3) <- colnames
#head(peaks1)
peaks1Ranges<- GRanges(seqnames=peaks1$chrom,
ranges=IRanges(peaks1$start,peaks1$end),
peaks1$name,
peaks1$score,
strand=NULL,
peaks1$signal,
peaks1$pvalue,
peaks1$qvalue,
peaks1$peak)
covplot(peaks1Ranges, weightCol=”peaks1$peak”)
peaks2Ranges<- GRanges(seqnames=peaks2$chrom,
ranges=IRanges(peaks2$start,peaks2$end),
peaks2$name,
peaks2$score,
strand=NULL,
peaks2$signal,
peaks2$pvalue,
peaks2$qvalue,
peaks2$peak)
covplot(peaks2Ranges, weightCol=”peaks2$peak”)
peaks3Ranges<- GRanges(seqnames=peaks3$chrom,
ranges=IRanges(peaks3$start,peaks3$end),
peaks3$name,
peaks3$score,
strand=NULL,
peaks3$signal,
peaks3$pvalue,
peaks3$qvalue,
peaks3$peak)
covplot(peaks3Ranges, weightCol=”peaks3$peak”)
Three ChIP-Seq peak coverage plots will be created but we will display only a single plot
to save space. Figure 6.8 shows the coverage plot for the first sample (chip1); it shows the
distribution of the peaks of each human chromosome.
Rather than the entire genome, “covplot()” can also display the coverage in a single
chromosome, a group of chromosome, a specific region of a chromosome, or it can be
232 ◾ Bioinformatics
FIGURE 6.9 ChIP-Seq peak coverage plot comparing between a region in Chromosomes 1 and 2.
used to compare between chromosomes. The following codes display peak coverage in the
region between 1.0e8 and 1.5e8 bp for chromosome 1 and 2 (Figure 6.9):
covplot(peaks1Ranges, weightCol=”peaks1$peak”,
chrs=c(“chr1”, “chr2”), xlim=c(1.0e8, 1.5e8))
Chromatin Immunoprecipitation Sequencing ◾ 233
The distribution of peaks in the TSS region can also be visualized with the line plot that
profiles the average peaks in the TSS region.
We can notice that as shown in Figure 6.11, the peaks are normally distributed (bell-
shaped) with the mean in the TSS region.
FIGURE 6.10 ChIP-Seq peak profiling in the TSS regions of the genes.
234 ◾ Bioinformatics
A line plot provided with confidence interval can also be created. The confidence inter-
val is estimated by resampling the peaks several times without replacement and computing
the average and variance each time.
Figure 6.12 shows that Poly II localization is centered in the TSS where most peaks are observed.
Then, we can assign the database of the known human genes to a variable so that we can
use the annotation information and associate them to the peaks.
236 ◾ Bioinformatics
The above R codes apply the “annotatePeak()” function to annotate the peaks in the peak
signal files. The peak region was also set to any distance in the range (−1000, 1000) from
the TSS of the gene. Figure 6.13 shows the annotation summaries for the three samples.
The summary includes the number of peaks annotated on the top and then the peak anno-
tation frequencies based on the genomic features (gene regions). We can notice that the
maximum frequencies are in the promoter region, which in this case is an indication for
the transcriptional activity in the gene associated to the peaks.
The ChIPseeker package provides several functions to visualize the annotated peaks. The
“plotAnnoBar()” creates a bar chart for the peak representation in the different genomic
regions (features).
plotAnnoBar(annotated_peaks)
Figure 6.14 shows a bar plot that depicts peak enrichment representation in the different
genomic regions of the genes. We can notice that most peaks are centered in the promoter
regions. This may look different if the ChIP-Seq is for TFs or histone marks.
Distribution of peaks relative to TSS:
The sites of TF binding and Poly II localization are found in the promoter regions of the
genes. Thus, distribution of peaks around TSS will give an idea about the activity of the
protein studied. The “plotDistToTSS()” function creates a plot showing the distribution of
the peaks relative to the TSS.
plotDistToTSS(annotated_peaks,
title=”Distribution of Poly II relative to TSS”)
As shown in Figure 6.15, most interaction sites are in the region of 0–1 kb from the TSS of
the genes.
annotations_edb2$ENTREZID
<- as.character(annotations_edb2$ENTREZID)
chip2_annot %>% left_join(annotations_edb2,
by=c(“geneId”=”ENTREZID”)) %>%
write.table(file=”Chip2_peak_annotation.txt”,
sep=”\t”, quote=F, row.names=F)
Those files will be saved in the working directory. Open these files in Excel to study their
contents.
keyType = “ENTREZID”,
OrgDb = org.Hs.eg.db,
ont = “BP”,
pAdjustMethod = “BH”,
qvalueCutoff = 0.05,
readable = TRUE)
ego3 <- enrichGO(gene = entrez3,
keyType = “ENTREZID”,
OrgDb = org.Hs.eg.db,
ont = “BP”,
pAdjustMethod = “BH”,
qvalueCutoff = 0.05,
readable = TRUE)
#GO output
# Chip1
cluster_summary1 <- data.frame(ego1)
write.csv(cluster_summary1, “chip1_GO.csv”)
# Dotplot visualization
dotplot(ego1, showCategory=10)
# Chip2
cluster_summary2 <- data.frame(ego2)
write.csv(cluster_summary2, “chip2_GO.csv”)
# Dotplot visualization
dotplot(ego2, showCategory=10)
# Chip2
cluster_summary3 <- data.frame(ego3)
write.csv(cluster_summary3, “chip3_GO.csv”)
# Dotplot visualization
dotplot(ego3, showCategory=10)
Figure 6.16 shows the dot plot of the first ChIP-Seq sample.
The IDs, descriptions, and statistics of the significant GO terms are stored in “chip1_
GO.csv”, “chip2_GO.csv”, and “chip3_GO.csv”. In Figure 6.16, we can notice that those top
ten GO terms are associated with gene transcription which reflects the Poly II biological
activity. The definitions of the GO terms can be searched at “http://www.informatics.jax.
org/vocab/gene_ontology/”. Thus, ChIP-Seq provides information about the functions of
the protein studied.
We can also use KEGG database for gene pathways to annotate the genes with signifi-
cant peaks. The “enrichKEGG()” function returns the enrichment KEGG categories with
FDR control. The following codes generate KEGG signaling pathway annotation and cre-
ate dot plot for each sample (Figure 6.17):
The significant KEGG signaling pathways show the most likely active pathways in the cells.
We can also compare enrichment across samples by using “compareCluste()” function,
which requires the list of genes from each sample (Figure 6.18).
FIGURE 6.18 KEGG signaling pathway comparison across the three samples.
Chromatin Immunoprecipitation Sequencing ◾ 243
mkdir motifs
cut -f 1,2,3 \
macs3output/chip1_peaks.narrowPeak \
> motifs/chip1_peaks.bed
cut -f 1,2,3 \
macs3output/chip2_peaks.narrowPeak \
> motifs/chip2_peaks.bed
cut -f 1,2,3 \
macs3output/chip3_peaks.narrowPeak \
> motifs/chip3_peaks.bed
The above commands create a new directory, “motifs”, and store the new created BED files
in it. We will extract FASTA sequences from each of these three files using bedtools, which
is a collection of programs for manipulation of BED files. On Ubuntu, you can install bed-
tools using “apt-get install bedtools”.
Visit the program website “https://bedtools.readthedocs.io/en/latest/content/installa-
tion.html” for more information.
The “bedtools getfasta” command is used to extract a FASTA file from each BED file.
This command requires the FASTA file of the reference sequence and a bed file as input.
We will use the same reference sequence that we used to generate BAM files.
bedtools getfasta \
-fi ref/hg19.fa \
-bed motifs/chip1_peaks.bed \
-fo motifs/chip1_peaks.fasta
bedtools getfasta \
-fi ref/hg19.fa \
-bed motifs/chip2_peaks.bed \
-fo motifs/chip2_peaks.fasta
244 ◾ Bioinformatics
bedtools getfasta \
-fi ref/hg19.fa \
-bed motifs/chip3_peaks.bed \
-fo motifs/chip3_peaks.fasta
Those three FASTA files contain the sequences of enriched peaks for each sample and we
will use them as inputs for the motif detection programs.
There are two approaches for motif detection: de novo method when no prior informa-
tion is assumed and a position weight matrix (PWM) method for known motif.
The de novo approach searches for motifs in an input FASTA sequences without prior
information about the motifs. The search is conducted in a window around the peak. The
motif discovery programs either create k-mers from the sequences and perform exhaustive
search to identify the most frequent consensus substring of the sequences as motifs or use
sequence alignments iteratively to create consensus motifs from the PWM that identifies
motifs as the consensus motifs with the most frequent nucleobases. An example of de novo
motif discovery program is MEME Suite [11], which has DREME, MEME, or STREME
programs for discovering ungapped motifs. DREME is k-mer based, but it is depreciated
and will not be supported in the future. MEME is an alignment-based motif discovery
tool but it is recommended for motifs discovery in less than 50 sequences. STREME is a
k-mer based and it is recommended for detecting motifs in a dataset with more than 50
sequences. MEME SUITE is available as web server and command-line programs. To use
the web server or to download and install MEME SUITE, visit “https://meme-suite.org/
meme/”. On Linux, you can download and install MEME SUITE by using the following
steps:
wget https://meme-suite.org/meme/meme-software/5.4.1/meme-
5.4.1.tar.gz
tar vxf meme-5.4.1.tar.gz
cd meme-5.4.1
./configure --prefix=$HOME/meme --enable-build-libxml2
--enable-build-libxslt
make
make test
make install
Once you have installed it, you can add the following to “.bashrc” file:
export PATH=$HOME/meme/bin:$HOME/meme/libexec/meme-5.4.1:$PATH
The version may change so the best way is to visit the MEME SUITE website for the lat-
est installation instruction.
After adding the above line to the “.bashrc” file, you may need to restart the terminal or
use “source ~/.bashrc” for the change that you have made to take effect.
The MEME Suite programs require the ChIP-Seq dataset in FASTA (primary data-
set) and control dataset (secondary dataset). If no control dataset is used, MEME Suite
Chromatin Immunoprecipitation Sequencing ◾ 245
programs will create a control dataset by shuffling each of the sequences in the primary
input dataset.
The following DREME command will search for motif in the FASTA sequences of the
three ChIP-Seq samples. However, because the process may take a long time and this is just
a practice, you can run the command for a single sample only to save time. Run the com-
mands from inside “motifs” directories, where FASTA files are found.
dreme -verbosity 2 \
-oc dreme_motifs_chip1 \
-dna \
-p chip1_peaks.fasta \
-t 14400 \
-e 0.05
dreme -verbosity 2 \
-oc dreme_motifs_chip2 \
-dna \
-p chip2_peaks.fasta \
-t 14400 \
-e 0.05
dreme -verbosity 2 \
-oc dreme_motifs_chip3 \
-dna \
-p chip3_peaks.fasta \
-t 14400 \
-e 0.05
The “-oc” specifies the output directory, “-dna” specifies the type of sequence, “-p” specifies
the primary dataset, “-t” specifies an elapsed time as a stopping criterion, and “-e” specifies
the E-value threshold.
The output files will be saved in the directories “dreme_motifs_chip”. The motifs are
reported in an HTML file, an XML file, and a text file. You can open each of these files by
using the right program. You can change into each of the output directory and display the
HTML file using Firefox as follows:
firefox dreme.html
Figure 6.19 shows the motifs as displayed on the HTML file. The figure shows motif
sequence, logo, RC logo (reverse complement logo), and E-value. The motif sequence logo
is a graphical representation of the sequence conservation of DNA nucleotides. A DNA
sequence logo consists of the four nucleobase letters A, C, G, and T at each position. The
relative sizes of the letters reflect their frequency in the aligned sequences. The sequence
of the motif uses the IUPAC codes for nucleotides for representing each of the 15 possible
combinations as shown in Table 6.2.
246 ◾ Bioinformatics
Each of the above commands will create a directory, “streme_motifs_*”. Four files will be
produced inside each directory: “sequences.tsv”, “streme.html”, “streme.txt”, and “streme.
xml”. Figure 6.20 shows how “streme.html” is displayed in the Internet browser.
248 ◾ Bioinformatics
HOMER is another motif discovery program that has an algorithm which performs de
novo motif discovery in the regulatory regions of genes using a primary dataset and a sec-
ondary dataset. HOMER has two programs, findMotifs.pl and findMotifsGenome.pl, that
perform the motif discovery in promoter and genomic regions, respectively. However, for
ChIP-Seq enrichment sequences, we will use findMotifs.pl or homer2. Like MEME Suite
programs, HOMER can also create a secondary random dataset if you do not have one.
For the instructions of HOMER installation, visit “http://homer.ucsd.edu/homer/intro-
duction/install.html”. Homer includes a collection of perl and c++ programs designed to
run in a UNIX/Linux environment. Refer to the website for the program requirements.
On Ubuntu, after making sure that all required programs have been installed, you can
create a directory “homer”, download the installation Perl script, and install the HOMER
as follows:
PATH=$PATH:/home/username/downloads/homer/.//bin/
The path will depend on where you have installed HOMER. Copy the path and add it to the
end of “.bashrc” file as follows:
export PATH=$PATH:/home/username/downloads/homer/.//bin/:$PATH
You may need to restart the terminal or use “source ~/.bashrc” for the change to take effect.
To test HOMER installation, run “findMotifs.pl”. That will display HOMER usage and
option. The “findMotifs.pl” command requires your target sequences’ file and a background
sequence file in FASTA format specified by “-fasta” option. However, if you do not provide
a background sequences’ file, the command will use your target sequences to create back-
ground sequences. The general syntax for the use of “findMotifs.pl” command is as follows:
We can use this command without providing background sequences’ file as follows:
Three output directories will be created and several files are produced in each of the output
directories:
Chromatin Immunoprecipitation Sequencing ◾ 249
The files “homerMotifs.motifs*” are the files that contain information of the motifs
found by the de novo motif discovery method, separated by motif length (“*” simply rep-
resents an integer for the length of the motif sequence). The motif information includes the
sequence, statistics, and PWM.
The file “homerMotifs.all.motifs” contains the information of all motifs found by the
de novo method. It is the concatenated file made up of all the “homerMotifs.motifs*” files.
The file “motifFindingParameters.txt” contains the command line used to execute the
program for the motif discovery.
The file “knownResults.txt” contains the statistics about known motif enrichment,
including motif name, consensus, p-value, Log p-value, q-value (Benjamini), number of
target sequences with motif, percentage of target sequences with motif, number of back-
ground sequences with motif, and the percentage of background sequences with motif.
The file “seq.autonorm.tsv” contains autonormalization statistics for the oligo.
The files “homerResults.html” and “knownResults.html” contain the output of de novo
motif finding and known motif finding, respectively. These HTML files can be displayed
on the Internet browser or using Firefox.
Figure 6.21 shows the motifs (logos and statistics) found by the de novo method using
“findMotifs.pl”. The motifs are ordered by the p-value. The significant motifs must have
very small p-value. Each motif has a link to the motif file.
The PWM motif discovery methods use prior information about the binding or inter-
action sites, obtained via laboratory means. Motif information are stored in a PWM file
format that describes the probability of finding the respective nucleotides A, C, G, and T
on each position of a motif. The word or PWM search method is used only to detect known
binding sites of transcription factors. PWM files of known motifs are used to scan windows
of a sequence for detecting the presence of the known binding sites of interest. PWM files
of known motifs can be download from motifs’ database such as JASPAR at “https://jaspar.
genereg.net/”. We will use MAST, which is one of the MEME Suite programs, to search for
known motifs in our sequences. Assume that we wish to search for TATA binding site in
our example sequences. First, we need to download the motif file from a database and then
run the program as follows:
wget https://jaspar.genereg.net/api/v1/matrix/MA0108.1.meme
mast -mt 5e-02 \
-oc mast_chip1 \
MA0108.1.meme \
chip1_peaks.fasta
Three output files (mast.html, mast.xml, and mast.txt) will be saved in the “mast_chip1”
directory. You can use “firefox mast.html” to view the results.
6.4 SUMMARY
Identification of binding sites of proteins on the genomic DNA is critical for understand-
ing gene regulation, pathways, and role of specific proteins in gene regulation and their
implications of some diseases. Therefore, ChIP-Seq is used to study epigenetic change that
affects gene expression and the impact of such changes on diseases. The ChIP-Seq is the
most effective way to identify protein-binding sites on the genomic DNA. The binding
sites of transcription factors and RNA polymerase II are found in the promoter regions of
genes. In a ChIP experiment, the genomic DNA is cut into fragments. The DNA regions,
where the protein of interest binds, are precipitated using a specific antibody. The protein
molecules are then removed from the DNA fragments. The isolated DNA fragments are
then sequenced using one of the sequencing techniques. The DNA library preparation and
sequencing are similar to that of other sequencing applications. The sequence reads (in
FASTQ files) produced by the sequencer are for the ChIP-Seq DNA reads that are likely to
contain the binding sites for the protein of interest. The quality control step is carried out to
reduce the error and to trim and remove adaptors and other technical sequences that may
affect the analysis results. The cleaned reads are then aligned to a reference genome to pro-
duce BAM files that contain the alignment information of the ChIP reads. The unaligned,
random, and mitochondrial reads are usually removed from the BAM files to reduce the
computational burden. The peak enrichment regions, where the binding sites are most
likely to be found, are called using one of the peak-calling programs. The peak information
for each sample is saved in a BED file. We have used R Bioconductor package to visualize
the distribution of the peaks and to perform annotation and functional analysis including
GO and KEGG pathways. GO and KEGG enrichment analyses provide knowledge-based
biological information. Finally, we used motif discovery programs to identify the motifs
on the promoter regions.
Chromatin Immunoprecipitation Sequencing ◾ 251
REFERENCES
1. Bernstein BE, Meissner A, Lander ES: The mammalian epigenome. Cell 2007, 128(4):669–681.
2. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers
RM, Brown M, Li W et al: Model-based Analysis of ChIP-Seq (MACS). Genome Biol 2008,
9(9):R137.
3. Narlikar L, Jothi R: ChIP-Seq data analysis: identification of protein-DNA binding sites with
SISSRs peak-finder. Methods Mol Biol 2012, 802:305–322.
4. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass
CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory
elements required for macrophage and B cell identities. Mol Cell 2010, 38(4):576–589.
5. Ibrahim MM, Lacadie SA, Ohler U: JAMM: a peak finder for joint analysis of NGS replicates.
Bioinformatics 2015, 31(1):48–55.
6. Feng J, Liu T, Qin B, Zhang Y, Liu XS: Identifying ChIP-seq enrichment using MACS. Nat
Protoc 2012, 7(9):1728–1740.
7. Feng J, Liu T, Zhang Y: Using MACS to identify peaks from ChIP-Seq data. Curr Prot Bioinfor
2011, Chapter 2:Unit2 14–12.14.
8. Freese NH, Norris DC, Loraine AE: Integrated genome browser: visual analytics platform for
genomics. Bioinformatics 2016, 32(14):2089–2095.
9. Yu G, Wang LG, He QY: ChIPseeker: an R/Bioconductor package for ChIP peak annotation,
comparison and visualization. Bioinformatics 2015, 31(14):2382–2383.
10. Yu G, Wang LG, Han Y, He QY: clusterProfiler: an R package for comparing biological themes
among gene clusters. Omics 2012, 16(5):284–287.
11. Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homol-
ogy searches. Bioinformatics 1998, 14(1):48–54.
Chapter 7
how the reads and barcodes are stored. The reads also might be sequenced as single end or
paired end. The paired-end reads are usually stored in two FASTQ files.
7.2.2.1 Clustering
There are three approaches for clustering the preprocessed amplicon reads. The first is the
de novo clustering [6] in which the reads are clustered into OTUs according to their simi-
larity at a specified threshold. It compares each read against each other and then it groups
sequences into OTUs. The second is called open-reference clustering in which reads are
clustered around previously annotated reference sequences by sequence classification or
searching in a database (Greengenes database) and the reads that do not cluster around
reference sequences are clustered with de novo method. The third is the closed-reference
clustering in which reads are clustered around reference sequences and the reads that do
not find reference sequences are removed [7]. The closed-reference clustering may intro-
duce reference bias to the clustering process since the reads without references will be
removed, while they may represent novel species or taxa. The de novo clustering is pre-
ferred, although it may be computationally expensive other than the other two clustering
methods. The algorithms for de novo clustering include hierarchical clustering and heu-
ristic clustering. The hierarchical clustering method requires a matrix for the pairwise dis-
tances between all unique sequences. A hierarchical tree is constructed from the distance
matrix, and then the OTUs are constructed from the tree based on a distance threshold.
The heuristic clustering uses pairwise sequence alignment to construct a distance
matrix one by one instead of computing all distances in a single step. The heuristic cluster-
ing method generates OTUs in a greedy incremental strategy that utilizes one sequence as
a seed to represent a cluster. Then, each one of the unique reads is compared with the seeds
of existing clusters. A read is assigned to a cluster if the pairwise distance between a unique
read and a seed meets a predefined threshold. If a read does not join any cluster, that read
will become a seed for a new cluster. The final clusters represent the OTUs. The heuristic
clustering method is more computationally efficient than the hierarchical clustering. There
are a variety of heuristic clustering methods that vary in seed selection and distance cal-
culation. Examples of the tools that use heuristic clustering for constructing OTUs include
USEARCH and CD-HIT.
In general, an OTU constructed by any of the clustering methods is considered as a tax-
onomic group. From this point, the analysis can move on to the taxonomic group assign-
ment, construction of the phylogenetic tree, and diversity analysis but the latest analytic
256 ◾ Bioinformatics
tools proposed an extra step before taxonomic assignment. This step is meant to reduce the
potential errors that may produce noises; therefore, the step is called denoising.
7.2.2.2 Denoising
There are two possible types of errors that may occur on deciding whether the variation
within an OTU represents errors or real diversity. The first type is the base calling error
which may arise from the sequencing. This type of errors may occur due to the incorrect
base pairing during the PCR amplification, polymerase slippage, or PCR chimeras that
are formed when the DNA strand extension is aborted during the PCR process and the
aborted products act as primers in the next PCR cycle producing artifacts. The second
type of errors is the misclassification of a read to an incorrect taxonomic group. This error
can be corrected by constructing OTUs at a particular similarity threshold such as 97%.
However, that may come at the cost of taxonomic sensitivity. Denoising is attempting to
handle these errors by using the reads to infer the correct biological sequences. This way
the misclassification can be avoided.
Several computational approaches have been proposed for sequence denoising. The most
commonly used approaches are DADA2, Deblur, and UNOISE3 which are able to infer
error-free biological sequences at a single-nucleotide resolution. Those inferred sequences
that will be used for taxonomic assignment are called features, zero-radius OTUs (ZOTUs),
exact sequence variants (ESVs), or amplicon sequence variants (ASVs). In the following, we
will discuss those three popular denoising methods.
λ ji = ∏ p ( j ( l ) → i ( l ) , q ( l ))
l =0
i (7.1)
Targeted Gene Metagenomic Data Analysis ◾ 257
pA ( j → i ) =
1
∑p
1 − ppoisson (n j λ ji ,0 ) a=a
i
poisson (n λ , a )
j ji (7.2)
These p-values of the unique sequences are used as the division criteria for an iterative
partitioning. A threshold is specified for partition; if the smallest abundance p-value falls
below the threshold, a new partition is formed with that unique sequence allowing other
similar unique sequences to join it. The division continues iteratively until all unique
sequences falling within a OTUs are consistent with abundance p-values greater than the
specified threshold.
The output of the divisive amplicon denoising algorithm is a collection of ASVs, which
are exact sequences with defined statistical confidence. Because ASVs are exact sequences,
generated without clustering or reference databases, they can be readily compared between
studies using the same target region. DADA2 pipeline generates an ASV table that can be
used for downstream analysis.
the centroid sequence (C) and skew of its abundance (aM ) with respect to the centroid
sequence abundance (aC), which is given as
skew ( M , C ) = aM a (7.3)
C
When a member unique sequence has both an enough small distance and an enough small
skew with respect to the centroid sequence, then it is likely that sequence is incorrect read
of the centroid sequence with d points errors. The maximum skew (β) allowed for a cluster
member with d differences from the centroid sequence is given by
1
β (d ) = α d +1 (7.4)
2
where α is set to 2 by default.
We can notice that as the distance d between the member sequence and centroid
increases, the maximum skew β decreases exponentially.
The unique sequences with low abundance are removed by the UNOISE2 algorithm.
The final products of any of the clustering and denoising methods are feature table and the
list of representative sequences. The feature table provides the feature abundance or the
number of a times a feature has been observed in a sample. A feature is a unit of observa-
tion that can be an OTU or an ASV. The feature table is needed for the downstream analysis
such as taxonomy assignment, construction of phylogenetic tree, and diversity analysis.
7.2.3.2 VSEARCH
VSEARCH [12] is another alignment-based and open-source tool like BLAST. However,
unlike BLAST, which uses seed-based heuristic search algorithm, it uses a fast heuris-
tic search by identifying a small set of database sequences that have many k-mer words
in common with the query sequence (a representative sequence). VSEARCH algorithm
counts the number of shared words between a representative sequence and each data-
base sequence (a word will count once if it appears multiple times). Thus, the similarity
between a representative sequence and a target sequence is based on the statistics of shared
words rather than alignment. Then, the algorithm performs pairwise global alignment
(Needleman-Wunsch) for the query sequence with each sequence of the database begin-
ning with the sequences with the largest number of words in common with the representa-
tive sequence. The optimal global alignment score is computed for each alignment. Unlike
BLAST, exhaustive pairwise alignment is computationally expensive. However, VSEARCH
employs parallel strategies by using vectorization and multiple threading that reduce the
computational cost.
Pi =
[n(wi ) + 0.5] (7.5)
( N + 1)
The numbers 0.5 and 1 are used to keep the value of Pi to be between 0 and 1 (e.g., 0 Pi <1).
Assume that m(wi) is the number of the word wi contained by M training sequences
whose genus is G. Thus, the conditional probability that a sequence of genus G contains the
word wi is estimated as:
P ( wi | G ) =
[m(wi ) + Pi ] (7.6)
M +1
Now assume that a sequence S of genus G has the observed set of words V={v1, v2, …, vf }
V⊑W; thus, the joint probability or likelihood of observing the sequence S is the product of
the probability of each word given genus G as
260 ◾ Bioinformatics
P ( S|G ) = ∏ P (v | G )
i (7.7)
P (G )
P (G|S ) = P(S | G ) × (7.8)
P(S)
where P(G│S) is the posterior probability, P(G) is the class prior probability of a sequence
being a member of G, and P(S) is the predictor prior probability.
Since both class prior probability and predictor prior probability are constant for all
sequences, we can drop both P(G) and P(S). So, the above formula can be rewritten as
P (G|S ) = P ( S|G ) = ∏ P (v | G )
i (7.9)
Based on the naïve Bayesian, each representative sequence obtained from the denoising
step is assigned to the genus G or a taxonomic group that is giving the highest probability
score after calculating the score for each taxonomic group.
3 4
d = log 1 − p (7.10)
4 3
The tree is then built from the pairwise distances of the representative sequences. Several
tree-building methods are available. The most popular distance-based methods are the
Targeted Gene Metagenomic Data Analysis ◾ 261
unweighted pair group method with arithmetic mean (UPGMA), weighted pair group
method with arithmetic mean (WPGMA), and neighbor joining (NJ) [14].
Both UPGMA and WPGMA assume a randomized molecular clock that measures the
evolutionary divergence of sequences. The molecular clock is defined as the average rate at
which a sequence accumulates mutations. Both UPGMA and WPGMA also have a similar
algorithm. They use a cluster procedure that assumes each representative sequence as a
cluster on its own and then they join the closest clusters and recalculate the distance of the
joint pair by the average. These steps are repeated until all sequences are connected in a
single cluster. However, the difference between the two methods is that in UPGMA, equal
weight is assigned on the distances, while in the WPGMA different weights are assigned
on the distances.
The algorithm of the NJ method does not make an assumption of the molecular clock
and it adjusts for the rate variation among branches. The algorithm begins with an initial
unsolved star-like tree made up of the representative sequences. The distance between each
pair is evaluated. The first joint is created by joining the closest two neighboring sequences
and a branch is inserted between them and the rest of the star-like tree. The value of the
branch is recalculated on the basis of their average distance. This process is repeated until
only one terminal is present from the initial tree.
The above briefly described tree construction methods are distance-based and less
computationally expensive. However, there are other methods including maximum par-
simony (MP) and maximum likelihood (ML) which make use of all known evolutionary
information (individual substitutions) to determine the most likely ancestral relationships.
Refer to a book for phylogenetic tree for more details about the various tree construction
methods.
A phylogenetic tree is either rooted (with a common ancestor for all sequences) or
unrooted (without common ancestor). The unrooted trees are constructed when we do not
make the assumption that the molecular clock evolution is valid and they only reflect the
relationship among representative sequences but not the evolutionary path. However, if we
can make the assumption that sequences evolve at rates that remain constant through time
for different lineages, then the root of a tree is estimated as the midpoint of the longest span
across the tree.
H= ∑p × ln( p )
i =1
i i (7.11)
where pi is the probability of finding species/taxon i in the sample and S is the number of
taxa in the sample or richness.
The values of Shannon diversity usually fall between 1.5 and 3.5. The higher the value
of H, the higher the diversity of taxa in a sample. The lower the value of H, the lower the
diversity. A sample with a single taxon will have H = 0 (not diverse sample).
H
Evenness = (7.12)
ln( S )
The evenness value ranges from 0 to 1, where 1 indicates complete evenness or the taxo-
nomic groups in the sample have similar abundances.
If the samples share all the taxonomic groups, the percent Jaccard index will be 100%
similar (β jac = 100% ); the closer it is to 100%, the more similar the samples are. If the two
samples share no species/taxa, they will be 0% similar (β jac = 0 ). If the percent Jaccard
index is 50%, the two sample will share half of the taxonomic groups.
2c
β br = 1 − × 100 (7.14)
a + b
The percent Bray–Curtis dissimilarity is always a number between 0 and 100. If it is 0, then
the two samples share all the same species; if it is 100, that means the two samples do not
share any species.
(qiime2)$ qiime
264 ◾ Bioinformatics
This will display the program usage, options, and available commands. A command can
be a QIIME2 utility tool (info, tools, dev) or QIIME2 plugins. We will use the “tools” com-
mand very often. To use a QIIME2 command or plugin, it must be in the “qiime <com-
mand>”. For instance, to use “tools” command, run the following:
The QIIME2 plugins are software packages developed by programmers to be used for anal-
ysis. Plugins work through two functional units: methods and visualizers. Before any data
processing, QIIME2 converts input raw data into a special format called QIIME2 artifacts.
An artifact is a compressed file with the “.qza” file extension. For the output, QIIME2
FIGURE 7.1 General QIIME2 workflow for amplicon-based sequence data analysis.
Targeted Gene Metagenomic Data Analysis ◾ 265
generates another file format, with the “.qzv” file extension, called visualization file. This
visualization file is a standalone and sharable file that may contain any kinds of output such
as images, tables, and interactive representations. Plugin methods take QIIME2 artifacts
as input and produce an output. While a plugin visualizer produces a single visualiza-
tion file for the purpose of visualizing or sharing. In the following, we will show you how
to use QIIME2 to analyze targeted gene metagenomic data. The general workflow of the
amplicon-based metagenomic data analysis is shown in Figure 7.1.
know that to execute any QIIME2 command, you must use “qiime”. We also mentioned
above that “tools” is a command and command or plugin may have methods or visualizers
or both. The “import” above is a tools command method whose function is to import any
input data. However, for each input file format, different arguments and options will be
used. In general, importing an input file into an artifact will follow this format:
The “import” options are self-explanatory, where “--type” specifies the type of the data to
be imported, “--input-path” specifies the path to the raw data files, “--input-format” speci-
fies the input file format, and “--output-path” specifies the path where the artifact file will
be saved.
You must specify the right data type and format on “--type” and “--input-format” for
your input files to be imported successfully.
In the following, we will show you the formats for importing different output files, but
for practice and detailed information, visit “https://docs.qiime2.org/”.
We will discuss the import of the common types of raw data below.
The above command will import the FASTA sequences as a QIIME2 artifact “sequences.
qza” that is ready for the next step of the analysis.
and barcode FASTQ files). We can import such multiplexed raw data to QIIME2 and later
demultiplex them using a QIIME2 command. To import a single-end EMP-multiplexed
reads, the forward FASTQ file name must be “sequences.fastq.gz” and the barcode file is to
be “barcodes.fastq.gz”. These two files can be in a directory say “data” for example. Then,
you can use “qiime tools import” to import the raw data using the following:
Notice that we use “input-path” to specify the directory where the two files are found.
For the paired-end EMP-demultiplexed raw data, we can use the following:
Notice that both artifacts created by the above commands are for multiplexed reads. Those
artifacts required demultiplexed as we will discuss later.
Again, the artifacts created by the above commands contain multiplexed reads. So, they
must be demultiplexed before proceeding. Demultiplexing will be discussed later.
number. QIIME2 can use that information in the file name to automate importing the files
and to create an artifact for that demultiplexed raw data. The following are the commands
for importing Casava 1.8 formatted demultiplexed, single-end and paired-end FASTQ
files, respectively:
These commands create artifacts for demultiplexed data that are ready for processing (the
samples are already separated).
FIGURE 7.2 Examples of manifest files for single-end and paired-end FASTQ files.
The artifacts created by the above commands are demultiplexed and they are ready for
processing by QIIME2.
7.3.1.2 Metadata
In the above, we discussed importing the raw sequence data into QIIME2 artifacts. But we
have also mentioned that a sample metadata describing the study is required for the analy-
sis. A metadata file can also be created by the user and will be used later in the analysis to
show the sample grouping. A metadata file is a tab-separated value (TSV) file containing
sample and study design information in columns and the first column must be the sample
identifiers for the IDs of the sample in the study. The top line is for the column names. The
ID column name must be one of those: id, sampleid, sample id, sample-id, featured, fea-
ture id, or feature-id. The IDs listed in the metadata files should be unique and not empty.
The ID column is not optional and the metadata file must contain at least one ID. The ID
column can be followed by additional columns defining metadata associated with each
sample or feature such as groups, treatment, factor, and barcode sequence. The column
name should not be one of the reserved words and the type of the data contained in the
metadata file is either categorical or numeric. The sample metadata is created by the user
manually. However, if the data is downloaded from the NCBI SRA database, the metadata
can also be downloaded and modified to meet our goal. This will be discussed below in the
worked example.
7.3.2 Demultiplexing
Above, we discussed the EMP and non-EMP-multiplexed FASTQ files and how they can
be imported into QIIME2 artifacts. Since the reads are multiplexed, demultiplexing must
be carried out before reads processing and analysis. QIIME2 uses “demux” and “cutadapt”
plugins for demultiplexing EMP-multiplexed reads and non-EMP-multiplexed reads,
respectively.
270 ◾ Bioinformatics
It is also required to provide the sample metadata file as an input “m-barcodes-file” and the
column which includes the barcode sequence in the metadata to be specified. The output is
two artifacts: an artifact for the demultiplexed reads and an artifact for the demultiplexing
details.
Figure 7.3 shows the commands and usages for demultiplexing both file formats.
Once the raw data is imported into QIIME2 artifact and the multiplexed reads were
demultiplexed, all types of raw data will be preprocessed and analyzed in the same way.
Therefore, we will discuss the remaining steps of the analysis with QIIME2 through a
worked example.
mkdir PRJEB24421
cd PRJEB24421
mkdir data
In the next step, we will save the run accessions of the BioProject in a text file in the
“data” subdirectory. To do that, you can open the NCBI SRA database and search for the
272 ◾ Bioinformatics
BioProject by the above accession number or simply copy and paste the following URL on
the Internet browser:
https://www.ncbi.nlm.nih.gov/sra/?term=PRJEB24421
Then, use “Send to” dropdown menu to download the runinfo text file. After download-
ing the text file, open the file in Excel, delete all columns except the column with the run
accessions and remove the column name as well, and save the file as “runids.txt” in the
“data” subdirectory.
Instead of the above, you can also use the following EDirect script, which extracts the
run accessions and stores them in a file named “runids.txt” in “data” subdirectory (you
should have the NCBI Entrez Direct installed):
Check to see if the file has been saved successfully by using “ls data/” command or you can
display the file content by using “vim data/runids.txt” command.
After saving the text file with the 86 run accessions in the “data/runids.txt” file, you
can then download the raw FASTQ files from the NCBI SRA database either by saving
the following script in a bash file “download.sh” and then run it as “bash download.sh” or
you can just enter the script on the terminal command-line prompt, while you are in the
project directory:
while read f;
do
fasterq-dump --progress --outdir data “$f”
done < data/runids.txt
You will see the downloading progress. The files require only 771.29MB of storage space.
The 172 FASTQ files will be downloaded in the “data” subdirectory, two files for each
sample. When the files have been downloaded successfully, you can check the content
of the “data” subdirectory and count the number of the FASTQ files using the following
command:
ls data/*.fastq | wc -l
The number of files should be 172. If it is not, you may need to run the download script
again.
FIGURE 7.4 Downloading metadata from SRA database using Run Selector.
(Figure 7.4) to download the metadata text file. Open the file in a spreadsheet, remove all
columns except “run”, “group”, and “time”. Then, rename “run” to “SampleID”, insert a
data type row as shown in Figure 7.5, and save the file as a TSV file with the name: “data/
sample-metadata.tsv”.
If you have EDirect installed on your computer, you can search the NCBI database and
retrieve metadata from it. EDirect is a collection of Linux-based command-line E-Utilities
programs. The instructions for installing EDirect programs are available at the NCBI
274 ◾ Bioinformatics
The “manifest.txt” file will be created in “data” directory, and it looks as shown in
Figure 7.6.
After running the above script, you can display the file content using the text editor of
your choice. Then, move back to the project main directory using “cd ..”.
The next step is to import the FASTQ files into a QIIME2 artifact. To keep files orga-
nized, you can create a new subdirectory “input” for the artifact files.
mkdir inputs
Targeted Gene Metagenomic Data Analysis ◾ 275
The following script uses the “qiime tools import” to import the 172 FASTQ files onto a
single QIIME2 artifact “demux-yoga.qza” in the “inputs” directory:
mkdir viz
Then, we can run the following script to save the report visualization file in the “viz”
subdirectory:
We can view the visualization file on the Internet browser using “q2-tools view” as follows:
The report displayed on the Internet browser will have two tabs: Overview and Interactive
Quality Plot. The Overview report tab shows Demultiplexed sequence count summary,
forward and reverse reads frequency histogram, and Per-sample sequence counts. The
summary count section shows minimum, median, mean, maximum, and total number
of reads across the samples for forward reads and reverse reads (this is only in paired
end). Figure 7.7 shows that the summary statistics for forward reads and reverse reads are
the same. The minimum number of reads in a sample is 7702 and the maximum num-
ber of reads is 95075. The median and mean of reads across all samples are 25588.0 and
30018.709302, respectively. The total number of forward reads in all samples is 2581609
and is the same for reverse reads. The histogram shows the distribution of the number
of reads versus number of samples. The histograms can be downloaded as pdf files. The
Per-sample sequence counts table, below the histogram, lists sample ID and number of
forward reads and reverse reads in each sample. The Interactive Quality Plot section shows
plots for the per base Phred quality score across reads for both forward and reverse reads
(Figure 7.8). These plots are similar to the QC plots discussed in Chapter 1. They will help
us to determine if the sequences need trimming or filtering. If you hover the mouse over a
specific position on the plot, the table below the plot will update itself to display the seven-
number summary statistics for that base position. This table corresponds to what is visu-
ally represented by the box plot at that position.
FIGURE 7.7 Sequence count summary and histograms of reads of yoga data.
Targeted Gene Metagenomic Data Analysis ◾ 277
You can notice that the overall quality scores of the reads are high but there are also
some reads with quality score less than 20 (99% accuracy) toward the end of the reads.
We can trim the low-quality bases from the end of the reads. The demultiplexed sequence
length summary at the bottom of the Interactive Quality Plot tab shows that the reads have
equal length (275 base). This table will help us to determine if we need to make the length
of reads equal or not.
If you decide to filter out the reads with poor quality scores, you can use the “qual-
ity-filter” plugin with “q-score” method. However, this can be done for single-end reads.
For paired-end reads, you can join forward and reverse reads and then run “quality-fil-
ter q-score” on the merged reads. This will be discussed later with clustering. However,
denoising methods also have their way to filter low-quality reads as we will see soon. But,
if your data is single-end reads, you can use “quality-filter q-score” to remove low-quality
reads using the following script:
be the right moment to remove them using “cutadapt” plugin with “trim-single” or “trim-
paired” for single-end or paired-end reads, respectively.
About our yoga data, now we have a clear view about its quality after we have assessed it
using “demux summarize”. Since the data is paired-end reads, we will carry out the quality
control later with clustering and denoising.
7.3.4.2.1 Clustering
If your plan is to cluster reads into OTUs without denoising, QIIME2 provides “q2-vsearch”
plugin to do just that. This plugin has methods for the three types of clustering: de novo,
closed-reference, and open-reference. The “q2-vsearch” plugin can also perform quality
control; therefore, before running clustering, you may need to do some preprocessing to
the data. The paired-end reads must be merged before processing. In the following, we will
walk you through the steps of clustering to the point of generating feature tables and OTU
representative sequences.
--p-allowmergestagger \
--o-joined-sequences inputs/demux-yoga-merged.qza
For checking the merging, run “demux summarize” to create a visualization file for the
summary and quality of the merged reads as discussed above.
If you open Interactive quality plot on the summary report, you will notice that the for-
ward and reverse reads have been joined as shown in Figure 7.9.
If, at this point, you notice that the forward reads (for single-end data) or merged reads
(for paired-end data) require filtering, you can then perform that with “quality-filter
q-score” as follows:
FIGURE 7.9 Per base quality plots of the merged yoga data.
280 ◾ Bioinformatics
of “q2-vsearch” plugin. The input is the last preprocessed data artifact “demux-yoga-
merged.qza”.
The outputs from the “dereplicate-sequences” command are two artifacts: (i) feature table
containing the OTU features with their observed abundances (frequencies) for each of the
samples of the study and (ii) feature data in which each feature identifier is mapped to a
feature.
mkdir denovo
qiime vsearch cluster-features-de-novo \
--i-table inputs/derep-yoga-table.qza \
--i-sequences inputs/derep-yoga-seqs.qza \
--p-perc-identity 0.99 \
--o-clustered-table denovo/table-yoga-denovo.qza \
--o-clustered-sequences denovo/rep-seqs-yoga-denovo.qza
The outputs are two artifacts: a feature table for the OTUs and feature data that contains
the centroid sequences defining each OTU cluster. De novo clustering usually consumes
more computational resources compared to the other two methods.
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/
gg_13_8_otus.tar.gz
tar vxf gg_13_8_otus.tar.gz
rm gg_13_8_otus.tar.gz
Make sure that the URL is a single line with no white space. You can visit the website to
download the latest release.
The files will be extracted into a directory (gg_13_8_otus). Display the contents of this
directory and its subdirectories using “ls” Linux command. You will find four subdirec-
tories: “otus” (for reference OTUs), “rep_set” (for the reference representative sequences),
“rep_set_aligned” (for aligned representative sequences), “taxonomy” (for taxonomy
files), and “trees” (for phylogenetic trees). The files in these directories contain data at dif-
ferent identities (e.g., 99%, 97%, and 94%). Keep this database as we will use it for other
applications.
To use the reference database for clustering, you need to import the file of the database
representative sequences (FASTA file). You need to choose at which identity you wish to
perform clustering. Assume that you want to cluster your sample sequences at 97% iden-
tity, then you can import “rep_set/97_otus.fasta” onto QIIME2 as artifact using “tools
import”. To keep the files organized, we will create the subdirectory “closed_ref_cl_97” for
closed-reference clustering files.
mkdir closed_ref_cl_97
Then, you can use the “cluster-features-closed-reference” method of the “q2-vsearch” plu-
gin to perform the closed-reference clustering on the features generated in the derepli-
cation steps. The input artifacts are: dereplicated feature table “derep-yoga-table.qza”,
dereplicated representative sequences “derep-yoga-seqs.qza”, and the reference representa-
tive sequences from the database “97_otus-GG_db.qza”.
--i-reference-sequences inputs/97_otus-GG_db.qza \
--p-perc-identity 0.97 \
--o-clustered-table closed_ref_cl_97/table-yoga-closed_cl.qza \
--o-clustered-sequences closed_ref_cl_97/rep-seqs-yoga-close_
cl.qza \
--o-unmatched-sequences closed_ref_cl_97/
unmatched-yoga-close_cl.qza
The above script outputs three artifacts: A feature table, clustered sequences (the sequences
defining the features in the feature table), and unmatched sequences (the sequences that
didn’t match reference sequences at 97% identity). The unmatched sequences will be com-
pletely ignored.
mkdir open_ref_cl_97
qiime vsearch cluster-features-open-reference \
--i-table inputs/derep-yoga-table.qza \
--i-sequences inputs/derep-yoga-seqs.qza \
--i-reference-sequences inputs/97_otus-GG_db.qza \
--p-perc-identity 0.97 \
--o-clustered-table open_ref_cl_97/table-yoga-open_cl.qza.qza \
--o-clustered-sequences open_ref_cl_97/rep-seqs-yoga-open_cl.qza \
--o-new-reference-sequences open_ref_cl_97/
new-ref-seqs-open_cl.qza
The three clustering methods use dereplicated feature table and representative sequences
and produce a final feature table and OTU representative sequences to be used in the
downstream analysis for phylogeny, diversity analysis, assignment of taxonomic group,
and differential taxonomic analysis.
7.3.4.2.2 Denoising
Like clustering, denoising also produces a feature table and representative sequences.
However, denoising attempts to remove errors and to provide more accurate results.
There are two denoising methods available in QIIME2: DADA2 and deblur. Both meth-
ods output feature tables containing feature abundances and ASVs. Moreover, they also
Targeted Gene Metagenomic Data Analysis ◾ 283
perform quality filtering and chimera removal. You may not need to perform any quality
control prior using any of these two methods. Only for deblur denoising, you may need to
merge paired-end reads as we did for the clustering.
Now, we can use “q2-dada2” to denoise the yoga data that we imported and saved as
“demux-yoga.qza”. We will use “denoise-paired” method. To keep the files organized, we
will create the “dada2” subdirectory for DADA2 denoising files.
mkdir dada2
qiime dada2 denoise-paired \
--i-demultiplexed-seqs inputs/demux-yoga.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 250 \
--p-trunc-len-r 250 \
--p-n-threads 4 \
--o-representative-sequences dada2/rep-seqs_yoga_dada2.qza \
--o-table dada2/table_yoga_dada2.qza \
--o-denoising-stats dada2/stats_yoga_dada2.qza
The DADA2 stats summary table is a TSV. It shows the ID, number of input reads, number
of reads filtered, percentage of reads passed the filter, number of denoised reads, number of
merged reads, percentage of reads merged, number of non-chimeric reads, and percentage
of non-chimeric reads for each sample in the study.
mkdir deblur
qiime vsearch join-pairs \
--i-demultiplexed-seqs inputs/demux-yoga.qza \
--p-allowmergestagger \
--o-joined-sequences deblur/demux-yoga-merged.qza
Denoising with deblur is carried out with “q2-deblur” plugin, which has two methods:
“denoise-16S” for denoising 16S rRNA gene sequences and “denoise-other” for denoising
other types of targeted gene sequences. When you use “denoise-16S”, deblur will perform
an initial positive filtering step by discarding any reads that do not have a minimum 60%
identity similarity to database sequences from the 85% GreenGenes.
You can use “deblur denoise-16S” method, which has “--p-trim-length” parameter that
can be set to truncate the reads at specific position for quality filtering or you can set that
parameter to -1 to disable trimming. The following script will perform deblur denoising:
mkdir deblur
qiime deblur denoise-16S \
--i-demultiplexed-seqs deblur/demux-yoga-merged.qza \
--p-trim-length -1 \
--p-jobs-to-start 4 \
--p-sample-stats \
--o-representative-sequences deblur/rep-seqs_yoga_deblur.qza \
Targeted Gene Metagenomic Data Analysis ◾ 285
--o-table deblur/table_yoga_deblur.qza \
--o-stats deblur/stats_yoga_deblur.qza
The “--p-jobs-to-start” is an optional parameter and you can set it to the number of jobs to
run in parallel. To learn about more options, use “qiime deblur denoise-16S --help”.
The feature table and representative sequences artifacts will be used in the downstream
analysis.
The deblur stats summary artifact contains useful information about the filtering and
denoising. We can use the “deblur visualize-stats” plugin to generate a visualization file
(Figure 7.10).
Both clustering and denoising (with DADA2 or deblur) generate feature table and repre-
sentative sequences artifacts that can be used in the following analysis steps. You may need
to visualize these feature table and the sequence data artifacts. The “q2-feature-table” plu-
gin is used just for that. The “summarize” and “tabulate-seqs” methods are used to create a
visualization file for the feature table and representative sequences, respectively. As exam-
ples, the following script creates visualization files for the feature tables produced by the
de novo clustering and DADA2 denoising, respectively. The “summarize” method of the
“feature-table” plugin takes a feature table artifact and the sample metadata file as input
and it outputs a visualization file. We created the sample metadata file “sample-metadata.
tsv” earlier and we saved it in the “data” subdirectory.
The html report of the feature table visualization, as displayed on the Internet browser
(Figure 7.11), has three tabs: Overview, Interactive Sample Details, and Feature Details
tab. Overview tab contains Table summary, Frequency per sample with a histogram, and
Frequency per feature with a histogram. Interactive Sample Detail tab (Figure 7.12) displays
a plot, a table containing sample IDs and feature counts, and Plot controls to control the
plot interactively by changing Metadata Category using the dropdown list and sampling
depth using Sampling Depth sliding bar. The metadata category list is obtained from the
sample metadata file which describes the sample groups. Feature Details tab (Figure 7.13)
shows the list of the features found in the samples. The table includes feature id, feature
frequency (abundance), and number of samples in which that feature is observed.
The “feature-table summarize” methods are used to visualize any feature tables gener-
ated by any of the clustering or denoising methods.
We can also use “tabulate-seqs” to create a visualization file for the representative
sequence artifact generated by any of the clustering or denoising methods.
The following script creates a visualization file for the representative sequence artifact
produced with deblur denoising:
The tabular report of the representative sequences (Figure 7.14), as displayed on the Internet
browser, contains Sequence Length Statistics, Seven-Number Summary of Sequence
Lengths, and Sequence Table that includes ID, sequence length, and sequence for each fea-
ture. Moreover, each feature sequence is linked to a BLAST report showing the sequences
with significant alignments to the feature sequence.
288 ◾ Bioinformatics
FIGURE 7.14 The tabular report of the amplicon sequence variants (ASVs).
After you study the feature table visualization on the Internet browser, you may decide
to remove some samples or features from the feature table because of outliers or low abun-
dance. The “q2-feature-table” plugin has the “filter-samples” and “filter-features” methods
for these purposes. For example, you can remove samples based on their minimum total
frequency. The following script removes the samples with a total frequency less than 1000
from feature table created with DADA2 denoising and then it creates a visualization and
views it on the Internet browser:
Try to study the feature table report after the above filtering.
The “filter-features” method is used to remove low abundance features from a feature
table. The following script removes features with a total abundance (across all samples) of
less than 20:
You can use “qiime feature-table filter-samples --help” and “qiime feature-table filter-
features --help” to learn more about filtering parameters.
The steps of the preprocessing of raw data that we followed above will end up with
creation of feature tables and features (OTUs/ASVs) that we will rely on to move to the
downstream analysis (taxonomic classification, phylogenetic relationship, alpha and beta
diversity analysis, and differential abundance). There are always questions that come up at
this point: Which method is better (clustering or denoising)? Which clustering method is
the best (de novo, closed-reference, or open-reference clustering)? And, which denoising
method is better (DADA2 or deblur)? The right answer from many experts is that: try all of
them and adopt the one that works for you. There are number of articles that discussed the
pros and cons of each of these methods.
the consensus taxonomy at a given taxonomic level. Consensus is determined at each tax-
onomic level, beginning from kingdom and stopping when consensus is no longer met
above a specified minimum consensus value. Then, the taxonomy is trimmed at this point.
To use an alignment-based classifier, you need to download, extract, and import a rep-
resentative sequence database and reference taxonomy database as shown above in the de
novo clustering section. The following script imports “99_otus.fasta” and “99_otu_taxon-
omy.txt” reference database from Greensgenes into QIIME2 artifacts, which will be saved
in the “input” subdirectory. To keep the files organized, we will create the “taxonomy”
subdirectory to store the taxonomy files.
mkdir taxonomy
qiime tools import \
--type ‘FeatureData[Sequence]’ \
--input-path gg_13_8_otus/rep_set/99_otus.fasta \
--output-path taxonomy/99_otus.qza
qiime tools import \
--type ‘FeatureData[Taxonomy]’ \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path gg_13_8_otus/taxonomy/99_otu_taxonomy.txt \
--output-path taxonomy/99_otu_taxonomy.qza
Now, you can run taxonomy classification using the BLAST-based classifier (classify-
consensus-blast). The inputs are any representative sequence artifact (generated by any
of the clustering methods or denoising methods), the reference sequence artifact, and
the reference taxonomy artifact imported in the previous step. The following script uses
BLAST-based classifier to assign taxa to the representative sequences generated by DADA2
denoising:
If you wish to use the VSEARCH-based method instead, you can use “classify-consensus-
vsearch” methods.
mkdir classifiers
wget -O “classifiers/gg-nb-99-classifier.qza” \
“https://data.qiime2.org/2021.11/common/gg-13-8-99-nb-
classifier.qza”
Once the download has been completed, use that classifier artifact as an input for “clas-
sify-sklearn” method together with the representative sequence artifact generated in the
clustering or denoising step. In the following, we will assign taxa to the representative
sequences generated by DADA2:
Instead of using a pre-fitted one, we can train a classifier using “feature-classifier” plu-
gin, which has two methods for model fitting: “fit-classifier-naive-bayes” for the training
of a naïve bayes classifier and “fit-classifier-sklearn” for the training of any scikit-learn
classifier.
Next, we will train a Naive Bayes classifier using GreenGenes reference sequences and
then we will use the fitted classifier to assign taxa to the representative sequences generated
by a previous clustering or denoising step.
For training any classifier, we need a training dataset with known labels. In the case
of taxonomy classification, we need representative sequences with known taxa. For our
example, we can use GreenGenes 13_8 97% OTU dataset. Remember that we downloaded
GreenGenes database before and stored it in the “gg_13_8_otus” subdirectory. We will use
the representative sequences “gg_13_8_otus/rep_set/97_otus.fasta” and their correspond-
ing taxonomic classifications “gg_13_8_otus/taxonomy/97_otu_taxonomy.txt”. Since the
292 ◾ Bioinformatics
representative sequences are in a FASTA file and taxonomy in text file, we can import both
of them as follows:
After importing the training datasets (sequences and taxonomy) into the “inputs” subdi-
rectory, we can then use the “fit-classifier-naive-bayes” method of the “feature-classifier”
plugin to train the naïve bayes classifier and to save it in the “classifiers” subdirectory that
we created earlier.
After fitting, you can use the classifier artifact “nb-gg-97-classifier.qza” with the “classify-
sklearn” method to assign taxa to our unclassified representative sequences.
In the above, we used both alignment-based classifiers (BLAST, VSEARCH) and machine
learning classifiers (pre-fitted and fitted) to assign taxa to the unclassified representative
sequences. The output of any of these classification steps is an artifact for the classified
sequences. Whatever classifier is used for taxonomy assignment, the following is applied
to view the taxonomy results. A visualization file can be created from the resulted artifact
using “q2-metadata” plugin with “tabulate” method as follows:
Visualizing the BLAST-based taxonomy assignment:
You can compare between the taxonomy classification of the different classifiers. The fea-
ture table displayed on the Internet browser has three columns: Feature ID, Taxon, and
consensus (for alignment-based classifier) or confidence (for machine learning classifier).
The taxon column indicates the taxonomy assignment for each feature (k__ for kingdom,
p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for spe-
cies). For example, in Figure 7.15, the naïve bayes classifier predicted the taxa of the first
feature up to the family level “f__Coriobacteriaceae;” with a confidence of 0.994, but it did
not assign a genus or a species to that feature. However, for the second feature, the classifier
predicted taxa up to the species level with confidence of 0.76. The confidence reflects the
probability that the taxonomy is correct.
Instead of confidence, alignment-based classifiers provide consensus that is based on
the agreement of the alignment hits. For example, a consensus of 1 means that all hits
aligned to the feature agreed on the taxa.
A taxonomy bar plot visualization can also be created with “taxa barplot” using the
taxonomy predicted by the classifiers and the filtered feature table artifact generated in the
clustering or denoising step as input.
The taxonomy barplot (Figure 7.16) provides an interactive graphic interface to view the
bacterial taxa distribution in each sample as shown by the color keys. The bar width can
be changed by the slider bar on the top left. You can also use the dropdown menus on
the top (taxonomic levels, color palette, and Sort Sample by) to change the view and view
taxa distribution at specific taxonomic level or sort sample by a taxon or an experimental
group.
Earlier, we discussed how to filter the feature table to remove samples or features (with
low frequency). However, after viewing the taxonomy report, you may decide to focus on
a specific taxon or taxa across all samples. To achieve that you can use “taxa filter-table”
to create a new feature table for taxa that you wish to study. Filtering can be applied using
that method to retain only specific taxa using “--p-include” or to remove specific taxa using
“--p-exclude” or can be applied together. Multiple taxa are provided to the argument as
comma-separated list. By default, the terms provided for “--p-include” or “--p-exclude”
will match if they are contained in a taxonomic annotation. However, if you need the terms
match exactly the complete taxonomic annotation, then you can use “--p-mode exact”
parameter as well. Run “qiime taxa filter-table --help” to read more about the usage and
parameters. The “taxa” plugin output is a new feature table with the selected taxa. For
example, assume that you wish to retain only features that contain a phylum level classifi-
cation. So, you can use “--p-include p__” as follows:
FIGURE 7.16 Taxonomy bar plot (the samples are sorted by group).
Targeted Gene Metagenomic Data Analysis ◾ 295
Assume that you intend to explore the species under the “p__Bacteroidetes” phylum. Thus,
you can use “p__Bacteroidetes” to retain the features that contain this.
A metagenomic study may have a specific design such as groups, treatment, and facto-
rial design. The groups are to be specified in the sample metadata file as discussed above.
Grouping samples in an analysis by a group or multiple groups can be achieved by “feature-
table group”. The following script creates a feature table in which the samples are grouped
by the “group” column of the sample metadata file:
You can then create a barplot visualization for the new feature table; however, you need to
create a new metadata file in which the group levels (the metadata column which was used
for grouping) will be as sample ID. The metadata “taxonomy/group_metadata.tsv” will be
as in Figure 7.17.
296 ◾ Bioinformatics
FIGURE 7.17 Metadata file for sample grouping (group levels as sample ID).
FIGURE 7.18 Taxonomy bar plot for the distribution of taxa in each group.
The barplot can be then created and viewed as follows (Figure 7.18):
--m-metadata-file taxonomy/group_metadata.tsv \
--o-visualization taxonomy/groupedby-group-yoga-barplot.qzv
qiime tools view taxonomy/groupedby-group-yoga-barplot.qzv
mkdir trees
qiime alignment mafft \
--i-sequences dada2/rep-seqs_yoga_dada2.qza \
--o-alignment trees/aligned-rep-seqs_yoga_dada2.qza
Now, we have created both unrooted and rooted phylogenetic trees using de novo approach.
The tree artifact can be exported into NEWICK tree file format which can be viewed on any
tree viewing program such as ETE toolkit. In the following, we will use “export” method of
“q2-tools” plugin to export phylogenetic tree data into NEWICK tree files:
mkdir trees2
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences dada2/rep-seqs_yoga_dada2.qza \
--o-alignment trees2/rep-seqs_yoga_dada2_alignedTr.qza \
--o-masked-alignment trees2/rep-seqs_yoga_dada2_maskedTr.qza \
--o-tree trees2/unrooted-tree-yoga_dada2.qza \
--o-rooted-tree trees2/rooted-tree-yoga_dada2.qza
qiime tools export \
--input-path trees2/unrooted-tree-yoga_dada2.qza \
--output-path trees2/unrooted
qiime tools export \
--input-path trees2/rooted-tree-yoga_dada2.qza \
--output-path trees2/rooted
metrics. First, we need to normalize the read counts across sample to adjust for any bias
arising from the different sequence depths and to make the comparison meaningful. The
normalization is performed by rarefying the count of feature table to a user-specified depth.
The lowest read count can be chosen as the user-defined depth. The lowest number of reads
is determined from a summary created from the feature table. The lowest count number
is then provided to the “--p-sampling-depth” parameter of the “diversity” plugin as a sam-
pling depth for all samples. Once the plugin command is executed, samples are drawn
without replacement so that each sample in the resulting table will have a total count equal
to that of sampling depth. Then, the alpha and beta metrics are computed. The following
script creates summary from the feature table to determine the lowest read number:
When we study the summary, we can observe that the lowest read number for the samples
is 955 sequences. So, we can set the --p-sampling-depth parameter to 955. This step will
sub-sample the counts in each sample without replacement so that each sample in the
resulting table will have a total count of 955.
The “diversity” plugin requires a phylogenetic tree and feature table artifacts and the
sample metadata file as inputs and it outputs the alpha and beta diversity metrics saved into
the specified output directory.
The metrics would be saved to the output directory. We can use that metric to explore
the microbial composition of sample in the context of the grouping defined in the sample
metadata.
We will test for associations between categorical metadata columns and alpha diversity
data. We will do that here for the Faith Phylogenetic Diversity (a measure of community
richness) and Shannon diversity. The following commands will test for significant differ-
ences in the alpha diversity measures of samples:
These commands will run all-group and pairwise Kruskal–Wallis tests (non-parametric
analysis of variance). The visualization files show boxplots and test statistics for each meta-
data grouping.
We will analyze sample composition (beta diversity group distances) in the context
of categorical metadata using PERMANOVA. Note: The qiime diversity beta-group-
significance command computes only one metadata grouping at a time, so to test the
differences between groups we have to indicate the appropriate column name from the
metadata file. In addition, if we call this command with --p-pairwise parameter, it will
perform pairwise tests that will allow us to determine which specific pairs of groups are
different from one another in terms of dispersion. We will apply a PERMANOVA to test
for significant differences of the weighted UniFrac metrics between the samples.
Finally, we will use the Emperor tool to explore the microbial community composition
using principal coordinate analysis (PCoA) plots in the context of sample metadata.
7.4 SUMMARY
The amplicon-based sequencing is targeting a specific marker gene that is able to distinguish
species. Hence, it is used to identify species in a sample that contains multiple microbes
such as environmental and clinical samples. The 16S rRNA gene is usually targeted in the
Targeted Gene Metagenomic Data Analysis ◾ 301
REFERENCES
1. Coughlan L, Cotter P, Hill C, Alvarez-Ordóñez A: Biotechnological applications of functional
metagenomics in the food and pharmaceutical industries. Front Microbiol 2015, 6.
2. Schwartsmann G, Brondani da Rocha A, Berlinck RG, Jimeno J: Marine organisms as a source
of new anticancer agents. Lancet Oncol 2001, 2(4):221–225.
3. Xiong ZQ, Wang JF, Hao YY, Wang Y: Recent advances in the discovery and development of
marine microbial natural products. Mar Drugs 2013, 11(3):700–717.
4. Sun Z, Li J, Dai Y, Wang W, Shi R, Wang Z, Ding P, Lu Q, Jiang H, Pei W et al: Indigo Naturalis
Alleviates Dextran Sulfate Sodium-Induced Colitis in Rats via Altering Gut Microbiota. Front
Microbiol 2020, 11: 731.
5. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, Abebe E: Defining opera-
tional taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 2005,
360(1462):1935–1943.
6. Westcott SL, Schloss PD: De novo clustering methods outperform reference-based meth-
ods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ 2015,
3:e1487.
7. Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, Chase J, McDonald
D, Gonzalez A, Robbins-Pianka A et al: Subsampled open-reference clustering creates
consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ 2014,
2:e545.
8. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP: DADA2:
High-resolution sample inference from Illumina amplicon data. Nat Methods 2016,
13(7):581–583.
9. Nearing JT, Douglas GM, Comeau AM, Langille MGI: Denoising the Denoisers: an
independent evaluation of microbiome sequence error-correction approaches. PeerJ 2018,
6:e5364.
10. Edgar RC: UNOISE2: improved error-correction for Illumina 16S and ITS amplicon
sequencing. bioRxiv 2016:081257.
11. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol
Biol 1990, 215(3):403–410.
302 ◾ Bioinformatics
12. Rognes T, Flouri T, Nichols B, Quince C, Mahé F: VSEARCH: a versatile open source tool for
metagenomics. PeerJ 2016, 4:e2584.
13. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, Brown CT, Porras-Alfaro A, Kuske
CR, Tiedje JM: Ribosomal Database Project: data and tools for high throughput rRNA analy-
sis. Nucleic Acids Res 2014, 42(Database issue):D633–642.
14. Olsen GJ: [53] Phylogenetic analysis using ribosomal RNA. In: Methods in Enzymology. vol.
164: Academic Press; 1988: 793–812.
15. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H,
Alm EJ, Arumugam M, Asnicar F et al: Reproducible, interactive, scalable and extensible
microbiome data science using QIIME 2. Nat Biotechnol 2019, 37(8):852–857.
16. Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A,
Gibbons SM, Ackermann G et al: A communal catalogue reveals Earth’s multiscale microbial
diversity. Nature 2017, 551(7681):457–463.
Chapter 8
Shotgun Metagenomic
Data Analysis
8.1 INTRODUCTION
In the previous chapter, we discussed the amplicon-based metagenomic data analysis which
is based on the profiling of a single targeted gene, usually 16S rRNA gene in environmental
or clinical samples. Many researchers debate that approach is not metagenomic in nature
because it focuses only on a single gene rather than the entire genomes of the microbes
in the samples. In this chapter, we will discuss the shotgun sequencing metagenomic
approach which involves the sequencing of the entire genomes of the microbes in the sam-
ples, and therefore, it provides more insights onto the microbial communities, their genetic
profiling, and their impact on hosts and association to the host phenotype. The shotgun
sequencing for the metagenomes is rather new but it also emerged as a consequence of the
progress in the high-throughput sequencing technologies, which was also followed by the
progress in the development of the computational resources and tools that are capable to
handle the massiveness and complexity of the metagenomic sequencing data. The shotgun
whole-genome metagenomic sequencing and data analysis are now used to quantify the
microbial communities and diversity, to assemble novel microbial genomes, to identify
new microbial taxa and genes, and to determine the metabolic pathways orchestrated by
the microbial community and more.
The metagenomic raw data produced by a high-throughput sequencer is originated
from either environmental or clinical samples that contain multiple microbial organisms,
including bacteria, fungi, and viruses. Data originated from samples recovered from a
host may be contaminated with the host genomic sequences. Multiple samples can also
be sequenced in a single run (multiplexing). In the multiplex sequencing, unique barcode
sequences identifying each sample are ligated to the DNA fragments in the DNA library
preparation step. Some library preparation kits allow multiplexing of hundreds of samples.
Illumina has multiple kits for library preparation, including Illumina DNA Prep, (M) tag-
mentation, which uses bead-linked transposomes in the tagmentation process to randomly
insert transposomes into the metagenomic DNA. The tagmentation is followed by PCR
using index primers which enables amplification and subsequent indexing of the sample
libraries (barcoding) to allow multiplexing. The library preparation is followed by sequenc-
ing and production of the raw data in FASTQ files. The steps of read quality assessment and
processing are, to some extent, similar to the steps discussed in Chapter 1. The purpose of
the quality control is to reduce sequence biases or artifacts by removing sequencing adap-
tors, trimming low-quality ends of the reads, and removing duplicate reads. If the DNA is
extracted from a clinical sample, an additional quality control step is required which is to
remove the contaminating host DNA or non-target DNA sequences. If we need to perform
between-sample differential diversity analysis, we may also need to draw a random sub-
sample from the original sample to normalize read counts.
After the step of quality control, there are two strategies that can be followed for the
metagenomic raw data. The first one is to assemble the metagenomes using de novo genome
assembly method and the second one is an assembly-free approach similar to amplicon-
based method. Each of these strategies may address different kinds of questions. The types
and algorithms of the de novo assembly were discussed in Chapter 3. However, in the
shotgun metagenomics, a new step is introduced. This step is called metagenomic bin-
ning, which aims to separate the assembled sequences by species so that the assembled
contigs in a metagenomic sample will be assigned to different bins in FASTA files. A bin
will correspond to only one genome. A genome built with the process of binning is called
Metagenome-Assembled Genome (MAG). Binning algorithms adopt several ways to per-
form binning. Some algorithms use taxonomic assignment and others use properties of
contigs like GC-content of the contigs, nucleotide composition, or the abundance. Binning
algorithms use two approaches for assigning contigs to species: supervised machine learn-
ing and unsupervised machine learning. Both approaches use similarity scores to assign
a contig to a bin. Since many of the microbial species have not been sequenced and hence
some of the reads may not map to reference genomes, it is good practice to not rely on
mapping to reference genomes. Binning-based nucleotide composition of a contig has been
found useful in separating genomes into possible species. The nucleotide composition of
a contig is the frequency of k-mers in the contig, where k can be any reasonable integer
(e.g., 3, 4, 5, …). It has been found that different genomes of microbial species have dif-
ferent frequencies that may discriminate the genomes into potential taxonomic groups.
A machine learning algorithm like naïve Bayes and other machine learning algorithms
are used for taxonomic group assignment. However, features, more powerful than the
sequence composition, are required to deal with the complexity in the sequences of contigs.
The unsupervised machine learning tools cluster contigs into bins without requiring prior
information. There are several binning programs using different algorithms. MetaBAT 2
[1] uses an adaptive binning algorithm that does not require manual parameter tuning as
the case with its previous version. Its algorithm consists of multiple aspects, including nor-
malized tetranucleotide frequency (TNF) scores, clustering, and steps to recruit smaller
contigs. Moreover, the computational efficiency has been increased compared to the previ-
ous version. MaxBin [2] uses nucleotide composition and contig abundance information
to group metagenomic contigs into different bins; each bin represents one species. MaxBin
algorithm uses tetranucleotide frequencies and scaffold coverage levels to estimate the
Shotgun Metagenomic Data Analysis ◾ 305
Reads
Bin 1
Contigs
Bin 2
Sequencing Assembly Binning
Bin 3
Bin 4
Six FASTQ files will be saved in the “fastqdir”. Use “ls fastqdir/” to display the content of
that directory to make sure the files are there.
ends of the reads, and to remove adaptors and duplicates. Refer to Chapter 1 for detailed
information about this step. For multiplexing data, you need to perform demultiplexing
before you do the quality control. The multiplexing and demultiplexing are discussed in
Chapter 7. The FASTQ files, which we have downloaded, had already been processed and
they contain reads of good quality. You can check their quality with FastQC as follows:
fqs=$(ls fastqdir/*.fastq)
fastqc $fqs
htmls=$(ls fastqdir/*.html)
firefox $htmls
The above commands will display the quality control report of the six FASTQ files on the
Firefox browser. Check on the six tabs to study the reports.
mkdir sam
bowtie2 -p 4 \
-x ref/hg19 \
-1 fastqdir/ERR1823587_1.fastq \
-2 fastqdir/ERR1823587_2.fastq \
-S sam/ERR1823587.sam
bowtie2 -p 4 \
-x ref/hg19 \
-1 fastqdir/ERR1823601_1.fastq \
-2 fastqdir/ERR1823601_2.fastq \
-S sam/ERR1823601.sam
bowtie2 -p 4 \
-x ref/hg19 \
-1 fastqdir/ERR1823608_1.fastq \
-2 fastqdir/ERR1823608_2.fastq \
-S sam/ERR1823608.sam
Remember that the BAM files contain mapped and unmapped reads.
samtools view \
-b -f 12 \
-F 256 \
sam/ERR1823587.bam \
> sam/ERR1823587_unmapped.bam
samtools view \
-b -f 12 \
308 ◾ Bioinformatics
-F 256 sam/ERR1823601.bam \
> sam/ERR1823601_unmapped.bam
samtools view \
-b -f 12 \
-F 256 sam/ERR1823608.bam \
> sam/ERR1823608_unmapped.bam
The “-f 12” option is used to extract only the unmapped forward and reverse reads and “-F
256” option is used to exclude secondary alignments. Refer to Chapter 2 for FLAG field of
the SAM/BAM file.
The above Samtools commands separate unmapped reads, which represent the
pure metagenomic data, in the separate BAM files “ERR1823587_unmapped.bam”,
“ERR1823601_unmapped.bam”, and “ERR1823608_unmapped.bam”.
samtools sort \
-n -m 5G \
-@ 2 sam/ERR1823587_unmapped.bam \
-o sam/ERR1823587_unmapped_sorted.bam
samtools sort \
-n -m 5G \
-@ 2 sam/ERR1823601_unmapped.bam \
-o sam/ERR1823601_unmapped_sorted.bam
samtools sort \
-n -m 5G \
-@ 2 sam/ERR1823608_unmapped.bam \
-o sam/ERR1823608_unmapped_sorted.bam
Then, we create FASTQ files from the BAM files and store them in a new directory “fastq_
pure” so that we can use them in the next steps of the downstream analysis.
Mkdir fastq_pure
samtools fastq -@ 4 sam/ERR1823587_unmapped_sorted.bam \
-1 fastq_pure/ERR1823587_pure_R1.fastq.gz \
-2 fastq_pure/ERR1823587_pure_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
samtools fastq -@ 4 sam/ERR1823601_unmapped_sorted.bam \
-1 fastq_pure/ERR1823601_pure_R1.fastq.gz \
-2 fastq_pure/ERR1823601_pure_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
samtools fastq -@ 4 sam/ERR1823608_unmapped_sorted.bam \
Shotgun Metagenomic Data Analysis ◾ 309
-1 fastq_pure/ERR1823608_pure_R1.fastq.gz \
-2 fastq_pure/ERR1823608_pure_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
The saved FASTQ files contain the reads of the metagenomic data after removing the con-
taminating human sequences.
If you run FastQC on those files, you will find that the reads length varies between 35
and 151 bp. We can remove any paired reads of length less than 50 bp. Removing such
reads from paired-end FASTQ files requires a program or a script that removes reads from
both pair of files without leaving singletons that may be rejected by some programs used in
the downstream analysis. Thus, we need to filter reads of certain length by using the right
program. There may be some programs that can do this but we use a bash script for this
purpose. First, change into “fastq_pure” and run the following script to decompress the
FASTQ files:
cd fastq_pure
for i in $(ls *.gz);
do
gzip -d ${i}
done
Then, open a text editor of your choice such as “vim or nano” and save the following bash
script in a file “remove_PE.sh”:
vim remove_PE.sh
Then, copy the bash script into the file, save it, and exit:
#!/bin/sh
#Use: remove_PE.sh R1.fastq R2.fastq 80
#1. Start with inputs
fq_r1=$1
fq_r2=$2
minLength=$3
#2. Find all entries with read length less than minimum length and
print line numbers, for both R1 and R2
awk -v min=$minLength \
‘{if(NR%4==2) \
if(length($0)<min) \
print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \
$fq_r1 > temp.lines1
awk -v min=$minLength \
‘{if(NR%4==2) \
if(length($0)<min) \
print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \
$fq_r2@ temp.lines1
310 ◾ Bioinformatics
#3. Combine files into one, sort them numerically, and collapse
redundant entries
sort -n temp.lines1 | uniq > temp.lines
rm temp.lines1
outfq1=$(echo $fq_r1| cut -d’.’ -f 1)
outfq2=$(echo $fq_r2| cut -d’.’ -f 1)
#4. Remove the line numbers recorded in “lines” from both fastqs
awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \
temp.lines $fq_r1 \
> $outfq1-$minLength.fastq
awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \
temp.lines $fq_r2 \
> $outfq2-$minLength.fastq
gzip $outfq1-$minLength.fastq
gzip $outfq2-$minLength.fastq
rm temp.lines
Once you have saved the file, you may need to make the file executable by using the Linux
command “chmod”:
chmod +x remove_PE.sh
Up to this step, we would have removed the host sequences from metagenomic data which are
stored in “ERR1823587_pure_R1-50.fastq.gz” and “ERR1823587_pure_R2-50.fastq.gz” for
the sample of the healthy person, “ERR1823601_pure_R1-50.fastq.gz” and “ERR1823601_
pure_R2-50.fastq.gz” for the moderate sickle cell patient, and “ERR1823608_pure_R1-50.
fastq.gz” and “ERR1823608_pure_R2-50.fastq.gz” for severe sickle cell patient. To save
some storage space, you can delete the other FASTQ files using “rm *.fastq” and also delete
all files in “fastqdir”.
The metagenomic FASTQ files are stored in “fastq_pure” as shown in Figure 8.2. Above,
we have deleted the original FASTQ files from “fastqdir” directory and also the intermedi-
ate FASTQ files from “fastq_pure” directory. You can also delete the SAM and BAM files
from “sam” directory and the reference sequences and indexes from “ref” directory if you
want to save storage space. However, you are advised to keep reference genome files in “ref”
as you may need to repeat all the steps and indexing usually takes a long time.
allow more accurate taxonomic group assignment. There are several programs for assem-
bly-free classification and profiling of microbial communities in metagenomic samples.
Kaiju [3] uses taxonomy and NCBI refseq databases to find maximum matches to the reads
on the protein-level using the Burrows–Wheeler transform (BWT). CLARK [4] (CLAssifier
based on Reduced K-mers) creates a large index of k-mers of all target sequences and then
it removes the common ones among targets so that each target is described by unique
k-mers, which are used for taxonomic classification. Kraken [5] creates k-mers from the
reads and then it builds taxonomy trees that help discriminate closely related microbes
using classification tree and path. Those programs are just examples and there are others
with different algorithms. Centrifuge [6] is a rapid classifier that requires a little memory
and a relatively smaller index (only 5.8 GB for bacterial, viral, and human genomes) on
desktop computers compared to others. Centrifuge uses an indexing system that is based
on BWT and the Ferragina–Manzini (FM) index.
Most taxonomy classifiers of the metagenomic data use genomic database of known spe-
cies to construct an index and then use that index to assign taxa to the metagenomic reads.
The majority of the classifiers require a large storage space for database files and a large
memory for indexing and classification process. Kaiju and Kraken require a lot of memory
(around 128GB–512GB). Therefore, we recommend using these classifiers only if you have
enough computational resources. To use any of these classifiers, you need to download and
build an index and then to perform the classification.
Kaiju installation instructions are available at “https://github.com/bioinformatics-cen-
tre/kaiju”. You can install it by running the following command:
Then, you need to add its path by adding the following to the “.bashrc” file. You need to
replace YOUR_PATH with the program path.
export PATH=”YOUR_PATH/kaiju/bin”:$PATH
You must restart the terminal or use “source ~/.bashrc” to make the change active. Run
“kaiju” command to check if it has been installed.
Before using kaiju, you need to download the refseq database from the NCBI or you
can download it from the kaiju website at “https://kaiju.binf.ku.dk/server”. To download it
from the NCBI database, use the following:
mkdir kaijudb
cd kaijudb
kaiju-makedb -s refseq
The download will take a long time and a large storage space. When the database has
been downloaded, make sure that “nodes.dmp”, “kaiju_db_refseq.fmi”, and “names.
dmp” files are present in the “kaijudb” directory. You may need to decompress
312 ◾ Bioinformatics
“kaiju_db_refseq_xxxx-xx-xx.tgz”. To classify the short reads in our FASTQ files, you need
to run the following:
mkdir kaiju_output
kaiju -t kaijudb/nodes.dmp \
-f kaijudb/kaiju_db_refseq.fmi \
-i fastq_pure/ERR1823587_pure_R1-80.fastq.gz \
-j fastq_pure/ERR1823587_pure_R2-80.fastq.gz \
-o kaiju_output/ERR1823587.out \
-a greedy \
-z 4 -v
kaiju -t kaijudb/nodes.dmp \
-f kaijudb/kaiju_db_refseq.fmi \
-i fastq_pure/ERR1823601_pure_R1-80.fastq.gz \
-j fastq_pure/ERR1823601_pure_R2-80.fastq.gz \
-o kaiju_output/ERR1823601.out \
-a greedy \
-z 4 -v
kaiju -t kaijudb/nodes.dmp \
-f kaijudb/kaiju_db_refseq.fmi \
-i fastq_pure/ERR1823608_pure_R1-80.fastq.gz \
-j fastq_pure/ERR1823608_pure_R2-80.fastq.gz \
-o kaiju_output/ERR1823608.out \
-a greedy \
-z 4 -v
To learn more about these options, run “kaiju”. The indexing and classification require
around 128 GB RAM. We do not recommend using kaiju unless you have enough memory
and storage space.
After running the program successfully, you will need to convert the kaiju output file
into a summary table using “kaiju2table” command as follows:
kaiju2table -t kaijudb/nodes.dmp \
-n kaijudb/names.dmp \
-r taxonomic_level \
-o kaiju_output/ERR1823587_table.tsv \
kaiju_output/ERR1823587.out \
-l taxonomic,levels,separated,by,commas
kaiju2table -t kaijudb/nodes.dmp \
-n kaijudb/names.dmp \
-r taxonomic_level \
-o kaiju_output/ERR1823601_table.tsv \
kaiju_output/ERR1823601.out \
-l taxonomic,levels,separated,by,commas
kaiju2table -t kaijudb/nodes.dmp \
-n kaijudb/names.dmp \
Shotgun Metagenomic Data Analysis ◾ 313
-r taxonomic_level \
-o kaiju_output/ERR1823608_table.tsv \
kaiju_output/ERR1823608.out \
-l taxonomic,levels,separated,by,commas
Run “kaiju2table” to learn about the usage and options of this command.
Most taxonomy classifiers of the metagenomic data follow the same steps: the database
downloading and classification. For almost all of them, these steps require large storage
space and memory that may not be available on the regular desktop computers. However,
if we do not have enough computational resources, we can use Centrifuge which requires
relatively small storage space and memory that fits personal computers.
Centrifuge classifier is available at “https://github.com/infphilo/centrifuge”. For the
updated installation instructions, visit that site. Up to this day, you can install it on Linux
using the following commands:
If it has been installed successfully, no need to do anything else but to use it from any
directory. Run “centrifuge -h” to display the usage and options.
As usual, to use Centrifuge classifier, we will begin by building an index. There are
several ready-to-use indexes available at http://www.ccb.jhu.edu/software/centrifuge.
However, Centrifuge also needs sequence and taxonomy files and sequence ID. That can be
simplified by using “make” command that can build several standard and custom indices.
To do that, find the Centrifuge directory and change into “indices” directory and then run
the “make” command as follows:
cd indices
make p+h+v # bacterial, human, and viral genomes [~12G]
make p_compressed # bacterial genomes compressed at the species
level [~4.2G]
make p_compressed+h+v # combination of the two above [~8G]
This command will download the reference taxonomy files and reference genome at assem-
bly levels. The download may take a while depending on the speed of the Internet connec-
tion. It is also easier to download a database from Centrifuge homepage, which is available
at “https://ccb.jhu.edu/software/centrifuge/manual.shtml”. Centrifuge is used to assign
taxa to the short reads in the FASTQ files. For the “-x” option, make sure that you provide
the database name with the path if it is not in the current path.
mkdir centrifuge_out
centrifuge -x p+h+v \
314 ◾ Bioinformatics
-1 fastq_pure/ERR1823587_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823587_pure_R2-50.fastq.gz \
--report-file centrifuge_out/ERR1823587-report.txt \
-S centrifuge_out/ERR1823587-results.txt
centrifuge -x p+h+v \
-1 fastq_pure/ERR1823601_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823601_pure_R2-50.fastq.gz \
--report-file centrifuge_out/ERR1823601-report.txt \
-S centrifuge_out/ERR1823601-results.txt
centrifuge -x p+h+v \
-1 fastq_pure/ERR1823608_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823608_pure_R2-50.fastq.gz \
--report-file centrifuge_out/ERR1823608-report.txt \
-S centrifuge_out/ERR1823608-results.txt
The results are saved in “*-results.txt” files. Each read classified by Centrifuge results in
a single line of output. The output lines consist of eight tab-delimited fields: (1) the read
ID (from FASTQ file); (2) sequence ID (from the database sequence); (3) taxonomic ID of
the database sequence; (4) classification score (weighted sum of hits); (5) score for the next
best classification; (6) two numbers: (i) a number of base pairs of the read that match the
database sequence and (ii) the length of a read or the combined length of mate pairs; (7)
two numbers: (i) a number of base pairs of the read that match the database sequence and
(ii) the length of a read or the combined length of mate pairs; and (8) the number of clas-
sifications for this read.
The “*-report.txt” files contain summaries of the identified taxa and their abun-
dances. Each line in the file consists of seven tab-delimited fields: The name of a genome,
taxonomic ID, taxonomic rank (kingdom, genus, family, etc.), genome size in bp, number
of reads classified to this genomic sequence including multi-classified reads, number of
reads uniquely classified to this genomic sequence, and abundance proportion as shown in
Figure 8.2. The Centrifuge report shows that the metagenomic reads have been assigned to
taxonomic group in different ranks. This report can be further analyzed to filter the most
significant taxa based on their summary statistics.
metaspades.py
This will display the usage and options of metaSPAdes program. Otherwise, you may need
to install the program following the installation instructions.
The following metaSPAdes command will perform de novo metagenome assembly using
the metagenomic FASTQ files as input:
mkdir metag_healthy
metaspades.py \
-o metag_healthy \
-1 fastq_pure/ERR1823587_pure_R1-50.fastq.gz \
316 ◾ Bioinformatics
-2 fastq_pure/ERR1823587_pure_R2-50.fastq.gz \
--only-assembler \
--threads 4 \
--memory 16 \
--phred-offset 33 \
-k 51
mkdir metag_moderate
metaspades.py \
-o metag_moderate \
-1 fastq_pure/ERR1823601_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823601_pure_R2-50.fastq.gz \
--only-assembler \
--threads 4 \
--memory 16 \
--phred-offset 33 \
-k 51
mkdir metag_severe
metaspades.py \
-o metag_severe \
-1 fastq_pure/ERR1823608_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823608_pure_R2-50.fastq.gz \
--only-assembler \
--threads 4 \
--memory 16 \
--phred-offset 33 \
-k 51
Run “metaspades.py --help” to read about the usage and options of this program.
Several files are produced in the output directories: “metag_healthy”, “metag_moder-
ate”, and “metag_severe”. The files that contain the assembly sequences are the “contigs.
fasta” and the “scaffolds.fasta”. Contigs are made from read overlaps. The contigs are then
ordered, oriented, and connected with gaps filled with Ns to form the scaffolds. The K51
directory contains the individual result files for an assembly with 51-mers. However, when
multiple K directories are found, the best assembled sequences are the ones that are stored
outside these K directories. The directory “misc” contains broken scaffolds.
The file with the “.gfa” extension is in Graphic Fragment Assembly (GFA) file format in
which the sequences are represented by lines starting with “S” and the overlaps between
sequences are represented by lines starting with “L” as shown in Figure 8.3. The plus (+)
and minus (−) signs indicate whether the overlapping sequence is the original or its reverse
complement. The value in the form “XM” in a link indicates overlap length.
Thus, the file “assembly_graph_with_scaffolds.gfa” generated by metaSPAdes is the
GFA file that represents the final assembly of metagenomes in the sample. SPADes built
this assembly graph based on k-mers formed from the reads (vertices) and their overlaps
(edges). Then, the assembler resolves paths across the assembly vertices and outputs non-
branching paths as contigs.
Shotgun Metagenomic Data Analysis ◾ 317
This GFA file can be visualized by graph visualization programs like Bandage [8]. The
Bandage program is available at “http://rrwick.github.io/Bandage/” and can be down-
loaded and installed on Linux, Windows, and Mac OX. Visualizing a graph file will give
you an idea about the assembly quality. You can zoom in and out and do many operations
on the graph file. Moreover, there are good graph visualization packages in Python and R
such as igraph [9], which is available in both programming platforms. To read more about
Bandage and igraph, refer to their user manuals which are available on their web pages.
metaquast.py -t 4 \
-m 500 \
metag_healthy/scaffolds.fasta \
metag_moderate/scaffolds.fasta \
metag_severe/scaffolds.fasta \
-o output
This will generate an HTML report in the “output” directory. The other directories and
files are linked to this report when it is displayed on the Internet browser. You can display
“report.html” file by using “firefox report.html”.
318 ◾ Bioinformatics
The assemblies’ evaluation report contains important statistics that reflect the quality of
the assemblies in the file. Figure 8.4 shows partial evaluation report of the three samples.
The colored heatmap indicates the quality from the worst (red) to the best (blue). On the
top, there are links that can take to each sample graph as shown in Figure 8.5. The graphs
show the key identified bacterial taxa and their abundance. Refer to the program users’
manual, which is available at “http://cab.cc.spbu.ru/quast/manual.html”, to read more
about the program use and the different report sections, refer to Chapter 3 to read more
about the de novo assembly evaluation metrics.
mkdir assemblies
cp metag_healthy/scaffolds.fasta assemblies/healthy_scaffolds.
fasta
cp metag_moderate/scaffolds.fasta assemblies/moderate_scaffolds.
fasta
cp metag_severe/scaffolds.fasta assemblies/severe_scaffolds.fasta
Then, we can use Samtools and Bowtie2 to build an index for each “scaffolds.fasta” file of
each sample.
cd assemblies
cd assemblies
for i in $(ls *.fasta);
320 ◾ Bioinformatics
do
samtools faidx ${i}
done
bowtie2-build healthy_scaffolds.fasta healthy
bowtie2-build moderate_scaffolds.fasta moderate
bowtie2-build severe_scaffolds.fasta severe
Once the index has been built, we can use Bowtie2 to align the FASTQ reads to their
respective “scaffolds.fasta”.
mkdir sam_assemblies
bowtie2 --sensitive-local \
-p 4 \
-x assemblies/healthy \
-1 fastq_pure/ERR1823587_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823587_pure_R2-50.fastq.gz \
-S sam_assemblies/ERR1823587_healthy.sam
bowtie2 --sensitive-local \
-p 4 \
-x assemblies/moderate \
-1 fastq_pure/ERR1823601_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823601_pure_R2-50.fastq.gz \
-S sam_assemblies/ERR1823601_moderate.sam
bowtie2 --sensitive-local \
-p 4 \
-x assemblies/severe \
-1 fastq_pure/ERR1823608_pure_R1-50.fastq.gz \
-2 fastq_pure/ERR1823608_pure_R2-50.fastq.gz \
-S sam_assemblies/ERR1823608_severe.sam
We can notice that there are some statistics when the alignment process finishes for each
sample.
We will convert SAM files to bam files and then we will sort the alignments in the BAM
files.
cd sam_assemblies
samtools view -S -b ERR1823587_healthy.sam > ERR1823587_healthy.
bam
samtools view -S -b ERR1823601_moderate.sam > ERR1823601_moderate.
bam
samtools view -S -b ERR1823608_severe.sam > ERR1823608_severe.bam
We need to index the sorted BAM files using “samtools index” command.
Then, we will use “samtools idxstats” to generate some statistics from the sorted BAM files.
The output of the “samtools idxstats” command is a TAB-delimited file with each line con-
sisting of the reference sequence name, sequence length, number of mapped read-segments,
and number of unmapped read-segments. From those files, we can generate abundance
table similar to the OTU (operation taxonomic units) generated from clustering of the
amplicon-based reads in Chapter 7. For this purpose, we can use “get_count_table.py”
script, which can be cloned from GitHub using the following command:
Then, we can use that Python 2 script to generate an abundance table for each sample. So,
if you do not have Python 2 installed on your computer, you may need to install it.
We will use the output of this script for binning in the next step.
8.2.7 Binning
Above, we discussed binning as the process of separating the sequences into bins that
represent the most likely taxa. There are many programs that can do this job including
metabat2, CONCOCT, and MaxBin. Here, we will use metabat2 as an example. Metabat2
is easier to install on Anaconda or Miniconda.
Before we perform the taxonomic binning, we need to generate sequence depth from the
sorted BAM files produced by mapping the metagenomic FASTQ reads to the de novo
assemblies. For this purpose, we will also need the tables produced by “get_count_table.
py” script above as an input with the sorted BAM file for the “jgi_summarize_bam_con-
tig_depths” function to produce a file of five columns: contig name, contig length, total
average depth, mean depth, and variance.
mkdir stats_metabat
jgi_summarize_bam_contig_depths \
--outputDepth stats_metabat/healthy_depth.txt \
sam_assemblies/ERR1823587_healthy.bam.sorted
jgi_summarize_bam_contig_depths \
--outputDepth stats_metabat/moderate_depth.txt \
sam_assemblies/ERR1823601_moderate.bam.sorted
jgi_summarize_bam_contig_depths \
--outputDepth stats_metabat/severe_depth.txt \
sam_assemblies/ERR1823608_severe.bam.sorted
Then, we can perform binning on the contigs.fasta produced by de novo assembly above.
We can copy these files in a new directory “binning” with new names.
mkdir binning
cp metag_healthy/contigs.fasta binning/healthy_contigs.fasta
cp metag_moderate/contigs.fasta binning/moderate_contigs.fasta
cp metag_severe/contigs.fasta binning/severe_contigs.fasta
The next step is to separate the contigs in the contigs files into bins; each bin represents a
species. The bins of the three samples are saved in different subdirectories inside “binning”
directory.
mkdir binning/healthy
metabat2 -i binning/healthy_contigs.fasta \
-a stats_metabat/healthy_depth.txt \
-o binning/healthy/healthy \
-t 4 -v --seed 123
mkdir binning/moderate
metabat2 -i binning/moderate_contigs.fasta \
-a stats_metabat/moderate_depth.txt \
-o binning/moderate/moderate \
-v --seed 123
mkdir binning/severe
metabat2 -i binning/severe_contigs.fasta \
-a stats_metabat/severe_depth.txt \
-o binning/severe/severe \
-v --seed 123
Shotgun Metagenomic Data Analysis ◾ 323
The “-i” option specifies the input contigs FASTA file, “-a” option specifies the depth
file that contains contig depth averages and variances, “-o” specifies the output path
and prefix, “-v” for verbose, and “--seed” specifies a seed integer to replicate the same
results.
Up to this point, we have performed the taxonomic binning successfully and now we
have separated genomes for each potential species in the metagenomic sample. However,
we do not know the qualities of these genomes and to which microbial species they belong.
So, the next step, we must evaluate these genomic sequences and assess their completeness
with regard to protein-coding genes and their annotations.
You need to follow the installation instructions at Pplacer home page, which is available
at “https://matsen.fhcrc.org/pplacer/”, and add it to the Linux path. Then, you can install
CheckM with the following commands:
Now, we can run CheckM commands to assess the completeness and contamination of
the genome bins by using lineage-specific marker sets. This workflow consists of several
steps that include placing bins in the reference genome tree, assessing phylogenetic mark-
ers found in each bin, and inferring lineage-specific marker sets for each bin. These steps
are done with multiple CheckM commands but they can also be done in a single step by
using “lineage_wf” command.
324 ◾ Bioinformatics
mkdir checkM_out
mkdir checkM_out/healthy
checkm lineage_wf \
-t 4 \
-x fa \
-f checkM_out/healthy_checkm_report.txt \
binning/healthy \
checkM_out/healthy
mkdir checkM_out/moderate
checkm lineage_wf \
-t 4 \
-x fa \
-f checkM_out/moderate_checkm_report.txt \
binning/moderate \
checkM_out/moderate
mkdir checkM_out/severe
checkm lineage_wf \
-t 4 \
-x fa \
-f checkM_out/severe_checkm_report.txt \
binning/severe \
checkM_out/severe
The report (Figure 8.6) shows the bin ID, marker lineage (taxonomic rank), # genome
(number of genomes used to infer marker sets), # marker (number of marker genes), #
marker set (number of sets within the inferred markers), 0–5+ (number of times each
marker gene is identified), completeness (presence/absence of marker genes), and strain
heterogeneity (high heterogeneity indicates the contamination is from one or more closely
related organisms).
In Figure 8.6, for the moderate sample, we can notice that for the bin “moderate.4”, there
are 5449 bacterial genomes that were used to infer 104 markers genes; only 6 genes were
inferred in the bin, the completeness is 10.34, and there was no contamination. For more
details about the use of CheckM and report, refer to the program home page at “https://
ecogenomics.github.io/CheckM/”.
annotations and polypeptides and ORFs are written to files. Gene annotation of a new
genome assembly is an important step. Since bacteria have no introns, prediction of ORFs
is easier than in the eukaryotic genome. There are many programs for ORF prediction, but
Prodigal [12] is the most commonly used one. We have installed Prodigal above. Prodigal
can predict ORFs in any genomic sequences. Thus, we can predict the ORFs in assemblies
separated by binning. In the following, we will predict the ORFs in one of the assemblies
recovered from the sample of the patient with severe sickle cell disease.
prodigal -a prod_out/healthy.faa \
-d prod_out/healthy.fnt \
-o prod_out/healthy.gbk \
-s prod_out/genes.gff \
-i binning/severe/severe.1.fa \
-p single
The “-a” option specifies the FASTA file name for the polypeptides or proteins translated
from the predicted ORFs. The “-d” option specifies the FASTA file name of the nucleotide
sequences that represent the predicted ORFs. The “-o” option specifies the predicted ORF
as features in GenBank format. The “-s” option specifies the gene annotation in GFF (gen-
eral feature format). The “-i” option specifies the input file which is the assembly. The “-p”
option specifies the procedure, which is either “single” for a single assembly or “meta” for
metagenomic assembly that may include genomes of multiple species.
8.3 SUMMARY
The metagenomic DNA is isolated from environmental samples or clinical samples in
which several microbes are present. Unlike targeted gene sequencing, shotgun metage-
nomic sequencing allows researchers to sequence the whole genomes of all organisms pres-
ent in a sample and to evaluate the microbial diversity and abundance.
Shotgun metagenomic sequencing attempts to sequence the whole genomes of a large
diverse number of microbes, each with a different genome size. Long reads produced by
PacBio and Oxford Nanopore are preferred. However, they usually have higher error rate
than the short reads. Since there are several species in the metagenomic sample, there
must be a sufficient sequencing depth to allow assembling the genomes of all species in the
sample.
Before analysis, we should make sure that we have fixed any quality problem by trim-
ming adaptors, filtering out low-quality reads, and removing technical sequences. In the
case of clinical samples, we should also remove the host DNA by aligning reads to the host
genome and then separate the unaligned reads in new FASTQ files to be used in the analy-
sis. There are two approaches for the shotgun metagenomic data analysis: the assembly-
free and de novo assembly. The assembly-free approach does not require assembling the
genomes of the species in the sample; it uses reads present in the metagenomic samples
to assign taxonomic groups by identifying unique genomic regions in the reads. Most of
the programs used for taxonomy assignment require a large amount of memory and stor-
age space. The second approach uses de novo algorithms to assemble the genomes of the
326 ◾ Bioinformatics
species in the sample. This method provides information on the total genomic DNA from
all species in a sample. After running de novo assembly, binning is used to cluster con-
tigs that apparently originated from the same species population. The recovered genome
sequences can be annotated by identifying protein-coding sequence of the ORFs.
The shotgun metagenomic sequencing is widely used to study unculturable microor-
ganisms that are difficult to culture and analyze.
REFERENCES
1. Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z: MetaBAT 2: an adaptive binning
algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ
2019, 7:e7359.
2. Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW: MaxBin: an automated binning
method to recover individual genomes from metagenomes using an expectation-maximiza-
tion algorithm. Microbiome 2014, 2(1):26.
3. Menzel P, Ng KL, Krogh A: Fast and sensitive taxonomic classification for metagenomics with
Kaiju. Nat Commun 2016, 7(1):11257.
4. Ounit R, Wanamaker S, Close TJ, Lonardi S: CLARK: fast and accurate classification of
metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015,
16(1):236.
5. Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biol 2014, 15(3):R46.
6. Kim D, Song L, Breitwieser FP, Salzberg SL: Centrifuge: rapid and sensitive classification of
metagenomic sequences. Genome Res 2016, 26(12):1721–1729.
7. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA: metaSPAdes: a new versatile metagenomic
assembler. Genome Res 2017, 27(5):824–834.
8. Wick RR, Schultz MB, Zobel J, Holt KE: Bandage: interactive visualization of de novo genome
assemblies. Bioinformatics 2015, 31(20):3350–3352.
9. Csárdi G, Nepusz T: The igraph software package for complex network research. In: 2006.
10. Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A: Versatile genome assembly
evaluation with QUAST-LG. Bioinformatics 2018, 34(13):i142–i150.
11. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW: CheckM: assessing the
quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome
Res 2015, 25(7):1043–1055.
12. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic
gene recognition and translation initiation site identification. BMC Bioinfor 2010, 11(1):119.
Index
327
328 ◾ Index