Biostar-Workflows 1
Biostar-Workflows 1
None
Table of contents
1. Home 6
1.1.1 Prerequisites 6
1.7 Alignments 49
2. Workflows 108
3. Modules 137
4. Formats 189
1. Home
The Biostar Workflows book provides simple, clear and concise instructions on how to use
bioinformatics tools in automated ways. The book complements the other volumes of the
Biostar Handbook and should be used in tandem with those volumes.
In the Biostar Workflows we are providing complete, multistep Makefiles that automate the
analysis of large datasets. Think of the book as a collection of methods and techniques that
you can use in your projects.
Our books are updated quite frequently! We recommend that you read every book via the
website, as the content on the web will always be the most up-to-date. Visit the Biostar
Handbook site to access the most up-to-date version of each book.
1.1.1 Prerequisites
1. First, you will need to set up your computer as instructed in the Biostar Handbook. If you
have a working bioinfo environment you are all set.
2. We assume you have some familiarity with common bioinformatics data formats. If not
consult our chapter on data formats
3. Finally, we will extensively use bash shell scripting and Makefiles to create workflows. The
volume titled the Art of Bioinformatics Scripting goes into great detail on these subjects.
On top bar we have a navigation bar. You can use it to navigate across sections.
On the right-hand side, we show a table of contents that helps you navigate within the selected
page.
The book is available to registered users. The latest versions can be downloaded from:
• Biostar-Workflows.pdf
Our books are updated frequently, especially during the Spring and Fall semesters, when the
books are used as textbooks.
We recommend accessing the book via the web, as the web version always contains the most
recent and up-to-date content.
A few times a year, we send out emails that describe the new additions.
First, you need follow the How to set up your computer chapter of the Biostar Handbook. By
the end of that process your computer will contain the software needed to run common
bioinformatics analyses.
• src/run
• src/r
• src/bash
Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.
For each new project download a separate copy of the code. This is because you may need to
modify the code (even if just a tiny bit) to suit your needs. Thus, it is best if you start out with
separate copy of the code.
Run one of the modules to test them. Here we run genbank.mk designed to connect to
GenBank to download data:
make -f src/run/genbank.mk
#
# genbank.mk: download sequences from GenBank
#
# ACC=AF086833
# REF=refs/AF086833.fa
# GBK=refs/AF086833.gb
# GFF=refs/AF086833.gff
#
# make fasta genbank gff
#
mkdir -p refs/
bio fetch AF086833 --format fasta > refs/AF086833.fa
-rw-r--r-- 1 ialbert staff 19K Feb 28 10:48 refs/AF086833.fa
You now have a file called refs/AF086833.fa that contains the sequence of the accession
number AF086833 . If you were to run the fasta target again
note how no download is performed because the file already exists. The workflow tells you
the file is already there:
You can change what gets downloaded by passing an additional parameter to the fasta
target:
mkdir -p refs/
bio fetch NC_045512 --format fasta > refs/NC_045512.fa
-rw-r--r-- 1 ialbert staff 30K Feb 28 10:46 refs/NC_045512.fa
We provide several data analysis modules that make use of code written in R. We support
running R both at the command line and via RStudio.
If you are using the command line for running R we recommend that you create a separate
conda environment for the statistical analysis. We call our environment stats :
Usually this is not a problem as the R code runs in the final stages of each protocol when you
are analyzing, plotting and visualizing the data.
Creating a separate stats environment is not always necessary. We do it to ensure that you
can run all the packages we demonstrate in the book. In practice, when working on a specific
problem you might not need them all. Start with updating all your packages in bioinfo :
Next you may be able to install individual Bioconductor libraries into your bioinfo
environment without interfering with existing packages. When doing so, cherry-pick and
install just a few of the packages, rather than all of them as we do above. It is a bit of trial and
error. Instruct mamba to install a library then watch what mamba tells you that will happen.
Whether any existing packages get downgraded. You would not want to allow a lower version
of say samtools or bwa to be installed.
The success rate may vary as various package dependencies change in time.
Check the src/install-packages.sh to for an overview of what packages you might want
to install.
Rscript src/r/simulate_null.r
It should print:
The code above performed an RNA-Seq simulation and created the files design.csv and
counts.csv .
When run from command line, you can see usage information by adding the -h (help) option.
For example:
Rscript src/r/simulate_null.r -h
Should print:
Options:
-d DESIGN_FILE, --design_file=DESIGN_FILE
simulated design file [design.csv]
-o OUTPUT, --output=OUTPUT
simulated counts [counts.csv]
-r NREPS, --nreps=NREPS
the number of replicates [5]
-i INFILE, --infile=INFILE
input file with counts [src/counts/barton_counts.csv]
-h, --help
Show this help message and exit
And that's it! You have all the code you will need to perform the analyses in this book.
Every once in a while (and especially if you run into trouble) you should update all packages
in an environment. You can do so via:
To switch environments within a bash shell script use the following construct:
When you know specifically which tools you wish to run you might want to create a custom
environment just for those tools. For example, to run the hisat2 based pipeline followed by
featurecounts and DESeq2 as well as filling in gene names and plotting the results:
followed by:
It would create the rnaseq environment with all the tools needed to run the RNA-Seq
pipeline.
We provide several data analysis modules that make use of code written in R. We support
running R both at the command line and via RStudio.
We demonstrate the use cases from the command line, and it would work similarly in RStudio.
In the Installing modules chapter we saw that the command below downloads a copy of the
code used in this book:
• src/run
• src/r
• src/bash
Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.
You will need to have R installed first. Visit the R website and install R.
• https://www.r-project.org/
When using Apple M1 processor install the Intel MacOS binaries and not the so called native
arm version. While both versions of R will run on your Mac the Bioconductor package
requires the Intel version of R.
Open RStudio, find and load src/install-rstudio.r script within RStudio. Then press
source to run the script. The command will churn, download and print copious amounts of
information but should complete without errors.
Note: Always select No or None when asked to update or modify packages that are already
installed.
At the end of the installation above you may be prompted yet again to update existing
packages. I recommend to ignore that message as the update does not seem to succeed on my
system and the packages seem to work even if I skip that update.
To run any of our R scripts, open the script in RStudio and execute it from there. The
customizable parameters will always be listed at the top of the script.
Set the working directory to the location that contains the src folder that you have created at
the module installation. Don't select the src folder itself though (what you should select is
the parent folder that contains the src folder).
You can modify the parameters at the top of every script to fit your needs.
We have tried to write clean and clear R code where we separate the process of reading and
formatting the data from the analysis step. This allows you to more readily modify the code to
fit your needs. As you will see, in just about any R script 90% of the code is fiddling with
data: formatting, cleaning, consolidating etc. The actual analysis is just a few lines of code.
Below we loaded the src/r/simulate_null.r script in R and ran it via the source
command:
1. Each program prints information on what it is doing. For example note how src/r/edger.r
program reports that it has created the edger.csv .
2. Investigate the files that get generated in the work directory
3. We have set up the tools so that by default the file names match; you can change them to
whatever you want
4. The code is modular; you can easily change the parameters to fit your needs
We have implemented our workflows as modules written as Makefiles that may be run
individually or in combination.
• src/run
• src/r
• src/bash
Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.
Run the makefile to have it print information on its usage. For example:
make -f src/run/sra.mk
will print:
#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000
#
# make fastq
#
• Lower case words: fastq are the so-called targets of the makefile.
• Upper case words: SRR=SRR1553425 are the external parameters for the makefile.
Each makefile comes with a self-testing functionality that you can invoke with the test
target.
To run the makefile with the target that the usage showed, for example the fastq execute:
Running a makefile with no parameters will make it use the default parameters, in this case
SRR=SRR1553425 basically equivalent to running:
You can pass a value to any variable into the module with:
mkdir -p reads
fastq-dump -F --split-files -X 1000 -O reads SRR1553425
Read 1000 spots for SRR1553425
Written 1000 spots for SRR1553425
-rw-r--r-- 1 ialbert staff 58K Oct 25 10:00 reads/SRR1553425_1.fastq
-rw-r--r-- 1 ialbert staff 63K Oct 25 10:00 reads/SRR1553425_2.fastq
Note how the module will also list the commands that it executes.
Look at the source code of the file you are running to see all parameters.
The navigation bar on the left also has a documentation page for each module.
1. Look at the complete guide to understand the principles that we rely on.
2. Look inside the makefile to see how it works. We tried our best to write self-documenting code.
3. We also provide separate documentation for each makefile.
The table of contents lists several analyses. Each will rely on using the Makefiles in a
specific order and with certain parameters. Study each to learn more about the decision-
making that goes into each process.
In this section, we describe our Makefiles design and usage in more detail. It is not strictly
necessary to consult the content on this page, but helps in better understanding the principles
we follow.
For each project, start with a new copy of the code and customize that.
You may need to customize the modules, add various flags to it. Thus, it is best to start with a
new copy of the code for each project.
• src/run
• src/r
• src/bash
Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.
Testing a modules
Invoke the test target to have the module execute its self-test. For example:
The above will download all the data its needs and will print the commands. A module may
rely on other modules hence should be run from the directory that you have installed the
modules in.
Module usage
Every module is self-documenting and will print its usage when you run it via make .
Investigate the modules in the src/run folder. We'll pick, say, src/run/sra.mk . Run the
module to get information on its usage:
make -f src/run/sra.mk
#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000
#
# make fastq
#
All modules are runnable without passing parameters to them. Some modules list the valid
parameters they can take.
Action targets like get always have lower case. The parameters SRR and N are always
uppercased.
The module's usage indicates the so-called target, in this case, get . Targets are in lowercase;
parameters will always be uppercased. To see what the makefile will do when run, add the -n
flag (so-called dry run) to the command line:
mkdir -p reads
fastq-dump -F --split-files -X 1000 -O reads SRR1553425
ls -lh reads/SRR1553425*
Note how the Makefile is self-documenting. It prints the commands that it will execute.
Running a module
In most cases the default parameters for a module are sufficent for a test run witouth setting
additional paramters. For example:
mkdir -p reads
fastq-dump --gzip -F --split-files -X 1000 -O reads SRR1553425
Read 1000 spots for SRR1553425
Written 1000 spots for SRR1553425
-rwxrwxrwx 1 ialbert ialbert 58K Oct 22 22:36 reads/SRR1553425_1.fastq
-rwxrwxrwx 1 ialbert ialbert 63K Oct 22 22:36 reads/SRR1553425_2.fastq
This time it runs instantly because the files are already downloaded and prints the location of
the files only:
Some modules, for example, aligners, may need data files to run. Other modules may need an
alignment file to work.
In those cases, we would need to run the appropriate tools first, such as sra.mk or bwa.mk ,
to generate the correct data types.
Setting parameters
Look inside the Makefile (a simple text file) to see the available parameters.
If we wanted to download data for a different SRR number, we could specify command line
parameters like so:
In our convention, the parameters will always be in uppercase. The command, when run,
prints:
mkdir -p reads
fastq-dump --gzip -F --split-files -X 2000 -O reads SRR3191542
Read 2000 spots for SRR3191542
Written 2000 spots for SRR3191542
-rwxrwxrwx 1 ialbert ialbert 101K Oct 22 22:39 reads/SRR3191542_1.fastq
-rwxrwxrwx 1 ialbert ialbert 110K Oct 22 22:39 reads/SRR3191542_2.fastq
In a nutshell, the sra.mk module is a reusable component that allows us to download reads
from the SRA database.
Undoing a run
In all of our modules, to remove the results generated by a target, invoke the target with an
exclamation sign ! added. For example, to undo the results generated by the fastq target,
run fastq! :
Doing so will remove the downloaded files. To force rerunning the module even if the results
already exist, use both targets fastq! and fastq! in the same command:
The reversal may not undo every change. For example, in the above case, it won't delete the
directory that a command might have created but will delete the downloaded files.
Universal parameters
Many modules accept additional parameters that have a common utility. We do not list theses
parameters in the module's usage because they common to all.
For example
Tools are set up to run in single end mode MODE=SE thus for paired-end data set the
MODE=PE for every module.
Look at the source code, we list all parameters at the start of the module.
Another common parameter is SRR as many of our workflows start with downloading data
from the SRA database. The default file names in each tool will derive from the SRR
parameter unless set otherwise.
For example the following command will run bwa with 4 threads, in single end mode and
additionally applies the filtering flags to remove unmapped reads:
Notably using another aligner will work the same way as far these universal parameters are
concerned:
when not explicitly set, the module will use the SRR number of set the various input and
output parameters. Thus unless otherwise set it will read its input from R1=reads/
SRR3191542.fastq and will write the output to BAM=bam/SRR3191542.bam . The default
naming makes it convenient when reproducing published data.
For your data, that does not have an SRR number, you can set the R1 , R2 and BAM
parameters to point to your files.
Setting a parameter on a tool that does not recognize it will have no effect. So you can say
NCPU=4 on say genbank.mk and it will have no effect.
The easiest way to identify all parameters is reading the source in your code or here on the
website.
Triggering a rerun
If we attempt to modify the run and get more reads with N=2000 the module won't do it by
default since the files are already there, and it does not know we want a new file with the same
name. In that case, we need to trigger a rerun by undoing the target with get! and then
rerunning it with get :
The same rule applies to all other modules as well. We might need to trigger a rerun whenever
we change parameters that may affect the contents of a file.
For example, to get the entire SRR file, we need to set N=ALL as a parameter, and we
probably want to remove the previous files (if we downloaded these before):
You need to trigger a rerun if you change parameters that affect the contents of a file that
already exists.
Software requirements
Every workflow that we develop will have a list of software requirements that you can see
with the following:
If you have set up the bioinfo environment as instructed during installation, your system is
already set up and good to go. Otherwise, you need to use the command above to set up the
software. For example, you could do the following:
The above applies the installation commands into the current environment.
Look inside src/run/sra.mk to see what the module does. We've tried making the code as
simple as possible without oversimplifying it.
Our modules are building blocks around some of the most anachronistic aspects of
bioinformatics. Our modules are what bioinformatics would look like in a well-designed
world.
We also document each module individually; On the left hand side you can view the entry to
the documentation and full source code for sra.mk
As you run modules you may run into errors. Here is a list of common ones we observed and
how to fix them.
Study the guides a bit, to make sure you understand how the concepts work:
Each module is documented separately. See the section called Modules above.
The most important lesson, don't panic. Most errors are caused by plugging the wrong file into
the wrong location.
Check your file names, read the whole error message and think about what it says.
The vast majority of errors look scary because the wording is unfamiliar. Even scary looking
errors have straightforward solutions as demonstrated below.
Next generate a so called dry-run and see what the module will attempt to do:
Paired-end vs single-end
Most tools will work in paired end mode by default, but some require single end data. To
avoid uncertainty, specify the mode explicitly MODE=SE or MODE=PE .
The error listed in the title has many variants, and is a bit more verbose. It might look like this:
The error means that make cannot find one of your input files.
For example an aligner does not know how to make a FASTQ file, it is not its job.
The makefile knows that a FASTQ file is required to produce a BAM file. But if you don't
have that FASTQ file then it gets stumped. It tells you that it does not know how make
something that it needs later. It is always about input files and not output files.
reads/SRR7795540_2.fastq_R1.fq
Does that file exist? Go check. I already know the answer. No, the file does NOT exist as
listed.
make *** "### Error! Please use GNU Make 4.0 or later ###". Stop.
The error above means that you have not activated the bioinfo environment! As trivial as it
seems, the error still stumps many people. Solution? Make sure to activate your environment.
If for whatever reason you can't run the newer make you can replace the leading > symbols
with TAB characters in our modules. In general, though this is workaround should not be
necessary. Activate your environment and you should be fine.
The error means you are passing the incorrect path to the ' make '. Use TAB completion to run
the correct file.
Missing operator
Perhaps your makefile is old and does not have the latest changes. You should have make
version 4.0 or later.
It could also mean that one of the actions in the makefile is not properly formatted.
You can also get this error a comment character # is missing on a line.
Sometimes, if you change an internal parameter, for example adding different alignment flags
or other filtering parameters you will need to recreate the files.
Touching files
Another option to force a rerun is to touch (update the date) on a file. The date update makes
the file look newer and forces recomputing of all dependencies. For example if you have a
FASTQ file and you all dependent files to be recreated you could do:
touch reads/SRR1553425_1.fastq
In this cases touching is preferable over say triggering via get! as touching will not delete
and re-download the reads.
Additional tips
Bioinformatics workflows interconnect complex software packages. When starting any new
project expect to make several mistakes along the way. Your skill is measured primarily by the
speed by which you recognize and corrects your mistakes. Here are a few tips:
The forum is a resource, that we started for this book. You are welcome to post your question
as a new discussion topic.
Data is distributed via various repositories. The most commonly used ones are:
flowchart TD
Most of the time it is useful to run a bio search to get information on data before
downloading it. For example:
# Searches GenBank
bio search AF086833
# Searches SRA
bio search SRR1553425
GenBank is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences. If your data has a GenBank accession number such as AF086833
use the genbank.mk module.
mkdir -p refs/
bio fetch AF086833 --format fasta > refs/AF086833.fa
-rw-r--r-- 1 ialbert staff 19K Nov 4 11:26 refs/AF086833.fa
mkdir -p refs/
bio fetch AF086833 --format gff > refs/AF086833.gff
-rw-r--r-- 1 ialbert staff 7.5K Nov 4 11:26 refs/AF086833.gff
The command above downloaded the FASTA and GFF files corresponding to the AF086833
accession in GenBank.
The Assembly resource includes prokaryotic and eukaryotic genomes with a Whole Genome
Shotgun (WGS) assembly, clone-based assembly, or completely sequenced genome (gapless
chromosomes). The main entry point is at:
• https://ftp.ncbi.nlm.nih.gov/genomes/
If your data has a NCBI Assembly accession number such as GCA_000005845 then run:
It will print:
[
{
"assembly_accession": "GCA_000005845.2",
"bioproject": "PRJNA225",
"biosample": "SAMN02604091",
"wgs_master": "",
"refseq_category": "reference genome",
"taxid": "511145",
"species_taxid": "562",
"organism_name": "Escherichia coli str. K-12 substr. MG1655",
"infraspecific_name": "strain=K-12 substr. MG1655",
"isolate": "",
"version_status": "latest",
"assembly_level": "Complete Genome",
"release_type": "Major",
"genome_rep": "Full",
"seq_rel_date": "2013/09/26",
"asm_name": "ASM584v2",
"submitter": "Univ. Wisconsin",
"gbrs_paired_asm": "GCF_000005845.2",
"paired_asm_comp": "identical",
"ftp_path": "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/
GCA_000005845.2_ASM584v2",
"excluded_from_refseq": "",
"relation_to_type_materialasm_not_live_date": ""
}
]
• https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2
View and understand what files you may download from there. After that you can copy-paste
the link and download it directly with curl or wget :
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/
GCA_000005845.2_ASM584v2_genomic.fna.gz
Alternatively you use the modules named curl.mk that we provide. One advantage of our
Makefiles is that it will avoid downloading data that already exists.
# The curl.mk module that not download data that already exists.
make -f src/run/curl.mk get URL=$URL ACTION=unzip FILE=refs/chr22.fa
NCBI allows the use of the rsync protocol (do note that most sites would not support this
functionality). To use rsync.mk we have to change the protocol from https to rsync that
way we can download the complete directory, not just a single file:
Ensemble operates on numbered releases. For example, release 104 was published on March
30, 2021:
• http://ftp.ensembl.org/pub/release-104/
Navigate the link above and familiarize yourself with the structure. To collect all the data for
an organism, you’ll have to click around and visit different paths as Ensemble groups
information by file format.
You can invoke curl or wget directly on each file or use the curl.mk module.
# The curl.mk module that not download data that already exists.
make -f src/run/curl.mk get URL=$URL FILE=refs/chr22.fa ACTION=unzip
The complete workflow is included with the code you already have and can be run as:
bash src/scripts/getting-data.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
# Genome accesion.
ACC=AF086833
# Data at ENSEMBL.
URL=http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosom
Refgenie is a command-line tool that can be used to download and manage reference
genomes, and to build and manage custom genome assets.
flowchart TD
REMOTE1[<b>Remote Resource 1</b><br>Human genome] -->|refgenie pull| LOCAL[<b>Local
Computer</b><br>common storage area]
REMOTE2[<b>Remote Resource 2</b><br>BWA index] -->|refgenie pull| LOCAL
REMOTE3[<b>Remote Resource 3</b><br>GFT annotations] -->|refgenie pull| LOCAL
LOCAL -->|refgenie seek| PATH[<b>Path to <br> local resource</b>]
Refgenie installation
Then you need to create a configuration file that lists the resources.
Every time you run refgenie you can pass that file to each command with the -c option.
A simpler approach is to create a default configuration file that is accessed via the REFGENIE
shell environment variable. In that case you can omit the -c option and the configuration file
will be automatically loaded from the file stored in the REFGENIE variable.
export REFGENIE=~/refs/config.yaml`
Now run source ~/.bashrc or close and open the terminal to reload the configuration.
refgenie init
You are all set. You can now use refgenie to download and manage reference genomes.
Using refgenie
refgenie listr
Download a resource:
refgenie list
/scratch/refs/alias/hg38/gencode_gtf/default/hg38.gtf.gz
ls -l /scratch/refs/alias/hg38/gencode_gtf/default/hg38.gtf.gz
You can also run the command substitution $() construct like so:
In a scrip you can run a program and return the value with the $() syntax:
Refgenie in a Makefile
The iGenomes server is a public server that hosts additional reference genomes and genome
assets.
Published FASTQ files are stored in the Short Read Archive(SRA). Access to SRA can be
diagrammed like so:
flowchart LR
Of all tasks in bioinformatics nothing is more annoying than the clunkiness of downloading a
simple FASTQ file.
We have lots of options at our disposal, but in reality all are flaky.
Rant: Why is it that when I install a 30 GB Steam game on my home system it takes a single
click and it finishes in two hours, but if I try to download a 30 GB FASTQ file via my
superfast connection at the university it takes 16 hours (and many times fails outright) and I
have to use 3 different tools to figure out which works? Rhetorical question ... I know the
answer. It is because when you can't download a game you purchased there are repercussion.
When you can't download a FASTQ file nobody cares.
ENSEMBL does provide links to FASTQ files. Only that if you are in the USA it means you
have to download the files from Europe, and transfer speeds are usually lacking.
But now what do you do when you do need to get data? You will be hitting up various tools
until something works:
and so on. Eventually something will work. Good luck comrade, you will need it!
Most of the time you should start by investigating the SRA numbers for additional
information. I have developed bio search because at that time no other tool reported the
metadata in the way I thought was most useful:
will print:
[
{
"run_accession": "SRR1553425",
"sample_accession": "SAMN02951957",
"sample_alias": "EM110",
"sample_description": "Zaire ebolavirus genome sequencing from 2014 outbreak in
Sierra Leone",
"first_public": "2015-06-05",
"country": "Sierra Leone",
"scientific_name": "Zaire ebolavirus",
"fastq_bytes": "111859282;119350609",
"base_count": "360534650",
"read_count": "1784825",
"library_name": "EM110_r1.ADXX",
"library_strategy": "RNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "PAIRED",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "Zaire ebolavirus Genome sequencing",
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_1.fastq.gz",
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_2.fastq.gz"
],
"info": "112 MB, 119 MB file; 2 million reads; 360.5 million sequenced bases"
}
]
there are more fields to search for, see bio search --help for more information.
If you know the SRR number you can use the sra.mk module that uses fastq-dump behind
the scenes. Pass the N to get a subset of the data.
prints:
mkdir -p reads
fastq-dump -F --split-3 -X 10000 -O reads SRR1553425
Read 10000 spots for SRR1553425
Written 10000 spots for SRR1553425
-rw-r--r-- 1 ialbert staff 577K Nov 4 11:37 reads/SRR1553425_1.fastq
-rw-r--r-- 1 ialbert staff 621K Nov 4 11:37 reads/SRR1553425_2.fastq
We take a note of what happened, the module has also created a reads folder with FASTQ
files in it. You can change the destination folder as it is also an input parameter.
Fifty percent of times fastq-dump works every time. If it works for you, show your
gratitude to the deities of Bioinformatics and proceed to the next phase of your research.
However, if luck is not on your side, do not worry and keep on reading. We are gonna get
through it one way or another.
To obtain all data for a project, search for the project number to identify the SRR run numbers,
then use GNU parallel to download all the data.
now read the first column and run the download in parallel (we limit to 3 samples here)
Sometimes people have trouble running fastq-dump . In those cases locate the URLs in the
bio search output
that prints:
[
{
...
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_1.fastq.gz",
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_2.fastq.gz"
],
"info": "112 MB, 119 MB file; 2 million reads; 360.5 million sequenced bases"
...
}
]
note the entry called fastq_url . Isolate those files then download them with curl.mk. You
can do the isolation manually or automate it with the jq tool
will print
https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/SRR1553425_1.fastq.gz
https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/SRR1553425_2.fastq.gz
When you have to download large files you should use the aria2c tool. It is a handy
command-line download utility. One of its key benefits is the ability to resume partially
finished downloads. Moreover, the tool can segment the file into multiple parts and leverage
multiple connections to expedite the download process.
The SRA Explore tool aims to make datasets within the Sequence Read Archive more
accessible. It is everything that NCBI's SRA website should be. It is a single page application
that allows you to search for datasets, view metadata:
• https://sra-explorer.info/
You can also visit the SRA website to navigate the data.
• fastq_dl takes an ENA/SRA accession (Study, Sample, Experiment, or Run) and queries
ENA (via Data Warehouse API) to determine the associated metadata. It then downloads
FASTQ files for each Run. For Samples or Experiments with multiple Runs, users can
optionally merge the runs.
• ffq receives an accession and returns the metadata for that accession as well as the metadata
for all downstream accessions following the connections between GEO, SRA, EMBL-EBI,
DDBJ, and Biosample.
• geofetch is a command-line tool that downloads sequencing data and metadata from GEO
and SRA and creates standard PEPs. geofetch is hosted at pypi. You can convert the result of
geofetch into unmapped bam or fastq files with the included sraconvert command.
The complete workflow is included with the code you already have and can be run as::
bash src/scripts/getting-fastq.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
1.7 Alignments
flowchart LR
This tutorial will produce alignments using data discussed in the chapter Redo: Genomic
surveillance elucidates Ebola virus origin. We identified that the 1972 Mayinga strain of Ebola
had the accession number AF086833 . We also located multiple sequencing data, out of which
we select one SRR1553425 .
Obtain the genome reference file in FASTA and GFF formats from GenBank via the
genbank.mk module.
Download a subset of the FASTQ data for the SRR1553425 accession number.
We provide you with modules that can run multiple short read aligner in similar manner. Each
module is built such that we can pass either SRR numbers or specify the R1 and R2 files.
We need to ensure that the indices are also built so we need to invoke them with both index
and align targets.
Pass the BAM parameter to control where how the output is names. Visit the documentation of
each module to learn more about how to override additional parameters.
bam/SRR1553425.bwa.bam
Note how simple and elegant our reusable modules are. The interfaces are identical and you
can that FASTQ files with different methods to produce the BAM file.
Use IGV to import the reference file as the genome and the BAM file as the alignment.
The complete workflow is included with the code you already have and can be run as:
bash src/scripts/short-read-alignments.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
# Genome accesion.
ACC=AF086833
# SRR number.
SRR=SRR1553425
flowchart LR
NORM --> BCF( <b>Filtered <br> Normalized <br> Compressed <br> VCF</b>)
Se the tutorial How to align short reads to learn how to generate the BAM file.
A variant calling workflow starts with a BAM file that you have access to. Here we assume
you ran the previous tutorial that generated a file called:
bam/SRR1553425.bwa.bam
You also need to have a reference genome that was used to generate the alignment. The variant
caller will use your BAM file and the reference genome to identify variants.
Several alternative tools may be used to call variants. All do good job with easy variants but
may differ quite a bit when calling variants where the alignments are ambiguous or unreliable.
Different schools of thought may collide within each software package.
bcftools
bcftools is a variant caller that is part of the samtools package. It is my favorite tool for
non-human species. I think it strikes a balance between usability and performace.
• Docs: https://samtools.github.io/bcftools/bcftools.html
• Paper: A statistical framework for SNP calling, Bioinformatics 2011
• Paper: The evaluation of Bcftools mpileup, Scientific Reports (2022)
freebayes
• Docs: https://github.com/freebayes/freebayes
• Paper: Haplotype-based variant detection from short-read sequencing, arXiv, 2012
GATK
The most commonly used humane genome variation calling. It is a complex beast, it is a
platform and a way of thinking has extensive documentation and is widely used. A de-facto
standard for human genome variation calling.
• Docs: https://gatk.broadinstitute.org/hc/en-us
deepvariant
DeepVariant is a deep learning-based variant caller that takes aligned reads, produces pileup
images, classifies each image using a convolutional neural network, and finally reports the
results in a standard VCF or gVCF file.
• Docs: https://github.com/google/deepvariant
• Paper: A universal SNP and small-indel variant caller, Nature Methods, 2017
Generate variants
BAM=bam/SRR1553425.bwa.bam
REF=refs/AF086833.fa
vcf/SRR1553425.bcftools.vcf.gz
We also provide a module that runs freebayes that runs with the exact same interface:
vcf/SRR1553425.freebayes.vcf.gz
That's it.
Visualize all the files in IGV, load the reference FASTA, the BAM and the VCF files to obtain
the following picture:
The complete workflow is included with the code you already have and can be run as:
bash src/scripts/variant-calling.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
# Genome accesion.
ACC=AF086833
# SRR number.
SRR=SRR1553425
# Index the genome then align the reads to the reference genome.
make -f src/run/bwa.mk index align SRR=${SRR} REF=${REF} BAM=${BAM} MODE=${MODE}
In this chapter, we take on the task of reproducing the findings of a paper dedicated to
comparing the accuracy of different variant calling workflows.
• Comparison of three variant callers for human whole genome sequencing, Scientific Reports
volume 8, Article number: 17851 (2018)
As we work through the paper, we learn a lot about variant calling, the pitfalls, and the
challenges of detecting and validating variants in a genome.
Most importantly, we learn that deciding which variant caller is better is not nearly as simple
as it seems.
TLDR
All variant callers perform exceedingly well even with default parameters. bcftools was
the fastest, DeepVariant the most accurate.
Final ranking:
Reproducing the variant calls (VCF files) themselves is beyond the scope of this chapter. We
will present the variant calling process in a different section.
In this chapter, we will begin our study with the variant call (VCF) files published by the
authors.
In addition to the published results, we generated variant calls with workflows published in
this book.
The question of which variant caller is "better" has long preoccupied scientists. The main
challenge is that we don't know what the ground truth is. We don't have a universal and
objective method to decide which variant is accurate and which variant call is invalid.
Consequently, scientists need to reach the next best thing - some consensus. They first need to
identify a set of variant calls considered to be very likely correct and use those as a
benchmark. The problem is that we use variant callers to identify the benchmark data, then use
these results to evaluate the same variant callers.
Genome in a Bottle
The Genome in a Bottle Consortium (GIAB) was created to come up with a set of benchmark
datasets, which can be used to validate the accuracy of variant calls. According to various
resources:
GIAB provides high-confidence calls for a range of variant types, including single nucleotide
variants (SNVs), insertions, deletions, and structural variants (SVs), for various human
genomes. These calls are generated using multiple sequencing technologies, platforms, and
pipelines, making them highly reliable and unbiased. Researchers can compare their variant
calls with the benchmark calls provided by GIAB to assess the accuracy of their variant
calling workflow.
At least that is the theory - the reality is a bit more complicated. A lot more complicated,
actually.
The GIAB evolves, and the variant calls are refined over time as new data gets collected. The
GIAB has had releases in 2014, 2015,2016, 2017 and 2021.
Naturally, the question arises. How different is the 2014 benchmark from the 2021
benchmark?
Would the software version be considered best when evaluated on the 2014 data and score best
in the 2021 data?
The Precision FDA Truth Challenge was a public competition organized by the US Food and
Drug Administration (FDA) to evaluate the performance of variant calling methods on whole-
genome sequencing data.
If you look at the outcomes on their page above, you'll note that the results submitted for the
challenge are close. The contest winners are only 0.1% better than the second place. The best
variant caller is only 1% better than the worst.
Note that the results of an evaluation are only as good as the correctness of the data itself. If
the "ground" truth is inaccurate within 1%, then any ranking based on a 1% difference can't be
all that meaningful.
Are the winners indeed better, or are the winners better at predicting both the right calls and
the miscalls?
In this chapter, we've set out to compare variant calls from different sources - the process
turned out to be unexpectedly tedious. We clicked, pasted, read, unpacked, filtered, and
corrected VCF files for many days.
We ran into all kinds of quirks and problems. The sample names had to be changed; the
variant annotations would not parse correctly; the VCF files had errors inside them, and so on.
When downloading different GIAB releases, the VCF files are incompatible and need
alteration and fixing to become comparable. Take that reproducibility. We can't even directly
compare the different GIAB releases without some additional work.
We have lost track of how many small but annoying additional steps we had to employ to
make the data comparable.
But finally, we managed to package all the data up in a single archive that contains all you
need to get going.
• http://data.biostarhandbook.com/vcf/snpeval.tar.gz
The archive will create a snpeval folder containing several subdirectories and files.
We have restricted our data to chromosome 1 alone to make the data smaller and quicker to
work with.
The paper reports that the deep variant by Google is the best variant caller. We have
downloaded the variant calls from the article and stored them as the
DEEP-2017.chr1.vcf.gz file.
We also generated variant calls via the modules from this book that wraps Google Deepvariant
( deepvariant.mk ), GATK ( gatk.mk ) and BCFTOOLS ( bcftools.mk ) variant callers.
1. GIAB-2017.bed
2. GIAB-2021.bed
The high confidence regions in GIAB are typically defined as regions where the sequencing
and variant calling accuracy is believed to be very accurate. When evaluating variant calls,
benchmarking tools typically only consider the variants within high-confidence regions.
In reality, however, the high-confidence regions are also those that are easier to call, where the
choice of parameters and settings is not as critical. Another way to say this is that variant
calling errors are not uniformly distributed across the genome. Some regions are more error-
prone than others. Some areas may always generate incorrect variant calls. That is a far cry
from the typical error interpretation, where a 1% error is assumed to be uniformly distributed
across the data.
The results folder contains the results of running the rtg snpeval program on the
variant calls. We will talk about rtg later; for now, know that each folder compares variants
calls relative to GIAB-2021 :
You don't need to run any tool to follow our analysis, but if you want to run the evaluation
yourself, look at the Makefile we include at the end of the chapter to understand how we did
it.
Select the hg38 reference genome in IGV, then select chr1 , then load the MERGED.vcf.gz
file.
Quantifying accuracy
On the merged variant call file, identify by eye a few of the following situations:
Suppose we add these numbers and create the sums typically reported in the literature as TP ,
FP , FN , TN . These numbers can be used to evaluate the accuracy of the variant caller.
Precision and recall are two metrics commonly used to evaluate the accuracy of classification
methods.
Precision measures the proportion of identified variants that are true positives.
Recall, also known as sensitivity or true positive rate, measures the proportion of true variants
the tool correctly identified.
In an ideal case, precision and recall would be 1, indicating that the variant caller perfectly
identifies all true variants while not calling any false ones.
To combine precision and recall into a single number, the most commonly used metric is the
F1 score, which is the harmonic mean of precision and recall:
While the two measures appear similar, we want to note a fundamental difference when
dealing with false positives and false negatives.
A variant call file containing false positives can be post-processed with various filters that
could remove the false positives. Hence, in theory data with false positive can be improved
upon later.
Variant call files containing false negatives cannot be improved via VCF filtering. The variant
is absent, and data about the location is absent. Thus, we cannot improve upon this type of
error.
This is to say that false negatives are more problematic than false positives.
Comparing variants
Comparing two VCF files is not a trivial task. Intuitively we would think that all we need is
line up the two variant files, and then compare the variants at each position. For most cases
that might be enough, but there are many edge cases and overlappin variants that may make
the problem more complicated. In those cases we need to compare the entire region around the
variant (the entire haplotype), not just the variant itself.
Tools like hap.py and rtg vcfeval are designed to compare VCF files.
• https://github.com/Illumina/hap.py
• https://github.com/RealTimeGenomics/rtg-tools
For example the RTG Tools includes several utilities for VCF files and sequence data. The
most interesting is the vcfeval command, which compares VCF files.
Our data includes the results of running vcfeval on various VCF file combinations. See the
Makefile at the end of this chapter on how to run it yourself.
In the following, we report and discuss the results of the output of the rtg vcfeval tool.
The command we run in our Makefile will be approximately as follows (we simplifed paths
for brevity):
For example to compare GIAB 2021 to GIAB 2017, we run the following command:
The command, when run generates the directory called GIAB-2021-vs-GIAB-2017 with
several files. The summary.txt contains the report we are interested in. Here we note that
the GIAB files have been filtered to only contain high-confidence calls, and with time the
regions that are considered high-confidence have changed.
The rtg vcfeval tool creates verbose headers that are easy to read but won't fit in our text
without wrapping. We will abbreviate False-pos as FP , False-neg as FN , and so on in
our reports.
GIAB IN 2017
How reliable is the GIAB 2017 benchmark compared to the GIAB 2021 benchmark? Let;s
find out.
If we use the 2021 definition of "high-confidence intervals" the results are as follows:
If we use the 2017 version for the "high-confidence intervals" the results are as follows:
All of a sudden we note the difficulty in evaluating even consecutve GIAB releases. It is not
clear wether missing variants are due to selection of regions or due to the variant caller.
We do find the imbalance between false positives and false negatives quite unexpected.
DEEPVARIANT IN 2017
The variants below are those published in the paper Comparison of three variant callers for
human whole genome sequencing, Scientific Reports (2018) that we have set out to reproduce.
These variants were generated with the DeepVariant tool. When compared to the GIAB 2021
benchmark, the results are as follows:
The F score that obtain is very close to the number reported in the paper ( F=0.98 ).
We find this quite noteworthy. The variant calls generated by the DeepVariant tool in 2017 are
more accurate even than the GIAB 2017 benchmark!
DEEPVARIANT IN 2023
Since the publication of the paper the DeepVariant tool has been updated to version 1.5.0. We
have generated variants with this latest version as well:
GATK IN 2023
For many years the gold standard in variant calling was held by GATK, the Genome Analysis
Toolkit. Let's compare the results produced by the gatk.mk module in the Biostar Workflows
to the GIAB 2021 benchmark:
We note a more natural balance between false positives and false negatives. Just about the
same number of FP and FN calls are observed.
Have we used GATK to its full potential? Most likely not. Though I have attempted to use the
best practices, I don't doubt that many more options could be used to improve the results.
Yet I am quite pleased; we can match and even slightly exceed the results reported in the
paper.
BCFTOOLS IN 2023
My favorite variant caller is bcftools and you can used it via the bcftools.mk module in
the Biostar Workflows. It runs incombarably faster, requires fewer resources, and is far, far
easier to use than either DeepVariant or GATK. Let's see what results it produces:
And lo and behold, the variant calls generated via bcftools turn out to be the quite accurate
very much comparable to those generated by the "best methods"
Notably, bcftools runs in a fraction of the time of GATK and requires incomparably fewer
resources.
I did not know what to expect here and was pleased with the results.
Lessons learned
The complete workflow is included with the code you already have and can be run as:
#
# A makefile to evaluate SNP calls.
#
REF_URL = https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/${CHR}.fa.gz
# The high confidence SNP calls for the 2021 GIAB dataset.
GIAB_2021_VCF = vcf/GIAB-2021.${CHR}.vcf.gz
# The high confidence SNP calls for the 2017 GIAB dataset.
GIAB_2017_VCF = vcf/GIAB-2017.${CHR}.vcf.gz
#
# Apply Makefile customizations.
#
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
SHELL := bash
usage:
> @echo "#"
> @echo "# Evaluate SNP calls"
> @echo "#"
> @echo "# TARGET = ${TARGET}"
> @echo "# FLAGS = ${FLAGS}"
> @echo "#"
> @echo "# Usage: make data index eval "
> @echo "#"
clean:
> rm -rf results
In an RNA-Seq study we operate on a concept called gene expression that represents each
gene or transcript. The gene expression is a value assumed to correlate with the number of
transcripts present in the sample. A typical RNA-Seq study is a two-step process.
flowchart LR
flowchart LR
Gene expression is a bit of a misnomer since only transcripts can be expressed. Often there is
ambiguity in how gene expression is defined. In general gene expression is a count of reads
aligning to or overlapping with a region. Gene expression may be a count of individual
transcripts or a sum of counts over all transcripts that belong to a gene.
The differential expression means detecting a statistically significant change in the gene
expression between some grouping of the samples. For example, we may want to know which
gene's expression changes between two treatments, two strains, or two timepoints.
The gene expression number may be computed and expressed in different ways. The so-called
raw counts are preferable to other measures as we can quickly transform raw counts into
different units: FPKM, TPM.
Counts are always expressed relative to a genomic feature: gene or transcript and represent
the number of reads that overlap with that feature.
Due to gene expression's inherent variability, all measurements need to be repeated several
times for each condition. Replicate numbers vary from 3 to 6 or more. The more replicates,
the more reliable the results - though there are diminishing returns at some point, and the cost
of the experiment may become the limiting factor.
A count file is a tab or comma-separated file that contains count data for each feature and each
replicate. For example:
For demonstration purposes, we show the counts and design files in tabular format.
In general, we prefer and our tools always use comma-separated files that can be readily
viewed in a spreadsheet program. In addition tab-separated files cannot be copy-pasted from
a web browser, thus making it difficult to use them as examples.
A design file connects the sample names of the count file to a group. For example:
sample group
run1 WT
run2 WT
run3 WT
run4 KO
run5 KO
run6 KO
In this book, we call the above a design file. Other people may use different terminology such
as metadata, col data, or sample sheet.
In general, genes and samples may have other information (called metadata) associated with
them. For example, a gene may have a description a common name, or a gene ontology
annotation. A sample may have a tissue or a time associated with it. Here we show you the
bare minimum of information needed to perform a differential expression analysis.
Where to go next
• RNA-seq simulations
• Generate RNA-Seq counts using HiSat2
• Generate RNA-Seq counts with Salmon
Once you have a count matrix and a design file you can perform differential gene expression
analysis as described in:
We could use realistic data of course, but the trouble there is we never know what the true
answer is.
So we can't quite tell if the results are correct or not. A simulated data is generated from a
model that we know the answer to. We can then compare the results of the analysis to the
known answer.
The first challenge is always to understand a differential gene expression analysis's inputs
and outputs. We need to learn to be confident when evaluating the results we produce.
The best way to understand how RNA-Seq differential expression works are to generate count
matrices with different properties and then investigate how the statistical methods perform on
that data. Our modules provide several methods to generate counts;
One kind of simulation generates a null hypothesis data with no changes. Using this
simulation we can validate our methods since any detection of a differentially expressed gene
would be a false positive.
flowchart LR
NULL(<b>Null Simulations</b> <br> No differential expression) --> COUNTS
NULL --> DESIGN
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Expect to Find Nothing</b><br>Effect Sizes and P-Values)
We provide a module named simulate_null.r that generates a null data set. The null data
has been experimentally obtained by sequencing 48 replicates of the wild-type yeast strain.
When we run the simulation our tool will select replicates at random from the 48 the wild-type
columns and then randomly assign them to one of two groups.
Note: Remember that you can run every R module with the -h flag to see help on its use.
Running the simulate_null.r script would generate a count matrix counts.csv and a
design file called design.csv . To perform the simulation execute:
Rscript src/r/simulate_null.r
The data was published in the following papers that also are interesting to read:
• Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
• How many biological replicates are needed in an RNA-seq experiment and which
differential expression tool should you use?
We also provide you with a method that generates realistic effects of differential expressions.
The genes may be consistently up- or down-regulated, or they may be differentially expressed
in a more stochastic way. The simulations allow us to evaluate the specificity and sensitity of
the methods.
flowchart LR
NULL(<b>Realistic Simulations</b> <br> Known effect sizes) --> COUNTS
NULL --> DESIGN
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Recover Known Effects</b><br>Effect Sizes and P-Values)
The module will generate a realistic dataset based on published data, containing rows with
differential expression changes. To run it, use the src/r/simulate_counts.r module:
Rscript src/r/simulate_counts.r
Prints:
The code above performed an RNA-Seq simulation and created the files: design.csv and
counts.csv
Read through the output and think about what each line means. For example, it claims that the
data will have 270 detectable changes. Open both the count and design files in Excel and
study their contents. When I ran the command, the files looked like so:
Identify which samples belong in the same group. Make some notes about what you can
observe in the data. Can you see rows that appear to vary less between the groups and vary
more across the groups? Are those rows differentially expressed?
What to do next?
Now that you have a simulated data set, you can run the differential expression analysis on it.
Experiment with the various simulation methods and try to verify that the results you get are
what you expect.
This chapter assumes you have a count matrix and a design file that specifies the relationships
between the columns of the count matrix.
flowchart LR
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Results</b><br>Effect Sizes and P-Values)
We will now walk you through the steps of analyzing a count matrix.
We provide two modules that implement different statistical methods: edger.r and
deseq2.r .
Both modules take the same input and produce an output that is formatted the same way. The
only difference is the statistical method they use. By default, the tool will analyze the
counts.csv file using the grouping in the design.csv . You can of course change these
input parameters.
Note: Remember that you can run every R module with the -h flag to see help on its use.
My example count data was simulated with the simulate_counts.r methods that described
in the RNA-Seq simulations chapter.
Rscript src/r/simulate_counts.r
Investigate the two files to understand what they contain. You could also use the counts and
design files generated in the other tutorials.
Now let's run the edger.r module that will process the counts.csv and the design.csv
that the simulations produced.
Rscript src/r/edger.r
It shows how many PValues ( 380 ) would be significant at a 0.05 threshold without adding
a multiple testing correction! Once we apply the FDR adjustment, we'll end up with 219
significant rows.
Now, strictly speaking, FDR is not the same concept as the PValue, and we don't need to apply
the same significance level to each. You can apply any other filtering criteria yourself. Read
the statistics chapter to learn more about the differences between PValues and FDRs.
In a realistic experiment, we would not know which genes have changed, but in a simulated
experiment, we do. If your input counts.csv data was simulated with the
simulate_counts.r methods that was described in RNA-Seq simulations then it will also
contains an FDR column that indicates which genes were generated thought to be
differentially expressed.
We have an R script that can evaluate any two files that have FDR columns to find matching
rows. You can use this script to compare the results of your differential expression analysis to
the simulated results or to compare the results of different methods.
# Tool: evaluate_results.r
# File 1: 270 counts.csv
# File 2: 219 edger.csv
# 173 match
# 97 found only in counts.csv
# 46 found only in edger.csv
# Summary: summary.csv
Explaining in detail how PCA decompositions work is out of the scope of this book. But you
do not need to fully understand how it gets created to be able to interpret it, just as you don't
need to know how a Burrows-Wheeler transform works to use the bwa aligner.
A PCA plot is informative in that samples that are correlated with one another will bunch up
together. Samples that are different from one another will be more distant from one another.
PCA plots are the first plot you should make of your differential expressed, normalized matrix
to demonstrate that replication worked as expected.
Our PCCA plot above is informative and convenient to use, but you might want to customize
it at some point. You can edit our code, but we also recommend that you evaluate the
following package:
Generate a heatmap
A heatmap is a high-level overview of the consistency of the statistical results that your
method produced. It allows you to understand inter-replicate and intra-group variability.
The most important thing to note is that the heatmap rescales values into so-called z-scores for
each row separately. The colors in one row are not related to the magnitudes in another row. It
normalizes the data so that the values are comparable across rows.
It transforms numbers by subtracting the average of the row from each value in the row, then
divides the resulting value with the standard deviation of the row. Each number in the heatmap
(basically the color of it) indicates how far away (how many standard deviations) the value is
from the average of the row.
For example, the red (+2) would mean that the counts in that cell were two standard deviations
higher than the average (the gene is up-regulated). A green of (-1) would mean that the counts
in that cell were one standard deviation lower than the average.
Important: z-scores as not fold change! z-scores are computed relative to the average of the
row.
The heatmap shows only the differentially expressed genes, and we want to see nice, uniform
red/green blocks that show that the data is reliable within replicates and differences between
groups.
Our heatmap visualization above is informative and convenient; at some point you might want
to customize it. You can edit our code, but we also recommend that you evaluate the following
packages:
• Bioconductor: ComplexHeatmap
• A simple tutorial for a complex ComplexHeatmap
Volcano plots represent a helpful way to visualize the results of differential expression
analyses.
First, we decide what level of FDR we consider the data trustworthy. Say 0.05 . Remember
how the cutoff applies to an FDR (false discovery rate, an error rate) and not a P-value (a
probability of making an error).
All the genes above the cutoff would be genes that we have found. In this case, the first 219
lines would be the genes we consider differentially expressed.
Prints:
name
GENE-6491
GENE-4048
GENE-17700
GENE-9879
GENE-18457
GENE-6384
GENE-8179
GENE-6576
GENE-18742
The example above used simulated counts. Now learn how to generate count matrices from
sequencing data.
This tutorial will demonstrate the process of creating transcript abundance counts from RNA-
Seq sequencing data using a method that relies on genome alignments. The process diagram
will look like so:
flowchart LR
The reference file is the genome for the organism that contains all DNA: coding and non-
coding regions. The counting process will require another file, typically in GTF format,
describing the genome's coding regions.
In another tutorial titled RNA-Seq differential expression, we show how to analyze and
interpret the counts we produce as the results here.
• Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS
Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies,
Benjamin J. Ainscough, Obi L. Griffith.
The data consists of two commercially available RNA samples. The first is called Universal
Human Reference (UHR) and consists of total RNA isolated from a diverse set of 10 cancer
cell lines. The second dataset, named Human Brain Reference (HBR), is total RNA isolated
from the brains of 23 Caucasians, male and female, of varying ages but mainly 60-80 years
old. The authors also maintain a website for learning RNA-seq that you can access via:
Note: We have greatly simplified the data naming and organization relative to the source. Our
approach is also quite different from that presented in the resource above.
The sequencing data was produced in three replicates for each condition and sequenced in a
paired-end library. For this tutorial, we prepared a smaller subset (about 125 MB download)
filtered to contain only those reads that align to chromosome 22 (and the spike-in control).
You can get our subset by running the following command:
URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz
The module will download the file in the data directory and then unpacks the content of the
tar.gz archive.
Study your directories and note the files that have been created for you.
To make the tutorial go faster, we will only use chromosome 22 of the human genome as a
reference. In addition, we have already made this sequence part of the download and placed it
into the refs directory. If you did not have that file, you would need to consult the How to
download genome data tutorial on how to get this data.
Note that we have access to a FASTA genome file, a GTF annotation file, and a transcriptome
file. Investigate each file to understand what each contains.
Prints:
The FASTQ files contain the raw sequencing reads. Look at files and identify the naming
structure.
That prints:
Look at the reads folder and identify the roots for the names of the samples. You have files
that look like this:
reads/HBR_1_R1.fq
reads/HBR_2_R1.fq
...
reads/UHR_1_R1.fq
reads/UHR_2_R2.fq
...
The data appears to be a single-end library with three replicates per condition.
The root is the unique identifier in the name that will allow us to quickly generate the full file
name from it. In this case, we note that the variable pattern above is the four-letter pattern of
HBR_1 and UHR_1 . From those patterns, we can generate all the unique file names we might
need. It is okay if your roots are not that short; you could list the whole file name if you wish.
We have to create the design file that lists each sample's root and group. We can do this by
hand or via some automated process. The design.csv will look like so:
sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR
UHR_2,UHR
UHR_3,UHR
A splice-aware short-read aligner must be used when aligning transcript sequences to the
genome. We had great success with HiSat2 aligner and we provide a module for it.
First, we need to index the reference genome. Indexing is a one-time operation that will take a
few minutes to complete.
Take note of how your directory structure changed. The index has been created in the idx
folder.
Once the indexing completes, aligning a single, say, HBR_1 sample would look like so:
The output is a somewhat confusing wall of text with gems such as "aligned concordantly 0
times" ... what the heck is that supposed to mean ... but we move on and note that the final
alignment rate is 52%.
Where is your alignment file? Look in the bam folder. Note that we set the output BAM file
name from the command line.
We need to repeat the alignments for all six samples. We can automate the process using the
design file, but first, I like to put the roots into a separate file to make the subsequent
commands a lot shorter. The command below will keep only the HBR_1 and HBR_2 ... roots
one per line:
The next task is to use the ids.txt and parallel to create commands. Usually, it takes a
few tries to get the commands just right.
A practical troubleshooting approach is to add an echo in front of the command to see what
it generates to ensure it is correct. Below I am adding an echo to see the command that
parallel wants to generate:
It does not matter if you are new or experienced; it will take some attempts to write everything
right. Keep editing and tweaking. Copy and paste just the first command and see whether it
works. The command above prints:
UHR_2.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/UHR_3_R1.fq BAM=bam/
UHR_3.bam
Once the first command is correct, all the others should work too, so next, remove the echo
from the start and have parallel execute the commands instead of echoing them on the screen:
Boom! It just works on all samples and runs automatically in parallel. Look at your bam
directory; you should see BAM files for each sample.
bam/HBR_1.bam
bam/HBR_2.bam
bam/HBR_3.bam
bam/UHR_1.bam
bam/UHR_2.bam
bam/UHR_3.bam
BAM files are not the easiest to visualize, as IGV will only load them when sufficiently
zoomed in. For large genomes, BAM files can be quite a hassle to load up.
Bioinformaticians invented a new type of file called wiggle (specifically the big-wiggle
variant), allowing them to visualize the coverage more efficiently. The file name extension for
the coverage tracks is .bw . We provide you with a module to make a BigWig wiggle file
from a BAM file:
The command above will generate a wiggle file called wig/HBR_1.bw that can be loaded into
IGV.
Once we figured out how to run one analysis, we can automate the creation of all bigwig files
with parallel :
The resulting files, as well as the refs/22.gtf may be visualized in IGV to produce the
following image:
Note how the alignments cover only the exonic regions of the transcripts.
The featureCounts program will count the number of reads that overlap with each feature
in the annotation file. By default, the program expects a GTF file and will summarize exons
over the genes . We can set up the program differently, but for now, we will use the default
settings:
Investigate the counts.txt file to see what it looks like. We can list multiple files to count
all the samples at once; for example, we could write:
The above works and produces a separate column for each BAM file. We could also rely on
shell pattern matches, though some caveats apply:
I will admit that I usually use the pattern above. But I know I'm playing with fire there, so I
double-check the column orders.
The most robust solution would be to generate the list of input files from the ids.txt and
pass that to the feature counter.
The --xargs parameter changes the behavior of parallel , collects all the input, and then
passes it to the command all at once. Thus, what we need is to generate the BAM file names
from the roots, then list them all in one shot like so:
cat ids.txt | \
parallel -k echo bam/{}.bam | \
parallel --xargs featureCounts -a refs/chr22.gtf -o counts.txt {}
We are not really double looping since the second parallel waits for input from the first.
We are using a convenient feature of parallel to create the file names.
The first parallel produces the bam file names, and the second parallel collects these
names and passes them to the featureCounts command. I am using the -k (keep order)
option to ensure that files are listed in the same order as in the ids.txt .
So now we have the counts.txt file. Usually, the file needs a little post-processing to make
it more useful. I, for one, like to add the sample names and gene names as a column and turn it
into a CSV file.
The subsequent commands require you either switch to the stats environment or your
current environment has installed the biomart and tximport libraries. You could do the
latter with the following:
Rscript src/r/format_featurecounts.r -h
Rscript src/r/format_featurecounts.r
will print
# Reformating featurecounts.
# Input: counts.txt
# Output: counts.csv
When you look at the counts.csv file, you'll see that it contains Ensembl gene IDs that are
not easily recognizable. The need to fill in Ensembl gene names is such a common task that
we made all our scripts all able to do it automatically.
First, you need to obtain the mapping between the transcript and gene names. You can do it
with the following module src/r/create_tx2gene.r
As you can imagine, each organism has a different mapping; thus, we have to get the correct
mapping for the organisms we are working with. Run the module with the -s flag to show
you all the mappings it can produce.
The listing inside the names.txt file shows that Homo Sapiens can be accessed via
hsapiens_gene_ensembl . You would only need to generate the transcript to gene mapping
only once (it takes a while to get it, so keep it around). Let's create our transcript to gene id
mapping file:
Prints:
The code above generated a file called tx2gene.csv . Investigate the file to see how it
connects the various identifiers.
Now that we have tx2gene.csv , we can use the mapping file to add the informative gene
names into the counts.csv file:
And voila, the counts.csv file carries the correct gene names.
The resulting counts.csv can be used with the How to perform an RNA-Seq differential
expression study tutorial.
The complete workflow is included with the code you already have and can be run as:
bash src/scripts/rnaseq-with-hisat.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
# Genome reference.
REF=refs/chr22.genome.fa
This tutorial will demonstrate the process of creating transcript abundance counts from RNA-
Seq sequencing data using methods that rely on quantification (sometimes called
classification). The process diagram will look like so:
flowchart LR
The reference file used during quantification is the organism's transcriptome that lists all
known transcripts as a FASTA sequence. The quantification process assigns each read in the
FASTQ file to a single transcript in the FASTA file. Since transcripts may share regions
(isoforms are similar to one another), the classifier must employ a sophisticated redistribution
algorithm to ensure that the counts are properly assigned across the various similar transcripts.
In another tutorial titled RNA-Seq differential expression, we show how to analyze and
interpret the counts we produce as the results here.
The tutorial at RNA-Seq using HiSat2 describes the origin of the sequencing data and various
details of it.
URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz
The unpack command downloads and automatically extracts tar.gz files. As in the same
resource, we capture the layout of the data in the design.csv file that will contain the
following:
sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR
UHR_2,UHR
UHR_3,UHR
Quantification processes operate on transcript sequences rather than genomic sequences. The
downloaded data we provided includes the transcript sequences.
The primary difference relative to the approach in RNA-Seq using a genome is that the
reference sequence will be the transcriptome and not the genome.
The field of quantification offers us two tools, kallisto and salmon . We will demonstrate
the use of the salmon program as it appears to be more actively maintained.
First, we need to index the reference genome. Indexing is a one-time operation that will take a
few minutes to complete.
Once the indexing completes, aligning a single, say, HBR_1 sample would look like so:
Note how salmon places each output in a separate directory named by the sample, where the
abundance file name is the same quant.sf . The abundance file contains the following:
transcript level counts. Note how the Name column contains Ensembl transcript identifiers.
You can manually invoke the commands for each sample. It would be better practice to
automate it via parallel . First, we create the roots for the samples to make the commands
look simpler.
Then can run the classification for all samples with the following:
Prints:
./salmon/UHR_3/quant.sf
./salmon/HBR_1/quant.sf
./salmon/UHR_2/quant.sf
./salmon/HBR_2/quant.sf
./salmon/HBR_3/quant.sf
./salmon/UHR_1/quant.sf
We have produced a separate count file for each sample. Next, we need to combine these files
into a single output.
The subsequent commands require that you either switch to the stats environment or that
your current environment has the biomart and tximport libraries installed. You could do
the latter with the following:
To recap, we have a large number of abundance files in different folders, and each file has a
column with the counts. What we need to do next is to extract and glue those columns together
into a single file.
We have written a module that does just that, combines the counts into a single file:
Rscript src/r/combine_salmon.r
Note how the name column contains Ensembl transcript identifiers. You can check that each
column in the file above corresponds to the count column of the quant.sf for that sample.
When performing any counting, we want the names to be unique, like ENST00000359963 , so
that there is no misunderstanding about which feature was used.
Informative gene names, on the other hand, are often acronyms that help biologists associate
the gene name with a function. The problem there is that the generic names may be reused
across species and even across different versions of the same species.
Unsurprisingly, life scientists always want to see informative gene names in their results. So
we need to be able to add these names to our unique identifiers. There are quite a few ways to
go about the process, and you are welcome to explore the various alternative solutions.
In the Biostar Handbook, we provide you with two scripts that assist with the process.
First, you need to obtain the mapping between the transcript and gene names. You can do it
with the following module src/r/create_tx2gene.r
As you can imagine, each organism has a different mapping; thus, we have to get the correct
mapping for the organisms we are working with. Run the module with the `-s' flag to show
you all the mappings it can produce.
The listing shows that Homo Sapiens can be accessed as hsapiens_gene_ensembl . You
would only need to generate this file once (it takes a while to get it, so keep it around).
Prints:
Investigate the file tx2gene.csv to see how it connects the various identifiers on each line.
Rerunning the combination script, but this time with the mapping file, will produce gene
names in the second column:
Some transcripts may not be present in the mapping file; other transcripts may not have a
known gene name associated with them.
Finally, our module also allows you to summarize over gene names. To have that work, we
need to have a mapping file that connects the transcript names to the gene names, and then we
need to invoke the module with the -G flag.
Produces:
The resulting count file can be used as input for the differential expression modules.
The complete workflow is included with the code you already have and can be run as:
bash src/scripts/rnaseq-with-salmon.sh
#
# Biostar Workflows: https://www.biostarhandbook.com/
#
# Genome reference.
REF=refs/chr22.transcripts.fa
This chapter assumes we have a differential expression matrix and we wish to find the
interpretation of the results.
flowchart LR
DE(<b>Differential Expression</b><br>Effect Sizes and P-Values) --> FUNC{Functional <br>
Analysis}
FUNC --> RES(<b>Functional Enrichment</b><br>Effect Sizes and P-Values)
Below we assume that we ran the RNA-Seq counts with Salmon tutorial and obtained the
resulting counts.csv file. Then we used the edger.r module as described in RNA-Seq
differential expression to produce the differential expression matrix. We assume that the
resulting file is called edger.csv .
For convenience, we also distribute this file separately, you can download it with
wget http://data.biostarhandbook.com/books/rnaseq/data/edger.csv
The main volume of the Biostar Handbook describes several functional enrichment tools out
of which we explore a few below.
The bio profile command uses the g:Profiler service to find the functional enrichment of
the genes in the differential expression matrix.
prints:
# Running g:Profiler
# Counts: edger.csv
# Organism: hsapiens
# Name column: gene
# Pval column: FDR < 0.05
The resulting CSV file will list the various functions that are found to be enriched.
Do visit the main g:Profiler service as the results there will always be easier to navigate.
Enrichr
Enrichr integrates knowledge to provide synthesized information about mammalian genes and
gene sets. Read more about the protocls in the publication titled Gene Set Knowledge
Discovery with Enrichr in Curr Protoc. 2021
• https://maayanlab.cloud/Enrichr/
Our tool bio implements the bio enrichr command to facilitate the functional
exploration of the genes in the differential expression matrix. To run the tool execute:
The command will process the file to extract the gene names with FDR<0.05 and submits
these genes to the web service. The output of the tool will look like this:
# Running Enrichr
# Counts: edger.csv
# Organism: mmusculus
# Name column: gene
# Pval column: FDR < 0.05
# Gene count: 279
# Genes: IGLC2,SEPTIN3,SYNGR1,MIAT,SEZ6L,[...]
# Submitting to Enrichr
# User list id: 57476587
# Entries: 95
# Output: enrichr.csv
The resulting CSV file will list the various functions that are found to be enriched.
2. Workflows
The following chapter describes the process of reproducing the results of the paper:
Notably the workflow we present below is suited to reproduce any other RNA-Seq analysis
that makes use of pairwise comparisons.
We are particularly pleased with the last observation. Far too often, scientific reproducibility is
mistakenly framed as being able to reproduce the same numerical results while using the same
analytical methods.
In reality, scientific reproducibility should mean validating the biological insights and
conclusions drawn using different and valid ways of study.
The analysis is implemented with the process described in detail in the RNA-Seq with salmon
guide.
We will show the commands as shell commands, but our final product is a Makefile to
allow better reentrant behavior. Basically a makefile will allow us to rerun certain parts,
pick up where we left off. The approach will also allow you to understand how bash
commands are deployed as make instructions.
The complete source code of the Makefile is included at the end of this chapter.
The Makefile is also included in the code distribution you already have and the entire
process described below can be run in one shot as:
The first step of reproducibility is identifying the accession numbers and the metadata that
connects biological information to files. So we had to read the paper and find the accession
numbers.
The paper lists the GEO number GSE52778 ; visiting the GEO page, we can locate the
BioProject number PRJNA229998
• https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
• https://www.ncbi.nlm.nih.gov/bioproject/PRJNA229998
As always, the output is a CSV file with possible dozens of columns with only some filled in.
Never underestimate the ability of a scientists to overcomplicate things. Scientists are nerds
and nerds love to make things complicated. Complications are a sign of intelligence and
importance - or so the belief goes.
Any time you have to decipher a file made by scientists, you should assume that it is a mess,
like the above file. Hard work and dedication is needed to tease apart the information you
need (swearing helps, at least it helps me a great deal).
run_accession,sample_title
SRR1039508,N61311_untreated
SRR1039509,N61311_Dex
SRR1039510,N61311_Alb
SRR1039511,N61311_Alb_Dex
SRR1039512,N052611_untreated
SRR1039513,N052611_Dex
SRR1039514,N052611_Alb
SRR1039515,N052611_Alb_Dex
SRR1039516,N080611_untreated
SRR1039517,N080611_Dex
SRR1039518,N080611_Alb
SRR1039519,N080611_Alb_Dex
SRR1039520,N061011_untreated
SRR1039521,N061011_Dex
SRR1039522,N061011_Alb
SRR1039523,N061011_Alb_Dex
From the above list, we will focus on a single comparison, wherein the airway smooth muscle
cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-
inflammatory effects.
Thus, we must retain only the untreated samples and those treated with dexamethasone. We
are going to rename untreated samples to ctrl and dexamethasone-treated samples to dex .
And we are going to separate the cell types such N61311 as from the sample names.
We simplified and edited the file above and inserted another column for the treatment to create
our design.csv that will govern the analysis. There are many ways to label the data. We
chose to set up the following design matrix:
the above formats the csv file more nicely for readability:
We added a group , a celltype and a sample column as later it turns out both can provide
useful information. You can add additional columns to the design file at any time.
If you use Excel on Windows and then plan to use the file via the Linux subsystem, you may
need to convert the file endings to Unix format. The simplest way to do this is to open the
CSV file with your editor and change the line endings.
The Makefile we provide can generate the design file for you with:
The study makes use of the human genome. Since we plan to use the salmon tool we need to
obtain the transcriptome sequence for our target genomes. Several resource may be used to
download this type of data. We will choose the Ensemble download site:
• https://ftp.ensembl.org/pub/
With some clicking and poking around we can locate the files we need.
• CDNA: http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/
Homo_sapiens.GRCh38.cdna.all.fa.gz
We can get the data in various ways, we can click, download, unzip and rename the files. Or
we could use the curl as demonstrated below. In addition, we need to prepare the index to be
suitable for processing with salmon . Our Makefile will need to have the following lines:
Try it out (remember that you can add the -n flag for a dry-run to see what make plans to
do):
The command above will download and index the human transcriptome file. On my iMac
takes around 5 minutes.
We will use the design.csv file and extract the content of various columns and use those
values to build the commands we want to run. Here is how that works:
Will build and run the echo command in a loop, and makes it print the following:
The {run} and {sample} patterns are placeholders for the values in the run and sample
columns of the design.csv file. Using --header : with the parallel command will
extract and replace the placeholders with the values in the columns.
Once you understand what is happening above, you understand the entire automation process.
If you need help with parallel read the Art of Bioinformatics Scripting for additional
information.
Later, we will add the -v and --eta flags to parallel . The first flag shows the
commands that will be run; the second will print estimates for the completion of each task.
Both of these flags are optional, but I found these to be quite handy.
Every automation and looping you will ever do will be a variation of the command above. So
it is worth taking the time to understand it.
The project stores paired-end reads in SRA. We must set the MODE=PE parameter for all of
our workflows. We can automate the download from SRA with the following:
Look at your reads directory to see what it contains. By selecting 1 million read-pairs per
sample, we reduce the dataset size to a few gigabytes, and the process can run in less than 20
minutes, even on a Macbook Air laptop on a home internet connection. Set N=ALL to
download the entire dataset.
Below we will run the process with 2 CPUs; you may run the process with as many CPUs as
you have available.
CDNA=~/refs/hsapiens/Homo_sapiens.GRCh38.cdna.all.fa
Using the Makefile you can run the same commands with:
Using the Makefile you can run the same commands with:
The samples separate both by treatment and by cell type and quite cleanly. At this point we are
getting more confident that the analysis will be producing meaningful results.
You can use the design.csv file to generate the differential expression analysis results:
# Run edgeR.
Rscript src/r/edger.r -d design.csv -f group -c counts.csv -o edger.csv
We can see that the samples are separated by treatment. We can see more genes upregulated in
the Dex samples.
Let's verify that the above genes are differentially expressed in our data. Running such checks
typically requires that you write various programs.
The task is so common that book we provide you with a generic script that can compare the
contents of any files with FDR columns. To use it, we need to create a new file that lists the
genes found as differentially expressed in the paper. The file pub.csv would look like this:
gene,PAdj,FDR
DUSP1,0,0
KLF15,0,0
PER1,0,0
TSC22D3,0,0
C7,0,0
CCDC69,0,0
CRISPLD2,0,0
We add FDR=0 to each row to indicate that the gene was considered to be differentially
expressed. We then use the evaluate_results.r script to compare the genes in
edger.csv to pubs.csv :
The results of the script clearly show that we can reproduce all the genes from the paper.
# Tool: evaluate_results.r
# 535 in edger.csv
# 7 in pub.csv
# 7 found in both
# 528 found only in edger.csv
# 0 found only in pub.csv
And we did so by using just 2 million reads per sample, which is about 10% of the total data
on average!
A lesson to remember.
We can reproduce the paper's main findings with just 2 million reads per sample (10% of the
total data).
The paper reports various findings, for example, that the enriched genes.
The results are stored in gprofiler.csv and contain terms enriched in the upregulated and
downregulated genes. Let's search the annotations for the terms that are mentioned in the
paper:
Prints:
GO:0007160,cell-matrix adhesion
GO:0030198,extracellular matrix organization
GO:0001952,regulation of cell-matrix adhesion
GO:0048514,blood vessel morphogenesis
GO:0001568,blood vessel development
GO:0031012,extracellular matrix
GO:0062023,collagen-containing extracellular matrix
GO:0050840,extracellular matrix binding
A complete ontology interpretation is beyond the scope of this tutorial, but we can see that the
results are consistent with the paper.
The complete workflow is included with the code you already have and can be run as:
#
# RNA-Seq from the Biostar Workflows
#
# http://www.biostarhandbook.com
#
CDNA_URL ?= http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna
# The name of the ensembl database to use for transcript to gene mapping.
ENSEMBL_DB ?= hsapiens_gene_ensembl
This chapter will demonstrate the use of workflows when attempting to reproduce the results
from the publication titled:
The publication is a so-called "micro-report" that, according to the journal, is designed for
presenting small amounts of data or research that otherwise remain unpublished, including
important negative results, reproductions, or exciting findings that someone wishes to place
rapidly into the public domain.
The paper starts by recapitulating that a mutation in the PRESENILIN 1 gene ( PSEN1 ) is
known to be associated with the early onset of familial Alzheimer's Disease (EOfAD).
The scientists have introduced an EOfAD-like mutation: Q96_K97del into the endogenous
psen1 gene of zebrafish and then analyzed transcriptomes of young adult (6-month-old)
entire brains from a family of heterozygous mutant and wild-type sibling fish.
The main finding of the paper is that according to gene ontology (GO) analysis of the results,
the mutation has an effect on mitochondrial function, particularly ATP synthesis, and on ATP-
dependent processes, including vacuolar acidification.
Here are the lessons we learned when while attempting to reproduce the results:
Another way to say this, collecting more data cannot save an inconclusive analysis.
In the chapter titled Airway RNA-Seq in 10 minutes, we have developed a fully reusable
pipeline for RNA-Seq analysis. The current process will follow the same steps but with minor
modifications.
The experiment studies a different organism, and a new design file needs to be set up. In
addition, the reads in this experiment are single-end rather than paired-end; thus, we have to
set MODE=SE for the pipeline. But the vast majority of processes remain the same.
Our new Makefile can simply include the original Makefile and override certain
variables. In a nutshell, our new Makefile will look like this:
What is most important, though, is that we don't need to write any new code. Everything is
ready to go. That is the productivity boost by reusable components.
Note: Run the new study in a separate folder, and do not mix the results with the airways
study. Confusion is the biggest enemy of reproducibility.
The Makefile is also included in the code distribution you already have and can be right
away as:
Finding metadata is usually the hardest part of automating a bioinformatics project. The
metadata could be distributed in different sources; sometimes, it is located in the
supplementary information; sometimes, it is encoded in the file names, and so on.
The GEO number GSE126096 is provided in the paper. SRA project number PRJNA521018
is given in the paper and may be viewed on the NCBI SRA site:
• GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126096
• SRA: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA521018/
We can obtain the run information for project PRJNA521018 in CSV format:
The resulting file has 128(!) columns, with some filled in and some not. Thankfully some
contain the information that connects the SRR number to the sample name. It appears that we
have two conditions with four replicates each. The information from the data above allows us
to manually generate the design.csv matrix for the workflow.
run,group,sample
SRR8530750,WT,WT_1
SRR8530751,WT,WT_2
SRR8530752,WT,WT_3
SRR8530753,WT,WT_4
SRR8530754,Q96,Q96_1
SRR8530755,Q96,Q96_2
SRR8530756,Q96,Q96_3
SRR8530757,Q96,Q96_4
To generate the design file with our Makefile , we can run the following:
The study makes use of the zebrafish genome. The latin name for zebrafish is Danio Rerio.
Several resources may be used to download this genome data. We will choose the Ensemble
download site:
• https://ftp.ensembl.org/pub/
With some clicking and poking around, we can locate the files we need.
• CDNA: http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/cdna/
Danio_rerio.GRCz11.cdna.all.fa.gz
The steps here are identical to those described in Airway RNA-Seq in 10 minutes
The workflow is nearly identical to that described in the Airway RNA-Seq chapter. We just
added MODE=SE to the command. Let's make it download 4 million reads per sample.
You may need to switch to the stats environment: conda activate stats depending on
how the installation process went for you:
Let's generate the plot; by trial and error, we found how much we need to nudge the label
above the points to make it readable.
The counts matrix does not separate cleanly for the treatments. Considering that one sample is
a mutant with supposedly large changes in the transcriptome, this is not a good sign.
You can use the design.csv file to generate the differential expression analysis results:
# Run edgeR.
Rscript src/r/edger.r -d design.csv -f group
In the airway analysis, we saw that using a subset of about 2 million reads per sample could
provide us with a good separation of the samples. Here we used 4 million reads (on a much
smaller genome), yet we could not find anything significant.
Up to this point, we only used a subset of the data, and as have shown before, taking a subset
of the data should show a subset of the results - at the very least, some results. Since we have
not found anything, we can repeat the run with the complete data.
We can do this by deleting the reads folder rm -rf reads , then changing the N parameter
to ALL and rerunning each step.
I will call the resulting counts file full.csv, and all subsequent analyses will be performed
on this full file.
1. The PCA plot of the count file does not separate the samples.
2. A straightforward and standard differential expression analysis does not find any significant
gene.
When faced with problems like this, scientists start window shopping for alternative methods.
They explore tools until something works. This is called p-hacking and counts as one of the
sketchiest practices in science.
On the other hand, the very essence of science is exploration. We have to try different things
and see what works. Biological data is complicated; it is very much possible that the data we
have is not suitable for the analysis we are trying to perform on it. It is possible a specific data
is particularly ill-suited for a certain type of analysis. We ought to explore the data and see
what it tries to tell us.
At what point should we stop exploring? At what point does it become p-hacking? Smart
people are the best p-hackers, and can do so without even realizing it. Someone less skilled
would give up after a few failed attempts. A talented data analyst can come up with the coolest
tricks ever.
As you can see, the answer is a far less clear-cut than anticipated.
For example, edgeR has a so-called "classic" method that is supposed to be less performant.
We could try that:
Now we are getting results. Where there was nothing before, we get 13 genes:
Could we do better?
We can remove lines where the expression level is low. When we do so, we reduce the number
of comparisons in the multiple-test correction.
In our code, we already remove lines with fewer than three counts, but we can make the
threshold even stricter. Let's only keep rows where at least three samples have a count of 10 or
higher.
Copy the src/r/deseq2.r script to a local version and find the line where we filter data. It
can be changed to be stricter with the following:
will produce
So we started with no differential expression, but now we have 63 genes that pass the FDR
threshold.
Think about it from a scientist's point of view. Before, they had nothing to show. Now they
have 63 genes that can be talked about and published.
In total, 251 genes were identified as differentially expressed (see Additional file 1). Of these,
105 genes showed increased expression in heterozygous mutant brains relative to wild-type
sibling brains, while 146 genes showed decreased expression.
The authors provide a list of genes in the supplementary material. We downloaded the
supplementary file and stored the genes in one column, and created FDR columns with zeros.
that prints
gene,FDR
si:ch211-235i11.4,0
psmb5,0
si:ch211-235i11.6,0
si:dkey-206p8.1,0
clcn3,0
gpr155b,0
zgc:154061,0
dph6,0
CABZ01053323.1,0
We can now compare the genes in the supplementary material with the genes we found:
We find that 15 genes overlap, and most do not pass the FDR threshold:
# Tool: evaluate_results.r
# 63 in deseq2.csv
# 251 in pub.csv
# 15 found in both
# 48 found only in deseq2.csv
# 235 found only in pub.csv
# Summary: summary.csv
Gene ontology (GO) analysis implies effects on mitochondria, particularly ATP synthesis,
and on ATP-dependent processes, including vacuolar acidification.
We conclude that the scientific insights presented in this paper cannot be reproduced.
We want to note that, in our opinion, the list of genes provided by the authors is almost
certainly incorrect. And we suspected as much even before we performed the analysis.
The fold changes (effect sizes) reported by the authors in their main report file are extremely
small (indicated by the log2FC column in the Additional file 1). Most genes the authors
report as being differentially expressed appear to change between 10% to 30% between
conditions (as fold change and not log2FC ). The median fold change is just 25% (up or
down), and half the genes have changes smaller than that value.
It is exceedingly unlikely that their methodology, in particular, and RNA-Seq data, in general,
is appropriate to detect changes of such a small magnitude.
In our opinion, even a cursory (but critical) examination of the gene expression file should
have told the authors (and reviewers) that the results, as reported, are very unlikely to be
correct.
This wouldn't be either the first nor the last time that a paper is published with incorrect
results. What about the other thousands of papers just like it?
We understand the pressures and demands put on data analysts better than most. There is no
reason to believe the authors intentionally or deliberately published incorrect results. They just
hacked away at it until it worked. The pressure cooker of academia is a powerful force!
Publish or perish is the name of the game.
At some point connection with reality was lost and all that was left is shuffling endless rows of
numbers through cutoffs and thresholds.
It could also very well be that we did something incorrectly. That's always possibility.
One of the main challenges and reasons people don't publicly question research results is that
the burden of proof for refuting a published result is substantially higher than when
publishing the same results.
We did in fact spend significanly more time with the data analysis than what we present
above. We have attempted to validate it several other ways as well. We wanted the results to
be true so badly!
What a beautiful story it would have been to find that a single 6 base deletion causes massive
changes in the gene expression of the brain, leading to early onset Alzheimer's disease. It
would be a triumph of science, bioinformatics and a great story to tell.
The findings could be true ... all we are saying here is that the data in this paper is not
sufficient to support the conclusions.
The complete workflow is included with the code you already have and can be run as:
#
# Presenilin RNA-Seq in the Biostar Workflows
#
# http://www.biostarhandbook.com
#
# The name of the ensembl database to use for transcript to gene mapping.
ENSEMBL_DB = drerio_gene_ensembl
3. Modules
3.1 Introduction
Below we list the makefile modules that wrap the various tools used in the book. The modules
are organized by the type of analysis they perform.
The modules are self-contained and can be run independently but some may require input data
generated by other modules.
Each module may be run independently and will print its usage information when run without
any arguments. For example:
make -f src/run/bwa.mk
will print:
#
# bwa.mk: align reads using BWA
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bwa.bam
#
# make index align
#
3.2.1 sra.mk
The Short Read Archive (SRA) is a public repository of short-read sequencing data in FASTQ
format.
• Home: https://www.ncbi.nlm.nih.gov/sra
The sra.mk module assists with obtaining reads from the SRA repository.
SRA.MK USAGE
make -f src/run/sra.mk
SRA.MK HELP
#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000
#
# make fastq
#
SRA.MK EXAMPLES
# Print usage.
make -f src/run/sra.mk
sra.mk code
#
# Downloads sequencing reads from SRA.
#
# SRR number (sequencing run from the Ebola outbreak data of 2014)
SRR ?= SRR1553425
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
test:
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get! get
install::
> @echo mamba install sra-tools
3.2.2 genbank.mk
The genbank.mk module assists with obtaining reads from the GenBank repository.
GENBANK.MK USAGE
make -f src/run/genbank.mk
GENBANK.MK HELP
#
# genbank.mk: download sequences from GenBank
#
# ACC=AF086833
# REF=refs/AF086833.fa
# GBK=refs/AF086833.gb
# GFF=refs/AF086833.gff
#
# make fasta genbank gff
#
GENBANK.MK EXAMPLES
# Print usage.
make -f src/run/genbank.mk
GENBANK.MK CODE
#
# Downloads NCBI data vie Entrez API
#
REF ?= refs/${ACC}.fa
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
${GBK}:
> mkdir -p $(dir $@)
> bio fetch ${ACC} > $@
${GFF}:
> mkdir -p $(dir $@)
> bio fetch ${ACC} --format gff > $@
genbank:: ${GBK}
> @ls -lh ${GBK}
test:
> make -f src/run/genbank.mk fasta! fasta ACC=${ACC} REF=${REF}
> make -f src/run/genbank.mk gff! gff ACC=${ACC} GFF=${GFF}
> make -f src/run/genbank.mk genbank! genbank ACC=${ACC} GBK=${GBK}
# Installation instructions
install::
> @echo pip install bio --upgrade
3.3 QC modules
3.3.1 fastp.mk
• Homepage: https://github.com/OpenGene/fastp
The fastp.mk module may be used to apply quality control to sequencing data.
FASTP.MK USAGE
make -f src/run/fastp.mk
FASTP.MK HELP
#
# fastp.mk: trim FASTQ reads
#
# MODE=PE SRR=SRR1553425
#
# Input:
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# Output:
# Q1=trim/SRR1553425_1.fastq
# Q2=trim/SRR1553425_2.fastq
#
# make trim
#
FASTP.MK EXAMPLES
# Print usage
make -f src/run/fastp.mk
# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/fastp.mk trim
# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/fastp.mk trim SRR=SRR030257 N=100000
FASTP.MK CODE
Inputs: R1 , R2 , Outputs: Q1 , Q2
#
# Trims FASTQ files and runs FASTQC on them.
#
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
# Print usage
usage::
>@echo "#"
>@echo "# fastp.mk: trim FASTQ reads"
>@echo "#"
>@echo "# MODE=${MODE} SRR=${SRR}"
>@echo "#"
>@echo "# Input:"
>@echo "# R1=${R1}"
>@echo "# R2=${R2}"
>@echo "# Output:"
>@echo "# Q1=${Q1}"
>@echo "# Q2=${Q2}"
>@echo "#"
>@echo "# make trim"
>@echo "#"
trim: ${Q1}
> @ls -lh ${Q1} ${Q2}
endif
trim: ${Q1}
> @ls -lh ${Q1}
endif
test:
# Get the FASTQ reads.
> make -f src/run/fastp.mk MODE=${MODE} SRR=${SRR} trim! trim
# Installation instuctions
install::
>@echo mamba install fastp test
3.4.1 bwa.mk
The bwa.mk module aligns reads to a reference genome using the bwa aligner.
Home: https://github.com/lh3/bwa
BWA.MK USAGE
make -f src/run/bwa.mk
BWA.MK HELP
#
# bwa.mk: align reads using BWA
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bwa.bam
#
# make index align
#
BWA.MK EXAMPLES
# Print usage
make -f src/run/bwa.mk
BWA.MK CODE
#
# Generate alignments with bwa
#
# The genbank accession number
ACC = AF086833
# Number of CPUS
NCPU ?= 2
# Alignment mode.
MODE ?= PE
$(error "### Error! Please use GNU Make 4.0 or later ###")
endif
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
3.4.2 bowtie2.mk
The bowtie2.mk module assists with download reads from the Short Read Archive (SRA).
Home: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
BOWTIE2.MK USAGE
make -f src/run/bowtie2.mk
BOWTIE2.MK HELP
#
# bowtie2.mk: aligns read using bowtie2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bowtie2.bam
#
# make index align
#
BOWTIE2.MK EXAMPLES
# Print usage
make -f src/run/bowtie2.mk
# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/bowtie2.mk index align
BOWTIE2.MK CODE
#
# Generates alignments with bowtie2
#
SRR=SRR1553425
# Number of CPUS
NCPU ?= 2
# Alignment mode.
MODE ?= PE
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
3.4.3 minimap2.mk
Home: https://github.com/lh3/minimap2
MINIMAP2.MK USAGE
make -f src/run/minimap2.mk
MINIMAP2.MK HELP
#
# minimap2.mk: align read using minimap2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.minimap2.bam
#
# make index align
#
MINIMAP2.MK EXAMPLES
# Print usage
make -f src/run/minimap2.mk
MINIMAP2.MK CODE
#
# Generates alignments with minimap2
# Number of CPUS
NCPU ?= 2
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/minimap2.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}
3.4.4 hisat2.mk
• Home: http://daehwankimlab.github.io/hisat2/
The hisat2.mk module aligns reads to a reference genome using the hisat2 aligner.
HISAT2.MK USAGE
make -f src/run/hisat2.mk
HISAT2.MK HELP
#
# hisat2.mk: align reads using HISAT2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.hisat2.bam
#
# make index align
#
HISAT2.MK EXAMPLES
# Print usage
make -f src/run/hisat2.mk
#
# Generate alignments with hisat2
# Number of CPUS
NCPU ?= 2
# Read groups.
RG ?= --rg-id ${ID} --rg SM:${SM} --rg LB:${LB} --rg PL:${PL}
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
# We do not list the index as a dependency to avoid accidentally triggering the index build.
test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/hisat2.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}
3.5.1 salmon.mk
• Home: https://salmon.readthedocs.io/en/latest/index.html
The salmon.mk module aligns reads to a reference genome using the salmon transcript
quantification.
SALMON.MK USAGE
make -f src/run/salmon.mk
SALMON.MK HELP
#
# salmon.mk: classify reads using salmon
#
# MODE=SE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# SAMPLE=sample
#
# make index align
#
SALMON.MK EXAMPLES
# Print usage
make -f src/run/salmon.mk
SALMON.MK CODE
#
# Generates alignments with salmon
#
SRR ?= SRR1553425
# Number of CPUS
NCPU ?= 2
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
3.6.1 bcftools.mk
• Homepage: https://samtools.github.io/bcftools/bcftools.html
The bcftools.mk module may be used to apply quality control to sequencing data.
BCFTOOLS.MK USAGE
make -f src/run/bcftools.mk
BCFTOOLS.MK HELP
#
# bcftools.mk: calls variants using bcftools
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.bcftools.vcf.gz
#
# make vcf
#
BCFTOOLS.MK EXAMPLES
# Print usage
make -f src/run/bcftools.mk test
BCFTOOLS.MK CODE
The module also preemptively sets variable names for alignment and variant calling modules
so that chaining to these modules allows for a seamless interoperation. When chaining you
must include bcftools.mk before the other modules.
#
# Generates SNP calls with bcftools.
#
SRR = SRR1553425
# Number of CPUS
NCPU ?= 2
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
${VCF}.tbi: ${VCF}
> bcftools index -t -f $<
vcf: ${VCF}.tbi
> @ls -lh ${VCF}
vcf!:
> rm -rf ${VCF} ${VCF}.tbi
install::
> @echo mamba install bcftools
3.6.2 freebayes.mk
Homepage: https://github.com/freebayes/freebayes
The freebayes.mk module may be used to apply quality control to sequencing data.
FREEBAYES.MK USAGE
make -f src/run/freebayes.mk
FREEBAYES.MK HELP
#
# freebayes.mk: calls variants using freebayes
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.freebayes.vcf.gz
#
# make vcf
#
FREEBAYES.MK EXAMPLES
# Print usage
make -f src/run/freebayes.mk
# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/freebayes.mk vcf
# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/freebayes.mk vcf SRR=SRR030257 N=100000
FREEBAYES.MK CODE
#
# Generates SNP calls with freebaye
#
# Number of CPUS
NCPU ?= 2
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
3.6.3 gatk.mk
• Homepage: https://gatk.broadinstitute.org/hc/en-us
GATK.MK USAGE
make -f src/run/gatk.mk
GATK.MK HELP
#
# gatk.mk: call variants using GATK4
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# TARGET=AF086833.2
# SITES=vcf/SRR1553425.knownsites.vcf.gz
#
# DUP=bam/SRR1553425.markdup.bam
# TAB=bam/SRR1553425.recal.txt
# RCB=bam/SRR1553425.recal.bam
#
# VCF=vcf/SRR1553425.gatk.vcf.gz
#
# make mark calibrate apply vcf
#
GATK.MK EXAMPLES
# Print usage
make -f src/run/gatk.mk
GATK.MK CODE
#
# Generates SNP calls with gatk4
#
# Number of CPUS
NCPU ?= 2
# Accession number
ACC = AF086833
# GATK target.
TARGET = AF086833.2
# Reference dictionary.
DICT = $(basename ${REF}).dict
# Recalibration table.
TAB = $(basename ${BAM}).recal.txt
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
# Mark duplicates.
${DUP}: ${BAM} ${DICT}
> gatk MarkDuplicates -I ${BAM} -O ${DUP} -M ${DUP}.metrics.txt
# Call variants.
${VCF}: ${BAM} ${DICT}
> mkdir -p $(dir $@)
> gatk HaplotypeCaller --java-options ${JAVA_OPTS} -I ${BAM} -R ${REF} -L ${TARGET} --output ${VCF}
install::
> @echo mamba install gatk4 test
3.6.4 deepvariant.mk
• Homepage: https://github.com/google/deepvariant
Our module uses the Singularity containers. Unfortunately bioconda does not have a runnable
deepvariant package at this time.
DEEPVARIANT.MK USAGE
make -f src/run/deepvariant.mk
DEEPVARIANT.MK HELP
#
# deepvariant.mk: call variants using Google Deepvariant
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.deepvariant.vcf.gz
#
# make vcf
#
DEEPVARIANT.MK EXAMPLES
# Print usage
make -f src/run/deepvariant.mk
# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/deepvariant.mk vcf
DEEPVARIANT.MK CODE
# Number of CPUS
NCPU ?= 2
# Example:
# CALL_FLAGS = --regions chr1:1-1000000
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
${REF}.fai: ${REF}
vcf!:
> rm -rf ${VCF} ${VCF}.tbi
# Installation instructions.
install:
>@echo singularity pull docker://google/deepvariant:1.5.0"
3.6.5 ivar.mk
• Homepage: https://github.com/andersen-lab/ivar
The ivar.mk module may be used to apply quality control to sequencing data.
make -f src/run/ivar.mk
IVAR.MK USAGE:
#
# ivar.mk: runs the ivar suite
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
#
# make cons vars
#
IVAR.MK EXAMPLES
# Print usage
make -f src/run/ivar.mk
# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/ivar.mk fastq
# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/ivar.mk SRR=SRR030257 N=100000 fastq
IVAR.MK OUTPUT
IVAR.MK CODE
The module also preemptively sets variable names for alignment and variant calling modules
so that chaining to these modules allows for a seamless interoperation. When chaining you
must include ivar.mk before the other modules.
#
# Runs the ivar package.
#
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
usage::
> @echo "#"
> @echo "# ivar.mk: runs the ivar suite"
> @echo "#"
> @echo "# REF=${REF}"
test:
> make -f src/run/genbank.mk gff! gff ACC=${ACC} GFF=${GFF}
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
> make -f src/run/ivar.mk BAM=${BAM} REF=${REF} cons! cons vars! vars
3.7 Utilities
3.7.1 curl.mk
The curl.mk module supports downloading data from various URLS via curl .
One might ask: Why even have a makefile for what seems to be a trivial UNIX command?
The main reason to use our Makefile is that it will not leave an incomplete file in its wake
if, for any reason, the download fails to complete. Incomplete downloads can cause subtle and
hard to troubleshoot problems!
In addition, the curl.mk file can upack .gz and .tar.gz files automatically. Again that is
handy feature to have.
A typical usage is to visit a website, locate URLs for files or directories interest, then set the
URL parameter to that data. You may also override the output directory and the resulting
output file names.
Data sources:
1. UCSC: https://hgdownload.soe.ucsc.edu/downloads.html
2. Ensembl: https://ftp.ensembl.org/pub/
curl.mk usage
curl.mk examples
To find the proper URLs navigate the UCSC data server and copy paste the URLs for the data
of interest.
# Print usage
make -f src/run/curl.mk
URL=https://ftp.ensembl.org/pub/release-105/gff3/bubo_bubo/Bubo_bubo.BubBub1.0.105.gff3.gz
curl.mk code
#
# Download data via the web
#
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
get:: ${FILE}
> @ls -lh $<
endif
get:: ${FILE}
> @ls -lh $<
endif
# Installation instructions
install::
> @echo "# no installation required"
3.7.2 rsync.mk
The rsync.mk module supports downloading data from various URLS with different
protocols.
A typical usage is to visit the website, locate URLs for files or directories interest, then setting
the URL parameter to that data. You may also override the output directory and the resulting
output file name.
UCSC DOWNLOADS
# RSYNC url
URL = rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
TODO:
RSYNC.MK USAGE
RSYNC.MK EXAMPLES
To find the proper URLs navigate the UCSC data server and copy paste the URLs for the data
of interest.
# Print usage
make -f src/run/rsync.mk
# RSYNC url
URL=rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
RSYNC.MK CODE
#
# Download data via rsync
#
# You might ask yourself Why have a makefile for what seems to be be trivial command?
#
# In this case to help us remember the correct commands and to make it fit with
# the rest of the workflow building process.
#
# The rsync rules everyone always forgets:
#
# - trailing slash on URL means copy the contents of the directory
# - no trailing slash on URL means copy the directory itself
# - a slash on the destination directory has no effect
#
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
get!::
> @echo "# cannot undo an rsync download"
# Installation instructions
install::
4. Formats
Every tool, workflow, analysis is about combining and transforming information into different
formats.
Every workflow in this book will operate on data in different formats. Understanding the
structure of information and their representations and how tools combine existing data to
generate new information is the most important ingredient of understanding any workflow.
As you work through any pipeline do your best to get a firm grasp on:
In this section we will briefly cover the most common data formats.
Biological knowledge is always greatly simplified when stored in computer-ready form. Files
typically contain one of the two classes of information:
As it happens GENBANK (and EMBL ) formats contain both sequence information and
annotations in a complex and inefficient format. Notably very few tools can directly operate
directly on GENBANK files. Think of GENBANK as a storage from which data in FASTA or
GFF formats can be extracted.
Dozens of additional formats formats are in use. Some formats formats may even have
competing naming:
Most coordinate representations will display positions on the forward (positive) strand, even
when describing directional features on the corresponding reverse (negative) strand.
For example, for an interval [100, 200] that describes a transcript on the reverse strand, the
start column will contain 100 . In reality, the functional and actual start coordinate would be
200 as the feature is transcribed in reverse. Interpreting formats in the correct orientation
demands ongoing attention to detail.
GENBANK (and a similar related format called EMBL ) are formats for storing diverse
biological information in one file. These formats were designed to represent different types of
information in a human-readable fashion, and for that reason, are not all optimized for large-
scale formats representations.
GENBANK files are best suited for storing diverse and fine-grained information with lots of
details, for example: complete viral or bacterial genomes or individual human genes.
GENBANK files are NOT well suited for storing genomic information for organisms with long
DNA (say larger than 30 million bases).
There are various ways to obtain a GENBANK files from the internet. Here we are using bio :
In general, you should avoid manual conversion and instead obtain the data in a format that is
already in a format that is suitable for your analysis.
If you have to here are some tips on how to convert GenBank to other formats.
prints:
• https://www.bioinfo.help/
FASTA is a record-based format that starts with the > symbol, followed by a sequence id
then followed by an optional description. The subsequent lines contain the sequence.
will print:
FASTA lines should be the same lengths. The capitalization of the letters may also carry
meaning (typically, lowercase letters are used to mark repetitive, low complexity regions).
Some FASTA files may follow specific formatting to embed more structured information in
both ID and description.
The entries in FASTQ format represent individual measurements so-called "reads" produced
by a sequencing instrument. The instrument may produce millions or even billions of such
reads, where each FASTQ record consists of four lines.
Published scientific works using sequencing are required to deposit the data in original
formats that get deposited at websites such as SRA and ENA.
For example, let's get a single record from the sequencing run with the accession number of
SRR5790106 :
@HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
NCGCCCGGTTAGCGATCAACAATGGACTGCATCATTTCATGCAGCTCGAGCCGATTGTAAGTCGCCCGTAACGCG
+HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
#:=AA==EGG>FFCEFGDE1EFF@FEFFBBFGGGGGGDFGGG>@FGEGBGGGGGBGGGGGGGGFDFGGGGGBBGG
1. Header: @HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
2. Sequence: NCGCCCGGTTAGCGATCAACAATGGACTGCATCATTTCATGCAGCT ...
3. Header (again): +HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
4. Quality: #:=AA==EGG>FFCEFGDE1EFF@FEFFBBFGGGGGGDFGGG>@FGEG ...
The header may contain instrument-specific information encoded into the sequence name. The
quality row represents error rates, where each character is a "Phred quality" score. The quality
scores are encoded as ASCII characters, with the character @ representing the lowest quality
score. The repeated header in the third line, listed after the + sign, may be omitted.
• https://www.ncbi.nlm.nih.gov/sra/?term=SRR5790106
or at command line:
that prints:
[
{
"run_accession": "SRR5790106",
"sample_accession": "SAMN07304757",
"first_public": "2017-12-18",
"country": "",
"sample_alias": "GSM2691575",
"fastq_bytes": "1417909312",
"read_count": "20892203",
"library_name": "",
"library_strategy": "RNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "SINGLE",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "RNA-seq analysis of genes under control of Burkholderia
thailandensis E264 MftR",
"fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR579/006/SRR5790106/SRR5790106.fastq"
}
]
Each SRA record is associated with a Bioproject. with names such as PRJNA392446
GFF stands for Generic Feature Format. GFF files are nine-column, tab-delimited, plain text
files used to represent coordinates in one dimension (along an axis).
prints:
The last column, called "attributes," may contain multiple values separated by semicolons:
ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;
gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01
The structure of this column and the presence of certain attributes are essential for some tools,
notably in RNA-Seq analysis.
The GFF format may optionally include the sequence section at the end of the file as a
FASTA file. In general, it only includes coordinates. GTF is a precursor to the GFF format,
similar with some differences in the way attributes are encoded.
GTF files must always include gene_id and transcript_id attributes hence are used
when those are required.
[gff]
BED files are three, six, or eleven-column, tab-delimited, plain text files used to represent
coordinates in one dimension (along an axis).
Originally devised for visualization purposes, thus carry columns (such as color ) that are
not relevant to formats analysis. The 6 column BED designations are:
One of the most important differences between GFF and BED formats is their coordinate
systems. GFF files are one-based, while BED files are 0 based. The first coordinate in GFF is
1, while the first coordinate in BED is 0.
In addition GFF intervals are closed (include both coordinates) [100, 200] , while BED
intervals are half-open [100,200) do not include the last coordinate. Needless to say, the
differences lead to lots of errors and confusion. Let's obtain a BED file:
There is a BED file variant called bigBed that is a compressed and indexed BED file that
may be used to store much larger amounts of formats in an efficient manner. If your interval
dataset contains over a million items, you ought to consider using bigBed instead of BED .
The conversion typically requires sorting, the BED file, a chromosome size file, and a
conversion step. Install the converter with:
The sizes.txt file contains the length of each chromosome as a tab-delimited file.
chr1 249250621
chr2 243199373
...
SAM/BAM files are used to represent alignments of the FASTQ records to a FASTA
reference.
Each row in a BAM file describes properties of the alignment of a read relative to a target
reference. Each column carries information about the alignment. The 11 columns are the
following:
BAM files need to be indexed; for each BAM file, a bai file is created and stored next to the
BAM file.
• SAM specification
• SAM tag specification
VCF stands for Variant Call Format and is a tab-delimited, column-based file that represents a
set of variants in a genome.
A single VCF file may represent a single sample or a multiple of samples. The VCF files are
perhaps the most complex formats in the whole genome analysis even though the structure
appears to be simple:
The challenge of VCF is that it needs to represent not what something is but how it is
"different" from a reference. A tremendous amount of information may be crammed into the
various fields of a VCF file, often rendering it almost impossible to read.
Page through the dbSNP VCF file to see the full list of fields.
URL=https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz
RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:
0.9902,0.009763
NC_000001.10 10007 rs1639538116 T C,G . .
RS=1639538116;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:
1,0,0
We commonly convert VCF files to simpler tab-delimited formats. Below we convert a VCF
file to contain only position, reference, and allele:
prints:
• VCF specification
5. Modern Make
Ask a bioinformatician on what workflow management system they use, and you will get a
variety of answers. Most often you will hear about Snakemake and Nextflow. Alas, these tools
never worked out for me.
For beginners who are familiar with bash, starting with Makefiles can be a far more accessible
and efficient option, as they have a less steep learning curve compared to advanced tools like
Snakemake or Nextflow.
While more sophisticated workflow requirements might make Snakemake and Nextflow seem
advantageous and promise various benefits, many bioinformaticians never fully realize these
benefits. This is often due to factors such as the scope and complexity of their projects and the
considerable learning curve associated with adopting intricate workflow management systems.
Speaking from over a decade of experience in the field and currently managing a
Bioinformatics Consulting Center at a large university, I find Makefiles to be the most
convenient and user-friendly solution for managing my workflows, serving a wide range of
scientific needs.
I have noted that all of my colleagues that swear by Snakemake or Nextflow are not aware of
the capabilities of Makefiles .
Most importantly the approach we champion in this book is what we like call Modern Make
were we make use of tools parallel to provide parallel execution of tasks.
Hence the pipelines and workflows we will show you here always operate in stages, where we
invoke make multiple times to complete the workflow.
Consult the Makefile chapters in The Art of Bioinformatics Scripting for more details on how
we recommend the use Makefiles in bioinformatics.
A workflow consists of a series of data analysis tasks that process data into a format the user
can interpret. Ideally, workflows are built in a reusable and modular manner that we can adapt
to different needs.
Unfortunately, by 2022 many bioinformatics workflows have become obscure black boxes
that tend to hide the underlying data analysis steps. This is a problem because it makes it
extraordinarily difficult to understand the results and to adapt the workflow to new needs.
I blame the particular concept of easy-to-use software. The reality is that scientific data
analysis has never been and will never be easy - after all, the whole point of science is to work
at the edge of understanding.
Instead, we should aim to develop simple, logical and consistent software. The latter is critical
because it allows us to build on existing knowledge and reuse existing tools. An easy-to-use
software is a bit like a platypus, an over-specialized evolutionary dead-end that we can't
extend and build upon.
The allure of easy-to-use is powerful for life scientists already overwhelmed with the amount
of work that needs to be done in a wet lab. No wonder myriad solutions promise them that
easy solution where the analysis magically happens. And because biologists are ill-equipped
to decide if the process is reusable or not, they are easily fooled and will embrace solutions
that, in reality, make life harder.
Type that in; as long as your computer has the right software installed, the claim is that it just
runs and the results will be computed for you automatically. Let's take a look at what happens
when we run the above command (click to start):
But wait a minute, the workflow ran and produced an endless stream of results, but what
exactly has happened? Did you realize that we ran 34 different interconnected tasks? We are
left with little to no understanding of the assumptions, decisions, and processes that went into
the analysis itself.
Having run the analysis, what have we learned about the process? Almost nothing.
Did you notice the myriad of parameter settings that just flew by? Each one of those may
matter. Some parameters matter a lot more than others. But which one are those? Are the
results correct? Hard to tell. The resulting directory is humongous, with hundreds of
intermediate files and a wealth of implicit information.
I consider the approach above the curse of over-automation where scientists build needlessly
interconnected tasks, where we can't even tell what and why something is happening. Alas, it
is an endemic problem of bioinformatics workflows, which you will undoubtedly experience.
Criticism aside, Nextstrain's approach is still better than many other software packages. At
least Nexstrain shows the commands that get executed. We can figure out what is going on
with extra work and digging. Other software similar software won't even print the results.
There are more extreme examples beyond nextstrain . I believe that the entire ecosystem
built around nf-core is a hard to comprehend, and ever-growing black box. I am saying this
being fully aware that it has a peer-reviewed, highly cited Nature Biotechnology publication ...
Imagine you wanted to download a subset of reads from SRA. You could use the sra-tools
package and run the following command:
Of course, first you had to install the software itself. Then learn a bit about what Unix is and
what happens in a terminal. Neither of them are easy in the traditional sense.
But I consider the tasks above simple in that follow a transparent and logical process. The
Unix command line behavior has been honed over decades, and it mostly fits into the Unix
philosophy that ties it all together.
Now look at the proposed alternative. To use nf-core to do perform the same action first
you need to study the pages at:
• https://nf-co.re/fetchngs
• https://github.com/nf-core/fetchngs
A very lenghty reading indeed. And if you were industrious to study the pages above, you saw
that nf-core replaces running:
with:
We now need to understand all the elements of the above incantation, make the proper profile
file that contains the information in right format, and then we still need to install the tools as
before, and we still need to use UNIX commands. And even after reading the above, I am not
quite sure how to pass the 1000 parameter to get just a subset of the data. Is that even
possible? Probably not without editing the nf-core code itself.
See? All that just to run a simple command like fastq-dump . There is a promise there that if
we buy into the system, doing everything via nf-core , eventually everything will start to
make sense and just work.
Alas, that is not my experience at all. We still need to understand and troubleshoot the
individual tools - only now, we need to do so via yet another layer of configuration and
abstraction.
I believe that the abstraction proposed via nc-core is a fundamentally wrong approach to
bioinformatics because all it does it introduce yet another layer of abstraction on top of the
already complex software ecosystem.
When we learn nf-core we are learning bioinformatics projected onto a different plane, a
specific point of view created just a few years ago by a few people. It is not evident at all that
they have made the right choices.
A fragile system that may work for one specific need, but fails radically when we try to extend
it.
I believe that Makefiles provide sufficient automation for most individual projects. Other
scientists disagree, and several alternative approaches exist.
Every once in a while, I get the urge to keep up with the times, that I should be using
something like Snakemake or Nextflow instead of Makefile , so I sit down to rewrite a
workflow in Snakemake.
But then I run into a small problem with Snakemake - it is seemingly trivial yet it takes
twenty minutes to solve.
Then I hit another problem with Snakemake , and another ... Next thing I know is that I have
spent hours reading Snakemake documentation, Googling obtuse errors, scouring GitHub
issues, troubleshooting unexpected behaviors. I don't want to toot my own horn, but I
consider myself quite good at solving computational problems. Yet I am continuously
stumped by Snakemake .
What is most aggravating is that the simplest tasks seem to cause problems, not the actual
bioinformatics analysis. I already know how to solve the problem in an elegant way,
Snakemake just gets in the way.
I run into trouble when renaming a file to match an input, when trying to branch off a
different path upon a parameter, etc. What is extraordinarily annoying is that I always
precisely know what I want to achieve; it is just that I can't get Snakemake to do it.
Invariably, a few hours later, I give up. It is just not worth the effort; everything is more
straightforward with a bash script or a Makefile !
Snakemake is just not built for how I think about data analysis.
So I badmouthed Snakemake , but how about the next best thing, NextFlow ? Frankly, I
think NextFlow is worse than Snakemake .
At least with SnakeMake , I understand the principles, and I struggle with the rules getting in
the way. On the other hand, NextFlow seems even less approachable. It asks me to learn
another programming language, Groovy , as if having to deal with bash , UNIX , Python ,
and R weren't enough already.
With Nextflow I can't even get the most direct pipeline to work in a reasonable amount of
time. When I look for examples, most are either overly simplistic and trivial or suddenly
become indecipherably complicated. For example, here is an "official" RNA-Seq workflow
according to NextFlow .
• https://github.com/nf-core/rnaseq/blob/master/workflows/rnaseq.nf
The pipeline above stretches over seven hundred fifty-five lines that, in turn, include many
other files. It is difficult to overstate how intractable, incomprehensible, and hopelessly
convoluted the process presented in that "pipeline" is.
The most significant flaw of many automation approaches is that they impose yet another
level of abstraction, making it even more challenging to understand what is going on.
To be effective with the new workflow engine, you first need to thoroughly understand the
design principles of the pipeline software itself on top of understanding bioinformatics
protocols.
Sometimes I feel that automation software takes on a life of its own to justify its existence,
rapidly gaining seemingly gratuitous features and functionalities that are a burden to deal with.
Let me reiterate. Learn to use Makefiles and achieve everything you want.
The state of workflow engines in bioinformatics is best captured by the XKCD comic:
The following is a list of workflow engines designed explicitly for bioinformatics analysis:
The main differences between the platforms are their distinct approaches to the different
requirements of automation:
Snakemake
Additional links:
NextFlow
Homepage: https://pcingola.github.io/bds/
CGAT core is software that runs the Ruffus-enabled workflows on different computational
platforms: clusters, AWS, etc.
BPipe
GenPipe
Engine comparisons
• https://github.com/GoekeLab/bioinformatics-workflows
Workflow managers provide an easy and intuitive way to simplify pipeline development.
These implementations are designed for basic illustrations. Workflow managers provide
many more powerful features than what we use here; please visit the official documentation
to explore those in detail.
This chapter will demonstrate the various tradeoffs via a practical example.
We will build a short-read aligner pipeline with different methods a bash script, a
Makefile , and a Snakemake file. We will use the bwa and samtools tools.
All our code assumes that you have obtained the Ebola reference genome for accession
AF086833 and the sequencing reads deposited in the SRA for accession SRR1972739 . The
code below initializes the data:
First, we'll create our pipeline as a simple bash script. We will write the bash scripts in a
reusable manner. The first step in that process is factoring input and outputs into variables
listed at the start.
bash align.sh
Positives: Everything command is explicitly defined and executed in order. We can see what
is happening. We can provide someone with a script that contains all the information
necessary to run the pipeline.
Negative: Every step gets executed, even those that do not need to be run multiple times. The
script is not re-entrant. For example, genome indexing may take a long time and needs to be
run only once.
When running the code a second time, we want to comment on the genome indexing step. As
we collect more steps, we may need to comment various sections or maintain separate scripts.
That being said, the script, as written above, is an excellent way to get started. It has minimal
cognitive overhead and allows you to focus on progress. Many Ph.D. theses are written in this
way.
Next, let's build a Makefile version of the pipeline based on the bash script. A potential
solution could look like this:
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
# Print usage.
usage:
> @echo "make index align"
make align
The Makefile approach is my favorite. It strikes the ideal balance between convenience,
automation, and cognitive overhead.
Positives: The best thing about the Makefile is that the commands are identical to those in a
shell script. You can run a command in the shell, copy it to the file, and vice versa.
Then you don't have to set up dependencies if you don't want to. You can keep index and
align targets independent of one another. Makefiles can grow with you.
You can add the optional dependency management. In the case above, running make align
when the alignment is already present will not re-run the alignment; it just reports that it is
already done.
Negatives: There is an overhead in understanding and interpreting the rules. For example, $
{BAM}: ${IDX} means that the ${BAM} target will first require that the ${IDX} target is
present, so the rule for ${IDX} is executed first. Our eyes need to be trained to recognize
which patterns are commands and which are targets or other rules.
Finally, you can provide a cleanup rule that you can run.
make clean
To run Snakemake, you must have the snakemake command installed. The full install
mamba install snakemake requires many packages (more than 130). There is a slimmed-
down version of snakemake that can be installed as:
Now let's translate our script into a Snakefile . Note how much longer the same script is and
how many more rules, quotations, formatting, commas, and indentation it needs. Yet by the
end, it recapitulates the exact commands already shown in the Makefile :
# Indexing rule.
rule index:
input:
REF,
output:
IDX,
shell:
"bwa index {input}"
# Alignment rule.
rule align:
input:
IDX, R1, R2
output:
BAM,
shell:
"bwa mem {REF} {R1} {R2} | samtools sort > {output}"
Any error you make in the indentation would lead to very hard-to-debug errors. I believe that
Snakefile is substantially harder to read and understand. The above Snakefile can be run
as:
Then run:
When you run snakemake , you can note that the output is quite lengthy by default. It
generates lots and lots of messages that clog your terminal.
We can instruct Snakemake to print the commands it is planning to execute without executing
them using the flags --dryrun ( -n ) and --printshellcmds ( -p ).
You can decipher the commands that snakemake will execute among the many lines that scroll
by. Though far from ideal, the above command can help us understand the commands
executed via Snakefiles.
Positives The patterns match across to steps in a natural manner. We mark what input is and
what output is. Snakemake can manage the runtime environment as well (not shown here).
Negatives A substantial cognitive overhead is associated with writing the rules. Sometimes it
can be hair-raising challenging to debug problems. I have spent countless hours debugging
Snakefiles, unable to achieve what I wanted and giving up.
Snakefile parameters
To pass external configuration parameters into a Snakefile, we need to write the parameters a
little differently. For example, to provide both a default value and an externally changeable
value for the REF parameter, we need to write:
It is a bit more convoluted than the Makefile but not substantially more.
There are many more ways to pass parameters into Snakefiles, and it is one of its defining
features. That said, I find the many methods challenging to understand and follow.
Understanding the above has helped me better understand what Snakemake actually does.
Imagine that we have multiple samples, A , B , C , and D with two files, R1 and R2 , for
each sample. Our file names may look like this.
• A_R1.fq , A_R2.fq
• B_R1.fq , B_R2.fq
• C_R1.fq , C_R2.fq
• D_R1.fq , D_R2.fq
• A.bam
• B.bam
• C.bam
• D.bam
Where the reads from sample A are aligned into A.bam , the reads from sample B are
aligned into B.bam and so on. Basically the commands we wish to run are:
REF=refs/genome.fa
We could write out the above in a script, and that would work. But imagine that we have more
than four samples, perhaps dozens, or want that later we want change all lines to something
else. Handling all those changes by text editor is not very practical and is evidently error-
prone.
Instead, we need a way to generate identical commands just by knowing the sample names
A
B
C
D
Another way to say it is that we wish to write a program that, in turn, generates the command
for each sample. Basically we need a smart program that writes us another, more repetitive
program. This is the fundamental concept of automation can be summarized as:
Both parallel and Snakemake can help us with this automation, but at wildly different
levels of abstraction.
In our opinion, the simplest and most generic way to automate in tasks is using the GNU
parallel tool.
will print:
Hello pattern A
Hello pattern B
Hello pattern C
Hello pattern D
cat ids.txt | parallel 'bwa mem -t 8 $REF {}_R1.fq {}_R2.fq | samtools sort > {}.bam'
See the Art of Bioinformatics Scripting for more details on using GNU parallel.
cat ids.txt | parallel "bwa mem -t 8 $REF {}_R1.fq {}_R2.fq | samtools sort > {}.bam"
would also be possible with Snakemake by using patterns, it is just that the method is very
verbose and requires a lot of boilerplate code. The code below generate the same commands at
the single line above:
# Alignment rule.
rule align:
input: BAM,
In the example above, we demonstrate the pattern matching rules in Snakemake. Namely, we
list the desired BAM files with:
rule bwa_mem:
input:
IDX,
"reads/{sample}_R1.fq",
"reads/{sample}_R2.fq",
output:
"bam/{sample}.bam",
shell:
"bwa mem {input} | samtools sort > {output}"
It is that pattern listed as {sample} where a lot of implicit magic happens and where most of
your frustrations will come from later on.
There are many additional subtle rules and internal mechanisms at play.
Those are convenient when they work like magic but can be extraordinarily frustrating to deal
with when they don't. Most rules are neither obvious nor well explained in how and why they
work.
I have had lots of fustrations debugging Snakemake files, especially since I knew of much
simpler ways to achieve the same results.
5.6 Alternatives
The complexities of managing interconnecting tools have led scientists to develop fully
automated solutions targeted at very specific use cases.
Many times these workflows are "black-boxes", with users having little control and
understanding of what takes place inside the box. The tacit agreement is that once you accept
the constraints of the methodology, the software promises to produce informative results in an
easy to comprehend visual format.
A computational pipeline for the analysis of short-read re-sequencing data (e.g. Illumina,
454, IonTorrent, etc.). It uses reference-based alignment approaches to predict mutations in a
sample relative to an already sequenced genome. breseq is intended for microbial genomes
(<10 Mb) and re-sequenced samples that are only slightly diverged from the reference
sequence (<1 mutation per 1000 bp).
GitHub: https://github.com/barricklab/breseq
Docs: https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/
documentation/
breseq results
The command above will produce a directory called results that contains a fairly large
number of files containing a wide variety of information. It all looks very impressive! If you
want to be amazed check out the file located at results/output/index.html . Here is a
copy of that file accessible via the web:
• results/output/index.html
# Install breseq.
mamba install breseq sra-tools -y
Having activated the environment create a new directory for the analysis and switch to it. We
still need to use Biostar Workflow modules to obtain the reference genome and the reads.
RUN BRESEQ
Having obtained the reference genome and the reads, we can run breseq on our data.
We can also learn a bit about the process as we watch the alignments go by in the console. We
notice that breseq runs bowtie2 behind the scenes, and it seems to do so in an iterative
manner while using a large number of tuning parameters that look like this:
Look up some of these parameters to learn something more about how to tweak bowtie2
when looking for alignments. But note how we are at full "mercy" of what the original authors
of breseq decided for us.
Above we ran breseq and among the many files it produces there is VCF file located in
results/output/output.vcf . A nearly identical variant file could have been created with
a more generic workflows like the one we presented in this book.
The few differences that we see are solely a matter of filtering the VCF file based on various
parameters. It is a matter of choosing what to trust.
If the VCF files are the same and we can get that result with a Makefile then what does
breseq really do?
The most important result that breseq produces is the summary of mutations in the file
results/output/index.html. That file is the main distinguishing feature of breseq and one of
the primary reasons for anyone using it. It is an amazingly effective display and summary that
cross-references mutations with alignments and with supporting evidence. It allows end users
to investigate any mutation with ease. It is a significant scientific achievement! In addition
breseq also offers structural variation (junction) predictions and other valuable utilities.
Alas, it is all baked into the entire pipeline. We can't run any of its valuable subfunctions on a
different data even if had all the necessary information at our disposal. breseq runs only if
everything is named and organized aligned just like breseq expects and that can be
extraordinaly constraining.
There is no question that breseq is a extremely valuable tool that facilitates discoveries. As
long as you need to study bacterial genomes that fit the use case breseq is a boon!
Yet we can't help but wonder why the system was not designed in a more modular fashion.
There is universal downside of most pipelines built to be "easy-to-use" black boxes. The
pipeline exists as a single, monolithic entity. And monoliths are like dinosaurs, they can't
evolve to adapt to changing needs.
GitHub: https://github.com/nextstrain
Docs: https://docs.nextstrain.org/en/latest/index.html
What nextstrain is primarly? It is a visualization platform. And with that we note the emerging
commonality between these bespoke workflows. Standard software tools produce data that is
so difficult to interpret, that monuments of software are needed to built to help with the
process.
Install nextstrain
• https://docs.nextstrain.org/en/latest/
Run nextstrain