0% found this document useful (0 votes)
317 views228 pages

Biostar-Workflows 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
317 views228 pages

Biostar-Workflows 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 228

Biostar Workflows

Biostar Handbook Collection

None

Biostar Handbook Collection (2023)


Table of contents

Table of contents

1. Home 6

1.1 Welcome to Workflows 6

1.1.1 Prerequisites 6

1.1.2 Navigating content 7

1.1.3 How to download the book? 7

1.2 Installing the modules 8

1.2.1 1. Set up your computer 8

1.2.2 2. Install the modules 8

1.2.3 3. Test the modules 8

1.2.4 4. Change parameters 9

1.3 Setting up R with bioconda 11

1.3.1 Creating a separate environment 11

1.3.2 Cherry-pick packages 11

1.3.3 Test the R installation 12

1.3.4 Keep environments updated 13

1.3.5 How to switch environments 13

1.3.6 Creating a custom environment 14

1.4 Setting up R with RStudio 15

1.4.1 Install the modules 15

1.4.2 Installation process 16

1.4.3 Running scripts in RStudio 16

1.4.4 Test your installation 17

1.5 Module help 19

1.5.1 Quickstart to modules 19

1.5.2 Complete guide to modules 22

1.5.3 How to troubleshoot 29

1.6 How to get data 33

1.6.1 How to get genome data 33

1.6.2 How to use refgenie 38

1.6.3 How to get FASTQ files 42

1.7 Alignments 49

1.7.1 How to align short reads 49

1.8 Variant calling 53

1.8.1 How to call genomic variants 53

1.8.2 How to evaluate variant calls 57

- 2/228 - Biostar Handbook Collection (2023)


Table of contents

1.9 RNA-Seq analysis 71

1.9.1 RNA-Seq basics 71

1.9.2 RNA-Seq count simulations 74

1.9.3 RNA-Seq differential expression 78

1.9.4 RNA-Seq counts with HiSat2 85

1.9.5 RNA-Seq counts with Salmon 97

1.9.6 RNA-Seq functional analysis 105

2. Workflows 108

2.1 Airway RNA-Seq 108

2.1.1 Stating the problem 108

2.1.2 Main findings 109

2.1.3 Workflow plan 109

2.1.4 Accession numbers 110

2.1.5 Create the design file 110

2.1.6 Obtain the references 112

2.1.7 Automation pattern 113

2.1.8 Download the reads 114

2.1.9 Align the reads 115

2.1.10 Create the counts 115

2.1.11 Plot the PCA 116

2.1.12 Differential expression analysis 116

2.1.13 Plot a heatmap 117

2.1.14 Do the results validate? 118

2.1.15 Good data doesn't need to be big 119

2.1.16 Gene ontology enrichment 119

2.1.17 The complete workflow 120

2.2 Presenilin RNA-Seq 123

2.2.1 Stating the problem 123

2.2.2 Main findings 124

2.2.3 Workflow plan 124

2.2.4 Accession numbers 125

2.2.5 Create the design file 125

2.2.6 Obtain references 126

2.2.7 Download the reads 127

2.2.8 Align the reads 127

2.2.9 Counting the reads 127

2.2.10 Generate the PCA plot 127

2.2.11 Differential expression analysis 128

- 3/228 - Biostar Handbook Collection (2023)


Table of contents

2.2.12 P-hacking or being smart? 129

2.2.13 Let's try other methods 130

2.2.14 Plotting the heatmap 131

2.2.15 Do the results validate? 132

2.2.16 Gene ontology enrichment 133

2.2.17 Personal note 133

2.2.18 The complete workflow 135

3. Modules 137

3.1 Introduction 137

3.1.1 General tips 137

3.2 Data modules 138

3.2.1 sra.mk 138

3.2.2 genbank.mk 141

3.3 QC modules 144

3.3.1 fastp.mk 144

3.4 Alignment modules 147

3.4.1 bwa.mk 147

3.4.2 bowtie2.mk 151

3.4.3 minimap2.mk 155

3.4.4 hisat2.mk 159

3.5 RNA-Seq modules 163

3.5.1 salmon.mk 163

3.6 SNP calling modules 167

3.6.1 bcftools.mk 167

3.6.2 freebayes.mk 170

3.6.3 gatk.mk 173

3.6.4 deepvariant.mk 177

3.6.5 ivar.mk 180

3.7 Utilities 183

3.7.1 curl.mk 183

3.7.2 rsync.mk 186

4. Formats 189

4.1 Introduction to data 189

4.1.1 Information as data 189

4.1.2 Biological data is simplified 190

4.1.3 Unexpected caveats 190

4.2 GENBANK is for storage 191

4.2.1 Convert GenBank to FASTA 192

- 4/228 - Biostar Handbook Collection (2023)


Table of contents

4.2.2 Convert GenBank to GFF3 192

4.2.3 Extract sequences from GenBank 192

4.3 FASTA contains sequences 193

4.4 FASTQ stores reads 194

4.4.1 Getting data from SRA 194

4.4.2 Metadata for SRR numbers 195

4.4.3 SRA Bioprojects 195

4.5 GFF represents annotations 196

4.6 BED is for annotations 198

4.6.1 The bigBed format 198

4.6.2 The bigWig format 199

4.7 SAM/BAM represent alignments 200

4.8 VCF contains variations 202

5. Modern Make 204

5.1 Why makefiles? 204

5.1.1 What is Modern Make? 204

5.2 The rise of the black-box 206

5.2.1 Why did it spiral out of control? 206

5.2.2 Snakemake and the curious case of over-automation 206

5.2.3 NextFlow and the curious case of over-abstraction 207

5.3 Snakemake & Nextflow 210

5.3.1 Is Nextflow any better? 210

5.3.2 Accidental complexities 211

5.3.3 Automation software 211

5.3.4 Beyond Make 212

5.4 Makefiles vs Snakefiles 215

5.4.1 1. The code as a bash script 215

5.4.2 2. The code as a Makefile 216

5.4.3 3. The code as a Snakefile 218

5.5 Snakemake patterns 221

5.5.1 Stating the problem 221

5.5.2 Automation with parallel 222

5.5.3 Automation with Snakemake 223

5.6 Alternatives 225

5.6.1 Breseq: Bacterial genome analysis 225

5.6.2 Nextstrain: Viral pathogen evolution 228

- 5/228 - Biostar Handbook Collection (2023)


1. Home

1. Home

1.1 Welcome to Workflows

Last updated on May 03, 2023

The Biostar Workflows book provides simple, clear and concise instructions on how to use
bioinformatics tools in automated ways. The book complements the other volumes of the
Biostar Handbook and should be used in tandem with those volumes.

The Biostar Bioinformatics RNA-Seq by Example Corona Virus Biostar


Handbook Scripting Genome Analysis Workflows

In the Biostar Workflows we are providing complete, multistep Makefiles that automate the
analysis of large datasets. Think of the book as a collection of methods and techniques that
you can use in your projects.

Our books are updated quite frequently! We recommend that you read every book via the
website, as the content on the web will always be the most up-to-date. Visit the Biostar
Handbook site to access the most up-to-date version of each book.

1.1.1 Prerequisites

To get the most out of the workflows:

1. First, you will need to set up your computer as instructed in the Biostar Handbook. If you
have a working bioinfo environment you are all set.

2. We assume you have some familiarity with common bioinformatics data formats. If not
consult our chapter on data formats

3. Finally, we will extensively use bash shell scripting and Makefiles to create workflows. The
volume titled the Art of Bioinformatics Scripting goes into great detail on these subjects.

- 6/228 - Biostar Handbook Collection (2023)


1.1.2 Navigating content

1.1.2 Navigating content

On top bar we have a navigation bar. You can use it to navigate across sections.

On the left-hand side, we list all the pages within a section.

On the right-hand side, we show a table of contents that helps you navigate within the selected
page.

1.1.3 How to download the book?

The book is available to registered users. The latest versions can be downloaded from:

• Biostar-Workflows.pdf

Our books are updated frequently, especially during the Spring and Fall semesters, when the
books are used as textbooks.

We recommend accessing the book via the web, as the web version always contains the most
recent and up-to-date content.

A few times a year, we send out emails that describe the new additions.

- 7/228 - Biostar Handbook Collection (2023)


1.2 Installing the modules

1.2 Installing the modules

1.2.1 1. Set up your computer

First, you need follow the How to set up your computer chapter of the Biostar Handbook. By
the end of that process your computer will contain the software needed to run common
bioinformatics analyses.

1.2.2 2. Install the modules

To initialize the workflow source code in the current directory use:

# Upgrade the bio package.


pip install bio --upgrade

# Download the workflow source code.


bio code

The command above will create directories such as

• src/run
• src/r
• src/bash

Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.

For each new project download a separate copy of the code. This is because you may need to
modify the code (even if just a tiny bit) to suit your needs. Thus, it is best if you start out with
separate copy of the code.

1.2.3 3. Test the modules

Run one of the modules to test them. Here we run genbank.mk designed to connect to
GenBank to download data:

make -f src/run/genbank.mk

- 8/228 - Biostar Handbook Collection (2023)


1.2.4 4. Change parameters

Running a module prints the usage of it:

#
# genbank.mk: download sequences from GenBank
#
# ACC=AF086833
# REF=refs/AF086833.fa
# GBK=refs/AF086833.gb
# GFF=refs/AF086833.gff
#
# make fasta genbank gff
#

Now run the fasta target of the module:

make -f src/run/genbank.mk fasta

The command above should print:

mkdir -p refs/
bio fetch AF086833 --format fasta > refs/AF086833.fa
-rw-r--r-- 1 ialbert staff 19K Feb 28 10:48 refs/AF086833.fa

You now have a file called refs/AF086833.fa that contains the sequence of the accession
number AF086833 . If you were to run the fasta target again

make -f src/run/genbank.mk fasta

note how no download is performed because the file already exists. The workflow tells you
the file is already there:

-rw-r--r-- 1 ialbert staff 19K Feb 28 10:48 refs/AF086833.fa

1.2.4 4. Change parameters

You can change what gets downloaded by passing an additional parameter to the fasta
target:

make -f src/run/genbank.mk fasta ACC=NC_045512

- 9/228 - Biostar Handbook Collection (2023)


1.2.4 4. Change parameters

Now the workflow should print:

mkdir -p refs/
bio fetch NC_045512 --format fasta > refs/NC_045512.fa
-rw-r--r-- 1 ialbert staff 30K Feb 28 10:46 refs/NC_045512.fa

And that is how all of our workflows work.

The Biostar Workflows are ready to rumble!

- 10/228 - Biostar Handbook Collection (2023)


1.3 Setting up R with bioconda

1.3 Setting up R with bioconda

We provide several data analysis modules that make use of code written in R. We support
running R both at the command line and via RStudio.

1.3.1 Creating a separate environment

If you are using the command line for running R we recommend that you create a separate
conda environment for the statistical analysis. We call our environment stats :

# Create new environment


conda create -n stats python=3.8 -y

# Activate the environment


conda activate stats

In the new environment type:

# Install the requirements.


bash src/install-packages.sh

Remember to activate stats when you are running the R code.

Usually this is not a problem as the R code runs in the final stages of each protocol when you
are analyzing, plotting and visualizing the data.

1.3.2 Cherry-pick packages

Creating a separate stats environment is not always necessary. We do it to ensure that you
can run all the packages we demonstrate in the book. In practice, when working on a specific
problem you might not need them all. Start with updating all your packages in bioinfo :

# Update all installed packages


mamba update --all -y

Next you may be able to install individual Bioconductor libraries into your bioinfo
environment without interfering with existing packages. When doing so, cherry-pick and

- 11/228 - Biostar Handbook Collection (2023)


1.3.3 Test the R installation

install just a few of the packages, rather than all of them as we do above. It is a bit of trial and
error. Instruct mamba to install a library then watch what mamba tells you that will happen.
Whether any existing packages get downgraded. You would not want to allow a lower version
of say samtools or bwa to be installed.

The success rate may vary as various package dependencies change in time.

Check the src/install-packages.sh to for an overview of what packages you might want
to install.

1.3.3 Test the R installation

Run src/r/simulate_null.r from the command line:

Rscript src/r/simulate_null.r

It should print:

# Generating null data


# Input: src/counts/barton_counts.csv
# Total rows: 7126
# Above minimum: 6063
# Replicates: 5
# Design: design.csv
# Counts: counts.csv

The code above performed an RNA-Seq simulation and created the files design.csv and
counts.csv .

When run from command line, you can see usage information by adding the -h (help) option.
For example:

Rscript src/r/simulate_null.r -h

Should print:

Usage: src/r/simulate_null.r [options]

Options:
-d DESIGN_FILE, --design_file=DESIGN_FILE
simulated design file [design.csv]

- 12/228 - Biostar Handbook Collection (2023)


1.3.4 Keep environments updated

-o OUTPUT, --output=OUTPUT
simulated counts [counts.csv]

-r NREPS, --nreps=NREPS
the number of replicates [5]

-i INFILE, --infile=INFILE
input file with counts [src/counts/barton_counts.csv]

-h, --help
Show this help message and exit

And that's it! You have all the code you will need to perform the analyses in this book.

1.3.4 Keep environments updated

Every once in a while (and especially if you run into trouble) you should update all packages
in an environment. You can do so via:

# Update all installed packages


mamba update --all -y

1.3.5 How to switch environments

To switch environments within a bash shell script use the following construct:

# Turn off error checking and tracing.


set +uex

# Must include the following initialization command in the script.


source ~/miniconda3/etc/profile.d/conda.sh

# Activate the stats environment.


conda activate stats

# This command now runs in stats


echo "Look MA! I am in stats!"

# Activate the bioinfo environment.


conda activate bioinfo

# This command now runs in bioinfo


echo "Look MA! I am in bioinfo!"

- 13/228 - Biostar Handbook Collection (2023)


1.3.6 Creating a custom environment

# Turn the error checking and tracing back on.


set -uex

1.3.6 Creating a custom environment

When you know specifically which tools you wish to run you might want to create a custom
environment just for those tools. For example, to run the hisat2 based pipeline followed by
featurecounts and DESeq2 as well as filling in gene names and plotting the results:

mamba create -n rnaseq -y -c bioconda hisat2 samtools subread r-optparse \


r-tibble r-dplyr r-gplots bedtools ucsc-bedgraphtobigwig \
bioconductor-biomart bioconductor-deseq2 bioconductor-tximport

followed by:

# Activate the environment.


conda activate rnaseq

# Install the bio pacakge.


pip install bio

It would create the rnaseq environment with all the tools needed to run the RNA-Seq
pipeline.

Now, to run the RNA-Seq workflow execute:

# Install the code


bio code

# Run the pipeline


bash src/scripts/rnaseq-with-hisat.sh

- 14/228 - Biostar Handbook Collection (2023)


1.4 Setting up R with RStudio

1.4 Setting up R with RStudio

We provide several data analysis modules that make use of code written in R. We support
running R both at the command line and via RStudio.

We demonstrate the use cases from the command line, and it would work similarly in RStudio.

Sometimes the installation in RStudio is simpler as it uses a separate instance of R that is


independent of that running at the command line, hence it has fewer dependencies itsefl.

1.4.1 Install the modules

In the Installing modules chapter we saw that the command below downloads a copy of the
code used in this book:

# Upgrade the bio package.


pip install bio --upgrade

# Download the workflow source code.


bio code

The command above will create directories such as

• src/run
• src/r
• src/bash

Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.

- 15/228 - Biostar Handbook Collection (2023)


1.4.2 Installation process

1.4.2 Installation process

You will need to have R installed first. Visit the R website and install R.

• https://www.r-project.org/

When using Apple M1 processor install the Intel MacOS binaries and not the so called native
arm version. While both versions of R will run on your Mac the Bioconductor package
requires the Intel version of R.

Open RStudio, find and load src/install-rstudio.r script within RStudio. Then press
source to run the script. The command will churn, download and print copious amounts of
information but should complete without errors.

Note: Always select No or None when asked to update or modify packages that are already
installed.

At the end of the installation above you may be prompted yet again to update existing
packages. I recommend to ignore that message as the update does not seem to succeed on my
system and the packages seem to work even if I skip that update.

1.4.3 Running scripts in RStudio

To run any of our R scripts, open the script in RStudio and execute it from there. The
customizable parameters will always be listed at the top of the script.

Set the working directory to the location that contains the src folder that you have created at
the module installation. Don't select the src folder itself though (what you should select is
the parent folder that contains the src folder).

Session -> Set Working Directory --> Choose directory

You can modify the parameters at the top of every script to fit your needs.

We have tried to write clean and clear R code where we separate the process of reading and
formatting the data from the analysis step. This allows you to more readily modify the code to
fit your needs. As you will see, in just about any R script 90% of the code is fiddling with
data: formatting, cleaning, consolidating etc. The actual analysis is just a few lines of code.

- 16/228 - Biostar Handbook Collection (2023)


1.4.4 Test your installation

1.4.4 Test your installation

Below we loaded the src/r/simulate_null.r script in R and ran it via the source
command:

Then we loaded up the src/r/edger.r program and ran that as well:

- 17/228 - Biostar Handbook Collection (2023)


1.4.4 Test your installation

Important things to note:

1. Each program prints information on what it is doing. For example note how src/r/edger.r
program reports that it has created the edger.csv .
2. Investigate the files that get generated in the work directory
3. We have set up the tools so that by default the file names match; you can change them to
whatever you want
4. The code is modular; you can easily change the parameters to fit your needs

- 18/228 - Biostar Handbook Collection (2023)


1.5 Module help

1.5 Module help

1.5.1 Quickstart to modules

We have implemented our workflows as modules written as Makefiles that may be run
individually or in combination.

1. How to install the modules

# Upgrade the bio package.


pip install bio --upgrade

# Download the workflow source code.


bio code

The command above will create directories such as

• src/run
• src/r
• src/bash

Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.

2. How to see the usage

Run the makefile to have it print information on its usage. For example:

make -f src/run/sra.mk

will print:

#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000

- 19/228 - Biostar Handbook Collection (2023)


1.5.1 Quickstart to modules

#
# make fastq
#

• Lower case words: fastq are the so-called targets of the makefile.
• Upper case words: SRR=SRR1553425 are the external parameters for the makefile.

3. How to run the makefile

Each makefile comes with a self-testing functionality that you can invoke with the test
target.

make -f src/run/sra.mk test

To run the makefile with the target that the usage showed, for example the fastq execute:

make -f src/run/sra.mk fastq

Running a makefile with no parameters will make it use the default parameters, in this case
SRR=SRR1553425 basically equivalent to running:

make -f src/run/sra.mk fastq SRR=SRR1553425

4. How to change parameters

You can pass a value to any variable into the module with:

make -f src/run/sra.mk SRR=SRR1553425 N=1000 fastq

Note the output the command generates:

mkdir -p reads
fastq-dump -F --split-files -X 1000 -O reads SRR1553425
Read 1000 spots for SRR1553425
Written 1000 spots for SRR1553425
-rw-r--r-- 1 ialbert staff 58K Oct 25 10:00 reads/SRR1553425_1.fastq
-rw-r--r-- 1 ialbert staff 63K Oct 25 10:00 reads/SRR1553425_2.fastq

Note how the module will also list the commands that it executes.

- 20/228 - Biostar Handbook Collection (2023)


1.5.1 Quickstart to modules

6. Where to get help

Look at the source code of the file you are running to see all parameters.

The navigation bar on the left also has a documentation page for each module.

As an example here is the documentation for sra.mk.

7. How to learn more

1. Look at the complete guide to understand the principles that we rely on.
2. Look inside the makefile to see how it works. We tried our best to write self-documenting code.
3. We also provide separate documentation for each makefile.

The table of contents lists several analyses. Each will rely on using the Makefiles in a
specific order and with certain parameters. Study each to learn more about the decision-
making that goes into each process.

- 21/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

1.5.2 Complete guide to modules

In this section, we describe our Makefiles design and usage in more detail. It is not strictly
necessary to consult the content on this page, but helps in better understanding the principles
we follow.

Getting the modules

For each project, start with a new copy of the code and customize that.

You may need to customize the modules, add various flags to it. Thus, it is best to start with a
new copy of the code for each project.

# Upgrade the bio package.


pip install bio --upgrade

# Download the workflow source code.


bio code

The command above will create directories such as

• src/run
• src/r
• src/bash

Were each folder contains analytics code that we explain and demonstrate in this book. The
command will not overwrite existing files unless explicitly instructed to do so.

Testing a modules

Every module can self test.

Invoke the test target to have the module execute its self-test. For example:

make -f src/run/bwa.mk test

- 22/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

The above will download all the data its needs and will print the commands. A module may
rely on other modules hence should be run from the directory that you have installed the
modules in.

Module usage

Run the module, and it will tell you what it does.

Every module is self-documenting and will print its usage when you run it via make .
Investigate the modules in the src/run folder. We'll pick, say, src/run/sra.mk . Run the
module to get information on its usage:

make -f src/run/sra.mk

It prints the following:

#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000
#
# make fastq
#

All modules are runnable without passing parameters to them. Some modules list the valid
parameters they can take.

Action targets like get always have lower case. The parameters SRR and N are always
uppercased.

Performing a dry run

Add the -n flag to see what a makefile plans to do.

- 23/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

The module's usage indicates the so-called target, in this case, get . Targets are in lowercase;
parameters will always be uppercased. To see what the makefile will do when run, add the -n
flag (so-called dry run) to the command line:

make -f src/run/sra.mk fastq -n

It will print (but not execute the commands!):

mkdir -p reads
fastq-dump -F --split-files -X 1000 -O reads SRR1553425
ls -lh reads/SRR1553425*

Note how the Makefile is self-documenting. It prints the commands that it will execute.

Running a module

Every module is built to be runnable right away

In most cases the default parameters for a module are sufficent for a test run witouth setting
additional paramters. For example:

make -f src/run/sra.mk fastq

The run will print:

mkdir -p reads
fastq-dump --gzip -F --split-files -X 1000 -O reads SRR1553425
Read 1000 spots for SRR1553425
Written 1000 spots for SRR1553425
-rwxrwxrwx 1 ialbert ialbert 58K Oct 22 22:36 reads/SRR1553425_1.fastq
-rwxrwxrwx 1 ialbert ialbert 63K Oct 22 22:36 reads/SRR1553425_2.fastq

Now rerun the same command:

make -f src/run/sra.mk fastq

This time it runs instantly because the files are already downloaded and prints the location of
the files only:

- 24/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

-rwxrwxrwx 1 ialbert ialbert 58K Oct 22 22:36 reads/SRR1553425_1.fastq


-rwxrwxrwx 1 ialbert ialbert 63K Oct 22 22:36 reads/SRR1553425_2.fastq

Some modules, for example, aligners, may need data files to run. Other modules may need an
alignment file to work.

In those cases, we would need to run the appropriate tools first, such as sra.mk or bwa.mk ,
to generate the correct data types.

Setting parameters

Look inside the Makefile (a simple text file) to see the available parameters.

If we wanted to download data for a different SRR number, we could specify command line
parameters like so:

make -f src/run/sra.mk fastq SRR=SRR3191542 N=2000

In our convention, the parameters will always be in uppercase. The command, when run,
prints:

mkdir -p reads
fastq-dump --gzip -F --split-files -X 2000 -O reads SRR3191542
Read 2000 spots for SRR3191542
Written 2000 spots for SRR3191542
-rwxrwxrwx 1 ialbert ialbert 101K Oct 22 22:39 reads/SRR3191542_1.fastq
-rwxrwxrwx 1 ialbert ialbert 110K Oct 22 22:39 reads/SRR3191542_2.fastq

In a nutshell, the sra.mk module is a reusable component that allows us to download reads
from the SRA database.

Undoing a run

Add the ! to undo the results of a run

Sometimes we need to delete the results of an operation.

- 25/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

In all of our modules, to remove the results generated by a target, invoke the target with an
exclamation sign ! added. For example, to undo the results generated by the fastq target,
run fastq! :

make -f src/run/sra.mk fastq!

Doing so will remove the downloaded files. To force rerunning the module even if the results
already exist, use both targets fastq! and fastq! in the same command:

make -f src/run/sra.mk fastq! fastq

You can always list multiple targets if those are applicable.

The reversal may not undo every change. For example, in the above case, it won't delete the
directory that a command might have created but will delete the downloaded files.

Universal parameters

Many modules accept additional parameters that have a common utility. We do not list theses
parameters in the module's usage because they common to all.

For example

• NCPU=4 will set number the number of multithreading to 4.


• MODE=SE or MODE=PE would run a tool in single-end or paired-end mode.
• SRR=SRR3191542 will set the SRR number.

Tools are set up to run in single end mode MODE=SE thus for paired-end data set the
MODE=PE for every module.

Look at the source code, we list all parameters at the start of the module.

Another common parameter is SRR as many of our workflows start with downloading data
from the SRA database. The default file names in each tool will derive from the SRR
parameter unless set otherwise.

For example the following command will run bwa with 4 threads, in single end mode and
additionally applies the filtering flags to remove unmapped reads:

make -f src/run/bwa.mk align SRR=SRR3191542 NCPU=4 MODE=SE SAM_FLAGS="-F 4"

- 26/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

Notably using another aligner will work the same way as far these universal parameters are
concerned:

make -f src/run/bowtie2.mk align SRR=SRR3191542 NCPU=4 MODE=SE SAM_FLAGS="-F 4"

when not explicitly set, the module will use the SRR number of set the various input and
output parameters. Thus unless otherwise set it will read its input from R1=reads/
SRR3191542.fastq and will write the output to BAM=bam/SRR3191542.bam . The default
naming makes it convenient when reproducing published data.

For your data, that does not have an SRR number, you can set the R1 , R2 and BAM
parameters to point to your files.

Setting a parameter on a tool that does not recognize it will have no effect. So you can say
NCPU=4 on say genbank.mk and it will have no effect.

The easiest way to identify all parameters is reading the source in your code or here on the
website.

Triggering a rerun

Add the target with and without a ! to rerun the module.

If we attempt to modify the run and get more reads with N=2000 the module won't do it by
default since the files are already there, and it does not know we want a new file with the same
name. In that case, we need to trigger a rerun by undoing the target with get! and then
rerunning it with get :

make -f src/sra.mk fastq! fastq SRR=SRR1553498 N=2000 MODE=PE

The same rule applies to all other modules as well. We might need to trigger a rerun whenever
we change parameters that may affect the contents of a file.

For example, to get the entire SRR file, we need to set N=ALL as a parameter, and we
probably want to remove the previous files (if we downloaded these before):

make -f src/sra.mk fastq! fastq SRR=SRR1553498 N=ALL MODE=PE

- 27/228 - Biostar Handbook Collection (2023)


1.5.2 Complete guide to modules

You need to trigger a rerun if you change parameters that affect the contents of a file that
already exists.

Software requirements

You may need to create a new environment for specific tools.

Every workflow that we develop will have a list of software requirements that you can see
with the following:

make -f src/run/sra.mk install

It prints the installation commands (but does not run them):

mamba install sra-tools

If you have set up the bioinfo environment as instructed during installation, your system is
already set up and good to go. Otherwise, you need to use the command above to set up the
software. For example, you could do the following:

make -f src/run/sra.mk install | bash

The above applies the installation commands into the current environment.

View the source code

Use the source, Luke!

Look inside src/run/sra.mk to see what the module does. We've tried making the code as
simple as possible without oversimplifying it.

Our modules are building blocks around some of the most anachronistic aspects of
bioinformatics. Our modules are what bioinformatics would look like in a well-designed
world.

We also document each module individually; On the left hand side you can view the entry to
the documentation and full source code for sra.mk

- 28/228 - Biostar Handbook Collection (2023)


1.5.3 How to troubleshoot

1.5.3 How to troubleshoot

As you run modules you may run into errors. Here is a list of common ones we observed and
how to fix them.

Study the guides a bit, to make sure you understand how the concepts work:

• A short guide to modules


• A complete guide to modules

Each module is documented separately. See the section called Modules above.

Don't give up on the first error

The most important lesson, don't panic. Most errors are caused by plugging the wrong file into
the wrong location.

Check your file names, read the whole error message and think about what it says.

The vast majority of errors look scary because the wording is unfamiliar. Even scary looking
errors have straightforward solutions as demonstrated below.

Next generate a so called dry-run and see what the module will attempt to do:

make -f src/run/sra.mk get -n

Paired-end vs single-end

Most tools will work in paired end mode by default, but some require single end data. To
avoid uncertainty, specify the mode explicitly MODE=SE or MODE=PE .

No rule to make target

The error listed in the title has many variants, and is a bit more verbose. It might look like this:

make: *** No rule to make target 'reads/SRR7795540_2.fastq_R1.fq',


needed by 'bam/SRR7795540_2.fastq.bam'. Stop.

- 29/228 - Biostar Handbook Collection (2023)


1.5.3 How to troubleshoot

The error means that make cannot find one of your input files.

For example an aligner does not know how to make a FASTQ file, it is not its job.

The makefile knows that a FASTQ file is required to produce a BAM file. But if you don't
have that FASTQ file then it gets stumped. It tells you that it does not know how make
something that it needs later. It is always about input files and not output files.

Note the reported file name:

reads/SRR7795540_2.fastq_R1.fq

Does that file exist? Go check. I already know the answer. No, the file does NOT exist as
listed.

Use GNU Make 4.0 or later

make *** "### Error! Please use GNU Make 4.0 or later ###". Stop.

The error above means that you have not activated the bioinfo environment! As trivial as it
seems, the error still stumps many people. Solution? Make sure to activate your environment.

conda activate bioinfo

Our modules require a newer version of make above version 4.0.

If for whatever reason you can't run the newer make you can replace the leading > symbols
with TAB characters in our modules. In general, though this is workaround should not be
necessary. Activate your environment and you should be fine.

No targets specified and no makefile found

make: *** No targets specified and no makefile found. Stop.

The makefile does not exist at the location.

No such file or directory

make: src/run/bwa.mk: No such file or directory

- 30/228 - Biostar Handbook Collection (2023)


1.5.3 How to troubleshoot

The error means you are passing the incorrect path to the ' make '. Use TAB completion to run
the correct file.

No rule to make target

make: *** No rule to make target 'align'. Stop.

You are invoking a target that does not exist.

Missing operator

src/run/sra.mk:40: *** missing separator. Stop.

Perhaps your makefile is old and does not have the latest changes. You should have make
version 4.0 or later.

It could also mean that one of the actions in the makefile is not properly formatted.

You can also get this error a comment character # is missing on a line.

Recreating an existing file

Normally our modules will skip existing files.

Sometimes, if you change an internal parameter, for example adding different alignment flags
or other filtering parameters you will need to recreate the files.

Do so by undoing the target with ! so use both align! then align

make -f src/run/bwa.mk align! align

Touching files

Another option to force a rerun is to touch (update the date) on a file. The date update makes
the file look newer and forces recomputing of all dependencies. For example if you have a
FASTQ file and you all dependent files to be recreated you could do:

touch reads/SRR1553425_1.fastq

- 31/228 - Biostar Handbook Collection (2023)


1.5.3 How to troubleshoot

In this cases touching is preferable over say triggering via get! as touching will not delete
and re-download the reads.

Additional tips

Bioinformatics workflows interconnect complex software packages. When starting any new
project expect to make several mistakes along the way. Your skill is measured primarily by the
speed by which you recognize and corrects your mistakes. Here are a few tips:

Study the short and long guides again and also:

• Visit Biostar Handbook Discussion Forum

The forum is a resource, that we started for this book. You are welcome to post your question
as a new discussion topic.

- 32/228 - Biostar Handbook Collection (2023)


1.6 How to get data

1.6 How to get data

1.6.1 How to get genome data

Data is distributed via various repositories. The most commonly used ones are:

flowchart TD

GB[<b>NCBI</b><br> GenBank/RefSeq] --> FILES(<b>File Formats</b><br>FASTA, GFF, GTF,


BED,)
ENS[<b>Ensembl</b> <br> Numbered Releases] --> FILES
UCSC[<b>UCSC</b><br>Data downloads] --> FILES
MODEL[<b>Model Organisms</b><br> FLyBase, Wormbase,...] --> FILES

Search for metadata

Most of the time it is useful to run a bio search to get information on data before
downloading it. For example:

# Searches GenBank
bio search AF086833

# Searches SRA
bio search SRR1553425

# Searches NCBI Genomes


bio search GCA_000005845

# Also searches NCBI Genomes


bio search ecoli

How to access Genbank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences. If your data has a GenBank accession number such as AF086833
use the genbank.mk module.

make -f src/run/genbank.mk fasta gff ACC=AF086833

- 33/228 - Biostar Handbook Collection (2023)


1.6.1 How to get genome data

The command prints:

mkdir -p refs/
bio fetch AF086833 --format fasta > refs/AF086833.fa
-rw-r--r-- 1 ialbert staff 19K Nov 4 11:26 refs/AF086833.fa
mkdir -p refs/
bio fetch AF086833 --format gff > refs/AF086833.gff
-rw-r--r-- 1 ialbert staff 7.5K Nov 4 11:26 refs/AF086833.gff

The command above downloaded the FASTA and GFF files corresponding to the AF086833
accession in GenBank.

Using NCBI Genomes

The Assembly resource includes prokaryotic and eukaryotic genomes with a Whole Genome
Shotgun (WGS) assembly, clone-based assembly, or completely sequenced genome (gapless
chromosomes). The main entry point is at:

• https://ftp.ncbi.nlm.nih.gov/genomes/

If your data has a NCBI Assembly accession number such as GCA_000005845 then run:

bio search GCA_000005845

It will print:

[
{
"assembly_accession": "GCA_000005845.2",
"bioproject": "PRJNA225",
"biosample": "SAMN02604091",
"wgs_master": "",
"refseq_category": "reference genome",
"taxid": "511145",
"species_taxid": "562",
"organism_name": "Escherichia coli str. K-12 substr. MG1655",
"infraspecific_name": "strain=K-12 substr. MG1655",
"isolate": "",
"version_status": "latest",
"assembly_level": "Complete Genome",
"release_type": "Major",
"genome_rep": "Full",
"seq_rel_date": "2013/09/26",
"asm_name": "ASM584v2",
"submitter": "Univ. Wisconsin",
"gbrs_paired_asm": "GCF_000005845.2",

- 34/228 - Biostar Handbook Collection (2023)


1.6.1 How to get genome data

"paired_asm_comp": "identical",
"ftp_path": "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/
GCA_000005845.2_ASM584v2",
"excluded_from_refseq": "",
"relation_to_type_materialasm_not_live_date": ""
}
]

Now visit the url displayed in the ftp_path field:

• https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2

View and understand what files you may download from there. After that you can copy-paste
the link and download it directly with curl or wget :

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/
GCA_000005845.2_ASM584v2_genomic.fna.gz

Alternatively you use the modules named curl.mk that we provide. One advantage of our
Makefiles is that it will avoid downloading data that already exists.

With curl.mk you can download a single file:

# It is a long URL that we are unable to wrap.


URL=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/GCA_000005845

# You can use curl directly on the url


curl $URL | gunzip -c > refs/chr22.fa

# The curl.mk module that not download data that already exists.
make -f src/run/curl.mk get URL=$URL ACTION=unzip FILE=refs/chr22.fa

NCBI genomes with rsync

NCBI allows the use of the rsync protocol (do note that most sites would not support this
functionality). To use rsync.mk we have to change the protocol from https to rsync that
way we can download the complete directory, not just a single file:

# You need to change the protocol to rsync in this case.


URL=rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2

# You can use rsync directly


rsync -avz $URL .

- 35/228 - Biostar Handbook Collection (2023)


1.6.1 How to get genome data

# Or use our curl.mk makefile module.


make -f src/run/rsync.mk get URL=$URL

How to access Ensembl

Ensemble operates on numbered releases. For example, release 104 was published on March
30, 2021:

• http://ftp.ensembl.org/pub/release-104/

Navigate the link above and familiarize yourself with the structure. To collect all the data for
an organism, you’ll have to click around and visit different paths as Ensemble groups
information by file format.

You can invoke curl or wget directly on each file or use the curl.mk module.

# Get the FASTA file for chromosome 22 of the Human genome


URL=http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosom

# You can use curl directly on the url


curl $URL | gunzip -c > refs/chr22.fa

# The curl.mk module that not download data that already exists.
make -f src/run/curl.mk get URL=$URL FILE=refs/chr22.fa ACTION=unzip

The complete workflow

The complete workflow is included with the code you already have and can be run as:

bash src/scripts/getting-data.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

# Bash strict mode.


set -uex

# Genome accesion.
ACC=AF086833

# Download the reference genome as FASTA and GFF

- 36/228 - Biostar Handbook Collection (2023)


1.6.1 How to get genome data

make -f src/run/genbank.mk fasta ACC=${ACC}

# Data at ENSEMBL.
URL=http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosom

# Download a URL as a GZIP file.


make -f src/run/curl.mk get URL=$URL FILE=refs/chrY.fa.gz

# Download and unzip the file (note the different extension)


make -f src/run/curl.mk get URL=$URL FILE=refs/chrY.fa ACTION=UNZIP

# An rsync enabled URL.


URL=rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2

# Download the entire directory! via rsync into a destination folder


make -f src/run/rsync.mk get URL=$URL DEST=refs

- 37/228 - Biostar Handbook Collection (2023)


1.6.2 How to use refgenie

1.6.2 How to use refgenie

Refgenie is a command-line tool that can be used to download and manage reference
genomes, and to build and manage custom genome assets.

flowchart TD
REMOTE1[<b>Remote Resource 1</b><br>Human genome] -->|refgenie pull| LOCAL[<b>Local
Computer</b><br>common storage area]
REMOTE2[<b>Remote Resource 2</b><br>BWA index] -->|refgenie pull| LOCAL
REMOTE3[<b>Remote Resource 3</b><br>GFT annotations] -->|refgenie pull| LOCAL
LOCAL -->|refgenie seek| PATH[<b>Path to <br> local resource</b>]

When working with standardized genomic builds (human/mouse especially) we found


refgenie a very convenient tool.

See also the Refgenie official documentation

Refgenie installation

Install refgenie with:

pip install refgenie

Create a config file

Then you need to create a configuration file that lists the resources.

Every time you run refgenie you can pass that file to each command with the -c option.

A simpler approach is to create a default configuration file that is accessed via the REFGENIE
shell environment variable. In that case you can omit the -c option and the configuration file
will be automatically loaded from the file stored in the REFGENIE variable.

We have added the following to ~/.bashrc :

export REFGENIE=~/refs/config.yaml`

- 38/228 - Biostar Handbook Collection (2023)


1.6.2 How to use refgenie

You can append the above by entering:

echo "export REFGENIE=~/refs/config.yaml" >> ~/.bashrc

Now run source ~/.bashrc or close and open the terminal to reload the configuration.

Initialize the config

Initialize the configuration file, this needs to be done only once:

refgenie init

or if you have a different configuration file:

refgenie init -c ~/refs/otherconfig.yaml

You are all set. You can now use refgenie to download and manage reference genomes.

Using refgenie

List remote genome assets

refgenie listr

The resources are listed as a build/type. For example hg38/gencode_gtf

Download a resource:

refgenie pull hg38/gencode_gtf

List local genome assets:

refgenie list

Find the path to a resource:

refgenie seek hg38/gencode_gtf

- 39/228 - Biostar Handbook Collection (2023)


1.6.2 How to use refgenie

It will print the path:

/scratch/refs/alias/hg38/gencode_gtf/default/hg38.gtf.gz

Check that the file exists:

ls -l /scratch/refs/alias/hg38/gencode_gtf/default/hg38.gtf.gz

You can also run the command substitution $() construct like so:

ls -l $(refgenie seek hg38/gencode_gtf)

Refgenie in a bash script

In a scrip you can run a program and return the value with the $() syntax:

# Set the GTF variable to the path of the hg38/gencode_gtf resource


GTF=$(refgenie seek hg38/gencode_gtf)

Refgenie in a Makefile

In a Makefile you can use the $(shell command) syntax:

# Set the GTF variable to the path of the hg38/gencode_gtf resource


GTF=$(shell refgenie seek hg38/gencode_gtf)

Using refgenie with a module

A complete example on how to pull the bwa_index and use it in a module:

# Get the hg38/bwa_index resource


refgenie pull hg38/bwa_index

# Set the IDX variable to the path of the hg38/bwa_index resource


IDX=$(shell refgenie seek hg38/bwa_index)

# Run bwa with the new index set


make -f src/run/bwa.mk IDX=${IDX} R1=read1.fq R2=read2.fq MODE=PE align

- 40/228 - Biostar Handbook Collection (2023)


1.6.2 How to use refgenie

Subscribing to the iGenomes server

The iGenomes server is a public server that hosts additional reference genomes and genome
assets.

refgenie subscribe -s http://igenomes.databio.org/

- 41/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

1.6.3 How to get FASTQ files

Published FASTQ files are stored in the Short Read Archive(SRA). Access to SRA can be
diagrammed like so:

flowchart LR

SRR(<b>SRR numbers</b><br>metadata: bio/ffq) --> SRA


SRR --> ENS

SRA(<b>SRA</b><br> Short Read Archive) --> DUMP{fastq-dump<br>}

DUMP --> FILES(<b>FASTQ FILES</b>)

ENS(<b>Ensembl</b> <br> Sequence Archive) --> DOWN{wget/curl<br>aria2c}

DOWN --> FILES

SRR --> CLOUD

CLOUD(<b>Commercial Cloud</b> <br>Google Storage, Amazon S3<br>User pays for download)


--> CLOUD_TOOLS{gsutil<br>aws}

CLOUD_TOOLS --> FILES

The unbearable clunkiness of getting a simple file

Of all tasks in bioinformatics nothing is more annoying than the clunkiness of downloading a
simple FASTQ file.

- 42/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

We have lots of options at our disposal, but in reality all are flaky.

Rant: Why is it that when I install a 30 GB Steam game on my home system it takes a single
click and it finishes in two hours, but if I try to download a 30 GB FASTQ file via my
superfast connection at the university it takes 16 hours (and many times fails outright) and I
have to use 3 different tools to figure out which works? Rhetorical question ... I know the
answer. It is because when you can't download a game you purchased there are repercussion.
When you can't download a FASTQ file nobody cares.

Supposedly a dedicated, full-time team of programmers supported by NCBI develops sra-


tools yet it is one of the most universally reviled, disliked and flakiest tool of
bioinformatics. Sometimes it works, sometimes it doesn't, pops odd errors. On some systems
does not want to work at all. It is just a file download folks, why is it so hard?

ENSEMBL does provide links to FASTQ files. Only that if you are in the USA it means you
have to download the files from Europe, and transfer speeds are usually lacking.

Surviving the clunkiness

But now what do you do when you do need to get data? You will be hitting up various tools
until something works:

fastq-dump, bio, aria2c, ffq, fastq-dlm geofetch

and so on. Eventually something will work. Good luck comrade, you will need it!

Search for metadata

Most of the time you should start by investigating the SRA numbers for additional
information. I have developed bio search because at that time no other tool reported the
metadata in the way I thought was most useful:

# Look up information on a SRR number.


bio search SRR1553425

will print:

[
{

- 43/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

"run_accession": "SRR1553425",
"sample_accession": "SAMN02951957",
"sample_alias": "EM110",
"sample_description": "Zaire ebolavirus genome sequencing from 2014 outbreak in
Sierra Leone",
"first_public": "2015-06-05",
"country": "Sierra Leone",
"scientific_name": "Zaire ebolavirus",
"fastq_bytes": "111859282;119350609",
"base_count": "360534650",
"read_count": "1784825",
"library_name": "EM110_r1.ADXX",
"library_strategy": "RNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "PAIRED",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "Zaire ebolavirus Genome sequencing",
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_1.fastq.gz",
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_2.fastq.gz"
],
"info": "112 MB, 119 MB file; 2 million reads; 360.5 million sequenced bases"
}
]

there are more fields to search for, see bio search --help for more information.

# Produces all known fields.


bio search SRR1553425 --all

# Search a project number.


bio search PRJNA257197

# Format the output as a csv file.


bio search PRJNA257197 --csv

How to download an SRR run

If you know the SRR number you can use the sra.mk module that uses fastq-dump behind
the scenes. Pass the N to get a subset of the data.

make -f src/run/sra.mk get SRR=SRR1553425 N=10000

- 44/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

prints:

mkdir -p reads
fastq-dump -F --split-3 -X 10000 -O reads SRR1553425
Read 10000 spots for SRR1553425
Written 10000 spots for SRR1553425
-rw-r--r-- 1 ialbert staff 577K Nov 4 11:37 reads/SRR1553425_1.fastq
-rw-r--r-- 1 ialbert staff 621K Nov 4 11:37 reads/SRR1553425_2.fastq

We take a note of what happened, the module has also created a reads folder with FASTQ
files in it. You can change the destination folder as it is also an input parameter.

Pass N=ALL to download all the data.

Fifty percent of times fastq-dump works every time. If it works for you, show your
gratitude to the deities of Bioinformatics and proceed to the next phase of your research.
However, if luck is not on your side, do not worry and keep on reading. We are gonna get
through it one way or another.

How to download multiple runs

To obtain all data for a project, search for the project number to identify the SRR run numbers,
then use GNU parallel to download all the data.

bio search PRJNA257197 --csv > project.csv

now read the first column and run the download in parallel (we limit to 3 samples here)

cat project.csv | csvcut -c 1 | head -3 | \


parallel make -f src/run/sra.mk get SRR={} N=1000

When fastq-dump fails

Sometimes people have trouble running fastq-dump . In those cases locate the URLs in the
bio search output

bio search SRR1553425

that prints:

[
{

- 45/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

...
"fastq_url": [
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_1.fastq.gz",
"https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/
SRR1553425_2.fastq.gz"
],
"info": "112 MB, 119 MB file; 2 million reads; 360.5 million sequenced bases"
...
}
]

note the entry called fastq_url . Isolate those files then download them with curl.mk. You
can do the isolation manually or automate it with the jq tool

bio search SRR1553425 | jq -r '.[].fastq_url[]'`

will print

https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/SRR1553425_1.fastq.gz
https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/SRR1553425_2.fastq.gz

Download those files with a method of your choice.

How to download large files

When you have to download large files you should use the aria2c tool. It is a handy
command-line download utility. One of its key benefits is the ability to resume partially
finished downloads. Moreover, the tool can segment the file into multiple parts and leverage
multiple connections to expedite the download process.

mamba install aria2

# The file to download.


URL=https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/005/SRR1553425/SRR1553425_1.fastq.gz

# Download with 5 connections.


aria2c -x 5 -d reads $URL

- 46/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

Use the SRA Explorer

The SRA Explore tool aims to make datasets within the Sequence Read Archive more
accessible. It is everything that NCBI's SRA website should be. It is a single page application
that allows you to search for datasets, view metadata:

• https://sra-explorer.info/

Use the NCBI website

You can also visit the SRA website to navigate the data.

There are various download strategies described there as well.

Other ways to access fastq data

• fastq_dl takes an ENA/SRA accession (Study, Sample, Experiment, or Run) and queries
ENA (via Data Warehouse API) to determine the associated metadata. It then downloads
FASTQ files for each Run. For Samples or Experiments with multiple Runs, users can
optionally merge the runs.
• ffq receives an accession and returns the metadata for that accession as well as the metadata
for all downstream accessions following the connections between GEO, SRA, EMBL-EBI,
DDBJ, and Biosample.
• geofetch is a command-line tool that downloads sequencing data and metadata from GEO
and SRA and creates standard PEPs. geofetch is hosted at pypi. You can convert the result of
geofetch into unmapped bam or fastq files with the included sraconvert command.

The complete workflow

The complete workflow is included with the code you already have and can be run as::

bash src/scripts/getting-fastq.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

# Bash strict mode.


set -uex

- 47/228 - Biostar Handbook Collection (2023)


1.6.3 How to get FASTQ files

# The SRA archive accession.


SRR=SRR1553425

# Download the short reads.


make -f src/run/sra.mk get SRR=${SRR} N=2500 MODE=PE

- 48/228 - Biostar Handbook Collection (2023)


1.7 Alignments

1.7 Alignments

1.7.1 How to align short reads

A short read alignment workflow will typically look like this:

flowchart LR

FQ(<b>FASTQ</b> <br> Sequencing Reads) --> BWA{Aligner};


RF(<b>FASTA</b> <br> Genome Reference) --> BWA
BWA --> SAM(<b>SAM</b><br>Alignment file)
SAM --> SAMTOOLS{Sort & <br> Index}
SAMTOOLS --> BAM(<b>BAM</b><br>Alignment file)

This tutorial will produce alignments using data discussed in the chapter Redo: Genomic
surveillance elucidates Ebola virus origin. We identified that the 1972 Mayinga strain of Ebola
had the accession number AF086833 . We also located multiple sequencing data, out of which
we select one SRR1553425 .

Download the data

Obtain the genome reference file in FASTA and GFF formats from GenBank via the
genbank.mk module.

make -f src/run/genbank.mk fasta gff ACC=AF086833

Download a subset of the FASTQ data for the SRR1553425 accession number.

make -f src/run/sra.mk get SRR=SRR1553425 N=10000

Align the short reads

We provide you with modules that can run multiple short read aligner in similar manner. Each
module is built such that we can pass either SRR numbers or specify the R1 and R2 files.
We need to ensure that the indices are also built so we need to invoke them with both index
and align targets.

Pass the BAM parameter to control where how the output is names. Visit the documentation of
each module to learn more about how to override additional parameters.

- 49/228 - Biostar Handbook Collection (2023)


1.7.1 How to align short reads

A. Create alignments using bwa :

make -f src/run/bwa.mk index align REF=refs/AF086833.fa SRR=SRR1553425 MODE=PE

will produce an indexed BAM file in:

bam/SRR1553425.bwa.bam

B. Create alignments using bowtie2 :

make -f src/run/bowtie2.mk index align REF=refs/AF086833.fa SRR=SRR1553425

C. Create alignments hisat2 :

make -f src/run/hisat2.mk index align REF=refs/AF086833.fa SRR=SRR1553425

D. Create alignments minimap2 :

make -f src/run/minimap2.mk index align REF=refs/AF086833.fa SRR=SRR1553425

Note how simple and elegant our reusable modules are. The interfaces are identical and you
can that FASTQ files with different methods to produce the BAM file.

Visualize the alignments

Use IGV to import the reference file as the genome and the BAM file as the alignment.

- 50/228 - Biostar Handbook Collection (2023)


1.7.1 How to align short reads

Explore the interface and the alignments.

The complete workflow

The complete workflow is included with the code you already have and can be run as:

bash src/scripts/short-read-alignments.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

# Bash strict mode.


set -uex

# Genome accesion.
ACC=AF086833

# SRR number.
SRR=SRR1553425

# The reference genome.


REF=refs/${ACC}.fa

# The alignment file.


BAM=bam/${SRR}.bam

- 51/228 - Biostar Handbook Collection (2023)


1.7.1 How to align short reads

# Library layout: PE or SE.


MODE=PE

# Download the reference genome.


make -f src/run/genbank.mk fasta gff ACC=${ACC}

# Download the short reads.


make -f src/run/sra.mk get SRR=${SRR} N=2500 MODE=${MODE}

# Index and align the reads to the reference genome.


make -f src/run/bwa.mk index align SRR=${SRR} REF=${REF} BAM=${BAM} MODE=${MODE}

- 52/228 - Biostar Handbook Collection (2023)


1.8 Variant calling

1.8 Variant calling

1.8.1 How to call genomic variants

A typical variant calling workflow consists of the following steps:

flowchart LR

BAM(<b>BAM</b><br>Alignment file) --> SNP{SNP <br> caller}


RF(<b>FASTA</b> <br> Genome Reference) --> SNP
SNP --> VCF(<b>VCF</b><br>Variant Call File)

VCF --> NORM{Postprocess}

NORM --> BCF( <b>Filtered <br> Normalized <br> Compressed <br> VCF</b>)

Se the tutorial How to align short reads to learn how to generate the BAM file.

Start with a BAM file

A variant calling workflow starts with a BAM file that you have access to. Here we assume
you ran the previous tutorial that generated a file called:

bam/SRR1553425.bwa.bam

You also need to have a reference genome that was used to generate the alignment. The variant
caller will use your BAM file and the reference genome to identify variants.

Choose a variant caller

Several alternative tools may be used to call variants. All do good job with easy variants but
may differ quite a bit when calling variants where the alignments are ambiguous or unreliable.
Different schools of thought may collide within each software package.

bcftools

- 53/228 - Biostar Handbook Collection (2023)


1.8.1 How to call genomic variants

bcftools is a variant caller that is part of the samtools package. It is my favorite tool for
non-human species. I think it strikes a balance between usability and performace.

• Docs: https://samtools.github.io/bcftools/bcftools.html
• Paper: A statistical framework for SNP calling, Bioinformatics 2011
• Paper: The evaluation of Bcftools mpileup, Scientific Reports (2022)

freebayes

freebayes is a Bayesian genetic variant detector designed to find small polymorphisms,


specifically SNPs, indels, MNPs, and complex events smaller than the length of a short-read
sequencing alignment.

• Docs: https://github.com/freebayes/freebayes
• Paper: Haplotype-based variant detection from short-read sequencing, arXiv, 2012

GATK

The most commonly used humane genome variation calling. It is a complex beast, it is a
platform and a way of thinking has extensive documentation and is widely used. A de-facto
standard for human genome variation calling.

• Docs: https://gatk.broadinstitute.org/hc/en-us

deepvariant

DeepVariant is a deep learning-based variant caller that takes aligned reads, produces pileup
images, classifies each image using a convolutional neural network, and finally reports the
results in a standard VCF or gVCF file.

• Docs: https://github.com/google/deepvariant
• Paper: A universal SNP and small-indel variant caller, Nature Methods, 2017

Generate variants

Assuming the following variables are set:

BAM=bam/SRR1553425.bwa.bam
REF=refs/AF086833.fa

- 54/228 - Biostar Handbook Collection (2023)


1.8.1 How to call genomic variants

We provide a module that runs bcftools .

make -f src/run/bcftools.mk vcf BAM=$BAM REF=$REF

the module produces a VCF file:

vcf/SRR1553425.bcftools.vcf.gz

We also provide a module that runs freebayes that runs with the exact same interface:

make -f src/run/freebayes.mk vcf BAM=$BAM REF=$REF

In that case the file we now obtain is:

vcf/SRR1553425.freebayes.vcf.gz

That's it.

Visualize the results

Visualize all the files in IGV, load the reference FASTA, the BAM and the VCF files to obtain
the following picture:

- 55/228 - Biostar Handbook Collection (2023)


1.8.1 How to call genomic variants

The complete workflow

The complete workflow is included with the code you already have and can be run as:

bash src/scripts/variant-calling.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

# Bash strict mode.


set -uex

# Genome accesion.
ACC=AF086833

# SRR number.
SRR=SRR1553425

# The reference genome.


REF=refs/${ACC}.fa

# The alignment file.


BAM=bam/${SRR}.bam

# The variantswith bcftools.


VCF=vcf/${SRR}.bcftools.vcf.gz

# Library layout: PE or SE.


MODE=PE

# Download the reference genome.


make -f src/run/genbank.mk fasta gff ACC=$ACC

# Download the short reads.


make -f src/run/sra.mk get SRR=${SRR} N=2500 MODE=${MODE}

# Index the genome then align the reads to the reference genome.
make -f src/run/bwa.mk index align SRR=${SRR} REF=${REF} BAM=${BAM} MODE=${MODE}

# Call variants with bcftools.


make -f src/run/bcftools.mk vcf REF=${REF} BAM=${BAM} VCF=${VCF}

- 56/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

1.8.2 How to evaluate variant calls

In this chapter, we take on the task of reproducing the findings of a paper dedicated to
comparing the accuracy of different variant calling workflows.

• Comparison of three variant callers for human whole genome sequencing, Scientific Reports
volume 8, Article number: 17851 (2018)

As we work through the paper, we learn a lot about variant calling, the pitfalls, and the
challenges of detecting and validating variants in a genome.

Most importantly, we learn that deciding which variant caller is better is not nearly as simple
as it seems.

TLDR

All variant callers perform exceedingly well even with default parameters. bcftools was
the fastest, DeepVariant the most accurate.

- 57/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

Final ranking:

1. DeepVariant in 2023 F=0.997


2. bcftools in 2023 F=0.9846
3. GATK in 2023 F=0.9841
4. DeepVariant in 2017 F=0.977

Generating the variant calls

Reproducing the variant calls (VCF files) themselves is beyond the scope of this chapter. We
will present the variant calling process in a different section.

In this chapter, we will begin our study with the variant call (VCF) files published by the
authors.

In addition to the published results, we generated variant calls with workflows published in
this book.

How to tell if a variant is genuine?

The question of which variant caller is "better" has long preoccupied scientists. The main
challenge is that we don't know what the ground truth is. We don't have a universal and
objective method to decide which variant is accurate and which variant call is invalid.

Consequently, scientists need to reach the next best thing - some consensus. They first need to
identify a set of variant calls considered to be very likely correct and use those as a
benchmark. The problem is that we use variant callers to identify the benchmark data, then use
these results to evaluate the same variant callers.

A bit of a chicken and egg problem.

- 58/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

Genome in a Bottle

The Genome in a Bottle Consortium (GIAB) was created to come up with a set of benchmark
datasets, which can be used to validate the accuracy of variant calls. According to various
resources:

GIAB provides high-confidence calls for a range of variant types, including single nucleotide
variants (SNVs), insertions, deletions, and structural variants (SVs), for various human
genomes. These calls are generated using multiple sequencing technologies, platforms, and
pipelines, making them highly reliable and unbiased. Researchers can compare their variant
calls with the benchmark calls provided by GIAB to assess the accuracy of their variant
calling workflow.

At least that is the theory - the reality is a bit more complicated. A lot more complicated,
actually.

Who is watching the watchers?

The GIAB evolves, and the variant calls are refined over time as new data gets collected. The
GIAB has had releases in 2014, 2015,2016, 2017 and 2021.

Naturally, the question arises. How different is the 2014 benchmark from the 2021
benchmark?

Would the software version be considered best when evaluated on the 2014 data and score best
in the 2021 data?

Precision FDA Truth Challenge

The Precision FDA Truth Challenge was a public competition organized by the US Food and
Drug Administration (FDA) to evaluate the performance of variant calling methods on whole-
genome sequencing data.

If you look at the outcomes on their page above, you'll note that the results submitted for the
challenge are close. The contest winners are only 0.1% better than the second place. The best
variant caller is only 1% better than the worst.

Note that the results of an evaluation are only as good as the correctness of the data itself. If
the "ground" truth is inaccurate within 1%, then any ranking based on a 1% difference can't be
all that meaningful.

- 59/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

Are the winners indeed better, or are the winners better at predicting both the right calls and
the miscalls?

Why is it so hard to compare apples to apples

In this chapter, we've set out to compare variant calls from different sources - the process
turned out to be unexpectedly tedious. We clicked, pasted, read, unpacked, filtered, and
corrected VCF files for many days.

We ran into all kinds of quirks and problems. The sample names had to be changed; the
variant annotations would not parse correctly; the VCF files had errors inside them, and so on.

When downloading different GIAB releases, the VCF files are incompatible and need
alteration and fixing to become comparable. Take that reproducibility. We can't even directly
compare the different GIAB releases without some additional work.

We have lost track of how many small but annoying additional steps we had to employ to
make the data comparable.

But finally, we managed to package all the data up in a single archive that contains all you
need to get going.

Download the evaluation

We packaged the data and deposited it at this link:

• http://data.biostarhandbook.com/vcf/snpeval.tar.gz

To download the data at the command line, run the following:

# Download and unpack the data


curl http://data.biostarhandbook.com/vcf/snpeval.tar.gz | tar zxvf -

The archive will create a snpeval folder containing several subdirectories and files.

Understand the data

The vcf directory contains the variant calls in VCF format.

We have restricted our data to chromosome 1 alone to make the data smaller and quicker to
work with.

- 60/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

The paper reports that the deep variant by Google is the best variant caller. We have
downloaded the variant calls from the article and stored them as the
DEEP-2017.chr1.vcf.gz file.

We also generated variant calls via the modules from this book that wraps Google Deepvariant
( deepvariant.mk ), GATK ( gatk.mk ) and BCFTOOLS ( bcftools.mk ) variant callers.

Here are the files you will find:

1. GIAB-2017.chr1.vcf.gz - GIAB benchmark published in 2017


2. GIAB-2021.chr1.vcf.gz - GIAB benchmark published in 2021
3. DEEP-2017.chr1.vcf.gz - Variant calls published in the paper.
4. DEEP-2023.chr1.vcf.gz - Variant calls generated with the deepvariant.mk module
5. GATK-2023.chr1.vcf.gz - Variant calls generated with the gatk.mk module
6. BCFT-2023.chr1.vcf.gz - Variant calls generated with the bcftools.mk module

Combined variant calls

1. MERGED.chr1.vcf.gz - Merged variants from all the files above.

High confidence regions in 2017 and 2021

1. GIAB-2017.bed
2. GIAB-2021.bed

The high confidence regions in GIAB are typically defined as regions where the sequencing
and variant calling accuracy is believed to be very accurate. When evaluating variant calls,
benchmarking tools typically only consider the variants within high-confidence regions.

In reality, however, the high-confidence regions are also those that are easier to call, where the
choice of parameters and settings is not as critical. Another way to say this is that variant
calling errors are not uniformly distributed across the genome. Some regions are more error-
prone than others. Some areas may always generate incorrect variant calls. That is a far cry
from the typical error interpretation, where a 1% error is assumed to be uniformly distributed
across the data.

- 61/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

The results folder contains the results of running the rtg snpeval program on the
variant calls. We will talk about rtg later; for now, know that each folder compares variants
calls relative to GIAB-2021 :

1. GIAB-2021-vs-GIAB-2017 - compares GIAB 2021 to GIAB 2017


2. GIAB-2021-vs-DEEP-2017 - compares GIAB 2021 to DeepVariant 2017
3. GIAB-2021-vs-DEEP-2023 - compares GIAB 2021 to DeepVariant 2023
4. GIAB-2021-vs-GATK-2023 - compares GIAB 2021 to GATK 2023
5. GIAB-2021-vs-BCFT-2023 - compares GIAB 2021 to Bcftools 2023

You don't need to run any tool to follow our analysis, but if you want to run the evaluation
yourself, look at the Makefile we include at the end of the chapter to understand how we did
it.

Visualizing the merged data

First, visualize the MERGED.vcf.gz file using IGV.

Select the hg38 reference genome in IGV, then select chr1 , then load the MERGED.vcf.gz
file.

Study the merged VCF file in IGV.

Each track represents calls made by a different variant caller.

- 62/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

Quantifying accuracy

On the merged variant call file, identify by eye a few of the following situations:

1. A true positive call is present in both GIAB-2021 and the track.


2. A false positive call is present in the track but not in the GIAB-2021.
3. A false negative call is present in the GIAB-2021 benchmark but not in the track.
4. A true negative call is one that is not present in either the benchmark or the track.

Suppose we add these numbers and create the sums typically reported in the literature as TP ,
FP , FN , TN . These numbers can be used to evaluate the accuracy of the variant caller.

Precision and recall are two metrics commonly used to evaluate the accuracy of classification
methods.

Precision measures the proportion of identified variants that are true positives.

precision = TP / (TP + FP)

Recall, also known as sensitivity or true positive rate, measures the proportion of true variants
the tool correctly identified.

recall = TP / (TP + FN)

In an ideal case, precision and recall would be 1, indicating that the variant caller perfectly
identifies all true variants while not calling any false ones.

To combine precision and recall into a single number, the most commonly used metric is the
F1 score, which is the harmonic mean of precision and recall:

F1 score = 2 * (Precision * Recall) / (Precision + Recall)

False Positives vs Negatives

While the two measures appear similar, we want to note a fundamental difference when
dealing with false positives and false negatives.

A variant call file containing false positives can be post-processed with various filters that
could remove the false positives. Hence, in theory data with false positive can be improved
upon later.

- 63/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

Variant call files containing false negatives cannot be improved via VCF filtering. The variant
is absent, and data about the location is absent. Thus, we cannot improve upon this type of
error.

This is to say that false negatives are more problematic than false positives.

Comparing variants

Comparing two VCF files is not a trivial task. Intuitively we would think that all we need is
line up the two variant files, and then compare the variants at each position. For most cases
that might be enough, but there are many edge cases and overlappin variants that may make
the problem more complicated. In those cases we need to compare the entire region around the
variant (the entire haplotype), not just the variant itself.

Tools like hap.py and rtg vcfeval are designed to compare VCF files.

• https://github.com/Illumina/hap.py
• https://github.com/RealTimeGenomics/rtg-tools

For example the RTG Tools includes several utilities for VCF files and sequence data. The
most interesting is the vcfeval command, which compares VCF files.

Our data includes the results of running vcfeval on various VCF file combinations. See the
Makefile at the end of this chapter on how to run it yourself.

Evaluating the results

In the following, we report and discuss the results of the output of the rtg vcfeval tool.

The command we run in our Makefile will be approximately as follows (we simplifed paths
for brevity):

For example to compare GIAB 2021 to GIAB 2017, we run the following command:

rtg vcfeval -b GIAB-2021.chr1.vcf.gz \


-c GIAB-2017.chr1.vcf.gz \
-e GIAB-2021.bed \
-t hg38.sdf \
-o GIAB-2021-vs-GIAB-2017

The command, when run generates the directory called GIAB-2021-vs-GIAB-2017 with
several files. The summary.txt contains the report we are interested in. Here we note that

- 64/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

the GIAB files have been filtered to only contain high-confidence calls, and with time the
regions that are considered high-confidence have changed.

The rtg vcfeval tool creates verbose headers that are easy to read but won't fit in our text
without wrapping. We will abbreviate False-pos as FP , False-neg as FN , and so on in
our reports.

GIAB IN 2017

How reliable is the GIAB 2017 benchmark compared to the GIAB 2021 benchmark? Let;s
find out.

If we use the 2021 definition of "high-confidence intervals" the results are as follows:

Base TP FP FN Prec Recall F


239977 239979 21 16183 0.9999 0.9368 0.9673

• 21 false positives and


• 16183 false negatives relative to 2021
• 0.96 F score.

If we use the 2017 version for the "high-confidence intervals" the results are as follows:

Base TP FP FN Prec Recall F


242326 242327 5125 170 0.9793 0.9993 0.9892

• 5125 false positives and


• 170 false negatives relative to 2021
• 0.99 F score.

All of a sudden we note the difficulty in evaluating even consecutve GIAB releases. It is not
clear wether missing variants are due to selection of regions or due to the variant caller.

We do find the imbalance between false positives and false negatives quite unexpected.

DEEPVARIANT IN 2017

The variants below are those published in the paper Comparison of three variant callers for
human whole genome sequencing, Scientific Reports (2018) that we have set out to reproduce.

- 65/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

These variants were generated with the DeepVariant tool. When compared to the GIAB 2021
benchmark, the results are as follows:

Base TP FP FN Prec Recall F


247457 247460 2705 8703 0.9892 0.9660 0.9775

The F score that obtain is very close to the number reported in the paper ( F=0.98 ).

We find this quite noteworthy. The variant calls generated by the DeepVariant tool in 2017 are
more accurate even than the GIAB 2017 benchmark!

DEEPVARIANT IN 2023

Since the publication of the paper the DeepVariant tool has been updated to version 1.5.0. We
have generated variants with this latest version as well:

Base TP FP FN Prec Recall F


253872 253875 749 2288 0.9971 0.9911 0.9941

GATK IN 2023

For many years the gold standard in variant calling was held by GATK, the Genome Analysis
Toolkit. Let's compare the results produced by the gatk.mk module in the Biostar Workflows
to the GIAB 2021 benchmark:

Base TP FP FN Prec Recall F


252702 252705 4692 3458 0.9818 0.9865 0.9841

We note a more natural balance between false positives and false negatives. Just about the
same number of FP and FN calls are observed.

Have we used GATK to its full potential? Most likely not. Though I have attempted to use the
best practices, I don't doubt that many more options could be used to improve the results.

Yet I am quite pleased; we can match and even slightly exceed the results reported in the
paper.

- 66/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

BCFTOOLS IN 2023

My favorite variant caller is bcftools and you can used it via the bcftools.mk module in
the Biostar Workflows. It runs incombarably faster, requires fewer resources, and is far, far
easier to use than either DeepVariant or GATK. Let's see what results it produces:

Base TP FP FN Prec Recall F


250844 250846 2519 5317 0.9901 0.9792 0.9846

And lo and behold, the variant calls generated via bcftools turn out to be the quite accurate
very much comparable to those generated by the "best methods"

Notably, bcftools runs in a fraction of the time of GATK and requires incomparably fewer
resources.

I did not know what to expect here and was pleased with the results.

Lessons learned

1. We were able to reproduce the results of the paper.


2. Modern methods perform better than those reported in the paper
3. Deciding what variants to trust is still not a solved problem.
4. Modern variant callers are quite accurate and can be used with confidence. `There is not that
much need to worry about tuning them, default parameters work wonders.

The complete workflow

The complete workflow is included with the code you already have and can be run as:

make -f src/workflows/snpeval.mk eval

The source code for the workflow is:

#
# A makefile to evaluate SNP calls.
#

# Use a subset of the chromosome for testing.


CHR = chr1

# The chromosome FASTA file

- 67/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

REF_URL = https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/${CHR}.fa.gz

# The reference genome.


REF = refs/${CHR}.fa.gz

# The rtg index for the reference genome.


SDF = refs/sdf/${CHR}

# The high confidence SNP calls for the 2021 GIAB dataset.
GIAB_2021_VCF = vcf/GIAB-2021.${CHR}.vcf.gz

# The high confidence regions for the 2021 GIAB dataset.


GIAB_2021_BED = vcf/GIAB-2021.bed

# The high confidence regions for the 2017 GIAB dataset.


GIAB_2017_BED = vcf/GIAB-2017.bed

# The high confidence SNP calls for the 2017 GIAB dataset.
GIAB_2017_VCF = vcf/GIAB-2017.${CHR}.vcf.gz

# The Coriell index (the published paper)


CORIEL_VCF = vcf/DEEP-2017.${CHR}.vcf.gz

# The deep variant calls made in 2023.


DEEP_VCF = vcf/DEEP-2023.${CHR}.vcf.gz

# GATK calls performed with the Biostar Workflows.


GATK_VCF = vcf/GATK-2023.${CHR}.vcf.gz

# BCFTOOLS calls performed with the Biostar Workflows.


BCF_VCF = vcf/BCFT-2023.${CHR}.vcf.gz

# What target to compare it against


TARGET = ${GIAB_2021_VCF}

# Additional flags for rtg vcfeval.


FLAGS =

# Limit the evaluation to the high confidence regions.


FLAGS = -e ${GIAB_2021_BED}

#
# Apply Makefile customizations.
#
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
MAKEFLAGS += --warn-undefined-variables --no-print-directory
SHELL := bash

# Print some usage information

- 68/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

usage:
> @echo "#"
> @echo "# Evaluate SNP calls"
> @echo "#"
> @echo "# TARGET = ${TARGET}"
> @echo "# FLAGS = ${FLAGS}"
> @echo "#"
> @echo "# Usage: make data index eval "
> @echo "#"

# Get the reference genome.


${REF}:
> mkdir -p $(dir ${REF})
> wget -q -nc ${REF_URL} -O ${REF}

# Trigger the reference download.


refs:${REF}
> @ls -lh ${REF}

# Index genome for rtg


${SDF}: ${REF}
> rtg format -o ${SDF} ${REF}

# Create the SDF index for rtg vcfeval.


index: ${SDF}
> @ls -lh ${SDF}

# Obtain the data


data:
> curl http://data.biostarhandbook.com/vcf/snpeval.tar.gz | tar zxvf -

# Run the evaluation.


eval: ${SDF}
>
> @printf "#\n# LIMIT=${FLAGS}\n#\n"
>
> @printf "#\n# ${TARGET} vs GIAB-2017\n#\n"
> @rtg vcfeval -b ${TARGET} -c ${GIAB_2017_VCF} -t ${SDF} -e ${GIAB_2017_BED} -o results/GIAB-2021-
>
> @printf "#\n# ${TARGET} vs DEEP-2017\n#\n"
> @rtg vcfeval -b ${TARGET} -c ${CORIEL_VCF} -t ${SDF} ${FLAGS} -o results/GIAB-2021-vs-DEEP-2017
>
> @printf "#\n# ${TARGET} vs DEEP-2023\n#\n"
> @rtg vcfeval -b ${TARGET} -c ${DEEP_VCF} -t ${SDF} ${FLAGS} -o results/GIAB-2021-vs-DEEP-2023
>
> @printf "#\n# ${TARGET} vs GATK-2023\n#\n"
> @rtg vcfeval -b ${TARGET} -c ${GATK_VCF} -t ${SDF} ${FLAGS} -o results/GIAB-2021-vs-GATK-2023
>
> @printf "#\n# ${TARGET} vs BCFTOOLS-2023\n#\n"
> @rtg vcfeval -b ${TARGET} -c ${BCF_VCF} -t ${SDF} ${FLAGS} -o results/GIAB-2021-vs-BCFT-2023

- 69/228 - Biostar Handbook Collection (2023)


1.8.2 How to evaluate variant calls

clean:
> rm -rf results

.PHONY: usage refs index eval

- 70/228 - Biostar Handbook Collection (2023)


1.9 RNA-Seq analysis

1.9 RNA-Seq analysis

1.9.1 RNA-Seq basics

In an RNA-Seq study we operate on a concept called gene expression that represents each
gene or transcript. The gene expression is a value assumed to correlate with the number of
transcripts present in the sample. A typical RNA-Seq study is a two-step process.

1. First we produce a gene expression (count) matrix:

flowchart LR

FASTQ(<b>FASTQ</b><br> Raw Sequencing Data<br>) --> METHOD{Counting <br> Method}


REF(<b>Reference</b> <br>Genome/Transcriptome) --> METHOD
METHOD --> COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>)

2. In the next stage we analyze the gene expression matrix

flowchart LR

COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}


DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Results</b><br>Effect Sizes and P-Values)

What is a gene expression

Gene expression is a bit of a misnomer since only transcripts can be expressed. Often there is
ambiguity in how gene expression is defined. In general gene expression is a count of reads
aligning to or overlapping with a region. Gene expression may be a count of individual
transcripts or a sum of counts over all transcripts that belong to a gene.

What is differential expression

The differential expression means detecting a statistically significant change in the gene
expression between some grouping of the samples. For example, we may want to know which
gene's expression changes between two treatments, two strains, or two timepoints.

The gene expression number may be computed and expressed in different ways. The so-called
raw counts are preferable to other measures as we can quickly transform raw counts into
different units: FPKM, TPM.

- 71/228 - Biostar Handbook Collection (2023)


1.9.1 RNA-Seq basics

Counts are always expressed relative to a genomic feature: gene or transcript and represent
the number of reads that overlap with that feature.

What are replicates

Due to gene expression's inherent variability, all measurements need to be repeated several
times for each condition. Replicate numbers vary from 3 to 6 or more. The more replicates,
the more reliable the results - though there are diminishing returns at some point, and the cost
of the experiment may become the limiting factor.

What is a count file

A count file is a tab or comma-separated file that contains count data for each feature and each
replicate. For example:

name sample1 sample2 sample3 sample4 sample5 sample6


geneA 10 29 11 45 55 89
geneB 12 23 34 45 56 67
geneC 13 24 35 46 57 68
...

For demonstration purposes, we show the counts and design files in tabular format.

In general, we prefer and our tools always use comma-separated files that can be readily
viewed in a spreadsheet program. In addition tab-separated files cannot be copy-pasted from
a web browser, thus making it difficult to use them as examples.

What is a design file

A design file connects the sample names of the count file to a group. For example:

sample group
run1 WT
run2 WT
run3 WT
run4 KO
run5 KO
run6 KO

In this book, we call the above a design file. Other people may use different terminology such
as metadata, col data, or sample sheet.

- 72/228 - Biostar Handbook Collection (2023)


1.9.1 RNA-Seq basics

In general, genes and samples may have other information (called metadata) associated with
them. For example, a gene may have a description a common name, or a gene ontology
annotation. A sample may have a tissue or a time associated with it. Here we show you the
bare minimum of information needed to perform a differential expression analysis.

Where to go next

To learn how to generate a count matrix see the tutorials at:

• RNA-seq simulations
• Generate RNA-Seq counts using HiSat2
• Generate RNA-Seq counts with Salmon

Once you have a count matrix and a design file you can perform differential gene expression
analysis as described in:

• RNA-Seq differential expression

- 73/228 - Biostar Handbook Collection (2023)


1.9.2 RNA-Seq count simulations

1.9.2 RNA-Seq count simulations

Why use simulated counts

We could use realistic data of course, but the trouble there is we never know what the true
answer is.

So we can't quite tell if the results are correct or not. A simulated data is generated from a
model that we know the answer to. We can then compare the results of the analysis to the
known answer.

The first challenge is always to understand a differential gene expression analysis's inputs
and outputs. We need to learn to be confident when evaluating the results we produce.

The best way to understand how RNA-Seq differential expression works are to generate count
matrices with different properties and then investigate how the statistical methods perform on
that data. Our modules provide several methods to generate counts;

Generating NULL data

One kind of simulation generates a null hypothesis data with no changes. Using this
simulation we can validate our methods since any detection of a differentially expressed gene
would be a false positive.

flowchart LR
NULL(<b>Null Simulations</b> <br> No differential expression) --> COUNTS
NULL --> DESIGN
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Expect to Find Nothing</b><br>Effect Sizes and P-Values)

We provide a module named simulate_null.r that generates a null data set. The null data
has been experimentally obtained by sequencing 48 replicates of the wild-type yeast strain.
When we run the simulation our tool will select replicates at random from the 48 the wild-type
columns and then randomly assign them to one of two groups.

Note: Remember that you can run every R module with the -h flag to see help on its use.

- 74/228 - Biostar Handbook Collection (2023)


1.9.2 RNA-Seq count simulations

Running the simulate_null.r script would generate a count matrix counts.csv and a
design file called design.csv . To perform the simulation execute:

Rscript src/r/simulate_null.r

that will print:

# Generating null data


# Input: src/data/barton_counts.csv
# Total rows: 7126
# Above minimum: 6063
# Replicates: 5
# Design: design.csv
# Counts: counts.csv

The data was published in the following papers that also are interesting to read:

• Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
• How many biological replicates are needed in an RNA-seq experiment and which
differential expression tool should you use?

Generating realistic data

We also provide you with a method that generates realistic effects of differential expressions.
The genes may be consistently up- or down-regulated, or they may be differentially expressed
in a more stochastic way. The simulations allow us to evaluate the specificity and sensitity of
the methods.

flowchart LR
NULL(<b>Realistic Simulations</b> <br> Known effect sizes) --> COUNTS
NULL --> DESIGN
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Recover Known Effects</b><br>Effect Sizes and P-Values)

We provide a module that simulates counts based on the method called:

• PROPER: comprehensive power evaluation for differential expression using RNA-seq,


Bioinformatics (2015)

- 75/228 - Biostar Handbook Collection (2023)


1.9.2 RNA-Seq count simulations

The module will generate a realistic dataset based on published data, containing rows with
differential expression changes. To run it, use the src/r/simulate_counts.r module:

Rscript src/r/simulate_counts.r

Prints:

# Initializing PROPER ... done


# PROspective Power Evaluation for RNAseq
# Error level: 1 (bottomly)
# All genes: 20000
# Genes with data: 4706
# Genes that changed: 1000
# Changes we can detect: 270
# Replicates: 3
# Design: design.csv
# Counts: counts.csv

The code above performed an RNA-Seq simulation and created the files: design.csv and
counts.csv

Read through the output and think about what each line means. For example, it claims that the
data will have 270 detectable changes. Open both the count and design files in Excel and
study their contents. When I ran the command, the files looked like so:

- 76/228 - Biostar Handbook Collection (2023)


1.9.2 RNA-Seq count simulations

Identify which samples belong in the same group. Make some notes about what you can
observe in the data. Can you see rows that appear to vary less between the groups and vary
more across the groups? Are those rows differentially expressed?

What to do next?

Now that you have a simulated data set, you can run the differential expression analysis on it.

• RNA-Seq differential expression

Experiment with the various simulation methods and try to verify that the results you get are
what you expect.

- 77/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

1.9.3 RNA-Seq differential expression

This chapter assumes you have a count matrix and a design file that specifies the relationships
between the columns of the count matrix.

flowchart LR
COUNTS(<b>Counts</b><br> Gene Expression Matrix<br>) --> STATS{Statistical <br> Method}
DESIGN(<b>Design</b> <br>Samples & Replicates<br>) --> STATS
STATS --> DE(<b>Results</b><br>Effect Sizes and P-Values)

We will now walk you through the steps of analyzing a count matrix.

Finding differential expression

We provide two modules that implement different statistical methods: edger.r and
deseq2.r .

• Empirical Analysis of Digital Gene Expression Data in R (edger)


• Moderated estimation of fold change with DESeq2 (deseq2)

Both modules take the same input and produce an output that is formatted the same way. The
only difference is the statistical method they use. By default, the tool will analyze the
counts.csv file using the grouping in the design.csv . You can of course change these
input parameters.

Note: Remember that you can run every R module with the -h flag to see help on its use.

My example count data was simulated with the simulate_counts.r methods that described
in the RNA-Seq simulations chapter.

Rscript src/r/simulate_counts.r

The tool when run as above will generate two files:

• counts.csv and design.csv

Investigate the two files to understand what they contain. You could also use the counts and
design files generated in the other tutorials.

- 78/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

Now let's run the edger.r module that will process the counts.csv and the design.csv
that the simulations produced.

Rscript src/r/edger.r

It produces the output:

# Initializing edgeR Tibble dplyr tools ... done


# Tool: edgeR
# Design: design.csv
# Counts: counts.csv
# Groups: A, B
# A : 3 samples
# B : 3 samples
# Method: classic
# Input: 20000 rows
# Removed: 15268 rows
# Fitted: 4732 rows
# Significant PVal: 380 ( 8.00 %)
# Significant FDRs: 219 ( 4.60 %)
# Results: edger.csv

Note the most important lines:

# Significant PVal: 380 ( 8.00 %)


# Significant FDRs: 219 ( 4.60 %)

It shows how many PValues ( 380 ) would be significant at a 0.05 threshold without adding
a multiple testing correction! Once we apply the FDR adjustment, we'll end up with 219
significant rows.

Now, strictly speaking, FDR is not the same concept as the PValue, and we don't need to apply
the same significance level to each. You can apply any other filtering criteria yourself. Read
the statistics chapter to learn more about the differences between PValues and FDRs.

Evaluate the results

In a realistic experiment, we would not know which genes have changed, but in a simulated
experiment, we do. If your input counts.csv data was simulated with the
simulate_counts.r methods that was described in RNA-Seq simulations then it will also
contains an FDR column that indicates which genes were generated thought to be
differentially expressed.

- 79/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

We have an R script that can evaluate any two files that have FDR columns to find matching
rows. You can use this script to compare the results of your differential expression analysis to
the simulated results or to compare the results of different methods.

Rscript src/r/evaluate_results.r -a counts.csv -b edger.csv

On my system, it prints the following:

# Tool: evaluate_results.r
# File 1: 270 counts.csv
# File 2: 219 edger.csv
# 173 match
# 97 found only in counts.csv
# 46 found only in edger.csv
# Summary: summary.csv

So let's look at the summary:

1. Out of the 270 changes, edger found 173 that match.


2. There are 97 false negatives, genes that were differentially expressed but edger did not find.
3. There are 46 false positives, genes that edger found to be differentially expressed but were
not.

Generate a PCA plot

Explaining in detail how PCA decompositions work is out of the scope of this book. But you
do not need to fully understand how it gets created to be able to interpret it, just as you don't
need to know how a Burrows-Wheeler transform works to use the bwa aligner.

In a nutshell, a PCA transformation attempts to compute each value in a column of a matrix


with a combination of variables called the principal components. The same components are
used to approximate each column, but for each column, the components may be weighted
differently. A PCA plot is a scatter plot of the weights of the first two principal components for
each column (sample)

A PCA plot is informative in that samples that are correlated with one another will bunch up
together. Samples that are different from one another will be more distant from one another.

PCA plots are the first plot you should make of your differential expressed, normalized matrix
to demonstrate that replication worked as expected.

- 80/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

We can generate a PCR plot out of a results file ( edger.csv ) by running:

Rscript src/r/plot_pca.r -c edger.csv

# Initializing DESeq2 Tibble dplyr tools ... done


# Tool: PCA Plot
# Design: design.csv
# Counts: edger.csv
# Groups: A B
# A : 3 samples
# B : 3 samples
converting counts to integer mode
# PCA plot: pca.pdf

Our PCCA plot above is informative and convenient to use, but you might want to customize
it at some point. You can edit our code, but we also recommend that you evaluate the
following package:

• PCAtools: everything Principal Component Analysis

Generate a heatmap

A heatmap is a high-level overview of the consistency of the statistical results that your
method produced. It allows you to understand inter-replicate and intra-group variability.

- 81/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

The most important thing to note is that the heatmap rescales values into so-called z-scores for
each row separately. The colors in one row are not related to the magnitudes in another row. It
normalizes the data so that the values are comparable across rows.

It transforms numbers by subtracting the average of the row from each value in the row, then
divides the resulting value with the standard deviation of the row. Each number in the heatmap
(basically the color of it) indicates how far away (how many standard deviations) the value is
from the average of the row.

For example, the red (+2) would mean that the counts in that cell were two standard deviations
higher than the average (the gene is up-regulated). A green of (-1) would mean that the counts
in that cell were one standard deviation lower than the average.

Important: z-scores as not fold change! z-scores are computed relative to the average of the
row.

The heatmap shows only the differentially expressed genes, and we want to see nice, uniform
red/green blocks that show that the data is reliable within replicates and differences between
groups.

Rscript src/r/plot_heatmap.r -c edger.csv

# Initializing gplots Tibble dplyr tools ... done


# Tool: Create heatmap
# Design: design.csv
# Counts: edger.csv
# Groups: A B
# A : 3 samples
# B : 3 samples
# Output: heatmap.pdf

- 82/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

Our heatmap visualization above is informative and convenient; at some point you might want
to customize it. You can edit our code, but we also recommend that you evaluate the following
packages:

• Bioconductor: ComplexHeatmap
• A simple tutorial for a complex ComplexHeatmap

Generate a volcano plot

This section is not yet written. (TODO)

Volcano plots represent a helpful way to visualize the results of differential expression
analyses.

• EnhancedVolcano: publication-ready volcano plots with enhanced colouring and labeling

- 83/228 - Biostar Handbook Collection (2023)


1.9.3 RNA-Seq differential expression

So what are my differentially expressed genes?

First, we decide what level of FDR we consider the data trustworthy. Say 0.05 . Remember
how the cutoff applies to an FDR (false discovery rate, an error rate) and not a P-value (a
probability of making an error).

All the genes above the cutoff would be genes that we have found. In this case, the first 219
lines would be the genes we consider differentially expressed.

cat edger.csv | cut -f 1 -d , | head -10

Prints:

name
GENE-6491
GENE-4048
GENE-17700
GENE-9879
GENE-18457
GENE-6384
GENE-8179
GENE-6576
GENE-18742

What are the next steps?

The example above used simulated counts. Now learn how to generate count matrices from
sequencing data.

- 84/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

1.9.4 RNA-Seq counts with HiSat2

This tutorial will demonstrate the process of creating transcript abundance counts from RNA-
Seq sequencing data using a method that relies on genome alignments. The process diagram
will look like so:

flowchart LR

FASTQ(<b>FASTQ</b> <br> Sequencing Reads) --> ALN{RNA-Seq<br>Aligner};


GENOME(<b>FASTA</b> <br> Genomic Sequences) --> ALN
ALN --> BAM(<b>BAM</b><br>Alignment file)
BAM --> COUNT{Feature <br> Counter}
GTF(<b>GTF/GFF</b><br>Gene Annotations) --> COUNT
COUNT --> CSV(<b>Counts</b><br>Count matrix)

The reference file is the genome for the organism that contains all DNA: coding and non-
coding regions. The counting process will require another file, typically in GTF format,
describing the genome's coding regions.

In another tutorial titled RNA-Seq differential expression, we show how to analyze and
interpret the counts we produce as the results here.

The sequencing data

For this tutorial, we will use data from the publication:

• Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS
Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies,
Benjamin J. Ainscough, Obi L. Griffith.

The data consists of two commercially available RNA samples. The first is called Universal
Human Reference (UHR) and consists of total RNA isolated from a diverse set of 10 cancer
cell lines. The second dataset, named Human Brain Reference (HBR), is total RNA isolated
from the brains of 23 Caucasians, male and female, of varying ages but mainly 60-80 years
old. The authors also maintain a website for learning RNA-seq that you can access via:

• Introduction to bioinformatics for RNA sequence analysis

Note: We have greatly simplified the data naming and organization relative to the source. Our
approach is also quite different from that presented in the resource above.

- 85/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

Download the data

The sequencing data was produced in three replicates for each condition and sequenced in a
paired-end library. For this tutorial, we prepared a smaller subset (about 125 MB download)
filtered to contain only those reads that align to chromosome 22 (and the spike-in control).
You can get our subset by running the following command:

URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz

make -f src/run/curl.mk get URL=$URL ACTION=UNPACK

The module will download the file in the data directory and then unpacks the content of the
tar.gz archive.

Study your directories and note the files that have been created for you.

Understand the reference

To make the tutorial go faster, we will only use chromosome 22 of the human genome as a
reference. In addition, we have already made this sequence part of the download and placed it
into the refs directory. If you did not have that file, you would need to consult the How to
download genome data tutorial on how to get this data.

Note that we have access to a FASTA genome file, a GTF annotation file, and a transcriptome
file. Investigate each file to understand what each contains.

seqkit stats refs/*.fa

Prints:

file format type num_seqs sum_len min_len avg_len


max_len
chr22.genome.fa FASTA DNA 1 50,818,468 50,818,468 50,818,468
50,818,468
chr22.transcripts.fa FASTA DNA 4,506 7,079,970 33 1,571.2
84,332

- 86/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

Understand your FASTQ files

The FASTQ files contain the raw sequencing reads. Look at files and identify the naming
structure.

seqkit stats reads/*.fq

That prints:

file format type num_seqs sum_len min_len avg_len max_len


reads/HBR_1_R1.fq FASTQ DNA 118,571 11,857,100 100 100 100
reads/HBR_2_R1.fq FASTQ DNA 144,826 14,482,600 100 100 100
reads/HBR_3_R1.fq FASTQ DNA 129,786 12,978,600 100 100 100
reads/UHR_1_R1.fq FASTQ DNA 227,392 22,739,200 100 100 100
reads/UHR_2_R1.fq FASTQ DNA 162,373 16,237,300 100 100 100
reads/UHR_3_R1.fq FASTQ DNA 185,442 18,544,200 100 100 100

Create the design file

Look at the reads folder and identify the roots for the names of the samples. You have files
that look like this:

reads/HBR_1_R1.fq
reads/HBR_2_R1.fq
...
reads/UHR_1_R1.fq
reads/UHR_2_R2.fq
...

The data appears to be a single-end library with three replicates per condition.

The root is the unique identifier in the name that will allow us to quickly generate the full file
name from it. In this case, we note that the variable pattern above is the four-letter pattern of
HBR_1 and UHR_1 . From those patterns, we can generate all the unique file names we might
need. It is okay if your roots are not that short; you could list the whole file name if you wish.

We have to create the design file that lists each sample's root and group. We can do this by
hand or via some automated process. The design.csv will look like so:

sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR

- 87/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

UHR_2,UHR
UHR_3,UHR

This file is your minimal metadata to get the process working.

Generate the index

A splice-aware short-read aligner must be used when aligning transcript sequences to the
genome. We had great success with HiSat2 aligner and we provide a module for it.

First, we need to index the reference genome. Indexing is a one-time operation that will take a
few minutes to complete.

make -f src/run/hisat2.mk index REF=refs/chr22.genome.fa

Take note of how your directory structure changed. The index has been created in the idx
folder.

Generate a single alignment

Once the indexing completes, aligning a single, say, HBR_1 sample would look like so:

make -f src/run/hisat2.mk align \


REF=ref/chr22.genome.fa R1=reads/HBR_1_R1.fq BAM=bam/HBR_1.bam

The command prints:

118571 reads; of these:


118571 (100.00%) were paired; of these:
56203 (47.40%) aligned concordantly 0 times
61871 (52.18%) aligned concordantly exactly 1 time
497 (0.42%) aligned concordantly >1 times
----
56203 pairs aligned concordantly 0 times; of these:
156 (0.28%) aligned discordantly 1 time
----
56047 pairs aligned 0 times concordantly or discordantly; of these:
112094 mates make up the pairs; of these:
112035 (99.95%) aligned 0 times
58 (0.05%) aligned exactly 1 time
1 (0.00%) aligned >1 times
52.76% overall alignment rate
[bam_sort_core] merging from 0 files and 2 in-memory blocks...
samtools index bam/HBR_1.bam
-rw-r--r-- 1 ialbert staff 15M Dec 1 12:59 bam/HBR_1.bam

- 88/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

The output is a somewhat confusing wall of text with gems such as "aligned concordantly 0
times" ... what the heck is that supposed to mean ... but we move on and note that the final
alignment rate is 52%.

Where is your alignment file? Look in the bam folder. Note that we set the output BAM file
name from the command line.

Generate all alignments

We need to repeat the alignments for all six samples. We can automate the process using the
design file, but first, I like to put the roots into a separate file to make the subsequent
commands a lot shorter. The command below will keep only the HBR_1 and HBR_2 ... roots
one per line:

cat design.csv | cut -f 1 -d , | sed 1d > ids.txt

Where ids.txt will contain:

HBR_1 HBR_2 HBR_3 UHR_1 UHR_2 UHR_3

The next task is to use the ids.txt and parallel to create commands. Usually, it takes a
few tries to get the commands just right.

A practical troubleshooting approach is to add an echo in front of the command to see what
it generates to ensure it is correct. Below I am adding an echo to see the command that
parallel wants to generate:

cat ids.txt | parallel echo make -f src/run/hisat2.mk align \


REF=refs/chr22.genome.fa R1=reads/{}_R1.fq BAM=bam/{}.bam

It does not matter if you are new or experienced; it will take some attempts to write everything
right. Keep editing and tweaking. Copy and paste just the first command and see whether it
works. The command above prints:

make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/HBR_1_R1.fq BAM=bam/


HBR_1.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/HBR_2_R1.fq BAM=bam/
HBR_2.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/HBR_3_R1.fq BAM=bam/
HBR_3.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/UHR_1_R1.fq BAM=bam/
UHR_1.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/UHR_2_R1.fq BAM=bam/

- 89/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

UHR_2.bam
make -f src/run/hisat2.mk align REF=refs/chr22.genome.fa R1=reads/UHR_3_R1.fq BAM=bam/
UHR_3.bam

Looks good now.

Once the first command is correct, all the others should work too, so next, remove the echo
from the start and have parallel execute the commands instead of echoing them on the screen:

cat ids.txt | parallel make -f src/run/hisat2.mk align \


REF=refs/chr22.genome.fa MODE=SE \
R1=reads/{}_R1.fq BAM=bam/{}.bam

Boom! It just works on all samples and runs automatically in parallel. Look at your bam
directory; you should see BAM files for each sample.

bam/HBR_1.bam
bam/HBR_2.bam
bam/HBR_3.bam
bam/UHR_1.bam
bam/UHR_2.bam
bam/UHR_3.bam

Make coverage tracks

BAM files are not the easiest to visualize, as IGV will only load them when sufficiently
zoomed in. For large genomes, BAM files can be quite a hassle to load up.

Bioinformaticians invented a new type of file called wiggle (specifically the big-wiggle
variant), allowing them to visualize the coverage more efficiently. The file name extension for
the coverage tracks is .bw . We provide you with a module to make a BigWig wiggle file
from a BAM file:

make -f src/run/bigwig.mk wiggle \


REF=refs/chr22.genome.fa BAM=bam/HBR_1.bam

The command above will generate a wiggle file called wig/HBR_1.bw that can be loaded into
IGV.

- 90/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

Once we figured out how to run one analysis, we can automate the creation of all bigwig files
with parallel :

cat ids.txt | parallel make -f src/run/bigwig.mk \


wiggle REF=refs/chr22.genome.fa BAM=bam/{}.bam

The resulting files, as well as the refs/22.gtf may be visualized in IGV to produce the
following image:

Note how the alignments cover only the exonic regions of the transcripts.

Counting reads over features

Now finally, we are getting to produce the count matrix.

The featureCounts program will count the number of reads that overlap with each feature
in the annotation file. By default, the program expects a GTF file and will summarize exons
over the genes . We can set up the program differently, but for now, we will use the default
settings:

featureCounts -a refs/chr22.gtf -o counts.txt bam/HBR_1.bam

- 91/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

Investigate the counts.txt file to see what it looks like. We can list multiple files to count
all the samples at once; for example, we could write:

featureCounts -p -a refs/chr22.gtf -o counts.txt \


bam/HBR_1.bam bam/HBR_2.bam bam/HBR_3.bam \
bam/UHR_1.bam bam/UHR_2.bam bam/UHR_3.bam

The above works and produces a separate column for each BAM file. We could also rely on
shell pattern matches, though some caveats apply:

# Works if we don't have other bam files there.


featureCounts -p -a refs/chr22.gtf -o counts.txt bam/*.bam

I will admit that I usually use the pattern above. But I know I'm playing with fire there, so I
double-check the column orders.

The most robust solution would be to generate the list of input files from the ids.txt and
pass that to the feature counter.

The --xargs parameter changes the behavior of parallel , collects all the input, and then
passes it to the command all at once. Thus, what we need is to generate the BAM file names
from the roots, then list them all in one shot like so:

cat ids.txt | \
parallel -k echo bam/{}.bam | \
parallel --xargs featureCounts -a refs/chr22.gtf -o counts.txt {}

We are not really double looping since the second parallel waits for input from the first.
We are using a convenient feature of parallel to create the file names.

The first parallel produces the bam file names, and the second parallel collects these
names and passes them to the featureCounts command. I am using the -k (keep order)
option to ensure that files are listed in the same order as in the ids.txt .

So now we have the counts.txt file. Usually, the file needs a little post-processing to make
it more useful. I, for one, like to add the sample names and gene names as a column and turn it
into a CSV file.

- 92/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

Post-processing the counts

The subsequent commands require you either switch to the stats environment or your
current environment has installed the biomart and tximport libraries. You could do the
latter with the following:

mamba install bioconductor-biomart bioconductor-tximport

Just watch out for any conflicts with other packages.

The resulting counts.txt file can be made more useful. We provide a


format_featurecounts.r module that rewrites the file to a CSV format. Run it with the -
h flag to see the help:

Rscript src/r/format_featurecounts.r -h

by default, it expects a file called counts.txt as input.

Rscript src/r/format_featurecounts.r

will print

# Reformating featurecounts.
# Input: counts.txt
# Output: counts.csv

When you look at the counts.csv file, you'll see that it contains Ensembl gene IDs that are
not easily recognizable. The need to fill in Ensembl gene names is such a common task that
we made all our scripts all able to do it automatically.

Transcript to gene mapping

First, you need to obtain the mapping between the transcript and gene names. You can do it
with the following module src/r/create_tx2gene.r

As you can imagine, each organism has a different mapping; thus, we have to get the correct
mapping for the organisms we are working with. Run the module with the -s flag to show
you all the mappings it can produce.

Rscript src/r/create_tx2gene.r -s > names.txt

- 93/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

The listing inside the names.txt file shows that Homo Sapiens can be accessed via
hsapiens_gene_ensembl . You would only need to generate the transcript to gene mapping
only once (it takes a while to get it, so keep it around). Let's create our transcript to gene id
mapping file:

Rscript src/r/create_tx2gene.r -d hsapiens_gene_ensembl

Prints:

# Create tx2gene mapping


# Dataset: hsapiens_gene_ensembl
# Connecting to ensembl.
# Submitting the query.
# Output: tx2gene.csv

The code above generated a file called tx2gene.csv . Investigate the file to see how it
connects the various identifiers.

Add informative gene names

Now that we have tx2gene.csv , we can use the mapping file to add the informative gene
names into the counts.csv file:

Rscript src/r/format_featurecounts.r -t tx2gene.csv

And voila, the counts.csv file carries the correct gene names.

The resulting counts.csv can be used with the How to perform an RNA-Seq differential
expression study tutorial.

The complete workflow

The complete workflow is included with the code you already have and can be run as:

bash src/scripts/rnaseq-with-hisat.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

- 94/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

# Bash strict mode.


set -uex

# The URL for the data.


URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz

# Genome reference.
REF=refs/chr22.genome.fa

# Genome annotation file.


GTF=refs/chr22.gtf

# The counts in tab delimited format.


COUNTS_TXT=counts.txt

# Final combinted counts in CSV format.


COUNTS_CSV=counts.csv

# Making the design.file


cat << EOF > design.csv
sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR
UHR_2,UHR
UHR_3,UHR
EOF

# Download and unpack a tar archive


make -f src/run/curl.mk get URL=$URL ACTION=UNPACK

# Generate the HISAT2 index


make -f src/run/hisat2.mk index REF=${REF}

# Create the roots of the directory tree


cat design.csv | cut -f 1 -d , | sed 1d > ids.txt

# Run the alignments.


cat ids.txt | parallel make -f src/run/hisat2.mk align \
MODE=SE REF=${REF} BAM=bam/{}.bam \
R1=reads/{}_R1.fq

# Generate the coverage tracks


cat ids.txt | parallel make -f src/run/bigwig.mk \
wiggle REF=${REF} BAM=bam/{}.bam

# Count the features


cat ids.txt | parallel -k echo bam/{}.bam | \

- 95/228 - Biostar Handbook Collection (2023)


1.9.4 RNA-Seq counts with HiSat2

parallel -u --xargs featureCounts -a ${GTF} -o ${COUNTS_TXT} {}

# Reformat feature counts.


Rscript src/r/format_featurecounts.r -c ${COUNTS_TXT} -o ${COUNTS_CSV}

- 96/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

1.9.5 RNA-Seq counts with Salmon

This tutorial will demonstrate the process of creating transcript abundance counts from RNA-
Seq sequencing data using methods that rely on quantification (sometimes called
classification). The process diagram will look like so:

flowchart LR

FASTQ(<b>FASTQ</b> <br> Sequencing Reads) --> ALN{Quantify};


TRANS(<b>FASTA</b> <br> Transcriptome Sequences) --> ALN
ALN --> CSV(<b>Counts</b><br>Count matrix)
CSV --> CLEAN{Combine}
GENE(<b>Gene Names</b><br>Transcript to Gene Mapping) --> CLEAN{Combine}
CLEAN --> BETTER(<b>Counts</b><br>More informative<br>count matrix)

The reference file used during quantification is the organism's transcriptome that lists all
known transcripts as a FASTA sequence. The quantification process assigns each read in the
FASTQ file to a single transcript in the FASTA file. Since transcripts may share regions
(isoforms are similar to one another), the classifier must employ a sophisticated redistribution
algorithm to ensure that the counts are properly assigned across the various similar transcripts.

In another tutorial titled RNA-Seq differential expression, we show how to analyze and
interpret the counts we produce as the results here.

The sequencing data

The tutorial at RNA-Seq using HiSat2 describes the origin of the sequencing data and various
details of it.

URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz

make -f src/run/curl.mk get URL=$URL ACTION=UNPACK

The unpack command downloads and automatically extracts tar.gz files. As in the same
resource, we capture the layout of the data in the design.csv file that will contain the
following:

sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR

- 97/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

UHR_2,UHR
UHR_3,UHR

Obtaining the reference

Quantification processes operate on transcript sequences rather than genomic sequences. The
downloaded data we provided includes the transcript sequences.

The primary difference relative to the approach in RNA-Seq using a genome is that the
reference sequence will be the transcriptome and not the genome.

Generate a single classification

The field of quantification offers us two tools, kallisto and salmon . We will demonstrate
the use of the salmon program as it appears to be more actively maintained.

First, we need to index the reference genome. Indexing is a one-time operation that will take a
few minutes to complete.

make -f src/run/salmon.mk index REF=refs/chr22.transcripts.fa

Once the indexing completes, aligning a single, say, HBR_1 sample would look like so:

make -f src/run/salmon.mk align REF=refs/transcripts.fa \


R1=reads/HBR_1_R1.fq SAMPLE=HBR_1

Running the code above will result in the file:

cat salmon/HBR_1/quant.sf | column -t | head

Note how salmon places each output in a separate directory named by the sample, where the
abundance file name is the same quant.sf . The abundance file contains the following:

Name Length EffectiveLength TPM NumReads


ENST00000615943.1 113 5.501 0.000000 0.000
ENST00000618365.1 139 16.232 0.000000 0.000
ENST00000623473.1 54 54.000 0.000000 0.000
ENST00000624155.1 120 8.363 0.000000 0.000
ENST00000422332.1 1241 1067.519 0.000000 0.000
ENST00000612732.1 151 21.616 0.000000 0.000

- 98/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

ENST00000614148.1 115 6.274 0.000000 0.000


ENST00000614087.1 73 73.000 0.000000 0.000
ENST00000621672.1 82 82.000 0.000000 0.000

transcript level counts. Note how the Name column contains Ensembl transcript identifiers.

Generate all classifications

You can manually invoke the commands for each sample. It would be better practice to
automate it via parallel . First, we create the roots for the samples to make the commands
look simpler.

cat design.csv | cut -f 1 -d , | sed 1d > ids.txt

Then can run the classification for all samples with the following:

cat ids.txt | parallel make -f src/run/salmon.mk align \


REF=refs/chr22.transcripts.fa R1=reads/{}_R1.fq SAMPLE={}

Let's see all the files that have been created:

find . | grep quant.sf

Prints:

./salmon/UHR_3/quant.sf
./salmon/HBR_1/quant.sf
./salmon/UHR_2/quant.sf
./salmon/HBR_2/quant.sf
./salmon/HBR_3/quant.sf
./salmon/UHR_1/quant.sf

We have produced a separate count file for each sample. Next, we need to combine these files
into a single output.

- 99/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

Combine the counts

The subsequent commands require that you either switch to the stats environment or that
your current environment has the biomart and tximport libraries installed. You could do
the latter with the following:

mamba install bioconductor-biomart bioconductor-tximport

Just watch out for any conflicts with other packages.

To recap, we have a large number of abundance files in different folders, and each file has a
column with the counts. What we need to do next is to extract and glue those columns together
into a single file.

We have written a module that does just that, combines the counts into a single file:

Rscript src/r/combine_salmon.r

Produces the output:

# Combine salmon quantifications.


# Sample: design.csv
# Data dir: salmon
# Gene counts: FALSE
reading in files with read_tsv
1 2 3 4 5 6
# Results: counts.csv

The resulting file is:

name HBR_1 HBR_2 HBR_3 UHR_1 UHR_2 UHR_3


ENST00000359963.4 0 0 0 1 2 0
ENST00000430910.1 0 0 0 0 0 0
ENST00000558085.6 0 0 0 1 1.026 2.085
ENST00000592918.5 0 0 0 0 4.809 9.107
ENST00000400593.6 0 1.721 0 1.526 4.338 4.362
ENST00000592107.5 0 2.043 3.758 9.439 3.746 8.278
ENST00000426585.5 16 12 11 98.943 55.505 73.007
ENST00000591299.1 1 1.941 0 5.757 3.102 0
ENST00000588548.1 0 2.295 1.242 10.604 5.825 6.703
ENST00000438850.1 0 0 0 1.998 3 1
ENST00000383140.7 0 0 0 0 0 1.097

- 100/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

Note how the name column contains Ensembl transcript identifiers. You can check that each
column in the file above corresponds to the count column of the quant.sf for that sample.

Informative gene names

When performing any counting, we want the names to be unique, like ENST00000359963 , so
that there is no misunderstanding about which feature was used.

Informative gene names, on the other hand, are often acronyms that help biologists associate
the gene name with a function. The problem there is that the generic names may be reused
across species and even across different versions of the same species.

For example, the transcript named ENST00000463053.1 corresponds to gene


ENSG00000100288 and has the generic name of CHKB , which stands for Choline Kinase
Beta.

Unsurprisingly, life scientists always want to see informative gene names in their results. So
we need to be able to add these names to our unique identifiers. There are quite a few ways to
go about the process, and you are welcome to explore the various alternative solutions.

In the Biostar Handbook, we provide you with two scripts that assist with the process.

Transcript to gene mapping

First, you need to obtain the mapping between the transcript and gene names. You can do it
with the following module src/r/create_tx2gene.r

As you can imagine, each organism has a different mapping; thus, we have to get the correct
mapping for the organisms we are working with. Run the module with the `-s' flag to show
you all the mappings it can produce.

Rscript src/r/create_tx2gene.r -s > names.txt

The listing shows that Homo Sapiens can be accessed as hsapiens_gene_ensembl . You
would only need to generate this file once (it takes a while to get it, so keep it around).

Rscript src/r/create_tx2gene.r -d hsapiens_gene_ensembl

- 101/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

Prints:

# Create tx2gene mapping


# Dataset: hsapiens_gene_ensembl
# Connecting to ensembl.
# Submitting the query.
# Output: tx2gene.csv

Investigate the file tx2gene.csv to see how it connects the various identifiers on each line.

Add gene names to transcripts

Rerunning the combination script, but this time with the mapping file, will produce gene
names in the second column:

Rscript src/r/combine_salmon.r -t tx2gene.csv

Will generate a counts.csv file that has an additional column:

name gene HBR_1 HBR_2 HBR_3 UHR_1 UHR_2 UHR_3


ENST00000615943.1 U2 0 0 0 0 0 0
ENST00000618365.1 ENST00000618365.1 0 0 0 0 0 0
ENST00000422332.1 ACTR3BP7 0 0 0 0 1 0
ENST00000623543.1 ENSG00000280007 71.397 102.571 72.169 30.96 40.539 16.696
ENST00000330423.7 TUBA8 41.384 29.643 16.229 4.598 0 0
ENST00000416740.1 TUBA8 0 0 8.691 0 0 0
ENST00000608634.1 TUBA8 2.512 4.257 4.742 1 1 1
ENST00000215794.7 USP18 4 8 5 35.581 19.491 23
ENST00000434390.1 FAM230D 0 0 0 0 0 0
ENST00000617623.1 GGTLC5P 0 0.985 1.078 1.809 1.757 0
ENST00000619172.4 FAM230J 2 0 0 0 1.178 1.231
ENST00000618517.4 FAM230J 0 0 0 0 0 0

Some transcripts may not be present in the mapping file; other transcripts may not have a
known gene name associated with them.

Summarize over genes

Finally, our module also allows you to summarize over gene names. To have that work, we
need to have a mapping file that connects the transcript names to the gene names, and then we
need to invoke the module with the -G flag.

Rscript src/r/combine_salmon.r -t tx2gene.csv -G

- 102/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

Produces:

name gene HBR_1 HBR_2 HBR_3 UHR_1 UHR_2 UHR_3


ENSG00000008735 MAPK8IP2 402 541.101 475 14 18 8
ENSG00000015475 BID 68 86 55.105 112 74.001 54.71
ENSG00000025708 TYMP 13 11 19.001 48 5.397 26
ENSG00000025770 NCAPH2 79 83 79.001 212.001 123.081
166.001
ENSG00000040608 RTN4R 70 91.999 81 18 10 11
ENSG00000054611 TBC1D22A 36 46 42.001 73 50 67
ENSG00000056487 PHF21B 11 10 4 12 13 9
ENSG00000063515 GSC2 0 0 0 0 0 0
ENSG00000069998 HDHD5 28.001 41.999 33.001 125 79.001
138.001
ENSG00000070010 UFD1 46.813 70.001 57.003 262 183.001
188.853
ENSG00000070371 CLTCL1 28 34 32 87 61
70.999
ENSG00000070413 DGCR2 299 350 311.001 302 168.001 267
ENSG00000073146 MOV10L1 3 5 1 0 6 1

The resulting count file can be used as input for the differential expression modules.

The complete workflow

The complete workflow is included with the code you already have and can be run as:

bash src/scripts/rnaseq-with-salmon.sh

The source code for the workflow is:

#
# Biostar Workflows: https://www.biostarhandbook.com/
#

# Bash strict mode.


set -uex

# The URL for the data.


URL=http://data.biostarhandbook.com/data/uhr-hbr.tar.gz

# Genome reference.
REF=refs/chr22.transcripts.fa

# Genome annotation file.


GTF=refs/chr22.gtf

- 103/228 - Biostar Handbook Collection (2023)


1.9.5 RNA-Seq counts with Salmon

# Making the design.file


cat << EOF > design.csv
sample,group
HBR_1,HBR
HBR_2,HBR
HBR_3,HBR
UHR_1,UHR
UHR_2,UHR
UHR_3,UHR
EOF

# Download and unpack a tar archive


make -f src/run/curl.mk get URL=$URL ACTION=UNPACK

# Generate the Salmon index


make -f src/run/salmon.mk index REF=${REF}

# Create the roots of the directory tree


cat design.csv | cut -f 1 -d , | sed 1d > ids.txt

# Run the salmon alignments.


cat ids.txt | parallel make -f src/run/salmon.mk align \
REF=${REF} R1=reads/{}_R1.fq SAMPLE={}

# Combine the counts into a single file


Rscript src/r/combine_salmon.r -d design.csv -o counts.csv

- 104/228 - Biostar Handbook Collection (2023)


1.9.6 RNA-Seq functional analysis

1.9.6 RNA-Seq functional analysis

This chapter assumes we have a differential expression matrix and we wish to find the
interpretation of the results.

flowchart LR
DE(<b>Differential Expression</b><br>Effect Sizes and P-Values) --> FUNC{Functional <br>
Analysis}
FUNC --> RES(<b>Functional Enrichment</b><br>Effect Sizes and P-Values)

Below we assume that we ran the RNA-Seq counts with Salmon tutorial and obtained the
resulting counts.csv file. Then we used the edger.r module as described in RNA-Seq
differential expression to produce the differential expression matrix. We assume that the
resulting file is called edger.csv .

For convenience, we also distribute this file separately, you can download it with

wget http://data.biostarhandbook.com/books/rnaseq/data/edger.csv

The main volume of the Biostar Handbook describes several functional enrichment tools out
of which we explore a few below.

Interpretation of biological data is best done in an interactive/exploratory manner. The tools


described below are meant to be used as a starting point for further exploration.

Gene Ontology enrichment

The bio profile command uses the g:Profiler service to find the functional enrichment of
the genes in the differential expression matrix.

bio gprofiler -c edger.csv -d hsapiens

prints:

# Running g:Profiler
# Counts: edger.csv
# Organism: hsapiens
# Name column: gene
# Pval column: FDR < 0.05

- 105/228 - Biostar Handbook Collection (2023)


1.9.6 RNA-Seq functional analysis

# Gene count: 279


# Genes: IGLC2,SEPTIN3,SYNGR1,MIAT,SEZ6L,[...]
# Submitting to gProfiler
# Found 346 functions
# Output: gprofiler.csv

The resulting CSV file will list the various functions that are found to be enriched.

Do visit the main g:Profiler service as the results there will always be easier to navigate.

Enrichr

Enrichr integrates knowledge to provide synthesized information about mammalian genes and
gene sets. Read more about the protocls in the publication titled Gene Set Knowledge
Discovery with Enrichr in Curr Protoc. 2021

You can also use the web service directly.

• https://maayanlab.cloud/Enrichr/

Our tool bio implements the bio enrichr command to facilitate the functional
exploration of the genes in the differential expression matrix. To run the tool execute:

bio enrichr -c edger.csv

- 106/228 - Biostar Handbook Collection (2023)


1.9.6 RNA-Seq functional analysis

The command will process the file to extract the gene names with FDR<0.05 and submits
these genes to the web service. The output of the tool will look like this:

# Running Enrichr
# Counts: edger.csv
# Organism: mmusculus
# Name column: gene
# Pval column: FDR < 0.05
# Gene count: 279
# Genes: IGLC2,SEPTIN3,SYNGR1,MIAT,SEZ6L,[...]
# Submitting to Enrichr
# User list id: 57476587
# Entries: 95
# Output: enrichr.csv

The resulting CSV file will list the various functions that are found to be enriched.

- 107/228 - Biostar Handbook Collection (2023)


2. Workflows

2. Workflows

2.1 Airway RNA-Seq

2.1.1 Stating the problem

The following chapter describes the process of reproducing the results of the paper:

• RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive


Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells PLoS One. 2014;
9(6): e99625.

There is also a Bioconductor-based workflow titled RNA-seq workflow: gene-level


exploratory analysis and differential expression for this same study. You might want to
contrast their approach to ours. The data from the paper can also be accessed in Bioconductor
via the airway package.

Notably the workflow we present below is suited to reproduce any other RNA-Seq analysis
that makes use of pairwise comparisons.

- 108/228 - Biostar Handbook Collection (2023)


2.1.2 Main findings

2.1.2 Main findings

Here are the lessons we learned when we reproduced the results:

1. The most time-consuming part is obtaining the RNA-Seq data.


2. A small subset (just 2 million reads per sample) can recapitulate the main findings.
3. The results are reproducible even when applying different methods.

We are particularly pleased with the last observation. Far too often, scientific reproducibility is
mistakenly framed as being able to reproduce the same numerical results while using the same
analytical methods.

In reality, scientific reproducibility should mean validating the biological insights and
conclusions drawn using different and valid ways of study.

2.1.3 Workflow plan

The analysis is implemented with the process described in detail in the RNA-Seq with salmon
guide.

We will show the commands as shell commands, but our final product is a Makefile to
allow better reentrant behavior. Basically a makefile will allow us to rerun certain parts,
pick up where we left off. The approach will also allow you to understand how bash
commands are deployed as make instructions.

The complete source code of the Makefile is included at the end of this chapter.

The Makefile is also included in the code distribution you already have and the entire
process described below can be run in one shot as:

make -f src/workflows/airway.mk airway all

Running the above will generate the file called design.csv

- 109/228 - Biostar Handbook Collection (2023)


2.1.4 Accession numbers

2.1.4 Accession numbers

The first step of reproducibility is identifying the accession numbers and the metadata that
connects biological information to files. So we had to read the paper and find the accession
numbers.

• RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive


Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells

The paper lists the GEO number GSE52778 ; visiting the GEO page, we can locate the
BioProject number PRJNA229998

• https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
• https://www.ncbi.nlm.nih.gov/bioproject/PRJNA229998

2.1.5 Create the design file

We start by obtaining the metadata deposited for the project:

bio search PRJNA229998 -H --csv --all > PRJNA229998.csv

As always, the output is a CSV file with possible dozens of columns with only some filled in.

Never underestimate the ability of a scientists to overcomplicate things. Scientists are nerds
and nerds love to make things complicated. Complications are a sign of intelligence and
importance - or so the belief goes.

- 110/228 - Biostar Handbook Collection (2023)


2.1.5 Create the design file

Any time you have to decipher a file made by scientists, you should assume that it is a mess,
like the above file. Hard work and dedication is needed to tease apart the information you
need (swearing helps, at least it helps me a great deal).

Eyeballing the file in Excel, we identify columns of interest:

cat PRJNA229998.csv | csvcut -c run_accession,sample_title

The command above prints:

run_accession,sample_title
SRR1039508,N61311_untreated
SRR1039509,N61311_Dex
SRR1039510,N61311_Alb
SRR1039511,N61311_Alb_Dex
SRR1039512,N052611_untreated
SRR1039513,N052611_Dex
SRR1039514,N052611_Alb
SRR1039515,N052611_Alb_Dex
SRR1039516,N080611_untreated
SRR1039517,N080611_Dex
SRR1039518,N080611_Alb
SRR1039519,N080611_Alb_Dex
SRR1039520,N061011_untreated
SRR1039521,N061011_Dex
SRR1039522,N061011_Alb
SRR1039523,N061011_Alb_Dex

From the above list, we will focus on a single comparison, wherein the airway smooth muscle
cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-
inflammatory effects.

Thus, we must retain only the untreated samples and those treated with dexamethasone. We
are going to rename untreated samples to ctrl and dexamethasone-treated samples to dex .
And we are going to separate the cell types such N61311 as from the sample names.

We simplified and edited the file above and inserted another column for the treatment to create
our design.csv that will govern the analysis. There are many ways to label the data. We
chose to set up the following design matrix:

cat design.csv | column -t -s ,

- 111/228 - Biostar Handbook Collection (2023)


2.1.6 Obtain the references

the above formats the csv file more nicely for readability:

run group celltype sample


SRR1039508 ctrl N61311 N61311_Ctrl
SRR1039512 ctrl N052611 N052611_Ctrl
SRR1039516 ctrl N080611 N080611_Ctrl
SRR1039520 ctrl N061011 N061011_Ctrl
SRR1039509 dex N61311 N61311_Dex
SRR1039513 dex N052611 N052611_Dex
SRR1039517 dex N080611 N080611_Dex
SRR1039521 dex N061011 N061011_Dex

We added a group , a celltype and a sample column as later it turns out both can provide
useful information. You can add additional columns to the design file at any time.

If you use Excel on Windows and then plan to use the file via the Linux subsystem, you may
need to convert the file endings to Unix format. The simplest way to do this is to open the
CSV file with your editor and change the line endings.

The Makefile we provide can generate the design file for you with:

make -f src/workflows/airway.mk airway

The command above will generate the design.csv file automatically.

2.1.6 Obtain the references

The study makes use of the human genome. Since we plan to use the salmon tool we need to
obtain the transcriptome sequence for our target genomes. Several resource may be used to
download this type of data. We will choose the Ensemble download site:

• https://ftp.ensembl.org/pub/

With some clicking and poking around we can locate the files we need.

• CDNA: http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/
Homo_sapiens.GRCh38.cdna.all.fa.gz

- 112/228 - Biostar Handbook Collection (2023)


2.1.7 Automation pattern

We can get the data in various ways, we can click, download, unzip and rename the files. Or
we could use the curl as demonstrated below. In addition, we need to prepare the index to be
suitable for processing with salmon . Our Makefile will need to have the following lines:

# The URL for the CDNA


CDNA_URL=http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.al

# The name of the CDNA file locally.


CDNA=~/refs/hsapiens/Homo_sapiens.GRCh38.cdna.all.fa

# Create the directory for the reference genome.


mkdir -p $(dir ${CDNA})

# Download and unzip the reference genome.


curl ${CDNA_URL} | gunzip -c > ${CDNA}

# Trigger the reference indexing.


make -f src/run/salmon.mk index REF=${CDNA}

Try it out (remember that you can add the -n flag for a dry-run to see what make plans to
do):

make -f src/workflows/airway.mk genome

The command above will download and index the human transcriptome file. On my iMac
takes around 5 minutes.

2.1.7 Automation pattern

The main automation pattern will have the following logic.

We will use the design.csv file and extract the content of various columns and use those
values to build the commands we want to run. Here is how that works:

cat design.csv | parallel --colsep , --header : \


echo Process {run}.fastq and save to {sample}.txt

Will build and run the echo command in a loop, and makes it print the following:

Process SRR1039508.fastq and save to N61311_Ctrl.txt


Process SRR1039512.fastq and save to N052611_Ctrl.txt
Process SRR1039516.fastq and save to N080611_Ctrl.txt
Process SRR1039520.fastq and save to N061011_Ctrl.txt

- 113/228 - Biostar Handbook Collection (2023)


2.1.8 Download the reads

Process SRR1039509.fastq and save to N61311_Dex.txt


Process SRR1039513.fastq and save to N052611_Dex.txt
Process SRR1039517.fastq and save to N080611_Dex.txt
Process SRR1039521.fastq and save to N061011_Dex.txt

The {run} and {sample} patterns are placeholders for the values in the run and sample
columns of the design.csv file. Using --header : with the parallel command will
extract and replace the placeholders with the values in the columns.

Once you understand what is happening above, you understand the entire automation process.
If you need help with parallel read the Art of Bioinformatics Scripting for additional
information.

Later, we will add the -v and --eta flags to parallel . The first flag shows the
commands that will be run; the second will print estimates for the completion of each task.
Both of these flags are optional, but I found these to be quite handy.

Every automation and looping you will ever do will be a variation of the command above. So
it is worth taking the time to understand it.

2.1.8 Download the reads

The project stores paired-end reads in SRA. We must set the MODE=PE parameter for all of
our workflows. We can automate the download from SRA with the following:

cat design.csv | parallel -v --eta --colsep , --header : \


make -f src/run/sra.mk get \
SRR={run} N=1000000 MODE=PE

Look at your reads directory to see what it contains. By selecting 1 million read-pairs per
sample, we reduce the dataset size to a few gigabytes, and the process can run in less than 20
minutes, even on a Macbook Air laptop on a home internet connection. Set N=ALL to
download the entire dataset.

Using the Makefile you can run the commands with:

make -f src/workflows/airway.mk fastq

- 114/228 - Biostar Handbook Collection (2023)


2.1.9 Align the reads

2.1.9 Align the reads

Below we will run the process with 2 CPUs; you may run the process with as many CPUs as
you have available.

CDNA=~/refs/hsapiens/Homo_sapiens.GRCh38.cdna.all.fa

# Run salmon alignments for all samples


cat design.csv | parallel -v --eta --colsep , --header : -j 2 \
make -f src/run/salmon.mk align \
REF=${CDNA} SAMPLE={sample} MODE=PE \
R1=reads/{run}_1.fastq R2=reads/{run}_2.fastq

The process takes about 3 minutes on my iMac.

Using the Makefile you can run the same commands with:

make -f src/workflows/airway.mk align

2.1.10 Create the counts

You may need to switch to the stats environment to do the statistics

# Create a directory for the transcript to gene mapping


mkdir -p ~/refs/tx2gene

# Create the ensemble transcript to gene mapping


Rscript src/r/create_tx2gene.r -d hsapiens_gene_ensembl -o ~/refs/tx2gene/hsapiens.csv

# Combine salmon outputs into the final counts.


Rscript src/r/combine_salmon.r -d design.csv -G -t ~/refs/tx2gene/hsapiens.csv -o counts.csv

Using the Makefile you can run the same commands with:

# You may need to switch to the stats environment to do the statistics


conda activate stats

# Run the count file generation


make -f src/workflows/airway.mk counts

- 115/228 - Biostar Handbook Collection (2023)


2.1.11 Plot the PCA

2.1.11 Plot the PCA

# PCA plot by group.


Rscript src/r/plot_pca.r -f group -c counts.csv -o group.pdf

# PCA plot by celltype


Rscript src/r/plot_pca.r -f celltype -c counts.csv -o celltype.pdf

View the PCA plot. It looks promising.

The samples separate both by treatment and by cell type and quite cleanly. At this point we are
getting more confident that the analysis will be producing meaningful results.

2.1.12 Differential expression analysis

You can use the design.csv file to generate the differential expression analysis results:

# Run edgeR.
Rscript src/r/edger.r -d design.csv -f group -c counts.csv -o edger.csv

the report states:

# Initializing edgeR tibble dplyr tools ... done


# Tool: edgeR
# Design: design.csv
# Counts: counts.csv
# Sample column: sample
# Factor column: group

- 116/228 - Biostar Handbook Collection (2023)


2.1.13 Plot a heatmap

# Factors: ctrl dex


# Group ctrl has 4 samples.
# Group dex has 4 samples.
# Method: glm
# Input: 38254 rows
# Removed: 29064 rows
# Fitted: 9190 rows
# Significant PVal: 1795 ( 19.50 %)
# Significant FDRs: 535 ( 5.80 %)
# Results: edger.csv

The results are stored in edger.csv.

2.1.13 Plot a heatmap

Rscript src/r/plot_heatmap.r -c edger.csv -d design.csv -f group -o heatmap.pdf

The heatmap looks very lovely and clean.

We can see that the samples are separated by treatment. We can see more genes upregulated in
the Dex samples.

- 117/228 - Biostar Handbook Collection (2023)


2.1.14 Do the results validate?

2.1.14 Do the results validate?

From the paper:

Based on a Benjamini-Hochberg corrected p-value <0.05, we identified 316 differentially


expressed genes, including both well-known (DUSP1, KLF15, PER1, TSC22D3) and less
investigated (C7, CCDC69, CRISPLD2) glucocorticoid-responsive genes. CRISPLD2, which
encodes a secreted protein previously implicated in lung development and endotoxin
regulation

Let's verify that the above genes are differentially expressed in our data. Running such checks
typically requires that you write various programs.

The task is so common that book we provide you with a generic script that can compare the
contents of any files with FDR columns. To use it, we need to create a new file that lists the
genes found as differentially expressed in the paper. The file pub.csv would look like this:

gene,PAdj,FDR
DUSP1,0,0
KLF15,0,0
PER1,0,0
TSC22D3,0,0
C7,0,0
CCDC69,0,0
CRISPLD2,0,0

We add FDR=0 to each row to indicate that the gene was considered to be differentially
expressed. We then use the evaluate_results.r script to compare the genes in
edger.csv to pubs.csv :

Rscript src/r/evaluate_results.r -a edger.csv -b pub.csv -c gene

The results of the script clearly show that we can reproduce all the genes from the paper.

# Tool: evaluate_results.r
# 535 in edger.csv
# 7 in pub.csv
# 7 found in both
# 528 found only in edger.csv
# 0 found only in pub.csv

- 118/228 - Biostar Handbook Collection (2023)


2.1.15 Good data doesn't need to be big

And we did so by using just 2 million reads per sample, which is about 10% of the total data
on average!

2.1.15 Good data doesn't need to be big

A lesson to remember.

Good data does not need to be big data.

We can reproduce the paper's main findings with just 2 million reads per sample (10% of the
total data).

2.1.16 Gene ontology enrichment

The paper reports various findings, for example, that the enriched genes.

We identified 316 significantly differentially expressed genes representing various functional


categories such as glycoprotein/extracellular matrix, vasculature, and lung development,
regulation of cell migration, and extracellular matrix organization.

Generate g:Profiler results:

bio gprofiler -c edger.csv -d hsapiens

The results are stored in gprofiler.csv and contain terms enriched in the upregulated and
downregulated genes. Let's search the annotations for the terms that are mentioned in the
paper:

cat gprofiler.csv | csvcut -c 2,3 | grep -iE 'vessel|lung|matrix'

Prints:

GO:0007160,cell-matrix adhesion
GO:0030198,extracellular matrix organization
GO:0001952,regulation of cell-matrix adhesion
GO:0048514,blood vessel morphogenesis
GO:0001568,blood vessel development
GO:0031012,extracellular matrix
GO:0062023,collagen-containing extracellular matrix
GO:0050840,extracellular matrix binding

- 119/228 - Biostar Handbook Collection (2023)


2.1.17 The complete workflow

GO:0005201,extracellular matrix structural constituent


HPA:0301361,lung; alveolar cells type I[≥Low]
HPA:0300201,lung; endothelial cells[≥Low]
HPA:0301362,lung; alveolar cells type I[≥Medium]
HPA:0300000,lung
HPA:0300411,lung; macrophages[≥Low]
KEGG:05222,Small cell lung cancer
REAC:R-HSA-1474244,Extracellular matrix organization

A complete ontology interpretation is beyond the scope of this tutorial, but we can see that the
results are consistent with the paper.

2.1.17 The complete workflow

The complete workflow is included with the code you already have and can be run as:

make -f src/workflows/airway.mk airway all

The source code for the workflow is:

#
# RNA-Seq from the Biostar Workflows
#
# http://www.biostarhandbook.com
#

# How many reads to download N=ALL for all.


N ?= 1000000

# The number of CPUs to use.


NCPU ?= 2

# Library type SE=Single End PE=Paired End


MODE ?= PE

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)
$(error "#This makefile needs GNU Make 4.0 or later")
endif

# Apply Makefile customizations.


.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The URL for the CDNA

- 120/228 - Biostar Handbook Collection (2023)


2.1.17 The complete workflow

CDNA_URL ?= http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna

# The name of the CDNA file locally.


CDNA ?=~/refs/hsapiens/Homo_sapiens.GRCh38.cdna.all.fa

# The name of the ensembl database to use for transcript to gene mapping.
ENSEMBL_DB ?= hsapiens_gene_ensembl

# The transcript to gene mapping file.


TX2GENE ?= ~/refs/tx2gene/${ENSEMBL_DB}.csv

# The histat2 counts file.


COUNTS ?= counts.csv

# The design matrix.


DESIGN ?= design.csv

# Print help by default


usage:
> @echo "#"
> @echo "# RNA-Seq pipeline from the Biostar Workflows (MODE=${MODE})"
> @echo "#"
> @echo "# Usage: make genome fastq align counts edger"
> @echo "#"

# Trigger all steps at once.


all: genome fastq align edger

# Download and unzip the reference genome.


${CDNA}:
> mkdir -p $(dir ${CDNA})
> curl ${CDNA_URL} | gunzip -c > ${CDNA}

# Trigger the reference indexing.


genome: ${CDNA}
> make -f src/run/salmon.mk index REF=${CDNA} NCPU=${NCPU}

# Create the design matrix.


airway:
> @cat << EOF > ${DESIGN}
>run,group,celltype,sample
>SRR1039508,ctrl,N61311,N61311_Ctrl
>SRR1039512,ctrl,N052611,N052611_Ctrl
>SRR1039516,ctrl,N080611,N080611_Ctrl
>SRR1039520,ctrl,N061011,N061011_Ctrl
>SRR1039509,dex,N61311,N61311_Dex
>SRR1039513,dex,N052611,N052611_Dex
>SRR1039517,dex,N080611,N080611_Dex
>SRR1039521,dex,N061011,N061011_Dex
> EOF

- 121/228 - Biostar Handbook Collection (2023)


2.1.17 The complete workflow

> @echo "# Created airway design: ${DESIGN}"

# Download the reads. Set N=ALL to get all


fastq: ${DESIGN}
> cat design.csv | parallel -v --eta --colsep , --header : \
> make -f src/run/sra.mk get \
> SRR={run} N=${N}

# Run alignments for all samples


align: ${DESIGN}
> # Run the alignment on each sample.
> cat design.csv | parallel -v --eta --colsep , --header : -j ${NCPU} \
> make -f src/run/salmon.mk align \
> REF=${CDNA} SAMPLE={sample} MODE=${MODE} \
> R1=reads/{run}_1.fastq R2=reads/{run}_2.fastq

# Create the ensemble transcript to gene mapping


${TX2GENE}:
> mkdir -p $(dir $@)
> Rscript src/r/create_tx2gene.r -d ${ENSEMBL_DB} -o $@

# Combine salmon outputs into the final counts.


${COUNTS}: ${DESIGN} ${TX2GENE}
> Rscript src/r/combine_salmon.r -d ${DESIGN} -G -t ${TX2GENE} -o ${COUNTS}

# Single command to generate the counts


counts: ${COUNTS}

# Run edger differential expression analysis.


edger: ${COUNTS} ${DESIGN}
> Rscript src/r/edger.r -d design.csv

- 122/228 - Biostar Handbook Collection (2023)


2.2 Presenilin RNA-Seq

2.2 Presenilin RNA-Seq

2.2.1 Stating the problem

This chapter will demonstrate the use of workflows when attempting to reproduce the results
from the publication titled:

• Brain transcriptome analysis of a familial Alzheimer’s disease-like mutation in the zebrafish


presenilin 1 gene implies effects on energy production Molecular Brain volume 12, Article
number: 43 (2019)

The publication is a so-called "micro-report" that, according to the journal, is designed for
presenting small amounts of data or research that otherwise remain unpublished, including
important negative results, reproductions, or exciting findings that someone wishes to place
rapidly into the public domain.

The paper starts by recapitulating that a mutation in the PRESENILIN 1 gene ( PSEN1 ) is
known to be associated with the early onset of familial Alzheimer's Disease (EOfAD).

- 123/228 - Biostar Handbook Collection (2023)


2.2.2 Main findings

The scientists have introduced an EOfAD-like mutation: Q96_K97del into the endogenous
psen1 gene of zebrafish and then analyzed transcriptomes of young adult (6-month-old)
entire brains from a family of heterozygous mutant and wild-type sibling fish.

The main finding of the paper is that according to gene ontology (GO) analysis of the results,
the mutation has an effect on mitochondrial function, particularly ATP synthesis, and on ATP-
dependent processes, including vacuolar acidification.

2.2.2 Main findings

Here are the lessons we learned when while attempting to reproduce the results:

1. We had trouble reproducing the results of this publication.


2. We have come to believe that the data does not support the conclusions of the paper.
3. A small subset of the experimental data that can be processed in just 10 minutes demonstrates
the same effects as the entire study.

Another way to say this, collecting more data cannot save an inconclusive analysis.

2.2.3 Workflow plan

In the chapter titled Airway RNA-Seq in 10 minutes, we have developed a fully reusable
pipeline for RNA-Seq analysis. The current process will follow the same steps but with minor
modifications.

The experiment studies a different organism, and a new design file needs to be set up. In
addition, the reads in this experiment are single-end rather than paired-end; thus, we have to
set MODE=SE for the pipeline. But the vast majority of processes remain the same.

Our new Makefile can simply include the original Makefile and override certain
variables. In a nutshell, our new Makefile will look like this:

# Set the layout mode to single end


MODE=SE

# The URL for the CDNA


CDNA_URL = http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/cdna/
Danio_rerio.GRCz11.cdna.all.fa.gz

# The name of the CDNA file locally.


CDNA = ~/refs/drerio/Danio_rerio.cdna.fa

- 124/228 - Biostar Handbook Collection (2023)


2.2.4 Accession numbers

# The name of the ensemble database


ENSEMBL_DB = drerio_gene_ensembl

# Run the original Makefile


include src/workflows/airway.mk

What is most important, though, is that we don't need to write any new code. Everything is
ready to go. That is the productivity boost by reusable components.

Note: Run the new study in a separate folder, and do not mix the results with the airways
study. Confusion is the biggest enemy of reproducibility.

The Makefile is also included in the code distribution you already have and can be right
away as:

make -f src/workflows/presenilin.mk psen all

2.2.4 Accession numbers

Finding metadata is usually the hardest part of automating a bioinformatics project. The
metadata could be distributed in different sources; sometimes, it is located in the
supplementary information; sometimes, it is encoded in the file names, and so on.

The GEO number GSE126096 is provided in the paper. SRA project number PRJNA521018
is given in the paper and may be viewed on the NCBI SRA site:

• GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126096
• SRA: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA521018/

2.2.5 Create the design file

We can obtain the run information for project PRJNA521018 in CSV format:

bio search PRJNA521018 --header --all --csv > PRJNA521018.csv

The resulting file looks like this:

- 125/228 - Biostar Handbook Collection (2023)


2.2.6 Obtain references

The resulting file has 128(!) columns, with some filled in and some not. Thankfully some
contain the information that connects the SRR number to the sample name. It appears that we
have two conditions with four replicates each. The information from the data above allows us
to manually generate the design.csv matrix for the workflow.

A copy-paste-ready version for creating the design matrix is provided below:

run,group,sample
SRR8530750,WT,WT_1
SRR8530751,WT,WT_2
SRR8530752,WT,WT_3
SRR8530753,WT,WT_4
SRR8530754,Q96,Q96_1
SRR8530755,Q96,Q96_2
SRR8530756,Q96,Q96_3
SRR8530757,Q96,Q96_4

To generate the design file with our Makefile , we can run the following:

make -f src/workflows/presenilin.mk psen

2.2.6 Obtain references

The study makes use of the zebrafish genome. The latin name for zebrafish is Danio Rerio.
Several resources may be used to download this genome data. We will choose the Ensemble
download site:

• https://ftp.ensembl.org/pub/

With some clicking and poking around, we can locate the files we need.

• CDNA: http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/cdna/
Danio_rerio.GRCz11.cdna.all.fa.gz

- 126/228 - Biostar Handbook Collection (2023)


2.2.7 Download the reads

The steps here are identical to those described in Airway RNA-Seq in 10 minutes

make -f src/workflows/presenilin.mk genome

2.2.7 Download the reads

The workflow is nearly identical to that described in the Airway RNA-Seq chapter. We just
added MODE=SE to the command. Let's make it download 4 million reads per sample.

make -f src/workflows/presenilin.mk fastq N=4000000

Will download 4 million reads per sample.

2.2.8 Align the reads

Nothing needs to be changed from before.

make -f src/workflows/presenilin.mk align

The process takes around 4 minutes to complete on an iMac.

2.2.9 Counting the reads

You may need to switch to the stats environment: conda activate stats depending on
how the installation process went for you:

make -f src/workflows/presenilin.mk counts

2.2.10 Generate the PCA plot

Let's generate the plot; by trial and error, we found how much we need to nudge the label
above the points to make it readable.

Rscript src/r/plot_pca.r -f group --nudge 0.2

- 127/228 - Biostar Handbook Collection (2023)


2.2.11 Differential expression analysis

The counts matrix does not separate cleanly for the treatments. Considering that one sample is
a mutant with supposedly large changes in the transcriptome, this is not a good sign.

2.2.11 Differential expression analysis

You can use the design.csv file to generate the differential expression analysis results:

# Run edgeR.
Rscript src/r/edger.r -d design.csv -f group

We obtain no significant FDRs.

# Significant PVal: 851 ( 5.50 %)


# Significant FDRs: 0 ( 0.00 %)

In the airway analysis, we saw that using a subset of about 2 million reads per sample could
provide us with a good separation of the samples. Here we used 4 million reads (on a much
smaller genome), yet we could not find anything significant.

- 128/228 - Biostar Handbook Collection (2023)


2.2.12 P-hacking or being smart?

Up to this point, we only used a subset of the data, and as have shown before, taking a subset
of the data should show a subset of the results - at the very least, some results. Since we have
not found anything, we can repeat the run with the complete data.

We can do this by deleting the reads folder rm -rf reads , then changing the N parameter
to ALL and rerunning each step.

make -f src/workflows/presenilin.mk psen all N=ALL NCPU=8

I will call the resulting counts file full.csv, and all subsequent analyses will be performed
on this full file.

Rscript src/r/edger.r -c full.csv -f group

We get the same results as before:

# Significant PVal: 1377 ( 6.50 %)


# Significant FDRs: 0 ( 0.00 %)

2.2.12 P-hacking or being smart?

So far, things are not looking good!

1. The PCA plot of the count file does not separate the samples.
2. A straightforward and standard differential expression analysis does not find any significant
gene.

Both of these signs indicate that something is not going right.

When faced with problems like this, scientists start window shopping for alternative methods.
They explore tools until something works. This is called p-hacking and counts as one of the
sketchiest practices in science.

On the other hand, the very essence of science is exploration. We have to try different things
and see what works. Biological data is complicated; it is very much possible that the data we
have is not suitable for the analysis we are trying to perform on it. It is possible a specific data
is particularly ill-suited for a certain type of analysis. We ought to explore the data and see
what it tries to tell us.

- 129/228 - Biostar Handbook Collection (2023)


2.2.13 Let's try other methods

At what point should we stop exploring? At what point does it become p-hacking? Smart
people are the best p-hackers, and can do so without even realizing it. Someone less skilled
would give up after a few failed attempts. A talented data analyst can come up with the coolest
tricks ever.

As you can see, the answer is a far less clear-cut than anticipated.

2.2.13 Let's try other methods

For example, edgeR has a so-called "classic" method that is supposed to be less performant.
We could try that:

Rscript src/r/edger.r -c full.csv -f group -m classic

Now we are getting results. Where there was nothing before, we get 13 genes:

# Significant PVal: 613 ( 2.90 %)


# Significant FDRs: 13 ( 0.10 %)

Can we do better? Let's try deseq2 :

Rscript src/r/deseq2.r -d design.csv -c full.csv -f group

Running that, we get 45 genes that pass the FDR threshold:

# Significant PVal: 1377 ( 5.3 %)


# Significant FDRs: 45 ( 0.2 %)

Could we do better?

We can remove lines where the expression level is low. When we do so, we reduce the number
of comparisons in the multiple-test correction.

In our code, we already remove lines with fewer than three counts, but we can make the
threshold even stricter. Let's only keep rows where at least three samples have a count of 10 or
higher.

- 130/228 - Biostar Handbook Collection (2023)


2.2.14 Plotting the heatmap

Copy the src/r/deseq2.r script to a local version and find the line where we filter data. It
can be changed to be stricter with the following:

# At least 3 samples with a count of 10 or higher


keep <- rowSums(counts(dds) >= 10) >= 3

After this change, the command:

Rscript deseq2.r -d design.csv -c full.csv -f group

will produce

# Significant PVal: 1262 ( 5.8 %)


# Significant FDRs: 63 ( 0.3 %)

So we started with no differential expression, but now we have 63 genes that pass the FDR
threshold.

Think about it from a scientist's point of view. Before, they had nothing to show. Now they
have 63 genes that can be talked about and published.

But are the results correct?

2.2.14 Plotting the heatmap

Let us generate the heatmap of the 63 differentially expressed genes:

Rscript src/r/plot_heatmap.r -c deseq2.csv

The heatmap looks like so:

- 131/228 - Biostar Handbook Collection (2023)


2.2.15 Do the results validate?

2.2.15 Do the results validate?

The authors in the paper state that:

In total, 251 genes were identified as differentially expressed (see Additional file 1). Of these,
105 genes showed increased expression in heterozygous mutant brains relative to wild-type
sibling brains, while 146 genes showed decreased expression.

The authors provide a list of genes in the supplementary material. We downloaded the
supplementary file and stored the genes in one column, and created FDR columns with zeros.

cat pub.csv | head

that prints

gene,FDR
si:ch211-235i11.4,0
psmb5,0
si:ch211-235i11.6,0
si:dkey-206p8.1,0
clcn3,0
gpr155b,0
zgc:154061,0

- 132/228 - Biostar Handbook Collection (2023)


2.2.16 Gene ontology enrichment

dph6,0
CABZ01053323.1,0

We can now compare the genes in the supplementary material with the genes we found:

Rscript src/r/evaluate_results.r -a deseq2.csv -b pub.csv -n gene

We find that 15 genes overlap, and most do not pass the FDR threshold:

# Tool: evaluate_results.r
# 63 in deseq2.csv
# 251 in pub.csv
# 15 found in both
# 48 found only in deseq2.csv
# 235 found only in pub.csv
# Summary: summary.csv

What genes are in common:

Rscript src/r/evaluate_results.r -a deseq2.csv -b pub.csv -c gene --show_common

2.2.16 Gene ontology enrichment

The authors in the paper state that:

Gene ontology (GO) analysis implies effects on mitochondria, particularly ATP synthesis,
and on ATP-dependent processes, including vacuolar acidification.

bio gprofiler -c deseq2.csv -d drerio

None of the functions appear to be matching.

We conclude that the scientific insights presented in this paper cannot be reproduced.

2.2.17 Personal note

We want to note that, in our opinion, the list of genes provided by the authors is almost
certainly incorrect. And we suspected as much even before we performed the analysis.

- 133/228 - Biostar Handbook Collection (2023)


2.2.17 Personal note

The fold changes (effect sizes) reported by the authors in their main report file are extremely
small (indicated by the log2FC column in the Additional file 1). Most genes the authors
report as being differentially expressed appear to change between 10% to 30% between
conditions (as fold change and not log2FC ). The median fold change is just 25% (up or
down), and half the genes have changes smaller than that value.

It is exceedingly unlikely that their methodology, in particular, and RNA-Seq data, in general,
is appropriate to detect changes of such a small magnitude.

In our opinion, even a cursory (but critical) examination of the gene expression file should
have told the authors (and reviewers) that the results, as reported, are very unlikely to be
correct.

But now what? Well ... nothing.

This wouldn't be either the first nor the last time that a paper is published with incorrect
results. What about the other thousands of papers just like it?

We understand the pressures and demands put on data analysts better than most. There is no
reason to believe the authors intentionally or deliberately published incorrect results. They just
hacked away at it until it worked. The pressure cooker of academia is a powerful force!
Publish or perish is the name of the game.

At some point connection with reality was lost and all that was left is shuffling endless rows of
numbers through cutoffs and thresholds.

- 134/228 - Biostar Handbook Collection (2023)


2.2.18 The complete workflow

We move on and learn from it.

It could also very well be that we did something incorrectly. That's always possibility.

One of the main challenges and reasons people don't publicly question research results is that
the burden of proof for refuting a published result is substantially higher than when
publishing the same results.

We did in fact spend significanly more time with the data analysis than what we present
above. We have attempted to validate it several other ways as well. We wanted the results to
be true so badly!

What a beautiful story it would have been to find that a single 6 base deletion causes massive
changes in the gene expression of the brain, leading to early onset Alzheimer's disease. It
would be a triumph of science, bioinformatics and a great story to tell.

Alas, we could not make it work with the data at hand.

The findings could be true ... all we are saying here is that the data in this paper is not
sufficient to support the conclusions.

2.2.18 The complete workflow

The complete workflow is included with the code you already have and can be run as:

make -f src/workflows/presenilin.mk psen all

The source code for the workflow is:

#
# Presenilin RNA-Seq in the Biostar Workflows
#
# http://www.biostarhandbook.com
#

# Single end mode.


MODE = SE

# How many reads to download N=ALL for all.


N = 4000000

- 135/228 - Biostar Handbook Collection (2023)


2.2.18 The complete workflow

# The number of CPUs to use.


NCPU = 2

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)
$(error "#This makefile needs GNU Make 4.0 or later")
endif

# Apply Makefile customizations.


.RECIPEPREFIX = >

# The URL for the CDNA


CDNA_URL=http://ftp.ensembl.org/pub/release-107/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.

# The name of the CDNA file locally.


CDNA=~/refs/drerio/Danio_rerio.cdna.fa

# The name of the ensembl database to use for transcript to gene mapping.
ENSEMBL_DB = drerio_gene_ensembl

# The transcript to gene mapping file.


TX2GENE = ~/refs/tx2gene/drerio_tx2gene.csv

# The histat2 counts file.


COUNTS = counts.csv

# The design matrix.


DESIGN = design.csv

# Include the RNA-Seq pipeline.


include src/redo/airway.mk

# Create the design matrix for this workflow.


psen:
> @cat << EOF > ${DESIGN}
> run,group,sample
> SRR8530750,WT,WT_1
> SRR8530751,WT,WT_2
> SRR8530752,WT,WT_3
> SRR8530753,WT,WT_4
> SRR8530754,Q96,Q96_1
> SRR8530755,Q96,Q96_2
> SRR8530756,Q96,Q96_3
> SRR8530757,Q96,Q96_4
> EOF
> @echo "# Created presenilin design: ${DESIGN}"

- 136/228 - Biostar Handbook Collection (2023)


3. Modules

3. Modules

3.1 Introduction

Below we list the makefile modules that wrap the various tools used in the book. The modules
are organized by the type of analysis they perform.

See the installation for instructions on how to install the modules.

The modules are self-contained and can be run independently but some may require input data
generated by other modules.

3.1.1 General tips

Each module may be run independently and will print its usage information when run without
any arguments. For example:

make -f src/run/bwa.mk

will print:

#
# bwa.mk: align reads using BWA
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bwa.bam
#
# make index align
#

Each module may be run in test mode:

make -f src/run/bwa.mk test

- 137/228 - Biostar Handbook Collection (2023)


3.2 Data modules

3.2 Data modules

3.2.1 sra.mk

The Short Read Archive (SRA) is a public repository of short-read sequencing data in FASTQ
format.

• Home: https://www.ncbi.nlm.nih.gov/sra

The sra.mk module assists with obtaining reads from the SRA repository.

SRA.MK USAGE

make -f src/run/sra.mk

SRA.MK HELP

#
# sra.mk: downloads FASTQ reads from SRA
#
# MODE=PE
# SRR=SRR1553425
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# N=1000
#
# make fastq
#

SRA.MK EXAMPLES

# Print usage.
make -f src/run/sra.mk

# Download the default SRR number.


make -f src/run/sra.mk fastq

# Download data for a specific SRR number and read number N


make -f src/run/sra.mk fastq SRR=SRR030257 N=1000

# Delete the downloaded files.


make -f src/run/sra.mk fastq!

- 138/228 - Biostar Handbook Collection (2023)


3.2.1 sra.mk

# Redownload all the reads for a SRR number.


make -f src/run/sra.mk fastq! fastq SRR=SRR030257 N=ALL

sra.mk code

#
# Downloads sequencing reads from SRA.
#

# The directory that stores FASTQ reads that we operate on.


OUT_DIR ?= reads

# SRR number (sequencing run from the Ebola outbreak data of 2014)
SRR ?= SRR1553425

# The default operation mode is paired end


MODE ?= PE

# The name of the reads.


R1 ?= ${OUT_DIR}/${SRR}_1.fastq
R2 ?= ${OUT_DIR}/${SRR}_2.fastq

# How many reads to download (N=ALL downloads everything).


N ?= 1000

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)
$(error "### Error! Please use GNU Make 4.0 or later ###")
endif

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage information.


usage::
> @echo "#"
> @echo "# sra.mk: downloads FASTQ reads from SRA"
> @echo "#"
> @echo "# MODE=${MODE}"
> @echo "# SRR=${SRR}"
> @echo "# R1=${R1}"
> @echo "# R2=${R2}"
> @echo "# N=${N}"
> @echo "#"
> @echo "# make fastq"
> @echo "#"

- 139/228 - Biostar Handbook Collection (2023)


3.2.1 sra.mk

# Set the flags for the download


ifeq ($(N), ALL)
SRA_FLAGS = -F --split-files
else
SRA_FLAGS = -F --split-files -X ${N}
endif

# Obtain the reads from SRA.


${R1}:
> mkdir -p ${OUT_DIR}
> fastq-dump ${SRA_FLAGS} -O ${OUT_DIR} ${SRR}

# List the data.


get:: ${R1}
> @ls -lh ${OUT_DIR}/${SRR}*

# Removes the SRA files.


get!::
> rm -f ${OUT_DIR}/${SRR}*.fastq

# Will change to these


fastq: get
fastq!: get!

test:
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get! get

install::
> @echo mamba install sra-tools

# Targets that are not files.


.PHONY: usage install get get! fastq fastq! test

- 140/228 - Biostar Handbook Collection (2023)


3.2.2 genbank.mk

3.2.2 genbank.mk

The genbank.mk module assists with obtaining reads from the GenBank repository.

GENBANK.MK USAGE

make -f src/run/genbank.mk

GENBANK.MK HELP

#
# genbank.mk: download sequences from GenBank
#
# ACC=AF086833
# REF=refs/AF086833.fa
# GBK=refs/AF086833.gb
# GFF=refs/AF086833.gff
#
# make fasta genbank gff
#

GENBANK.MK EXAMPLES

# Print usage.
make -f src/run/genbank.mk

# Download a specific accession number.


make -f src/run/genbank.mk fasta ACC=NC_045512

# Get data in all three formats.


make -f src/run/genbank.mk fasta genbank gff

# Delete the downloaded files.


make -f src/run/ncbi.mk fasta!

GENBANK.MK CODE

#
# Downloads NCBI data vie Entrez API
#

# Accession number at NCBI


ACC ?= AF086833

# The accession as a fasta file.

- 141/228 - Biostar Handbook Collection (2023)


3.2.2 genbank.mk

REF ?= refs/${ACC}.fa

# The accession as a genbank file.


GBK ?= refs/${ACC}.gb

# The accession as a gff file.


GFF ?= refs/${ACC}.gff

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)
$(error "### Error! Please use GNU Make 4.0 or later ###")
endif

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage information.


usage::
> @echo "#"
> @echo "# genbank.mk: download sequences from GenBank"
> @echo "#"
> @echo "# ACC=${ACC}"
> @echo "# REF=${REF}"
> @echo "# GBK=${GBK}"
> @echo "# GFF=${GFF}"
> @echo "#"
> @echo "# make fasta genbank gff"
> @echo "#"

# Obtain a sequence from NCBI.


${REF}:
> mkdir -p $(dir $@)
> bio fetch ${ACC} --format fasta > $@

${GBK}:
> mkdir -p $(dir $@)
> bio fetch ${ACC} > $@

${GFF}:
> mkdir -p $(dir $@)
> bio fetch ${ACC} --format gff > $@

# Download FASTA file.


fasta:: ${REF}
> @ls -lh ${REF}

# Download GenBank file.

- 142/228 - Biostar Handbook Collection (2023)


3.2.2 genbank.mk

genbank:: ${GBK}
> @ls -lh ${GBK}

# Download GFF file.


gff:: ${GFF}
> @ls -lh ${GFF}

# Remove FASTA file.


fasta!::
> rm -rf ${REF}

# Remove GenBank file.


genbank!::
> rm -rf ${GBK}

# Remove GFF file.


gff!::
> rm -rf ${GFF}

test:
> make -f src/run/genbank.mk fasta! fasta ACC=${ACC} REF=${REF}
> make -f src/run/genbank.mk gff! gff ACC=${ACC} GFF=${GFF}
> make -f src/run/genbank.mk genbank! genbank ACC=${ACC} GBK=${GBK}

# Installation instructions
install::
> @echo pip install bio --upgrade

# Targets that are not files.


.PHONY: usage install get get!

- 143/228 - Biostar Handbook Collection (2023)


3.3 QC modules

3.3 QC modules

3.3.1 fastp.mk

• Homepage: https://github.com/OpenGene/fastp

The fastp.mk module may be used to apply quality control to sequencing data.

FASTP.MK USAGE

make -f src/run/fastp.mk

FASTP.MK HELP

#
# fastp.mk: trim FASTQ reads
#
# MODE=PE SRR=SRR1553425
#
# Input:
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# Output:
# Q1=trim/SRR1553425_1.fastq
# Q2=trim/SRR1553425_2.fastq
#
# make trim
#

FASTP.MK EXAMPLES

# Print usage
make -f src/run/fastp.mk

# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/fastp.mk trim

# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/fastp.mk trim SRR=SRR030257 N=100000

# Deletes the downloaded files.


make -f src/run/fastp.mk trim!

- 144/228 - Biostar Handbook Collection (2023)


3.3.1 fastp.mk

# Download all sequencing reads from SRA in single end mode.


make -f src/run/fastp.mk trim MODE=SE SRR=SRR030257

FASTP.MK CODE

Inputs: R1 , R2 , Outputs: Q1 , Q2

#
# Trims FASTQ files and runs FASTQC on them.
#

# Library layout: single-end (SE) or paired-end (PE)


MODE ?= PE

# SRR numbers may also be used as input.


SRR ?= SRR1553425

# The input read pairs.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# The quality controlled read pairs


Q1 ?= trim/$(notdir ${R1})
Q2 ?= trim/$(notdir ${R2})

# Number of CPUS to use.


NCPU ?= 2

# The adapter sequence for trimming.


ADAPTER ?= AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

# The statistics on the reads files.


READ_STATS ?= $(basename ${Q1}).stats

# FASTP html report.


FASTP_HTML = $(basename ${Q1}).html

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage
usage::
>@echo "#"
>@echo "# fastp.mk: trim FASTQ reads"
>@echo "#"
>@echo "# MODE=${MODE} SRR=${SRR}"

- 145/228 - Biostar Handbook Collection (2023)


3.3.1 fastp.mk

>@echo "#"
>@echo "# Input:"
>@echo "# R1=${R1}"
>@echo "# R2=${R2}"
>@echo "# Output:"
>@echo "# Q1=${Q1}"
>@echo "# Q2=${Q2}"
>@echo "#"
>@echo "# make trim"
>@echo "#"

# Other trimming options, more like Trimmomatic


FASTP_FLAGS ?= --adapter_sequence ${ADAPTER} --cut_right --cut_right_window_size 4 --cut_right_mean

# Trim in paired end mode.


ifeq ($(MODE), PE)
${Q1} ${Q2}: ${R1} ${R2}
> mkdir -p $(dir ${Q1})
> fastp -i ${R1} -I ${R2} -o ${Q1} -O ${Q2} -w ${NCPU} ${FASTP_FLAGS}

trim: ${Q1}
> @ls -lh ${Q1} ${Q2}
endif

# Trim in single end mode.


ifeq ($(MODE), SE)
${Q1}: ${R1}
> mkdir -p $(dir ${Q1})
> fastp -i ${R1} -o ${Q1} -w ${NCPU} ${FASTP_FLAGS}

trim: ${Q1}
> @ls -lh ${Q1}
endif

# Removes the trimmed files.


trim!:
> rm -f ${Q1} ${Q2}

test:
# Get the FASTQ reads.
> make -f src/run/fastp.mk MODE=${MODE} SRR=${SRR} trim! trim

# Installation instuctions
install::
>@echo mamba install fastp test

# Targets that are not valid files.


.PHONY: trim install usage

- 146/228 - Biostar Handbook Collection (2023)


3.4 Alignment modules

3.4 Alignment modules

3.4.1 bwa.mk

The bwa.mk module aligns reads to a reference genome using the bwa aligner.

Home: https://github.com/lh3/bwa

BWA.MK USAGE

make -f src/run/bwa.mk

BWA.MK HELP

#
# bwa.mk: align reads using BWA
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bwa.bam
#
# make index align
#

BWA.MK EXAMPLES

# Print usage
make -f src/run/bwa.mk

# Align with default parameters.


make -f src/run/bwa.mk index align

# Delete alignment files.


make -f src/run/bwa.mk align!

# Align to specific reference genome and reads


make -f src/run/bwa.mk align REF=refs/mygenome.fa R1=reads/r1.fq R2=reads/r2.fq

# Align as single end reads.


make -f src/run/bwa.mk align MODE=SE REF=refs/mygenome.fa R1=reads/r1.fq

- 147/228 - Biostar Handbook Collection (2023)


3.4.1 bwa.mk

BWA.MK CODE

#
# Generate alignments with bwa
#
# The genbank accession number
ACC = AF086833

# The reference genome.


REF ?= refs/${ACC}.fa

# The indexed reference genome.


IDX ?= $(dir ${REF})/idx/$(notdir ${REF})

# A file in the index directory.


IDX_FILE ?= ${IDX}.ann

# A root to derive output default names from.


SRR=SRR1553425

# Number of CPUS
NCPU ?= 2

# Additional flags to pass to BWA.


BWA_FLAGS ?= -t ${NCPU}

# Sam filter flags to filter the BAM file before sorting.


SAM_FLAGS ?=

# FASTQ read pair.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# The alignment file.


BAM ?= bam/${SRR}.bwa.bam

# Alignment mode.
MODE ?= PE

# Set the values for the read groups.


ID ?= run1
SM ?= sample1
LB ?= library1
PL ?= ILLUMINA

# Build the read groups tag.


RG ?= '@RG\tID:${ID}\tSM:${SM}\tLB:${LB}\tPL:${PL}'

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)

- 148/228 - Biostar Handbook Collection (2023)


3.4.1 bwa.mk

$(error "### Error! Please use GNU Make 4.0 or later ###")
endif

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage information.


usage::
> @echo "#"
> @echo "# bwa.mk: align reads using BWA"
> @echo "#"
> @echo "# MODE=${MODE}"
> @echo "# R1=${R1}"
> @echo "# R2=${R2}"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "#"
> @echo "# make index align"
> @echo "#"

# Index the reference genome.


${IDX_FILE}:
> @mkdir -p $(dir $@)
> bwa index -p ${IDX} ${REF}

# Create the index.


index: ${IDX_FILE}
> @ls -lh ${IDX_FILE}

# Remove the index.


index!:
> rm -rf ${IDX_FILE}

# Paired end alignment.


ifeq ($(MODE), PE)
# Unsorted BAM file.
${BAM}: ${R1} ${R2}
> mkdir -p $(dir $@)
> bwa mem ${BWA_FLAGS} -R ${RG} ${IDX} ${R1} ${R2} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Single end alignment.


ifeq ($(MODE), SE)
${BAM}: ${R1}
> mkdir -p $(dir $@)

- 149/228 - Biostar Handbook Collection (2023)


3.4.1 bwa.mk

> bwa mem ${BWA_FLAGS} -R ${RG} ${IDX} ${R1} | \


samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Create the BAM index file.


${BAM}.bai: ${BAM}
> samtools index ${BAM}

# Display the BAM file path.


align: ${BAM}.bai
> @ls -lh ${BAM}

# Remove the BAM file.


align!:
> rm -rf ${BAM} ${BAM}.bai

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bwa.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}

# Install required software.


install::
> @echo mamba install bwa samtools

# Targets that are not files.


.PHONY: align install usage index test

- 150/228 - Biostar Handbook Collection (2023)


3.4.2 bowtie2.mk

3.4.2 bowtie2.mk

The bowtie2.mk module assists with download reads from the Short Read Archive (SRA).

Home: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

BOWTIE2.MK USAGE

make -f src/run/bowtie2.mk

BOWTIE2.MK HELP

#
# bowtie2.mk: aligns read using bowtie2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bowtie2.bam
#
# make index align
#

BOWTIE2.MK EXAMPLES

# Print usage
make -f src/run/bowtie2.mk

# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/bowtie2.mk index align

# Deletes the downloaded files.


make -f src/run/bowtie2.mk align!

# Downloads sequencing reads from SRA


make -f src/run/bowtie2.mk align MODE=SE SRR=SRR030257 N=200000

BOWTIE2.MK CODE

#
# Generates alignments with bowtie2
#

# A root to derive default file names from.

- 151/228 - Biostar Handbook Collection (2023)


3.4.2 bowtie2.mk

SRR=SRR1553425

# Number of CPUS
NCPU ?= 2

# Additional bowtie2 options.


BOWTIE2_FLAGS = --sensitive-local

# Sam flags to filter the BAM file before sorting.


SAM_FLAGS ?=

# FASTQ read pair.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# The reference genome.


REF ?= refs/AF086833.fa

# The alignment file.


BAM ?= bam/${SRR}.bowtie2.bam

# The indexed reference genome.


IDX ?= $(dir ${REF})/idx/$(notdir ${REF})

# File in the index directory.


IDX_FILE = ${IDX}.1.bt2

# Alignment mode.
MODE ?= PE

# Set the values for the read groups.


ID ?= run1
SM ?= sample1
LB ?= library1
PL ?= ILLUMINA
RG ?= "@RG\tID:${ID}\tSM:${SM}\tLB:${LB}\tPL:${PL}"

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# bowtie2.mk: aligns read using bowtie2 "
> @echo "#"
> @echo "# MODE=${MODE}"
> @echo "# R1=${R1}"

- 152/228 - Biostar Handbook Collection (2023)


3.4.2 bowtie2.mk

> @echo "# R2=${R2}"


> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "#"
> @echo "# make index align"
> @echo "#"

# Index the reference genome.


${IDX_FILE}:
> mkdir -p $(dir ${IDX_FILE})
> bowtie2-build ${REF} ${IDX}

# Generate the index.


index: ${IDX_FILE}
> @ls -lh ${IDX_FILE}

# Remove the index.


index!:
> rm -f ${IDX_FILE}

# Paired end alignment.


ifeq ($(MODE), PE)
${BAM}: ${R1} ${R2}
> @mkdir -p $(dir $@)
> bowtie2 ${BOWTIE2_FLAGS} -p ${NCPU} -x ${IDX} -1 ${R1} -2 ${R2} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Single end alignment.


ifeq ($(MODE), SE)
${BAM}: ${R1}
> @mkdir -p $(dir $@)
> bowtie2 ${BOWTIE2_FLAGS} -p ${NCPU} -x ${IDX} -U ${R1} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Create the BAM index file.


${BAM}.bai: ${BAM}
> samtools index ${BAM}

# Target to trigger the alignment.


align: ${BAM}.bai
> @ls -lh ${BAM}

# Remove the BAM file.


align!:
> rm -rf ${BAM} ${BAM}.bai

- 153/228 - Biostar Handbook Collection (2023)


3.4.2 bowtie2.mk

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bowtie2.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}

# Install required software.


install::
> @echo mamba install bowtie2 samtools

# Targets that are not files.


.PHONY: align align! install usage index test

- 154/228 - Biostar Handbook Collection (2023)


3.4.3 minimap2.mk

3.4.3 minimap2.mk

Home: https://github.com/lh3/minimap2

The minimap2.mk module aligns reads to a reference genome using minimap2.

MINIMAP2.MK USAGE

make -f src/run/minimap2.mk

MINIMAP2.MK HELP

#
# minimap2.mk: align read using minimap2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.minimap2.bam
#
# make index align
#

MINIMAP2.MK EXAMPLES

# Print usage
make -f src/run/minimap2.mk

# Align with default parameters.


make -f src/run/minimap2.mk index align

# Delete alignment files.


make -f src/run/minimap2.mk align!

# Align to specific reference genome and reads


make -f src/run/minimap2.mk align REF=refs/mygenome.fa R1=reads/r1.fq R2=reads/r2.fq

# Align as single end reads.


make -f src/run/minimap2.mk align MODE=SE REF=refs/mygenome.fa R1=reads/r1.fq

MINIMAP2.MK CODE

#
# Generates alignments with minimap2

- 155/228 - Biostar Handbook Collection (2023)


3.4.3 minimap2.mk

# A root to derive output default names from.


SRR=SRR1553425

# Number of CPUS
NCPU ?= 2

# Additional minimap2 options. Also common: --secondary=no --sam-hit-only


MINI2_FLAGS = -x sr

# Sam flags to filter the BAM file before sorting.


SAM_FLAGS ?=

# FASTQ read pair.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# Test the entire pipeline.


ACC ?= AF086833
REF ?= refs/${ACC}.fa

# The alignment file.


BAM ?= bam/${SRR}.minimap2.bam

# The indexed reference genome.


IDX ?= $(dir ${REF})/idx/$(notdir ${REF}).mmi

# Alignment mode: paired end (PE) and single end (SE)


MODE ?= PE

# Set the values for the read groups.


ID ?= run1
SM ?= sample1
LB ?= library1
PL ?= ILLUMINA
RG ?= "@RG\tID:${ID}\tSM:${SM}\tLB:${LB}\tPL:${PL}"

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# minimap2.mk: align read using minimap2 "
> @echo "#"
> @echo "# MODE=${MODE}"

- 156/228 - Biostar Handbook Collection (2023)


3.4.3 minimap2.mk

> @echo "# R1=${R1}"


> @echo "# R2=${R2}"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "#"
> @echo "# make index align"
> @echo "#"

# Index the reference genome.


${IDX}:
> mkdir -p $(dir ${IDX})
> minimap2 -t ${NCPU} -x sr -d ${IDX} ${REF}

# Generate the index.


index: ${IDX}
> @ls -lh ${IDX}

# Remove the index.


index!:
> rm -f ${IDX}

# Paired end alignment.


ifeq ($(MODE), PE)
${BAM}: ${R1} ${R2}
> mkdir -p $(dir $@)
> minimap2 -a --MD -t ${NCPU} ${MINI2_FLAGS} -R ${RG} ${IDX} ${R1} ${R2} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Single end alignment.


ifeq ($(MODE), SE)
${BAM}: ${R1}
> mkdir -p $(dir $@)
> minimap2 -a --MD -t ${NCPU} ${MINI2_FLAGS} -R ${RG} ${IDX} ${R1} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Create the BAM index file.


${BAM}.bai: ${BAM}
> samtools index ${BAM}

# Target to trigger the alignment.


align: ${BAM}.bai
> @ls -lh ${BAM}

# Remove the BAM file.


align!:
> rm -rf ${BAM} ${BAM}.bai

- 157/228 - Biostar Handbook Collection (2023)


3.4.3 minimap2.mk

test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/minimap2.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}

# Install required software.


install::
> @echo mamba install minimap2 samtools

# Targets that are not files.


.PHONY: align align! install usage index

- 158/228 - Biostar Handbook Collection (2023)


3.4.4 hisat2.mk

3.4.4 hisat2.mk

• Home: http://daehwankimlab.github.io/hisat2/

The hisat2.mk module aligns reads to a reference genome using the hisat2 aligner.

HISAT2.MK USAGE

make -f src/run/hisat2.mk

HISAT2.MK HELP

#
# hisat2.mk: align reads using HISAT2
#
# MODE=PE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.hisat2.bam
#
# make index align
#

HISAT2.MK EXAMPLES

# Print usage
make -f src/run/hisat2.mk

# Align with default parameters.


make -f src/run/hisat2.mk align

# Delete alignment files.


make -f src/run/hisat2.mk align!

# Align to specific reference genome and reads


make -f src/run/hisat2.mk REF=refs/mygenome.fa R1=reads/r1.fq R2=reads/r2.fq align

# Align as single end reads.


make -f src/run/hisat2.mk MODE=SE REF=refs/mygenome.fa R1=reads/r1.fq align

HISAT2.MK SOURCE CODE

#
# Generate alignments with hisat2

- 159/228 - Biostar Handbook Collection (2023)


3.4.4 hisat2.mk

# A root to derive output default names from.


SRR=SRR1553425

# Number of CPUS
NCPU ?= 2

# Additional flags to pass to HISAT.


HISAT2_FLAGS ?= --threads ${NCPU} --sensitive

# Read groups.
RG ?= --rg-id ${ID} --rg SM:${SM} --rg LB:${LB} --rg PL:${PL}

# Sam filter flags to filter the BAM file before sorting.


SAM_FLAGS ?=

# Default alignment mode is single end.


MODE ?= PE

# FASTQ read pair.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# The accession number for the reference genome.


ACC = AF086833

# The reference genome.


REF ?= refs/${ACC}.fa

# The indexed reference genome.


IDX ?= $(dir ${REF})/idx/$(notdir ${REF})

# A file in the index directory.


IDX_FILE ?= ${IDX}.1.ht2

# The alignment file.


BAM ?= bam/${SRR}.hisat2.bam

# Set the values for the read groups.


ID ?= run1
SM ?= sample1
LB ?= library1
PL ?= ILLUMINA

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

- 160/228 - Biostar Handbook Collection (2023)


3.4.4 hisat2.mk

# Print usage information.


usage::
> @echo "#"
> @echo "# hisat2.mk: align reads using HISAT2"
> @echo "#"
> @echo "# MODE=${MODE}"
> @echo "# R1=${R1}"
> @echo "# R2=${R2}"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "#"
> @echo "# make index align"
> @echo "#"

# Build the index for the reference genome.


${IDX_FILE}:
> mkdir -p $(dir $@)
> hisat2-build --threads ${NCPU} ${REF} ${IDX}

# Create the index.


index: ${IDX_FILE}
> @ls -lh ${IDX_FILE}

# Remove the index.


index!:
> rm -rf ${IDX_FILE}

# We do not list the index as a dependency to avoid accidentally triggering the index build.

# Paired end alignment.


ifeq ($(MODE), PE)
${BAM}: ${R1} ${R2}
> @mkdir -p $(dir $@)
> hisat2 ${HISAT2_FLAGS} ${RG} -x ${IDX} -1 ${R1} -2 ${R2} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Single end alignment.


ifeq ($(MODE), SE)
${BAM}: ${R1}
> @mkdir -p $(dir $@)
> hisat2 ${HISAT2_FLAGS} ${RG} -x ${IDX} -U ${R1} | \
samtools view -h ${SAM_FLAGS} | \
samtools sort -@ ${NCPU} > ${BAM}
endif

# Create the BAM index file.


${BAM}.bai: ${BAM}

- 161/228 - Biostar Handbook Collection (2023)


3.4.4 hisat2.mk

> samtools index ${BAM}

# Display the BAM file path.


align: ${BAM}.bai
> @ls -lh ${BAM}

# Remove the BAM file.


align!:
> rm -rf ${BAM} ${BAM}.bai

test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/hisat2.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align
# Generate alignment report
> samtools flagstat ${BAM}

# Install required software.


install::
> @echo mamba install hisat2 samtools

# Targets that are not files.


.PHONY: align install usage index

- 162/228 - Biostar Handbook Collection (2023)


3.5 RNA-Seq modules

3.5 RNA-Seq modules

3.5.1 salmon.mk

• Home: https://salmon.readthedocs.io/en/latest/index.html

The salmon.mk module aligns reads to a reference genome using the salmon transcript
quantification.

SALMON.MK USAGE

make -f src/run/salmon.mk

SALMON.MK HELP

#
# salmon.mk: classify reads using salmon
#
# MODE=SE
# R1=reads/SRR1553425_1.fastq
# R2=reads/SRR1553425_2.fastq
# REF=refs/AF086833.fa
# SAMPLE=sample
#
# make index align
#

SALMON.MK EXAMPLES

# Print usage
make -f src/run/salmon.mk

# Align with default parameters.


make -f src/run/salmon.mk index align

# Delete alignment files.


make -f src/run/salmon.mk align!

# Align to specific reference genome and reads


make -f src/run/salmon.mk align REF=refs/mygenome.fa R1=reads/r1.fq R2=reads/r2.fq

# Align as single end reads.


make -f src/run/salmon.mk align MODE=SE REF=refs/mygenome.fa R1=reads/r1.fq

- 163/228 - Biostar Handbook Collection (2023)


3.5.1 salmon.mk

SALMON.MK CODE

#
# Generates alignments with salmon
#
SRR ?= SRR1553425

# FASTQ read pair.


R1 ?= reads/${SRR}_1.fastq
R2 ?= reads/${SRR}_2.fastq

# Accession number for the reference genome.


ACC ?= AF086833

# Normally it should be a transcriptome file!


REF ?= refs/AF086833.fa

# The name of the sample that is processed


SAMPLE = sample

# Name of the output directory.


PREFIX ?= salmon/${SAMPLE}

# The name of the quantification file.


QUANT ?= ${PREFIX}/quant.sf

# The indexed reference genome.


IDX ?= $(dir ${REF})/idx/$(notdir ${REF})

# Default alignment mode is single end.


MODE ?= SE

# Number of CPUS
NCPU ?= 2

# Additional salmon options.


SALMON_FLAGS = -l A

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# salmon.mk: classify reads using salmon "
> @echo "#"
> @echo "# MODE=${MODE}"

- 164/228 - Biostar Handbook Collection (2023)


3.5.1 salmon.mk

> @echo "# R1=${R1}"


> @echo "# R2=${R2}"
> @echo "# REF=${REF}"
> @echo "# SAMPLE=${SAMPLE}"
> @echo "#"
> @echo "# make index align"
> @echo "#"

# Index the reference.


${IDX}:
> mkdir -p $(dir ${IDX})
> salmon index --threads ${NCPU} -t ${REF} -i ${IDX}

# Generate the index.


index:: ${IDX}
> @ls -lh ${IDX}/info.json

# Remove the index.


index!:: ${IDX}
> rm -rf ${IDX}

# Paired end mode classification.


ifeq ($(MODE), PE)
${QUANT}: ${R1} ${R2}
> mkdir -p ${PREFIX}
> salmon quant -q ${SALMON_FLAGS} -i ${IDX} -1 ${R1} -2 ${R2} --threads ${NCPU} -o ${PREFIX}
endif

# Single end mode classification.


ifeq ($(MODE), SE)
${QUANT}: ${R1}
> mkdir -p ${PREFIX}
> salmon quant ${SALMON_FLAGS} -i ${IDX} -r ${R1} -p ${NCPU} -o ${PREFIX}
endif

# Target to trigger the alignment.


align:: ${QUANT}
> @ls -lh ${QUANT}

# Remove the BAM file.


align!::
> rm -rf ${QUANT}

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.

- 165/228 - Biostar Handbook Collection (2023)


3.5.1 salmon.mk

> make -f src/run/salmon.mk MODE=${MODE} BAM=${BAM} REF=${REF} index align! align

# Targets that are not files.


.PHONY: align align! install usage index test

# Install required software.


install::
> @echo mamba install salmon

- 166/228 - Biostar Handbook Collection (2023)


3.6 SNP calling modules

3.6 SNP calling modules

3.6.1 bcftools.mk

• Homepage: https://samtools.github.io/bcftools/bcftools.html

The bcftools.mk module may be used to apply quality control to sequencing data.

BCFTOOLS.MK USAGE

make -f src/run/bcftools.mk

BCFTOOLS.MK HELP

#
# bcftools.mk: calls variants using bcftools
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.bcftools.vcf.gz
#
# make vcf
#

BCFTOOLS.MK EXAMPLES

# Print usage
make -f src/run/bcftools.mk test

# Generate VCF file


make -f src/run/bcftools.mk vcf

BCFTOOLS.MK CODE

The module also preemptively sets variable names for alignment and variant calling modules
so that chaining to these modules allows for a seamless interoperation. When chaining you
must include bcftools.mk before the other modules.

#
# Generates SNP calls with bcftools.
#

# A root to derive output default names from.

- 167/228 - Biostar Handbook Collection (2023)


3.6.1 bcftools.mk

SRR = SRR1553425

# Number of CPUS
NCPU ?= 2

# Genbank accession number.


ACC ?= AF086833

# The reference genome.


REF ?= refs/${ACC}.fa

# The alignment file.


BAM ?= bam/${SRR}.bam

# The variant file.


VCF ?= vcf/$(notdir $(basename ${BAM})).bcftools.vcf.gz

# Additional bcf flags for pileup annotation.


PILE_FLAGS = -d 100 --annotate 'INFO/AD,FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP'

# Additional bcf flags for calling process.


CALL_FLAGS = --ploidy 2 --annotate 'FORMAT/GQ'

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# bcftools.mk: calls variants using bcftools"
> @echo "#"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "# VCF=${VCF}"
> @echo "#"
> @echo "# make vcf"
> @echo "#"

${VCF}: ${BAM} ${REF}


> mkdir -p $(dir $@)
> bcftools mpileup ${PILE_FLAGS} -O u -f ${REF} ${BAM} | bcftools call ${CALL_FLAGS} -mv -O u | bcf

${VCF}.tbi: ${VCF}
> bcftools index -t -f $<

vcf: ${VCF}.tbi
> @ls -lh ${VCF}

- 168/228 - Biostar Handbook Collection (2023)


3.6.1 bcftools.mk

vcf!:
> rm -rf ${VCF} ${VCF}.tbi

install::
> @echo mamba install bcftools

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
# Call the variants.
> make -f src/run/bcftools.mk VCF=${VCF} BAM=${BAM} REF=${REF} vcf! vcf

# Targets that are not files.


.PHONY: vcf vcf! usage

- 169/228 - Biostar Handbook Collection (2023)


3.6.2 freebayes.mk

3.6.2 freebayes.mk

Homepage: https://github.com/freebayes/freebayes

The freebayes.mk module may be used to apply quality control to sequencing data.

FREEBAYES.MK USAGE

make -f src/run/freebayes.mk

FREEBAYES.MK HELP

#
# freebayes.mk: calls variants using freebayes
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.freebayes.vcf.gz
#
# make vcf
#

FREEBAYES.MK EXAMPLES

# Print usage
make -f src/run/freebayes.mk

# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/freebayes.mk vcf

# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/freebayes.mk vcf SRR=SRR030257 N=100000

# Deletes the downloaded files.


make -f src/run/freebayes.mk vcf!

# Download all sequencing reads from SRA in single end mode.


make -f src/run/freebayes.mk vcf MODE=SE SRR=SRR030257 N=ALL

FREEBAYES.MK CODE

#
# Generates SNP calls with freebaye
#

- 170/228 - Biostar Handbook Collection (2023)


3.6.2 freebayes.mk

# A root to derive output default names from.


SRR=SRR1553425

# Number of CPUS
NCPU ?= 2

# The alignment file.


BAM ?= bam/${SRR}.bam

# Genbank accession number.


ACC = AF086833

# The reference genome.


REF ?= refs/${ACC}.fa

# Name of the VCF with all variants.


VCF ?= vcf/$(notdir $(basename ${BAM})).freebayes.vcf.gz

# Additional flags passed to freebayes


BAYES_FLAGS = --pvar 0.5

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# freebayes.mk: calls variants using freebayes"
> @echo "#"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "# VCF=${VCF}"
> @echo "#"
> @echo "# make vcf"
> @echo "#"

# Call SNPs with freebayes.


${VCF}: ${BAM} ${REF}
> mkdir -p $(dir $@)
> freebayes ${BAYES_FLAGS} -f ${REF} ${BAM} | bcftools norm -f ${REF} -d all -O z > ${VCF}

# The VCF index file.


${VCF}.tbi: ${VCF}
> bcftools index -t -f $<

# The main action.


vcf: ${VCF}.tbi

- 171/228 - Biostar Handbook Collection (2023)


3.6.2 freebayes.mk

> @ls -lh ${VCF}

# Undo the main action.


vcf!:
> rm -rf ${VCF}

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
# Call the variants.
> make -f src/run/freebayes.mk VCF=${VCF} BAM=${BAM} REF=${REF} vcf! vcf

# Print installation instructions.


install::
> @echo mamba install bcftools freebayes

# Targets that are not files.


.PHONY: vcf usage install

- 172/228 - Biostar Handbook Collection (2023)


3.6.3 gatk.mk

3.6.3 gatk.mk

• Homepage: https://gatk.broadinstitute.org/hc/en-us

The gatk.mk module may be used to call variants.

GATK.MK USAGE

make -f src/run/gatk.mk

GATK.MK HELP

#
# gatk.mk: call variants using GATK4
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# TARGET=AF086833.2
# SITES=vcf/SRR1553425.knownsites.vcf.gz
#
# DUP=bam/SRR1553425.markdup.bam
# TAB=bam/SRR1553425.recal.txt
# RCB=bam/SRR1553425.recal.bam
#
# VCF=vcf/SRR1553425.gatk.vcf.gz
#
# make mark calibrate apply vcf
#

GATK.MK EXAMPLES

# Print usage
make -f src/run/gatk.mk

# Run the test suite


make -f src/run/gatk.mk test

GATK.MK CODE

#
# Generates SNP calls with gatk4
#

# A root to derive output default names from.


SRR=SRR1553425

- 173/228 - Biostar Handbook Collection (2023)


3.6.3 gatk.mk

# Number of CPUS
NCPU ?= 2

# Accession number
ACC = AF086833

# GATK target.
TARGET = AF086833.2

# The reference genome.


REF ?= refs/${ACC}.fa

# Reference dictionary.
DICT = $(basename ${REF}).dict

# The known sites in VCF format.


SITES ?= vcf/${SRR}.knownsites.vcf.gz

# The alignment file.


BAM ?= bam/${SRR}.bam

# Mark duplicates BAM file.


DUP = $(basename ${BAM}).markdup.bam

# Recalibration table.
TAB = $(basename ${BAM}).recal.txt

# Recalibrated BAM file.


RCB = $(basename ${BAM}).recal.bam

# The directory for the variant files.


VCF_DIR = vcf

# The variant file.


VCF ?= ${VCF_DIR}/$(notdir $(basename ${BAM})).gatk.vcf.gz

# Check the make version.


ifeq ($(origin .RECIPEPREFIX), undefined)
$(error "# This file requires GNU Make 4.0 or later: mamba install make")
endif

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

- 174/228 - Biostar Handbook Collection (2023)


3.6.3 gatk.mk

JAVA_OPTS = '-Xmx4g -XX:+UseParallelGC -XX:ParallelGCThreads=4'

# The first target is always the help.


usage::
> @echo "#"
> @echo "# gatk.mk: call variants using GATK4"
> @echo "#"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "# TARGET=${TARGET}"
> @echo "# SITES=${SITES}"
> @echo "#"
> @echo "# DUP=${DUP}"
> @echo "# TAB=${TAB}"
> @echo "# RCB=${RCB}"
> @echo "#"
> @echo "# VCF=${VCF} "
> @echo "# "
> @echo "# make mark calibrate apply vcf"
> @echo "#"

# Generate the FASTA sequence dictionary.


${DICT}: ${REF}
> gatk --java-options ${JAVA_OPTS} CreateSequenceDictionary --REFERENCE ${REF}

# Shortcut to FASTA sequence dictionary.


dict: ${DICT}

# Mark duplicates.
${DUP}: ${BAM} ${DICT}
> gatk MarkDuplicates -I ${BAM} -O ${DUP} -M ${DUP}.metrics.txt

# Shortcut to mark duplicates.


mark: ${DUP}
> ls -l $<

# Generate the recalibration table.


${TAB}: ${DUP} ${SITES}
> gatk BaseRecalibrator -R ${REF} -I ${BAM} --known-sites ${SITES} -O ${TAB}

# Compute the calibration table.


calibrate: ${TAB}
> ls -l $<

# Remove the recalibration table.


calibrate!:
> rm -rf ${TAB}

# Apply the recalibration table.


${RCB}: ${TAB}

- 175/228 - Biostar Handbook Collection (2023)


3.6.3 gatk.mk

> gatk ApplyBQSR -R ${REF} -I ${BAM} --bqsr-recal-file ${TAB} -O ${RCB}

# Apply a recalibration onto the BAM file.


apply: ${RCB}
> ls -l $<

# Remove recalibrated BAM file.


apply!:
> rm -rf ${RCB}

# Call variants.
${VCF}: ${BAM} ${DICT}
> mkdir -p $(dir $@)
> gatk HaplotypeCaller --java-options ${JAVA_OPTS} -I ${BAM} -R ${REF} -L ${TARGET} --output ${VCF}

# Shortcut to call variants.


vcf: ${VCF}
> @ls -lh ${VCF}

# Delete variant call.


vcf!:
> rm -rf ${VCF}

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
# Create known sites with bcftools
> make -f src/run/bcftools.mk BAM=${BAM} REF=${REF} VCF=${SITES} vcf
# Call the variants.
> make -f src/run/gatk.mk TARGET=${TARGET} VCF=${VCF} BAM=${BAM} REF=${REF} mark calibrate apply vc

install::
> @echo mamba install gatk4 test

# Targets that are not files.


.PHONY: vcf vcf! usage test mark calibrate apply

- 176/228 - Biostar Handbook Collection (2023)


3.6.4 deepvariant.mk

3.6.4 deepvariant.mk

• Homepage: https://github.com/google/deepvariant

The deepvariant.mk module may be used to generate variant calls.

Our module uses the Singularity containers. Unfortunately bioconda does not have a runnable
deepvariant package at this time.

DEEPVARIANT.MK USAGE

make -f src/run/deepvariant.mk

DEEPVARIANT.MK HELP

#
# deepvariant.mk: call variants using Google Deepvariant
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
# VCF=vcf/SRR1553425.deepvariant.vcf.gz
#
# make vcf
#

DEEPVARIANT.MK EXAMPLES

# Print usage
make -f src/run/deepvariant.mk

# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/deepvariant.mk vcf

DEEPVARIANT.MK CODE

# Genbank accession number.


ACC ?= AF086833

# A root to derive output default names from.


SRR ?= SRR1553425

# Number of CPUS
NCPU ?= 2

- 177/228 - Biostar Handbook Collection (2023)


3.6.4 deepvariant.mk

# The reference genome.


REF ?= refs/${ACC}.fa

# The alignment file.


BAM ?= bam/${SRR}.bam

# The variant file.


VCF ?= vcf/$(notdir $(basename ${BAM})).deepvariant.vcf.gz

# The temporary intermediate results.


TMP ?= tmp/$(notdir ${VCF})

# The model type in deep variant.


MODEL ?= WGS

# Directory to mount in the singularity container.


MNT ?= $(shell pwd):$(shell pwd)

# Deepvariant singularity image


SIF ?= deepvariant_1.5.0.sif

# The deepvariant command line run


CMD ?= singularity run -B ${MNT} ${SIF}

# Additional bcf flags for calling process.


CALL_FLAGS ?=

# Example:
# CALL_FLAGS = --regions chr1:1-1000000

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# The first target is always the help.


usage::
> @echo "#"
> @echo "# deepvariant.mk: call variants using Google Deepvariant"
> @echo "#"
> @echo "# REF=${REF}"
> @echo "# BAM=${BAM}"
> @echo "# VCF=${VCF}"
> @echo "#"
> @echo "# make vcf"
> @echo "#"

${REF}.fai: ${REF}

- 178/228 - Biostar Handbook Collection (2023)


3.6.4 deepvariant.mk

> samtools faidx ${REF}

# Generate the variant calls.


${VCF}: ${BAM} ${REF} ${REF}.fai
> mkdir -p $(dir $@)
> ${CMD} \
/opt/deepvariant/bin/run_deepvariant \
--model_type=${MODEL} \
--ref ${REF} \
--reads ${BAM} \
--output_vcf ${VCF} \
--num_shards ${NCPU} \
--intermediate_results_dir ${TMP} \
${CALL_FLAGS}

# Create the VCF index


${VCF}.tbi: ${VCF}
> bcftools index -t -f $<

# Generating a VCF file.


vcf: ${VCF}.tbi
> @ls -lh ${VCF}

vcf!:
> rm -rf ${VCF} ${VCF}.tbi

# Test the entire pipeline.


test:
# Get the reference genome.
> make -f src/run/genbank.mk ACC=${ACC} fasta
# Get the FASTQ reads.
> make -f src/run/sra.mk SRR=${SRR} get
# Align the FASTQ reads.
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
# Call the variants.
> make -f src/run/deepvariant.mk VCF=${VCF} BAM=${BAM} REF=${REF} vcf! vcf

# Installation instructions.
install:
>@echo singularity pull docker://google/deepvariant:1.5.0"

.PHONY: usage vcf test install

- 179/228 - Biostar Handbook Collection (2023)


3.6.5 ivar.mk

3.6.5 ivar.mk

• Homepage: https://github.com/andersen-lab/ivar

The ivar.mk module may be used to apply quality control to sequencing data.

make -f src/run/ivar.mk

IVAR.MK USAGE:

#
# ivar.mk: runs the ivar suite
#
# REF=refs/AF086833.fa
# BAM=bam/SRR1553425.bam
#
# make cons vars
#

IVAR.MK EXAMPLES

# Print usage
make -f src/run/ivar.mk

# Downloads sequencing reads from SRA with the default SRR number.
make -f src/run/ivar.mk fastq

# Downloads sequencing reads from specific SRR number and read number N
make -f src/run/ivar.mk SRR=SRR030257 N=100000 fastq

# Deletes the downloaded files.


make -f src/run/ivar.mk fastq!

# Download all sequencing reads from SRA in single end mode.


make -f src/run/ivar.mk MODE=SE SRR=SRR030257 N=ALL fastq

IVAR.MK OUTPUT

Obtains the files and produces read statistics:

file format type num_seqs sum_len min_len avg_len max_len


reads/SRR1553591_1.fq FASTQ DNA 100,000 10,100,000 101 101 101
reads/SRR1553591_2.fq FASTQ DNA 100,000 10,100,000 101 101 101

- 180/228 - Biostar Handbook Collection (2023)


3.6.5 ivar.mk

IVAR.MK CODE

The module also preemptively sets variable names for alignment and variant calling modules
so that chaining to these modules allows for a seamless interoperation. When chaining you
must include ivar.mk before the other modules.

#
# Runs the ivar package.
#

# Command line flags for ivar.


IVAR_FLAGS = -t 0.5 -m 1

# The accession number


ACC=AF086833

# The SRR number


SRR=SRR1553425

# The alignment file.


BAM ?= bam/${SRR}.bam

# The reference genome.


REF ?= refs/${ACC}.fa

# The reference as GFF.


GFF ?= refs/${ACC}.gff

# The prefix for the output files.


PREFIX = ivar/${ACC}.consensus

# The ivar consensus file.


CONS = ${PREFIX}.fa

# The ivar variants file.


VARS = ${PREFIX}.tsv

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

usage::
> @echo "#"
> @echo "# ivar.mk: runs the ivar suite"
> @echo "#"
> @echo "# REF=${REF}"

- 181/228 - Biostar Handbook Collection (2023)


3.6.5 ivar.mk

> @echo "# BAM=${BAM}"


> @echo "#"
> @echo "# make cons vars"
> @echo "#"

# Generates the ivar consensus sequences.


${CONS}: ${BAM}
> mkdir -p $(dir $@)
> samtools mpileup ${BAM} -aa -A -d 0 -Q 20 | ivar consensus -p ${PREFIX} ${IVAR_FLAGS}

# Generates the ivar variant table.


${VARS}: ${BAM} ${REF} ${GFF}
> mkdir -p $(dir $@)
> samtools mpileup ${BAM} -aa -A -B -d 0 -Q 20 | ivar variants -p ${PREFIX} ${IVAR_FLAGS} -r ${REF}

# Prints the consensus file location.


cons: ${CONS}
> @ls -lh ${CONS}

# Remove the consensus.


cons!:
> @rm -rf ${CONS}

# Prints the variant file location.


vars: ${VARS}
> @ls -lh ${VARS}

# Remove the variants.


vars!:
>@rm -rf ${VARS}

test:
> make -f src/run/genbank.mk gff! gff ACC=${ACC} GFF=${GFF}
> make -f src/run/bwa.mk BAM=${BAM} REF=${REF} index align
> make -f src/run/ivar.mk BAM=${BAM} REF=${REF} cons! cons vars! vars

# Install required software.


install::
>@echo mamba install ivar samtools

# Targets that are not files.


.PHONY: cons cons! vars vars! usage

- 182/228 - Biostar Handbook Collection (2023)


3.7 Utilities

3.7 Utilities

3.7.1 curl.mk

The curl.mk module supports downloading data from various URLS via curl .

One might ask: Why even have a makefile for what seems to be a trivial UNIX command?

The main reason to use our Makefile is that it will not leave an incomplete file in its wake
if, for any reason, the download fails to complete. Incomplete downloads can cause subtle and
hard to troubleshoot problems!

In addition, the curl.mk file can upack .gz and .tar.gz files automatically. Again that is
handy feature to have.

A typical usage is to visit a website, locate URLs for files or directories interest, then set the
URL parameter to that data. You may also override the output directory and the resulting
output file names.

Data sources:

1. UCSC: https://hgdownload.soe.ucsc.edu/downloads.html
2. Ensembl: https://ftp.ensembl.org/pub/

curl.mk usage

make -f src/run/curl.mk get URL=$URL

curl.mk examples

To find the proper URLs navigate the UCSC data server and copy paste the URLs for the data
of interest.

# Print usage
make -f src/run/curl.mk

# Obtain default data


make -f src/run/curl.mk get

# Ensembl URL to gzipped file

- 183/228 - Biostar Handbook Collection (2023)


3.7.1 curl.mk

URL=https://ftp.ensembl.org/pub/release-105/gff3/bubo_bubo/Bubo_bubo.BubBub1.0.105.gff3.gz

# Automatically unpack a gzipped file.


make -f src/run/curl.mk get URL=$URL FILE=bubo.gff ACTION=UNZIP

# Link to a tar.gz file.


URL=http://data.biostarhandbook.com/data/pairs.tar.gz

# Automatically unpack a tar.gz file


make -f src/run/curl.mk get URL=$URL ACTION=UNPACK

curl.mk code

#
# Download data via the web
#

# The URL to be downloaded


URL ?= https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/bigZips/wuhCor1.fa.gz

# Data destination directory.


DEST ?= data

# Additional curl flags.


CURL_FLAGS = -L

# The resulting file.


FILE ?= ${DEST}/$(notdir ${URL})

# The temporary download


TMP=${FILE}.tmp

# A sentinel file used only when upacking tar.gz files.


SENTINEL = ${FILE}.files.txt

# What to do with the file.


ACTION = NONE

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# General usage information.


usage::
> @echo "#"
> @echo "# curl.mk: downloads data"
> @echo "#"

- 184/228 - Biostar Handbook Collection (2023)


3.7.1 curl.mk

> @echo "# make get URL=? FILE=? ACTION=NONE|UNZIP|UNPACK"


> @echo "#"

# Download a temporary file.


${TMP}:
> mkdir -p $(dir $@)
> curl ${CURL_FLAGS} ${URL} > $@

# Deals with regular downloads. No compression.


ifeq ($(ACTION), NONE)
${FILE}:
> mkdir -p $(dir $@)
> curl ${CURL_FLAGS} ${URL} > $@

get:: ${FILE}
> @ls -lh $<
endif

# Dealing with GZIP files.


ifeq ($(ACTION), UNZIP)
${FILE}: ${TMP}
> gunzip -c $< > ${FILE}

get:: ${FILE}
> @ls -lh $<
endif

# Dealing with TAR.GZ files.


ifeq ($(ACTION), UNPACK)
# The sentinel file keeps track on whether the file has been unpacked.
# So that we don't unpack twice. Not a perfect solution, but it works.
${SENTINEL}: ${TMP}
> tar zxvkf $<
> tar tzf $< > $@

# Moves the file to its final destination.


get:: ${SENTINEL}
> @ls -lh $<
endif

# For all cases remove the file and temporary file.


get!::
> rm -rf ${FILE} ${TMP} ${SENTINEL}

# Installation instructions
install::
> @echo "# no installation required"

- 185/228 - Biostar Handbook Collection (2023)


3.7.2 rsync.mk

3.7.2 rsync.mk

The rsync.mk module supports downloading data from various URLS with different
protocols.

A typical usage is to visit the website, locate URLs for files or directories interest, then setting
the URL parameter to that data. You may also override the output directory and the resulting
output file name.

UCSC DOWNLOADS

Rsync help: https://genome.ucsc.edu/goldenPath/help/ftp.html

rsync is the recommended protocol as it can resume downloads.

# RSYNC url
URL = rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz

NCBI GENOME RSYNC:

TODO:

RSYNC.MK USAGE

make -f src/run/rsync.mk get

RSYNC.MK EXAMPLES

To find the proper URLs navigate the UCSC data server and copy paste the URLs for the data
of interest.

# Print usage
make -f src/run/rsync.mk

# Obtain default data via HTTP protocol


make -f src/run/rsync.mk get

# RSYNC url
URL=rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz

# Get data via RSYNC protocol


make -f src/run/rsync.mk get URL=$URL

- 186/228 - Biostar Handbook Collection (2023)


3.7.2 rsync.mk

The recommended choice is to use the rsync protocol.

RSYNC.MK CODE

#
# Download data via rsync
#
# You might ask yourself Why have a makefile for what seems to be be trivial command?
#
# In this case to help us remember the correct commands and to make it fit with
# the rest of the workflow building process.
#
# The rsync rules everyone always forgets:
#
# - trailing slash on URL means copy the contents of the directory
# - no trailing slash on URL means copy the directory itself
# - a slash on the destination directory has no effect
#

# The remote location we wish to download


URL ?= rsync://hgdownload.soe.ucsc.edu/goldenPath/eboVir3/bigZips

# The destination that the download will be placed in.


DEST ?= refs

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# General usage information.


usage::
> @echo ""
> @echo "# rsync.mk: downloads data via rsync protocol"
> @echo ""
> @echo "# make get URL=? FILE=?"
> @echo ""

# rsync will automatically detect updated files.


get::
> mkdir -p ${DEST}
> rsync --times --copy-links --recursive -vz -P ${URL} ${DEST}
> find ${DEST}/*

get!::
> @echo "# cannot undo an rsync download"

# Installation instructions
install::

- 187/228 - Biostar Handbook Collection (2023)


3.7.2 rsync.mk

> @echo "# no installation required"

- 188/228 - Biostar Handbook Collection (2023)


4. Formats

4. Formats

4.1 Introduction to data

Every tool, workflow, analysis is about combining and transforming information into different
formats.

4.1.1 Information as data

Every workflow in this book will operate on data in different formats. Understanding the
structure of information and their representations and how tools combine existing data to
generate new information is the most important ingredient of understanding any workflow.

As you work through any pipeline do your best to get a firm grasp on:

1. What kinds of data are needed at the start?


2. What kind of data are produced in each step?
3. What information do we get at the end?

In this section we will briefly cover the most common data formats.

- 189/228 - Biostar Handbook Collection (2023)


4.1.2 Biological data is simplified

4.1.2 Biological data is simplified

Biological knowledge is always greatly simplified when stored in computer-ready form. Files
typically contain one of the two classes of information:

1. sequence information stored in file types of FASTA , FASTQ


2. genomic regions (coordinates) stored as files called BED , GFF , VCF ...

As it happens GENBANK (and EMBL ) formats contain both sequence information and
annotations in a complex and inefficient format. Notably very few tools can directly operate
directly on GENBANK files. Think of GENBANK as a storage from which data in FASTA or
GFF formats can be extracted.

Dozens of additional formats formats are in use. Some formats formats may even have
competing naming:

• MAF : multiple alignment format represents alignments of multiple sequences.


• MAF : mutation annotation format represents variants.

4.1.3 Unexpected caveats

Most coordinate representations will display positions on the forward (positive) strand, even
when describing directional features on the corresponding reverse (negative) strand.

For example, for an interval [100, 200] that describes a transcript on the reverse strand, the
start column will contain 100 . In reality, the functional and actual start coordinate would be
200 as the feature is transcribed in reverse. Interpreting formats in the correct orientation
demands ongoing attention to detail.

- 190/228 - Biostar Handbook Collection (2023)


4.2 GENBANK is for storage

4.2 GENBANK is for storage

GENBANK (and a similar related format called EMBL ) are formats for storing diverse
biological information in one file. These formats were designed to represent different types of
information in a human-readable fashion, and for that reason, are not all optimized for large-
scale formats representations.

GENBANK files are best suited for storing diverse and fine-grained information with lots of
details, for example: complete viral or bacterial genomes or individual human genes.

GENBANK files are NOT well suited for storing genomic information for organisms with long
DNA (say larger than 30 million bases).

There are various ways to obtain a GENBANK files from the internet. Here we are using bio :

bio fetch NC_045512 > NC_045512.gb

the file NC_045512.gb contains the following information:

LOCUS NC_045512 29903 bp ss-RNA linear VRL 18-JUL-2020


DEFINITION Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
complete genome.
ACCESSION NC_045512
VERSION NC_045512.2
DBLINK BioProject: PRJNA485481
KEYWORDS RefSeq.
SOURCE Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
ORGANISM Severe acute respiratory syndrome coronavirus 2
Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae;
Betacoronavirus; Sarbecovirus.
REFERENCE 1 (bases 1 to 29903)
AUTHORS Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y.,
Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H.,
Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z.
TITLE A new coronavirus associated with human respiratory disease in
China
JOURNAL Nature 579 (7798), 265-269 (2020)
...

In general, you should avoid manual conversion and instead obtain the data in a format that is
already in a format that is suitable for your analysis.

If you have to here are some tips on how to convert GenBank to other formats.

- 191/228 - Biostar Handbook Collection (2023)


4.2.1 Convert GenBank to FASTA

4.2.1 Convert GenBank to FASTA

cat NC_045512.gb | bio fasta > NC_045512.fa

4.2.2 Convert GenBank to GFF3

cat NC_045512.gb | bio gff > NC_045512.gff

4.2.3 Extract sequences from GenBank

Extract the coding sequence for the gene S

cat NC_045512.gb | bio fasta --gene S | head

prints:

>YP_009724390.1 {"type": "CDS", "gene": "S", "product": "surface glycoprotein", "locus":


"GU280_gp02"}
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA

Extract the coding sequence for all coding features:

cat NC_045512.gb | bio fasta --type CDS

For a full list of of features on bio see:

• https://www.bioinfo.help/

- 192/228 - Biostar Handbook Collection (2023)


4.3 FASTA contains sequences

4.3 FASTA contains sequences

FASTA is a record-based format that starts with the > symbol, followed by a sequence id
then followed by an optional description. The subsequent lines contain the sequence.

bio fetch NC_045512 -format fasta | head -5

will print:

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate


ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC

The sequence ID above is NC_045512.2. The description is Severe acute respiratory


syndrome coronavirus 2 isolate.

FASTA lines should be the same lengths. The capitalization of the letters may also carry
meaning (typically, lowercase letters are used to mark repetitive, low complexity regions).

Some FASTA files may follow specific formatting to embed more structured information in
both ID and description.

Reference genomes are always represented in FASTA format.

- 193/228 - Biostar Handbook Collection (2023)


4.4 FASTQ stores reads

4.4 FASTQ stores reads

The entries in FASTQ format represent individual measurements so-called "reads" produced
by a sequencing instrument. The instrument may produce millions or even billions of such
reads, where each FASTQ record consists of four lines.

Published scientific works using sequencing are required to deposit the data in original
formats that get deposited at websites such as SRA and ENA.

4.4.1 Getting data from SRA

For example, let's get a single record from the sequencing run with the accession number of
SRR5790106 :

fastq-dump -F -X 1 --split-files SRR5790106

the resulting SRR5790106_1.fastq file contains:

@HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
NCGCCCGGTTAGCGATCAACAATGGACTGCATCATTTCATGCAGCTCGAGCCGATTGTAAGTCGCCCGTAACGCG
+HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
#:=AA==EGG>FFCEFGDE1EFF@FEFFBBFGGGGGGDFGGG>@FGEGBGGGGGBGGGGGGGGFDFGGGGGBBGG

FASTQ file structure:

1. Header: @HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
2. Sequence: NCGCCCGGTTAGCGATCAACAATGGACTGCATCATTTCATGCAGCT ...
3. Header (again): +HWI-D00653:77:C6EBMANXX:7:1101:1429:1868
4. Quality: #:=AA==EGG>FFCEFGDE1EFF@FEFFBBFGGGGGGDFGGG>@FGEG ...

The header may contain instrument-specific information encoded into the sequence name. The
quality row represents error rates, where each character is a "Phred quality" score. The quality
scores are encoded as ASCII characters, with the character @ representing the lowest quality
score. The repeated header in the third line, listed after the + sign, may be omitted.

- 194/228 - Biostar Handbook Collection (2023)


4.4.2 Metadata for SRR numbers

4.4.2 Metadata for SRR numbers

You may either investigate via the web

• https://www.ncbi.nlm.nih.gov/sra/?term=SRR5790106

or at command line:

bio search SRR5790106

that prints:

[
{
"run_accession": "SRR5790106",
"sample_accession": "SAMN07304757",
"first_public": "2017-12-18",
"country": "",
"sample_alias": "GSM2691575",
"fastq_bytes": "1417909312",
"read_count": "20892203",
"library_name": "",
"library_strategy": "RNA-Seq",
"library_source": "TRANSCRIPTOMIC",
"library_layout": "SINGLE",
"instrument_platform": "ILLUMINA",
"instrument_model": "Illumina HiSeq 2500",
"study_title": "RNA-seq analysis of genes under control of Burkholderia
thailandensis E264 MftR",
"fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR579/006/SRR5790106/SRR5790106.fastq"
}
]

4.4.3 SRA Bioprojects

Each SRA record is associated with a Bioproject. with names such as PRJNA392446

bio search PRJNA392446

See the Biostar Handbook for additional details.

- 195/228 - Biostar Handbook Collection (2023)


4.5 GFF represents annotations

4.5 GFF represents annotations

GFF stands for Generic Feature Format. GFF files are nine-column, tab-delimited, plain text
files used to represent coordinates in one dimension (along an axis).

You may produce a GFF file with:

bio fetch NC_045512 -format gff | head -5

prints:

NC_045512.2 RefSeq region 1 29903 . + .


ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-
acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-
host=Homo sapiens;old-name=Wuhan seafood market pneumonia
virus
NC_045512.2 RefSeq five_prime_UTR 1 265 . + . ID=id-
NC_045512.2:1..265;gbkey=5'UTR
NC_045512.2 RefSeq gene 266 21555 . + . ID=gene-
GU280_gp01;Dbxref=GeneID:
43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01
NC_045512.2 RefSeq CDS 266 13468 . + 0 ID=cds-
YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:
43740578;Name=YP_009724389.1;Note=pp1ab%3B
translated by -1 ribosomal
frameshift;exception=ribosomal
slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab
polyprotein;protein_id=YP_009724389.1
NC_045512.2 RefSeq CDS 13468 21555 . + 0 ID=cds-YP_00

The last column, called "attributes," may contain multiple values separated by semicolons:

ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;
gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01

The structure of this column and the presence of certain attributes are essential for some tools,
notably in RNA-Seq analysis.

- 196/228 - Biostar Handbook Collection (2023)


4.5 GFF represents annotations

The GFF format may optionally include the sequence section at the end of the file as a
FASTA file. In general, it only includes coordinates. GTF is a precursor to the GFF format,
similar with some differences in the way attributes are encoded.

GTF files must always include gene_id and transcript_id attributes hence are used
when those are required.

• More information: [Generic Feature Format Version 3 (GFF3)][gff]

See the Biostar Handbook for additional details.

[gff]

- 197/228 - Biostar Handbook Collection (2023)


4.6 BED is for annotations

4.6 BED is for annotations

BED files are three, six, or eleven-column, tab-delimited, plain text files used to represent
coordinates in one dimension (along an axis).

Originally devised for visualization purposes, thus carry columns (such as color ) that are
not relevant to formats analysis. The 6 column BED designations are:

One of the most important differences between GFF and BED formats is their coordinate
systems. GFF files are one-based, while BED files are 0 based. The first coordinate in GFF is
1, while the first coordinate in BED is 0.

In addition GFF intervals are closed (include both coordinates) [100, 200] , while BED
intervals are half-open [100,200) do not include the last coordinate. Needless to say, the
differences lead to lots of errors and confusion. Let's obtain a BED file:

wget -nc https://genome.ucsc.edu/goldenPath/help/examples/bedExample.txt

where the bedExample.txt file contains:

#chrom chromStart chromEnd


chr21 9434178 9434609
chr21 9434178 9434609
chr21 9508110 9508214
...

4.6.1 The bigBed format

There is a BED file variant called bigBed that is a compressed and indexed BED file that
may be used to store much larger amounts of formats in an efficient manner. If your interval
dataset contains over a million items, you ought to consider using bigBed instead of BED .
The conversion typically requires sorting, the BED file, a chromosome size file, and a
conversion step. Install the converter with:

mamba install ucsc-bedtobigbed -y

- 198/228 - Biostar Handbook Collection (2023)


4.6.2 The bigWig format

now obtain the formats:

wget -nc https://genome.ucsc.edu/goldenPath/help/examples/bedExample.txt


wget -nc https://genome.ucsc.edu/goldenPath/help/hg19.chrom.sizes

sort -k1,1 -k2,2n bedExample.txt > data.bed


bedToBigBed data.bed hg19.chrom.sizes data.bb

The sizes.txt file contains the length of each chromosome as a tab-delimited file.

chr1 249250621
chr2 243199373
...

• More information: BED format specification at UCSC

4.6.2 The bigWig format

mamba install ucsc-bigwig -y

- 199/228 - Biostar Handbook Collection (2023)


4.7 SAM/BAM represent alignments

4.7 SAM/BAM represent alignments

SAM/BAM files are used to represent alignments of the FASTQ records to a FASTA
reference.

Each row in a BAM file describes properties of the alignment of a read relative to a target
reference. Each column carries information about the alignment. The 11 columns are the
following:

BAM files may be accessed from file or over the internet:

# View remote BAM file.


URL=https://genome.ucsc.edu/goldenPath/help/examples/bamExample.bam

samtools view $URL | head -1

prints a single alignment from the BAM file.

SRR010939.15011799 35 21 33019936 99 76M = 33019947 87


TAAAGATATAATCAGTAACTAAACTAATCCCAACACTAGGATTATTTGCCTAAATCATATATATGTGTATAGAGAA
=789=7D69;<E<5?A%7.D=;1AC1,ED,CA;<:0E>?>&BB6<H@?*7A/99C82<141C6C8575-.-$+$*,
RG:Z:SRR010939 MF:i:18 Aq:i:77 NM:i:2 UQ:i:15 H0:i:1 H1:i:0

- 200/228 - Biostar Handbook Collection (2023)


4.7 SAM/BAM represent alignments

BAM files need to be indexed; for each BAM file, a bai file is created and stored next to the
BAM file.

• SAM specification
• SAM tag specification

- 201/228 - Biostar Handbook Collection (2023)


4.8 VCF contains variations

4.8 VCF contains variations

VCF stands for Variant Call Format and is a tab-delimited, column-based file that represents a
set of variants in a genome.

A single VCF file may represent a single sample or a multiple of samples. The VCF files are
perhaps the most complex formats in the whole genome analysis even though the structure
appears to be simple:

The challenge of VCF is that it needs to represent not what something is but how it is
"different" from a reference. A tremendous amount of information may be crammed into the
various fields of a VCF file, often rendering it almost impossible to read.

Page through the dbSNP VCF file to see the full list of fields.

URL=https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz

bcftools view $URL | more

prints sections like:

NC_000001.10 10001 rs1570391677 T A,C . .


RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:
0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
NC_000001.10 10002 rs1570391692 A C . .
RS=1570391692;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:
0.9944,0.005597
NC_000001.10 10003 rs1570391694 A C . .

- 202/228 - Biostar Handbook Collection (2023)


4.8 VCF contains variations

RS=1570391694;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:
0.9902,0.009763
NC_000001.10 10007 rs1639538116 T C,G . .
RS=1639538116;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=dbGaP_PopFreq:
1,0,0

We commonly convert VCF files to simpler tab-delimited formats. Below we convert a VCF
file to contain only position, reference, and allele:

bcftools query -f '%CHROM %POS %REF %ALT\n' $URL | head

prints:

NC_000001.10 10001 T A,C


NC_000001.10 10002 A C
NC_000001.10 10003 A C
NC_000001.10 10007 T C,G
NC_000001.10 10008 A C,G,T
NC_000001.10 10009 A C,G

• VCF specification

- 203/228 - Biostar Handbook Collection (2023)


5. Modern Make

5. Modern Make

5.1 Why makefiles?

Ask a bioinformatician on what workflow management system they use, and you will get a
variety of answers. Most often you will hear about Snakemake and Nextflow. Alas, these tools
never worked out for me.

For beginners who are familiar with bash, starting with Makefiles can be a far more accessible
and efficient option, as they have a less steep learning curve compared to advanced tools like
Snakemake or Nextflow.

While more sophisticated workflow requirements might make Snakemake and Nextflow seem
advantageous and promise various benefits, many bioinformaticians never fully realize these
benefits. This is often due to factors such as the scope and complexity of their projects and the
considerable learning curve associated with adopting intricate workflow management systems.

Speaking from over a decade of experience in the field and currently managing a
Bioinformatics Consulting Center at a large university, I find Makefiles to be the most
convenient and user-friendly solution for managing my workflows, serving a wide range of
scientific needs.

In the following sections, I'll share my personal experiences of attempting to switch to


Snakemake and Nextflow, which led to nothing but headaches and frustrations. Despite
multiple attempts, I have never been able to successfully implement them in a satisfactory
manner.

5.1.1 What is Modern Make?

I have noted that all of my colleagues that swear by Snakemake or Nextflow are not aware of
the capabilities of Makefiles .

Most importantly the approach we champion in this book is what we like call Modern Make
were we make use of tools parallel to provide parallel execution of tasks.

We also believe that in bioinformatics context, complete full dependence management is


counterproductive as different steps in the workflow have staggeringly different runtimes.

Hence the pipelines and workflows we will show you here always operate in stages, where we
invoke make multiple times to complete the workflow.

- 204/228 - Biostar Handbook Collection (2023)


5.1.1 What is Modern Make?

Consult the Makefile chapters in The Art of Bioinformatics Scripting for more details on how
we recommend the use Makefiles in bioinformatics.

- 205/228 - Biostar Handbook Collection (2023)


5.2 The rise of the black-box

5.2 The rise of the black-box

A workflow consists of a series of data analysis tasks that process data into a format the user
can interpret. Ideally, workflows are built in a reusable and modular manner that we can adapt
to different needs.

Unfortunately, by 2022 many bioinformatics workflows have become obscure black boxes
that tend to hide the underlying data analysis steps. This is a problem because it makes it
extraordinarily difficult to understand the results and to adapt the workflow to new needs.

5.2.1 Why did it spiral out of control?

I blame the particular concept of easy-to-use software. The reality is that scientific data
analysis has never been and will never be easy - after all, the whole point of science is to work
at the edge of understanding.

Instead, we should aim to develop simple, logical and consistent software. The latter is critical
because it allows us to build on existing knowledge and reuse existing tools. An easy-to-use
software is a bit like a platypus, an over-specialized evolutionary dead-end that we can't
extend and build upon.

The allure of easy-to-use is powerful for life scientists already overwhelmed with the amount
of work that needs to be done in a wet lab. No wonder myriad solutions promise them that
easy solution where the analysis magically happens. And because biologists are ill-equipped
to decide if the process is reusable or not, they are easily fooled and will embrace solutions
that, in reality, make life harder.

5.2.2 Snakemake and the curious case of over-automation

An example of an over-automated bioinformatics black box is the nextstrain workflow. In


principle, to run the pipeline, all you need is to install the package and then run the following:

nextstrain build . --cores 1 --configfile parameters.yaml

Type that in; as long as your computer has the right software installed, the claim is that it just
runs and the results will be computed for you automatically. Let's take a look at what happens
when we run the above command (click to start):

- 206/228 - Biostar Handbook Collection (2023)


5.2.3 NextFlow and the curious case of over-abstraction

Oof! That was easy. Wasn't it?

But wait a minute, the workflow ran and produced an endless stream of results, but what
exactly has happened? Did you realize that we ran 34 different interconnected tasks? We are
left with little to no understanding of the assumptions, decisions, and processes that went into
the analysis itself.

Having run the analysis, what have we learned about the process? Almost nothing.

Did you notice the myriad of parameter settings that just flew by? Each one of those may
matter. Some parameters matter a lot more than others. But which one are those? Are the
results correct? Hard to tell. The resulting directory is humongous, with hundreds of
intermediate files and a wealth of implicit information.

I consider the approach above the curse of over-automation where scientists build needlessly
interconnected tasks, where we can't even tell what and why something is happening. Alas, it
is an endemic problem of bioinformatics workflows, which you will undoubtedly experience.

Criticism aside, Nextstrain's approach is still better than many other software packages. At
least Nexstrain shows the commands that get executed. We can figure out what is going on
with extra work and digging. Other software similar software won't even print the results.

5.2.3 NextFlow and the curious case of over-abstraction

There are more extreme examples beyond nextstrain . I believe that the entire ecosystem
built around nf-core is a hard to comprehend, and ever-growing black box. I am saying this
being fully aware that it has a peer-reviewed, highly cited Nature Biotechnology publication ...

• The nf-core framework for community-curated bioinformatics pipelines. Nature


Biotechnology, 2020

I'll give you an example of what I mean.

Imagine you wanted to download a subset of reads from SRA. You could use the sra-tools
package and run the following command:

fastq-dump -X 100000 --split-3 SRR123456

- 207/228 - Biostar Handbook Collection (2023)


5.2.3 NextFlow and the curious case of over-abstraction

Of course, first you had to install the software itself. Then learn a bit about what Unix is and
what happens in a terminal. Neither of them are easy in the traditional sense.

But I consider the tasks above simple in that follow a transparent and logical process. The
Unix command line behavior has been honed over decades, and it mostly fits into the Unix
philosophy that ties it all together.

Now look at the proposed alternative. To use nf-core to do perform the same action first
you need to study the pages at:

• https://nf-co.re/fetchngs
• https://github.com/nf-core/fetchngs

A very lenghty reading indeed. And if you were industrious to study the pages above, you saw
that nf-core replaces running:

fastq-dump -X 1000 --split-3 SRR123456

with:

nextflow run nf-core/fetchngs -profile YOURPROFILE --outdir <OUTDIR>

We now need to understand all the elements of the above incantation, make the proper profile
file that contains the information in right format, and then we still need to install the tools as
before, and we still need to use UNIX commands. And even after reading the above, I am not
quite sure how to pass the 1000 parameter to get just a subset of the data. Is that even
possible? Probably not without editing the nf-core code itself.

See? All that just to run a simple command like fastq-dump . There is a promise there that if
we buy into the system, doing everything via nf-core , eventually everything will start to
make sense and just work.

Alas, that is not my experience at all. We still need to understand and troubleshoot the
individual tools - only now, we need to do so via yet another layer of configuration and
abstraction.

I believe that the abstraction proposed via nc-core is a fundamentally wrong approach to
bioinformatics because all it does it introduce yet another layer of abstraction on top of the
already complex software ecosystem.

- 208/228 - Biostar Handbook Collection (2023)


5.2.3 NextFlow and the curious case of over-abstraction

When we learn nf-core we are learning bioinformatics projected onto a different plane, a
specific point of view created just a few years ago by a few people. It is not evident at all that
they have made the right choices.

I believe nf-core is more of a Rube Goldberg machine:

A fragile system that may work for one specific need, but fails radically when we try to extend
it.

The problem of over-engineering is endemic and rampant in bioinformatics. So much so that


we even have a separate chapter titled How not to waste your time in the main volume of the
Biostar Handbook.

- 209/228 - Biostar Handbook Collection (2023)


5.3 Snakemake & Nextflow

5.3 Snakemake & Nextflow

I believe that Makefiles provide sufficient automation for most individual projects. Other
scientists disagree, and several alternative approaches exist.

Here is my personal anecdote:

Every once in a while, I get the urge to keep up with the times, that I should be using
something like Snakemake or Nextflow instead of Makefile , so I sit down to rewrite a
workflow in Snakemake.

But then I run into a small problem with Snakemake - it is seemingly trivial yet it takes
twenty minutes to solve.

Then I hit another problem with Snakemake , and another ... Next thing I know is that I have
spent hours reading Snakemake documentation, Googling obtuse errors, scouring GitHub
issues, troubleshooting unexpected behaviors. I don't want to toot my own horn, but I
consider myself quite good at solving computational problems. Yet I am continuously
stumped by Snakemake .

What is most aggravating is that the simplest tasks seem to cause problems, not the actual
bioinformatics analysis. I already know how to solve the problem in an elegant way,
Snakemake just gets in the way.

I run into trouble when renaming a file to match an input, when trying to branch off a
different path upon a parameter, etc. What is extraordinarily annoying is that I always
precisely know what I want to achieve; it is just that I can't get Snakemake to do it.

Invariably, a few hours later, I give up. It is just not worth the effort; everything is more
straightforward with a bash script or a Makefile !

Snakemake is just not built for how I think about data analysis.

5.3.1 Is Nextflow any better?

So I badmouthed Snakemake , but how about the next best thing, NextFlow ? Frankly, I
think NextFlow is worse than Snakemake .

- 210/228 - Biostar Handbook Collection (2023)


5.3.2 Accidental complexities

At least with SnakeMake , I understand the principles, and I struggle with the rules getting in
the way. On the other hand, NextFlow seems even less approachable. It asks me to learn
another programming language, Groovy , as if having to deal with bash , UNIX , Python ,
and R weren't enough already.

With Nextflow I can't even get the most direct pipeline to work in a reasonable amount of
time. When I look for examples, most are either overly simplistic and trivial or suddenly
become indecipherably complicated. For example, here is an "official" RNA-Seq workflow
according to NextFlow .

• https://github.com/nf-core/rnaseq/blob/master/workflows/rnaseq.nf

The pipeline above stretches over seven hundred fifty-five lines that, in turn, include many
other files. It is difficult to overstate how intractable, incomprehensible, and hopelessly
convoluted the process presented in that "pipeline" is.

5.3.2 Accidental complexities

The most significant flaw of many automation approaches is that they impose yet another
level of abstraction, making it even more challenging to understand what is going on.

To be effective with the new workflow engine, you first need to thoroughly understand the
design principles of the pipeline software itself on top of understanding bioinformatics
protocols.

Sometimes I feel that automation software takes on a life of its own to justify its existence,
rapidly gaining seemingly gratuitous features and functionalities that are a burden to deal with.

Let me reiterate. Learn to use Makefiles and achieve everything you want.

5.3.3 Automation software

The state of workflow engines in bioinformatics is best captured by the XKCD comic:

- 211/228 - Biostar Handbook Collection (2023)


5.3.4 Beyond Make

5.3.4 Beyond Make

The following is a list of workflow engines designed explicitly for bioinformatics analysis:

The main differences between the platforms are their distinct approaches to the different
requirements of automation:

1. Reusability: how to write instructions that we can reuse in different contexts.


2. Dependency management: how to skip stages that are already completed.
3. Parallelism: how to run multiple identical analyses in parallel.
4. Computational platform management: how to run analyses on different hardware platforms.

Snakemake

Additional links:

• Home page: https://snakemake.github.io/


• Catalog: https://snakemake.github.io/snakemake-workflow-catalog/

- 212/228 - Biostar Handbook Collection (2023)


5.3.4 Beyond Make

NextFlow

• Home page: https://www.nextflow.io/


• Catalog: https://nf-co.re/

Big Data Script

bds is intended as a scripting language for big data pipelines.

Homepage: https://pcingola.github.io/bds/

Ruffus & CGATCore

One of the first bioinformatics-oriented workflow engines:

• Ruffus home page: https://ruffus.readthedocs.io/

CGAT core is software that runs the Ruffus-enabled workflows on different computational
platforms: clusters, AWS, etc.

• CGAT core: https://cgat-core.readthedocs.io/en/latest/index.html

BPipe

• Home page: http://docs.bpipe.org/

GenPipe

• Home page: https://genpipes.readthedocs.io/

- 213/228 - Biostar Handbook Collection (2023)


5.3.4 Beyond Make

Engine comparisons

A GitHub repository offers a comparison of workflow engines:

• https://github.com/GoekeLab/bioinformatics-workflows

Workflow managers provide an easy and intuitive way to simplify pipeline development.

Here, we provide basic proof-of-concept implementations for selected workflow managers.


The analysis workflow is based on a small portion of an RNA-seq pipeline, using fastqc for
quality controls and salmon for transcript quantification.

These implementations are designed for basic illustrations. Workflow managers provide
many more powerful features than what we use here; please visit the official documentation
to explore those in detail.

- 214/228 - Biostar Handbook Collection (2023)


5.4 Makefiles vs Snakefiles

5.4 Makefiles vs Snakefiles

This chapter will demonstrate the various tradeoffs via a practical example.

We will build a short-read aligner pipeline with different methods a bash script, a
Makefile , and a Snakemake file. We will use the bwa and samtools tools.

All our code assumes that you have obtained the Ebola reference genome for accession
AF086833 and the sequencing reads deposited in the SRA for accession SRR1972739 . The
code below initializes the data:

# Bash strict mode.


set -uex

# Make the directory for the files


mkdir -p reads refs

# Get the Ebola Mayinga reference genome.


bio fetch AF086833 -format fasta > refs/AF086833.fa

# Get the sequencing data for the 2014 outbreak.


fastq-dump -F -X 10000 -O reads --split-3 SRR1972739

5.4.1 1. The code as a bash script

First, we'll create our pipeline as a simple bash script. We will write the bash scripts in a
reusable manner. The first step in that process is factoring input and outputs into variables
listed at the start.

# Run bash in strict mode.


set -uex

# The reference genome.


REF=refs/AF086833.fa

# The input reads.


R1=reads/SRR1972739_1.fastq
R2=reads/SRR1972739_2.fastq

# The alignment file.


BAM=bam/SRR1972739.bam

# Index the reference genome.

- 215/228 - Biostar Handbook Collection (2023)


5.4.2 2. The code as a Makefile

bwa index $REF

# Make the directory for the BAM file.


mkdir -p bam

# Align the reads to the reference genome.


bwa mem -t 8 ${REF} ${R1} ${R2} | samtools sort > ${BAM}

# Generate a report on the bam file


samtools flagstat ${BAM}

To execute the script, run the following:

bash align.sh

Let's list the positives and negatives.

Positives: Everything command is explicitly defined and executed in order. We can see what
is happening. We can provide someone with a script that contains all the information
necessary to run the pipeline.

Negative: Every step gets executed, even those that do not need to be run multiple times. The
script is not re-entrant. For example, genome indexing may take a long time and needs to be
run only once.

When running the code a second time, we want to comment on the genome indexing step. As
we collect more steps, we may need to comment various sections or maintain separate scripts.

That being said, the script, as written above, is an excellent way to get started. It has minimal
cognitive overhead and allows you to focus on progress. Many Ph.D. theses are written in this
way.

5.4.2 2. The code as a Makefile

Next, let's build a Makefile version of the pipeline based on the bash script. A potential
solution could look like this:

# The reference genome.


REF = refs/AF086833.fa

# The input reads.


R1 = reads/SRR1972739_1.fastq
R2 = reads/SRR1972739_2.fastq

- 216/228 - Biostar Handbook Collection (2023)


5.4.2 2. The code as a Makefile

# The alignment file.


BAM = bam/SRR1972739.bam

# The BWA index file.


IDX = ${REF}.amb

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
.ONESHELL:
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage.
usage:
> @echo "make index align"

# Rule to create the BWA index.


${IDX}: ${REF}
> bwa index ${REF} ${REF}

# Rule to create the BAM file.


${BAM}: ${IDX}
> mkdir -p bam
> bwa mem -t 8 ${REF} ${R1} ${R2} | samtools sort > ${BAM}

# The index target depends on the IDX file.


index: ${IDX}
> ls -l ${IDX}

# The align target depends on the BAM file.


align: ${BAM}
> samtools flagstat ${BAM}

# Remove created files.


clean:
> rm -f ${BAM} ${IDX}

We can invoke the resulting Makefile as:

make align

The Makefile approach is my favorite. It strikes the ideal balance between convenience,
automation, and cognitive overhead.

Positives: The best thing about the Makefile is that the commands are identical to those in a
shell script. You can run a command in the shell, copy it to the file, and vice versa.

- 217/228 - Biostar Handbook Collection (2023)


5.4.3 3. The code as a Snakefile

Then you don't have to set up dependencies if you don't want to. You can keep index and
align targets independent of one another. Makefiles can grow with you.

You can add the optional dependency management. In the case above, running make align
when the alignment is already present will not re-run the alignment; it just reports that it is
already done.

Negatives: There is an overhead in understanding and interpreting the rules. For example, $
{BAM}: ${IDX} means that the ${BAM} target will first require that the ${IDX} target is
present, so the rule for ${IDX} is executed first. Our eyes need to be trained to recognize
which patterns are commands and which are targets or other rules.

Makefiles are reusable!

One of my favorite features of Makefiles is quickly passing named parameters. For


example, to write the alignment into a different BAM file, I can just run the following:

make align BAM=bam/results.bam

Finally, you can provide a cleanup rule that you can run.

make clean

to remove the files created by the Makefile

5.4.3 3. The code as a Snakefile

To run Snakemake, you must have the snakemake command installed. The full install
mamba install snakemake requires many packages (more than 130). There is a slimmed-
down version of snakemake that can be installed as:

mamba install snakemake-minimal

Now let's translate our script into a Snakefile . Note how much longer the same script is and
how many more rules, quotations, formatting, commas, and indentation it needs. Yet by the
end, it recapitulates the exact commands already shown in the Makefile :

# The reference genome.


REF = "refs/AF086833.fa"

- 218/228 - Biostar Handbook Collection (2023)


5.4.3 3. The code as a Snakefile

# The BWA index file.


IDX = REF + ".amb"

# The input reads.


R1 = "reads/SRR1972739_1.fastq"
R2 = "reads/SRR1972739_1.fastq"

# The alignment file


BAM = "bam/SRR1972739.bam"

# Print usage information.


rule usage:
shell:
"echo 'Usage: snakemake -c1 index align'"

# Indexing rule.
rule index:
input:
REF,
output:
IDX,
shell:
"bwa index {input}"

# Alignment rule.
rule align:
input:
IDX, R1, R2
output:
BAM,
shell:
"bwa mem {REF} {R1} {R2} | samtools sort > {output}"

# Removes created files.


rule clean:
shell:
"rm -rf {BAM} {IDX}"

Any error you make in the indentation would lead to very hard-to-debug errors. I believe that
Snakefile is substantially harder to read and understand. The above Snakefile can be run
as:

snakemake -c1 align

Then run:

snakemake -c1 clean

- 219/228 - Biostar Handbook Collection (2023)


5.4.3 3. The code as a Snakefile

When you run snakemake , you can note that the output is quite lengthy by default. It
generates lots and lots of messages that clog your terminal.

We can instruct Snakemake to print the commands it is planning to execute without executing
them using the flags --dryrun ( -n ) and --printshellcmds ( -p ).

snakemake -c1 align -np

You can decipher the commands that snakemake will execute among the many lines that scroll
by. Though far from ideal, the above command can help us understand the commands
executed via Snakefiles.

Positives The patterns match across to steps in a natural manner. We mark what input is and
what output is. Snakemake can manage the runtime environment as well (not shown here).

Negatives A substantial cognitive overhead is associated with writing the rules. Sometimes it
can be hair-raising challenging to debug problems. I have spent countless hours debugging
Snakefiles, unable to achieve what I wanted and giving up.

Snakefile parameters

To pass external configuration parameters into a Snakefile, we need to write the parameters a
little differently. For example, to provide both a default value and an externally changeable
value for the REF parameter, we need to write:

REF = config.get("REF") or "refs/genome.fa"

Then during invocation, we can write:

snakemake -c1 align --config REF=refs/othergenome.fa

It is a bit more convoluted than the Makefile but not substantially more.

There are many more ways to pass parameters into Snakefiles, and it is one of its defining
features. That said, I find the many methods challenging to understand and follow.

- 220/228 - Biostar Handbook Collection (2023)


5.5 Snakemake patterns

5.5 Snakemake patterns

For a long time, I thought of Snakemake as I do of make as a dependency management and


command execution engine. Today I think it is best to evaluate it as a pattern matching engine
that generates and launches tasks based on various patterns in the data names.

Understanding the above has helped me better understand what Snakemake actually does.

5.5.1 Stating the problem

Imagine that we have multiple samples, A , B , C , and D with two files, R1 and R2 , for
each sample. Our file names may look like this.

• A_R1.fq , A_R2.fq
• B_R1.fq , B_R2.fq
• C_R1.fq , C_R2.fq
• D_R1.fq , D_R2.fq

From each pair, we want to generate a new file:

• A.bam
• B.bam
• C.bam
• D.bam

Where the reads from sample A are aligned into A.bam , the reads from sample B are
aligned into B.bam and so on. Basically the commands we wish to run are:

REF=refs/genome.fa

bwa mem -t 8 $REF A_R1.fq A_R2.fq | samtools sort > A.bam


bwa mem -t 8 $REF B_R1.fq B_R2.fq | samtools sort > B.bam
bwa mem -t 8 $REF C_R1.fq C_R2.fq | samtools sort > C.bam
bwa mem -t 8 $REF D_R1.fq D_R2.fq | samtools sort > D.bam

We could write out the above in a script, and that would work. But imagine that we have more
than four samples, perhaps dozens, or want that later we want change all lines to something

- 221/228 - Biostar Handbook Collection (2023)


5.5.2 Automation with parallel

else. Handling all those changes by text editor is not very practical and is evidently error-
prone.

Instead, we need a way to generate identical commands just by knowing the sample names

A
B
C
D

Another way to say it is that we wish to write a program that, in turn, generates the command
for each sample. Basically we need a smart program that writes us another, more repetitive
program. This is the fundamental concept of automation can be summarized as:

Write programs that write other programs.

Both parallel and Snakemake can help us with this automation, but at wildly different
levels of abstraction.

5.5.2 Automation with parallel

In our opinion, the simplest and most generic way to automate in tasks is using the GNU
parallel tool.

cat ids.txt | parallel echo "Hello pattern {}"

will print:

Hello pattern A
Hello pattern B
Hello pattern C
Hello pattern D

Thus generating the alignment commands we wanted is as simple as writing:

cat ids.txt | parallel 'bwa mem -t 8 $REF {}_R1.fq {}_R2.fq | samtools sort > {}.bam'

where the commands that get executed are

- 222/228 - Biostar Handbook Collection (2023)


5.5.3 Automation with Snakemake

bwa mem -t 8 A_R1.fq A_R2.fq | samtools sort > A.bam


bwa mem -t 8 B_R1.fq B_R2.fq | samtools sort > B.bam
bwa mem -t 8 C_R1.fq C_R2.fq | samtools sort > C.bam
bwa mem -t 8 D_R1.fq D_R2.fq | samtools sort > D.bam

See the Art of Bioinformatics Scripting for more details on using GNU parallel.

5.5.3 Automation with Snakemake

The automation that we have achieved via:

cat ids.txt | parallel "bwa mem -t 8 $REF {}_R1.fq {}_R2.fq | samtools sort > {}.bam"

would also be possible with Snakemake by using patterns, it is just that the method is very
verbose and requires a lot of boilerplate code. The code below generate the same commands at
the single line above:

# The reference genome.


REF = "refs/genome.fa"

# The sample names.


SAMPLES = [ "A", "B", "C", "D" ]

# Generate the pattern for the BAM files.


BAM = expand("bam/{patt}.bam", patt=SAMPLES)

# Running bwa mem.


rule bwa_mem:
input:
REF,
"reads/{sample}_R1.fq",
"reads/{sample}_R2.fq",
output:
"bam/{sample}.bam",
shell:
"bwa mem {input} | samtools sort > {output}"

# Alignment rule.
rule align:
input: BAM,

- 223/228 - Biostar Handbook Collection (2023)


5.5.3 Automation with Snakemake

In the example above, we demonstrate the pattern matching rules in Snakemake. Namely, we
list the desired BAM files with:

# The sample names.


SAMPLES = [ "A", "B", "C", "D" ]

# Generate the pattern for the BAM files.


BAM = expand("bam/{patt}.bam", patt=SAMPLES)

and later in the bwa_mem rule, we describe the output as:

rule bwa_mem:
input:
IDX,
"reads/{sample}_R1.fq",
"reads/{sample}_R2.fq",
output:
"bam/{sample}.bam",
shell:
"bwa mem {input} | samtools sort > {output}"

It is that pattern listed as {sample} where a lot of implicit magic happens and where most of
your frustrations will come from later on.

There are many additional subtle rules and internal mechanisms at play.

Those are convenient when they work like magic but can be extraordinarily frustrating to deal
with when they don't. Most rules are neither obvious nor well explained in how and why they
work.

I have had lots of fustrations debugging Snakemake files, especially since I knew of much
simpler ways to achieve the same results.

- 224/228 - Biostar Handbook Collection (2023)


5.6 Alternatives

5.6 Alternatives

5.6.1 Breseq: Bacterial genome analysis

The complexities of managing interconnecting tools have led scientists to develop fully
automated solutions targeted at very specific use cases.

Many times these workflows are "black-boxes", with users having little control and
understanding of what takes place inside the box. The tacit agreement is that once you accept
the constraints of the methodology, the software promises to produce informative results in an
easy to comprehend visual format.

According to the description breseq is a:

A computational pipeline for the analysis of short-read re-sequencing data (e.g. Illumina,
454, IonTorrent, etc.). It uses reference-based alignment approaches to predict mutations in a
sample relative to an already sequenced genome. breseq is intended for microbial genomes
(<10 Mb) and re-sequenced samples that are only slightly diverged from the reference
sequence (<1 mutation per 1000 bp).

GitHub: https://github.com/barricklab/breseq

Docs: https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/
documentation/

Running breseq will look like this (click to play):

breseq results

The command above will produce a directory called results that contains a fairly large
number of files containing a wide variety of information. It all looks very impressive! If you
want to be amazed check out the file located at results/output/index.html . Here is a
copy of that file accessible via the web:

• results/output/index.html

- 225/228 - Biostar Handbook Collection (2023)


5.6.1 Breseq: Bacterial genome analysis

How to install breseq

# Create a new environment.


conda create -n breseq

# Activate the environment.


conda activate breseq

# Install breseq.
mamba install breseq sra-tools -y

# Install the bio package


pip install bio --upgrade

OBTAIN THE DATA

Having activated the environment create a new directory for the analysis and switch to it. We
still need to use Biostar Workflow modules to obtain the reference genome and the reads.

# Create the working directory.


mkdir -p work

# Switch to working directory.


cd work

# Download the reference genome


make -f src/run/genbank.mk fasta ACC=NC_012967

# Download the reads for SRR030257


make -f src/run/sra.mk get SRR=SRR030257 N=1000000

RUN BRESEQ

Having obtained the reference genome and the reads, we can run breseq on our data.

breseq -j 4 -p -o results -r refs/NC_012967.gb reads/SRR030257_1.fq reads/SRR030257_2.fq

We can also learn a bit about the process as we watch the alignments go by in the console. We
notice that breseq runs bowtie2 behind the scenes, and it seems to do so in an iterative
manner while using a large number of tuning parameters that look like this:

bowtie2 -t --no-unal -p 4 -L 18 --ma 1 --mp 3 --np 0 --rdg 2,3 --rfg 2,3 \


--ignore-quals --local -i S,1,0.25 --score-min L,1,0.9 -k 2000 --reorder

- 226/228 - Biostar Handbook Collection (2023)


5.6.1 Breseq: Bacterial genome analysis

Look up some of these parameters to learn something more about how to tweak bowtie2
when looking for alignments. But note how we are at full "mercy" of what the original authors
of breseq decided for us.

WHAT DOES BRESEQ DO?

Above we ran breseq and among the many files it produces there is VCF file located in
results/output/output.vcf . A nearly identical variant file could have been created with
a more generic workflows like the one we presented in this book.

The few differences that we see are solely a matter of filtering the VCF file based on various
parameters. It is a matter of choosing what to trust.

If the VCF files are the same and we can get that result with a Makefile then what does
breseq really do?

The most important result that breseq produces is the summary of mutations in the file
results/output/index.html. That file is the main distinguishing feature of breseq and one of
the primary reasons for anyone using it. It is an amazingly effective display and summary that
cross-references mutations with alignments and with supporting evidence. It allows end users
to investigate any mutation with ease. It is a significant scientific achievement! In addition
breseq also offers structural variation (junction) predictions and other valuable utilities.

Alas, it is all baked into the entire pipeline. We can't run any of its valuable subfunctions on a
different data even if had all the necessary information at our disposal. breseq runs only if
everything is named and organized aligned just like breseq expects and that can be
extraordinaly constraining.

There is no question that breseq is a extremely valuable tool that facilitates discoveries. As
long as you need to study bacterial genomes that fit the use case breseq is a boon!

Yet we can't help but wonder why the system was not designed in a more modular fashion.
There is universal downside of most pipelines built to be "easy-to-use" black boxes. The
pipeline exists as a single, monolithic entity. And monoliths are like dinosaurs, they can't
evolve to adapt to changing needs.

- 227/228 - Biostar Handbook Collection (2023)


5.6.2 Nextstrain: Viral pathogen evolution

5.6.2 Nextstrain: Viral pathogen evolution

(TODO: currently incomplete)

According to the authors, NexStrain is a:

A pathogen surveillance tool to virologists, epidemiologists, public health officials, and


community scientists.

GitHub: https://github.com/nextstrain

Docs: https://docs.nextstrain.org/en/latest/index.html

What nextstrain is primarly? It is a visualization platform. And with that we note the emerging
commonality between these bespoke workflows. Standard software tools produce data that is
so difficult to interpret, that monuments of software are needed to built to help with the
process.

Install nextstrain

• https://docs.nextstrain.org/en/latest/

Run nextstrain

- 228/228 - Biostar Handbook Collection (2023)

You might also like