0% found this document useful (0 votes)
389 views

Getting Started With HISAT, StringTie, and Ballgown

Getting started with HISAT, StringTie, and Ballgown

Uploaded by

Patricia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
389 views

Getting Started With HISAT, StringTie, and Ballgown

Getting started with HISAT, StringTie, and Ballgown

Uploaded by

Patricia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

 MENU Search and hit enter...

DAVE TANG'S BLOG


CO M P UT AT I O N AL BI O L O GY AN D GE N O M I CS

Getting started with HISAT, StringTie,


StringTie and Ballgown
B IO IN FO RMA T ICS DA VO OCT OBE R 2 5 , 2 0 1 7  11

A popular toolset used for analysing RNA-seq data is the tuxedo suite, which consists of TopHat and Cu inks. The
suite provided a start to nish pipeline that allowed users to map reads, assemble transcripts, and perform
di erential expression analyses. A newer “tuxedo suite” has been developed and is made up of three tools: HISAT,
StringTie, and Ballgown. A Nature Protocols article provides a summary of the new suite as well as a tutorial; this
StringTie
post was written while I was going through the tutorial.

I worked through the tutorial on a MacBook Pro, which means that I downloaded binaries for OS X. If you’re using
some avour of Linux, download the Linux binaries instead. The data for the tutorial is available at
ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol; you can perform a recursive download using wget to download all
the les on the FTP server. You can use your data but you’ll have to index the relevant reference le and prepare
your own sample text le. For this post, I used the same data as the tutorial.

1 # recursive download
2 wget -c -r ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol
3
4 # move the data tarball to directory root
5 mv ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz .
6
7 # extract
8 tar xzf chrX_data.tar.gz
9
10 # check out the directory structure
11 tree --charset=ascii chrX_data
12 chrX_data
13 |-- genes
14 | `-- chrX.gtf
15 |-- genome
16 | `-- chrX.fa
17 |-- geuvadis_phenodata.csv
18 |-- indexes
19 | |-- chrX_tran.1.ht2
20 | |-- chrX_tran.2.ht2
21 | |-- chrX_tran.3.ht2
22 | |-- chrX_tran.4.ht2
23 | |-- chrX_tran.5.ht2
24 | |-- chrX_tran.6.ht2
25 | |-- chrX_tran.7.ht2
26 | `-- chrX_tran.8.ht2
We use cookiesmergelist.txt
27 |-- to ensure that we give you the best experience on our website. If you continue to use this site we will
28 `-- samples
29 assume that you are happy with it.
|-- ERR188044_chrX_1.fastq.gz
30 |-- ERR188044_chrX_2.fastq.gz
31 |-- ERR188104_chrX_1.fastq.gz Ok
32 |-- ERR188104_chrX_2.fastq.gz
33 |-- ERR188234_chrX_1.fastq.gz
34 |-- ERR188234_chrX_2.fastq.gz
35 |-- ERR188245_chrX_1.fastq.gz
36 |-- ERR188245_chrX_2.fastq.gz
37 |-- ERR188257_chrX_1.fastq.gz
38 |-- ERR188257_chrX_2.fastq.gz
39 |-- ERR188273_chrX_1.fastq.gz
40 |-- ERR188273_chrX_2.fastq.gz
41 |-- ERR188337_chrX_1.fastq.gz
42 |-- ERR188337_chrX_2.fastq.gz
43 |-- ERR188383_chrX_1.fastq.gz
44 |-- ERR188383_chrX_2.fastq.gz
45 |-- ERR188401_chrX_1.fastq.gz
46 |-- ERR188401_chrX_2.fastq.gz
47 |-- ERR188428_chrX_1.fastq.gz
48 |-- ERR188428_chrX_2.fastq.gz
49 |-- ERR188454_chrX_1.fastq.gz
50 |-- ERR188454_chrX_2.fastq.gz
51 |-- ERR204916_chrX_1.fastq.gz
52 `-- ERR204916_chrX_2.fastq.gz
53
54 4 directories, 36 files

A description of the data set is provided by geuvadis_phenodata.csv. Normally, you will have to prepare this le
yourself; it will be used later in the Ballgown step.

1 cat chrX_data/geuvadis_phenodata.csv
2 "ids","sex","population"
3 "ERR188044","male","YRI"
4 "ERR188104","male","YRI"
5 "ERR188234","female","YRI"
6 "ERR188245","female","GBR"
7 "ERR188257","male","GBR"
8 "ERR188273","female","YRI"
9 "ERR188337","female","GBR"
10 "ERR188383","male","GBR"
11 "ERR188401","male","GBR"
12 "ERR188428","female","GBR"
13 "ERR188454","male","YRI"
14 "ERR204916","female","YRI"

Now let’s download the programs; have a look at the HISAT2 page to nd the appropriate binary to download. I
like to download programs in a src directory and link them to a bin directory, which is in my PATH.

1 # for OS X
2 cd ~/src
3 wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-OSX_x86_64.zip
4 unzip hisat2-2.1.0-OSX_x86_64.zip
5
6 # provide link to binaries in my bin directory
7 cd ~/bin/
8 ln -s ~/src/hisat2-2.1.0/hisat2* .
9 # some files were already linked
10 ln -s ~/src/hisat2-2.1.0/*.py .
11 ln: ./hisat2_extract_exons.py: File exists
12 ln: ./hisat2_extract_snps_haplotypes_UCSC.py: File exists
13 ln: ./hisat2_extract_snps_haplotypes_VCF.py: File exists
14 ln: ./hisat2_extract_splice_sites.py: File exists
15 ln: ./hisat2_simulate_reads.py: File exists

Again, take a look at the StringTie page to nd the appropriate binary to download.

1 # for OS X
2 cd ~/src
3 wget -c http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.3b.OSX_x86_64.tar.gz
We 4use cookies
tar xzftostringtie-1.3.3b.OSX_x86_64.tar.gz
ensure that we give you the best experience on our website. If you continue to use this site we will
5 assume that you are happy with it.
6 # provide link to binary in my bin directory
7 cd ~/bin/ Ok
8 ln -s ~/src/stringtie-1.3.3b.OSX_x86_64/stringtie
The g compare tool needs to be compiled.

1 cd ~/src/
2 git clone https://github.com/gpertea/gclib
3 git clone https://github.com/gpertea/gffcompare
4 cd gffcompare
5 make release
6
7 # link again
8 cd ~/bin/
9 ln -s ~/src/gffcompare/gffcompare

Download SAMtools from http://www.htslib.org/download/ and compile.

1 # unzip and compile


2 tar xjf samtools-1.6.tar.bz2
3 cd samtools-1.6
4 ./configure
5 make
6
7 # link samtools
8 cd ~/bin
9 ln -s ~/src/samtools-1.6/samtools

Ballgown is a Bioconductor package, so we need to install that using R. While we are at it, we will install various
dependencies too.

1 install.packages("devtools")
2 install.packages("dplyr")
3
4 source("https://www.bioconductor.org/biocLite.R")
5 biocLite(c("alyssafrazee/RSkittleBrewer", "ballgown", "genefilter"))

Now that we have downloaded and prepared all the required programs, we can start the analysis!

Mapping
Mapping is performed using HISAT2 and usually the rst step, prior to mapping, is to create an index of the
reference genome. The indices are provided in the data folder but let’s create them again.

1 mkdir my_index
2 cd my_index
3
4 # use the Python scripts to extract splice-site and exon information from a gene annotatio
5 extract_splice_sites.py ../chrX_data/genes/chrX.gtf > chrX.ss
6 extract_exons.py ../chrX_data/genes/chrX.gtf > chrX.exon
7
8 head -3 chrX.ss
9 chrX 276393 281481 +
10 chrX 281683 284166 +
11 chrX 284313 288732 +
12
13 head -3 chrX.exon
14 chrX 276323 276393 +
15 chrX 281393 281683 +
16 chrX 284166 284313 +
17
18 # now to build the index
19 # the --ss and --exon options can be omitted if annotation data is not available
20 time hisat2-build -p 8 --ss chrX.ss --exon chrX.exon ../chrX_data/genome/chrX.fa chrX_tran
21 # screen output not shown to save space
22 Total time for call to driver() for forward index: 00:03:34
23
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
24 real 3m33.870s
25 user 10m10.778s assume that you are happy with it.
26 sys 1m9.074s
Ok
27
28 ls -1
29 chrX.exon
30 chrX.fa
31 chrX.ss
32 chrX_tran.1.ht2
33 chrX_tran.2.ht2
34 chrX_tran.3.ht2
35 chrX_tran.4.ht2
36 chrX_tran.5.ht2
37 chrX_tran.6.ht2
38 chrX_tran.7.ht2
39 chrX_tran.8.ht2

Despite creating our own indices, we’ll use the ones provided by the tutorial for reproducibility’s sake. From
geuvadis_phenodata.csv we saw that there are 12 samples; each sample has two FASTQ les since this is paired-
end data. Let’s start the mapping.

1 # create directory to store mapping results


2 mkdir map
3
4 # map each sample using 8 threads
5 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fas
6 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fas
7 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fas
8 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fas
9 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fas
10 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fas
11 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fas
12 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fas
13 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fas
14 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fas
15 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fas
16 hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fas
17
18 # mapping took around two and a half minutes
19 # real 2m36.509s
20 # user 15m17.815s
21 # sys 3m29.939s

You should always only store sorted BAM (or CRAM) les and delete the SAM les after conversion.

1 # sort mapping results using SAMtools on 8 threads


2 samtools sort -@ 8 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam
3 samtools sort -@ 8 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam
4 samtools sort -@ 8 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam
5 samtools sort -@ 8 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam
6 samtools sort -@ 8 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam
7 samtools sort -@ 8 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam
8 samtools sort -@ 8 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam
9 samtools sort -@ 8 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam
10 samtools sort -@ 8 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam
11 samtools sort -@ 8 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam
12 samtools sort -@ 8 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam
13 samtools sort -@ 8 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam
14
15 # remove SAM files
16 rm map/*.sam
17
18 # sorting and converting took just over a minute
19 real 1m14.533s
20 user 5m44.637s
21 sys 0m9.590s

Assembly
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.
Ok
Now we need to assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without
annotation; as noted in the protocol, annotation can be helpful when the number of reads for a transcript is too
low for an accurate assembly.

1 # store assembly results in a new directory


2 mkdir assembly
3
4 # create assembly per sample using 8 threads
5 stringtie map/ERR188044_chrX.bam -l ERR188044 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
6 stringtie map/ERR188104_chrX.bam -l ERR188104 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
7 stringtie map/ERR188234_chrX.bam -l ERR188234 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
8 stringtie map/ERR188245_chrX.bam -l ERR188245 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
9 stringtie map/ERR188257_chrX.bam -l ERR188257 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
10 stringtie map/ERR188273_chrX.bam -l ERR188273 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
11 stringtie map/ERR188337_chrX.bam -l ERR188337 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
12 stringtie map/ERR188383_chrX.bam -l ERR188383 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
13 stringtie map/ERR188401_chrX.bam -l ERR188401 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
14 stringtie map/ERR188428_chrX.bam -l ERR188428 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
15 stringtie map/ERR188454_chrX.bam -l ERR188454 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
16 stringtie map/ERR204916_chrX.bam -l ERR204916 -p 8 -G chrX_data/genes/chrX.gtf -o assembly
17
18 # assembly and quantification took a minute and a half
19 # real 1m30.893s
20 # user 1m58.455s
21 # sys 0m9.860s
22
23 # before merging we need to modify mergelist.txt
24 # this is because I created a new directory to store the results
25 # the modified mergelist.txt should look like this
26 cat chrX_data/mergelist.txt
27 assembly/ERR188044_chrX.gtf
28 assembly/ERR188104_chrX.gtf
29 assembly/ERR188234_chrX.gtf
30 assembly/ERR188245_chrX.gtf
31 assembly/ERR188257_chrX.gtf
32 assembly/ERR188273_chrX.gtf
33 assembly/ERR188337_chrX.gtf
34 assembly/ERR188383_chrX.gtf
35 assembly/ERR188401_chrX.gtf
36 assembly/ERR188428_chrX.gtf
37 assembly/ERR188454_chrX.gtf
38 assembly/ERR204916_chrX.gtf
39
40 # merge all transcripts from the different samples
41 stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/merge
42
43 # check out the transcripts
44 cat stringtie_merged.gtf | head
45 # stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mer
stringtie
46 # StringTie version 1.3.3b
47 chrX StringTie transcript 322514 323718 1000 . . gene_id "M
48 chrX StringTie exon 322514 323718 1000 . . gene_id "MSTRG.1";
49 chrX StringTie transcript 319145 321319 1000 + . gene_id "M
50 chrX StringTie exon 319145 321319 1000 + . gene_id "MSTRG.2";
51 chrX StringTie transcript 319145 321319 1000 + . gene_id "M
52 chrX StringTie exon 319145 319551 1000 + . gene_id "MSTRG.2";
53 chrX StringTie exon 320208 321319 1000 + . gene_id "MSTRG.2";
54 chrX StringTie transcript 304750 318701 1000 - . gene_id "M
55
56 # how many transcripts?
57 cat stringtie_merged.gtf | grep -v "^#" | awk '$3=="transcript" {print}' | wc -l
58 3491

Let’s compare the StringTie transcripts to known transcripts using g compare.

We use
1 cookies to ensure
# compare the that we givetranscripts
assembled you the best experience on our website. If you continue to use this site we will
to known transcripts
2 gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
assume that you are happy with it.
3
4 cat merged.stats Ok
5 # gffcompare v0.10.1 | Command line was:
6 #gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
stringtie
7 #
8
9 #= Summary for dataset: stringtie_merged.gtf
stringtie
10 # Query mRNAs : 3281 in 1521 loci (2651 multi-exon transcripts)
11 # (535 multi-transcript loci, ~2.2 transcripts per locus)
12 # Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)
13 # Super-loci w/ reference transcripts: 998
14 #-----------------| Sensitivity | Precision |
15 Base level: 100.0 | 77.6 |
16 Exon level: 100.0 | 85.4 |
17 Intron level: 99.8 | 91.0 |
18 Intron chain level: 99.6 | 69.7 |
19 Transcript level: 99.6 | 63.8 |
20 Locus level: 100.0 | 70.9 |
21
22 Matching intron chains: 1848
23 Matching transcripts: 2094
24 Matching loci: 1086
25
26 Missed exons: 0/8804 ( 0.0%)
27 Novel exons: 971/10608 ( 9.2%)
28 Missed introns: 14/7946 ( 0.2%)
29 Novel introns: 219/8714 ( 2.5%)
30 Missed loci: 0/1086 ( 0.0%)
31 Novel loci: 421/1521 ( 27.7%)
32
33 Total union super-loci across all input datasets: 1521
34 3281 out of 3281 consensus transcripts written in merged.annotated.gtf (0 discarded as red

The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false
negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known
transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate
how many of the sites were not found in the list of known transcripts.

All known transcripts were assembled by StringTie


StringTie, including a few novel ones.

Now that we have our assembled transcripts, we can estimate their abundances.

We use
1 cookies to ensure
stringtie -e -Bthat
-pwe
8 give you the best experience-o
-G stringtie_merged.gtf on ballgown/ERR188044/ERR188044_chrX.gtf
our website. If you continue to use this site we will
map/
2 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf map/
assume that you are happy with it.
3 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf map/
4 stringtie -e -B -p 8 -G stringtie_merged.gtf Ok -o ballgown/ERR188245/ERR188245_chrX.gtf map/
5 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf map/
6 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf map/
7 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf map/
8 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf map/
9 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf map/
10 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf map/
11 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf map/
12 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf map/
13
14 # estimation took just over a minute and a half
15 # real 1m39.661s
16 # user 2m0.179s
17 # sys 0m9.223s
18
19 # check out the files
20 ls -1 ballgown/ERR188044
21 ERR188044_chrX.gtf
22 e2t.ctab
23 e_data.ctab
24 i2t.ctab
25 i_data.ctab
26 t_data.ctab

Differential expression
To perform the expression analyses, we need to use R and Ballgown; I recommend using RStudio. To get started
load the required libraries and the data.

1 library(ballgown)
2 library(RSkittleBrewer)
3 library(genefilter)
4 library(dplyr)
5 library(devtools)
6
7 # change this to the directory that contains all the StringTie results
8 setwd("~/muse/tuxedo")
9
10 # load the sample information
11 pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv")
12
13 # create a ballgown object
14 bg_chrX <- ballgown(dataDir = "ballgown",
15 samplePattern = "ERR",
16 pData = pheno_data)
17
18 class(bg_chrX)
19 [1] "ballgown"
20 attr(,"package")
21 [1] "ballgown"
22
23 bg_chrX
24 ballgown instance with 3491 transcripts and 12 samples

What methods are available for ballgown objects?

1 methods(class="ballgown")
2 [1] dirs eexpr expr expr<- geneIDs geneN
3 [8] iexpr indexes indexes<- mergedDate pData pData
4 [15] seqnames show structure subset texpr trans
5 see '?methods' for accessing help and source code
6
7 # we can get the gene, transcript, exon, and intron expression levels using
8 # gexpr(), texpr(), eexpr(), and iexpr()
9 head(gexpr(bg_chrX), 2)
10 FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.E
11 MSTRG.1 7.169349
We use cookies to ensure that 10.42652
we give you the best experience13.83639
on our website. If1.050201 5.677819
you continue to use 1
this site we will
12 MSTRG.10 21.428192 13.13144 14.11443 18.454338 10.182308
13 assume that you
FPKM.ERR188383 FPKM.ERR188401 are happy with FPKM.ERR188454
FPKM.ERR188428 it. FPKM.ERR204916
14 MSTRG.1 4.732841 11.424809 5.733899 6.688090 5.061143
15 MSTRG.10 11.815677 8.196958 Ok 9.578302 9.961549 10.997639
16
17 head(texpr(bg_chrX), 2)
18 FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257 FPKM.ERR18827
19 1 23.9694 18.49576 39.70492 14.06822 25.51846 23.8477
20 2 0.0000 0.00000 27.79636 13.96464 44.97094 0.0000
21 FPKM.ERR188401 FPKM.ERR188428 FPKM.ERR188454 FPKM.ERR204916
22 1 28.03131 24.97612 28.2617 20.24706
23 2 25.81932 0.00000 0.0000 0.00000

Next we lter out transcripts with low variance.

1 # note that this subset function is not the base R function but a ballgown one
2 # to see the order in which R looks for functions in packages use search()
3 # search()
4 # [1] ".GlobalEnv" "package:bindrcpp" "package:devtools" "package
5 # [5] "package:genefilter" "package:RSkittleBrewer" "package:ballgown" "tools:r
6 # [9] "package:stats" "package:graphics" "package:grDevices" "package
7 # [13] "package:datasets" "package:methods" "Autoloads" "package
8 #
9 # the rowVars is from the genefilter package and calculates the row variance
10 bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)
11
12 # 1,264 transcripts were filtered out
13 bg_chrX_filt
14 ballgown instance with 2227 transcripts and 12 samples

Perform the di erential expression analysis stattest() function; confounders are speci ed using the adjustvars
parameter, which has to match the column name in pheno_data. We are testing for transcripts and genes that are
di erentially expressed between male and females, hence sex is our covariate of interest. In addition to testing
transcripts and genes, we can also test di erential expression at exons and introns; just change the feature
parameter accordingly.

1 head(pData(bg_chrX_filt), 3)
2 ids sex population
3 1 ERR188044 male YRI
4 2 ERR188104 male YRI
5 3 ERR188234 female YRI
6
7 # test on transcripts
8 results_transcripts <- stattest(bg_chrX_filt,
9 feature="transcript",
10 covariate="sex",
11 adjustvars = c("population"),
12 getFC=TRUE, meas="FPKM")
13
14 # results are in a data frame
15 class(results_transcripts)
16 [1] "data.frame"
17
18 dim(results_transcripts)
19 [1] 2227 5
20
21 head(results_transcripts)
22 feature id fc pval qval
23 1 transcript 1 0.9386481 0.7208669 0.9454480
24 2 transcript 2 1.2073309 0.8670656 0.9756579
25 3 transcript 3 1.0058534 0.9964598 0.9997816
26 4 transcript 4 0.3847566 0.5214029 0.9290666
27 5 transcript 5 0.6089373 0.3247825 0.9278154
28 6 transcript 6 0.6449469 0.3062408 0.9253708
29
30 table(results_transcripts$qval < 0.05)
31
We 32
use cookies
FALSE toTRUE
ensure that we give you the best experience on our website. If you continue to use this site we will
33 2215 12 assume that you are happy with it.
34
35 # test on genes Ok
36 results_genes <- stattest(bg_chrX_filt,
37 feature="gene",
38 covariate="sex",
39 adjustvars = c("population"),
40 getFC=TRUE, meas="FPKM")
41
42 class(results_genes)
43 [1] "data.frame"
44
45 dim(results_genes)
46 [1] 1013 5
47
48 table(results_genes$qval<0.05)
49
50 FALSE TRUE
51 1002 11

The results_transcripts data frame doesn’t contain any identi ers; we will create a new data frame with this
information.

1 # the order is the same so we can simply combine the information


2 results_transcripts <- data.frame(geneNames = geneNames(bg_chrX_filt),
3 geneIDs = geneIDs(bg_chrX_filt),
4 results_transcripts)
5
6 # now we have the identifiers
7 head(results_transcripts)
8 geneNames geneIDs feature id fc pval qval
9 1 . MSTRG.4 transcript 1 0.9386481 0.7208669 0.9454480
10 2 PLCXD1 MSTRG.4 transcript 2 1.2073309 0.8670656 0.9756579
11 3 . MSTRG.4 transcript 3 1.0058534 0.9964598 0.9997816
12 4 . MSTRG.4 transcript 4 0.3847566 0.5214029 0.9290666
13 5 . MSTRG.5 transcript 5 0.6089373 0.3247825 0.9278154
14 6 PLCXD1 MSTRG.4 transcript 6 0.6449469 0.3062408 0.9253708
15
16 # which transcripts are detected as differentially expressed at qval < 0.05?
17 results_transcripts %>% filter(qval < 0.05)
18 geneNames geneIDs feature id fc pval qval
19 1 PNPLA4 MSTRG.64 transcript 186 0.592477057 2.119474e-04 4.290970e-02
20 2 . MSTRG.140 transcript 421 3.141219608 6.096529e-05 1.508552e-02
21 3 KDM6A MSTRG.255 transcript 734 0.054166544 1.208983e-04 2.692404e-02
22 4 RPS4X MSTRG.511 transcript 1605 0.598737678 2.560509e-04 4.751878e-02
23 5 TSIX MSTRG.522 transcript 1648 0.078029979 1.743580e-06 7.765906e-04
24 6 . MSTRG.523 transcript 1649 0.016057740 3.872369e-10 2.874589e-07
25 7 XIST MSTRG.523 transcript 1650 0.002997908 1.849406e-10 2.059314e-07
26 8 . MSTRG.523 transcript 1651 0.030714646 1.360867e-10 2.059314e-07
27 9 . MSTRG.523 transcript 1652 0.028289665 6.782559e-08 3.776190e-05
28 10 . MSTRG.605 transcript 1843 7.378759461 1.285917e-05 4.772897e-03
29 11 . MSTRG.612 transcript 1847 9.154881892 4.889775e-05 1.361191e-02
30 12 . MSTRG.766 transcript 2333 0.272425415 1.909634e-05 6.075365e-03

Let’s create a MA plot.

1 library(ggplot2)
2 library(cowplot)
3
4 results_transcripts$mean <- rowMeans(texpr(bg_chrX_filt))
5
6 ggplot(results_transcripts, aes(log2(mean), log2(fc), colour = qval<0.05)) +
7 scale_color_manual(values=c("#999999", "#FF0000")) +
8 geom_point() +
9 geom_hline(yintercept=0)

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.

Ok
Summary
The new tuxedo package is very fast; I realise that the tutorial only used a small subset of reads that were already
determined to map to chromosome X. Despite this, the mapping and assembly took mere minutes. A recent
benchmark of RNA-seq aligners did demonstrate that HISAT or HISAT2 was the fastest splice-aware mapper out of
14 algorithms. However, HISAT or HISAT2 had a low recall percentage when mapping reads with high complexity,
i.e. more polymorphic sites and higher error rates, on the default settings; mapping accuracy was vastly improved
after tuning the parameters.

I plan to set up a Snakemake pipeline for running the new tuxedo suite and will compare it with other pipelines,
such as this STAR and Cu inks/RSEM pipeline.

This work is licensed under a Creative Commons


Attribution 4.0 International License.

SHARE THIS:

 Twitter  Facebook  LinkedIn  Email

LIKE THIS:

Like
Be use
We the first to like this.
cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.
Ok
RELA TED

Getting started with TopHat Getting started with Picard Getting started with Seurat
May 9, 2012 July 26, 2014 August 1, 2017
In "bioinformatics" In "bioinformatics" In "single cell"

 Posted in bioinformatics
 Tagged RNA-seq

1 1 C O M M EN TS A DD Y O URS

N A N DI T A
January 31, 2018 at 7:59 am

Hi Dave- I was wondering if you could comment on an observation we made when we ran this pipeline
as described here.

We did an experiment in mouse, knockout vs WT. For alignment we used hisat2, default parameters.
Followed by stringtie
stringtie, and ballgown. We got a large number of signi cantly D.E. “transcripts”, but, when
we conducted a gene level analysis, we got barely any D.E. genes. The D.E. transcripts list mostly has the
same gene showing D.E. of di erent splice forms in each condition. Since we are dealing with the same
tissue, we really don’t expect such a huge splicing e ect. I wonder if many of the splice variants could be
mapping artifacts, because, in some cases, I look at the aligned reads in a browser and it shows no
di erence between the two samples in terms of # of reads mapped.

RE PLY

DA VO
February 1, 2018 at 12:56 am

Hi Nandita,

I recall that a former colleague had a similar problem to what you are describing, which is the
discrepancy in DE between genes and transcripts. Regarding your example, I guess the obvious
thing to do (which you may have already done) is to create an expression table of the gene and
another of the transcripts belonging to the same gene. Perhaps in the knockout, it has switched
to another splice variant, therefore there is DE on the transcript level. However, when you
collapse expression onto a gene level they are expressed similarly. I’m not so sure about what
you meant about mapping artifacts though. If there was a systematic artifact, it should a ect
both samples equally and you shouldn’t have a discrepancy only in one sample.

Cheers,
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
Dave assume that you are happy with it.
Ok
RE PLY
N A N DI T A
February 1, 2018 at 7:37 am

Thanks for responding, Dave.


“Perhaps in the knockout, it has switched to another splice variant, therefore there is DE on the
transcript level.” This would be really cool if true, and the when we use the “plot transcripts” function in
ballgown (Fig 5. page 1664 in the Nature Protocols 2016 paper) to look at one of these cases, it indeed
implies di erent transcripts are expressed.

However, when using the ucsc genome browser to view bigwig les generated from the aligned bams,
we cannot see any unique splice junction being covered in one condition versus the other. So we are not
sure why these reads have been assigned to di erent splice isoforms. Additionally, like I said, we really
don’t expect so many events where isoforms are switched in the system we are examining.

“If there was a systematic artifact, it should a ect both samples equally and you shouldn’t have a
discrepancy only in one sample.”

Agreed. I am unable to explain it either. ? Short of eyeballing every such event in a genome browser, or
asking the lab to validate via qPCR, I’m not able to assign con dence in the di erential transcripts
results, even though the fc, p-val and q-val look very good.

RE PLY

UPE N DR A K UM A R DE VI S E T T Y
June 9, 2018 at 4:13 am

Hi Dan,
Very nice blog. I have one quick questions. Is there a way one can logFC in addition to FC in ballgown
output?
Thanks,
Upendra

RE PLY

R A M A N S E T HI
January 3, 2019 at 10:30 am

Nice blog. I want to ask how Ballgown compares with DESeq2? And which is the best tool to plot heat
maps, GO and Pathway Analysis, PCA Analysis? Thank you!

RE PLY

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
DA VO assume that you are happy with it.

January 23, 2019 at 8:32 am Ok


I haven’t made the comparison yet. For heatmaps, pathway analysis, and PCA I like pheatmap,
fgsea, and FactoMineR, respectively.

RE PLY

J O S R UI R O D5
January 9, 2019 at 10:25 am

Hi, very nice blog entry.


Thanks for the comprehensive explanations. At the end, did you compare the pipline of the new tuxedo
suite with others such as the STAR/Cu inks? I couldn’t nd any other entry. I wonder if you have any
comment on this.

Thank you!

RE PLY

DA VO
January 23, 2019 at 8:26 am

I haven’t done the comparison yet. It’s on my TODO list.

RE PLY

FA W Z I Y A S S I N E
March 16, 2019 at 3:45 pm

how to interpret fold change (fc) in ballgown results, a fake example calculation is appreciated.
regards,

RE PLY

J O S E B A S I LI O
March 26, 2019 at 2:49 pm

Thank you for your post. I would like to know if you have the possibility to get, and send to my email,
the paper which you have mentioned at the end of your post:
https://link.springer.com/protocol/10.1007%2F978-1-4939-4035-6_14

Thank you once again.


Best Regards, José

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
RE PLY
assume that you are happy with it.
Ok
DA VI D HUE LS
April 4, 2019 at 11:46 am

Hi Dave,
thank you for your detailed blog post. Very helpful!
Which les did you load in the IGV to visualise the known as well as the novel transcripts?
Cheers
David

RE PLY

Leave a Reply

Your email address will not be published. Required elds are marked *

Comment

Name *

Email *

Website

POS T C OM M E N T

Notify me of follow-up comments by email.

Notify me of new posts by email.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.

Ok
Search … S E A RC H

WHO ' S O N L I N E

31 visitors online now


27 guests, 4 bots, 0 members
Map of Visitors

SUP P O R T

Buy me a co ee

L I CE N SE

This work is licensed under a Creative Commons Attribution 4.0 International License.

R E CE N T P O ST S

Setting up Windows for bioinformatics in 2019

Importing vector images into R

The Golden Rule of Bioinformatics

Visualising Google Trends results with R

Getting started with Cell Ranger

10x single cell BAM les

Interactive plots in R

Making a heatmap in R with the pheatmap package

Organising computational biology projects with Cookiecutter

Compiling R with GNU Readline

R E CE N T CO M M E N T S

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.
Ashley on Getting started with Seurat
Ok
Gwang-Jin Kim on Getting started with Monocle
Santiago on Making a heatmap in R with the pheatmap package

David on The Adjusted Rand index

Daniela on Creating a coverage plot using BEDTools and R

Josh on Tissue speci city

Rittik on On curve tting using R

Jyoti on Setting up Windows for bioinformatics in 2019

David Huels on Getting started with HISAT, StringTie


StringTie, and Ballgown

Dimitris on Making a heatmap in R with the pheatmap package

T AG CL O UD

6mer 10x annotation bedtools bioinformatics biomaRt CAGE clustering correlation DGE
encode etc fork genome GO graph heatmap histones home machine learning mapping

maths miRNA motif OMIM parser pca perl promoter python R refseq repeats rnaseq SAM scan

sequencing snps spearman statistics TFBS tips twitter variants visualisation

AR CHI V E S
Tweets by @davetang31

Dave Tang Retweeted April 2019

RStudio
February 2019
@rstudio

In light of the recently disclosed incident of sexual January 2019


harassment at DataCamp and their response to it,
including the attempt to conceal their public December 2018
acknowledgement from search engines, we want
August 2018
to share the steps that we have taken:

June 2018
Apr 15, 2019

May 2018
Dave Tang
@davetang31 February 2018
Given everything I've read about DataCamp in the
January 2018
past week, I have unsubscribed and deleted my
account. I have also removed all links to their site October 2017
from my blog and will stop recommending it.
September 2017
Apr 15, 2019
We use cookies to ensure that we give you the best August 2017on our website. If you continue to use this site we will
experience
Dave Tang Retweeted assume that you are happy with it.
July 2017
F Rodriguez-Sanchez Ok
@frod_san
Software authors deserve being cited too! June 2017
For #rstats, just run `grateful::cite_packages()` and
March 2017
get citations ready to paste into your
manuscript!github.com/Pakillo/gratef…
February 2017

Pakillo/grateful January 2017


Facilitate citation of R package…
github.com November 2016

October 2016
Feb 6, 2019
September 2016
Dave Tang Retweeted
August 2016
bioRxiv Bioinfo
@biorxiv_bioinfo July 2016

A comprehensive analysis of the usability and


May 2016
archival stability of omics computational tools and
resources biorxiv.org/cgi/content/sh… March 2016
#biorxiv_bioinfo
January 2016
A comprehensive analysis of t…
December 2015
Developing new software tools …
biorxiv.org
October 2015

August 2015
Oct 25, 2018

July 2015
Dave Tang Retweeted

Anis Musli ć ⣢ June 2015


@0xUID
May 2015
A Unix Shell poster from 1983
April 2015

March 2015

February 2015

January 2015

December 2014

November 2014

October 2014

September 2014

August 2014

July 2014
Oct 13, 2018
June 2014
Dave Tang Retweeted
May 2014
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
Eric Alper
@ThatEricAlper assume that you2014
April are happy with it.

When you're overqualified for the job Ok


March 2014
February 2014

January 2014

December 2013

November 2013

October 2013
Oct 2, 2018
September 2013

Dave Tang Retweeted August 2013


Jason Sheltzer
July 2013
@JSheltzer

The Nobel Prize in Medicine will be announced this May 2013


coming Monday. You might think that the winner is a
secret, but, with some degree of confidence, you April 2013
can narrow it down to some likely candidates -
March 2013
Sep 28, 2018
February 2013

Dave Tang Retweeted January 2013


Stephen Turner
December 2012
@strnr

#ASMNGS18 @torstenseemann describes the November 2012


woes of installing/running bioinformatics software.
Many bioinformaticians' colleagues don't realize October 2012
bioinfo software != production/enterprise software.
"Just run the software on my data" isn't as easy as September 2012
it sounds.
August 2012

July 2012

June 2012

May 2012

April 2012

March 2012
Sep 26, 2018
February 2012

Dave Tang Retweeted January 2012


Sean Kross
@seankross November 2011

Paraphrasing @mgymrek: October 2011

Your paper is cited outside of your field in one September 2011


semi-related paragraph: *counts towards your
career progression* August 2011

We use cookies to ensure that we give you the best experience


July 2011 on our website. If you continue to use this site we will
Your academic software package has 100 stars on
assume that you are happy with it.
GitHub: *counts for nothing*
June 2011
Ok
Something is wrong here #jsm2018
May 2011
Jul 30, 2018
April 2011
Dave Tang Retweeted
January 2011
Stephen Turner
@strnr December 2010
FASTQ sequence quality visualisation with Emoji
November 2010
github.com/lonsbio/fastqe

October 2010

ME T A

Log in

Entries RSS
Jul 25, 2018
Comments RSS
Embed View on Twitter
WordPress.org

I N T E N T I O N AL L Y BL AN K

Copyright © 2019 Dave Tang's blog. All Rights Reserved.


Boston Theme by FameThemes

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will
assume that you are happy with it.
Ok

You might also like