Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
by
Aureliano Bombarely
[email protected]
Lectures:
1.Basics of the Next Generation Sequencing (NGS).
1.1. The sequencing revolutions.
1.2. Strengths and weaknesses of the different technologies.
1.3. Inputs and outputs.
2.RNAseq experiment design.
2.1. Reference vs Non-reference.
2.2. High heterozygosity and polyploid polyploid problem.
2.3. Tissue selection and treatments.
2.4. Sequencing technology.
3.RNAseq expression analysis.
3.1. Reference preparation and read mapping.
3.2. Gene expression.
3.3. Analysis and visualization.
4.Use of RNAseq reads for phylogeny and genetics.
4.1. Recovering full length mRNA: Reference guided assembly.
4.2. Phylogeny though RNAseq: From gene tree to species tree.
4.3. From reads to markers: SNP calling.
4.4. Population genetics and NGS.
Lectures:
1.Basics of the Next Generation Sequencing (NGS).
1.1. The sequencing revolutions.
1.2. Strengths and weaknesses of the different technologies.
1.3. Inputs and outputs.
2.RNAseq experiment design.
2.1. Reference vs Non-reference.
2.2. High heterozygosity and polyploid polyploid problem.
2.3. Tissue selection and treatments.
2.4. Sequencing technology.
3.RNAseq expression analysis.
3.1. Reference preparation and read mapping.
3.2. Gene expression.
3.3. Analysis and visualization.
4.Use of RNAseq reads for phylogeny and genetics.
4.1. Recovering full length mRNA: Reference guided assembly.
4.2. Phylogeny though RNAseq: From gene tree to species tree.
4.3. From reads to markers: SNP calling.
4.4. Population genetics and NGS.
1.Basics of the Next Generation Sequencing (NGS).
DNA Sequencing:
“Process of determining the precise order of nucleotides within a DNA molecule.”
-Wikipedia
Genetics Medicine
Molecular Forensics
Taxonomy
Biology Biology
Breeding Ecology
1.Basics of the Next Generation Sequencing (NGS).
ddATP
ddGTP
ddTTP STOP
ddCTP
time
3) Chromatogram Read
GTCACCCTGAAT
Total nucleotides
Run Time Sequence Length Reads/Run
sequenced
Capillary Sequencing
~2.5 h 800 bp 386 0.308 Mb
(ABI37000)
MS2 Bacteriophage (3.658 Kb) 1977
1978
1979
1980
1981
1982
1983
Epstein-Barr Virus (170 Kb) 1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
Haemophilus influenzae (1.83 Mb) 1995
Sanger Sequencing
Saccharomyces cerevisiae (12.1 Mb) 1996
1997
Caenorhabditis elegans (100 Mb) 1998
1.Basics of the Next Generation Sequencing (NGS).
1999
Arabidopsis thaliana (157 Mb) 2000
Homo sapiens (3.2 Gb) 2001
Oryza sativa (420 Mb) 2002
2003
2004
2005
Populus trichocarpa (550 Mb) 2006
Vitis vinifera (487 Mb) 2007
Physcomitrela (480 Mb); Carica papaya (372 Mb) 2008
454
Sorghum bicolor (730 Mb); Zea Mays (2.3 Gb); Cucumis sativus (367 Mb) 2009
Ectocarpus (214 Mb); Malus x domestica (742 Mb); Glycine max (1.1 Gb) 2010
SOLiD
Solexa / Illumina
Mei, Benthamiana, Tomato, Setaria, Melon, Flax, T. salsuginea, Banana, Cotton, Orange, Pear 2012
PB
1.1 The sequencing revolutions.
3. Same or bigger error rate that the traditional sequencing (from 87 to 99.9%).
• Pyrosequencing (454/Roche).
• Illumina sequencing
• SOLID sequencing
• Ion semiconductor sequencing (IonTorrent)
• Single Molecule SMRT sequencing (PacBio)
1.1 The sequencing revolutions.
Total nucleotides
Run Time Sequence Length Reads/Run
sequenced per run
Capillary Sequencing
~2.5 h 800 bp 386 0.308 Mb
(ABI37000)
454 Pyrosequencing 700 Mb
~23 h 700 bp 1,000,000
(GS FLX Titanium XL+) (0.7 Gb)
Illumina 264 h / 27 h 2 x 100 bp 2 x 3,000,000,000 600,000 / 120,000 Mb
(HiSeq 2500) (11 days) 2 x 150 bp 2 x 600,000,000 (600 / 120 Gb)
Illumina 8,500 Mb
39 h 2 x 250 bp 2 x 17,000,000
(MiSeq) (8.5 Gb)
SOLID 48 h 30,000 Mb
75 bp 400,000,000
(5500xl system) (2 days) (30 Gb)
Ion Torrent 10,000 Mb
2h 100 bp 100,000,000
(Ion Proton I) (10 Gb)
PacBio 100 Mb
1.5 h ~3,000 bp 25,000
(PacBioRS) (0.1 Gb)
1.2 Strengths and weaknesses of the different technologies.
Strenghs Weaknesses
SOLID - 2-base encoding reduce the observed raw error - 2-base color coding makes difficult the
sequence manipulation and assembly.
(5500xl system) rate (0.06%)
- Short reads (75 bp)
PacBio - Long reads (3000 bp) - Really high observed raw error rate (12.7%)
(PacBioRS) - Fast run (2 hours) - High instrument cost (~ $700K)
- No pair end/mate pair reads
1.3 Inputs and Outputs.
Inputs Outputs
Illumina
(HiSeq 2500) - Single Reads Library.
- Pair End Library (170-800 bp insert size). - fastq files (Phred+64)
- Mate Pair Library (2 to 10 Kb insert Size) - fastq files (Phred+33, Illumina 1.8+)
Illumina - Multiplexed sample.
(MiSeq)
PacBio
(PacBioRS) - Single Reads Library.
1.3 Inputs and Outputs.
★ Library types:
• Single reads
• Pair ends (PE) (from 150-800 bp)
R F Illumina
F R 454/Roche
1.3 Inputs and Outputs.
Scaffold
NNNNN (or Supercontig)
F
Pseudomolecule
(or ultracontig)
NNNNN NN
1.3 Inputs and Outputs.
★ Multiplexing:
Use of different tags (4-6 nucleotides) to
identify different samples in the same lane/
sector.
AGTCGT
AGTCGT
AGTCGT
AGTCGT
AGTCGT
AGTCGT
AGTCGT Sequencing
TGAGCA
AGTCGT
TGAGCA
TGAGCA TGAGCA
TGAGCA TGAGCA
TGAGCA AGTCGT
TGAGCA
TGAGCA
1.3 Inputs and Outputs.
Sff files:
Standard flowgram format (SFF) is a binary file format used to encode results of
pyrosequencing from the 454 Life Sciences platform for high-throughput sequencing. SFF files
can be viewed, edited and converted with DNA Baser SFF Workbench (graphic tool), or
converted to FASTQ format with sff2fastq or sff_extract.
-Wikipedia
Fasta files:
Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
2. There should be not space between “@” symbol and the first letter of the identifier.
4. Single line with plus symbol (“+”) in the first column to represent the quality line.
Fastq files:
2. Sequencing evaluation.
5. Basic R commands
6. Functional annotation.
2. Sequencing evaluation.
5. Basic R commands
6. Functional annotation.
Novel Gene
Discovery
Alternative NcRNA Profiling
Splicing Discovery and Discovery
Gene Expression
Analysis
2. RNAseq Experiment Design
* Regulation
Space (where) and Time (when)
(Promotors ...)
Translate to protein
* Regulation
(glycosilations, phosphorilations...)
Synthesize compounds
* Karp et al. Multidimensional annotation of the Escheriichia coli K-12 genome. Nucleid Acid Research. 2007:35:7577-7590
** http://www.arabidopsis.org/portals/genAnnotation/genome_snapshot.jsp
*** http://rice.plantbiology.msu.edu/riceInfo/info.shtml
**** http://www.sanger.ac.uk/Info/Press/2004/041020.shtml
2. RNAseq Experiment Design
Transcriptome Complexity:
Simple System:
Gene
Genome AAAAAAA
mRNA
mRNA population
Single Cell
2. RNAseq Experiment Design
Transcriptome Complexity:
Pathogen
Genome AAAAAAA
mRNA
2. RNAseq Experiment Design
Transcriptome Complexity:
mRNA population
AAAAAAA
Genome
AAAAAAA
mRNA
2. RNAseq Experiment Design
Transcriptome Complexity:
mRNA-1
AAAAAAA
Gene
AAAAAAA
mRNA-2
2. RNAseq Experiment Design
Transcriptome Complexity:
time
1 2 3
2. RNAseq Experiment Design
Transcriptome Complexity:
mRNA population
2. RNAseq Experiment Design
Experimental Design:
Genomic Biological
Considerations Considerations
Number of Species Organ/Tissue/Cell Type
Polyploidy/Heterozygosity Developmental Stage
Treatments
Economical Technical
Considerations Considerations
Budged Skills/Hardware
Controls Replicates
Technology Used
Library Preparation Sequencing Amount
Analysis Pipeline
2.1 Reference vs. Non-reference
Reference:
Plant Genomes:
http://chibba.agtec.uga.edu/duplication/index/home
2.1 Reference vs. Non-reference
Yes
Same genus
For most of them, but some losses are expected for the
most polymorphic genes
Same family
Probably not. Still some reads will map with the most
conserved genes.
2.1 Reference vs. Non-reference
% Mapped
Species Accession SRA Reads Time
Read
Arabidopsis
Ler SRR392121 9752382 71% 00:07:05
thaliana
Arabidopsis
- SRR072809 9214967 69% 00:10:11
lyrata
Reference Gene 1
ATGCGCGCTAGACGACATGACGACA
CCGCTA
CCGCTA
CCGCTA 5
TGACGA +4=9
CCCGCT TGACGA (+ 5 = 10 )
CCCGCT ATGACG
GCCCGC ATGACG
ATGCCCGCTAGACGACATGACGACAGCGTGTCGTAG Reference Gene 2
Non
Polymorphic Polymorphic Reads assigned randomly
Region Region
2.2 High heterozygosity and polyploid polyploid problem
http://chibba.agtec.uga.edu/duplication/index/home
2.3 Tissue Selection and Treatments
Best Practices:
1) Compare samples with same number of amplification rounds
2) Use software to measure and correct the bias
(example: seqbias from R/Bioconductor, Jones DC et al. 2012)
2.3 Tissue Selection and Treatments
CGATCG
Library
Control - mRNA preparation
extraction and
multiplexing
ATCGTA
Treatment +
2.4 Sequencing technology
http://www.rna-seqblog.com/information/how-many-reads-are-enough/
2.4 Sequencing technology
Tarazona S. et al. (2012) Differential expression in RNA-seq: a matter of depth. Genome Res.21:2213-23
2.4 Sequencing technology
Total nucleotides
Run Time Sequence Length Reads/Run
sequenced per run
Capillary Sequencing
~2.5 h 800 bp 386 0.308 Mb
(ABI37000)
454 Pyrosequencing 700 Mb
~23 h 700 bp 1,000,000
(GS FLX Titanium XL+) (0.7 Gb)
Illumina 264 h / 27 h 2 x 100 bp 2 x 3,000,000,000 600,000 / 120,000 Mb
(HiSeq 2500) (11 days) 2 x 150 bp 2 x 600,000,000 (600 / 120 Gb)
Illumina 8,500 Mb
39 h 2 x 250 bp 2 x 17,000,000
(MiSeq) (8.5 Gb)
SOLID 48 h 30,000 Mb
75 bp 400,000,000
(5500xl system) (2 days) (30 Gb)
Ion Torrent 10,000 Mb
2h 100 bp 100,000,000
(Ion Proton I) (10 Gb)
PacBio 100 Mb
1.5 h ~3,000 bp 25,000
(PacBioRS) (0.1 Gb)
2.4 Sequencing technology
• de-novo assembly
• Reference with recent WGD } ‣Longer is better (at least 100 bp)
‣Pair ends recommended
Fastq
preprocessed Assembling
Mapping
• FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
3.1 Reference preparation and read mapping
• Suggested Q30.
• Fastx-Toolkit (http://hannonlab.cshl.edu)
• Ea-Utils (http://code.google.com/p/ea-utils/)
• PrinSeq (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi)
Tophat (Bowtie2)
Bowtie2
BWA
Sequencing
Software Features URL
technology
http://bio-bwa.sourceforge.net/
bwa Illumina Mapping bwa.shtml
Illumina, http://bowtie-bio.sourceforge.net/
bowtie Mapping index.shtml
SOLID
Illumina, 454 http://bowtie-bio.sourceforge.net/
bowtie2 Mapping bowtie2
(fastq)
Illumina, http://www.novocraft.com/main/
novoalign Mapping index.php
SOLID
http://454.com/products/analysis-
gsMapper 454 (sff) Mapping, annotation software/index.asp#reference-tabbing
http://soap.genomics.org.cn/
SOAPaligner Illumina Mapping soapaligner.html
TopHat
Illumina Mapping, splicing http://tophat.cbcb.umd.edu/index.html
(bowtie)
3.1 Reference preparation and read mapping
##gff-version 3
##sequence-region ctg123 1 1497228
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003
http://www.sequenceontology.org/resources/gff3.html
3.1 Reference preparation and read mapping
bowtie2 -N 1 -M 0 -x merged_reference.fasta
seq1.fastq -S results.sam
Fasta
(Reference)
Read Mapping
Fastq
Sam/Bam
preprocessed
Gff
(Reference)
http://samtools.sourceforge.net/SAM1.pdf
3.1 Reference preparation and read mapping
Sam file
http://samtools.sourceforge.net/SAM1.pdf
3.1 Reference preparation and read mapping
Cigar String
http://samtools.sourceforge.net/SAM1.pdf
3.1 Reference preparation and read mapping
Flags String
Schliesky S, Gowik U, Weber AP, Bräutigam A. (2012) RNA-Seq Assembly – Are We There Yet? Front Plant Sci doi: 10.3389/fpls.2012.00220
3.1 Reference preparation and read mapping
Sequencing
Software Type Features URL
technology
Overlap-layout- Highly http://sourceforge.net/apps/
MIRA Sanger, 454 mediawiki/mira-assembler
consensus configurable
Overlap-layout- http://454.com/products/
gsAssembler Sanger, 454 Splicings analysis-software/index.asp
consensus
Overlap-layout- Improves http://bioinfo.bti.cornell.edu/
iAssembler Sanger, 454 tool/iAssembler
consensus MIRA
Splicings, http://www.bcgsc.ca/platform/
Trans-ABySS* 454 or Illumina Bruijn graph bioinfo/software/trans-abyss
Gene fusions
SOAPdenovo- http://soap.genomics.org.cn/
454 or Illumina Bruijn graph Fastest SOAPdenovo-Trans.html
trans*
454 or Illumina http://www.ebi.ac.uk/~zerbino/
Velvet/Oases Bruijn graph SOLiD oases/
or SOLiD
Downstream http://
Trinity* 454 or Illumina Bruijn graph trinityrnaseq.sourceforge.net/
expression
What is a Kmer ?
Specific n-tuple or n-gram of nucleic acid or amino acid sequences.
-Wikipedia
ordered list contiguous sequence
of elements of n items from a given
sequence of text
5 Kmers of 20-mer
ATGCGCAGTGGAGAGAGAGC
TGCGCAGTGGAGAGAGAGCG
GCGCAGTGGAGAGAGAGCGA N_kmers = L_read - Kmer_size
CGCAGTGGAGAGAGAGCGAT
GCAGTGGAGAGAGAGCGATG
3.1 Reference preparation and read mapping
Li Z. et al. (2011) Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph
Brief. Funct. Genomics 11: 25-37. doi: 10.1093/bfgp/elr035
1. A brief history of the sequence assembly.
Compeau PEC. et al. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 2011. 29:287-291
3.2 Gene Expression
Alicia Oshlack A, Robinson MD, and YoungMD, Genome Biology 2010, 11:220
3.2 Gene Expression
Gene expression for RNAseq analysis is based in how many reads map
to an specific gene. For comparison purposes the counts needs to be
normalized. There are different methodologies.
http://woldlab.caltech.edu/wiki/
ERANGE RPKM Python RNASeq
http://www.broadinstitute.org/
Scripture RPKM Java software/scripture
R/Bioconductor, http://www.bioconductor.org/
BitSeq* RPKM Calculate DE packages/2.12/bioc/html/BitSeq.html
R/Bioconductor, http://www.bioconductor.org/
EdgeR TMM Calculate DE packages/2.11/bioc/html/edgeR.html
Isoforms,
Cufflinks* FPKM Calculate DE
http://cufflinks.cbcb.umd.edu/
Isoforms,
MMSEQ* FPKM Haplotypes
http://bgx.org.uk/software/mmseq.html
http://deweylab.biostat.wisc.edu/rsem/
RSEM* FPKM Calculate DE (EBSeq) README.html
Need
Software Normalization Input URL
Replicas
http://bioconductor.org/packages/
DESeq Library Size No Raw Counts release/bioc/html/DESeq.html
http://www.bioconductor.org/
baySeq Library Size Yes Raw Counts packages/2.11/bioc/html/baySeq.html
Raw or
Library Size / http://bioinfo.cipf.es/noiseq/doku.php?
NOISeq No Normalized
RPKM / UpperQ Counts
id=start
Tarazona S. et al. (2012) Differential expression in RNA-seq: a matter of depth. Genome Res.21:2213-23
3.3 Analysis and Visualization
http://en.wikipedia.org/wiki/Data_mining
3.3 Analysis and Visualization
For gene expression there are some common tasks and associated
methods for the data mining:
▪
Clustering of the expression values and principal component
analysis to reduce the variables.
▪
Classification using Gene Ontology terms and metabolic
annotations
▪
Summarization visualizing the expression data through heat
maps.
3.3 Analysis and Visualization
http://en.wikipedia.org/wiki/Cluster_analysis
3.3 Analysis and Visualization
HC ( hclust() function )
Stats http://stat.ethz.ch/R-manual/R-patched/library/
KMC ( kmeans() function ) stats/html/stats-package.html
(R package)
Visualization ( gplots() function )
http://www.broadinstitute.org/cancer/software/
GENE-E HC, visualization GENE-E/
3.3 Analysis and Visualization
One of the most common classification data mining method is the use
of gene annotations such as GO terms or metabolic annotations. These
methodologies compare two groups between them to find if there are
term more represented in one group than in other. Some examples are:
One of the most common classification data mining method is the use
of gene annotations such as GO terms or metabolic annotations. These
methodologies compare two groups between them to find if there are
term more represented in one group than in other. Some examples are:
Gene ontologies:
biological processes,
molecular functions
in a species-independent manner
http://www.geneontology.org/GO.doc.shtml
3.3 Analysis and Visualization
Biological processes,
Recognized series of events or molecular functions. A process is
a collection of molecular events with a defined beginning and end.
Cellular components,
Describes locations, at the levels of subcellular structures and
macromolecular complexes.
Molecular functions
Describes activities, such as catalytic or binding activities, that occur at the molecular
level.
http://www.geneontology.org/GO.doc.shtml
3.3 Analysis and Visualization
http://www.geneontology.org/GO.doc.shtml
Lectures:
1.Basics of the Next Generation Sequencing (NGS).
1.1. The sequencing revolutions.
1.2. Strengths and weaknesses of the different technologies.
1.3. Inputs and outputs.
2.RNAseq experiment design.
2.1. Reference vs Non-reference.
2.2. High heterozygosity and polyploid polyploid problem.
2.3. Tissue selection and treatments.
2.4. Sequencing technology.
3.RNAseq expression analysis.
3.1. Reference preparation and read mapping.
3.2. Gene expression.
3.3. Analysis and visualization.
4.Use of RNAseq reads for phylogeny and genetics.
4.1. Recovering full length mRNA: Reference guided assembly.
4.2. Phylogeny though RNAseq: From gene tree to species tree.
4.3. From reads to markers: SNP calling.
4.4. Population genetics and NGS.
4. Use of RNAseq reads for phylogeny and genetics.
A
B
C
Unroot tree B
C Outgroup
Root tree
Character
Maximum Likelihood (ML)
State
Bayesian Inference (BI)
PhyML ML http://www.atgc-montpellier.fr/phyml/
RAxML ML http://sco.h-its.org/exelixis/software.html
MrBayes BI http://mrbayes.sourceforge.net/index.php
4. Use of RNAseq reads for phylogeny and genetics.
Sequence
Alignment
Pairwise Distance
Matrix
Tree
C C C A
B A B
A B A
So bootstrap values are like the error bars for a phylogenetic tree.
A tree without bootstrapping values has an incomplete information
about how reliable are each of the branches.
100 D
100 C
80
B
A
4. Use of RNAseq reads for phylogeny and genetics.
1. Use CDS sequence, from start codon to the codon before the stop
codon. Use full length if they are available.
2. The consensus sequence is supported by enough reads to avoid
sequencing errors.
4.1 Recovering full length mRNA: Reference guided assembly.
1. De-novo assembly
2. Reference guided assembly
ATGCCCGCTAGACGACATGACGACAGCGTGTCGTAG Reference
TCGCTA TGACGA
ACGCTA TGACGA Mapped reads
TCGCTA ATGACG
CTCGCT ATGACG
CTCGCT
GCTCGC
Consensus
NNGCTCGCTANNNNNNATGACGANNNNNNNNNNNNN reference guided
4.1 Recovering full length mRNA: Reference guided assembly.
BCF (Binary variant Call Format) stores the variant call for the mapped reads at each
reference position.
4.1 Recovering full length mRNA: Reference guided assembly.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
4.1 Recovering full length mRNA: Reference guided assembly.
Tree
http://bibiserv.techfak.uni-
DiAlign Segment-based method Global and Local bielefeld.de/dialign
Sensitive progressive
TCoffee alignment Global and Local http://www.tcoffee.org
http://www.kuleuven.be/aidslab/phylogenybook/Table3.1.html
4.2 Phylogeny though RNAseq: From gene tree to species tree.
๏PhygOmics
https://github.com/solgenomics/PhygOmics
4.2 Phylogeny though RNAseq: From gene tree to species tree.
๏PhygOmics
~ 1,000 Gene trees
for the allotetraploid
Nicotiana tabacum and
its diploids
progenitors, N.
sylvestris and N.
tomentosiformis were
analyzed to identify
the origin of each
homoeolog.
Samtools/ http://samtools.sourceforge.net/
SNPs, INDELS
Bcftools mpileup.shtml
http://www.broadinstitute.org/gatk/about
4.3 From reads to markers: SNP calling.
After the SNP calling and before use the SNP data for other analysis is
recommended to perform a SNP filtering.
http://pritch.bsd.uchicago.edu/
Structure Population structure analysis structure.html
http://stephenslab.uchicago.edu/
Phase Genetic phasing of alleles software.html#phase
4.4 Population Genetics and RNAseq
GenoToolBox
MultiVcfTool Hapmap2Structure
https://github.com/aubombarely/GenoToolBox
4.4 Population Genetics and RNAseq
G. canescens (A):
G1232
G. syndetika (D4):
G1300,G2073,G2321 G. clandestina (A)
G1126, G1253
G. dolichocarpa G. tomentella
(T2): (T5):
G1134,G1188,G1286,
A39, A58, G1487,G1969
4
−20000 ●
Structure analysis:
1. Number of clusters
optimization (Evanno
3
−30000
G. et al. 2005)
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ●
●
Mean Ln(K)
Delta K
●
−40000
2
−50000
●
●
K=6
1
(K = 16)
● ●
●
●
● ●
●
● ●
●
●
●
●
−60000 ●
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
K K
4.4 Population Genetics and RNAseq
3. Run FineStructure
without homoeologus
read separation
T2
D4
Admixture model
T1
A
T5
D1
4.4 Population Genetics and RNAseq
D4
A
D4 T2 D3 D1
D3
It agrees with
previous data from
nuclear genes