09.05.23_Sequencing Technology and Development_Canvas
09.05.23_Sequencing Technology and Development_Canvas
C
Learning Objectives…
op
• Know how twin studies and sequencing approaches have been used to
yr
identify the genetic basis of traits/diseases.
ig
• Recall single nucleotide polymorphisms (SNPs) and know how SNP
ht
arrays have been used to conduct Genome Wide Association Studies
(GWAS).
ed
• Be able to briefly describe the core features of Sanger sequencing used
to complete whole genome sequencing (WGS) of the first human genome
M
as well as modern Next-gen Sequence by Synthesis protocols.
at
• Compare and contrast bulk-tissue RNAseq with recently developed
single-cell (scRNAseq) approaches.
er
ia
l2
02
2
Molecular Regulation of Development: By the Numbers
C
• Human development generates 30-40 trillion cells
op
• Each cell contains 19,975 protein-coding genes (60,603 total genes)
yr
• Typically, some combination of 12,262 ± 1,007 protein-coding genes are expressed
in any given cell
•
ig
Estimates suggest ~40% expressed in all cells, ~60% tissue specific.
ht
ed
• What is the role of these molecules during development?
M
GENES
at PROTEINS
~19,975 protein-coding
ia
5-100 mRNA’s per gene 1,000,000,000-3,000,000,000 per cell
l2
02
2
https://bionumbers.hms.harvard.edu/search.aspx
The “Omics”
C
• Life scientists are now able to comprehensively study many biological
op
questions at the scale of ALL genes (genomics), RNA (transcriptomics)
and proteins (proteomics) in many organisms.
yr
• These technologies have allowed for an unprecedented view of biological
ig
processes including those relevant to developmental biology.
ht
ed
M
at
er
ia
l2
02
2
Benson et al. (2020) J. Neurosci. 40:81-88
Genetic Basis of Disease: Classic Twin Studies
C
op
Twin Studies
• Quantifies phenotypic differences in monozygotic
yr
(“MZ”, genetically identical) and dizygotic (“DZ”, ~50%
ig
genetic similarity) twin pairs to calculate concordance
and estimate heritability.
ht
ed
M
at
er
ia
l2
02
2
http://www.nature.com/doifinder/10.1038/ng.3285
Genetic Basis of Disease: Classic Twin Studies
C
• Twin studies utilize estimates of heritability (h2) a statistical measure describing how
op
much of the variation in a given trait can be attributed to genetic variation range in
value from 0 to 1 (for detailed description see https://pubmed.ncbi.nlm.nih.gov/18319743/)
yr
• If H = 1, all variation is due to differences between genotype.
ig
• If H = 0, all variation is due to differences in the environments experienced by twins.
ht
ed
M
at
er
ia
l2
02
2
https://doi.org/10.1016/j.cell.2019.01.015
Father of Genomics: Frederick Sanger
C
So what are the exact changes in the sequence of effected genes?
op
• In 1977, Sanger developed a clever approach using chain-terminating
yr
dideoxynucleotides (ddNTPs) to sequence a simple 5,386bp bacterial virus.
ig
• “Sanger sequencing” became the gold standard for >40 years and is still used.
ht
• Won his second Nobel prize in 1980 with Paul Berg and Walter Gilbert.
ed
M
at "for their contributions
er
concerning the
determination of base
ia
sequences in nucleic acids."
l2
02
2
Sanger Sequencing
C
• Sanger developed chain-terminating dideoxynucleotide triphosphates (ddNTPs) that stop
DNA polymerization during PCR to cleverly resolve sequence by measuring size.
•
op
Regular deoxynucleotide triphosphates (dNTPs, “N” = A, G, C, or T) and a small amount of
four fluorescently-conjugated ddNTPs (“N” = A, G, C, or T) and are mixed with template,
yr
genomic fragment, PCR product, or plasmid of interest.
ig
• PCR reaction will generate fragments of all lengths, but when a fluorescently-conjugated
ddNTP is incorporated, amplification of that strand stops and is color-coded.
•
ht
Following size separation by gel electrophoresis, the sequence of fragments up to 1000bp
ed
can then be determined.
M
at
er
ia
l2
02
2
https://www.sigmaaldrich.com/US/en/technical-documents/protocol/genomics/sequencing/sanger-sequencing
The Human Genome Project
C
• Whole Genome Sequencing (WGS) - the
op
comprehensive sequencing of the entire genome.
• Using Sanger sequencing, >10 yrs, $3 billion, ~12
yr
individuals from Buffalo, the Human Genome
Project completed the first WGS of ~3 billion
ig
base-pairs in 2003 covering ~90% of the genome.
ht
• The Telomere-to-Telomere (T2T) Consortium
finished filling in the remaining gaps in 2022-
ed
Dr. Francis Collins analyzing an autoradiogram
https://www.science.org/doi/10.1126/science.abj6987 . displaying the results of a Sanger DNA sequencing
experiment, such as that used in the early years of the
Human Genome Project.
https://www.youtube.com/watch?v=qOW5e4BgEa4
M
at
er
ia
l2
02
2
https://www.nature.com/articles/s41576-020-0275-3
Functional Gene Expression & Transcriptomics
C
op
Studying the expression of all genes (aka functional
genomics or transcriptomics) has been achieved with a
range of approaches.
yr
ig
ht
• Typically requires extraction of mRNA transcripts from a
tissue, or more recently, single cells!
ed
• Technologies continue to evolve…
M
• Microarrays
at
• Next-generation sequencing (NGS, RNAseq)
er
• Single-cell RNA-seq
ia
• Spatial Transcriptomics
l2
Nice review from the text…
https://learninglink.oup.com/access/content/barresi-12e-student- 02
2
resources/barresi-12e-further-development-3-17-6-microarrays-and-macroarrays https://www.nature.com/articles/nrg.2016.49
Microarrays
C
op
yr
• Once model genomes were
ig
sequenced, approaches to attach
small oligonucleotides to multiple
ht
genes of interest on a chip
ed
(microarrays) were developed.
M
• By the 2000s, microarrays
containing ~20,000 oligos
complementary to all known
at
er
protein-coding mRNAs were
regularly used to study the
transcriptome.
ia
l2
02
2
Microarrays
C
• Tissue derived RNA samples are added to the chip and the intensity of
op
a hybridized “spot” on the array provides an estimate of RNA
abundance for that specific oligonucleotide.
yr
ig
ht
ed
M
at
er
ia
l2
02
2
Single Nucleotide Polymorphisms Arrays & GWAS
C
• A single nucleotide polymorphism (abbreviated SNP, pronounced ‘snip’) is a
op
genomic variant at a single base position in the DNA.
• By 2005, the Human Genome Project and International HapMap Project (~270
yr
individuals) generated maps of ~3 million SNPs across the human genome.
• Inexpensive SNP microarrays based on these maps were created to conduct
ig
Genome Wide Association Studies (GWAS) that detect associations between
ht
genetic loci and traits/disease.
ed
M
at
er
ia
In one of the first uses of SNP arrays, the SNP in this Manhattan
plot (https://www.youtube.com/watch?v=Pdic7p_dk0I ) was l2
02
detected in 146 patients with age-related macular degeneration
(leading cause of blindness) and ultimately led to the discovery
that mutations in the CFH gene increase risk of AMD >5-fold.
2
GWAS
•
C
Estimates suggest 90-95% of SNPs detected in GWAS are located in non-coding
op
genomic regions that indirectly modulate gene expression ~40-50% of the time.
yr
ig
ht
ed
M
at
er
ia
l2
02
2
https://www.nature.com/articles/s43586-021-00056-9
GWAS Catalogs
C
op
• In response to the massive increase in the number of published GWAS, the GWAS
Catalog was founded by NIH-NHGRI in 2008. https://www.ebi.ac.uk/gwas/
yr
ig
ht
ed
M
at
er
ia
l2
02
2
Case Study: GWAS Catalog
•
C
GWAS using SNP arrays have provided
op
unprecedented insight into the genetic basis of
many traits/diseases, such as the data from
yr
this paper on autism.
ig
ht
ed
M
at
er
ia
l2
02
2
https://www.nature.com/articles/s41598-021-95447-z#Sec26
Genetic Basis of Disease: GWAS
C
op
yr
ig
ht
ed
M
at
er
https://doi.org/10.1016/j.cell.2019.01.015
ia
CONS
• l2
Depending on the disease/trait, GWAS loci are often found to only confer a small increase in disease risk
02
(many diseases are strongly associated with environmental factors, not genetics) and explain a fraction of the
heritability.
2
• Only ~10-20% of all GWAS participants are of non-European descent.
Next Generation Sequencing Technologies
C
• Sanger sequencing and SNP arrays gave way to high-throughput
op
sequencing technologies that are faster and cheaper.
• These massively parallel, highly multiplexed protocols are now typically
yr
referred to as Next-generation Sequencing (NGS) approaches often
ig
termed DNAseq or RNAseq.
• Current ‘next-gen’ based human WGS or whole exome sequencing (WES)
ht
now costs <$1000 and a few weeks.
ed
M
at
er
ia
l2
02
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
2
2
02
l2
ia
er
at
M
ed
ht
ig
yr
op
C
RNAseq
NGS Example Tech: Sequence by synthesis
C
Reversible dye-terminator (RDT)-based sequencing-by-synthesis (SBS)
op
(please note this is but one commonly used chemistry, there are many others)
yr
1. Fragments of the sample DNA/RNA template of interest (800–
1000K clusters per mm2) are anchored on a chip or “flowcell”.
ig
2. RDT-ddNTPs with florescent tags are added during synthesis (no
dNTPs) and at the end of each cycle, a picture is taken.
ht
3. An enzyme cleaves the fluorescent tag and turns the RDT-ddNTP
ed
into a regular dNTP to reverse termination and allow another
synthesis cycle to continue.
4. Repeat Step 2-3 for 100-300 bp: each “colored dot” in the pic
M
reveals the sequence.
at
er
ia
l2
02
2
https://www.youtube.com/watch?v=fCd6B5HRaZ8
Bioinformatics
C
op
• A typical next-gen WGS experiment on a single tissue derived mRNA
sample will sequence >500 million fragments or “reads” ~100-300bp in
yr
length.
ig
• Often stored in a >250 GB “.fastq” formatted file with quality control info
ht
for each “read” that looks something like…
ed
M
at
er
ia
l2
02
2
Bioinformatics
C
• Resequencing: when reads are aligned back to known genomic sequence
op
to identify all genetic variants, including…
• SNPs
yr
• insertions/deletions
ig
• structural variants
ht
• copy number variants (CNVs)
ed
• Genome Reference Consortium Human Build 38 (GRCh38)
https://www.ncbi.nlm.nih.gov/grc
M
at
er
ia
l2
02
2
Example Transcriptome Analytic Pipeline
C
• How do we convert TBs of sequencing data from mRNA samples
op
(RNAseq) into information on changes in gene expression?
yr
• MANY different analytic packages have been developed, here is but one
example pipeline/workflow…
ig
Map reads to genome:
ht Count reads: Compute Differential Gene
ed
e.g. STAR e.g. featurecounts Expression: e.g. DeSeq2
M
at
er
ia
l2
02
2
Public Repositories
C
• MANY different publically accessible repositories, that provide RNAseq
op
based information on gene expression.
yr
ig
ht
ed
M
at
er
ia
l2
RPKM = Reads Per Kilobase transcript, per Million mapped reads.
02
2
A commonly used normalized unit of transcript expression.
Public Repositories
C
• The Gene Expression Omnibus portal is a NCBI managed public
op
repository for sequencing data.
• https://www.ncbi.nlm.nih.gov/geo/
yr
ig
ht
ed
M
at
er
ia
• l2
Most journals now require data to be uploaded to GEO prior to publication.
02
2
Limitations of Bulk-tissue Analyses
C
• DNAseq/RNAseq aggregate information from all cells within a sample.
op
• What if the tissue is highly heterogenous?
yr
ig
ht If a gene is known to be
ed
expressed in multiple cell
types, bulk-tissue RNAseq
may be unable to tell you
at
responsible for detected
changes in expression.
er
ia
l2
02
2
Single Cell RNAseq (scRNAseq)
C
• Is there a way to sequence nucleic acids from a single cell in a complex
op
multicellular tissue?
• Yes, by gently digesting/dissociating tissue, single cells can be extracted
yr
and individual cells (sometimes nuclei) can be sequenced.
ig
• Bioinformatics analysis is much more data intensive, complex, and
ht
continues to evolve.
• Sequencing “depth” (# of detected transcripts) is often limited relative to
RNAseq.
ed
M
at
er
ia
l2
02
2
Single Cell RNAseq (scRNAseq)
C
op
yr
ig
ht
ed
M
at
er
ia
l2
02
tSNE = t-distributed stochastic neighbor embedding: a statistical method for visualizing
and clustering cells defined by a global pattern of gene expression that gives each cell
2
a single location in a two or three-dimensional map
Case Study: Single nuclei RNAseq & Autism
C
op
yr
• Performed snRNAseq
ig
analysis of 104,559
ht
postmortem single nuclei
isolated from 15 ASD and 16
control individuals forebrain
regions.
ed
M
at
• Used a statistical clustering
method to identify patterns of
gene expression to classify
cells.
er
ia
• Able to untangle neuron vs
glial specific changes in
l2
02
gene expression in autism.
2
https://www.science.org/doi/10.1126/science.aav8130
Allen Brain Atlas Transcriptomic Explorer
C
• A Single Cell RNAseq Database that provides RNA expression
op
information from 49,495 randomly selected cells from the human cortex
• https://celltypes.brain-map.org/rnaseq/human_ctx_smart-seq?
yr
ig
ht
ed
M
at
er
ia
l2
02
2
Spatial Transcriptomics
•
C
Isolation of single cells in scRNA-seq destroys information on spatial
op
localization and proximity to other cells.
• What if we could examine gene expression in situ?
•
yr
“Spatial transcriptomics” is still nascent, but represents the next frontier in
ig
building 3D cellular and gene expression atlases of entire tissues/organisms.
ht
ed
M
at
er
ia
l2
02
2
Integrative “Omics”…
C
op
yr
ig
ht
ed
M
at
er
ia
l2
02
2
Circles = relevant molecules Arrows = potential interactions