SlideShare a Scribd company logo
Read Processing and Mapping:
From Raw to Analysis-ready Reads
Ben Passarelli
Stem Cell Institute Genome Center
NGS Workshop
12 September 2012
Click to edit Master title styleSamples to Information
Variant calling
Gene expression
Chromatin structure
Methylome
Immunorepertoires
De novo assembly
…
Click to edit Master title style
http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg
http://www.broadinstitute.org/gatk/guide/topic?name=intro
Many Analysis Pipelines Start with Read Mapping
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
Genotyping (GATK) RNA-seq (Tuxedo)
Click to edit Master title styleFrom Raw to Analysis-ready Reads
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Session Topics
• Understand read data formats and quality scores
• Identify and fix some common read data problems
• Find and prepare a genomic reference for mapping
• Map reads to a genome reference
• Understand alignment output
• Sort, merge, index alignment for further analysis
• Locally realign at indels to reduce alignment artifacts
• Mark/eliminate duplicate reads
• Recalibrate base quality scores
• An easy way to get started
Click to edit Master title styleInstrument Output
Illumina
MiSeq
Illumina
HiSeq
IonTorrent
PGM
Roche
454
Pacific Biosciences
RS
Images (.tiff)
Cluster intensity file (.cif)
Base call file (.bcl)
Standard flowgram file (.sff) Movie
Trace (.trc.h5)
Pulse (.pls.h5)
Base (.bas.h5)
Sequence Data
(FASTQ Format)
Click to edit Master title style
Raw reads
Read assessment and
prep
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready
reads
FASTQ Format (Illumina Example)
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
+
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Read Record
Header
Read Bases
Separator
(with optional
repeated
header)
Read Quality
Scores
Flow Cell ID
Lane Tile
Tile
Coordinates
Barcode
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads
Click to edit Master title style
Phred* quality score Q with base-calling error probability P
Q = -10 log10P
* Name of first program to assign accurate base quality scores. From the Human Genome Project.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
S - Sanger Phred+33 range: 0 to 40
I - Illumina 1.3+ Phred+64 range: 0 to 40
L - Illumina 1.8+ Phred+33 range: 0 to 41
Q score
Probability of
base error Base confidence
Sanger-encoded
(Q Score + 33)
ASCII character
10 0.1 90% “+”
20 0.01 99% “5”
30 0.001 99.9% “?”
40 0.0001 99.99% “I”
Base Call Quality: Phred Quality Scores
Click to edit Master title style
[benpass@solexalign]$ ls
Raw reads
Read assessment and
prep
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready
reads
File Organization
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Barcode
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Read
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Format
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
Click to edit Master title styleInitial Read Assessment
Common problems that can affect analysis
• Low confidence base calls
– typically toward ends of reads
– criteria vary by application
• Presence of adapter sequence in reads
– poor fragment size selection
– protocol execution or artifacts
• Over-abundant sequence duplicates
• Library contamination
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Click to edit Master title styleInitial Read Assessment: FastQC
• Free Download
Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
• Samples reads (200K default): fast, low resource use
Raw reads
Read assessment
and prep
Mapping
Local realignment
Duplicate marking
Base quality
recalibration
Analysis-ready
reads
Click to edit Master title style
http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc
Read Duplication
Read Assessment Examples
~8% of
sampled
sequences
occur twice
~6% of
sequences
occur more
than 10x
~71.48% of
sequences are
duplicates
Sanger Quality Score by Cycle
Median, Inner Quartile Range, 10-90 percentile range, Mean
Note: Duplication based on read identity,
not alignment at this point
Click to edit Master title style
Per base sequence content should resemble this…
Read Assessment Example (Cont’d)
Click to edit Master title styleRead Assessment Example (Cont’d)
Click to edit Master title styleRead Assessment Example (Cont’d)
TruSeq Adapter, Index 9 5’
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
Click to edit Master title styleRead Assessment Example (Cont’d)
Trim for base quality or adapters
(run or library issue)
Trim leading bases
(library artifact)
Click to edit Master title style
Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/
(partial list)
FASTQ Information: Chart Quality Statistics and Nucleotide Distribution
FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).
FASTQ Clipper: Removing sequencing adapters
FASTQ Quality Filter: Filters sequences based on quality
FASTQ Quality Trimmer: Trims (cuts) sequences based on quality
FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality
*defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.
SepPrep https://github.com/jstjohn/SeqPrep
Adapter trimming
Merge overlapping paired-end read
Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html
(for python programmers)
Especially useful for implementing custom/complex sequence analysis/manipulation
Galaxy http://galaxy.psu.edu
Great for beginners: upload data, point and click
Just about everything you’ll see in today’s presentations
Selected Tools to Process Reads
Click to edit Master title style
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Read Mapping
http://www.broadinstitute.org/igv/
Click to edit Master title style
SOAP2
(2.20)
Bowtie (0.12.8)
BWA
(0.6.2)
Novoalign
(2.07.00)
License GPL v3 LGPL v3 GPL v3 Commercial
Mismatch
allowed
exactly 0,1,2 0-3 max in read user specified.
max is function of
read length and
error rate
up to 8 or more
Alignments
reported per
read
random/all/none user selected user selected random/all/none
Gapped
alignment
1-3bp gap no yes up to 7bp
Pair-end reads yes yes yes yes
Best alignment minimal number
of mismatches
minimal number
of mismatches
minimal number
of mismatches
highest alignment
score
Trim bases 3’ end 3’ and 5’ end 3’ and 5’ end 3’ end
Read Mapping: Aligning to a Reference
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Click to edit Master title style
BWA Features
• Uses Burrows Wheeler Transform
— fast
— modest memory footprint (<4GB)
• Accurate
• Tolerates base mismatches
— increased sensitivity
— reduces allele bias
• Gapped alignment for both single- and paired-ended reads
• Automatically adjusts parameters based on read lengths and
error rates
• Native BAM/SAM output (the de facto standard)
• Large installed base, well-supported
• Open-source (no charge)
Read Mapping: BWA
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Click to edit Master title style
Sequence References and Annotations
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome
Comprehensive reference information
http://hgdownload.cse.ucsc.edu/downloads.html
Comprehensive reference, annotation, and translation information
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
References and SNP information data by GATK
Human only
http://cufflinks.cbcb.umd.edu/igenomes.html
Pre-indexed references and gene annotations for Tuxedo suite
Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast
http://www.repeatmasker.org/
Click to edit Master title style
Fasta Sequence Format
>chr1
…
TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt
agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga
attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga
aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa
taaccatatttgggatatactggattttaaaaatatatcactaatttcat
…
>chr2
…
>chr3
…
• One or more sequences per file
• “>” denotes beginning of sequence or contig
• Subsequent lines up to the next “>” define sequence
• Lowercase base denotes repeat masked base
• Contig ID may have comments delimited by “|”
Click to edit Master title style
Input files:
reference.fasta, read1.fastq.gz, read2.fastq.gz
Step 1: Index the genome (~3 CPU hours for a human genome reference):
bwa index -a bwtsw reference.fasta
Step 2: Generate alignments in Burrows-Wheeler transform suffix array
coordinates:
bwa aln reference.fasta read1.fastq.gz > read1.sai
bwa aln reference.fasta read2.fastq.gz > read2.sai
Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads
Step 3: Generate alignments in the SAM format (paired-end):
bwa sampe reference.fasta read1.sai read2.sai 
read1.fastq.gz read2.fastq.gz > alignment_ouput.sam
http://bio-bwa.sourceforge.net/bwa.shtml
Running BWA
Click to edit Master title style
Simple Form:
bwa sampe reference.fasta read1.sai read2.sai 
read1.fastq.gz read2.fastq.gz > alignment.sam
Output to BAM:
bwa sampe reference.fasta read1.sai read2.sai 
read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam
With Read Group Information:
bwa sampe -r "@RGtID:readgroupIDtLB:librarynametSM:samplenametPL:ILLUMINA“ 
reference.fasta 
read1.sai read2.sai 
read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam
Running BWA (Cont’d)
Click to edit Master title styleSAM (BAM) Format
Sequence Alignment/Map format
– Universal standard
– Human-readable (SAM) and compact (BAM) forms
Structure
– Header
version, sort order, reference sequences, read groups,
program/processing history
– Alignment records
Click to edit Master title style
[benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
…
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
…
@RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1
@RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1
@RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1
@RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1
@RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1
…
@RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023
@PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.list
LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15
maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30
maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null
generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null
statisticsFileForDebugging=null SNPsFileForDebugging=null
@PG ID:bwa PN:bwa VN:0.6.2-r126
samtools to view bam
headersort order
reference sequence names
with lengths
read groups with platform,
library and sample information
program (analysis) history
SAM/BAM Format: Header
Click to edit Master title style
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188
TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC
=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/
RG:Z:86-191
HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119
ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA
8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0
RG:Z:SDH023
* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability
SAM/BAM Format: Alignment Records
http://samtools.sourceforge.net/SAM1.pdf
1
3 4 5 6 8 9
10
11
Click to edit Master title style
• Subsequent steps require sorted and indexed bams
– Sort orders: karyotypic, lexicographical
– Indexing improves analysis performance
• Picard tools: fast, portable, free
http://picard.sourceforge.net/command-line-overview.shtml
Sort: SortSam.jar
Merge: MergeSamFiles.jar
Index: BuildBamIndex.jar
• Order: sort, merge (optional), index
Preparing for Next Steps
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Click to edit Master title styleLocal Realignment
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
• BWT-based alignment is fast for matching reads to reference
• Individual base alignments often sub-optimal at indels
• Approach
– Fast read mapping with BWT-based aligner
– Realign reads at indel sites using gold standard (but much
slower) Smith-Waterman1 algorithm
• Benefits
– Refines location of indels
– Reduces erroneous SNP calls
– Very high alignment accuracy in significantly less time,
with fewer resources
1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of
Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238
Click to edit Master title styleLocal Realignment
DePristo MA, et al. A framework for variation discovery and genotyping
using next-generation DNA sequencing data. Nat Genet. 2011
May;43(5):491-8. PMID: 21478889
Post re-alignment at indelsRaw BWA alignment
Click to edit Master title style
• Covered in genotyping presentation
• Note that this is done after alignment
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Duplicate Marking
Click to edit Master title style
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
STEP 1: Find covariates at non-dbSNP sites using:
Reported quality score
The position within the read
The preceding and current nucleotide (sequencer properties)
java -Xmx4g -jar GenomeAnalysisTK.jar 
-T BaseRecalibrator 
-I alignment.bam 
-R hg19/ucsc.hg19.fasta 
-knownSites hg19/dbsnp_135.hg19.vcf 
-o alignment.recal_data.grp
STEP 2: Generate BAM with recalibrated base scores:
java -Xmx4g -jar GenomeAnalysisTK.jar 
-T PrintReads 
-R hg19/ucsc.hg19.fasta 
-I alignment.bam 
-BQSR alignment.recal_data.grp 
-o alignment.recalibrated.bam
Base Quality Recalibration
Click to edit Master title styleBase Quality Recalibration (Cont’d)
Click to edit Master title styleGetting Started
Is there an easier way to get started?!!
Click to edit Master title styleGetting Started
http://galaxy.psu.edu/ Click “Use Galaxy”
Click to edit Master title styleGetting Started
http://galaxy.psu.edu/ Click “Use Galaxy”
Click to edit Master title styleQ&A
Ad

More Related Content

What's hot (9)

Euretos presentation ACS
Euretos presentation ACSEuretos presentation ACS
Euretos presentation ACS
albertmons
 
Regulating the Workload of Your Clinical Research Coordinator (CRC)
Regulating the Workload of Your Clinical Research Coordinator (CRC)Regulating the Workload of Your Clinical Research Coordinator (CRC)
Regulating the Workload of Your Clinical Research Coordinator (CRC)
TrialJoin
 
Using Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical PathwaysUsing Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical Pathways
diannepatricia
 
Worst Database Practices
Worst Database PracticesWorst Database Practices
Worst Database Practices
FrankScopelleti
 
SciBite Short Intro Sept 2015
SciBite Short Intro Sept 2015SciBite Short Intro Sept 2015
SciBite Short Intro Sept 2015
SciBite
 
vocabulary number 4 page 51 FOR Greek ESL students
vocabulary number 4 page 51 FOR  Greek ESL students vocabulary number 4 page 51 FOR  Greek ESL students
vocabulary number 4 page 51 FOR Greek ESL students
Olga Vareli
 
HealthHack_Find gene commonalities tool
HealthHack_Find gene commonalities toolHealthHack_Find gene commonalities tool
HealthHack_Find gene commonalities tool
Emma Duval
 
Data: The Good, The Bad & The Ugly
Data: The Good, The Bad & The UglyData: The Good, The Bad & The Ugly
Data: The Good, The Bad & The Ugly
SciBite Limited
 
Mi primer slide share
Mi primer slide shareMi primer slide share
Mi primer slide share
jessicafigueroaj30
 
Euretos presentation ACS
Euretos presentation ACSEuretos presentation ACS
Euretos presentation ACS
albertmons
 
Regulating the Workload of Your Clinical Research Coordinator (CRC)
Regulating the Workload of Your Clinical Research Coordinator (CRC)Regulating the Workload of Your Clinical Research Coordinator (CRC)
Regulating the Workload of Your Clinical Research Coordinator (CRC)
TrialJoin
 
Using Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical PathwaysUsing Machine Learning to Automate Clinical Pathways
Using Machine Learning to Automate Clinical Pathways
diannepatricia
 
Worst Database Practices
Worst Database PracticesWorst Database Practices
Worst Database Practices
FrankScopelleti
 
SciBite Short Intro Sept 2015
SciBite Short Intro Sept 2015SciBite Short Intro Sept 2015
SciBite Short Intro Sept 2015
SciBite
 
vocabulary number 4 page 51 FOR Greek ESL students
vocabulary number 4 page 51 FOR  Greek ESL students vocabulary number 4 page 51 FOR  Greek ESL students
vocabulary number 4 page 51 FOR Greek ESL students
Olga Vareli
 
HealthHack_Find gene commonalities tool
HealthHack_Find gene commonalities toolHealthHack_Find gene commonalities tool
HealthHack_Find gene commonalities tool
Emma Duval
 
Data: The Good, The Bad & The Ugly
Data: The Good, The Bad & The UglyData: The Good, The Bad & The Ugly
Data: The Good, The Bad & The Ugly
SciBite Limited
 

Viewers also liked (20)

Electron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBICElectron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBIC
Jisc
 
Molecular characterization of Pst isolates from Western Canada
Molecular characterization of Pst isolates from Western CanadaMolecular characterization of Pst isolates from Western Canada
Molecular characterization of Pst isolates from Western Canada
Borlaug Global Rust Initiative
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
Lex Nederbragt
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiome
Mick Watson
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
Surya Saha
 
Next-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones InfographicNext-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones Infographic
QIAGEN
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Joe Parker
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
jukais
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
GenomeInABottle
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
Candy Smellie
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
Elsa von Licy
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
John Blue
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Prof. Wim Van Criekinge
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum
 
2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit2016 iHT2 San Diego Health IT Summit
2016 iHT2 San Diego Health IT Summit
Health IT Conference – iHT2
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
Sri Ambati
 
Biz model for ion proton dna sequencer
Biz model for ion proton dna sequencerBiz model for ion proton dna sequencer
Biz model for ion proton dna sequencer
Jeffrey Funk Business Models
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
Chung-Tsai Su
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
Electron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBICElectron Microscopy Between OPIC, Oxford and eBIC
Electron Microscopy Between OPIC, Oxford and eBIC
Jisc
 
Molecular characterization of Pst isolates from Western Canada
Molecular characterization of Pst isolates from Western CanadaMolecular characterization of Pst isolates from Western Canada
Molecular characterization of Pst isolates from Western Canada
Borlaug Global Rust Initiative
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
Lex Nederbragt
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiome
Mick Watson
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
Surya Saha
 
Next-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones InfographicNext-Generation Sequencing Commercial Milestones Infographic
Next-Generation Sequencing Commercial Milestones Infographic
QIAGEN
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Joe Parker
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
jukais
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
GenomeInABottle
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
Candy Smellie
 
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
Dr. Douglas Marthaler - Use of Next Generation Sequencing for Whole Genome An...
John Blue
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Prof. Wim Van Criekinge
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
Sri Ambati
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
Chung-Tsai Su
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
Ad

Similar to Ngs workshop passarelli-mapping-1 (16)

Gene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment AnalysisGene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment Analysis
UC Davis
 
Template Based Protein Structure Modeling
Template Based Protein Structure ModelingTemplate Based Protein Structure Modeling
Template Based Protein Structure Modeling
ghazi201
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertable
hypertable
 
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
Shawn Wells
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 
SESHAKRISHNA
SESHAKRISHNASESHAKRISHNA
SESHAKRISHNA
jitendar stv
 
Learning biologically relevant features using convolutional neural networks f...
Learning biologically relevant features using convolutional neural networks f...Learning biologically relevant features using convolutional neural networks f...
Learning biologically relevant features using convolutional neural networks f...
Wesley De Neve
 
Preparing Pathology WSI data for Machine Learning Experiments
Preparing Pathology WSI data for Machine Learning Experiments Preparing Pathology WSI data for Machine Learning Experiments
Preparing Pathology WSI data for Machine Learning Experiments
Dima Lituiev
 
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Explaining Online Reinforcement Learning Decisions of Self-Adaptive SystemsExplaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Andreas Metzger
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
DATAVERSITY
 
第12回 配信講義 計算科学技術特論A(2021)
第12回 配信講義 計算科学技術特論A(2021)第12回 配信講義 計算科学技術特論A(2021)
第12回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Paragon_Science_Inc
 
Introduction to SDshare
Introduction to SDshareIntroduction to SDshare
Introduction to SDshare
Lars Marius Garshol
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
Bioinformatics and Computational Biosciences Branch
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Paolo Missier
 
Gene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment AnalysisGene Ontology Network Enrichment Analysis
Gene Ontology Network Enrichment Analysis
UC Davis
 
Template Based Protein Structure Modeling
Template Based Protein Structure ModelingTemplate Based Protein Structure Modeling
Template Based Protein Structure Modeling
ghazi201
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
Keiichiro Ono
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertable
hypertable
 
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
2013-07-21 MITRE Developer Days - Red Hat SCAP Remediation
Shawn Wells
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 
Learning biologically relevant features using convolutional neural networks f...
Learning biologically relevant features using convolutional neural networks f...Learning biologically relevant features using convolutional neural networks f...
Learning biologically relevant features using convolutional neural networks f...
Wesley De Neve
 
Preparing Pathology WSI data for Machine Learning Experiments
Preparing Pathology WSI data for Machine Learning Experiments Preparing Pathology WSI data for Machine Learning Experiments
Preparing Pathology WSI data for Machine Learning Experiments
Dima Lituiev
 
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Explaining Online Reinforcement Learning Decisions of Self-Adaptive SystemsExplaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems
Andreas Metzger
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
DATAVERSITY
 
第12回 配信講義 計算科学技術特論A(2021)
第12回 配信講義 計算科学技術特論A(2021)第12回 配信講義 計算科学技術特論A(2021)
第12回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Paragon_Science_Inc
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Paolo Missier
 
Ad

Recently uploaded (20)

Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 

Ngs workshop passarelli-mapping-1

  • 1. Read Processing and Mapping: From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 12 September 2012
  • 2. Click to edit Master title styleSamples to Information Variant calling Gene expression Chromatin structure Methylome Immunorepertoires De novo assembly …
  • 3. Click to edit Master title style http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg http://www.broadinstitute.org/gatk/guide/topic?name=intro Many Analysis Pipelines Start with Read Mapping http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html Genotyping (GATK) RNA-seq (Tuxedo)
  • 4. Click to edit Master title styleFrom Raw to Analysis-ready Reads Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads Session Topics • Understand read data formats and quality scores • Identify and fix some common read data problems • Find and prepare a genomic reference for mapping • Map reads to a genome reference • Understand alignment output • Sort, merge, index alignment for further analysis • Locally realign at indels to reduce alignment artifacts • Mark/eliminate duplicate reads • Recalibrate base quality scores • An easy way to get started
  • 5. Click to edit Master title styleInstrument Output Illumina MiSeq Illumina HiSeq IonTorrent PGM Roche 454 Pacific Biosciences RS Images (.tiff) Cluster intensity file (.cif) Base call file (.bcl) Standard flowgram file (.sff) Movie Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Sequence Data (FASTQ Format)
  • 6. Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads FASTQ Format (Illumina Example) @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Record Header Read Bases Separator (with optional repeated header) Read Quality Scores Flow Cell ID Lane Tile Tile Coordinates Barcode NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads
  • 7. Click to edit Master title style Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41 Q score Probability of base error Base confidence Sanger-encoded (Q Score + 33) ASCII character 10 0.1 90% “+” 20 0.01 99% “5” 30 0.001 99.9% “?” 40 0.0001 99.99% “I” Base Call Quality: Phred Quality Scores
  • 8. Click to edit Master title style [benpass@solexalign]$ ls Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads File Organization [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Barcode [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Read [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Format [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed
  • 9. Click to edit Master title styleInitial Read Assessment Common problems that can affect analysis • Low confidence base calls – typically toward ends of reads – criteria vary by application • Presence of adapter sequence in reads – poor fragment size selection – protocol execution or artifacts • Over-abundant sequence duplicates • Library contamination Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads
  • 10. Click to edit Master title styleInitial Read Assessment: FastQC • Free Download Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y • Samples reads (200K default): fast, low resource use Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads
  • 11. Click to edit Master title style http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc Read Duplication Read Assessment Examples ~8% of sampled sequences occur twice ~6% of sequences occur more than 10x ~71.48% of sequences are duplicates Sanger Quality Score by Cycle Median, Inner Quartile Range, 10-90 percentile range, Mean Note: Duplication based on read identity, not alignment at this point
  • 12. Click to edit Master title style Per base sequence content should resemble this… Read Assessment Example (Cont’d)
  • 13. Click to edit Master title styleRead Assessment Example (Cont’d)
  • 14. Click to edit Master title styleRead Assessment Example (Cont’d) TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
  • 15. Click to edit Master title styleRead Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact)
  • 16. Click to edit Master title style Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/ (partial list) FASTQ Information: Chart Quality Statistics and Nucleotide Distribution FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). FASTQ Clipper: Removing sequencing adapters FASTQ Quality Filter: Filters sequences based on quality FASTQ Quality Trimmer: Trims (cuts) sequences based on quality FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. SepPrep https://github.com/jstjohn/SeqPrep Adapter trimming Merge overlapping paired-end read Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html (for python programmers) Especially useful for implementing custom/complex sequence analysis/manipulation Galaxy http://galaxy.psu.edu Great for beginners: upload data, point and click Just about everything you’ll see in today’s presentations Selected Tools to Process Reads
  • 17. Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads Read Mapping http://www.broadinstitute.org/igv/
  • 18. Click to edit Master title style SOAP2 (2.20) Bowtie (0.12.8) BWA (0.6.2) Novoalign (2.07.00) License GPL v3 LGPL v3 GPL v3 Commercial Mismatch allowed exactly 0,1,2 0-3 max in read user specified. max is function of read length and error rate up to 8 or more Alignments reported per read random/all/none user selected user selected random/all/none Gapped alignment 1-3bp gap no yes up to 7bp Pair-end reads yes yes yes yes Best alignment minimal number of mismatches minimal number of mismatches minimal number of mismatches highest alignment score Trim bases 3’ end 3’ and 5’ end 3’ and 5’ end 3’ end Read Mapping: Aligning to a Reference Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads
  • 19. Click to edit Master title style BWA Features • Uses Burrows Wheeler Transform — fast — modest memory footprint (<4GB) • Accurate • Tolerates base mismatches — increased sensitivity — reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge) Read Mapping: BWA Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads
  • 20. Click to edit Master title style Sequence References and Annotations http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information ftp://[email protected]/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org/
  • 21. Click to edit Master title style Fasta Sequence Format >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 … • One or more sequences per file • “>” denotes beginning of sequence or contig • Subsequent lines up to the next “>” define sequence • Lowercase base denotes repeat masked base • Contig ID may have comments delimited by “|”
  • 22. Click to edit Master title style Input files: reference.fasta, read1.fastq.gz, read2.fastq.gz Step 1: Index the genome (~3 CPU hours for a human genome reference): bwa index -a bwtsw reference.fasta Step 2: Generate alignments in Burrows-Wheeler transform suffix array coordinates: bwa aln reference.fasta read1.fastq.gz > read1.sai bwa aln reference.fasta read2.fastq.gz > read2.sai Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads Step 3: Generate alignments in the SAM format (paired-end): bwa sampe reference.fasta read1.sai read2.sai read1.fastq.gz read2.fastq.gz > alignment_ouput.sam http://bio-bwa.sourceforge.net/bwa.shtml Running BWA
  • 23. Click to edit Master title style Simple Form: bwa sampe reference.fasta read1.sai read2.sai read1.fastq.gz read2.fastq.gz > alignment.sam Output to BAM: bwa sampe reference.fasta read1.sai read2.sai read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam With Read Group Information: bwa sampe -r "@RGtID:readgroupIDtLB:librarynametSM:samplenametPL:ILLUMINA“ reference.fasta read1.sai read2.sai read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam Running BWA (Cont’d)
  • 24. Click to edit Master title styleSAM (BAM) Format Sequence Alignment/Map format – Universal standard – Human-readable (SAM) and compact (BAM) forms Structure – Header version, sort order, reference sequences, read groups, program/processing history – Alignment records
  • 25. Click to edit Master title style [benpass align_genotype]$ samtools view -H allY.recalibrated.merge.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 … @RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1 … @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:bwa PN:bwa VN:0.6.2-r126 samtools to view bam headersort order reference sequence names with lengths read groups with platform, library and sample information program (analysis) history SAM/BAM Format: Header
  • 26. Click to edit Master title style [benpass align_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability SAM/BAM Format: Alignment Records http://samtools.sourceforge.net/SAM1.pdf 1 3 4 5 6 8 9 10 11
  • 27. Click to edit Master title style • Subsequent steps require sorted and indexed bams – Sort orders: karyotypic, lexicographical – Indexing improves analysis performance • Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar • Order: sort, merge (optional), index Preparing for Next Steps Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads
  • 28. Click to edit Master title styleLocal Realignment Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads • BWT-based alignment is fast for matching reads to reference • Individual base alignments often sub-optimal at indels • Approach – Fast read mapping with BWT-based aligner – Realign reads at indel sites using gold standard (but much slower) Smith-Waterman1 algorithm • Benefits – Refines location of indels – Reduces erroneous SNP calls – Very high alignment accuracy in significantly less time, with fewer resources 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238
  • 29. Click to edit Master title styleLocal Realignment DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889 Post re-alignment at indelsRaw BWA alignment
  • 30. Click to edit Master title style • Covered in genotyping presentation • Note that this is done after alignment Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads Duplicate Marking
  • 31. Click to edit Master title style Raw reads Read assessment and prep Mapping Local realignment Duplicate marking Base quality recalibration Analysis-ready reads STEP 1: Find covariates at non-dbSNP sites using: Reported quality score The position within the read The preceding and current nucleotide (sequencer properties) java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I alignment.bam -R hg19/ucsc.hg19.fasta -knownSites hg19/dbsnp_135.hg19.vcf -o alignment.recal_data.grp STEP 2: Generate BAM with recalibrated base scores: java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -R hg19/ucsc.hg19.fasta -I alignment.bam -BQSR alignment.recal_data.grp -o alignment.recalibrated.bam Base Quality Recalibration
  • 32. Click to edit Master title styleBase Quality Recalibration (Cont’d)
  • 33. Click to edit Master title styleGetting Started Is there an easier way to get started?!!
  • 34. Click to edit Master title styleGetting Started http://galaxy.psu.edu/ Click “Use Galaxy”
  • 35. Click to edit Master title styleGetting Started http://galaxy.psu.edu/ Click “Use Galaxy”
  • 36. Click to edit Master title styleQ&A

Editor's Notes

  • #3: More samples, more data, more runs. And more customers. As the HTS is adopted, computer sophistication of average user is less though amount of data, variety of data types is more complicated than ever.
  • #4: Whether Genotyping, RNA-seq, ChIPseq, Methylation analysis – data requires processing. These number and makeup of steps is Evolving asynchronously.