Bioinformatics Experimental Design
Bioinformatics Experimental Design
Where do I start?
Please refer to the following guide to better plan your experiments for good statistical
analysis, best suited for your research needs. Statistics cannot rescue a bad experimental
design.
§ Platform choice:
Platform Platform
Genome Sequencer Genome Hiseq 2000 SOLiD 4 system HeliScope
FLX Titanium System Analyzer IIx
Company Roche Illumina Illumina Applied Biosystems Helicos
Biosciences
Read length 400-600bp 2x100bp 2x100-150bp 50 +25bp ~30bp
Samples per run 16 8 16 16 50
Reads per run ~1 million ~300million ~800 million >700 million ~500 million
Run time 10 h 8 days 8 days 11-13 days 8 days
Website www.454.com www.illumina.com www.illumina.com www.appliedbiosystems.com www.helicosbio.com
These numbers change rapidly as technology improves. Please note that these numbers are
based on data from Oct. 2010. Please refer to the websites listed under each platform for the
latest numbers.
Center for Research Informatics Bioinformatics Core last updated May 2015
Paired End Single End
RNASeq - De novo Assembly RNASeq - Counting
RNASeq - Splicing ChIP-Seq - Counting
ChIP Seq – Epigenetic modifications
DNA – SNP Identification
DNA – Indel identification
DNA – Structural variants
§ Read Length:
50bp reads are typically sufficient for read mapping to the reference genome, and
RNASeq counting experiments. >100bp reads are useful for whole genome and
transcriptome studies based on the application.
§ Replication:
Samples must be sequenced with replicates to identify sources of variance and increase
statistical power to separate true biological variance from technical variance. Biological
replicates are critical whereas technical replicates are typically not required.
Cutting back replicates to reduce cost might seem like a good option, but remember: A sample or
sequencing run can fail, and lead to repeating the experiment.
§ Randomization:
Assign individuals at random to different groups to reduce bias. We recommend
randomization of samples such that each sequencing lane contains samples from all
experimental groups. Please refer to Blocking and Multiplexing below to understand how to
do this.
“Block what you can and randomize what you cannot.” – Box, Hunter, & Hunter (1978)
Center for Research Informatics Bioinformatics Core last updated May 2015
Group! A! B!
Biological 1! 2! 3! 1! 2! 3!
replicates!
RNA
R1! R2! R3! R1! R2! R3!
extraction!
Flowcell! Flowcell!
Lane1 Lane2 Lane3 Lane4 Lane5 Lane6! Lane1 Lane2 Lane3 Lane4 Lane5 Lane6!
✗ ✔
If,
I= Number of groups/treatments
J= Number of biological replicates per treatment
s= Number of unique barcodes that can be added in one lane
L= Number of lanes sequenced
T=Total number of technical replicates
sL
T=
JI
If s<I, complete block design is not possible. [1]
§ Sequencing depth:
The following table provides general recommendations for coverage/reads
(https://genohub.com/recommended-sequencing-coverage-by-application/) for typical
read lengths for the HUMAN genome. Please visit https://genohub.com/next-
generation-sequencing-guide/#reads for typical number of reads/lane for various
commonly used NGS platforms.
A useful resource from Illumina for specific coverage estimates for various Illumina
instruments and genomes of different sizes is
http://support.illumina.com/downloads/sequencing_coverage_calculator.html
Center for Research Informatics Bioinformatics Core last updated May 2015
DNA:
Center for Research Informatics Bioinformatics Core last updated May 2015
Microarray Experiments
A very useful resource for microarray design is:
http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp
• Balanced samples
o Same amount of cases and controls
o Matched phenotypes: gender, age, etc.
• Biological replicates
o Pure background to avoid biological variation
o More replicates are needed if there is larger variation between individuals and
small difference between groups
• Avoid technical variation
o Process sample at same condition as much as possible
o Technician, reagents, time, procedures
• Randomize samples on array
o Avoid confounding technical and biological factors
o Randomly put samples on different array slides and positions
Center for Research Informatics Bioinformatics Core last updated May 2015
Frequently Asked Questions
´ What if I do not have replicates of data points?
Understand the limitations of un-replicated data! You cannot separate technical variance
from biological variance, thus, the results only apply to the data points sequenced but cannot
be extrapolated to the population.
References
1. P. L. Auer and R. W. Doerge. 2010. Statistical design and analysis of RNA sequencing data.
Genetics 185:405-416.
Center for Research Informatics Bioinformatics Core last updated May 2015