Genomics Deployments - How to Get Right with Software Defined Storage

2018 Storage Developer Conference India © All Rights Reserved.
1
Genomics Deployments: How To Get Right
With Software Defined Storage
Sandeep Patil
IBM
Acknowledgement : Ulf Troppens, Piyush Chaudhary, Kumaran Rajaram, Theodore Hoover
Jr, Kevin Gildea, Sasikanth Eda, Smita J Raut, Luis Bolinches , Monica Lemay, Carl Zetie,
Joanna Wong

2
Agenda
r Genomic Introduction
r What is Genomics
r Genomics – An Emerging Market
r Understanding Genomic Sequencing Workloads
r Requirements on Infrastructure
r Solution Approach
r Solution Architecture
r Performance for GATK based on proposed solution

3
Genomics - Introduction
Ø Genomics is a branch of biotechnology focusing on genomes.
Ø Genomics involves applying the techniques of genetics and molecular biology
to sequence, analyze or modify the DNA of an organism.
Ø It finds its use in a number of fields, such as, diagnostics, personalized
healthcare, agricultural innovation, forensic science and others.
Frost & Sullivan: Global Precision Medicine Growth Opportunities, Forecast to 2025

4
Genomics – An Emerging Market
Why Genomics is becoming more relevant ?
Ø Feasibility: Decreased cost of sequencing.
- First sequencing of the whole human genome in 2003 cost roughly
$2.7 billion
- Today Genome sequence can cost around 1000 to 1500 USD
- DNA sequencing players target to get it down to 100 USD
Ø Value:
- Genomics will bring in an era of proactive and personalized
medicine (among other fields) – Potential of disruption.
Ø Investment:
- The market for genomic products and services is growing at 10%
and is predicted to become a $20 billion opportunity by 2020.
- Growth in the Genomics market is majorly attributed to increasing
government initiatives and increasing research.
https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
MarketsandMarkets Report

5
https://pxhere.com/en/photo/1403209
Questions for Storage Community
Questions that come in mind….
r Customer: Is there a reference architecture or approach.
r Solution Architect : How do I solutionize storage for
Genomics … What is the workload requirements ?
r Storage Developer: what I developed meets the genomics
requirement… What is the workload looks like ?
r Storage Tester: did my testing cover the requirements for
genomic workload…What is the workload, what are the tools,

6
Questions for Storage Community (Cont.)
Answer to the Questions in mind :
Need to understand the Genomics Sequencing
Workload from Storage Perspective !

7
Genomic Sequencing Workload– High Level
Key Characteristics of Genomic Workload
Ø Genomics requires a significant focus on big data
management as the sequencing of the genome
results in the production of a large amount of data.
Ø Genomic data analysis requires 3 process steps:
1. Sequencers convert the physical sample to
raw data. ‘
2. Raw data is put in a sequence
corresponding to the genome.
3. Analytics (example: matching mutations with
certain diseases), is then performed.
Requires easy to use and scalable
IT Infrastructure for:
1) Owning, managing and
accessing PBs of file storage
2) High throughput batch
processing to analyze data.

8
Genomic Sequencing Workload to Storage Requirement Mapping
(based on GATK3 pipeline reference)
ü Need to ingest millions
of files (Small to
medium size)
ü Continuous guaranteed
writes from multiple
sequences
ü Ingest protocol by
sequencer systems is
(SMB / NFS )
Sequencing Processing Raw Data
ü CPU Intensive
ü Pattern of writes
followed by reads
ü Predominantly
sequential I/O
ü Few large file access
(GB file size)
ü Access Protocols:
POSIX/NFS
Genome Alignment
ü Memory Intensive
ü Write intensive and
Write I/O is
predominantly
sequential I/O
ü Read I/O is random
access
ü Few output files – MB
to GB file size
POSIX/NFS
Variant Detection Collaboration
ü CPU intensive
ü Memory intensive
ü Mix of sequential and
random file access.
ü Read and Write I/O to
many files with varying
file sizes (KB – GB)
Access Protocols:
POSIX/NFS
ü Read Intensive
ü Multi Region/Multi-
Sites
ü Authenticated and
Secure
ü Metadata Support
SMB/NFS/Object
ü Faster access
across WAN

9
Need for Optimal Solution.
• Need to think end to end which included Compute, Network and Storage
as the key building blocks.
• The infrastructure (Compute, Network & Storage) should allow elasticity
to scale-in / scale-out of the building blocks similar to “Lego” blocks.
• For Storage Building Block : Need for a high performance file storage with
multiple access interfaces/protocols – Not a typical Network Attach
Storage (NAS) as genomic sequencing workload is not a NAS workload
but a Technical Computing workload.

10
Need for Elasticity like ‘Lego’ Blocks
… Choosing the Composable Infrastructure Principal
• Composable solutions are built in a way that disaggregates the
underlying building blocks viz. compute, storage, and network
services.
• These disaggregated services provide the required granularity
allowing the infrastructure that can be sliced, diced, expanded and
contracted at will and based on the actual need.
• It facilitates ease in deployment with well defined configuration
and tuning templates per building block.
• Genomic workloads benefit from composable principals as one
can grow and shrink the building blocks based on the needs.

11
IBM Storage & SDI
For Genomics – A Composable Building Block Approach
Shared Network
• High-speed NFS , SMB , Object Data Access,
connected to shared campus network.
• User Login to submit and manage batch jobs and to
access interactive applications.
Compute Services
• Scale-able Compute Cluster to analyze genomics
data.
Storage Services
• Scale-able Storage Cluster to store, manage and
access genomic data.
Private Network Services
• High-speed Data Network,
not connected to data center network.
• Provisioning Network and Service Network
for administrative login and hardware services,
optionally connected to shared campus network.

12
Storage Service: Need for High Performance File Storage
aligning to Composable Infrastructure Principals
• Key requirements per genomic sequencing workload
• High Performance & high throughput is key – Technical Computing workload ,
HPC-like , not a typical NAS workload.
• Should support scale-in and scale-out to adhere to composable infra principals.
• Ability to support different type of storage backend (need to be software defined)
• Support global namespace across different stages of sequencing.
• Multiprotocol support like NFS,SMB,HDFS,POSIX,Object for data ingestion,
collaboration and computing the sequencing.
• Easy ability for archive, cloud integration.
Storage Solution: Taking the “Software Defined” approach and choosing a clustered
filesystem that meets the above requirements (e.g. IBM Spectrum Scale)

13
Choosing a High Performance Clustered Filesystem for Storage (eg
IBM Spectrum Scale )

14
Genomics: Storage Building Block Interactions
Clustered Scale Out
Parallel File System
(e.g. IBM Spectrum
Scale)

15
Solution Architecture: Putting it all together

16
IBM Storage & SDI
Accelerated Performance for Genomics Sequencing
GATK Workflow – Execution Time on Profiling Environment using the Proposed
Solution Architecture for single sample
Solexa WGS Broad
dataset with b37
reference
BWA-Mem 303 min 47 sec
sam2bam
(storage mode)
35 min 53 sec
GATK BaseRecalibrator
(java setting -Xmn10g -Xms10g -Xmx10g)
87 min 21 sec
GATK PrintReads
97 min 1 sec
GATK HaplotypeCaller
261 min 37 sec
GATK mergeVCF
0 min 51 sec
Note: Execution time was measured on the testbed configuration (detailed in
profiling environment). The actual Genomics application performance will
depend on testbed configuration, tunings, and other factors.
Profiling environment:
1x Power8 Node (IBM 8247-22L with SMT=8) with 256GB memory to execute whole
workflow.
1x IBM ESS GS4 storage based on SSD (>= 23 GB/s write bandwidth and >= 30 GB/s
read bandwidth)
Dual rail FDR InfiniBand aggregating to ~13 GB/s

17
References
• Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data.
https://software.broadinstitute.org/gatk/
• IBM Redpaper: IBM Spectrum Scale Best Practices for Genomics Medicine Workloads:
http://www.redbooks.ibm.com/abstracts/redp5479.html
• Performance optimization of Broad Institute GATK Best Practices on IBM reference
architecture for healthcare and life sciences: https://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=TSW03540USEN
• IBM Reference Architecture for Genomics: Speed, Scale, Smarts:
http://www.redbooks.ibm.com/abstracts/redp5210.html?Open

18
Thank You!

19
Workload profile for each GATK processing step for one sample
Source: IBM Redpaper: IBM
Spectrum Scale Best Practices
for Genomics Medicine
Workloads:
http://www.redbooks.ibm.com/ab
stracts/redp5479.html

Genomics Deployments - How to Get Right with Software Defined Storage

Recommended

More Related Content

What's hot (19)

Similar to Genomics Deployments - How to Get Right with Software Defined Storage (20)

More from Sandeep Patil (7)

Recently uploaded (20)

Genomics Deployments - How to Get Right with Software Defined Storage