SlideShare a Scribd company logo
2018 Storage Developer Conference India © All Rights Reserved.
1
Genomics Deployments: How To Get Right
With Software Defined Storage
Sandeep Patil
IBM
Acknowledgement : Ulf Troppens, Piyush Chaudhary, Kumaran Rajaram, Theodore Hoover
Jr, Kevin Gildea, Sasikanth Eda, Smita J Raut, Luis Bolinches , Monica Lemay, Carl Zetie,
Joanna Wong
2018 Storage Developer Conference India © All Rights Reserved.
2
Agenda
r Genomic Introduction
r What is Genomics
r Genomics – An Emerging Market
r Understanding Genomic Sequencing Workloads
r Requirements on Infrastructure
r Solution Approach
r Solution Architecture
r Performance for GATK based on proposed solution
2018 Storage Developer Conference India © All Rights Reserved.
3
Genomics - Introduction
Ø Genomics is a branch of biotechnology focusing on genomes.
Ø Genomics involves applying the techniques of genetics and molecular biology
to sequence, analyze or modify the DNA of an organism.
Ø It finds its use in a number of fields, such as, diagnostics, personalized
healthcare, agricultural innovation, forensic science and others.
Frost & Sullivan: Global Precision Medicine Growth Opportunities, Forecast to 2025
2018 Storage Developer Conference India © All Rights Reserved.
4
Genomics – An Emerging Market
Why Genomics is becoming more relevant ?
Ø Feasibility: Decreased cost of sequencing.
- First sequencing of the whole human genome in 2003 cost roughly
$2.7 billion
- Today Genome sequence can cost around 1000 to 1500 USD
- DNA sequencing players target to get it down to 100 USD
Ø Value:
- Genomics will bring in an era of proactive and personalized
medicine (among other fields) – Potential of disruption.
Ø Investment:
- The market for genomic products and services is growing at 10%
and is predicted to become a $20 billion opportunity by 2020.
- Growth in the Genomics market is majorly attributed to increasing
government initiatives and increasing research.
https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
MarketsandMarkets Report
2018 Storage Developer Conference India © All Rights Reserved.
5
https://pxhere.com/en/photo/1403209
Questions for Storage Community
Questions that come in mind….
r Customer: Is there a reference architecture or approach.
r Solution Architect : How do I solutionize storage for
Genomics … What is the workload requirements ?
r Storage Developer: what I developed meets the genomics
requirement… What is the workload looks like ?
r Storage Tester: did my testing cover the requirements for
genomic workload…What is the workload, what are the tools,
2018 Storage Developer Conference India © All Rights Reserved.
6
Questions for Storage Community (Cont.)
Answer to the Questions in mind :
Need to understand the Genomics Sequencing
Workload from Storage Perspective !
2018 Storage Developer Conference India © All Rights Reserved.
7
Genomic Sequencing Workload– High Level
Key Characteristics of Genomic Workload
Ø Genomics requires a significant focus on big data
management as the sequencing of the genome
results in the production of a large amount of data.
Ø Genomic data analysis requires 3 process steps:
1. Sequencers convert the physical sample to
raw data. ‘
2. Raw data is put in a sequence
corresponding to the genome.
3. Analytics (example: matching mutations with
certain diseases), is then performed.
Requires easy to use and scalable
IT Infrastructure for:
1) Owning, managing and
accessing PBs of file storage
2) High throughput batch
processing to analyze data.
2018 Storage Developer Conference India © All Rights Reserved.
8
Genomic Sequencing Workload to Storage Requirement Mapping
(based on GATK3 pipeline reference)
ü Need to ingest millions
of files (Small to
medium size)
ü Continuous guaranteed
writes from multiple
sequences
ü Ingest protocol by
sequencer systems is
(SMB / NFS )
Sequencing Processing Raw Data
ü CPU Intensive
ü Pattern of writes
followed by reads
ü Predominantly
sequential I/O
ü Few large file access
(GB file size)
ü Access Protocols:
POSIX/NFS
Genome Alignment
ü Memory Intensive
ü Write intensive and
Write I/O is
predominantly
sequential I/O
ü Read I/O is random
access
ü Few output files – MB
to GB file size
ü Access Protocols:
POSIX/NFS
Variant Detection Collaboration
ü CPU intensive
ü Memory intensive
ü Mix of sequential and
random file access.
ü Read and Write I/O to
many files with varying
file sizes (KB – GB)
Access Protocols:
POSIX/NFS
ü Read Intensive
ü Multi Region/Multi-
Sites
ü Authenticated and
Secure
ü Metadata Support
ü Access Protocols:
SMB/NFS/Object
ü Faster access
across WAN
2018 Storage Developer Conference India © All Rights Reserved.
9
Need for Optimal Solution.
• Need to think end to end which included Compute, Network and Storage
as the key building blocks.
• The infrastructure (Compute, Network & Storage) should allow elasticity
to scale-in / scale-out of the building blocks similar to “Lego” blocks.
• For Storage Building Block : Need for a high performance file storage with
multiple access interfaces/protocols – Not a typical Network Attach
Storage (NAS) as genomic sequencing workload is not a NAS workload
but a Technical Computing workload.
2018 Storage Developer Conference India © All Rights Reserved.
10
Need for Elasticity like ‘Lego’ Blocks
… Choosing the Composable Infrastructure Principal
• Composable solutions are built in a way that disaggregates the
underlying building blocks viz. compute, storage, and network
services.
• These disaggregated services provide the required granularity
allowing the infrastructure that can be sliced, diced, expanded and
contracted at will and based on the actual need.
• It facilitates ease in deployment with well defined configuration
and tuning templates per building block.
• Genomic workloads benefit from composable principals as one
can grow and shrink the building blocks based on the needs.
2018 Storage Developer Conference India © All Rights Reserved.
11
IBM Storage & SDI
For Genomics – A Composable Building Block Approach
Shared Network
• High-speed NFS , SMB , Object Data Access,
connected to shared campus network.
• User Login to submit and manage batch jobs and to
access interactive applications.
Compute Services
• Scale-able Compute Cluster to analyze genomics
data.
Storage Services
• Scale-able Storage Cluster to store, manage and
access genomic data.
Private Network Services
• High-speed Data Network,
not connected to data center network.
• Provisioning Network and Service Network
for administrative login and hardware services,
optionally connected to shared campus network.
2018 Storage Developer Conference India © All Rights Reserved.
12
Storage Service: Need for High Performance File Storage
aligning to Composable Infrastructure Principals
• Key requirements per genomic sequencing workload
• High Performance & high throughput is key – Technical Computing workload ,
HPC-like , not a typical NAS workload.
• Should support scale-in and scale-out to adhere to composable infra principals.
• Ability to support different type of storage backend (need to be software defined)
• Support global namespace across different stages of sequencing.
• Multiprotocol support like NFS,SMB,HDFS,POSIX,Object for data ingestion,
collaboration and computing the sequencing.
• Easy ability for archive, cloud integration.
Storage Solution: Taking the “Software Defined” approach and choosing a clustered
filesystem that meets the above requirements (e.g. IBM Spectrum Scale)
2018 Storage Developer Conference India © All Rights Reserved.
13
Choosing a High Performance Clustered Filesystem for Storage (eg
IBM Spectrum Scale )
2018 Storage Developer Conference India © All Rights Reserved.
14
Genomics: Storage Building Block Interactions
Clustered Scale Out
Parallel File System
(e.g. IBM Spectrum
Scale)
2018 Storage Developer Conference India © All Rights Reserved.
15
Solution Architecture: Putting it all together
2018 Storage Developer Conference India © All Rights Reserved.
16
IBM Storage & SDI
Accelerated Performance for Genomics Sequencing
GATK Workflow – Execution Time on Profiling Environment using the Proposed
Solution Architecture for single sample
Solexa WGS Broad
dataset with b37
reference
BWA-Mem 303 min 47 sec
sam2bam
(storage mode)
35 min 53 sec
GATK BaseRecalibrator
(java setting -Xmn10g -Xms10g -Xmx10g)
87 min 21 sec
GATK PrintReads
(java setting -Xmn10g -Xms10g -Xmx10g)
97 min 1 sec
GATK HaplotypeCaller
(java setting -Xmn10g -Xms10g -Xmx10g)
261 min 37 sec
GATK mergeVCF
(java setting -Xmn10g -Xms10g -Xmx10g)
0 min 51 sec
Note: Execution time was measured on the testbed configuration (detailed in
profiling environment). The actual Genomics application performance will
depend on testbed configuration, tunings, and other factors.
Profiling environment:
1x Power8 Node (IBM 8247-22L with SMT=8) with 256GB memory to execute whole
workflow.
1x IBM ESS GS4 storage based on SSD (>= 23 GB/s write bandwidth and >= 30 GB/s
read bandwidth)
Dual rail FDR InfiniBand aggregating to ~13 GB/s
2018 Storage Developer Conference India © All Rights Reserved.
17
References
• Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data.
https://software.broadinstitute.org/gatk/
• IBM Redpaper: IBM Spectrum Scale Best Practices for Genomics Medicine Workloads:
http://www.redbooks.ibm.com/abstracts/redp5479.html
• Performance optimization of Broad Institute GATK Best Practices on IBM reference
architecture for healthcare and life sciences: https://www-01.ibm.com/common/ssi/cgi-
bin/ssialias?htmlfid=TSW03540USEN
• IBM Reference Architecture for Genomics: Speed, Scale, Smarts:
http://www.redbooks.ibm.com/abstracts/redp5210.html?Open
2018 Storage Developer Conference India © All Rights Reserved.
18
Thank You!
2018 Storage Developer Conference India © All Rights Reserved.
19
Workload profile for each GATK processing step for one sample
Source: IBM Redpaper: IBM
Spectrum Scale Best Practices
for Genomics Medicine
Workloads:
http://www.redbooks.ibm.com/ab
stracts/redp5479.html
Ad

More Related Content

What's hot (19)

IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
inside-BigData.com
 
Benefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): StorageBenefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): Storage
MarketingArrowECS_CZ
 
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
xKinAnx
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red_Hat_Storage
 
Check Point automatizace a orchestrace
Check Point automatizace a orchestraceCheck Point automatizace a orchestrace
Check Point automatizace a orchestrace
MarketingArrowECS_CZ
 
2021 March Pravega Community Meeting
2021 March Pravega Community Meeting2021 March Pravega Community Meeting
2021 March Pravega Community Meeting
Derek Moore
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage Matters
Colleen Corrice
 
Award winning scale-up and scale-out storage for Xen
Award winning scale-up and scale-out storage for XenAward winning scale-up and scale-out storage for Xen
Award winning scale-up and scale-out storage for Xen
GlusterFS
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
GlusterFS
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
GlusterFS
 
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red_Hat_Storage
 
In Place Analytics For File and Object Data
In Place Analytics For File and Object DataIn Place Analytics For File and Object Data
In Place Analytics For File and Object Data
Sandeep Patil
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)
HTS Hosting
 
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red_Hat_Storage
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
GlusterFS
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
Sandeep Patil
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
inside-BigData.com
 
Benefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): StorageBenefity Oracle Cloudu (4/4): Storage
Benefity Oracle Cloudu (4/4): Storage
MarketingArrowECS_CZ
 
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
xKinAnx
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red_Hat_Storage
 
Check Point automatizace a orchestrace
Check Point automatizace a orchestraceCheck Point automatizace a orchestrace
Check Point automatizace a orchestrace
MarketingArrowECS_CZ
 
2021 March Pravega Community Meeting
2021 March Pravega Community Meeting2021 March Pravega Community Meeting
2021 March Pravega Community Meeting
Derek Moore
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage Matters
Colleen Corrice
 
Award winning scale-up and scale-out storage for Xen
Award winning scale-up and scale-out storage for XenAward winning scale-up and scale-out storage for Xen
Award winning scale-up and scale-out storage for Xen
GlusterFS
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
GlusterFS
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
GlusterFS
 
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red_Hat_Storage
 
In Place Analytics For File and Object Data
In Place Analytics For File and Object DataIn Place Analytics For File and Object Data
In Place Analytics For File and Object Data
Sandeep Patil
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
Red Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super StorageRed Hat Storage Day Boston - Supermicro Super Storage
Red Hat Storage Day Boston - Supermicro Super Storage
Red_Hat_Storage
 
Software-Defined Storage (SDS)
Software-Defined Storage (SDS)Software-Defined Storage (SDS)
Software-Defined Storage (SDS)
HTS Hosting
 
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red Hat Storage Day New York - QCT: Avoid the mess, deploy with a validated s...
Red_Hat_Storage
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
GlusterFS
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
Sandeep Patil
 

Similar to Genomics Deployments - How to Get Right with Software Defined Storage (20)

Open Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache GeodeOpen Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache Geode
Apache Geode
 
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
Anthony Baker
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
upthewaterspout
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
IBM Spectrum Scale ECM - Winning Combination
IBM Spectrum Scale  ECM - Winning CombinationIBM Spectrum Scale  ECM - Winning Combination
IBM Spectrum Scale ECM - Winning Combination
Sasikanth Eda
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
Oracle Korea
 
NVMe and Flash – Make Your Storage Great Again!
NVMe and Flash – Make Your Storage Great Again!NVMe and Flash – Make Your Storage Great Again!
NVMe and Flash – Make Your Storage Great Again!
DataCore Software
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Storage Switzerland
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPC
Intel IT Center
 
20230614 LinuxONE Distinguished_Recognition ISSIP_Award_Talk.pptx
20230614 LinuxONE Distinguished_Recognition ISSIP_Award_Talk.pptx20230614 LinuxONE Distinguished_Recognition ISSIP_Award_Talk.pptx
20230614 LinuxONE Distinguished_Recognition ISSIP_Award_Talk.pptx
International Society of Service Innovation Professionals
 
Introducing PMDK into PostgreSQL
Introducing PMDK into PostgreSQLIntroducing PMDK into PostgreSQL
Introducing PMDK into PostgreSQL
NTT Software Innovation Center
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
inside-BigData.com
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
LibbySchulze
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
Dave Cramer
 
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI SolutionsPro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Penguin Computing
 
Penguin Computing Designing and Deploying End to End HPC and AI Solutions
Penguin Computing Designing and Deploying End to End HPC and AI SolutionsPenguin Computing Designing and Deploying End to End HPC and AI Solutions
Penguin Computing Designing and Deploying End to End HPC and AI Solutions
Kristi King
 
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI SolutionsPenguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin Computing
 
Open Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache GeodeOpen Sourcing GemFire - Apache Geode
Open Sourcing GemFire - Apache Geode
Apache Geode
 
An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)An Introduction to Apache Geode (incubating)
An Introduction to Apache Geode (incubating)
Anthony Baker
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
Ganesan Narayanasamy
 
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
[DSC Europe 24] Thomas Kitzler - Building the Future – Unpacking the Essentia...
DataScienceConferenc1
 
IBM Spectrum Scale ECM - Winning Combination
IBM Spectrum Scale  ECM - Winning CombinationIBM Spectrum Scale  ECM - Winning Combination
IBM Spectrum Scale ECM - Winning Combination
Sasikanth Eda
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
Oracle Korea
 
NVMe and Flash – Make Your Storage Great Again!
NVMe and Flash – Make Your Storage Great Again!NVMe and Flash – Make Your Storage Great Again!
NVMe and Flash – Make Your Storage Great Again!
DataCore Software
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Storage Switzerland
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPC
Intel IT Center
 
GEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use CasesGEN-Z: An Overview and Use Cases
GEN-Z: An Overview and Use Cases
inside-BigData.com
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
LibbySchulze
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
Dave Cramer
 
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI SolutionsPro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Pro Tips: Designing and Deploying End-to-End HPC and AI Solutions
Penguin Computing
 
Penguin Computing Designing and Deploying End to End HPC and AI Solutions
Penguin Computing Designing and Deploying End to End HPC and AI SolutionsPenguin Computing Designing and Deploying End to End HPC and AI Solutions
Penguin Computing Designing and Deploying End to End HPC and AI Solutions
Kristi King
 
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI SolutionsPenguin computing designing and deploying end to end HPC and AI Solutions
Penguin computing designing and deploying end to end HPC and AI Solutions
Penguin Computing
 
Ad

More from Sandeep Patil (7)

IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
Sandeep Patil
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf Weiser
Sandeep Patil
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
Sandeep Patil
 
IBM Spectrum Scale Authentication for Protocols
IBM Spectrum Scale Authentication for ProtocolsIBM Spectrum Scale Authentication for Protocols
IBM Spectrum Scale Authentication for Protocols
Sandeep Patil
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
IBM Spectrum Scale and Its Use for Content Management
 IBM Spectrum Scale and Its Use for Content Management IBM Spectrum Scale and Its Use for Content Management
IBM Spectrum Scale and Its Use for Content Management
Sandeep Patil
 
Spectrum scale-external-unified-file object
Spectrum scale-external-unified-file objectSpectrum scale-external-unified-file object
Spectrum scale-external-unified-file object
Sandeep Patil
 
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and RestIBM Spectrum Scale Secure- Secure Data in Motion and Rest
IBM Spectrum Scale Secure- Secure Data in Motion and Rest
Sandeep Patil
 
Spectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf WeiserSpectrum Scale Best Practices by Olaf Weiser
Spectrum Scale Best Practices by Olaf Weiser
Sandeep Patil
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
Sandeep Patil
 
IBM Spectrum Scale Authentication for Protocols
IBM Spectrum Scale Authentication for ProtocolsIBM Spectrum Scale Authentication for Protocols
IBM Spectrum Scale Authentication for Protocols
Sandeep Patil
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
Sandeep Patil
 
IBM Spectrum Scale and Its Use for Content Management
 IBM Spectrum Scale and Its Use for Content Management IBM Spectrum Scale and Its Use for Content Management
IBM Spectrum Scale and Its Use for Content Management
Sandeep Patil
 
Spectrum scale-external-unified-file object
Spectrum scale-external-unified-file objectSpectrum scale-external-unified-file object
Spectrum scale-external-unified-file object
Sandeep Patil
 
Ad

Recently uploaded (20)

Role of Gene Therapy Neurological disorders
Role of Gene Therapy Neurological disordersRole of Gene Therapy Neurological disorders
Role of Gene Therapy Neurological disorders
riggdiana2
 
LDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDMMIA Reiki Yoga S4 Bonus S2 Clearing ChiLDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDM Mia eStudios
 
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Oleg Kshivets
 
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
Tanja Milenković
 
Defining and Delivering Person-Centric HIV Care in Key Populations
Defining and Delivering Person-Centric HIV Care in Key PopulationsDefining and Delivering Person-Centric HIV Care in Key Populations
Defining and Delivering Person-Centric HIV Care in Key Populations
PVI, PeerView Institute for Medical Education
 
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil DubeyCommon Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Sexologist Dr. Sunil Dubey - Dubey Clinic
 
Subconjunctival Hemorrhage Secondary to Pertussis.pdf
Subconjunctival Hemorrhage Secondary to Pertussis.pdfSubconjunctival Hemorrhage Secondary to Pertussis.pdf
Subconjunctival Hemorrhage Secondary to Pertussis.pdf
wahbikhalidali
 
WOUND HEALING IN PERIODONTOLOGY - 3.pptx
WOUND HEALING IN PERIODONTOLOGY - 3.pptxWOUND HEALING IN PERIODONTOLOGY - 3.pptx
WOUND HEALING IN PERIODONTOLOGY - 3.pptx
tarunprakash1904
 
Introduction to adverse drug reactions & Management of ADR.pptx
Introduction to adverse drug reactions & Management of ADR.pptxIntroduction to adverse drug reactions & Management of ADR.pptx
Introduction to adverse drug reactions & Management of ADR.pptx
Dr. Koppala R.V.S. Chaitanya
 
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Oleg Kshivets
 
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Dr. Damian Lastra Copello
 
Evaluation of cosmetic products-Dhanashree Kolhekar-M.PHARM.pptx
Evaluation of cosmetic products-Dhanashree Kolhekar-M.PHARM.pptxEvaluation of cosmetic products-Dhanashree Kolhekar-M.PHARM.pptx
Evaluation of cosmetic products-Dhanashree Kolhekar-M.PHARM.pptx
DR. BABASAHEB BHIMRAO AMBEDKAR (A CENTRAL GOVT.) UNIVERSITY, LUCKNOW, UP.
 
Primary Care at the Center of RSV Prevention: Community-Focused Strategies to...
Primary Care at the Center of RSV Prevention: Community-Focused Strategies to...Primary Care at the Center of RSV Prevention: Community-Focused Strategies to...
Primary Care at the Center of RSV Prevention: Community-Focused Strategies to...
PVI, PeerView Institute for Medical Education
 
Methods of Cancer diagnosis in Context of Radiotherapy
Methods of Cancer diagnosis in Context  of RadiotherapyMethods of Cancer diagnosis in Context  of Radiotherapy
Methods of Cancer diagnosis in Context of Radiotherapy
Saikat Roy
 
Meeting dissolution requirements M.Pharmacy sem 2nd biopharmaceutics &pharmac...
Meeting dissolution requirements M.Pharmacy sem 2nd biopharmaceutics &pharmac...Meeting dissolution requirements M.Pharmacy sem 2nd biopharmaceutics &pharmac...
Meeting dissolution requirements M.Pharmacy sem 2nd biopharmaceutics &pharmac...
Swami ramanand teerth marathwada university
 
Physiology of Pain and thermal sensations
Physiology of Pain and thermal sensationsPhysiology of Pain and thermal sensations
Physiology of Pain and thermal sensations
MedicoseAcademics
 
Ophthalmological notes for dental students
Ophthalmological notes for dental studentsOphthalmological notes for dental students
Ophthalmological notes for dental students
KafrELShiekh University
 
various methods and techniques used and Pharmacovigilance Methods.pptx
various methods and techniques used and Pharmacovigilance Methods.pptxvarious methods and techniques used and Pharmacovigilance Methods.pptx
various methods and techniques used and Pharmacovigilance Methods.pptx
Dr. Koppala R.V.S. Chaitanya
 
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptxSPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SanskritiUpadhyay5
 
Terminologies of adverse medication related events , Regulatory terminologies.
Terminologies of adverse medication related events , Regulatory terminologies.Terminologies of adverse medication related events , Regulatory terminologies.
Terminologies of adverse medication related events , Regulatory terminologies.
Dr. Koppala R.V.S. Chaitanya
 
Role of Gene Therapy Neurological disorders
Role of Gene Therapy Neurological disordersRole of Gene Therapy Neurological disorders
Role of Gene Therapy Neurological disorders
riggdiana2
 
LDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDMMIA Reiki Yoga S4 Bonus S2 Clearing ChiLDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDMMIA Reiki Yoga S4 Bonus S2 Clearing Chi
LDM Mia eStudios
 
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Esophageal Cancer: Artificial Intelligence, Synergetics, Complex System Analy...
Oleg Kshivets
 
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
2025-ADA-SOC-Slide-Deck-all-recommendations-FINAL-12-9-24.pptx
Tanja Milenković
 
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil DubeyCommon Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Common Male Sexual Problems | Best Sexologist in Patna, Bihar | Dr. Sunil Dubey
Sexologist Dr. Sunil Dubey - Dubey Clinic
 
Subconjunctival Hemorrhage Secondary to Pertussis.pdf
Subconjunctival Hemorrhage Secondary to Pertussis.pdfSubconjunctival Hemorrhage Secondary to Pertussis.pdf
Subconjunctival Hemorrhage Secondary to Pertussis.pdf
wahbikhalidali
 
WOUND HEALING IN PERIODONTOLOGY - 3.pptx
WOUND HEALING IN PERIODONTOLOGY - 3.pptxWOUND HEALING IN PERIODONTOLOGY - 3.pptx
WOUND HEALING IN PERIODONTOLOGY - 3.pptx
tarunprakash1904
 
Introduction to adverse drug reactions & Management of ADR.pptx
Introduction to adverse drug reactions & Management of ADR.pptxIntroduction to adverse drug reactions & Management of ADR.pptx
Introduction to adverse drug reactions & Management of ADR.pptx
Dr. Koppala R.V.S. Chaitanya
 
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Gastric Cancer: Artificial Intelligence, Synergetics, Complex System Analysis...
Oleg Kshivets
 
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Awake Craniotomy with endoscopic support, guided by intraoperative ultrasound...
Dr. Damian Lastra Copello
 
Methods of Cancer diagnosis in Context of Radiotherapy
Methods of Cancer diagnosis in Context  of RadiotherapyMethods of Cancer diagnosis in Context  of Radiotherapy
Methods of Cancer diagnosis in Context of Radiotherapy
Saikat Roy
 
Physiology of Pain and thermal sensations
Physiology of Pain and thermal sensationsPhysiology of Pain and thermal sensations
Physiology of Pain and thermal sensations
MedicoseAcademics
 
Ophthalmological notes for dental students
Ophthalmological notes for dental studentsOphthalmological notes for dental students
Ophthalmological notes for dental students
KafrELShiekh University
 
various methods and techniques used and Pharmacovigilance Methods.pptx
various methods and techniques used and Pharmacovigilance Methods.pptxvarious methods and techniques used and Pharmacovigilance Methods.pptx
various methods and techniques used and Pharmacovigilance Methods.pptx
Dr. Koppala R.V.S. Chaitanya
 
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptxSPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SPECIFIC ETHICAL ISSUES, FFP, AUTHORSHIP,CONFLICT OF INTEREST.pptx
SanskritiUpadhyay5
 
Terminologies of adverse medication related events , Regulatory terminologies.
Terminologies of adverse medication related events , Regulatory terminologies.Terminologies of adverse medication related events , Regulatory terminologies.
Terminologies of adverse medication related events , Regulatory terminologies.
Dr. Koppala R.V.S. Chaitanya
 

Genomics Deployments - How to Get Right with Software Defined Storage

  • 1. 2018 Storage Developer Conference India © All Rights Reserved. 1 Genomics Deployments: How To Get Right With Software Defined Storage Sandeep Patil IBM Acknowledgement : Ulf Troppens, Piyush Chaudhary, Kumaran Rajaram, Theodore Hoover Jr, Kevin Gildea, Sasikanth Eda, Smita J Raut, Luis Bolinches , Monica Lemay, Carl Zetie, Joanna Wong
  • 2. 2018 Storage Developer Conference India © All Rights Reserved. 2 Agenda r Genomic Introduction r What is Genomics r Genomics – An Emerging Market r Understanding Genomic Sequencing Workloads r Requirements on Infrastructure r Solution Approach r Solution Architecture r Performance for GATK based on proposed solution
  • 3. 2018 Storage Developer Conference India © All Rights Reserved. 3 Genomics - Introduction Ø Genomics is a branch of biotechnology focusing on genomes. Ø Genomics involves applying the techniques of genetics and molecular biology to sequence, analyze or modify the DNA of an organism. Ø It finds its use in a number of fields, such as, diagnostics, personalized healthcare, agricultural innovation, forensic science and others. Frost & Sullivan: Global Precision Medicine Growth Opportunities, Forecast to 2025
  • 4. 2018 Storage Developer Conference India © All Rights Reserved. 4 Genomics – An Emerging Market Why Genomics is becoming more relevant ? Ø Feasibility: Decreased cost of sequencing. - First sequencing of the whole human genome in 2003 cost roughly $2.7 billion - Today Genome sequence can cost around 1000 to 1500 USD - DNA sequencing players target to get it down to 100 USD Ø Value: - Genomics will bring in an era of proactive and personalized medicine (among other fields) – Potential of disruption. Ø Investment: - The market for genomic products and services is growing at 10% and is predicted to become a $20 billion opportunity by 2020. - Growth in the Genomics market is majorly attributed to increasing government initiatives and increasing research. https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/ MarketsandMarkets Report
  • 5. 2018 Storage Developer Conference India © All Rights Reserved. 5 https://pxhere.com/en/photo/1403209 Questions for Storage Community Questions that come in mind…. r Customer: Is there a reference architecture or approach. r Solution Architect : How do I solutionize storage for Genomics … What is the workload requirements ? r Storage Developer: what I developed meets the genomics requirement… What is the workload looks like ? r Storage Tester: did my testing cover the requirements for genomic workload…What is the workload, what are the tools,
  • 6. 2018 Storage Developer Conference India © All Rights Reserved. 6 Questions for Storage Community (Cont.) Answer to the Questions in mind : Need to understand the Genomics Sequencing Workload from Storage Perspective !
  • 7. 2018 Storage Developer Conference India © All Rights Reserved. 7 Genomic Sequencing Workload– High Level Key Characteristics of Genomic Workload Ø Genomics requires a significant focus on big data management as the sequencing of the genome results in the production of a large amount of data. Ø Genomic data analysis requires 3 process steps: 1. Sequencers convert the physical sample to raw data. ‘ 2. Raw data is put in a sequence corresponding to the genome. 3. Analytics (example: matching mutations with certain diseases), is then performed. Requires easy to use and scalable IT Infrastructure for: 1) Owning, managing and accessing PBs of file storage 2) High throughput batch processing to analyze data.
  • 8. 2018 Storage Developer Conference India © All Rights Reserved. 8 Genomic Sequencing Workload to Storage Requirement Mapping (based on GATK3 pipeline reference) ü Need to ingest millions of files (Small to medium size) ü Continuous guaranteed writes from multiple sequences ü Ingest protocol by sequencer systems is (SMB / NFS ) Sequencing Processing Raw Data ü CPU Intensive ü Pattern of writes followed by reads ü Predominantly sequential I/O ü Few large file access (GB file size) ü Access Protocols: POSIX/NFS Genome Alignment ü Memory Intensive ü Write intensive and Write I/O is predominantly sequential I/O ü Read I/O is random access ü Few output files – MB to GB file size ü Access Protocols: POSIX/NFS Variant Detection Collaboration ü CPU intensive ü Memory intensive ü Mix of sequential and random file access. ü Read and Write I/O to many files with varying file sizes (KB – GB) Access Protocols: POSIX/NFS ü Read Intensive ü Multi Region/Multi- Sites ü Authenticated and Secure ü Metadata Support ü Access Protocols: SMB/NFS/Object ü Faster access across WAN
  • 9. 2018 Storage Developer Conference India © All Rights Reserved. 9 Need for Optimal Solution. • Need to think end to end which included Compute, Network and Storage as the key building blocks. • The infrastructure (Compute, Network & Storage) should allow elasticity to scale-in / scale-out of the building blocks similar to “Lego” blocks. • For Storage Building Block : Need for a high performance file storage with multiple access interfaces/protocols – Not a typical Network Attach Storage (NAS) as genomic sequencing workload is not a NAS workload but a Technical Computing workload.
  • 10. 2018 Storage Developer Conference India © All Rights Reserved. 10 Need for Elasticity like ‘Lego’ Blocks … Choosing the Composable Infrastructure Principal • Composable solutions are built in a way that disaggregates the underlying building blocks viz. compute, storage, and network services. • These disaggregated services provide the required granularity allowing the infrastructure that can be sliced, diced, expanded and contracted at will and based on the actual need. • It facilitates ease in deployment with well defined configuration and tuning templates per building block. • Genomic workloads benefit from composable principals as one can grow and shrink the building blocks based on the needs.
  • 11. 2018 Storage Developer Conference India © All Rights Reserved. 11 IBM Storage & SDI For Genomics – A Composable Building Block Approach Shared Network • High-speed NFS , SMB , Object Data Access, connected to shared campus network. • User Login to submit and manage batch jobs and to access interactive applications. Compute Services • Scale-able Compute Cluster to analyze genomics data. Storage Services • Scale-able Storage Cluster to store, manage and access genomic data. Private Network Services • High-speed Data Network, not connected to data center network. • Provisioning Network and Service Network for administrative login and hardware services, optionally connected to shared campus network.
  • 12. 2018 Storage Developer Conference India © All Rights Reserved. 12 Storage Service: Need for High Performance File Storage aligning to Composable Infrastructure Principals • Key requirements per genomic sequencing workload • High Performance & high throughput is key – Technical Computing workload , HPC-like , not a typical NAS workload. • Should support scale-in and scale-out to adhere to composable infra principals. • Ability to support different type of storage backend (need to be software defined) • Support global namespace across different stages of sequencing. • Multiprotocol support like NFS,SMB,HDFS,POSIX,Object for data ingestion, collaboration and computing the sequencing. • Easy ability for archive, cloud integration. Storage Solution: Taking the “Software Defined” approach and choosing a clustered filesystem that meets the above requirements (e.g. IBM Spectrum Scale)
  • 13. 2018 Storage Developer Conference India © All Rights Reserved. 13 Choosing a High Performance Clustered Filesystem for Storage (eg IBM Spectrum Scale )
  • 14. 2018 Storage Developer Conference India © All Rights Reserved. 14 Genomics: Storage Building Block Interactions Clustered Scale Out Parallel File System (e.g. IBM Spectrum Scale)
  • 15. 2018 Storage Developer Conference India © All Rights Reserved. 15 Solution Architecture: Putting it all together
  • 16. 2018 Storage Developer Conference India © All Rights Reserved. 16 IBM Storage & SDI Accelerated Performance for Genomics Sequencing GATK Workflow – Execution Time on Profiling Environment using the Proposed Solution Architecture for single sample Solexa WGS Broad dataset with b37 reference BWA-Mem 303 min 47 sec sam2bam (storage mode) 35 min 53 sec GATK BaseRecalibrator (java setting -Xmn10g -Xms10g -Xmx10g) 87 min 21 sec GATK PrintReads (java setting -Xmn10g -Xms10g -Xmx10g) 97 min 1 sec GATK HaplotypeCaller (java setting -Xmn10g -Xms10g -Xmx10g) 261 min 37 sec GATK mergeVCF (java setting -Xmn10g -Xms10g -Xmx10g) 0 min 51 sec Note: Execution time was measured on the testbed configuration (detailed in profiling environment). The actual Genomics application performance will depend on testbed configuration, tunings, and other factors. Profiling environment: 1x Power8 Node (IBM 8247-22L with SMT=8) with 256GB memory to execute whole workflow. 1x IBM ESS GS4 storage based on SSD (>= 23 GB/s write bandwidth and >= 30 GB/s read bandwidth) Dual rail FDR InfiniBand aggregating to ~13 GB/s
  • 17. 2018 Storage Developer Conference India © All Rights Reserved. 17 References • Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data. https://software.broadinstitute.org/gatk/ • IBM Redpaper: IBM Spectrum Scale Best Practices for Genomics Medicine Workloads: http://www.redbooks.ibm.com/abstracts/redp5479.html • Performance optimization of Broad Institute GATK Best Practices on IBM reference architecture for healthcare and life sciences: https://www-01.ibm.com/common/ssi/cgi- bin/ssialias?htmlfid=TSW03540USEN • IBM Reference Architecture for Genomics: Speed, Scale, Smarts: http://www.redbooks.ibm.com/abstracts/redp5210.html?Open
  • 18. 2018 Storage Developer Conference India © All Rights Reserved. 18 Thank You!
  • 19. 2018 Storage Developer Conference India © All Rights Reserved. 19 Workload profile for each GATK processing step for one sample Source: IBM Redpaper: IBM Spectrum Scale Best Practices for Genomics Medicine Workloads: http://www.redbooks.ibm.com/ab stracts/redp5479.html