SlideShare a Scribd company logo
SeqsLab: a high-performance
genomics data analysis platform
based on Apache Spark
Yun-Lung Li | Genomic Data Scientist | yunlung.li@atgenomix.com
Road to Precision Medicine
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Agenda
• NGS - From DNA to Text Data
• Dry Lab Challenges
• Atgenomix SeqsLab
• Case studies
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Agenda
• NGS - From DNA to Text Data
• Dry Lab Challenges
• Atgenomix SeqsLab
• Case studies
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
From DNA to Insight
FASTQ
BAM FASTQVCF
Whole Genome ~100GB
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Variant Call Format (VCF)
• ~5M SNP/indel
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
From DNA to Insight
FASTQ
BAM FASTQVCF
Whole Genome ~100GB
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Sequencing Industry Stats
2015 2020 2025
$600
Whole genomes
>15,000
Illumina installed base
$1,000
Whole genomes
~150 Petabases
Sequence data
https://s24.q4cdn.com/526396163/files/doc_presentations/2020/01/ILMN-at-JPM-13-Jan-2020-final.pdf
https://www.biorxiv.org/content/10.1101/203554v1
10M+
Global samples sequenced
UKBB, NIH’s All of US,
Human Cell Atlas
60M+
Patients sequenced
in healthcare context
1M+
Global samples sequenced
100GB per sample
Whole genomes
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Agenda
BAM FASTQVCF
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
• NGS - From DNA to Text Data
• Dry Lab Challenges
• Atgenomix SeqsLab
• Case studies
Dry Lab Challenges - Scalability
• Data amount
• Process complexity / time
GATK Best Practice Workflow
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Dry Lab Challenges – Evolving Open-Source Community
https://assemblathon.org/post/48865310097/slides-from-an-assemblathon-2-talk
https://github.com/broadinstitute/gatk/
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Dry Lab Challenges
• Data scalability
• Computation scalability
• Evolving open-source community
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Possible Solutions - Reimplementation
• Edico Dragen (GATK in FPGA)
• Nvidia Clara Parabricks (GATK in GPU)
• Sentieon (GATK in C++)
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Possible Solutions - Reimplementation
• Edico Dragen (GATK in FPGA)
• Nvidia Clara Parabricks (GATK in GPU)
• Sentieon (GATK in C++)
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Atgenomix – Scaling Out Existing Tools With Data
Parallelization
• Scaling with data parallelization
• Run bioinformatics tools as is
• On-demand Apache Spark clusters
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Atgenomix SeqsLab - BIO-IT Platform in Cloud
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Containerized SeqsLab Runtime
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Azure Distributed Data Engineering Toolkit (AZTK)
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
User’s Workflows on On-demand Spark Clusters
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
BAM (Binary Alignment Map)
BAM
Human
genome
Human
genome
FASTQ
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Read
Mapping
Adaptive Data Parallelization – BAM Partitioning
• Human genome is organized into
chromosomes, so we can partition
DNA data by chromosome without
losing information.
• Several partitioning strategies
provide more partitions for better
parallelization:
- Centromere
- Long ambiguous regions
- Any customized region
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
• Columnar storage - eased data retrieval overhead
• Binary, encoded, compressed - size reduction
• Statistic metadata - pushdown predicate
• Hive style partitioning
BAM to SparkSQL – Avro, Parquet
Bam sort Time Hardware
samtools sort / index 326mins D13_v2 x1
samtools view 206mins D13_v2 x1
Bam partition 3101 Time Hardware
Atgenomix Transform 5mins D13_v2 x40
Atgenomix BamSelect 8mins D13_v2 x40
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Atgenomix
Transform
Azure
Data Lake
Storage
Azure
Data Lake
Storage
Avro
Parquet
PipeSeq – Read Mapping
pipeTransform
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Azure
Data Lake
Storage
Atgenomix
BamSelect
Azure
Data Lake
Storage
PipeSeq – Variant Calling
PipeTransform
Avro
Parquet
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
SeqsLab - BIO-IT Platform in Cloud
• Containerized user runtime + SeqsLab runtime
• Data Parallelization - biological consideration
• PipeSeq - bridge HDFS/Spark and bioinformatics tools
• Computation resource recommendation
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Computing Unit Catalog
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Smart, Scalable and Simplified Computing Platform on Azure
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Agenda
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
• NGS - From DNA to Text Data
• Dry Lab Challenges
• Atgenomix SeqsLab
• Case studies
Case Study: GATK Best Practice on SeqsLab
BWA-MEM
AtgxTransform
alignment flagstat |
sorting
PosBinSelect
GATK4.1 HaplotypeCaller
MarkDuplicate | BQSR | whatshap phasing | dbSNP
annotate
Runtime
(min)
Compute
Cost (US$)
240d 240d 240d 160d 120 6.5
160d 160d 80d 80d 190 5.5
Illumina WGS 40X PE150bp
GRCh38 (hs38d1, primary assembly plus decoy contigs)
Parallelization Factor
708 FASTQ shards (>500k read pairs per shard)
155 BAM shards (~20m base pairs partitioned by
contiguous unmasked Ns per shard)
Computing Environment
Azure WEST US 2 region (Azure Batch, Azure Data Lake Gen2)
SeqsLab Compute Unit
80d (5 x D14v2 low-priority VMs)
160d (10 x D14v2 low-priority VMs)
240d (15 x D14v2 low-priority VMs)
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Case Study: DeepVariant
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.https://www.nature.com/articles/nbt.4235.epdf
https://github.com/google/deepvariant/issues/90
• All CPU resource is occupied
• Still have multiple threads
Confidential - Atgenomix internal use only. © 2018 Atgenomix, Inc.
Google DeepVariant – CPU/GPU Resource Limitation Issue
FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
https://github.com/google/deepvariant/blob/5e6fe205b984c6be116dcacafdfd83ce1df4d2e9/deepvariant/call_variants.py#L344
Google DeepVariant – CPU/GPU Resource Limitation Issue
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Issue When more GPUs are added, the performance didn’t improve.
Observation 1. Low Volatile GPU-Util
2. Each process will allocate
memory from all GPU cards
Google DeepVariant – Multiple GPU Support Issue
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
def execute(cls, examples_tfrecord, ref_version, is_gpu, partition_id, extra_params, logger):
o, _ = pipe_call("nvidia-smi -L | awk '{print $2}'", True)
gpu_list= o.decode('utf-8').strip(":n").split(":n")
shuffle(gpu_list)
o, _ = pipe_call('nvidia-smi --query-gpu=memory.total --format=csv,nounits,noheader', True)
gpu_mem_mb= max([int(x) for x in o.decode('utf-8').strip().split('n')])
gpu_mem_ratio= str(int(100 / (int(gpu_mem_mb/ cv_mim_mem_mb) + 1)))
cmd = [BIN['deepvariant-call-variants'],'--execution_hardware', mode,
'--examples',examples_tfrecord,'--checkpoint', model,
'--outfile', cvo_tfrecord_path, '--percentage_gpu_memory', gpu_mem_ratio]
while True:
for gpu_idx in gpu_list:
env = os.environ.copy()
env['CUDA_VISIBLE_DEVICES'] = gpu_idx
try:
pipe_call(cmd, False, env)
eflag = True
except SeqPiperProcessError as e:
if str(e).find('CUDA_ERROR_OUT_OF_MEMORY')!= -1 or str(e).find('OOM') != -1:
continue
else:
raise e
if eflag:
break
t = random.randrange(5, 30)
time.sleep(t)
return cvo_tfrecord_path
1. Collect GPU list and shuffle it
2. Get GPU Memory and dynamically
transform to percentage
3. Specify GPU memory when
launching call_variants
4. Select a specific GPU in a round-
robin assignment
5. Sleep a few seconds if failed.
Google DeepVariant – Multiple GPU Support Issue
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
1. Each process will allocate
memory from a specific GPU card
2. High Volatile GPU-Util
Google DeepVariant – Multiple GPU Support Issue
DeepVariant-on-Spark
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
https://github.com/atgenomix/deepvariant-on-spark
DeepVariant-on-Spark
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
https://github.com/atgenomix/deepvariant-on-spark
https://www.hindawi.com/journals/cmmm/2020/7231205/
黃柏榕教授
長庚大學
生物醫學研究所
Process 1 WGS Cluster
BWA (708 shards) 23m
DS14_v2 x 20
BAM Sorting 7.6m
BAM Select 8.4m
NC12 x 20
DeepVariant (3K parts) 23m
Execution Time Per
Sample
1h2m
Google DeepVariant on SeqsLab
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Case Study: Population Study
96 Hours
60 Hours
Illumina WGS 1475 Samples Joint Genotyping
Reference Genome GRCh38
Data Amount 20TB GVCF
Parallelization Factor 3101 VCF shards (~1M base pairs per shard)
Pipeline Configuration GATK4 GenomicsDBImport | GenotypeGVCFs
Computing Configuration 30xD14v2 (480 cores)
Affymetrix TWB 69264 Sample Imputation
Reference Genome GRCh38
Parallelization Factor 1891 Reference panel shards (~50K variants per shard, based on
UK Biobank workflow)
Reference Panels 1000Genomes (5096 haplotypes), TWBiobank (2902 haplotypes)
Pipeline Configuration IMPUTE2 RefMerge | SHAPEIT4 Phasing | IMPUTE4
Computing Configuration 2xE64v4 (128 cores) | 11xD16v4 (176 cores) | 25xE32v4 (800 cores)
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Case Study: Assembly Algorithm Scaling
Reference
genome
FASTQ
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Case Study: Assembly Algorithm Scaling
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Join Us to improve human health faster and better
Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
Software Engineer
Bioinformatician
亞大基因科技成立於2015,
是一個擁有生物資訊和大數
據分析技術雙重優勢的軟體
研發團隊,專注致力於發展
創新的人類基因序列數據挖
掘技術及分析軟體平台,進
而幫助醫院醫師提供更好的
個人化醫療。亞大基因立足
台灣,放眼國際,矢志成為
亞洲最大的基因大數據分析
軟體公司。歡迎有志之士一
起加入!
One platform for all your
genomic workloads
Yun-Lung Li | Genomic Data Scientist | yunlung.li@atgenomix.com
Ad

More Related Content

What's hot (20)

Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
Genome Reference Consortium
 
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDTHigh efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
Integrated DNA Technologies
 
Bioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory UseBioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory Use
Itoshi Nikaido
 
ABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop Schneider
Genome Reference Consortium
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
Baptiste Mayjonade
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
Genome Reference Consortium
 
HiPipe Professional
HiPipe ProfessionalHiPipe Professional
HiPipe Professional
Cheng-Yang(Louis) Tang
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
Genome Reference Consortium
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment
Dirk Petersen
 
iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2
Shigehiro Kuraku (工樂 樹洋)
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Karan Singh
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Integrated DNA Technologies
 
BeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPCBeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPC
inside-BigData.com
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
Denis C. Bauer
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Use of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay DesignUse of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay Design
Integrated DNA Technologies
 
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDTHigh efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
Integrated DNA Technologies
 
Bioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory UseBioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory Use
Itoshi Nikaido
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
Baptiste Mayjonade
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment BeeGFS Enterprise Deployment
BeeGFS Enterprise Deployment
Dirk Petersen
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Karan Singh
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Integrated DNA Technologies
 
BeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPCBeeGFS - Dealing with Extreme Requirements in HPC
BeeGFS - Dealing with Extreme Requirements in HPC
inside-BigData.com
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
Karan Singh
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
Denis C. Bauer
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 

Similar to SeqsLab: a high performance genomics data analysis platform based on Apache Spark (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
Ceph Community
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using Ray
Databricks
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Future services on Janet
Future services on JanetFuture services on Janet
Future services on Janet
Jisc
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdays
MongoDB APAC
 
Geek Nights Hong Kong
Geek Nights Hong KongGeek Nights Hong Kong
Geek Nights Hong Kong
Rahul Gupta
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
DataStax Academy
 
OGCE SC10
OGCE SC10OGCE SC10
OGCE SC10
marpierc
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Mistral and StackStorm
Mistral and StackStormMistral and StackStorm
Mistral and StackStorm
Dmitri Zimine
 
Cardinality-HL-Overview
Cardinality-HL-OverviewCardinality-HL-Overview
Cardinality-HL-Overview
Harry Frost
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
Olga Lavrentieva
 
Denver Big Data Analytics Day
Denver Big Data Analytics DayDenver Big Data Analytics Day
Denver Big Data Analytics Day
Zivaro Inc
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Neotys_Partner
 
Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
Ceph Community
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using Ray
Databricks
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Future services on Janet
Future services on JanetFuture services on Janet
Future services on Janet
Jisc
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdays
MongoDB APAC
 
Geek Nights Hong Kong
Geek Nights Hong KongGeek Nights Hong Kong
Geek Nights Hong Kong
Rahul Gupta
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Mistral and StackStorm
Mistral and StackStormMistral and StackStorm
Mistral and StackStorm
Dmitri Zimine
 
Cardinality-HL-Overview
Cardinality-HL-OverviewCardinality-HL-Overview
Cardinality-HL-Overview
Harry Frost
 
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
«Использование GPU для распределенных вычислений Map Reduce (Hadoop)»
Olga Lavrentieva
 
Denver Big Data Analytics Day
Denver Big Data Analytics DayDenver Big Data Analytics Day
Denver Big Data Analytics Day
Zivaro Inc
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Neotys_Partner
 
Ad

Recently uploaded (20)

Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Salesforce Data Cloud- Hyperscale data platform, built for Salesforce.
Dele Amefo
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AIScaling GraphRAG:  Efficient Knowledge Retrieval for Enterprise AI
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AI
danshalev
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Ad

SeqsLab: a high performance genomics data analysis platform based on Apache Spark

  • 1. SeqsLab: a high-performance genomics data analysis platform based on Apache Spark Yun-Lung Li | Genomic Data Scientist | [email protected]
  • 2. Road to Precision Medicine Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 3. Agenda • NGS - From DNA to Text Data • Dry Lab Challenges • Atgenomix SeqsLab • Case studies Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 4. Agenda • NGS - From DNA to Text Data • Dry Lab Challenges • Atgenomix SeqsLab • Case studies Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 5. From DNA to Insight FASTQ BAM FASTQVCF Whole Genome ~100GB Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 6. Variant Call Format (VCF) • ~5M SNP/indel Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 7. From DNA to Insight FASTQ BAM FASTQVCF Whole Genome ~100GB Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 8. Sequencing Industry Stats 2015 2020 2025 $600 Whole genomes >15,000 Illumina installed base $1,000 Whole genomes ~150 Petabases Sequence data https://s24.q4cdn.com/526396163/files/doc_presentations/2020/01/ILMN-at-JPM-13-Jan-2020-final.pdf https://www.biorxiv.org/content/10.1101/203554v1 10M+ Global samples sequenced UKBB, NIH’s All of US, Human Cell Atlas 60M+ Patients sequenced in healthcare context 1M+ Global samples sequenced 100GB per sample Whole genomes Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 9. Agenda BAM FASTQVCF Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. • NGS - From DNA to Text Data • Dry Lab Challenges • Atgenomix SeqsLab • Case studies
  • 10. Dry Lab Challenges - Scalability • Data amount • Process complexity / time GATK Best Practice Workflow Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 11. Dry Lab Challenges – Evolving Open-Source Community https://assemblathon.org/post/48865310097/slides-from-an-assemblathon-2-talk https://github.com/broadinstitute/gatk/ Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 12. Dry Lab Challenges • Data scalability • Computation scalability • Evolving open-source community Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 13. Possible Solutions - Reimplementation • Edico Dragen (GATK in FPGA) • Nvidia Clara Parabricks (GATK in GPU) • Sentieon (GATK in C++) Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 14. Possible Solutions - Reimplementation • Edico Dragen (GATK in FPGA) • Nvidia Clara Parabricks (GATK in GPU) • Sentieon (GATK in C++) Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 15. Atgenomix – Scaling Out Existing Tools With Data Parallelization • Scaling with data parallelization • Run bioinformatics tools as is • On-demand Apache Spark clusters Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 16. Atgenomix SeqsLab - BIO-IT Platform in Cloud Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 17. SeqsLab - BIO-IT Platform in Cloud Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 18. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 19. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 20. Containerized SeqsLab Runtime Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 21. Azure Distributed Data Engineering Toolkit (AZTK) Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 22. User’s Workflows on On-demand Spark Clusters Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 23. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 24. BAM (Binary Alignment Map) BAM Human genome Human genome FASTQ Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. Read Mapping
  • 25. Adaptive Data Parallelization – BAM Partitioning • Human genome is organized into chromosomes, so we can partition DNA data by chromosome without losing information. • Several partitioning strategies provide more partitions for better parallelization: - Centromere - Long ambiguous regions - Any customized region Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 26. • Columnar storage - eased data retrieval overhead • Binary, encoded, compressed - size reduction • Statistic metadata - pushdown predicate • Hive style partitioning BAM to SparkSQL – Avro, Parquet Bam sort Time Hardware samtools sort / index 326mins D13_v2 x1 samtools view 206mins D13_v2 x1 Bam partition 3101 Time Hardware Atgenomix Transform 5mins D13_v2 x40 Atgenomix BamSelect 8mins D13_v2 x40
  • 27. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 28. Atgenomix Transform Azure Data Lake Storage Azure Data Lake Storage Avro Parquet PipeSeq – Read Mapping pipeTransform Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 29. Azure Data Lake Storage Atgenomix BamSelect Azure Data Lake Storage PipeSeq – Variant Calling PipeTransform Avro Parquet Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 30. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 31. SeqsLab - BIO-IT Platform in Cloud • Containerized user runtime + SeqsLab runtime • Data Parallelization - biological consideration • PipeSeq - bridge HDFS/Spark and bioinformatics tools • Computation resource recommendation Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 32. Computing Unit Catalog Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 33. Smart, Scalable and Simplified Computing Platform on Azure Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 34. Agenda Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. • NGS - From DNA to Text Data • Dry Lab Challenges • Atgenomix SeqsLab • Case studies
  • 35. Case Study: GATK Best Practice on SeqsLab BWA-MEM AtgxTransform alignment flagstat | sorting PosBinSelect GATK4.1 HaplotypeCaller MarkDuplicate | BQSR | whatshap phasing | dbSNP annotate Runtime (min) Compute Cost (US$) 240d 240d 240d 160d 120 6.5 160d 160d 80d 80d 190 5.5 Illumina WGS 40X PE150bp GRCh38 (hs38d1, primary assembly plus decoy contigs) Parallelization Factor 708 FASTQ shards (>500k read pairs per shard) 155 BAM shards (~20m base pairs partitioned by contiguous unmasked Ns per shard) Computing Environment Azure WEST US 2 region (Azure Batch, Azure Data Lake Gen2) SeqsLab Compute Unit 80d (5 x D14v2 low-priority VMs) 160d (10 x D14v2 low-priority VMs) 240d (15 x D14v2 low-priority VMs) Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 36. Case Study: DeepVariant Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.https://www.nature.com/articles/nbt.4235.epdf
  • 37. https://github.com/google/deepvariant/issues/90 • All CPU resource is occupied • Still have multiple threads Confidential - Atgenomix internal use only. © 2018 Atgenomix, Inc. Google DeepVariant – CPU/GPU Resource Limitation Issue FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 38. Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. https://github.com/google/deepvariant/blob/5e6fe205b984c6be116dcacafdfd83ce1df4d2e9/deepvariant/call_variants.py#L344 Google DeepVariant – CPU/GPU Resource Limitation Issue
  • 39. Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. Issue When more GPUs are added, the performance didn’t improve. Observation 1. Low Volatile GPU-Util 2. Each process will allocate memory from all GPU cards Google DeepVariant – Multiple GPU Support Issue
  • 40. Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. def execute(cls, examples_tfrecord, ref_version, is_gpu, partition_id, extra_params, logger): o, _ = pipe_call("nvidia-smi -L | awk '{print $2}'", True) gpu_list= o.decode('utf-8').strip(":n").split(":n") shuffle(gpu_list) o, _ = pipe_call('nvidia-smi --query-gpu=memory.total --format=csv,nounits,noheader', True) gpu_mem_mb= max([int(x) for x in o.decode('utf-8').strip().split('n')]) gpu_mem_ratio= str(int(100 / (int(gpu_mem_mb/ cv_mim_mem_mb) + 1))) cmd = [BIN['deepvariant-call-variants'],'--execution_hardware', mode, '--examples',examples_tfrecord,'--checkpoint', model, '--outfile', cvo_tfrecord_path, '--percentage_gpu_memory', gpu_mem_ratio] while True: for gpu_idx in gpu_list: env = os.environ.copy() env['CUDA_VISIBLE_DEVICES'] = gpu_idx try: pipe_call(cmd, False, env) eflag = True except SeqPiperProcessError as e: if str(e).find('CUDA_ERROR_OUT_OF_MEMORY')!= -1 or str(e).find('OOM') != -1: continue else: raise e if eflag: break t = random.randrange(5, 30) time.sleep(t) return cvo_tfrecord_path 1. Collect GPU list and shuffle it 2. Get GPU Memory and dynamically transform to percentage 3. Specify GPU memory when launching call_variants 4. Select a specific GPU in a round- robin assignment 5. Sleep a few seconds if failed. Google DeepVariant – Multiple GPU Support Issue
  • 41. Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. 1. Each process will allocate memory from a specific GPU card 2. High Volatile GPU-Util Google DeepVariant – Multiple GPU Support Issue
  • 42. DeepVariant-on-Spark Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. https://github.com/atgenomix/deepvariant-on-spark
  • 43. DeepVariant-on-Spark Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. https://github.com/atgenomix/deepvariant-on-spark https://www.hindawi.com/journals/cmmm/2020/7231205/ 黃柏榕教授 長庚大學 生物醫學研究所
  • 44. Process 1 WGS Cluster BWA (708 shards) 23m DS14_v2 x 20 BAM Sorting 7.6m BAM Select 8.4m NC12 x 20 DeepVariant (3K parts) 23m Execution Time Per Sample 1h2m Google DeepVariant on SeqsLab Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 45. Case Study: Population Study 96 Hours 60 Hours Illumina WGS 1475 Samples Joint Genotyping Reference Genome GRCh38 Data Amount 20TB GVCF Parallelization Factor 3101 VCF shards (~1M base pairs per shard) Pipeline Configuration GATK4 GenomicsDBImport | GenotypeGVCFs Computing Configuration 30xD14v2 (480 cores) Affymetrix TWB 69264 Sample Imputation Reference Genome GRCh38 Parallelization Factor 1891 Reference panel shards (~50K variants per shard, based on UK Biobank workflow) Reference Panels 1000Genomes (5096 haplotypes), TWBiobank (2902 haplotypes) Pipeline Configuration IMPUTE2 RefMerge | SHAPEIT4 Phasing | IMPUTE4 Computing Configuration 2xE64v4 (128 cores) | 11xD16v4 (176 cores) | 25xE32v4 (800 cores)
  • 46. Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 47. Case Study: Assembly Algorithm Scaling Reference genome FASTQ Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 48. Case Study: Assembly Algorithm Scaling Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES.
  • 49. Join Us to improve human health faster and better Confidential - Atgenomix internal use only. © 2020 Atgenomix, Inc.FOR RESEARCH USE ONLY. NOT FOR USE IN DIAGNOSTIC PROCEDURES. Software Engineer Bioinformatician 亞大基因科技成立於2015, 是一個擁有生物資訊和大數 據分析技術雙重優勢的軟體 研發團隊,專注致力於發展 創新的人類基因序列數據挖 掘技術及分析軟體平台,進 而幫助醫院醫師提供更好的 個人化醫療。亞大基因立足 台灣,放眼國際,矢志成為 亞洲最大的基因大數據分析 軟體公司。歡迎有志之士一 起加入!
  • 50. One platform for all your genomic workloads Yun-Lung Li | Genomic Data Scientist | [email protected]