SlideShare a Scribd company logo
Share and analyse genomic data
at scale
with Spark, Adam, Tachyon & the Spark Notebook
by @DataFellas, Oct • 29th • 2015
Outline
• Sharp intro to Genomics data
• What are the Challenges
• Distributed Machine Learning to the rescue
• Projects: Distributed teams
• Research: Long process
• Towards Maximum Share for efficiency
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
“There must be another way of doingthecredits” -- Robin Hood: Menin Tights (1993, Mel Brooks)
Analyse Genomic At Scale
Spark, Adam, Spark Notebook
• Sharp intro to Genomics data
• What are the Challenges
• Distributed Machine Learning to the rescue
What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
On the production side
Fast biotech progress…
… can IT keep up?
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
… x 30 (x 60)
Massively parallel
Lots of data?
Lots of data?
10’s millions
Lots of data!
10’s millions
1,000s
1,000,000s
...
ADAM: Spark genomics library
http://www.bdgenomics.org
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
Avro schema
Parquet storage
Genomics API
So what do we do with this?
Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised
learning)
… and share and replay!
The Spark Notebook
… comes to the rescue.
Spark: easy APIs
Self described and consistent
Easily shared (code)
http://www.spark-notebook.io
The Spark Notebook
The Spark Notebook
The Spark Notebook
So what do we do with this?
… and share and replay!
Code can be shared easily but we want better...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
Share Genomic At Scale
Spark, Tachyon, Mesos, Shar3
• Projects: Distributed teams
• Research: Long process
• Towards Maximum Share for efficiency
Projects
Intrinsically involving many teams
geolocally distributed in different
countries or laboratories
with different skills in
Biology, Genetics, I.T., Medicine (, legal...)
Projects
Require many types of data ranging from
bio samples
imagery
textual
archives/historical
Projects
Of course
Generally gather many people from several populations
Note: This is very expensive and burns time as hell!
Projects
1.000 genomes (2008-2012): 200To
100.000 genomes (2013-2017): 20Po (probably more)
1.000.000 genomes (2016-2020): 0.2Eo (probably more)
eQTL: mixing many sources
Projects
Need proper data management between entities, yet
coping with:
amount of data
heterogeneity of people
distance between actors
constraints related to data
location
Projects
Distributed friendly
SCHEMAS + BINARY
f.i. Avro
Research
Research in medicine or health in general is
LOOOOOOO…OOOOONG
Research
Most reasons are quite obvious not have to be overlooked
Lots of measures and validation
Lots of control (including by Gov.)
Lots of actors
Research
As a matter of fact, research need
to be conducted on data and
to produce results
But both are highly exposed to reuse, so what if we lose
either of them?
Research
However, we can get into troubles instantly without even
losing them.
What if we don’t track the processes to go from one to the
other?
In any scientific process: confrontation, replay and
enhancement are key to move forward
This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources
and so on, and all should be part of the data
discovery/validation task
Research
Assess the risk factor associated with a disease given
mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation with new ideas.
Replaying old processes on new data,
new processes on old data
Research
Share share
share
All these facts relate to our capacity to share our work and
to collaborate.
We need to share efficiently and accurately
• data
• process
• results
Share share
share
The challenge resides in the workflow
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types,…)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types,…)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
Streamlining development lifecycle
for better Productivity
with Shar3
Share share
share
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
That’s all folks
Thanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Check also @TypeSafe: http://t.co/o1Bt6dQtgH

More Related Content

PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
PDF
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
PDF
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
PPTX
Learning Systems for Science
PPT
Big Graph Analytics on Neo4j with Apache Spark
PPTX
Analyzing Data With Python
PDF
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Analytics with Storm, Spark and GraphLab
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Learning Systems for Science
Big Graph Analytics on Neo4j with Apache Spark
Analyzing Data With Python
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

What's hot (20)

PPTX
Data Automation at Light Sources
PDF
04 open source_tools
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PDF
Deep Learning with MXNet - Dmitry Larko
PPTX
Beyond Kaggle: Solving Data Science Challenges at Scale
PDF
Signals from outer space
PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PDF
07 data structures_and_representations
PDF
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
PDF
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
PPTX
51 Use Cases and implications for HPC & Apache Big Data Stack
PPTX
Classification of Big Data Use Cases by different Facets
PPTX
Machine Learning with Spark
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
PDF
Graph Analytics in Spark
PPTX
R and Data Science
PDF
Realtime Data Analysis Patterns
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PPTX
Comparing Big Data and Simulation Applications and Implications for Software ...
PPTX
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Data Automation at Light Sources
04 open source_tools
Making Machine Learning Scale: Single Machine and Distributed
Deep Learning with MXNet - Dmitry Larko
Beyond Kaggle: Solving Data Science Challenges at Scale
Signals from outer space
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
07 data structures_and_representations
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
51 Use Cases and implications for HPC & Apache Big Data Stack
Classification of Big Data Use Cases by different Facets
Machine Learning with Spark
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Graph Analytics in Spark
R and Data Science
Realtime Data Analysis Patterns
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Comparing Big Data and Simulation Applications and Implications for Software ...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Ad

Viewers also liked (7)

PDF
Scala: the unpredicted lingua franca for data science
PDF
Lightning fast genomics with Spark, Adam and Scala
PDF
Reactive Design Patterns — J on the Beach
ODP
Introduction to Apache Kafka- Part 2
ODP
Introduction to Apache Kafka- Part 1
PDF
A dive into akka streams: from the basics to a real-world scenario
PPTX
Composable Futures with Akka 2.0
Scala: the unpredicted lingua franca for data science
Lightning fast genomics with Spark, Adam and Scala
Reactive Design Patterns — J on the Beach
Introduction to Apache Kafka- Part 2
Introduction to Apache Kafka- Part 1
A dive into akka streams: from the basics to a real-world scenario
Composable Futures with Akka 2.0
Ad

Similar to Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir (20)

PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
PPTX
2015 genome-center
PPTX
2014 aus-agta
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PPTX
2016 09 cxo forum
PPTX
2016 davis-biotech
PPTX
Data analytics challenges in genomics
PPTX
2016 davis-plantbio
PPTX
Hadoop as a Platform for Genomics - Strata 2015, San Jose
PPTX
2015 illinois-talk
PDF
Hadoop as a Platform for Genomics
PPTX
11-Big Data Application in Biomedical Research and Health Care.pptx
PDF
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
PPTX
2014 nicta-reproducibility
PDF
Enabling Biobank-Scale Genomic Processing with Spark SQL
PDF
Amia tb-review-12
PDF
2015 03-28-eb-final
Spark Summit Europe: Share and analyse genomic data at scale
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
2015 genome-center
2014 aus-agta
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
2016 09 cxo forum
2016 davis-biotech
Data analytics challenges in genomics
2016 davis-plantbio
Hadoop as a Platform for Genomics - Strata 2015, San Jose
2015 illinois-talk
Hadoop as a Platform for Genomics
11-Big Data Application in Biomedical Research and Health Care.pptx
The Future of Healthcare with Big Data and AI with Ion Stoica and Frank Nothaft
2014 nicta-reproducibility
Enabling Biobank-Scale Genomic Processing with Spark SQL
Amia tb-review-12
2015 03-28-eb-final

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
PPTX
batch data Retailer Data management Project.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Performance Implementation Review powerpoint
PPTX
Web dev -ppt that helps us understand web technology
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Understanding Prototyping in Design and Development
PPTX
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
PPTX
Economic Sector Performance Recovery.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Extract Transformation Load (3) (1).pptx
PDF
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
PPTX
1intro to AI.pptx AI components & composition
PPTX
Global journeys: estimating international migration
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
345_IT infrastructure for business management.pdf
PDF
Digital Infrastructure – Powering the Connected Age
PPTX
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
Research about a FoodFolio app for personalized dietary tracking and health o...
batch data Retailer Data management Project.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Performance Implementation Review powerpoint
Web dev -ppt that helps us understand web technology
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Understanding Prototyping in Design and Development
LESSON-1-NATURE-OF-MATHEMATICS.pptx patterns
Economic Sector Performance Recovery.pptx
Business Acumen Training GuidePresentation.pptx
Extract Transformation Load (3) (1).pptx
Nashik East side PPT 01-08-25. vvvhvjvvvhvh
1intro to AI.pptx AI components & composition
Global journeys: estimating international migration
Company Profile 2023 PT. ZEKON INDONESIA.pdf
Data Analyst Certificate Programs for Beginners | IABAC
Launch Your Data Science Career in Kochi – 2025
345_IT infrastructure for business management.pdf
Digital Infrastructure – Powering the Connected Age
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

  • 1. Share and analyse genomic data at scale with Spark, Adam, Tachyon & the Spark Notebook by @DataFellas, Oct • 29th • 2015
  • 2. Outline • Sharp intro to Genomics data • What are the Challenges • Distributed Machine Learning to the rescue • Projects: Distributed teams • Research: Long process • Towards Maximum Share for efficiency
  • 3. Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning “There must be another way of doingthecredits” -- Robin Hood: Menin Tights (1993, Mel Brooks)
  • 4. Analyse Genomic At Scale Spark, Adam, Spark Notebook • Sharp intro to Genomics data • What are the Challenges • Distributed Machine Learning to the rescue
  • 5. What is genomics data? DNA? What makes us what we are… … a complex biochemical soup. With applications to medical diagnostics, drug response, disease mechanisms
  • 6. On the production side Fast biotech progress… … can IT keep up?
  • 7. On the production side Sequence {A, T, G, C} 3 billion characters (bases)
  • 8. On the production side Sequence {A, T, G, C} 3 billion characters (bases) … x 30 (x 60) Massively parallel
  • 11. Lots of data! 10’s millions 1,000s 1,000,000s ...
  • 12. ADAM: Spark genomics library http://www.bdgenomics.org
  • 16. ADAM: Spark genomics library Avro schema Parquet storage Genomics API
  • 17. So what do we do with this? Study variations between populations Descriptive statistics Machine Learning (Population stratification or Supervised learning) … and share and replay!
  • 18. The Spark Notebook … comes to the rescue. Spark: easy APIs Self described and consistent Easily shared (code) http://www.spark-notebook.io
  • 22. So what do we do with this? … and share and replay! Code can be shared easily but we want better... How do we share data produced by the notebook? How do we publish the notebook as a service?
  • 23. Share Genomic At Scale Spark, Tachyon, Mesos, Shar3 • Projects: Distributed teams • Research: Long process • Towards Maximum Share for efficiency
  • 24. Projects Intrinsically involving many teams geolocally distributed in different countries or laboratories with different skills in Biology, Genetics, I.T., Medicine (, legal...)
  • 25. Projects Require many types of data ranging from bio samples imagery textual archives/historical
  • 26. Projects Of course Generally gather many people from several populations Note: This is very expensive and burns time as hell!
  • 27. Projects 1.000 genomes (2008-2012): 200To 100.000 genomes (2013-2017): 20Po (probably more) 1.000.000 genomes (2016-2020): 0.2Eo (probably more) eQTL: mixing many sources
  • 28. Projects Need proper data management between entities, yet coping with: amount of data heterogeneity of people distance between actors constraints related to data location
  • 30. Research Research in medicine or health in general is LOOOOOOO…OOOOONG
  • 31. Research Most reasons are quite obvious not have to be overlooked Lots of measures and validation Lots of control (including by Gov.) Lots of actors
  • 32. Research As a matter of fact, research need to be conducted on data and to produce results But both are highly exposed to reuse, so what if we lose either of them?
  • 33. Research However, we can get into troubles instantly without even losing them. What if we don’t track the processes to go from one to the other? In any scientific process: confrontation, replay and enhancement are key to move forward
  • 34. This is misleading to think that sharing the code is enough. Remind: we look for data and results, not for code. The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task Research
  • 35. Assess the risk factor associated with a disease given mutations of a certain gene. More than 50 years of data collecting and modelling. Hundreds of researchers, each generation with new ideas. Replaying old processes on new data, new processes on old data Research
  • 36. Share share share All these facts relate to our capacity to share our work and to collaborate. We need to share efficiently and accurately • data • process • results
  • 37. Share share share The challenge resides in the workflow
  • 38. Share share share “Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types,…) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 39. Share share share “Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types,…) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 40. Share share share Streamlining development lifecycle for better Productivity with Shar3
  • 42. That’s all folks Thanks for listening/staying Poke us on Twitter or via http://data-fellas.guru @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Check also @TypeSafe: http://t.co/o1Bt6dQtgH