SlideShare a Scribd company logo
Extending Machine
Learning Algorithms
with PySpark
Karen Feng, Kiavash Kianfar
Databricks
Agenda
● Discuss using PySpark
(especially Pandas UDFs) to
perform machine learning
at unprecedented scale
● Learn about an application
for a genomics use case
(GloWGR)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
3. Problem: Performant, maintainable machine
learning algorithms are difficult to write natively in
big data tools (Spark SQL expressions)
Solution: Write algorithms in high-level languages
and link them to big data tools (PySpark)
Genomic data are growing too fast for existing tools
Problem 1
Genomic data are growing at an exponential pace
●
Biobank datasets are growing in
scale
• Next-generation sequencing
• Genotyping arrays (1Mb)
• Whole exome sequence (39Mb)
• Whole genome sequence (3200Mb)
• 1,000s of samples → 100,000s
of samples
• 10s of traits → 1000s of traits
Genomic data are growing at an exponential pace
Use general-purpose big data tools - specifically, Spark
Solution 1
Differentiation from single-node libraries
▪ Flexible: Glow is built natively on Spark, a
general-purpose big data engine
▪ Enables aggregation and mining of genetic
variants on an industrial scale
▪ Low-overhead: Spark minimizes
serialization cost with libraries like Kryo and
Arrow
▪ Inflexible: Each tool requires custom
parallelization logic, per language and
algorithm
▪ High-overhead: Moving text between
arbitrary processes hurts performance
Single-node
Bioinformaticians are not familiar with the native languages
used by big data tools, such as Scala
Problem 2
Spark is predominantly written in Scala
Data engineers and
scientists are
Python-oriented
● More than 60% of
notebook commands in
Databricks are written in
Python
● Fewer than 20% of
commands are written in
Scala
Bioinformaticians are even more Python-oriented
Provide clients for high-level languages, such as Python
Solution 2
Python improves the user experience
• Py4J: achieve
near-feature parity with
Scala APIs
• PySpark Project Zen
• PySpark type hints
Py4J
Performant, maintainable machine learning algorithms are
difficult to write natively in big data tools
Problem 3
Spark SQL expressions
• Built to process data row-by-row
• Difficult to maintain state
• Minimal support for machine learning
• Overhead from converting rows to ML-compatible shapes (eg. matrices)
• Few linear algebra libraries exist in Scala
• Limited functionality
Write algorithms in high-level languages and link them to big
data tools
Solution 3
Python improves the developer experience
• Pandas: user-defined
functions (UDFs)
• Apache Arrow: transfer
data between JVM and
Python processes
Feature in Spark 3.0: mapInPandas
Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead
X
f(X) → Y
Y
...
Iter(Y) ...
Iter(X)
f(X) → Y
Deep Dive: Genomics Use Case
Single nucleotide polymorphisms (SNP)
Genome Wide Association Studies (GWAS)
Detect associations between
genetic variations and traits of
interest across a population
• Common genetic
variations confer a small
amount of risk
• Rare genetic variation
confer a large amount of
risk
Whole Genome Regression (WGR)
Account for polygenic
effects, population
structure, and
relatedness
• Reduce false positives
• Reduce false
negatives
Mission: Industrialize genomics by integrating bioinformatics
into data science
Core principles:
• Build on Apache Spark
• Flexibly and natively support genomics tools and file
formats
• Provide single-line functions for common genomics
workloads
• Build an open-source community
26
Glow v1.0.0
● Datasources: Read/write common
genomic file formats (eg. VCF, BGEN,
Plink, GFF3) into/from Spark
DataFrames
● SQL expressions: Simple variant
handling operations can be called
from Python, SQL, Scala, or R
● Transformers: Complex genomic
transformations can be called from
Python or Scala
● GloWGR: Novel WGR/GWAS algorithm
built with PySpark
https://projectglow.io/
GloWGR: WGR and GWAS
● Detect which genotypes are associated with each
phenotype using a Generalized Linear Model
● Glow parallelizes the REGENIE method via Spark as
GloWGR
● Built from the ground-up using Pandas UDFs
GWAS Regression Tests
Millions of single-variate linear or logistic regressions
GloWGR: Learning at huge dimensions
WGR Reduction: ~5000 multi-variate linear ridge
regressions (one for each block and parameter)
500K x 100
500K x 50 500K x 1M
WGR Regression: ~ 5000 multi-variate linear or
logistic ridge regressions with cross validation
Data preparation
Transformation and SQL functions
on Genomic Variant DataFrame
● split_multiallelics
● genotype_states
● mean_substitute
Stage 1: Genotype matrix blocking
Stage 2: Dimensionality reduction
RidgeReduction.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and Xt
Y for a particular sample
block and calculate B= (Xt
X + I⍺)-1
Xt
Y
RidgeReduction.transform
● Pandas UDF: Calculates XB for each block
Stage 3: Estimate
phenotypic predictors
RidgeRegression.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and XY for a particular sample block
and calculate B= (Xt
X + I⍺)-1
Xt
Y
● Perform cross validation. Pick model
with best ⍺
RidgeRegression.transform_loco
● Pandas UDF: Calculates XB for each
block in a loco fashion
GWAS
Y ~ Gβg
+ Cβc
+ ϵ
Y - Ŷ ~ Gβg
+ Cβc +
ϵ
Use the phenotype estimate Ŷ
output by WGR to account for
polygenic effects during
regression
GWAS with Spark SQL expressions
Data
S samples
C covariates
V variants
T traits
Fitted model
S samples
C covariates
1 variant
1 trait
Results
V variants
T traits
Null model
S samples
C covariates
1 trait
V
x T
x
T
x
Cβc
Gβg
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
Cons
• Requires writing your own Spark SQL
expressions
• User-unfriendly linear algebra libraries in Scala
(ie. Breeze)
• Limited to 2 dimensions
• Unnatural expressions of mathematical operations
• Customized, expensive data transfers
• Spark DataFrames ↔ MLLib matrices ↔ Breeze
matrices
• Input and output must be Spark DataFrames
GWAS with PySpark
Phenotype
matrix
S samples
T traits
Covariate
matrix
S samples
C covariates
Null model
S samples
C covariates
1 trait
Genotype
matrix
S samples
T traits
Fitted model
S samples
C covariates
O(V) variants
O(T) traits
T x
# partitions x
Results
V variants
T traits
Gβg
Cβc
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS
I/O formats Linalg libraries Accessible clients
Spark SQL Spark DataFrames Spark ML/MLLib,
Breeze
Scala, Python, R
PySpark Spark or Pandas
DataFrames
Pandas, Numpy,
Einsum, ...
Python
Differentiation from other parallelized libraries
▪ Lightweight: Glow is a thin layer built to be
compatible with the latest major Spark
releases, as well as other open-source
libraries (eg. Delta)
▪ Flexible: Glow includes a set of core
algorithms, and is easily extended to ad-hoc
use cases using existing tools
▪ Heavyweight: Many libraries build on
custom logic that make it difficult to update
to new technologies
▪ Inflexible: Many libraries expose custom
interfaces that make it difficult to extend
beyond the built-in algorithms
Other parallelized libraries
Future work: gene burden tests
Big takeaways
1. Listen to your
users
2. Use the latest
off-the-shelf
tools
3. If all else fails,
pivot early
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Ad

More Related Content

What's hot (20)

Avro introduction
Avro introductionAvro introduction
Avro introduction
Nanda8904648951
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
Prakhar Rastogi
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
Matthew Rocklin
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
ateeq ateeq
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Marin Dimitrov
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
Ankit Beohar
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
Yuval Carmel
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
Prakhar Rastogi
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Marin Dimitrov
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
Ankit Beohar
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 

Similar to Extending Machine Learning Algorithms with PySpark (20)

DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
Venkata Naga Ravi
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
Nikhil Ketkar
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
Nikhil Ketkar
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Sri Ambati
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 

Extending Machine Learning Algorithms with PySpark

  • 1. Extending Machine Learning Algorithms with PySpark Karen Feng, Kiavash Kianfar Databricks
  • 2. Agenda ● Discuss using PySpark (especially Pandas UDFs) to perform machine learning at unprecedented scale ● Learn about an application for a genomics use case (GloWGR)
  • 3. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
  • 4. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)
  • 5. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python) 3. Problem: Performant, maintainable machine learning algorithms are difficult to write natively in big data tools (Spark SQL expressions) Solution: Write algorithms in high-level languages and link them to big data tools (PySpark)
  • 6. Genomic data are growing too fast for existing tools Problem 1
  • 7. Genomic data are growing at an exponential pace ●
  • 8. Biobank datasets are growing in scale • Next-generation sequencing • Genotyping arrays (1Mb) • Whole exome sequence (39Mb) • Whole genome sequence (3200Mb) • 1,000s of samples → 100,000s of samples • 10s of traits → 1000s of traits Genomic data are growing at an exponential pace
  • 9. Use general-purpose big data tools - specifically, Spark Solution 1
  • 10. Differentiation from single-node libraries ▪ Flexible: Glow is built natively on Spark, a general-purpose big data engine ▪ Enables aggregation and mining of genetic variants on an industrial scale ▪ Low-overhead: Spark minimizes serialization cost with libraries like Kryo and Arrow ▪ Inflexible: Each tool requires custom parallelization logic, per language and algorithm ▪ High-overhead: Moving text between arbitrary processes hurts performance Single-node
  • 11. Bioinformaticians are not familiar with the native languages used by big data tools, such as Scala Problem 2
  • 12. Spark is predominantly written in Scala
  • 13. Data engineers and scientists are Python-oriented ● More than 60% of notebook commands in Databricks are written in Python ● Fewer than 20% of commands are written in Scala
  • 14. Bioinformaticians are even more Python-oriented
  • 15. Provide clients for high-level languages, such as Python Solution 2
  • 16. Python improves the user experience • Py4J: achieve near-feature parity with Scala APIs • PySpark Project Zen • PySpark type hints Py4J
  • 17. Performant, maintainable machine learning algorithms are difficult to write natively in big data tools Problem 3
  • 18. Spark SQL expressions • Built to process data row-by-row • Difficult to maintain state • Minimal support for machine learning • Overhead from converting rows to ML-compatible shapes (eg. matrices) • Few linear algebra libraries exist in Scala • Limited functionality
  • 19. Write algorithms in high-level languages and link them to big data tools Solution 3
  • 20. Python improves the developer experience • Pandas: user-defined functions (UDFs) • Apache Arrow: transfer data between JVM and Python processes
  • 21. Feature in Spark 3.0: mapInPandas Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead X f(X) → Y Y ... Iter(Y) ... Iter(X) f(X) → Y
  • 24. Genome Wide Association Studies (GWAS) Detect associations between genetic variations and traits of interest across a population • Common genetic variations confer a small amount of risk • Rare genetic variation confer a large amount of risk
  • 25. Whole Genome Regression (WGR) Account for polygenic effects, population structure, and relatedness • Reduce false positives • Reduce false negatives
  • 26. Mission: Industrialize genomics by integrating bioinformatics into data science Core principles: • Build on Apache Spark • Flexibly and natively support genomics tools and file formats • Provide single-line functions for common genomics workloads • Build an open-source community 26
  • 27. Glow v1.0.0 ● Datasources: Read/write common genomic file formats (eg. VCF, BGEN, Plink, GFF3) into/from Spark DataFrames ● SQL expressions: Simple variant handling operations can be called from Python, SQL, Scala, or R ● Transformers: Complex genomic transformations can be called from Python or Scala ● GloWGR: Novel WGR/GWAS algorithm built with PySpark https://projectglow.io/
  • 28. GloWGR: WGR and GWAS ● Detect which genotypes are associated with each phenotype using a Generalized Linear Model ● Glow parallelizes the REGENIE method via Spark as GloWGR ● Built from the ground-up using Pandas UDFs
  • 29. GWAS Regression Tests Millions of single-variate linear or logistic regressions GloWGR: Learning at huge dimensions WGR Reduction: ~5000 multi-variate linear ridge regressions (one for each block and parameter) 500K x 100 500K x 50 500K x 1M WGR Regression: ~ 5000 multi-variate linear or logistic ridge regressions with cross validation
  • 30. Data preparation Transformation and SQL functions on Genomic Variant DataFrame ● split_multiallelics ● genotype_states ● mean_substitute
  • 31. Stage 1: Genotype matrix blocking
  • 32. Stage 2: Dimensionality reduction RidgeReduction.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and Xt Y for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y RidgeReduction.transform ● Pandas UDF: Calculates XB for each block
  • 33. Stage 3: Estimate phenotypic predictors RidgeRegression.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and XY for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y ● Perform cross validation. Pick model with best ⍺ RidgeRegression.transform_loco ● Pandas UDF: Calculates XB for each block in a loco fashion
  • 34. GWAS Y ~ Gβg + Cβc + ϵ Y - Ŷ ~ Gβg + Cβc + ϵ Use the phenotype estimate Ŷ output by WGR to account for polygenic effects during regression
  • 35. GWAS with Spark SQL expressions Data S samples C covariates V variants T traits Fitted model S samples C covariates 1 variant 1 trait Results V variants T traits Null model S samples C covariates 1 trait V x T x T x Cβc Gβg
  • 36. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 37. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 38. GWAS with Spark SQL expressions Pros • Portable to all Spark clients Cons • Requires writing your own Spark SQL expressions • User-unfriendly linear algebra libraries in Scala (ie. Breeze) • Limited to 2 dimensions • Unnatural expressions of mathematical operations • Customized, expensive data transfers • Spark DataFrames ↔ MLLib matrices ↔ Breeze matrices • Input and output must be Spark DataFrames
  • 39. GWAS with PySpark Phenotype matrix S samples T traits Covariate matrix S samples C covariates Null model S samples C covariates 1 trait Genotype matrix S samples T traits Fitted model S samples C covariates O(V) variants O(T) traits T x # partitions x Results V variants T traits Gβg Cβc
  • 40. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 41. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 42. GWAS I/O formats Linalg libraries Accessible clients Spark SQL Spark DataFrames Spark ML/MLLib, Breeze Scala, Python, R PySpark Spark or Pandas DataFrames Pandas, Numpy, Einsum, ... Python
  • 43. Differentiation from other parallelized libraries ▪ Lightweight: Glow is a thin layer built to be compatible with the latest major Spark releases, as well as other open-source libraries (eg. Delta) ▪ Flexible: Glow includes a set of core algorithms, and is easily extended to ad-hoc use cases using existing tools ▪ Heavyweight: Many libraries build on custom logic that make it difficult to update to new technologies ▪ Inflexible: Many libraries expose custom interfaces that make it difficult to extend beyond the built-in algorithms Other parallelized libraries
  • 44. Future work: gene burden tests
  • 45. Big takeaways 1. Listen to your users 2. Use the latest off-the-shelf tools 3. If all else fails, pivot early
  • 46. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.