SlideShare a Scribd company logo
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In: Migrating a Genomics
Pipeline from BASH/Hive to Spark
and Azure Databricks—A Real
World Case Study
Victoria Morris
Unicorn Health Bridge Consulting working for Atrium Health
Agenda
Victoria Morris
▪ Overview Link
▪ Issues – why change?
▪ Next Moves
▪ Migration Starting Small Pharmacogenomics
Pipeline
▪ Clinical Trials Matching Pipeline
▪ The Great Migration Hive-> Databricks
▪ Things we Learned
▪ Business Impact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Overview LInk
Original Problem Statement(s)
▪ Genomic reports are hard to find in the Electronic Medical Record (EMR)
▪ The reports are difficult to read (++ pages) are different from each lab, may not
have relevant recommendations and require manual efforts to summarize
▪ Presenting relevant Clinical Trails to providers when making treatment decisions
will increase Clinical Trial participation
▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology
(ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial,
clinical outcomes and treatment data must be reported back to the COE for
patients enrolled in the studies
▪ Current process is complicated, time consuming and manual
Overview
▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide
interoperability of data between different LCI data sources
▪ Specifically to address the multiple data silo’s, that contain related data, which is a
consistent challenge across the System
▪ Data meaning, must be transferred, not just values
▪ Apple: Fruit vs. Computer
▪ Originally we had 4 people, and we all had day jobs
Specialized External testing
Testing Results
PDF’s, results and
Raw Sequence data in
PDF, Clinical Decision Support Out
(External –sftp/data factory)
Clinical
Trails
Management
Software
(On-Premise-
soon to be
Cloud)
EMR
Clinical Data
(Cerner reporting
Database/EDW)
EAPathways embedded in
Cerner
via SMART/FHIR
Genomic results and
PDF reports
via Tier 1 SharePoint
for molecular tumor
board review
Converting Raw Reads to
Genotype-> Phenotype and
generating report for Provider
LCI
Encounter
Data
(EDW)
LInK
Unstructured Notes
(e.g. Cerner reporting
Database)
EAPathways
Database
(On-premise
DB)
Integration
Office
365
(External-
API)
POC
Clinical
Decisio
n
Support
Clinical
Trials
Matching
Pharmacogenomics
Specialized Internal testing
Testing Results and
Raw Sequence data in PDF
out
(internal)
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Caris
Inivata
FMI
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines-
Auto-generate by WebApps
Radiation
Treatments
CoPath
Pathology
MS Web
Apps
MS
SharePoint
Designer
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Issues
Issues
▪ We run 365 days a year
▪ The Data is used in real time by providers to make clinical decisions for
patient treatment for Cancer any breakdown in the pipeline is a
Priority 1 fix that needs to be fixed as soon as possible
▪ We were early adopters of HDI – this server has been up since 2016 – it
is old technology and HDI was not built for servers to live this long.
Issues cont’d
▪ Randomly the cluster would freeze and go into SAFE mode – with no
warning, this happened on a weekly basis often several days, in a row
during the overnight batch.
▪ We were past the default allocated 10,000 tez counters and had to
change the runs to constantly run with additional ones, back at
around 3,000 lines of Hive code.
▪ Although we tried using Matrix manipulation in hive– at some point you
just need a loop.
Issues cont’d
▪ The costs to have the HDI cluster up 24x365 was very expensive, we
scaled it up and down to help reduced costs.
▪ The cluster was not stable, because we were scaling up and scaling
down everyday, at one point there so many logs on the daily scaling it
took the entire HDI cluster down.
Issues cont’d
▪ Twice the cluster went down so bad and so hard MS Support’s
response was destroy it and start again, which we did the first time…
▪ The HDI server choice-dichotomy to HiveV2 had forced us into not
allowing vectorized execution– we had to constantly set
hive.vectorized.execution.enabled=false; through out the script
because it would “forget” and which was slowing down processing.
Next moves
Search
▪ We wanted something that was cheaper
▪ We wanted to keep our old wasbi storage – not have to migrate the
datalake
▪ We wanted flexibility in language options for on-going operations and
continuity of care we did not want to get boxed into just one
▪ We wanted something less agnostic, more fully integrated into the
Microsoft eco-system
Search cont’d
▪ We needed it to be HIPAA compliant because we were working with
Patient data.
▪ We needed something that was self sufficient with the Cluster
management so we could concentrate on the programming aspect
instead of infrastructure.
▪ We really liked the Notebook concept – and had started experimenting
with Jupiter notebooks inside HDI
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Migration
Migration – starting small
▪ There is a large steep learning curve to get into the databricks
▪ We had a new project the second pipeline that had to be built and it
seemed easier to start with something smaller than the 8000 lines of
Hive code that would be required if we started transitioning the
original pipeline.
Pharmacogenomics In progress
Pharmacogenomics
We receive raw Genomic test
results from our internal lab
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
Pharmacogenomics
Single Notebook
Overview Genomic Clinical Trials Pipeline
--------------------
Clinical Trial Match Criteria
Age (today’s) Gender
First line eligible(no
previous anti-
neoplastics
ordered)
Genomic Results
(over 1290 genes)
Diagnosis Tumor Site
Secondary Gene
results
Must have/not have
a specific protein
change/mutation
Previous Lab
results
Previous
Medications
Opening Screen
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
The Great Migration
Process Tempus
files
Process Caris
files Process FMI files
Process Inivata
files
Main Match
Create Summary
Preprocess each
lab into similar
data format
Create Clinical Matches
Create Genomic Summary,
combine with matches an
save to database
1
2
3
Hive
Conversion
Initial Definitions
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
Reading the file
▪ Not a separate step in Hive part of the next step ▪ Bulleted list
▪ Bulleted list
DatabricksHive
Creating a clean view of the data
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
Databricks by the numbers
▪ We work in a Premium Workspace, using our internal ip addresses
inside a secured subnet inside the Atrium Health Azure Subscription
▪ Databricks is fully HIPPA compliant
▪ Clusters are created with predefined tags and costs associated to
each tagged cluster’s run can be separated out
▪ Our data lake is ~110 terabytes
▪ We have 2.3+ million gene results x 240+ CTC to match against 10
criteria
▪ Yes even during COVID-19 we are still seeing an average of 1 new
report a day –
We still run 365 a year
Things we learned
Azure Key Vaults and Back-up
▪ Azure Key Vaults are tricky to implement and you only need to do the
connection on a new workspace – so save those instructions!
▪ But these are a very secure way to save all your connection info
without having it in plain text on the notebook itself.
▪ Do not forget to save a copy of everything periodically offline –if your
workspace goes you lose all the notebooks and any manually uploaded
data tables…
▪ Yes we have had to replace the workspace twice in this project
Working with complex nested Json and XML sucks
▪ It sounds so simple in the examples and works great in the simple 1 level
examples – real world when something is nested and duplicated or
missing entirely from that record several levels deep and usually in
structs -it sucks
▪ Struct versus arrays- we ended-up having to convert structs to arrays all
the time
▪ Use the cardinality function a lot to determine if there was anything in an
array
▪ The concat_ws trick if you are not sure if ended up with an array or a
string in a sql in your data
Tips and tricks?
▪ Databricks only reads a Blob Type of Block blob. Any other type means
that databricks does not even see the directory – that took a fair bit to
uncover when one of our vendors uploaded a new set of files in the
wrong block type without realizing it.
▪ We ended up using data factory a lot less than we thought –odbc
connections worked well except for Oracle we never could get that to
work – it is the only thing still sqooped nightly
Code Snips I used all the time
▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”)
▪ %scala val ScalaDF= spark.read($“pythonTable”)
▪ If you need a table from a JDBC source to use in SQL:
▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties)
▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl")
▪ If you suddenly cannot write out a table:
▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true)
I am no expert – but I ended up using these all the time
Code Snips I used all the time
▪ Save tables between notebooks – use REFERSH table at the start of
the new notebook to grab the latest version
▪ The null problem – using the cast function to save yourself from
Parquet
I am no expert – but I ended up using these all the time
Business Impact
▪ More stable infrastructure
▪ Lower costs
▪ Results come in faster
▪ Easier to add additional labs
▪ Easier to troubleshoot when there are issues
▪ Increase in volume handled easily
▪ Self-service for end-users means no IAS intervention
Thanks!
Dr Derek Ragavan,
Carol Farhangfar, Nury Steuerwald, Jai Patel
Chris Danzi, Lance Richey, Scott Blevins
Andrea Bouronich, Stephanie King, Melanie Bamberg,
Stacy Harris
Kelly Jones and his team
All the data and system owners who let us access their data
All the Microsoft support folks who helped us push to the edge
And of course Databricks
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Ad

More Related Content

What's hot (20)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
Gwen (Chen) Shapira
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
Edureka!
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
Databricks
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
Gwen (Chen) Shapira
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
DataWorks Summit/Hadoop Summit
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
Edureka!
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
Databricks
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
Novita Sari
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
DataWorks Summit/Hadoop Summit
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 

Similar to All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study (20)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Bonnie Hurwitz
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Laurens De Vocht
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
Jonathan Long
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013
Brock Heinz
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
ArmonDadgar
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
Chris Dwan
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Kevin Crawley
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
OSTHUS
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
Chip Childers
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
OSTHUS
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Bonnie Hurwitz
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Laurens De Vocht
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
Jonathan Long
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013
Brock Heinz
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
ArmonDadgar
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
Chris Dwan
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
Kevin Crawley
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
OSTHUS
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
Chip Childers
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
OSTHUS
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
Precisely
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 

All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

  • 2. All In: Migrating a Genomics Pipeline from BASH/Hive to Spark and Azure Databricks—A Real World Case Study Victoria Morris Unicorn Health Bridge Consulting working for Atrium Health
  • 3. Agenda Victoria Morris ▪ Overview Link ▪ Issues – why change? ▪ Next Moves ▪ Migration Starting Small Pharmacogenomics Pipeline ▪ Clinical Trials Matching Pipeline ▪ The Great Migration Hive-> Databricks ▪ Things we Learned ▪ Business Impact
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 6. Original Problem Statement(s) ▪ Genomic reports are hard to find in the Electronic Medical Record (EMR) ▪ The reports are difficult to read (++ pages) are different from each lab, may not have relevant recommendations and require manual efforts to summarize ▪ Presenting relevant Clinical Trails to providers when making treatment decisions will increase Clinical Trial participation ▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology (ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial, clinical outcomes and treatment data must be reported back to the COE for patients enrolled in the studies ▪ Current process is complicated, time consuming and manual
  • 7. Overview ▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide interoperability of data between different LCI data sources ▪ Specifically to address the multiple data silo’s, that contain related data, which is a consistent challenge across the System ▪ Data meaning, must be transferred, not just values ▪ Apple: Fruit vs. Computer ▪ Originally we had 4 people, and we all had day jobs
  • 8. Specialized External testing Testing Results PDF’s, results and Raw Sequence data in PDF, Clinical Decision Support Out (External –sftp/data factory) Clinical Trails Management Software (On-Premise- soon to be Cloud) EMR Clinical Data (Cerner reporting Database/EDW) EAPathways embedded in Cerner via SMART/FHIR Genomic results and PDF reports via Tier 1 SharePoint for molecular tumor board review Converting Raw Reads to Genotype-> Phenotype and generating report for Provider LCI Encounter Data (EDW) LInK Unstructured Notes (e.g. Cerner reporting Database) EAPathways Database (On-premise DB) Integration Office 365 (External- API) POC Clinical Decisio n Support Clinical Trials Matching Pharmacogenomics Specialized Internal testing Testing Results and Raw Sequence data in PDF out (internal)
  • 9. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Caris Inivata FMI Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines- Auto-generate by WebApps Radiation Treatments CoPath Pathology MS Web Apps MS SharePoint Designer
  • 10. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 12. Issues ▪ We run 365 days a year ▪ The Data is used in real time by providers to make clinical decisions for patient treatment for Cancer any breakdown in the pipeline is a Priority 1 fix that needs to be fixed as soon as possible ▪ We were early adopters of HDI – this server has been up since 2016 – it is old technology and HDI was not built for servers to live this long.
  • 13. Issues cont’d ▪ Randomly the cluster would freeze and go into SAFE mode – with no warning, this happened on a weekly basis often several days, in a row during the overnight batch. ▪ We were past the default allocated 10,000 tez counters and had to change the runs to constantly run with additional ones, back at around 3,000 lines of Hive code. ▪ Although we tried using Matrix manipulation in hive– at some point you just need a loop.
  • 14. Issues cont’d ▪ The costs to have the HDI cluster up 24x365 was very expensive, we scaled it up and down to help reduced costs. ▪ The cluster was not stable, because we were scaling up and scaling down everyday, at one point there so many logs on the daily scaling it took the entire HDI cluster down.
  • 15. Issues cont’d ▪ Twice the cluster went down so bad and so hard MS Support’s response was destroy it and start again, which we did the first time… ▪ The HDI server choice-dichotomy to HiveV2 had forced us into not allowing vectorized execution– we had to constantly set hive.vectorized.execution.enabled=false; through out the script because it would “forget” and which was slowing down processing.
  • 17. Search ▪ We wanted something that was cheaper ▪ We wanted to keep our old wasbi storage – not have to migrate the datalake ▪ We wanted flexibility in language options for on-going operations and continuity of care we did not want to get boxed into just one ▪ We wanted something less agnostic, more fully integrated into the Microsoft eco-system
  • 18. Search cont’d ▪ We needed it to be HIPAA compliant because we were working with Patient data. ▪ We needed something that was self sufficient with the Cluster management so we could concentrate on the programming aspect instead of infrastructure. ▪ We really liked the Notebook concept – and had started experimenting with Jupiter notebooks inside HDI
  • 19. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 21. Migration – starting small ▪ There is a large steep learning curve to get into the databricks ▪ We had a new project the second pipeline that had to be built and it seemed easier to start with something smaller than the 8000 lines of Hive code that would be required if we started transitioning the original pipeline.
  • 23. Pharmacogenomics We receive raw Genomic test results from our internal lab
  • 28. Overview Genomic Clinical Trials Pipeline
  • 30. Clinical Trial Match Criteria Age (today’s) Gender First line eligible(no previous anti- neoplastics ordered) Genomic Results (over 1290 genes) Diagnosis Tumor Site Secondary Gene results Must have/not have a specific protein change/mutation Previous Lab results Previous Medications
  • 37. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 39. Process Tempus files Process Caris files Process FMI files Process Inivata files Main Match Create Summary Preprocess each lab into similar data format Create Clinical Matches Create Genomic Summary, combine with matches an save to database 1 2 3
  • 41. Initial Definitions ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 42. Reading the file ▪ Not a separate step in Hive part of the next step ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 43. Creating a clean view of the data ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 45. Databricks by the numbers ▪ We work in a Premium Workspace, using our internal ip addresses inside a secured subnet inside the Atrium Health Azure Subscription ▪ Databricks is fully HIPPA compliant ▪ Clusters are created with predefined tags and costs associated to each tagged cluster’s run can be separated out ▪ Our data lake is ~110 terabytes ▪ We have 2.3+ million gene results x 240+ CTC to match against 10 criteria ▪ Yes even during COVID-19 we are still seeing an average of 1 new report a day – We still run 365 a year
  • 47. Azure Key Vaults and Back-up ▪ Azure Key Vaults are tricky to implement and you only need to do the connection on a new workspace – so save those instructions! ▪ But these are a very secure way to save all your connection info without having it in plain text on the notebook itself. ▪ Do not forget to save a copy of everything periodically offline –if your workspace goes you lose all the notebooks and any manually uploaded data tables… ▪ Yes we have had to replace the workspace twice in this project
  • 48. Working with complex nested Json and XML sucks ▪ It sounds so simple in the examples and works great in the simple 1 level examples – real world when something is nested and duplicated or missing entirely from that record several levels deep and usually in structs -it sucks ▪ Struct versus arrays- we ended-up having to convert structs to arrays all the time ▪ Use the cardinality function a lot to determine if there was anything in an array ▪ The concat_ws trick if you are not sure if ended up with an array or a string in a sql in your data
  • 49. Tips and tricks? ▪ Databricks only reads a Blob Type of Block blob. Any other type means that databricks does not even see the directory – that took a fair bit to uncover when one of our vendors uploaded a new set of files in the wrong block type without realizing it. ▪ We ended up using data factory a lot less than we thought –odbc connections worked well except for Oracle we never could get that to work – it is the only thing still sqooped nightly
  • 50. Code Snips I used all the time ▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”) ▪ %scala val ScalaDF= spark.read($“pythonTable”) ▪ If you need a table from a JDBC source to use in SQL: ▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties) ▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl") ▪ If you suddenly cannot write out a table: ▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true) I am no expert – but I ended up using these all the time
  • 51. Code Snips I used all the time ▪ Save tables between notebooks – use REFERSH table at the start of the new notebook to grab the latest version ▪ The null problem – using the cast function to save yourself from Parquet I am no expert – but I ended up using these all the time
  • 52. Business Impact ▪ More stable infrastructure ▪ Lower costs ▪ Results come in faster ▪ Easier to add additional labs ▪ Easier to troubleshoot when there are issues ▪ Increase in volume handled easily ▪ Self-service for end-users means no IAS intervention
  • 53. Thanks! Dr Derek Ragavan, Carol Farhangfar, Nury Steuerwald, Jai Patel Chris Danzi, Lance Richey, Scott Blevins Andrea Bouronich, Stephanie King, Melanie Bamberg, Stacy Harris Kelly Jones and his team All the data and system owners who let us access their data All the Microsoft support folks who helped us push to the edge And of course Databricks
  • 55. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.