SlideShare a Scribd company logo
Healthcare Claim Reimbursement using Apache Spark
Claim Reimbursement with Spark
Salim Sayed
Principal Architect, Optum
Agenda
▪
▪
▪
▪
▪
▪
▪
▪
Claim Reimbursement Overview
1 2
34
5
0
6
Why Spark
▪ Performant and Scalable
▪ Works in both high and low volume scenarios.
▪ Applicable for applications not having three Vs - Volume, Velocity and Variety.
▪ Stability
▪ Easy and mature API.
▪ Mature set of tools are available for development and support.
▪ Cost
▪ Free opensource version is production grade.
▪ Possibility of cost savings with dynamic scaling on public cloud.
Why Spark
▪ Easy to Adapt
▪ Few concepts to master – distributed nature of execution, lazy evaluation.
▪ The API is very fluent.
▪ Java and Python are common skill sets.
▪ Easy IDE based development – no VM needed as in Hadoop.
▪ Code can be unit tested.
▪ Easy to apply Object Oriented and functional programming.
▪ Code can be reused for Batch and Streaming mode.
▪ Compatibility
▪ It is compatible with almost all technologies.
▪ A great number of data drivers.
▪ Active Community Support
▪ There is a plenty of support on Stack overflow, Spark’s official documentation and Databricks blogs.
Healthcare Claim Reimbursement using Apache Spark
Claim ETL Rewrite
AfterBefore
Oracle-
PLSQL
Claims
Oracle-
tables
Parser
Claims
Parser
Oracle-
tables
Data lake
(Parquet)
Staging
tables
1
3
2
4
5
1
2
3
Claim ETL Rewrite
▪ Spark 2.4.2, opensource.
▪ Data lake in Parquet format.
▪ Plan to migrate to Delta Lake.
▪ Spark standalone cluster.
▪ On premise cluster.
▪ Plan to move to Cloud.
▪ Production support tools.
▪ Zeppelin notebook.
▪ Spark Shell.
Gains
Gains
▪ Deprecation of old codebase and technical debt.
▪ Cost Savings
▪ Realized in the Cloud with dynamic scaling.
▪ Data will be available in a compressed, splittable format for any other
processing.
▪ It is a great infrastructure for ad hoc data analysis and machine
learning etc.
Challenges
▪ Operationalization
▪ New toolset to interact with data.
▪ Data-lake can’t be updated easily.
▪ Debugging takes longer as accessing data from data-lake is slow.
▪ New skills to learn to support the new system.
Challenges
▪ Custom tools and scripts developed by support team need to be
refactored.
▪ Cost Savings
▪ Old infrastructure cost may not phase out immediately.
▪ Development
New skill sets
Adoption
▪ Possible data consistency issues if you choose dual write
architecture.
Healthcare Claim Reimbursement using Apache Spark
Claim Reimbursement Rewrite
AfterBefore
Oracle-
PLSQL
Oracle-
tables
Oracle –
Claim
tables
O360 Claim Reimbursement
Staging
tables
Claim (Delta
Lake)
Optum Intelligent EDI (
clearing house)
Optum Intelligent EDI (
clearing house)
1
2
5
4
3
6
1
2
3
Design Constraints
▪ REST API compatible.
▪ Streaming compatible.
▪ Batch compatible.
▪ Inner source, sharable library.
▪ Domain Driven Development to tackle complexity.
▪ Easy for BDD and TDD development processes.
Design
ClaimReimbursement.jarData-to-Object
Mapping
Data-to-Object
Mapping
ClaimReimbursement.jar
Data-to-Object
Mapping
ClaimReimbursement.jar
Sample Code
Implementation Highlights
▪ Same claim reimbursement library is used across streaming, batch
and REST API frameworks.
▪ The main business logic is implemented as a library written in Java.
▪ Java is chosen for its compatibility with both Scala and Java.
▪ Spark-SQL API is not used for writing business logic so that the core library is reusable across other integration frameworks.
▪ Conversion is needed between Java and Scala objects, either by coping values explicitly or using ‘import
collection.JavaConverters._’ to convert collection API from Scala to Java.
▪ Spark is used for its scalability
▪ Spark handles the consumption and processing of data.
▪ Dataset API is used for Domain Driver implementation.
▪ A data structure for the claim is chosen as an aggregate object so that join operation can be avoided. This facilitates reimbursement
calculation as a ‘map’ operation.
Results
▪ We have not seen this volume getting processed in our PLSQL based
system.
▪ The DB time is for ‘insert’ operation. Update and Delete operations are
far slower.
Process Volume Time Throughput
(Spark)
Throughput
(Baseline)
Claim Reimbursement (batch
mode, file system to file system) 80M 86 minutes 1M claims/minute
Pushing this result to a Oracle DB
80M *160 minutes 0.5M claims/minute
Total 80M 4hrs 333K claims/minute
(20 vCPU, 100GB memory)
400K claims/hr.
(20 vCPU, 50GB
memory, Oracle)
Cost
Technology Cost
Spark on Azure Databricks @ $4/hour
(DS15v2, 20 CPUs, 140 GB, Premium Tier, Data Engineering)
$20
+ Storage
Oracle on cloud
Dynamic scaling is not available
for non Exadata workload.
We have not see such a large volume go through, so
the time is not comparable.
PG on Azure (with support) @ $4/hour of G5 server
Given the complexity of the calculation process I am
not sure if the job time is comparable.
Delta lake adoption
▪ We are using open source delta lake, not Databricks’ managed delta
lake.
▪ We plan to use Databricks’ managed delta-lake when we migrate to Azure.
▪ Data is partitioned by one key and ordered by another key.
▪ Z-order doesn’t work outside Databricks’ environment. Ordering data by a single key is good enough for us.
▪ However the more keys you use in z-order the less effective it becomes.
▪ Optimize command doesn’t work outside Databricks’ environment. Delta lake needs to be rebuild to compensate for that.
▪ The partition and order keys are chosen based on the most frequent
access so that best performance gain can be achieved.
Delta lake adoption
▪ All queries, having filter condition on these keys, run extremely fast as
compared to parquet based data-lake.
▪ As you might know that delta-lake improves performance by skipping data.
▪ All queries, not having filter condition on these keys, run at same
speed.
Performance Comparison
Process
20 vCPUs, 100GB memory, 97 GB data, 87M records
Data lake (parquet) Data lake (delta)
FILTER operation on the keys on which the delta-lake has
been partitioned or/and ordered. SELECT all columns.
18m 20s
FILTER operation on the keys on which the delta-lake has
been partitioned or/and ordered. SELECT 1 column. 3s <1s
UPDATE using partitioned or/and ordered columns.
1h 10m 37s
Merge 6000 records
1h 10m 23m
Left-outer join between 87M and 6K records on
partitioned or/and ordered keys
36m 20s
FILTER operation on the keys on which the delta-lake has
not been partitioned or/and ordered. 18m 18m
Expanding the Horizon
▪ By rewriting a RDBMS based application using Spark, we are looking
forward to expanding our capabilities much further.
▪ High volume operation has become a matter of hardware scaling.
▪ Not depending on SQL optimization for complicated business logic.
▪ Integration with streaming workload has become a reality.
▪ Applying machine learning on our dataset starts to look easier.
▪ Saving cost on license and support looks closer.
Tips
▪ Running Spark on premises doesn’t save a ton of cost. Plan for cloud
migration as a core component of the rewrite project.
▪ Consider changing production support procedures, ad hoc tool sets,
inertia of people using RDBMS, debug time on data lake etc. as part of
the rewrite project
▪ Use delta-lake. There is hardly any reason to use direct parquet based
data lake
Tips
▪ An optimum schema design will provide the biggest performance gain
and implementation simplicity. This needs to be thought out properly.
▪ Dataset API makes development process much easier.
▪ Easy to implement complicated logic.
▪ A team new to Spark can use a language like Java/Scala to write
bulk of the logic.
▪ Business logic needn’t depend on Spark syntax which makes it
dependent on Spark to execute.
▪ Existing libraries can be used.
▪ It is slower than Dataframe.
Tips
▪ Be cognizant of all pieces of the data pipeline. Moving bottleneck from
one component may shift the bottleneck to the other components.
Ideally separate out the scalable component from non-scalable ones.
▪ Opensource Spark and delta-lake are pretty good and production
grade. It could be a cheaper option for you
▪ A Spark application is a practical solution even if you don’t have those
6 Vs calling for a bigdata application.
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Healthcare Claim Reimbursement using Apache Spark
Ad

More Related Content

What's hot (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
Ashish Thapliyal
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Denodo
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
Spark
SparkSpark
Spark
Heena Madan
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
Snowflake Computing
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
Knoldus Inc.
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Denodo
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
James Serra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
Knoldus Inc.
 

Similar to Healthcare Claim Reimbursement using Apache Spark (20)

From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
Mohammed Fazuluddin
 
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
Gary Jackson MBCS
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
Kellyn Pot'Vin-Gorman
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
Navneet Upneja
 
Intro to Azure SQL database
Intro to Azure SQL databaseIntro to Azure SQL database
Intro to Azure SQL database
Steve Knutson
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2
Ajay Kumar Uppal
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
Dr. Wilfred Lin (Ph.D.)
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
SAP HANA System Replication (HSR) versus SAP Replication Server (SRS)
Gary Jackson MBCS
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
 
ORACLE 12C-New-Features
ORACLE 12C-New-FeaturesORACLE 12C-New-Features
ORACLE 12C-New-Features
Navneet Upneja
 
Intro to Azure SQL database
Intro to Azure SQL databaseIntro to Azure SQL database
Intro to Azure SQL database
Steve Knutson
 
Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2Trusted advisory on technology comparison --exadata, hana, db2
Trusted advisory on technology comparison --exadata, hana, db2
Ajay Kumar Uppal
 
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
Dr. Wilfred Lin (Ph.D.)
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 

Healthcare Claim Reimbursement using Apache Spark

  • 2. Claim Reimbursement with Spark Salim Sayed Principal Architect, Optum
  • 5. Why Spark ▪ Performant and Scalable ▪ Works in both high and low volume scenarios. ▪ Applicable for applications not having three Vs - Volume, Velocity and Variety. ▪ Stability ▪ Easy and mature API. ▪ Mature set of tools are available for development and support. ▪ Cost ▪ Free opensource version is production grade. ▪ Possibility of cost savings with dynamic scaling on public cloud.
  • 6. Why Spark ▪ Easy to Adapt ▪ Few concepts to master – distributed nature of execution, lazy evaluation. ▪ The API is very fluent. ▪ Java and Python are common skill sets. ▪ Easy IDE based development – no VM needed as in Hadoop. ▪ Code can be unit tested. ▪ Easy to apply Object Oriented and functional programming. ▪ Code can be reused for Batch and Streaming mode. ▪ Compatibility ▪ It is compatible with almost all technologies. ▪ A great number of data drivers. ▪ Active Community Support ▪ There is a plenty of support on Stack overflow, Spark’s official documentation and Databricks blogs.
  • 9. Claim ETL Rewrite ▪ Spark 2.4.2, opensource. ▪ Data lake in Parquet format. ▪ Plan to migrate to Delta Lake. ▪ Spark standalone cluster. ▪ On premise cluster. ▪ Plan to move to Cloud. ▪ Production support tools. ▪ Zeppelin notebook. ▪ Spark Shell.
  • 10. Gains
  • 11. Gains ▪ Deprecation of old codebase and technical debt. ▪ Cost Savings ▪ Realized in the Cloud with dynamic scaling. ▪ Data will be available in a compressed, splittable format for any other processing. ▪ It is a great infrastructure for ad hoc data analysis and machine learning etc.
  • 12. Challenges ▪ Operationalization ▪ New toolset to interact with data. ▪ Data-lake can’t be updated easily. ▪ Debugging takes longer as accessing data from data-lake is slow. ▪ New skills to learn to support the new system.
  • 13. Challenges ▪ Custom tools and scripts developed by support team need to be refactored. ▪ Cost Savings ▪ Old infrastructure cost may not phase out immediately. ▪ Development New skill sets Adoption ▪ Possible data consistency issues if you choose dual write architecture.
  • 15. Claim Reimbursement Rewrite AfterBefore Oracle- PLSQL Oracle- tables Oracle – Claim tables O360 Claim Reimbursement Staging tables Claim (Delta Lake) Optum Intelligent EDI ( clearing house) Optum Intelligent EDI ( clearing house) 1 2 5 4 3 6 1 2 3
  • 16. Design Constraints ▪ REST API compatible. ▪ Streaming compatible. ▪ Batch compatible. ▪ Inner source, sharable library. ▪ Domain Driven Development to tackle complexity. ▪ Easy for BDD and TDD development processes.
  • 19. Implementation Highlights ▪ Same claim reimbursement library is used across streaming, batch and REST API frameworks. ▪ The main business logic is implemented as a library written in Java. ▪ Java is chosen for its compatibility with both Scala and Java. ▪ Spark-SQL API is not used for writing business logic so that the core library is reusable across other integration frameworks. ▪ Conversion is needed between Java and Scala objects, either by coping values explicitly or using ‘import collection.JavaConverters._’ to convert collection API from Scala to Java. ▪ Spark is used for its scalability ▪ Spark handles the consumption and processing of data. ▪ Dataset API is used for Domain Driver implementation. ▪ A data structure for the claim is chosen as an aggregate object so that join operation can be avoided. This facilitates reimbursement calculation as a ‘map’ operation.
  • 20. Results ▪ We have not seen this volume getting processed in our PLSQL based system. ▪ The DB time is for ‘insert’ operation. Update and Delete operations are far slower. Process Volume Time Throughput (Spark) Throughput (Baseline) Claim Reimbursement (batch mode, file system to file system) 80M 86 minutes 1M claims/minute Pushing this result to a Oracle DB 80M *160 minutes 0.5M claims/minute Total 80M 4hrs 333K claims/minute (20 vCPU, 100GB memory) 400K claims/hr. (20 vCPU, 50GB memory, Oracle)
  • 21. Cost Technology Cost Spark on Azure Databricks @ $4/hour (DS15v2, 20 CPUs, 140 GB, Premium Tier, Data Engineering) $20 + Storage Oracle on cloud Dynamic scaling is not available for non Exadata workload. We have not see such a large volume go through, so the time is not comparable. PG on Azure (with support) @ $4/hour of G5 server Given the complexity of the calculation process I am not sure if the job time is comparable.
  • 22. Delta lake adoption ▪ We are using open source delta lake, not Databricks’ managed delta lake. ▪ We plan to use Databricks’ managed delta-lake when we migrate to Azure. ▪ Data is partitioned by one key and ordered by another key. ▪ Z-order doesn’t work outside Databricks’ environment. Ordering data by a single key is good enough for us. ▪ However the more keys you use in z-order the less effective it becomes. ▪ Optimize command doesn’t work outside Databricks’ environment. Delta lake needs to be rebuild to compensate for that. ▪ The partition and order keys are chosen based on the most frequent access so that best performance gain can be achieved.
  • 23. Delta lake adoption ▪ All queries, having filter condition on these keys, run extremely fast as compared to parquet based data-lake. ▪ As you might know that delta-lake improves performance by skipping data. ▪ All queries, not having filter condition on these keys, run at same speed.
  • 24. Performance Comparison Process 20 vCPUs, 100GB memory, 97 GB data, 87M records Data lake (parquet) Data lake (delta) FILTER operation on the keys on which the delta-lake has been partitioned or/and ordered. SELECT all columns. 18m 20s FILTER operation on the keys on which the delta-lake has been partitioned or/and ordered. SELECT 1 column. 3s <1s UPDATE using partitioned or/and ordered columns. 1h 10m 37s Merge 6000 records 1h 10m 23m Left-outer join between 87M and 6K records on partitioned or/and ordered keys 36m 20s FILTER operation on the keys on which the delta-lake has not been partitioned or/and ordered. 18m 18m
  • 25. Expanding the Horizon ▪ By rewriting a RDBMS based application using Spark, we are looking forward to expanding our capabilities much further. ▪ High volume operation has become a matter of hardware scaling. ▪ Not depending on SQL optimization for complicated business logic. ▪ Integration with streaming workload has become a reality. ▪ Applying machine learning on our dataset starts to look easier. ▪ Saving cost on license and support looks closer.
  • 26. Tips ▪ Running Spark on premises doesn’t save a ton of cost. Plan for cloud migration as a core component of the rewrite project. ▪ Consider changing production support procedures, ad hoc tool sets, inertia of people using RDBMS, debug time on data lake etc. as part of the rewrite project ▪ Use delta-lake. There is hardly any reason to use direct parquet based data lake
  • 27. Tips ▪ An optimum schema design will provide the biggest performance gain and implementation simplicity. This needs to be thought out properly. ▪ Dataset API makes development process much easier. ▪ Easy to implement complicated logic. ▪ A team new to Spark can use a language like Java/Scala to write bulk of the logic. ▪ Business logic needn’t depend on Spark syntax which makes it dependent on Spark to execute. ▪ Existing libraries can be used. ▪ It is slower than Dataframe.
  • 28. Tips ▪ Be cognizant of all pieces of the data pipeline. Moving bottleneck from one component may shift the bottleneck to the other components. Ideally separate out the scalable component from non-scalable ones. ▪ Opensource Spark and delta-lake are pretty good and production grade. It could be a cheaper option for you ▪ A Spark application is a practical solution even if you don’t have those 6 Vs calling for a bigdata application.
  • 29. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.