SlideShare a Scribd company logo
Behavior-Driven
Development (BDD)
Testing with Apache Spark
Aaron Colcord
Director of Engineering, Data and Analytics
Zachary Nanfelt
Software Engineer, Data and Analytics
Who is
FIS Global?
• We’re FIS Digital Finance, Mobile Data
and Analytics
• One of the largest global
FinTech companies
• Customers are banks and credit unions
• Ecosystem of products and services
built around core banking
•
2
Data Wrangling
ETL is still a thing
The way we do it varies quite a bit
When we do it also varies
The why we do it, that’s easy
1
2
3
4
3
Did we cover all scenarios?
Questions on what we are doing?
Can you prove that the data transformed correctly?
Is unit testing understandable?
Was acceptance criteria met?
Are we able to do all this testing complexity in a reasonable timeframe?
4
What is BDD?
• An extension of Test Driven Development
• Deliberate shared, ubiquitous language
• Automated acceptance tests written as Examples that anyone can read
• Living documentation – Documentation others can read about code is updated
• More agile to have good documentation as an output to development than
making documentation as input
• Enable more Team Members to Participate in the development process
5
Core Problem of Data Transformation
• It is really hard to prove data transformed correctly
in a normal pipeline aka Batch-Oriented
• The traditional way has been to push data through
the system and then query it out
• Apache Spark can accelerate not only the speed you transform,
but the speed in which you can validate transformations
– We can switch from Batch Oriented to Streaming
6
Spark is our favorite hammer
Beautiful Baby
=
7
Super Widget Scenario
• Our app servers log everything in Epoch Time (Unix)
from mobile app clients all over the world
• Users seem incapable of computing this mentally
and want it to appear in their own timezone
• Crazy, but some of these guys are remote
and in different timezones
8
Boilerplate Code
9
Extraction
Code
10
SQL Validation Test Code
SELECT TIMEZONE_OFFSET,
TIMESTAMP_GMT,
TIMESTAMP_LTZ
FROM
APPSTORE_REVIEWS
WHERE
TIMEZONE_OFFSET <> 0
LIMIT 10;
11
Cucumber Test
12
Cucumber Test
Version 2
13
Why this is so great
• Collaboration and Participation
• Thinking naturally begets
better scenarios
• We are able to unify
–Use Cases
•We are using ETL...
–All Projects
14
Given a BDD Presentation
When it is late in the day
And FIS is giving the talk
Then get Excited!
• We define Features and Scenarios
Expressive Scenarios
• Given/When/Then
Gherkin doesn’t care how you use them
They just help with readability.
15
Wait, there’s more!
Step Definitions
• Step Definitions tell the how to do
The Feature file said what to do
• This the boundary of the programmer’s
Domain and the business domain.
• It’s not all snake oil, really...
16
Cucumber Step Definition Code
17
Cucumber Step Definition Code
18
Cucumber Code
19
For these two guys, ETL wasn’t hell,
it was target practice.
2 Kinds of People in this world… About
20
Enterprise Stuff
• You will notice we are sticking to Eclipse/IntelliJ
• Enterprises usually need to prove Separation of Duties and Audit Trails
• Most Data processing tasks should have an established process to ensure
quality and correctness.
– All Business have their own Custom Approach to Business Rules
– Consolidating these transformations ensure quality. Allows QA Checking
• Notebooks:
– It’s really hard to enforce that consistency and correctness in notebooks, except by
Compiling libraries.
– Unifying Business Logic and Common Transformations removes the prep work.
21
Code
https://dbc-39f78c99-dfb2.cloud.databricks.com/#notebook/28139
22
Tips and Tricks (Pretty Report)
•plugin = {“pretty”, “html:target/cucumber”} isn’t
very pretty
•Use cucumber-reports
23
Verses
Tips and Tricks
(miscellaneous)
• Java cuke > Scala cuke
– intelliJ integration, speed, etc...
• .config(“spark.driver.host”, “127.0.0.1”)
– Saves overall test execution time
for each run
• Think hard about whether things should
get tested at unit or component level
– DAG takes longer to compute path
on more complex DAGs (e.g. longer tests),
but can provide more value
24
Tips and Tricks (Code Coverage for Scala
Cuke)
25
Resources, Resources, Resources
• cucumber.io
• cucumber-reporting
• Pragmatic Books
– Cucumber Book
– Cucumber Recipes
– Cucumber for Java
• Specification by Example
• Databricks Blog
26
Thank you
Ad

More Related Content

What's hot (20)

Business Intelligence (BI) For Manufacturing - A White Paper
Business Intelligence (BI) For Manufacturing - A White PaperBusiness Intelligence (BI) For Manufacturing - A White Paper
Business Intelligence (BI) For Manufacturing - A White Paper
Dhiren Gala
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Managing Data Integration Initiatives
Managing Data Integration InitiativesManaging Data Integration Initiatives
Managing Data Integration Initiatives
AllinConsulting
 
Business intelligence 101
Business intelligence   101Business intelligence   101
Business intelligence 101
Ashok Bhatla
 
The Complete Guide to Embedded Analytics
The Complete Guide to Embedded AnalyticsThe Complete Guide to Embedded Analytics
The Complete Guide to Embedded Analytics
Jessica Sprinkel
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Transforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming DataTransforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming Data
confluent
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business Intelligence
Ronan Soares
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Delivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for CollibraDelivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for Collibra
Precisely
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
The Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldThe Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud World
DATAVERSITY
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
Ashish Thapliyal
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)
Bernardo Najlis
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration Testing
Nick van Beest
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
Data warehouse proposal
Data warehouse proposalData warehouse proposal
Data warehouse proposal
Peter Macdonald
 
Business Intelligence (BI) For Manufacturing - A White Paper
Business Intelligence (BI) For Manufacturing - A White PaperBusiness Intelligence (BI) For Manufacturing - A White Paper
Business Intelligence (BI) For Manufacturing - A White Paper
Dhiren Gala
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Managing Data Integration Initiatives
Managing Data Integration InitiativesManaging Data Integration Initiatives
Managing Data Integration Initiatives
AllinConsulting
 
Business intelligence 101
Business intelligence   101Business intelligence   101
Business intelligence 101
Ashok Bhatla
 
The Complete Guide to Embedded Analytics
The Complete Guide to Embedded AnalyticsThe Complete Guide to Embedded Analytics
The Complete Guide to Embedded Analytics
Jessica Sprinkel
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Transforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming DataTransforming Financial Services with Event Streaming Data
Transforming Financial Services with Event Streaming Data
confluent
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Introduction to Business Intelligence
Introduction to Business IntelligenceIntroduction to Business Intelligence
Introduction to Business Intelligence
Ronan Soares
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Delivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for CollibraDelivering Trusted Insights with Integrated Data Quality for Collibra
Delivering Trusted Insights with Integrated Data Quality for Collibra
Precisely
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
The Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldThe Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud World
DATAVERSITY
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)
Bernardo Najlis
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration Testing
Nick van Beest
 

Similar to Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt (20)

DevOps Days Ohio
DevOps Days OhioDevOps Days Ohio
DevOps Days Ohio
Kelly Looney
 
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps JourneyGartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Kelly Looney
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...
Marco Tusa
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps StoryDOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
Gene Kim
 
Kku2011
Kku2011Kku2011
Kku2011
ทวิร พานิชสมบัติ
 
Why retail companies can't afford database downtime
Why retail companies can't afford database downtimeWhy retail companies can't afford database downtime
Why retail companies can't afford database downtime
DBmaestro - Database DevOps
 
Performance Tuning in the Trenches
Performance Tuning in the TrenchesPerformance Tuning in the Trenches
Performance Tuning in the Trenches
Donald Belcham
 
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Simon Storm
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
Mirco Hering
 
CQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architectureCQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architecture
Thomas Jaskula
 
Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
Naresh Jain
 
DBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro's State of the Database Continuous Delivery Survey- Findings RevealedDBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro - Database DevOps
 
Building trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOpsBuilding trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOps
Guido Serra
 
The DevOps Journey at bwin.party
The DevOps Journey at bwin.partyThe DevOps Journey at bwin.party
The DevOps Journey at bwin.party
Kelly Looney
 
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps worldLucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
DevOps Enterprise Summit
 
Kku2011
Kku2011Kku2011
Kku2011
ทวิร พานิชสมบัติ
 
Behaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibileBehaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
Nacho Cougil
 
Technical Without Code
Technical Without CodeTechnical Without Code
Technical Without Code
Caitlin Cassidy
 
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps JourneyGartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Gartner Infrastructure and Operations Summit Berlin 2015 - DevOps Journey
Kelly Looney
 
Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...Are we there Yet?? (The long journey of Migrating from close source to opens...
Are we there Yet?? (The long journey of Migrating from close source to opens...
Marco Tusa
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps StoryDOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
DOES16 London - Jonathan Fletcher - Re-imagining Hiscox IT: A DevOps Story
Gene Kim
 
Why retail companies can't afford database downtime
Why retail companies can't afford database downtimeWhy retail companies can't afford database downtime
Why retail companies can't afford database downtime
DBmaestro - Database DevOps
 
Performance Tuning in the Trenches
Performance Tuning in the TrenchesPerformance Tuning in the Trenches
Performance Tuning in the Trenches
Donald Belcham
 
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Agile and Continuous Delivery for Audits and Exams - DC Continuous Delivery M...
Simon Storm
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
Mirco Hering
 
CQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architectureCQRS recipes or how to cook your architecture
CQRS recipes or how to cook your architecture
Thomas Jaskula
 
Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
Naresh Jain
 
DBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro's State of the Database Continuous Delivery Survey- Findings RevealedDBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro's State of the Database Continuous Delivery Survey- Findings Revealed
DBmaestro - Database DevOps
 
Building trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOpsBuilding trust within the organization, first steps towards DevOps
Building trust within the organization, first steps towards DevOps
Guido Serra
 
The DevOps Journey at bwin.party
The DevOps Journey at bwin.partyThe DevOps Journey at bwin.party
The DevOps Journey at bwin.party
Kelly Looney
 
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps worldLucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
DevOps Enterprise Summit
 
Behaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibileBehaviour Driven Development: Oltre i limiti del possibile
Behaviour Driven Development: Oltre i limiti del possibile
Iosif Itkin
 
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
Nacho Cougil
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcord and Zachary Nanfelt

  • 1. Behavior-Driven Development (BDD) Testing with Apache Spark Aaron Colcord Director of Engineering, Data and Analytics Zachary Nanfelt Software Engineer, Data and Analytics
  • 2. Who is FIS Global? • We’re FIS Digital Finance, Mobile Data and Analytics • One of the largest global FinTech companies • Customers are banks and credit unions • Ecosystem of products and services built around core banking • 2
  • 3. Data Wrangling ETL is still a thing The way we do it varies quite a bit When we do it also varies The why we do it, that’s easy 1 2 3 4 3
  • 4. Did we cover all scenarios? Questions on what we are doing? Can you prove that the data transformed correctly? Is unit testing understandable? Was acceptance criteria met? Are we able to do all this testing complexity in a reasonable timeframe? 4
  • 5. What is BDD? • An extension of Test Driven Development • Deliberate shared, ubiquitous language • Automated acceptance tests written as Examples that anyone can read • Living documentation – Documentation others can read about code is updated • More agile to have good documentation as an output to development than making documentation as input • Enable more Team Members to Participate in the development process 5
  • 6. Core Problem of Data Transformation • It is really hard to prove data transformed correctly in a normal pipeline aka Batch-Oriented • The traditional way has been to push data through the system and then query it out • Apache Spark can accelerate not only the speed you transform, but the speed in which you can validate transformations – We can switch from Batch Oriented to Streaming 6
  • 7. Spark is our favorite hammer Beautiful Baby = 7
  • 8. Super Widget Scenario • Our app servers log everything in Epoch Time (Unix) from mobile app clients all over the world • Users seem incapable of computing this mentally and want it to appear in their own timezone • Crazy, but some of these guys are remote and in different timezones 8
  • 11. SQL Validation Test Code SELECT TIMEZONE_OFFSET, TIMESTAMP_GMT, TIMESTAMP_LTZ FROM APPSTORE_REVIEWS WHERE TIMEZONE_OFFSET <> 0 LIMIT 10; 11
  • 14. Why this is so great • Collaboration and Participation • Thinking naturally begets better scenarios • We are able to unify –Use Cases •We are using ETL... –All Projects 14
  • 15. Given a BDD Presentation When it is late in the day And FIS is giving the talk Then get Excited! • We define Features and Scenarios Expressive Scenarios • Given/When/Then Gherkin doesn’t care how you use them They just help with readability. 15
  • 16. Wait, there’s more! Step Definitions • Step Definitions tell the how to do The Feature file said what to do • This the boundary of the programmer’s Domain and the business domain. • It’s not all snake oil, really... 16
  • 20. For these two guys, ETL wasn’t hell, it was target practice. 2 Kinds of People in this world… About 20
  • 21. Enterprise Stuff • You will notice we are sticking to Eclipse/IntelliJ • Enterprises usually need to prove Separation of Duties and Audit Trails • Most Data processing tasks should have an established process to ensure quality and correctness. – All Business have their own Custom Approach to Business Rules – Consolidating these transformations ensure quality. Allows QA Checking • Notebooks: – It’s really hard to enforce that consistency and correctness in notebooks, except by Compiling libraries. – Unifying Business Logic and Common Transformations removes the prep work. 21
  • 23. Tips and Tricks (Pretty Report) •plugin = {“pretty”, “html:target/cucumber”} isn’t very pretty •Use cucumber-reports 23 Verses
  • 24. Tips and Tricks (miscellaneous) • Java cuke > Scala cuke – intelliJ integration, speed, etc... • .config(“spark.driver.host”, “127.0.0.1”) – Saves overall test execution time for each run • Think hard about whether things should get tested at unit or component level – DAG takes longer to compute path on more complex DAGs (e.g. longer tests), but can provide more value 24
  • 25. Tips and Tricks (Code Coverage for Scala Cuke) 25
  • 26. Resources, Resources, Resources • cucumber.io • cucumber-reporting • Pragmatic Books – Cucumber Book – Cucumber Recipes – Cucumber for Java • Specification by Example • Databricks Blog 26