SlideShare a Scribd company logo
Welcome
Chicago Data Engineering Meetup
- Our First Event – November 2018
- Objectives
- Every 2 months
- Format
- sharing experiences (open for volunteers)
- new tools / demos
- Open for suggestions
01 Who I am
02 QuantumBlack
03 Today’s topic: Spark UDF Performance
04 Background
05 Benchmarking – Live demo
06 Conclusion and Our Approach
07 Q&A
Agenda
Who I am
01
4All content copyright © 2017 QuantumBlack, a McKinsey company
Client case studies
Experience across several industry sectors,
including telecoms, retail, financial services and
pharmaceuticals.
Financial sector – Advanced Analytics
projects for Fraud detection in Internet Banking
and Credit Risk Modelling.
Telecommunications – Petabyte scale
environment, delivering several use cases,
including: real-time failure detection using CDR
data, customer profiling and marketing
campaigns.
Manufacturing– data wrangling in failure
detection project for computer parts
manufacturing in Europe.
Pharmaceuticals – Site selection optimisation
for a top pharma players.
Telematics (Car insurance) – machine learning
model that estimates the probability of crashing
for each driver based data obtained from on
board units box installed on cars containing
geo-location positions, speed and acceleration
of ~2 million drivers over a 2-year period.
Complex feature creation using terabyte scale
and external data sources such as weather,
street and traffic data.
Education
Guilherme has a BSc in Data Processing from
Mackenzie University and specialisations in
Machine Learning and Business Intelligence.
Role
Big Data technology expert based in Chicago.
Work with clients to translate business
hypotheses into data requirements and
technology solutions.
Expertise
Provides technical data engineering oversight
on projects and advises other data engineers
on architecture definition and performance
optimization for large-scale data wrangling.
Professional experience
Prior to joining QuantumBlack, Guilherme
specialised for over 18 years in Data
Warehouse and Business Intelligence projects
on large-scale environments. More recently, 6
years experience in Big Data projects and
architecture, lots of them at petabyte scale, as
well as real-time projects.
Previously led big data projects at Hortonworks,
SAP and large financial institutions.
BIOGRAPHY
Guilherme Braccialli
Principal Data Engineer, QuantumBlack,
Chicago
QuantumBlack
02
6All content copyright © 2017 QuantumBlack, a McKinsey company
QB exploit data, analytics and
design to help our clients be the
best they can be
We were born and proven in
Formula One, where the smallest
margins are the difference
between winning and losing and
data has emerged as a
fundamental element of
competitive advantage
QuantumBlack
6All content copyright © 2017 QuantumBlack, a McKinsey company
In elite sport the
smallest edge makes
the difference,
and the best teams
exploit this to outlearn
their rivals
8All content copyright © 2017 QuantumBlack, a McKinsey company
Since then, we have applied our proven
methodology across multiple sectors
Advanced
Industries
Aerospace
Automotive
Semi-Conductors
Urban Infrastructure
Financial
Services
Asset Management
Payment Networks
Private Banking
Retail Banking
Health &
Wellbeing
Hospitals
Medical Devices
Pharmaceutical
Natural
Resources
Oil & Gas
Mining
Renewable Energy
Utilities
Sports
Basketball
Baseball
Formula One
Soccer
Spark UDF Performance
03
- Share our learnings
- Running spark at scale
- Practical Examples
- Live demo (code)
Background
04
11All content copyright © 2017 QuantumBlack, a McKinsey company
• Open Source
‒ We are a consulting company, we enable our clients to use Advanced Analytics
‒ We don’t sell a out-of-box solution / licensing
‒ Clients can run it anywhere, we use open source tools
• Scalable
‒ We deal with big data volumes
‒ Multiple TBs of data
‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone)
• Flexibility and Integration
‒ Supports multiple languages: Python, SQL, Scala, Java, R
‒ Batch, Streaming, Graph, Machine Learning
‒ Easy to integrate with Data Scientist code, single data pipeline
Why we use spark
BACKGROUND
12All content copyright © 2017 QuantumBlack, a McKinsey company
• In the Cloud
‒ AWS (EMR)
‒ Azure (HDInsight)
‒ Google Cloud (DataProc)
‒ Databricks (AWS or Azure)
• On-premises
‒ Some clients have their internal hadoop cluster on premisses
Where we run
BACKGROUND
13All content copyright © 2017 QuantumBlack, a McKinsey company
Why PySpark / Performance implications
BACKGROUND
• PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist
• Same performance for data frame operations (pyspark is a wrapper that runs native scala code)
• Performance hit when we use UDF (execution relies on: scala - python - scala)
• Pandas UDFs (Vectorized UDFs) + Arrow
‒ Nov/2017 – Spark 2.3
https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
‒ but… where are Scala numbers?
Benchmarking – Live Demo
05
15All content copyright © 2017 QuantumBlack, a McKinsey company
Databricks Notebook – (try on Community version)
LIVE DEMO
https://bit.ly/2E4ehIm
Conclusion and Our Approach
06
17All content copyright © 2017 QuantumBlack, a McKinsey company
Best of both worlds: PySpark with Scala performance
CONCLUSION AND OUR APPROACH
• Conclusion
‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS
‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs
• Our Approach
‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data
‒ Built an internal library with re-usable Scala UDFs
‒ Created Python wrappers to call Scala UDFs
‒ Demo
Q&A
07
Thank you!
- Would you like to share your
experiences on next events?
and…
- We are hiring!!!

More Related Content

What's hot (20)

PDF
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Jan Wiegelmann
 
PDF
The 2018 Deloitte Global Outsourcing Survey Presentation
Deloitte United States
 
PDF
Tech Adoption and Strategy for Innovation & Growth
accenture
 
PDF
Retail Banking in the New Reality – Summary Survey Findings
Boston Consulting Group
 
PDF
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton
 
PDF
Technology Trends 2021 | Tech Vision | Accenture
accenture
 
PPTX
Data Quality Management: Cleaner Data, Better Reporting
accenture
 
PDF
Journey data driven organization
Inbalraanan
 
PPTX
Communications Technology Vision 2021
accenture
 
PDF
How to Implement Data Governance Best Practice
DATAVERSITY
 
PPTX
A Beginner’s Guide to Generative AI for Business
Intelisync
 
PDF
How enterprise networks can boost Cloud Continuum experience
accenture
 
PDF
How future-ready is your IT –Next Gen Tech Function.pdf
Yiannis Paraschos
 
PDF
How Companies in Emerging Markets Are Winning at Home
Boston Consulting Group
 
PDF
The Changing Role of Compliance | Accenture
Accenture Operations
 
PDF
Global Capital Confidence Barometer 21st edition
EY
 
PDF
Maximizing Radford Data through the Payscale Partnership
PayScale, Inc.
 
PDF
Digital Supply Chain Workforce Future | Accenture
accenture
 
PPTX
Deloitte gov federal practice
DeloitteGov
 
PDF
Colombia's Consumer Pulse Update - August 2020
Bain & Company Brasil
 
Data Analytics and Artificial Intelligence in the era of Digital Transformation
Jan Wiegelmann
 
The 2018 Deloitte Global Outsourcing Survey Presentation
Deloitte United States
 
Tech Adoption and Strategy for Innovation & Growth
accenture
 
Retail Banking in the New Reality – Summary Survey Findings
Boston Consulting Group
 
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton
 
Technology Trends 2021 | Tech Vision | Accenture
accenture
 
Data Quality Management: Cleaner Data, Better Reporting
accenture
 
Journey data driven organization
Inbalraanan
 
Communications Technology Vision 2021
accenture
 
How to Implement Data Governance Best Practice
DATAVERSITY
 
A Beginner’s Guide to Generative AI for Business
Intelisync
 
How enterprise networks can boost Cloud Continuum experience
accenture
 
How future-ready is your IT –Next Gen Tech Function.pdf
Yiannis Paraschos
 
How Companies in Emerging Markets Are Winning at Home
Boston Consulting Group
 
The Changing Role of Compliance | Accenture
Accenture Operations
 
Global Capital Confidence Barometer 21st edition
EY
 
Maximizing Radford Data through the Payscale Partnership
PayScale, Inc.
 
Digital Supply Chain Workforce Future | Accenture
accenture
 
Deloitte gov federal practice
DeloitteGov
 
Colombia's Consumer Pulse Update - August 2020
Bain & Company Brasil
 

Similar to Meetup Spark UDF performance (20)

PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
PPTX
Atlanta MLConf
Qubole
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PDF
Big Data Engineering for Machine Learning
Vasu S
 
PDF
Data Skills for Digital Era
Mohamadreza Mohtat
 
PDF
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
PPTX
Introduction to Data Engineering
Vivek Aanand Ganesan
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
PPSX
10-Hot-Data-Analytics-Tre-8904178.ppsx
SangeetaTripathi8
 
PDF
Fossasia 2018-chetan-khatri
Chetan Khatri
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
PDF
Data Science as Scale
Conor B. Murphy
 
PPTX
Machine Learning with Apache Spark
IBM Cloud Data Services
 
PDF
Data Science with Spark
Krishna Sankar
 
PDF
Data Skills for Digital Era-مهارت های داده ای
Hosseinieh Ershad Public Library
 
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Sonya Liberman
 
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
Atlanta MLConf
Qubole
 
Demystifying data engineering
Thang Bui (Bob)
 
Introduction to Data Engineering
Durga Gadiraju
 
Big Data Engineering for Machine Learning
Vasu S
 
Data Skills for Digital Era
Mohamadreza Mohtat
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
SangeetaTripathi8
 
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Data Science as Scale
Conor B. Murphy
 
Machine Learning with Apache Spark
IBM Cloud Data Services
 
Data Science with Spark
Krishna Sankar
 
Data Skills for Digital Era-مهارت های داده ای
Hosseinieh Ershad Public Library
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Sonya Liberman
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
July Patch Tuesday
Ivanti
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
July Patch Tuesday
Ivanti
 
Ad

Meetup Spark UDF performance

  • 1. Welcome Chicago Data Engineering Meetup - Our First Event – November 2018 - Objectives - Every 2 months - Format - sharing experiences (open for volunteers) - new tools / demos - Open for suggestions
  • 2. 01 Who I am 02 QuantumBlack 03 Today’s topic: Spark UDF Performance 04 Background 05 Benchmarking – Live demo 06 Conclusion and Our Approach 07 Q&A Agenda
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Client case studies Experience across several industry sectors, including telecoms, retail, financial services and pharmaceuticals. Financial sector – Advanced Analytics projects for Fraud detection in Internet Banking and Credit Risk Modelling. Telecommunications – Petabyte scale environment, delivering several use cases, including: real-time failure detection using CDR data, customer profiling and marketing campaigns. Manufacturing– data wrangling in failure detection project for computer parts manufacturing in Europe. Pharmaceuticals – Site selection optimisation for a top pharma players. Telematics (Car insurance) – machine learning model that estimates the probability of crashing for each driver based data obtained from on board units box installed on cars containing geo-location positions, speed and acceleration of ~2 million drivers over a 2-year period. Complex feature creation using terabyte scale and external data sources such as weather, street and traffic data. Education Guilherme has a BSc in Data Processing from Mackenzie University and specialisations in Machine Learning and Business Intelligence. Role Big Data technology expert based in Chicago. Work with clients to translate business hypotheses into data requirements and technology solutions. Expertise Provides technical data engineering oversight on projects and advises other data engineers on architecture definition and performance optimization for large-scale data wrangling. Professional experience Prior to joining QuantumBlack, Guilherme specialised for over 18 years in Data Warehouse and Business Intelligence projects on large-scale environments. More recently, 6 years experience in Big Data projects and architecture, lots of them at petabyte scale, as well as real-time projects. Previously led big data projects at Hortonworks, SAP and large financial institutions. BIOGRAPHY Guilherme Braccialli Principal Data Engineer, QuantumBlack, Chicago
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company QB exploit data, analytics and design to help our clients be the best they can be We were born and proven in Formula One, where the smallest margins are the difference between winning and losing and data has emerged as a fundamental element of competitive advantage QuantumBlack 6All content copyright © 2017 QuantumBlack, a McKinsey company
  • 7. In elite sport the smallest edge makes the difference, and the best teams exploit this to outlearn their rivals
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Since then, we have applied our proven methodology across multiple sectors Advanced Industries Aerospace Automotive Semi-Conductors Urban Infrastructure Financial Services Asset Management Payment Networks Private Banking Retail Banking Health & Wellbeing Hospitals Medical Devices Pharmaceutical Natural Resources Oil & Gas Mining Renewable Energy Utilities Sports Basketball Baseball Formula One Soccer
  • 9. Spark UDF Performance 03 - Share our learnings - Running spark at scale - Practical Examples - Live demo (code)
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company • Open Source ‒ We are a consulting company, we enable our clients to use Advanced Analytics ‒ We don’t sell a out-of-box solution / licensing ‒ Clients can run it anywhere, we use open source tools • Scalable ‒ We deal with big data volumes ‒ Multiple TBs of data ‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone) • Flexibility and Integration ‒ Supports multiple languages: Python, SQL, Scala, Java, R ‒ Batch, Streaming, Graph, Machine Learning ‒ Easy to integrate with Data Scientist code, single data pipeline Why we use spark BACKGROUND
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company • In the Cloud ‒ AWS (EMR) ‒ Azure (HDInsight) ‒ Google Cloud (DataProc) ‒ Databricks (AWS or Azure) • On-premises ‒ Some clients have their internal hadoop cluster on premisses Where we run BACKGROUND
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Why PySpark / Performance implications BACKGROUND • PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist • Same performance for data frame operations (pyspark is a wrapper that runs native scala code) • Performance hit when we use UDF (execution relies on: scala - python - scala) • Pandas UDFs (Vectorized UDFs) + Arrow ‒ Nov/2017 – Spark 2.3 https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/ https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html ‒ but… where are Scala numbers?
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Databricks Notebook – (try on Community version) LIVE DEMO https://bit.ly/2E4ehIm
  • 16. Conclusion and Our Approach 06
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Best of both worlds: PySpark with Scala performance CONCLUSION AND OUR APPROACH • Conclusion ‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS ‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs • Our Approach ‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data ‒ Built an internal library with re-usable Scala UDFs ‒ Created Python wrappers to call Scala UDFs ‒ Demo
  • 19. Thank you! - Would you like to share your experiences on next events? and… - We are hiring!!!