Meetup Spark UDF performance

Welcome
Chicago Data Engineering Meetup
- Our First Event – November 2018
- Objectives
- Every 2 months
- Format
- sharing experiences (open for volunteers)
- new tools / demos
- Open for suggestions

01 Who I am
02 QuantumBlack
03 Today’s topic: Spark UDF Performance
04 Background
05 Benchmarking – Live demo
06 Conclusion and Our Approach
07 Q&A
Agenda

4All content copyright © 2017 QuantumBlack, a McKinsey company
Client case studies
Experience across several industry sectors,
including telecoms, retail, financial services and
pharmaceuticals.
Financial sector – Advanced Analytics
projects for Fraud detection in Internet Banking
and Credit Risk Modelling.
Telecommunications – Petabyte scale
environment, delivering several use cases,
including: real-time failure detection using CDR
data, customer profiling and marketing
campaigns.
Manufacturing– data wrangling in failure
detection project for computer parts
manufacturing in Europe.
Pharmaceuticals – Site selection optimisation
for a top pharma players.
Telematics (Car insurance) – machine learning
model that estimates the probability of crashing
for each driver based data obtained from on
board units box installed on cars containing
geo-location positions, speed and acceleration
of ~2 million drivers over a 2-year period.
Complex feature creation using terabyte scale
and external data sources such as weather,
street and traffic data.
Education
Guilherme has a BSc in Data Processing from
Mackenzie University and specialisations in
Machine Learning and Business Intelligence.
Role
Big Data technology expert based in Chicago.
Work with clients to translate business
hypotheses into data requirements and
technology solutions.
Expertise
Provides technical data engineering oversight
on projects and advises other data engineers
on architecture definition and performance
optimization for large-scale data wrangling.
Professional experience
Prior to joining QuantumBlack, Guilherme
specialised for over 18 years in Data
Warehouse and Business Intelligence projects
on large-scale environments. More recently, 6
years experience in Big Data projects and
architecture, lots of them at petabyte scale, as
well as real-time projects.
Previously led big data projects at Hortonworks,
SAP and large financial institutions.
BIOGRAPHY
Guilherme Braccialli
Principal Data Engineer, QuantumBlack,
Chicago

QB exploit data, analytics and
design to help our clients be the
best they can be
We were born and proven in
Formula One, where the smallest
margins are the difference
between winning and losing and
data has emerged as a
fundamental element of
competitive advantage
QuantumBlack

In elite sport the
smallest edge makes
the difference,
and the best teams
exploit this to outlearn
their rivals

Since then, we have applied our proven
methodology across multiple sectors
Advanced
Industries
Aerospace
Automotive
Semi-Conductors
Urban Infrastructure
Financial
Services
Asset Management
Payment Networks
Private Banking
Retail Banking
Health &
Wellbeing
Hospitals
Medical Devices
Pharmaceutical
Natural
Resources
Oil & Gas
Mining
Renewable Energy
Utilities
Sports
Basketball
Baseball
Formula One
Soccer

Spark UDF Performance
03
- Share our learnings
- Running spark at scale
- Practical Examples
- Live demo (code)

• Open Source
‒ We are a consulting company, we enable our clients to use Advanced Analytics
‒ We don’t sell a out-of-box solution / licensing
‒ Clients can run it anywhere, we use open source tools
• Scalable
‒ We deal with big data volumes
‒ Multiple TBs of data
‒ Spark has several options to run on distributed mode (Hadoop, Kubernetes, Stand Alone)
• Flexibility and Integration
‒ Supports multiple languages: Python, SQL, Scala, Java, R
‒ Batch, Streaming, Graph, Machine Learning
‒ Easy to integrate with Data Scientist code, single data pipeline
Why we use spark
BACKGROUND

• In the Cloud
‒ AWS (EMR)
‒ Azure (HDInsight)
‒ Google Cloud (DataProc)
‒ Databricks (AWS or Azure)
• On-premises
‒ Some clients have their internal hadoop cluster on premisses
Where we run
BACKGROUND

Why PySpark / Performance implications
BACKGROUND
• PySpark is best choice to integrate data pipeline Data Engineering + Data Scientist
• Same performance for data frame operations (pyspark is a wrapper that runs native scala code)
• Performance hit when we use UDF (execution relies on: scala - python - scala)
• Pandas UDFs (Vectorized UDFs) + Arrow
‒ Nov/2017 – Spark 2.3
https://www.twosigma.com/insights/article/introducing-vectorized-udfs-for-pyspark/
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
‒ but… where are Scala numbers?

Databricks Notebook – (try on Community version)
LIVE DEMO
https://bit.ly/2E4ehIm

Conclusion and Our Approach
06

Best of both worlds: PySpark with Scala performance
CONCLUSION AND OUR APPROACH
• Conclusion
‒ PySpark Pandas (Vectorized UDFs) can be faster than PySpark UDF, but not ALWAYS
‒ PySpark UDFs (vectorized or not) are much slower than scala UDFs
• Our Approach
‒ We use PySpark UDFs when data volume is not big, or quick insights on sample data
‒ Built an internal library with re-usable Scala UDFs
‒ Created Python wrappers to call Scala UDFs
‒ Demo

Thank you!
- Would you like to share your
experiences on next events?
and…
- We are hiring!!!

Meetup Spark UDF performance

More Related Content

What's hot (20)

Similar to Meetup Spark UDF performance (20)

Recently uploaded (20)

Meetup Spark UDF performance