SlideShare a Scribd company logo
Apache Spark Usage in the
Open Source Ecosystem
Hossein Falaki
@mhfalaki
About me
• Software Engineer /part-time Data Scientist atDatabricks
• I started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data source
• Worked on SparkR and Rnotebooks at Databricks
2
Stackoverflow 2016 trending tech
3
Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
Databricks Community Edition
• In February Databricks launched a free version of its cloud based
platform in beta
• Since then more than 8,000 users registered
• Users created over 61,000 notebooks indifferent languages
• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5
What % of users use other libraries
Language %	users	importing external	libs Average	#	libs Median	#	libs
Python 75	% 9 2
Scala 55	% 3 1
R 57	% 6 1
6
Installing libraries is easy
7
Python Packages
8
Most popular Python packages
9
What is test_helper?
10
What are these?
ETL
• re
• datetime
• pandas
• json
• csv
• string
• math /operator
• urllib /urllib2
11
Visualization
• matplotlib
• ggplot
• seaborn
Advanced analytics
• numpy
• sklearn
• graphframes
• tensorflow
• scipy
Other
• test_helper
• os
• md5
Python package categories
12
What packages go together?
13
Scala Packages
14
Most popular Scala libraries
15
What are these?
ETL
• java/scala util
• scala.collection
• scala.math
• java.{io, nio}
• java.text
• o.a.commons
• kafka
• twitter4j
16
Visualization
• ?
Advanced analytics
• spark.ml
• graphframes
Other
• java.net
• scala.sys
Scala package categories
17
What libraries go together?
18
R Packages
19
Most popular R packages
20
What are these?
ETL
• dplyr
• plyr
• reshape2
• jsonlite
• tidyr
• lubridate
• httr
• data.table
21
Visualization
• ggplot2
• beanplot
• plotly
• ...
Advanced analytics
• sparkr
• h2o
• caret
• e1071
Other
• devtools
• magrittr
R package categories
22
Comparing Python, Scala & R
23
Languages have unique features
24
Scala/ Python / R R / Python Scala / Python/ R
• 25 % of users,use multiple languages
• 3% of notebooks mix different languages
Summary
• Spark users extensively mix itwith other packages in different languages
– One ofgoals ofSpark project is working well with other projects
• ETL related libraries are the most popular category
– Opportunities for newdata sources
• Notebooks are being used for “small data” aswell as“big data.”
• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage
– Scala is missing visualization libraries
25
Try your favorite library in Databricks
26
http://databricks.com/ce
Try latest version of Apache Spark and previewof Spark 2.0
Thank you!
What packages are used together?
28

More Related Content

What's hot (20)

PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Spark Meetup at Uber
Databricks
 
PDF
Distributed ML in Apache Spark
Databricks
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Operational Tips for Deploying Spark
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Spark Meetup at Uber
Databricks
 
Distributed ML in Apache Spark
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Enabling exploratory data science with Spark and R
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Operational Tips for Deploying Spark
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Databricks
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 

Viewers also liked (20)

PDF
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
PPTX
Introduction to Hive
Uday Vakalapudi
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
PPTX
Big data spain keynote nov 2016
alanfgates
 
PPTX
Hive ACID Apache BigData 2016
alanfgates
 
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
2016 spark survey
Abhishek Choudhary
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PPTX
Big Data Trend with Open Platform
Jongwook Woo
 
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PDF
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica
Spark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
Introduction to Hive
Uday Vakalapudi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark is going to replace Apache Hadoop! Know Why?
Edureka!
 
Big data spain keynote nov 2016
alanfgates
 
Hive ACID Apache BigData 2016
alanfgates
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
2016 spark survey
Abhishek Choudhary
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Big Data Trend with Open Platform
Jongwook Woo
 
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PySpark Best Practices
Cloudera, Inc.
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Ad

Similar to Apache Spark Usage in the Open Source Ecosystem (20)

PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Legacy Typesafe (now Lightbend)
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
Contributing to Apache Spark 3
Holden Karau
 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PDF
Sparkr sigmod
waqasm86
 
PDF
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Running R at Scale with Apache Arrow on Spark
Databricks
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PDF
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Databricks
 
PDF
Big data analysis using spark r published
Dipendra Kusi
 
PPTX
Spark for big data analytics
Edureka!
 
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
PDF
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Legacy Typesafe (now Lightbend)
 
Started with-apache-spark
Happiest Minds Technologies
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Contributing to Apache Spark 3
Holden Karau
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Sparkr sigmod
waqasm86
 
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
Apache Spark in Industry
Dorian Beganovic
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Running R at Scale with Apache Arrow on Spark
Databricks
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Databricks
 
Big data analysis using spark r published
Dipendra Kusi
 
Spark for big data analytics
Edureka!
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Introducing Koalas 1.0 (and 1.1)
Takuya UESHIN
 
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 

Apache Spark Usage in the Open Source Ecosystem

  • 1. Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki
  • 2. About me • Software Engineer /part-time Data Scientist atDatabricks • I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR and Rnotebooks at Databricks 2
  • 4. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 5. Databricks Community Edition • In February Databricks launched a free version of its cloud based platform in beta • Since then more than 8,000 users registered • Users created over 61,000 notebooks indifferent languages • This is an analysis of third party libraries that our beta users imported to complement Apache Spark in Scala, Python, and R 5
  • 6. What % of users use other libraries Language % users importing external libs Average # libs Median # libs Python 75 % 9 2 Scala 55 % 3 1 R 57 % 6 1 6
  • 9. Most popular Python packages 9
  • 11. What are these? ETL • re • datetime • pandas • json • csv • string • math /operator • urllib /urllib2 11 Visualization • matplotlib • ggplot • seaborn Advanced analytics • numpy • sklearn • graphframes • tensorflow • scipy Other • test_helper • os • md5
  • 13. What packages go together? 13
  • 15. Most popular Scala libraries 15
  • 16. What are these? ETL • java/scala util • scala.collection • scala.math • java.{io, nio} • java.text • o.a.commons • kafka • twitter4j 16 Visualization • ? Advanced analytics • spark.ml • graphframes Other • java.net • scala.sys
  • 18. What libraries go together? 18
  • 20. Most popular R packages 20
  • 21. What are these? ETL • dplyr • plyr • reshape2 • jsonlite • tidyr • lubridate • httr • data.table 21 Visualization • ggplot2 • beanplot • plotly • ... Advanced analytics • sparkr • h2o • caret • e1071 Other • devtools • magrittr
  • 24. Languages have unique features 24 Scala/ Python / R R / Python Scala / Python/ R • 25 % of users,use multiple languages • 3% of notebooks mix different languages
  • 25. Summary • Spark users extensively mix itwith other packages in different languages – One ofgoals ofSpark project is working well with other projects • ETL related libraries are the most popular category – Opportunities for newdata sources • Notebooks are being used for “small data” aswell as“big data.” • Languages and their ecosystems have diverse capabilities. Users seem to be mixing languages to their advantage – Scala is missing visualization libraries 25
  • 26. Try your favorite library in Databricks 26 http://databricks.com/ce Try latest version of Apache Spark and previewof Spark 2.0
  • 28. What packages are used together? 28