SlideShare a Scribd company logo
Spark:
The Good, the Bad, and the Ugly
Sarah Guido
Data Day Texas 2016
About
this talk
About me & Bitly
Spark journey to date
The future!
The good, the bad, and the ugly
About this talk
This talk is
• My own experience using Spark
• New toys come with caveats
This talk is not
• Ground truth
• An argument for or against using Spark
About me!
• Lead data scientist at Bitly (bitly.is/hiring)
• NYC Python meetup organizer
• Spark user
• @sarah_guido
SPARK JOURNEY
The stage
“Don’t you guys just, like, shorten links?”
The stage
The stage
• Need for big data analysis tools
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!
1 HOUR OF DECODES
10 GB
1 DAY OF DECODES
240 GB
1 MONTH OF DECODES
~7 TB
Data
{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2)
AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4
Safari/600.4.10",
"c": "US",
"nk": 0,
"tz": "America/Los_Angeles",
"g": "1HfTjh8",
"h": "1HfTjh7",
"u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-
care-tech-is-still-so-bad.html?smid=tw-share",
"t": 1427288425,
"cy": "Seattle"}
Spark: the Why
• Fast. Really fast.
• SQL layer – kind of like Hive
• Distributed scientific tools
• Python! Sometimes.
• Cutting edge technology
Spark: the What
• Large-scale distributed data processing tool
• SQL and streaming tools
• Faster than Hadoop
• Python, Scala, Java, R APIs
Spark: the How
• Partitions your data to operate over in parallel
• Capability to add map/reduce features
• Lazy – only operates when a method is called (ex.
collect()/or writing to file)
First exploration of Spark during hack week
SPARK 1.2
DataFrames, AWS config
SPARK 1.3
Decision to use Scala, official AWS support
SPARK 1.4
Prototype of trend detection
SPARK 1.5
Spark environment
• From Hadoop to AWS
• Runs on AWS EMR clusters
• Reads data from S3
• Special config
THE GOOD
Speed
Both development
time and data
extraction
Hadoop vs. Spark
Speed
Easy to prototype
Building a working Spark program is relatively
simple.*
* In simple circumstances.
Integration with Jupyter Notebook
THE BAD
Updates are major and frequent
AWS Spark version lag
Documentation
• Lack of project maturity
• Simple examples
• Books end up being obsolete
• Can’t just Stackoverflow it!
THE UGLY
DataFrames
DataFrames are appealing to data scientists
familiar with Python/Pandas, but their lack of
flexibility makes them difficult to use.
Python vs. Scala: the Saga
Python vs. Scala: the Saga
Python (I know Python!)
Scala (Python API lacks parity with Scala)
Python (But I KNOW Python and can code faster!)
Scala (Lack of parity, weird undebuggable Python errors?!)
Python vs. Scala: the Saga
THE FUTURE
My hopes
• Python catches up – barrier to entry
• Community (and therefore knowledge base)
grows
• More people talk about using Spark
• Best practices
• Used in projects/stack at Bitly
I want to try
• Layer between NSQ and Spark – Cassandra?
• Tool like H2O
• Zeppelin (again)
• Spark slack channel?
Resources
• spark.apache.org - documentation
• Databricks blog
• Cloudera blog
• Other Spark users
THANK YOU
@sarah_guido

More Related Content

PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
Data Day Seattle 2015: Sarah Guido
Bitly
 
PDF
Solr for Data Science
Grant Ingersoll
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
KEY
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PDF
PyCon Singapore 2013 Keynote
Wes McKinney
 
PDF
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Databricks
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Data Day Seattle 2015: Sarah Guido
Bitly
 
Solr for Data Science
Grant Ingersoll
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Cascalog at May Bay Area Hadoop User Group
nathanmarz
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PyCon Singapore 2013 Keynote
Wes McKinney
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Databricks
 

What's hot (20)

PDF
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
PPTX
Analyzing Data With Python
Sarah Guido
 
PPTX
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
PPTX
Building and Improving Products with Hadoop
DataWorks Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PPTX
Is there a SQL for NoSQL?
Arthur Keen
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
PDF
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
Juantomás García Molina
 
PDF
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
PDF
Apache Spark Introduction
bigdata trunk
 
PDF
Polyglot Graph Databases using OCL as pivot
Graph-TA
 
PPT
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Charlie Hull
 
PDF
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
PDF
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
PDF
sparklyr - Jeff Allen
Sri Ambati
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PPTX
Zeppelin at Twitter
Prasad Wagle
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Analyzing Data With Python
Sarah Guido
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
Building and Improving Products with Hadoop
DataWorks Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Is there a SQL for NoSQL?
Arthur Keen
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017
Juantomás García Molina
 
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Apache Spark Introduction
bigdata trunk
 
Polyglot Graph Databases using OCL as pivot
Graph-TA
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Charlie Hull
 
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
sparklyr - Jeff Allen
Sri Ambati
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Zeppelin at Twitter
Prasad Wagle
 
Ad

Similar to Spark: The Good, the Bad, and the Ugly (20)

PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
PDF
spark_v1_2
Frank Schroeter
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
20160512 apache-spark-for-everyone
Amanda Casari
 
PPTX
Taboola Road To Scale With Apache Spark
tsliwowicz
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Spark Programming Basic Training Handout
yanuarsinggih1
 
PDF
Are general purpose big data systems eating the world?
Holden Karau
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PPTX
Spark intro iwomm_2017-07-19
Erik Schmiegelow
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PPTX
Introduction to spark
Home
 
PDF
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
spark
Ben Liu
 
PPTX
.net developer for Jupyter Notebook and Apache Spark and viceversa
Marco Parenzan
 
Data Science at Scale by Sarah Guido
Spark Summit
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
spark_v1_2
Frank Schroeter
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
20160512 apache-spark-for-everyone
Amanda Casari
 
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Spark Programming Basic Training Handout
yanuarsinggih1
 
Are general purpose big data systems eating the world?
Holden Karau
 
Apache Spark in Industry
Dorian Beganovic
 
Spark intro iwomm_2017-07-19
Erik Schmiegelow
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Introduction to spark
Home
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
spark
Ben Liu
 
.net developer for Jupyter Notebook and Apache Spark and viceversa
Marco Parenzan
 
Ad

More from Sarah Guido (7)

PDF
Data Science Retrospective
Sarah Guido
 
PPTX
The Wild West of Data Wrangling (PyTN)
Sarah Guido
 
PDF
The Wild West of Data Wrangling
Sarah Guido
 
PDF
The Importance of Community
Sarah Guido
 
PPTX
Network theory - PyCon 2015
Sarah Guido
 
PPTX
K-means Clustering with Scikit-Learn
Sarah Guido
 
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Data Science Retrospective
Sarah Guido
 
The Wild West of Data Wrangling (PyTN)
Sarah Guido
 
The Wild West of Data Wrangling
Sarah Guido
 
The Importance of Community
Sarah Guido
 
Network theory - PyCon 2015
Sarah Guido
 
K-means Clustering with Scikit-Learn
Sarah Guido
 
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 

Recently uploaded (20)

PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 

Spark: The Good, the Bad, and the Ugly

  • 1. Spark: The Good, the Bad, and the Ugly Sarah Guido Data Day Texas 2016
  • 2. About this talk About me & Bitly Spark journey to date The future! The good, the bad, and the ugly
  • 3. About this talk This talk is • My own experience using Spark • New toys come with caveats This talk is not • Ground truth • An argument for or against using Spark
  • 4. About me! • Lead data scientist at Bitly (bitly.is/hiring) • NYC Python meetup organizer • Spark user • @sarah_guido
  • 6. The stage “Don’t you guys just, like, shorten links?”
  • 8. The stage • Need for big data analysis tools • Iterate/prototype quickly • Overall goal: understand how people use not only our app, but the Internet!
  • 9. 1 HOUR OF DECODES 10 GB 1 DAY OF DECODES 240 GB 1 MONTH OF DECODES ~7 TB
  • 10. Data {"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health- care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}
  • 11. Spark: the Why • Fast. Really fast. • SQL layer – kind of like Hive • Distributed scientific tools • Python! Sometimes. • Cutting edge technology
  • 12. Spark: the What • Large-scale distributed data processing tool • SQL and streaming tools • Faster than Hadoop • Python, Scala, Java, R APIs
  • 13. Spark: the How • Partitions your data to operate over in parallel • Capability to add map/reduce features • Lazy – only operates when a method is called (ex. collect()/or writing to file)
  • 14. First exploration of Spark during hack week SPARK 1.2 DataFrames, AWS config SPARK 1.3 Decision to use Scala, official AWS support SPARK 1.4 Prototype of trend detection SPARK 1.5
  • 15. Spark environment • From Hadoop to AWS • Runs on AWS EMR clusters • Reads data from S3 • Special config
  • 17. Speed Both development time and data extraction Hadoop vs. Spark
  • 18. Speed
  • 19. Easy to prototype Building a working Spark program is relatively simple.* * In simple circumstances.
  • 22. Updates are major and frequent
  • 24. Documentation • Lack of project maturity • Simple examples • Books end up being obsolete • Can’t just Stackoverflow it!
  • 26. DataFrames DataFrames are appealing to data scientists familiar with Python/Pandas, but their lack of flexibility makes them difficult to use.
  • 27. Python vs. Scala: the Saga
  • 28. Python vs. Scala: the Saga Python (I know Python!) Scala (Python API lacks parity with Scala) Python (But I KNOW Python and can code faster!) Scala (Lack of parity, weird undebuggable Python errors?!)
  • 29. Python vs. Scala: the Saga
  • 31. My hopes • Python catches up – barrier to entry • Community (and therefore knowledge base) grows • More people talk about using Spark • Best practices • Used in projects/stack at Bitly
  • 32. I want to try • Layer between NSQ and Spark – Cassandra? • Tool like H2O • Zeppelin (again) • Spark slack channel?
  • 33. Resources • spark.apache.org - documentation • Databricks blog • Cloudera blog • Other Spark users