Spark: The Good, the Bad, and the Ugly

Download as PPTX, PDF

•6 likes•2,319 views

Sarah Guido's talk at Data Day Texas 2016 shares her experiences and insights using Apache Spark, highlighting its advantages such as speed and ease of prototyping, as well as challenges like frequent updates and documentation issues. She discusses the importance of adopting Spark for big data analysis and the ongoing struggle between using Python and Scala due to discrepancies in features. Ultimately, Guido expresses hope for the future growth of the Spark community and improved resources for users.

Technology

Spark:
The Good, the Bad, and the Ugly
Sarah Guido
Data Day Texas 2016

About
this talk
About me & Bitly
Spark journey to date
The future!
The good, the bad, and the ugly

About this talk
This talk is
• My own experience using Spark
• New toys come with caveats
This talk is not
• Ground truth
• An argument for or against using Spark

About me!
• Lead data scientist at Bitly (bitly.is/hiring)
• NYC Python meetup organizer
• Spark user
• @sarah_guido

The stage
“Don’t you guys just, like, shorten links?”

The stage
• Need for big data analysis tools
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!

1 HOUR OF DECODES
10 GB
1 DAY OF DECODES
240 GB
1 MONTH OF DECODES
~7 TB

$Data {"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health- care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}$

Spark: the Why
• Fast. Really fast.
• SQL layer – kind of like Hive
• Distributed scientific tools
• Python! Sometimes.
• Cutting edge technology

Spark: the What
• Large-scale distributed data processing tool
• SQL and streaming tools
• Faster than Hadoop
• Python, Scala, Java, R APIs

Spark: the How
• Partitions your data to operate over in parallel
• Capability to add map/reduce features
• Lazy – only operates when a method is called (ex.
collect()/or writing to file)

First exploration of Spark during hack week
SPARK 1.2
DataFrames, AWS config
SPARK 1.3
Decision to use Scala, official AWS support
SPARK 1.4
Prototype of trend detection
SPARK 1.5

Spark environment
• From Hadoop to AWS
• Runs on AWS EMR clusters
• Reads data from S3
• Special config

Speed
Both development
time and data
extraction
Hadoop vs. Spark

Easy to prototype
Building a working Spark program is relatively
simple.*
* In simple circumstances.

Documentation
• Lack of project maturity
• Simple examples
• Books end up being obsolete
• Can’t just Stackoverflow it!

DataFrames
DataFrames are appealing to data scientists
familiar with Python/Pandas, but their lack of
flexibility makes them difficult to use.

Python vs. Scala: the Saga
Python (I know Python!)
Scala (Python API lacks parity with Scala)
Python (But I KNOW Python and can code faster!)
Scala (Lack of parity, weird undebuggable Python errors?!)

My hopes
• Python catches up – barrier to entry
• Community (and therefore knowledge base)
grows
• More people talk about using Spark
• Best practices
• Used in projects/stack at Bitly

I want to try
• Layer between NSQ and Spark – Cassandra?
• Tool like H2O
• Zeppelin (again)
• Spark slack channel?

Resources
• spark.apache.org - documentation
• Databricks blog
• Cloudera blog
• Other Spark users

More Related Content

PPTX

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

PPTX

Data Day Seattle 2015: Sarah GuidoBitly

PDF

Solr for Data ScienceGrant Ingersoll

PDF

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

KEY

Cascalog at May Bay Area Hadoop User Groupnathanmarz

PDF

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

PDF

PyCon Singapore 2013 KeynoteWes McKinney

PDF

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Data Day Seattle 2015: Sarah GuidoBitly

Solr for Data ScienceGrant Ingersoll

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Cascalog at May Bay Area Hadoop User Groupnathanmarz

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

PyCon Singapore 2013 KeynoteWes McKinney

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Databricks

What's hot (20)

PDF

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

PPTX

Analyzing Data With PythonSarah Guido

PPTX

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson

PPTX

Building and Improving Products with Hadoop DataWorks Summit

PDF

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

PPTX

Is there a SQL for NoSQL?Arthur Keen

PDF

An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar

PDF

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

PDF

How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017Juantomás García Molina

PDF

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

PDF

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

PDF

Apache Spark Introductionbigdata trunk

PDF

Polyglot Graph Databases using OCL as pivotGraph-TA

PPT

Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull

PDF

From R Script to Production Using rsparkling with Navdeep GillDatabricks

PDF

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

PDF

Pandas/Data Analysis at BaypiggiesAndy Hayden

PDF

sparklyr - Jeff AllenSri Ambati

PPTX

Big data Processing with Apache Spark & ScalaEdureka!

PPTX

Zeppelin at TwitterPrasad Wagle

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Analyzing Data With PythonSarah Guido

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson

Building and Improving Products with Hadoop DataWorks Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Is there a SQL for NoSQL?Arthur Keen

An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

How to create a personal knowledge graph IBM Meetup Big Data Madrid 2017Juantomás García Molina

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Spark Introductionbigdata trunk

Polyglot Graph Databases using OCL as pivotGraph-TA

Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull

From R Script to Production Using rsparkling with Navdeep GillDatabricks

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Pandas/Data Analysis at BaypiggiesAndy Hayden

sparklyr - Jeff AllenSri Ambati

Big data Processing with Apache Spark & ScalaEdureka!

Zeppelin at TwitterPrasad Wagle

Similar to Spark: The Good, the Bad, and the Ugly (20)

PPTX

Data Science at Scale by Sarah GuidoSpark Summit

PPTX

JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...David Taieb

PDF

spark_v1_2Frank Schroeter

PDF

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

PDF

20160512 apache-spark-for-everyoneAmanda Casari

PPTX

Taboola Road To Scale With Apache Sparktsliwowicz

PDF

Big data beyond the JVM - DDTX 2018Holden Karau

PDF

Spark Programming Basic Training Handoutyanuarsinggih1

PDF

Are general purpose big data systems eating the world?Holden Karau

PPTX

Apache Spark in IndustryDorian Beganovic

PPTX

Spark intro iwomm_2017-07-19Erik Schmiegelow

PDF

Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau

PDF

Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau

PDF

20151015 zagreb spark_notebooksAndrey Vykhodtsev

PPTX

Introduction to sparkHome

PDF

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly

PDF

Apache Spark and Python: unified Big Data analyticsJulien Anguenot

PDF

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

PPTX

sparkBen Liu

PPTX

.net developer for Jupyter Notebook and Apache Spark and viceversaMarco Parenzan

Data Science at Scale by Sarah GuidoSpark Summit

JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...David Taieb

spark_v1_2Frank Schroeter

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

20160512 apache-spark-for-everyoneAmanda Casari

Taboola Road To Scale With Apache Sparktsliwowicz

Big data beyond the JVM - DDTX 2018Holden Karau

Spark Programming Basic Training Handoutyanuarsinggih1

Are general purpose big data systems eating the world?Holden Karau

Apache Spark in IndustryDorian Beganovic

Spark intro iwomm_2017-07-19Erik Schmiegelow

Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau

Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau

20151015 zagreb spark_notebooksAndrey Vykhodtsev

Introduction to sparkHome

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly

Apache Spark and Python: unified Big Data analyticsJulien Anguenot

A look under the hood at Apache Spark's API and engine evolutionsDatabricks

sparkBen Liu

.net developer for Jupyter Notebook and Apache Spark and viceversaMarco Parenzan

More from Sarah Guido (7)

PDF

Data Science RetrospectiveSarah Guido

PPTX

The Wild West of Data Wrangling (PyTN)Sarah Guido

PDF

The Wild West of Data WranglingSarah Guido

PDF

The Importance of CommunitySarah Guido

PPTX

Network theory - PyCon 2015Sarah Guido

PPTX

K-means Clustering with Scikit-LearnSarah Guido

PPTX

A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido

Data Science RetrospectiveSarah Guido

The Wild West of Data Wrangling (PyTN)Sarah Guido

The Wild West of Data WranglingSarah Guido

The Importance of CommunitySarah Guido

Network theory - PyCon 2015Sarah Guido

K-means Clustering with Scikit-LearnSarah Guido

A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido

Recently uploaded (20)

PDF

Make GenAI investments go further with the Dell AI FactoryPrincipled Technologies

PDF

Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdfSandesh Rao

PDF

NewMind AI Weekly Chronicles - July'25 - Week IVNewMind AI

PPTX

Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...AgileNetwork

PPTX

AI and Robotics for Human Well-being.pptxJAYMIN SUTHAR

PDF

Tea4chat - another LLM Project by Kerem Atama0m0rajab1

PPTX

The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptxsujalchauhan1305

PDF

MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdfNeo4j

PDF

OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian

PDF

Google I/O Extended 2025 Baku - all pptsHusseinMalikMammadli

PPTX

What-is-the-World-Wide-Web -- Introductiontonifi9488

PDF

Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGISSafe Software

PDF

Security features in Dell, HP, and Lenovo PC systems: A research-based compar...Principled Technologies

PDF

Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely

PPTX

IT Runs Better with ThousandEyes AI-driven AssuranceThousandEyes

PDF

Economic Impact of Data Centres to the Malaysian Economyflintglobalapac

PDF

SparkLabs Primer on Artificial Intelligence 2025SparkLabs Group

PDF

Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...Sandesh Rao

PDF

Orbitly Pitch Deck｜A Mission-Driven Platform for Side Project Collaboration (...zz41354899

PDF

Automating ArcGIS Content Discovery with FME: A Real World Use CaseSafe Software