SlideShare a Scribd company logo
Putting Lipstick on Apache Pig
Big Data Gurus Meetup
August 14, 2013
Data should be accessible, easy to discover, and
easy to process for everyone.
Motivation
Big Data Users at Netflix
Analysts Engineers
Desires
Self Service
Easy
Rich Toolset Rich APIs
A Single Platform / Data Architecture that Serves Both Groups
Netflix Data Warehouse - Storage
S3 is the source of truth
Decouples storage from
processing.
Persistent data; multiple/
transient Hadoop clusters
Data sources
Event data from cloud
services via Ursula/Honu
Dimension data from
Cassandra via Aegisthus
~100 billion events processed
/ day
Petabytes of data persisted
and available to queries on
S3.
Netflix Data Platform - Processing
Long running clusters
sla and ad-hoc
Supplemental nightly
bonus clusters
For high priority ETL jobs
2,000+ instances in
aggregate across the
clusters
Netflix Hadoop Platform as a Service
S3
https://github.com/Netflix/genie
Netflix Data Platform – Primitive
Service Layer
Primitive, decoupled services
Building blocks for more
complicated
tools/services/apps
Serves 1000s of MapReduce
Jobs / day
100+ jobs concurrently
Netflix Data Platform – Tools
Sting
(Adhoc
Visualization)
Looper
(Backloading)
Forklift
(Data Movement)
Ignite
(A/B Test Analytics)
Lipstick
(Workflow
Visualization)
Spock
(Data Auditing)
Heavily utilize services in the
primitive layer.
Follow the same design
philosophy as primitive apps:
RESTful API
Decoupled javascript interfaces
Pig and Hive at Netflix
• Hive
– AdHoc queries
– Lightweight aggregation
• Pig
– Complex Dataflows / ETL
– Data movement “glue” between complex
operations
What is Pig?
• A data flow language
• Simple to learn
– Very few reserved words
– Comparable to a SQL logical query plan
• Easy to extend and optimize
• Extendable via UDFs written in multiple
languages
– Java, Python, Ruby, Groovy, Javascript
Sample Pig Script* (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
* http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
A Typical Pig Script
Pig…
• Data flows are easy & flexible to express in text
– Facilitates code reuse via UDFs and macros
– Allows logical grouping of operations vs grouping by order
of execution.
– But errors are easy to make and overlook.
• Scripts can quickly get complicated
• Visualization quickly draws attention to:
– Common errors
– Execution order / logical flow
– Optimization opportunities
Lipstick
• Generates graphical
representations of Pig data flows.
• Compatible with Apache Pig v11+
• Has been used to monitor more
than 25,000 Pig jobs at Netflix
Lipstick
Overall Job
Progress
Logical
Plan
Overall Job
Progress
Logical Operator
(reduce side)
Logical Operator
(map side)
Map/Reduce Job
Intermediate Row Count
Records
Loaded
Hadoop
Counters
Lipstick for Fast Development
• During development:
– Keep track of data flow
– Spot common errors
• Omitted (hanging) operators
• Data type issues
– Easily estimate and optimize complexity
• Number of MR jobs generated
• Map only vs full Map/Reduce jobs
• Opportunities to rejigger logic to:
– Combine multiple jobs into a single job
– Manipulate execution order to achieve better parallelism (e.g.
less blocking)
Lipstick for Job Monitoring
• During execution:
– Graphically monitor execution status from a single
console
– Spot optimization opportunities
• Map vs reduce side joins
• Data skew
• Better parallelism settings
Lipstick for Support
• Empowers users to support themselves
– Better operational visibility
• What is my script currently doing?
• Why is my script slow?
– Examine intermediate output of jobs
– All execution information in one place
• Facilitates communication between
infrastructure / support teams and end users
– Lipstick link contains all information needed to
provide support.
Lipstick Architecture
Pig v11+
lipstick-console.jar
Lipstick Server
(RESTful
Grails app)
Javascript Client
(Frontend GUI)
RDS
Persistence
Lipstick Architecture - Console
• Implements PigProgressNotificationListener interface
• Listens for:
1. New statements to be registered (unoptimized plan)
2. Script launched event (optimized, physical, M/R plan)
3. MR Job completion/failure event
4. Heartbeat progress (during execution)
• Pig Plans and Progress  Lipstick objects
• Communicates with Lipstick Server
Pig Compilation Plans
Optimized Logical Plan
Physical Plan
MapReduce Plan
(grouping of Physical Operators into
map or reduce jobs)
Pig Script
Unoptimized Logical Plan
(~1:1 logical operator / line of Pig)
Lipstick associates Logical Operators
with MapReduce jobs by inferring
relationships between Logical and
Physical Operations.
Lipstick Architecture - Server
• Simple REST interface
• It’s a Grails app!
• Pig client posts plans and puts progress
• Javascript client
• gets plans and progress
• Searches jobs by job name and user name
Lipstick Architecture – JS Client
• Displays and annotates graphs with status / progress
• Completely decoupled from Server
• Event based design
• Periodically polls Server for job progress
• Usability is a key focus
My Job has stalled.
Solving Problems with Lipstick -
Common Problem #1
Lipstick On Pig
Unoptimized/Optimized
Logical Plan Toggle
Dangling
Operator
I didn’t get the data I was expecting
Common Problem #2
Lipstick On Pig
Lipstick On Pig
I don’t understand why my job failed.
Common Problem #3
Failed Job
(light red background)
Successful Job
(light blue background)
Future of Lipstick
• Annotate common errors and inefficiencies on the graph
– Skew / map side join opportunities / scalar issues
– E.g. Warnings / error dashboard
• Provide better details of runtime performance
– Timings annotated on graph
– Min / median / max mapper and reducer times
– Map / reduce completion over time
• Search through execution history
– Examine trends in runtime and data volumes
– History of failure / success
• Search jobs for commonalities
– Common datasets loaded / saved
– Better grasp data lineage
– Common uses of UDFs and macros
Lipstick on Hive
Honey?
A closer look…
Wrapping up
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• Check out the quickstart guide
– https://github.com/Netflix/Lipstick/wiki/Getting-
Started#1-quick-start
– Get started playing with Lipstick in under 5 minutes!
• We happily welcome your feedback and
contributions!
 Jeff Magnusson:
jmagnusson@netflix.com | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson
Thank you!
Jobs: http://jobs.netflix.com
Netflix OSS: http://netflix.github.io
Tech Blog: http://techblog.netflix.com/

More Related Content

PDF
Apache Hama at Samsung Open Source Conference
Edward Yoon
 
PPT
Apache hama @ Samsung SW Academy
Edward Yoon
 
PDF
Introduction of Apache Hama - 2011
Edward Yoon
 
PDF
Apache Hama 0.4
Edward Yoon
 
PPTX
Distributed computing poli
ivascucristian
 
PDF
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
PDF
Spark Summit EU talk by Reza Karimi
Spark Summit
 
PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Jen Aman
 
Apache Hama at Samsung Open Source Conference
Edward Yoon
 
Apache hama @ Samsung SW Academy
Edward Yoon
 
Introduction of Apache Hama - 2011
Edward Yoon
 
Apache Hama 0.4
Edward Yoon
 
Distributed computing poli
ivascucristian
 
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Spark Summit EU talk by Reza Karimi
Spark Summit
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Jen Aman
 

What's hot (20)

PDF
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
PDF
Scalding: Twitter's New DSL for Hadoop
DataWorks Summit
 
DOCX
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
PDF
Large Scale Graph Processing with Apache Giraph
sscdotopen
 
PPTX
Big dataproposal
Qubole
 
PDF
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Databricks
 
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
PDF
Operational Tips For Deploying Apache Spark
Databricks
 
PDF
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
PPT
Hadoop by sunitha
Sunitha Satyadas
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
Map Reduce along with Amazon EMR
ABC Talks
 
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
PDF
The Revolution Will be Streamed
Databricks
 
PDF
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
PPTX
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Scalding: Twitter's New DSL for Hadoop
DataWorks Summit
 
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Large Scale Graph Processing with Apache Giraph
sscdotopen
 
Big dataproposal
Qubole
 
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Databricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
Operational Tips For Deploying Apache Spark
Databricks
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
Hadoop by sunitha
Sunitha Satyadas
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Map Reduce along with Amazon EMR
ABC Talks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
The Revolution Will be Streamed
Databricks
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Ad

Viewers also liked (12)

PPTX
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
PDF
Introduction to pig & pig latin
knowbigdata
 
PDF
Apache spark session
knowbigdata
 
PPTX
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
PPT
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PPT
Introduction to pig
karthika karthi
 
PPT
Introduction to hadoop
karthika karthi
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PPTX
Fetal Pig Dissection
Lumen Learning
 
PPTX
Lipstick project
Trung Milanô
 
PPTX
Apache hadoop pig overview and introduction
BigClasses Com
 
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
Introduction to pig & pig latin
knowbigdata
 
Apache spark session
knowbigdata
 
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Introduction to pig
karthika karthi
 
Introduction to hadoop
karthika karthi
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Fetal Pig Dissection
Lumen Learning
 
Lipstick project
Trung Milanô
 
Apache hadoop pig overview and introduction
BigClasses Com
 
Ad

Similar to Lipstick On Pig (20)

PPTX
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
PDF
Introduction To Apache Pig at WHUG
Adam Kawa
 
PDF
Pig programming is fun
DataWorks Summit
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PPTX
Pig programming is more fun: New features in Pig
daijy
 
PPTX
Apache pig as a researcher’s stepping stone
benosteen
 
PPTX
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
PDF
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
 
PPTX
Intro to Big Data - Orlando Code Camp 2014
John Ternent
 
PPTX
Introduction to Pig
Mike Unwin
 
PPTX
Making pig fly optimizing data processing on hadoop presentation
Md Rasool
 
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
PDF
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
PPTX
Hadoop for sysadmins
ericwilliammarshall
 
PDF
What is Big Data?
CodePolitan
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
KEY
Cassandra eu
Jeremy Hanna
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PDF
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
Introduction To Apache Pig at WHUG
Adam Kawa
 
Pig programming is fun
DataWorks Summit
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
Pig programming is more fun: New features in Pig
daijy
 
Apache pig as a researcher’s stepping stone
benosteen
 
Apache pig presentation_siddharth_mathur
Siddharth Mathur
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET Journal
 
Intro to Big Data - Orlando Code Camp 2014
John Ternent
 
Introduction to Pig
Mike Unwin
 
Making pig fly optimizing data processing on hadoop presentation
Md Rasool
 
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Hadoop for sysadmins
ericwilliammarshall
 
What is Big Data?
CodePolitan
 
Introduction to Pig
Prashanth Babu
 
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
Cassandra eu
Jeremy Hanna
 
03 pig intro
Subhas Kumar Ghosh
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j
 

More from bigdatagurus_meetup (11)

PDF
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
PDF
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
PDF
Big data beyond the hype may 2014
bigdatagurus_meetup
 
PDF
What enterprises can learn from Real Time Bidding (RTB)
bigdatagurus_meetup
 
PDF
Quantcast File System (QFS) - Alternative to HDFS
bigdatagurus_meetup
 
PDF
Scaling HBase at Pinterest
bigdatagurus_meetup
 
PDF
Continuuity Weave
bigdatagurus_meetup
 
PDF
Cassandra 2.0 (Introduction)
bigdatagurus_meetup
 
PDF
Search On Hadoop
bigdatagurus_meetup
 
PPTX
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
PPT
Cloudera Developer Kit (CDK)
bigdatagurus_meetup
 
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
Big data beyond the hype may 2014
bigdatagurus_meetup
 
What enterprises can learn from Real Time Bidding (RTB)
bigdatagurus_meetup
 
Quantcast File System (QFS) - Alternative to HDFS
bigdatagurus_meetup
 
Scaling HBase at Pinterest
bigdatagurus_meetup
 
Continuuity Weave
bigdatagurus_meetup
 
Cassandra 2.0 (Introduction)
bigdatagurus_meetup
 
Search On Hadoop
bigdatagurus_meetup
 
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Cloudera Developer Kit (CDK)
bigdatagurus_meetup
 

Recently uploaded (20)

PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Chad Readey - An Independent Thinker
Chad Readey
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 

Lipstick On Pig

  • 1. Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013
  • 2. Data should be accessible, easy to discover, and easy to process for everyone. Motivation
  • 3. Big Data Users at Netflix Analysts Engineers Desires Self Service Easy Rich Toolset Rich APIs A Single Platform / Data Architecture that Serves Both Groups
  • 4. Netflix Data Warehouse - Storage S3 is the source of truth Decouples storage from processing. Persistent data; multiple/ transient Hadoop clusters Data sources Event data from cloud services via Ursula/Honu Dimension data from Cassandra via Aegisthus ~100 billion events processed / day Petabytes of data persisted and available to queries on S3.
  • 5. Netflix Data Platform - Processing Long running clusters sla and ad-hoc Supplemental nightly bonus clusters For high priority ETL jobs 2,000+ instances in aggregate across the clusters
  • 6. Netflix Hadoop Platform as a Service S3 https://github.com/Netflix/genie
  • 7. Netflix Data Platform – Primitive Service Layer Primitive, decoupled services Building blocks for more complicated tools/services/apps Serves 1000s of MapReduce Jobs / day 100+ jobs concurrently
  • 8. Netflix Data Platform – Tools Sting (Adhoc Visualization) Looper (Backloading) Forklift (Data Movement) Ignite (A/B Test Analytics) Lipstick (Workflow Visualization) Spock (Data Auditing) Heavily utilize services in the primitive layer. Follow the same design philosophy as primitive apps: RESTful API Decoupled javascript interfaces
  • 9. Pig and Hive at Netflix • Hive – AdHoc queries – Lightweight aggregation • Pig – Complex Dataflows / ETL – Data movement “glue” between complex operations
  • 10. What is Pig? • A data flow language • Simple to learn – Very few reserved words – Comparable to a SQL logical query plan • Easy to extend and optimize • Extendable via UDFs written in multiple languages – Java, Python, Ruby, Groovy, Javascript
  • 11. Sample Pig Script* (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; * http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
  • 12. A Typical Pig Script
  • 13. Pig… • Data flows are easy & flexible to express in text – Facilitates code reuse via UDFs and macros – Allows logical grouping of operations vs grouping by order of execution. – But errors are easy to make and overlook. • Scripts can quickly get complicated • Visualization quickly draws attention to: – Common errors – Execution order / logical flow – Optimization opportunities
  • 14. Lipstick • Generates graphical representations of Pig data flows. • Compatible with Apache Pig v11+ • Has been used to monitor more than 25,000 Pig jobs at Netflix
  • 18. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded
  • 20. Lipstick for Fast Development • During development: – Keep track of data flow – Spot common errors • Omitted (hanging) operators • Data type issues – Easily estimate and optimize complexity • Number of MR jobs generated • Map only vs full Map/Reduce jobs • Opportunities to rejigger logic to: – Combine multiple jobs into a single job – Manipulate execution order to achieve better parallelism (e.g. less blocking)
  • 21. Lipstick for Job Monitoring • During execution: – Graphically monitor execution status from a single console – Spot optimization opportunities • Map vs reduce side joins • Data skew • Better parallelism settings
  • 22. Lipstick for Support • Empowers users to support themselves – Better operational visibility • What is my script currently doing? • Why is my script slow? – Examine intermediate output of jobs – All execution information in one place • Facilitates communication between infrastructure / support teams and end users – Lipstick link contains all information needed to provide support.
  • 23. Lipstick Architecture Pig v11+ lipstick-console.jar Lipstick Server (RESTful Grails app) Javascript Client (Frontend GUI) RDS Persistence
  • 24. Lipstick Architecture - Console • Implements PigProgressNotificationListener interface • Listens for: 1. New statements to be registered (unoptimized plan) 2. Script launched event (optimized, physical, M/R plan) 3. MR Job completion/failure event 4. Heartbeat progress (during execution) • Pig Plans and Progress  Lipstick objects • Communicates with Lipstick Server
  • 25. Pig Compilation Plans Optimized Logical Plan Physical Plan MapReduce Plan (grouping of Physical Operators into map or reduce jobs) Pig Script Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations.
  • 26. Lipstick Architecture - Server • Simple REST interface • It’s a Grails app! • Pig client posts plans and puts progress • Javascript client • gets plans and progress • Searches jobs by job name and user name
  • 27. Lipstick Architecture – JS Client • Displays and annotates graphs with status / progress • Completely decoupled from Server • Event based design • Periodically polls Server for job progress • Usability is a key focus
  • 28. My Job has stalled. Solving Problems with Lipstick - Common Problem #1
  • 31. I didn’t get the data I was expecting Common Problem #2
  • 34. I don’t understand why my job failed. Common Problem #3
  • 35. Failed Job (light red background) Successful Job (light blue background)
  • 36. Future of Lipstick • Annotate common errors and inefficiencies on the graph – Skew / map side join opportunities / scalar issues – E.g. Warnings / error dashboard • Provide better details of runtime performance – Timings annotated on graph – Min / median / max mapper and reducer times – Map / reduce completion over time • Search through execution history – Examine trends in runtime and data volumes – History of failure / success • Search jobs for commonalities – Common datasets loaded / saved – Better grasp data lineage – Common uses of UDFs and macros
  • 39. Wrapping up • Lipstick is part of Netflix OSS. • Clone it on github at http://github.com/Netflix/Lipstick • Check out the quickstart guide – https://github.com/Netflix/Lipstick/wiki/Getting- Started#1-quick-start – Get started playing with Lipstick in under 5 minutes! • We happily welcome your feedback and contributions!
  • 40.  Jeff Magnusson: [email protected] | http://www.linkedin.com/in/jmagnuss |@jeffmagnusson Thank you! Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/