SlideShare a Scribd company logo
Clouds, Hadoop and CascadingChristopher Curtin
About Me19+ years in TechnologyBackground in Factory Automation, Warehouse Management and Food Safety system development before SilverpopCTO of SilverpopSilverpop is a leading marketing automation and email marketing company
Cloud ComputingWhat exactly is ‘cloud computing’?Beats meMost overused term since ‘dot com’Ask 10 people, get 11 answers
What is Map/Reduce?Pioneered by GoogleParallel processing of large data sets across many computersHighly fault tolerantSplits work into two stepsMapReduce
MapIdentifies what is in the input that you want to processCan be simple: occurrence of a wordCan be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*Output is a list of name/value pairsName and value do not have to be primitives
ReduceTakes the name/value pairs from the Map step and does something useful with themMap/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ valuesOutput is the ‘answer’ to the questionExample: bytes streamed by IP address from Apache logs
HadoopApache’s Map/Reduce frameworkApache LicenseYahoo! Uses a version and they release their enhancements back to the community
HDFSDistributed File System WITHOUT NFS etc.Hadoop knows which parts of which files are on which machine (say that 5 times fast!)“Move the processing to the data” if possibleSimple API to move files in and out of HDFS
Runtime Distribution © Concurrent 2009
Getting Started with Map/ReduceFirst challenge: real examplesSecond challenge: when to map and when to reduce?Third challenge: what if I need more than one of each? How to coordinate?Fourth challenge: non-trivial business logic
CascadingOpen SourcePuts a wrapper on top of HadoopAnd so much more …
Main ConceptsTupleTapOperationsPipesFlowsCascades
TupleA single ‘row’ of data being processedEach column is namedCan access data by name or position
TAPAbstraction on top of Hadoop filesAllows you to define own parser for filesExample:Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
OperationsDefine what to do on the dataEach – for each “tuple” in data do this to itGroup – similar to a ‘group by’ in SQLCoGroup – joins of tuple streams togetherEvery – for every key in the Group or CoGroup do this
Operations - advancedEach operations allow logic on the row, such a parsing dates, creating new attributes etc.Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.Both allow multiple operations in same function, so no nested function calls!
PipesPipes tie Operations togetherPipes can be thought of as ‘tuple streams’Pipes can be split, allowing parallel execution of Operations
Example OperationRowAggregator aggr = new RowAggregator(row);Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME);Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);formatPipe = new GroupBy(formatPipe, groupBy);formatPipe = new Every(formatPipe, Fields.ALL, aggr);
FlowsFlows are reusable combinations of Taps, Pipes and OperationsAllows you to build library of functions Flows are where the Cascading scheduler comes into play
CascadesCascades are groups of Flows to address a needCascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow)Determines what operations can be run in parallel and which need to be serialized
Cascading SchedulerOnce the Flows and Cascades are defined, looks for dependenciesWhen executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were usedKnows what can be executed in parallelKnows when a step completes what other steps can execute
Dynamic Flow CreationFlows can be created at run time based on inputs.5 input files one week, 10 the next, Java code creates 10 Flows instead of 5Group and Every don’t care how many input Pipes
Dynamic Tuple DefinitionEach operations on input Taps can parse text lines into different FieldsSo one source may have 5 inputs, another 10Each operations can used meta data to know how to parseCan write Each operations to output common TuplesEvery operations can output new Tuples as well
Example: Dynamic Fieldssplice = new Each(splice,                new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes),                new Fields("mailing_id", "offset"));
Example: Custom Every OperationRowAggregator.java
Mixing non-Hadoop codeCascading allows you to mix regular java between Flows in a CascadeSo you can call out to databases, write intermediates to a file etc.We use it to load meta data about the columns in the source files
Features I don’t useFailure traps – allow you to write Tuples into ‘error’ files when something goes wrongNon-Java scriptingAssert() statementsSampling – throw away % of rows being imported
Quick HDFS/Local DiskUsing a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classesSo you can pass configuration information into these operations in bulkYou can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!
Real ExampleFor the hundreds of mailings sent last yearTo millions of recipientsShow me who opened, how oftenBreak it down by how long they have been a subscriberAnd their GenderAnd the number of times clicked on the offer
RDBMS solutionLots of million + row joinsLots of million + row countsTemporary tables since we want multiple answersLots of memoryLots of CPU and I/O$$ becomes bottleneck to adding more rows or more clients to same logic
Cascading SolutionLet Hadoop parse input filesLet Hadoop group all inputs by recipient’s emailLet Cascading call Every functions to look at all rows for a recipient and ‘flatten’ dataSplit ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on linksBandwidth to export data from RDBMS becomes bottleneck
Pros and ConsProsMix java between map/reduce stepsDon’t have to worry about when to map, when to reduceDon’t have to think about dependencies or how to processData definition can change on the flyConsLevel above Hadoop – sometimes ‘black magic’Data must (should) be outside of database to get most concurrency
Other SolutionsApache Pig: http://hadoop.apache.org/pig/More ‘sql-like’ Not as easy to mix regular Java into processesMore ‘ad hoc’ than CascadingAmazon Hadoop http://aws.amazon.com/elasticmapreduce/Runs on EC2Provide Map and Reduce functionsCan use Cascading Pay as you go
ResourcesMe: ccurtin@silverpop.com @ChrisCurtinChris Wensel: @cwensel Web site: www.cascading.orgMailing list off websiteAWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/

More Related Content

What's hot (20)

PPTX
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
PPS
Searching At Scale
Hadoop User Group
 
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
PPTX
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
PDF
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Data Con LA
 
PPTX
Visual Mapping of Clickstream Data
DataWorks Summit
 
PDF
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPT
Final deck
Steve Watt
 
PPT
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
PPTX
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
PDF
Building Hadoop Data Applications with Kite
huguk
 
KEY
Getting Started on Hadoop
Paco Nathan
 
PDF
Fast real-time approximations using Spark streaming
huguk
 
PDF
Demystifying Data Engineering
nathanmarz
 
PDF
Impala presentation ahad rana
Data Con LA
 
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
PPTX
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Searching At Scale
Hadoop User Group
 
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Putting Lipstick on Apache Pig at Netflix
Jeff Magnusson
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Data Con LA
 
Visual Mapping of Clickstream Data
DataWorks Summit
 
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Final deck
Steve Watt
 
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Building Hadoop Data Applications with Kite
huguk
 
Getting Started on Hadoop
Paco Nathan
 
Fast real-time approximations using Spark streaming
huguk
 
Demystifying Data Engineering
nathanmarz
 
Impala presentation ahad rana
Data Con LA
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 

Similar to Hadoop and Cascading At AJUG July 2009 (20)

PDF
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
PPT
Hadoop basics
Antonio Silveira
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPT
Another Intro To Hadoop
Adeel Ahmad
 
PPT
No sql
Shruti_gtbit
 
PDF
Hadoop interview questions
Kalyan Hadoop
 
PPTX
Big data concepts
Serkan Özal
 
PDF
Intro to Cascading
Ben Speakmon
 
PPTX
Expressiveness, Simplicity and Users
greenwop
 
PDF
Hadoop Technologies
zahid-mian
 
PPT
Bhupeshbansal bigdata
Bhupesh Bansal
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
PPTX
Clogeny Hadoop ecosystem - an overview
Madhur Nawandar
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PPTX
Distributed computing poli
ivascucristian
 
PDF
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PPTX
Using Cassandra with your Web Application
supertom
 
PPT
Document Databases & RavenDB
Brian Ritchie
 
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
Hadoop basics
Antonio Silveira
 
Hands on Hadoop and pig
Sudar Muthu
 
Another Intro To Hadoop
Adeel Ahmad
 
No sql
Shruti_gtbit
 
Hadoop interview questions
Kalyan Hadoop
 
Big data concepts
Serkan Özal
 
Intro to Cascading
Ben Speakmon
 
Expressiveness, Simplicity and Users
greenwop
 
Hadoop Technologies
zahid-mian
 
Bhupeshbansal bigdata
Bhupesh Bansal
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
Clogeny Hadoop ecosystem - an overview
Madhur Nawandar
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Distributed computing poli
ivascucristian
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
Jonathan Dursi
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Using Cassandra with your Web Application
supertom
 
Document Databases & RavenDB
Brian Ritchie
 
Ad

Recently uploaded (20)

PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Ad

Hadoop and Cascading At AJUG July 2009

  • 1. Clouds, Hadoop and CascadingChristopher Curtin
  • 2. About Me19+ years in TechnologyBackground in Factory Automation, Warehouse Management and Food Safety system development before SilverpopCTO of SilverpopSilverpop is a leading marketing automation and email marketing company
  • 3. Cloud ComputingWhat exactly is ‘cloud computing’?Beats meMost overused term since ‘dot com’Ask 10 people, get 11 answers
  • 4. What is Map/Reduce?Pioneered by GoogleParallel processing of large data sets across many computersHighly fault tolerantSplits work into two stepsMapReduce
  • 5. MapIdentifies what is in the input that you want to processCan be simple: occurrence of a wordCan be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*Output is a list of name/value pairsName and value do not have to be primitives
  • 6. ReduceTakes the name/value pairs from the Map step and does something useful with themMap/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ valuesOutput is the ‘answer’ to the questionExample: bytes streamed by IP address from Apache logs
  • 7. HadoopApache’s Map/Reduce frameworkApache LicenseYahoo! Uses a version and they release their enhancements back to the community
  • 8. HDFSDistributed File System WITHOUT NFS etc.Hadoop knows which parts of which files are on which machine (say that 5 times fast!)“Move the processing to the data” if possibleSimple API to move files in and out of HDFS
  • 9. Runtime Distribution © Concurrent 2009
  • 10. Getting Started with Map/ReduceFirst challenge: real examplesSecond challenge: when to map and when to reduce?Third challenge: what if I need more than one of each? How to coordinate?Fourth challenge: non-trivial business logic
  • 11. CascadingOpen SourcePuts a wrapper on top of HadoopAnd so much more …
  • 13. TupleA single ‘row’ of data being processedEach column is namedCan access data by name or position
  • 14. TAPAbstraction on top of Hadoop filesAllows you to define own parser for filesExample:Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
  • 15. OperationsDefine what to do on the dataEach – for each “tuple” in data do this to itGroup – similar to a ‘group by’ in SQLCoGroup – joins of tuple streams togetherEvery – for every key in the Group or CoGroup do this
  • 16. Operations - advancedEach operations allow logic on the row, such a parsing dates, creating new attributes etc.Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.Both allow multiple operations in same function, so no nested function calls!
  • 17. PipesPipes tie Operations togetherPipes can be thought of as ‘tuple streams’Pipes can be split, allowing parallel execution of Operations
  • 18. Example OperationRowAggregator aggr = new RowAggregator(row);Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME);Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);formatPipe = new GroupBy(formatPipe, groupBy);formatPipe = new Every(formatPipe, Fields.ALL, aggr);
  • 19. FlowsFlows are reusable combinations of Taps, Pipes and OperationsAllows you to build library of functions Flows are where the Cascading scheduler comes into play
  • 20. CascadesCascades are groups of Flows to address a needCascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow)Determines what operations can be run in parallel and which need to be serialized
  • 21. Cascading SchedulerOnce the Flows and Cascades are defined, looks for dependenciesWhen executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were usedKnows what can be executed in parallelKnows when a step completes what other steps can execute
  • 22. Dynamic Flow CreationFlows can be created at run time based on inputs.5 input files one week, 10 the next, Java code creates 10 Flows instead of 5Group and Every don’t care how many input Pipes
  • 23. Dynamic Tuple DefinitionEach operations on input Taps can parse text lines into different FieldsSo one source may have 5 inputs, another 10Each operations can used meta data to know how to parseCan write Each operations to output common TuplesEvery operations can output new Tuples as well
  • 24. Example: Dynamic Fieldssplice = new Each(splice, new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes), new Fields("mailing_id", "offset"));
  • 25. Example: Custom Every OperationRowAggregator.java
  • 26. Mixing non-Hadoop codeCascading allows you to mix regular java between Flows in a CascadeSo you can call out to databases, write intermediates to a file etc.We use it to load meta data about the columns in the source files
  • 27. Features I don’t useFailure traps – allow you to write Tuples into ‘error’ files when something goes wrongNon-Java scriptingAssert() statementsSampling – throw away % of rows being imported
  • 28. Quick HDFS/Local DiskUsing a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classesSo you can pass configuration information into these operations in bulkYou can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!
  • 29. Real ExampleFor the hundreds of mailings sent last yearTo millions of recipientsShow me who opened, how oftenBreak it down by how long they have been a subscriberAnd their GenderAnd the number of times clicked on the offer
  • 30. RDBMS solutionLots of million + row joinsLots of million + row countsTemporary tables since we want multiple answersLots of memoryLots of CPU and I/O$$ becomes bottleneck to adding more rows or more clients to same logic
  • 31. Cascading SolutionLet Hadoop parse input filesLet Hadoop group all inputs by recipient’s emailLet Cascading call Every functions to look at all rows for a recipient and ‘flatten’ dataSplit ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on linksBandwidth to export data from RDBMS becomes bottleneck
  • 32. Pros and ConsProsMix java between map/reduce stepsDon’t have to worry about when to map, when to reduceDon’t have to think about dependencies or how to processData definition can change on the flyConsLevel above Hadoop – sometimes ‘black magic’Data must (should) be outside of database to get most concurrency
  • 33. Other SolutionsApache Pig: http://hadoop.apache.org/pig/More ‘sql-like’ Not as easy to mix regular Java into processesMore ‘ad hoc’ than CascadingAmazon Hadoop http://aws.amazon.com/elasticmapreduce/Runs on EC2Provide Map and Reduce functionsCan use Cascading Pay as you go
  • 34. ResourcesMe: [email protected] @ChrisCurtinChris Wensel: @cwensel Web site: www.cascading.orgMailing list off websiteAWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/