Hadoop and Cascading At AJUG July 2009

Clouds, Hadoop and CascadingChristopher Curtin

About Me19+ years in TechnologyBackground in Factory Automation, Warehouse Management and Food Safety system development before SilverpopCTO of SilverpopSilverpop is a leading marketing automation and email marketing company

Cloud ComputingWhat exactly is ‘cloud computing’?Beats meMost overused term since ‘dot com’Ask 10 people, get 11 answers

What is Map/Reduce?Pioneered by GoogleParallel processing of large data sets across many computersHighly fault tolerantSplits work into two stepsMapReduce

MapIdentifies what is in the input that you want to processCan be simple: occurrence of a wordCan be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.*Output is a list of name/value pairsName and value do not have to be primitives

ReduceTakes the name/value pairs from the Map step and does something useful with themMap/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ valuesOutput is the ‘answer’ to the questionExample: bytes streamed by IP address from Apache logs

HadoopApache’s Map/Reduce frameworkApache LicenseYahoo! Uses a version and they release their enhancements back to the community

HDFSDistributed File System WITHOUT NFS etc.Hadoop knows which parts of which files are on which machine (say that 5 times fast!)“Move the processing to the data” if possibleSimple API to move files in and out of HDFS

Runtime Distribution © Concurrent 2009

Getting Started with Map/ReduceFirst challenge: real examplesSecond challenge: when to map and when to reduce?Third challenge: what if I need more than one of each? How to coordinate?Fourth challenge: non-trivial business logic

CascadingOpen SourcePuts a wrapper on top of HadoopAnd so much more …

Main ConceptsTupleTapOperationsPipesFlowsCascades

TupleA single ‘row’ of data being processedEach column is namedCan access data by name or position

TAPAbstraction on top of Hadoop filesAllows you to define own parser for filesExample:Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);

OperationsDefine what to do on the dataEach – for each “tuple” in data do this to itGroup – similar to a ‘group by’ in SQLCoGroup – joins of tuple streams togetherEvery – for every key in the Group or CoGroup do this

Operations - advancedEach operations allow logic on the row, such a parsing dates, creating new attributes etc.Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations.Both allow multiple operations in same function, so no nested function calls!

PipesPipes tie Operations togetherPipes can be thought of as ‘tuple streams’Pipes can be split, allowing parallel execution of Operations

Example OperationRowAggregator aggr = new RowAggregator(row);Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME);Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile);formatPipe = new GroupBy(formatPipe, groupBy);formatPipe = new Every(formatPipe, Fields.ALL, aggr);

FlowsFlows are reusable combinations of Taps, Pipes and OperationsAllows you to build library of functions Flows are where the Cascading scheduler comes into play

CascadesCascades are groups of Flows to address a needCascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow)Determines what operations can be run in parallel and which need to be serialized

Cascading SchedulerOnce the Flows and Cascades are defined, looks for dependenciesWhen executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were usedKnows what can be executed in parallelKnows when a step completes what other steps can execute

Dynamic Flow CreationFlows can be created at run time based on inputs.5 input files one week, 10 the next, Java code creates 10 Flows instead of 5Group and Every don’t care how many input Pipes

Dynamic Tuple DefinitionEach operations on input Taps can parse text lines into different FieldsSo one source may have 5 inputs, another 10Each operations can used meta data to know how to parseCan write Each operations to output common TuplesEvery operations can output new Tuples as well

Example: Dynamic Fieldssplice = new Each(splice, new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes), new Fields("mailing_id", "offset"));

Example: Custom Every OperationRowAggregator.java

Mixing non-Hadoop codeCascading allows you to mix regular java between Flows in a CascadeSo you can call out to databases, write intermediates to a file etc.We use it to load meta data about the columns in the source files

Features I don’t useFailure traps – allow you to write Tuples into ‘error’ files when something goes wrongNon-Java scriptingAssert() statementsSampling – throw away % of rows being imported

Quick HDFS/Local DiskUsing a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classesSo you can pass configuration information into these operations in bulkYou can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!

Real ExampleFor the hundreds of mailings sent last yearTo millions of recipientsShow me who opened, how oftenBreak it down by how long they have been a subscriberAnd their GenderAnd the number of times clicked on the offer

RDBMS solutionLots of million + row joinsLots of million + row countsTemporary tables since we want multiple answersLots of memoryLots of CPU and I/O$$ becomes bottleneck to adding more rows or more clients to same logic

Cascading SolutionLet Hadoop parse input filesLet Hadoop group all inputs by recipient’s emailLet Cascading call Every functions to look at all rows for a recipient and ‘flatten’ dataSplit ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on linksBandwidth to export data from RDBMS becomes bottleneck

Pros and ConsProsMix java between map/reduce stepsDon’t have to worry about when to map, when to reduceDon’t have to think about dependencies or how to processData definition can change on the flyConsLevel above Hadoop – sometimes ‘black magic’Data must (should) be outside of database to get most concurrency

Other SolutionsApache Pig: http://hadoop.apache.org/pig/More ‘sql-like’ Not as easy to mix regular Java into processesMore ‘ad hoc’ than CascadingAmazon Hadoop http://aws.amazon.com/elasticmapreduce/Runs on EC2Provide Map and Reduce functionsCan use Cascading Pay as you go

ResourcesMe: ccurtin@silverpop.com @ChrisCurtinChris Wensel: @cwensel Web site: www.cascading.orgMailing list off websiteAWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/

Hadoop and Cascading At AJUG July 2009

More Related Content

What's hot (20)

Similar to Hadoop and Cascading At AJUG July 2009 (20)

Recently uploaded (20)

Hadoop and Cascading At AJUG July 2009