Embedding Pig in scripting languages

Embedding Pig in scripting languagesWhat happens when you feed a Pig to a Python?Julien Le Dem – Principal Engineer - Content Platforms at Yahoo!Pig committerjulien@ledem.net@julienledem

DisclaimerNo animals were hurtin the making of this presentationI’m cuteI’m hungryPicture credits:OZinOH: http://www.flickr.com/photos/75905404@N00/5421543577/Stephen & Claire Farnsworth: http://www.flickr.com/photos/the_farnsworths/4720850597/

What for ?Simplifying the implementation of iterative algorithms:Loop and exit criteriaSimpler User Defined FunctionsEasier parameter passing

BeforeThe implementation has the following artifacts:

Pig Script(s)warshall_n_minus_1 = LOAD '$workDir/warshall_0' USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);to_join_n_minus_1 = LOAD '$workDir/to_join_0'USING BinStorage AS (id1:chararray, id2:chararray, status:chararray);joined = COGROUP to_join_n_minus_1 BY id2, warshall_n_minus_1 BY id1;followed = FOREACH joinedGENERATE FLATTEN(followRel(to_join_n_minus_1,warshall_n_minus_1));followed_byid = GROUP followed BY id1;warshall_n = FOREACH followed_byidGENERATE group, FLATTEN(coalesceLine(followed.(id2, status)));to_join_n = FILTER warshall_n BY $2 == 'notfollowed' AND $0!=$1;STORE warshall_n INTO '$workDir/warshall_1' USING BinStorage;STORE to_join_n INTO '$workDir/to_join_1 USING BinStorage;

External loop#!/usr/bin/python import osnum_iter=int(10)for i in range(num_iter):os.system('java -jar ./lib/pig.jar -x local plsi_singleiteration.pig')os.rename('output_results/p_z_u','output_results/p_z_u.'+str(i))os.system('cpoutput_results/p_z_u.nxtoutput_results/p_z_u'); os.rename('output_results/p_z_u.nxt','output_results/p_z_u.'+str(i+1))os.rename('output_results/p_s_z','output_results/p_s_z.'+str(i))os.system('cpoutput_results/p_s_z.nxtoutput_results/p_s_z'); os.rename('output_results/p_s_z.nxt','output_results/p_s_z.'+str(i+1))

So… What happens?Credits: Mango Atchar: http://www.flickr.com/photos/mangoatchar/362439607/

AfterOne script (to rule them all): - main program - UDFs as script functions - embedded Pig statementsAll the algorithm in one place

ReferencesIt uses JVM implementations of scripting languages (Jython, Rhino).This is a joint effort, see the following Jiras: in Pig 0.8: PIG-928 Python UDFs in Pig0.9: PIG-1479 embedding, PIG-1794 JavaScript supportDoc: http://pig.apache.org/docs/

Examples1) Simple example: fixed loop iteration2) Adding convergence criteria and accessing intermediary output3)More advanced example with UDFs

1) A Simple ExamplePageRank:A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.Ref: http://en.wikipedia.org/wiki/PageRank

Or more visuallyEach page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links.

Embedding Pig in scripting languages

Let’s zoom inpig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))Iterate 10 timesPass parameters as a dictionaryPass parameters as a dictionaryJust run P, that was declared aboveThe output becomes the new input

Practical resultApplied to the English Wikipedia link graph:http://wiki.dbpedia.org/Downloads36#owikipediapagelinksIt turns out that the max PageRank is awarded to:http://en.wikipedia.org/wiki/United_StatesThanks @ogrisel for the suggestion

2) Same example, one step furtherNow let’s say that we define a threshold as a convergence criteria instead of a fixed iteration count.

Same thing as previouslyComputation of the maximum difference with the previous iteration… continued next slide

The main programThe parameter-less bind() uses the variables in the current scopeWe can easily read the output of Pig from the gridStop if we reach a threshold

3) Now somethingmore complexCompute a transitive closure: find the connected components of a graph. - Useful if you’re doing de-duplication - Requires iterations and UDFs

Or more visuallyTurn this: Into this:

ConvergenceConverges in : log2(max(Diameter of a component))Diameter = “the longest shortest path”Bottom line: the number of iterations will be reasonable.

UDFs are in the same script as the main programZoom next slidePage 1/3

Zoom on UDFsThe output schema of the UDF is defined using a decoratorThe native structures of the language can be used directly

Zoom next slidesZoom next slidesPage 2/3

Zoom on the Pig script…UDFs are directly available, no extra declaration needed

Zoom on the loopIterate a maximum of 10 times(2^10 maximum diameter of a component)Convergence criteria: all links have been followed

Final part: formattingTurning the graph representation into a component list representationThis is necessary when we have UDFs, so that the script can be evaluated again on the slaves without running main()Page 3/3

One more thing …I presented Python but JavaScript is available as well (experimental).The framework is extensible. Any JVM implementation of a language could be integrated (contribute!).The examples can be found at:https://github.com/julienledem/Pig-scripting-examples

Embedding Pig in scripting languages

Recommended

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Embedding Pig in scripting languages (20)

More from Julien Le Dem (18)

Recently uploaded (20)

Embedding Pig in scripting languages