SlideShare a Scribd company logo
“The Workflow Abstraction”

                     Strata SC
                     2013-02-28

                     Paco Nathan
                     Concurrent, Inc.
                     San Francisco, CA
                     @pacoid




                   Copyright @2013, Concurrent, Inc.




Friday, 01 March 13                                                                                           1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
The Workflow Abstraction
                                                                                                                                                            Document
                                                                                                                                                            Collection



                                                                                                                                                                                         Scrub
                                                                                                                                                                         Tokenize
                                                                                                                                                                                         token

                                                                                                                                                                    M




                       1. Funnel
                                                                                                                                                                                                 HashJoin   Regex
                                                                                                                                                                                                   Left     token
                                                                                                                                                                                                                    GroupBy    R
                                                                                                                                                                                    Stop Word                        token
                                                                                                                                                                                       List
                                                                                                                                                                                                   RHS




                                                                                                                                                                                                                       Count




                                                                                                                                                                                                                                   Word
                                                                                                                                                                                                                                   Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                                                                                        2
This talk is about the workflow abstraction:
 * the business process of structuring data
 * the practices of building robust apps at scale
 * the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
Marketing Funnel – overview

             In reference to Making Data Work…
                                                                                                                               Customers
             Almost every business uses a model
             similar to this – give or take a few steps.                                                                       Campaigns


             Customer leads go in at the top,
                                                                                                                               Awareness
             those get refined through several stages,
             then results flow out the bottom.
                                                                                                                                Interest



                                                                                                                               Evalutation



                                                                                                                               Conversion



                                                                                                                                Referral



                                                                                                                                 Repeat




Friday, 01 March 13                                                                                                                          3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
Marketing Funnel – clickstream

              Different funnel stages get represented
              in ecommerce by events captured in                                                                                           Customers
              log files, as a class of machine data
              called clickstream                                                                                                           Campaigns

                                                                                                        Impression
                •   ad impressions                                                                                                         Awareness

                •   URL clicks                                                                                Click

                •   landing page views                                                                                                      Interest


                •   new user registrations                                                                            Sign Up

                                                                                                                                           Evalutation
                •   session cookies
                                                                                                                              Purchase
                •   online purchases                                                                                                       Conversion

                •   social network activity                                                                                       "Like"


                •   etc.                                                                                                                    Referral



                                                                                                                                             Repeat




Friday, 01 March 13                                                                                                                                      4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
Marketing Funnel – metrics

             A variety of clickstream metrics can
             be used as performance indicators                                                             Customers
             at different stages of the funnel:
                                                                                                           Campaigns
              •    CPM: cost per thousand                                    Impression

              •    CTR: click-through rate                                                                 Awareness                           CPM

              •    CPA: cost per action                                         Click


              •    etc.                                                                                     Interest                     CTR

                                                                                        Sign Up

                                                                                                           Evalutation                behaviors

                                                                                           Purchase

                                                                                                           Conversion           CPA

                                                                                                  "Like"

                                                                                                            Referral        NPS, social graph, etc.



                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                   5
The many different highly-nuanced metrics which apply are mind-boggling :)
Marketing Funnel – example calculations                                               Customers


                                                                                                   Campaigns



                                                                                                   Awareness



                                                                                                    Interest




                            metric                       cost     events     formula       rate    Evalutation



                                                                                                   Conversion



                                                                                                    Referral



                                                                                                     Repeat




                                                                              $4,000
                              CPM                     $4,000       10^6          ÷         $4.00
                                                                           (10^6 ÷ 10^3)



                                                                               3∙10^3
                               CTR                            -   3∙10^3
                                                                              ÷ 10^6
                                                                                           0.3%




                                                                              $4,000
                               CPA                            -     20           ÷         $200
                                                                                20




Friday, 01 March 13                                                                                              6
Here are examples of the kinds of calculations performed...
Marketing Funnel – predictive model

             Given these metrics, we can go further
             to estimate cost per paying user (CPP)                                                                                       Customers
             customer lifetime value (LTV), etc.
                                                                                                                                          Campaigns
             Then we can build a predictive model for
             return on investment (ROI) per customer,                                                                                     Awareness
             summarizing the funnel performance:
                     ROI = (LTV − CPP) ∕ CPP                                                                                               Interest




             As an example, after crunching lots of logs,                                                                                 Evalutation

             suppose that…
                                                                                                                                          Conversion

                     CPP = $200
                     LTV = $2000                                                                                                           Referral

                     ROI = ($2000 − $200) ∕ $200
                                                                                                                                            Repeat
             for a 9x multiple

Friday, 01 March 13                                                                                                                                     7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
Marketing Funnel – example architecture                                                                        Customers


                                                                                                                            Campaigns




                                                                                                              Customers
                                                                                                                            Awareness




             Let’s consider an example architecture                                                                          Interest



                                                                                                                            Evalutation




             for calculating, reporting, and taking action                                                      Web
                                                                                                                            Conversion




             on funnel metrics, based on large-scale                                                            App
                                                                                                                             Referral



                                                                                                                              Repeat




             clickstream data…
                                                                                                  logs         Cache
                                                                                                    logs
                                                                                                      Logs

                                                                      Support
                                                                                                     source
                                                                                           trap                  sink
                                                                                                       tap
                                                                                            tap                  tap


                                                                                                   Data
                                                                     Modeling            PMML
                                                                                                  Workflow

                                                                                                                source
                                                                                           sink
                                                                                                                  tap
                                                                                           tap

                                                                     Analytics
                                                                      Cubes                                    customer
                                                                                                                Customer
                                                                                                              profile DBs
                                                                                                                  Prefs
                                                                                                    Hadoop
                                                                                                    Cluster
                                                                     Reporting




Friday, 01 March 13                                                                                                                       8
Here’s an example architecture of using clickstream metrics within an online business.
Marketing Funnel – complexities

             Multiple ad partners, different contracts
             terms, reporting different metrics at                                                                                  Customers
                                                                                                                                                ×
                                                                                                    ×
             different times, click scrubs, etc.
                                                                                                                                    Campaigns
             Campaigns target specific geo/demo,                                                     Impression




                                                                                                    ×                                           ×
             test alternate landing pages, probably                                                                                 Awareness                           CPM
             need to segment customer base…                                                              Click


             These issues make clickstream data                                                                                      Interest                     CTR


             large and yet sparse.                                                                               Sign Up

                                                                                                                                    Evalutation                behaviors

             Other issues:


                                                                                                                                                ×
                                                                                                                    Purchase

             • seasonal variation                                                                                                   Conversion           CPA


             • fluctuating currency exchange rates                                                                          "Like"

                                                                                                                                     Referral        NPS, social graph, etc.
             • distortions due to credit card fraud
             • diminishing returns                                                                                                    Repeat      loyalty, win back, etc.

             • forecasting requirements
Friday, 01 March 13                                                                                                                                                            9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards

social dimension makes this convoluted
not simple
Marketing Funnel – very large scale

             Even a small start-up may need to
             make decisions about billions of                                                                                              Customers
             events, many millions of users, and
             millions of dollars in annual ad spend.                                                                                       Campaigns

                                                                                               Impression
             Ad networks attempt to simplify and                                                                                           Awareness                           CPM
             optimize parts of the funnel process                                                   Click
             as a value-add.                                                                                                                Interest                     CTR

             The need for these insights has been a                                                         Sign Up

             driver for Hadoop-related technologies.                                                                                       Evalutation                behaviors

                                                                                                                 Purchase

                                                                                                                                           Conversion           CPA

                                                                                                                        "Like"

                                                                                                                                            Referral        NPS, social graph, etc.



                                                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                                                   10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
Marketing Funnel – very large scale

            Even a small start-up may need to
            make decisions about billions of                                               Customers
            events, many millions of users, and
            millions of dollars in annual ad spend.                                        Campaigns

                                                             Impression
            Ad networks attempt to simplify and                                            Awareness                           CPM
            optimize parts of the funnel process                Click
            as a value-add.
                                      funnel modeling and optimization                      Interest                     CTR

            The need for these insights has been a                      Sign Up

            driver for Hadoop-relatedrequires complex data workflows
                                       technologies.                                       Evalutation                behaviors

                                      to obtain the required insights      Purchase

                                                                                           Conversion           CPA

                                                                                  "Like"

                                                                                            Referral        NPS, social graph, etc.



                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                   11
These needs imply complex data workflows.

It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
The Workflow Abstraction
                                                                                                      Document
                                                                                                      Collection



                                                                                                                                   Scrub
                                                                                                                   Tokenize
                                                                                                                                   token

                                                                                                              M




                      1. Funnel
                                                                                                                                           HashJoin   Regex
                                                                                                                                             Left     token
                                                                                                                                                              GroupBy    R
                                                                                                                              Stop Word                        token
                                                                                                                                 List
                                                                                                                                             RHS




                                                                                                                                                                 Count




                                                                                                                                                                             Word
                                                                                                                                                                             Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                                                  12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
Circa 2008 – Hadoop at scale
                                                                                                                                                Customers




             Scenario: Analytics team at a large ad network…                                                                                    Campaigns



                                                                                                                                                Awareness




             Company had invested $MM capex in a                                                                                                 Interest




             large data warehouse across LOBs                                                                                                   Evalutation



                                                                                                                                                Conversion




             Mission-critical app had been written as
                                                                                                                                                 Referral




                                                                                                                                     collab       Repeat



             a large SQL workflow in the DW                                                                            roll-ups
                                                                                                                                     filter


             Marketing funnel metrics were estimated
             for many advertisers, many campaigns,                                                                                   per-user
                                                                                                                                   recommends
             many publishers, many customers –
             billions of calculations daily
                                                                                                                     query/load
             Predictive models matched publisher ~ advertiser                                                        clickstream     RDBMS

             and campaign ~ user, to optimize marketing
             funnel performance




Friday, 01 March 13                                                                                                                                           13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
Circa 2008 – Hadoop at scale
                                                                                                                                                         Customers




             Issues:                                                                                                                                     Campaigns



                                                                                                                                                         Awareness




              • critical app had hit hard limits for scalability                                                                                          Interest




              • several Tb data, 100’s of servers
                                                                                                                                                         Evalutation



                                                                                                                                                         Conversion




              • batch window length vs. failure rate vs. SLA                                                                                collab
                                                                                                                                                          Referral



                                                                                                                                                           Repeat



                in the context of business growth posed                                                                      roll-ups
                                                                                                                                            filter
                an existential risk




                                                                                                                                                     ×
             We built out a team to address these issues                                                                                    per-user
                                                                                                                                          recommends
             as rapidly as possible…
             Needed to re-create that data workflows                                                                         query/load
             based on Enterprise requirements.                                                                              clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                    14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
Circa 2008 – Hadoop at scale

            Approach:                                                           roll-ups
                                                                                               collab
                                                                                               filter
             • reverse-engineered business process from
               ~1500 lines of undocumented SQL
                                                                                               per-user
             • created a large, multi-step Apache Hadoop                                     recommends
               app on AWS                                                        HDFS


             • leveraged cloud strategy to trade $MM
               capex for lower, scalable opex
             • Amazon identified our app as one of the                             msg
                                                                                 queue
               largest Hadoop deployments on EC2
             • our app became a case study for AWS                             query/load
                                                                                               RDBMS
               prior to Elastic MapReduce launch                               clickstream




Friday, 01 March 13                                                                                       15
Our solution involved dependencies among more than a dozen Hadoop job steps.
Circa 2008 – Hadoop at scale




                                                                                                                                 ×
             Unresolved:                                                                                                                 roll-ups
                                                                                                                                                        collab
                                                                                                                                                        filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                                                                                            per-user
                debugging, etc., across the entire workflow                                                                                            recommends
                                                                                                                                          HDFS
              • data scientists wore beepers since Ops

                                                                                                                                × ×
                lacked visibility into business process
              • coding directly in MapReduce created
                a staffing bottleneck                                                                                                       msg
                                                                                                                                          queue



                                                                                                                                        query/load
                                                                                                                                        clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows:
 * staffing bottleneck unless there’s a good abstraction layer
 * operational complexity, mostly due to lack of transparency
 * system integration problems *are* the main problem to solve
Circa 2008 – Hadoop at scale

             Unresolved:                                           roll-ups
                                                                               collab
                                                                                filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                  per-user
                debugging, etc., across the entire workflow                  recommends

              • data scientists worea good since Ops for a large, commercial
                                       beepers solution
                                                                    HDFS

                lacked visibility into Apachebusiness logic deployment, but
                                       the app’s Hadoop
              • coding directly in MapReduce created
                a staffing bottleneck   workflow management lacked crucial
                                                                     msg
                                                                    queue
                                                             features…
                                                                                                                                     query/load
                                                             which led to a search for a better                                      clickstream                RDBMS


                                                             workflow abstraction



Friday, 01 March 13                                                                                                                                                                           17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M




                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                             18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins

             API author Chris Wensel worked as a system architect
             at an Enterprise firm well-known for several popular
             data products.
             Wensel was following the Nutch open source project –
             before Hadoop even had a name.
             He noted that it would become difficult to find Java
             developers to write complex Enterprise apps directly
             in Apache Hadoop – a potential blocker for leveraging
             this new open source technology.




Friday, 01 March 13                                                                                                                                                            19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming

             Key insight: MapReduce is based on functional programming
             – back to LISP in 1970s. Apache Hadoop use cases are
             mostly about data pipelines, which are functional in nature.
             To ease staffing problems as “Main Street” Enterprise firms
             began to embrace Hadoop, Cascading was introduced
             in late 2007, as a new Java API to implement functional
             programming for large-scale data workflows:

               • leverages JVM and Java-based tools without an need
                    to create an entirely new language
               •    allows many programmers who have J2EE expertise
                    to build apps that leverage the economics of Hadoop
                    clusters




Friday, 01 March 13                                                                                                                           20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
quotes…

                       “Cascading gives Java developers the ability to build
                        Big Data applications on Hadoop using their existing
                        skillset … Management can really go out and build a
                        team around folks that are already very experienced
                        with Java. Switching over to this is really a very short
                        exercise.”
                            CIO, Thor Olavsrud
                            2012-06-06
                            cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

                       “Masks the complexity of MapReduce, simplifies the
                        programming, and speeds you on your journey toward
                        actionable analytics … A vast improvement over native
                        MapReduce functions or Pig UDFs.”
                            2012 BOSSIE Awards, James Borck
                            2012-09-18
                            infoworld.com/slideshow/65089




Friday, 01 March 13                                                                           21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues:
 * staffing bottleneck
 * operational complexity
 * system integration
Cascading – deployments

              • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
                   uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
              • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                   MapR, EMC, SpringSource, Cloudera
              • 5+ history of Enterprise production deployments,
                   ASL 2 license, GitHub src, http://conjars.org
              • use cases: ETL, marketing funnel, anti-fraud, social media,
                   retail pricing, search analytics, recommenders, eCRM,
                   utility grids, genomics, climatology, etc.




Friday, 01 March 13                                                                  22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:

                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)


                     github.com/nathanmarz/cascalog/wiki
                     github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    23
Many case studies, many Enterprise production deployments now for 5+ years.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:
                                         Cascading as the basis for workflow
                                         abstractions atop Hadoop and more,
                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)
                                         with a 5+ year history of production
                                         deployments across multiple verticals
                      github.com/nathanmarz/cascalog/wiki
                      github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
The Workflow Abstraction
                                                                          Document
                                                                          Collection



                                                                                                       Scrub
                                                                                       Tokenize
                                                                                                       token

                                                                                  M




                      1. Funnel
                                                                                                               HashJoin   Regex
                                                                                                                 Left     token
                                                                                                                                  GroupBy    R
                                                                                                  Stop Word                        token
                                                                                                     List
                                                                                                                 RHS




                                                                                                                                     Count




                                                                                                                                                 Word
                                                                                                                                                 Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                      25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
The Ubiquitous Word Count
                                                                                                                     Document
                                                                                                                     Collection




             Definition:                                                                                                     M
                                                                                                                                  Tokenize
                                                                                                                                             GroupBy
                                                                                                                                              token    Count




                 count how often each word appears
               count how often each word appears
                                                                                                                                                R              Word
                                                                                                                                                               Count




               inin a collection of text documents
                  a collection of text documents
             This simple program provides an excellent test case for
             parallel processing, since it illustrates:                                                    void map (String doc_id, String text):
                                                                                                            for each word w in segment(text):
              • requires a minimal amount of code                                                             emit(w, "1");

              • demonstrates use of both symbolic and numeric values
              • shows a dependency graph of tuples as an abstraction                                       void reduce (String word, Iterator group):


              • is not many steps away from useful search indexing
                                                                                                            int count = 0;



              • serves as a “Hello World” for Hadoop apps                                                   for each pc in group:
                                                                                                              count += Int(pc);


             Any distributed computing framework which can run Word                                         emit(word, String(count));
             Count efficiently in parallel at scale can handle much
             larger and more interesting compute problems.


Friday, 01 March 13                                                                                                                                                    26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
word count – conceptual flow diagram


                 Document
                 Collection




                                                       Tokenize
                                                                                                       GroupBy
                               M                                                                        token                                             Count




                                                                                                             R                                                                                Word
                                                                                                                                                                                              Count




                1 map                                                                                            cascading.org/category/impatient
                1 reduce
               18 lines code                                                                                                               gist.github.com/3900702


Friday, 01 March 13                                                                                                                                                                                                      27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java
                                                                                                     Document
                                                                                                     Collection




             String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                             GroupBy
                                                                                                                              token
             String wcPath = args[ 1 ];                                                                      M                         Count




             Properties properties = new Properties();                                                                          R              Word
                                                                                                                                               Count


             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );
             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/wc.dot" );
             wcFlow.complete();



Friday, 01 March 13                                                                                                                                    28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram
word count – generated flow diagram
                                                                                                                                                      Document
                                                                                                                                                      Collection




                                                                                                                                                                   Tokenize
                                                                                                      [head]                                                  M
                                                                                                                                                                              GroupBy
                                                                                                                                                                               token    Count




                                                                                                                                                                                 R              Word
                                                                                                                                                                                                Count




                                                                        Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                [{2}:'doc_id', 'text']
                                                                                                [{2}:'doc_id', 'text']




                                                                                                                                             map
                                                                         Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                                                                    [{1}:'token']
                                                                                                    [{1}:'token']



                                                                                          GroupBy('wc')[by:['token']]

                                                                                                  wc[{1}:'token']
                                                                                                  [{1}:'token']




                                                                                                                                             reduce
                                                                                       Every('wc')[Count[decl:'count']]

                                                                                                [{2}:'token', 'count']
                                                                                                [{1}:'token']



                                                                    Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                [{2}:'token', 'count']
                                                                                                [{2}:'token', 'count']



                                                                                                       [tail]


Friday, 01 March 13                                                                                                                                                                                     29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
word count – Cascalog / Clojure
                                                                      Document
                                                                      Collection




             (ns impatient.core                                               M
                                                                                   Tokenize
                                                                                              GroupBy
                                                                                               token    Count



               (:use [cascalog.api]                                                              R              Word
                                                                                                                Count


                     [cascalog.more-taps :only (hfs-delimited)])
               (:require [clojure.string :as s]
                         [cascalog.ops :as c])
               (:gen-class))

             (defmapcatop split [line]
               "reads in a line of string and splits it by regex"
               (s/split line #"[[](),.)s]+"))

             (defn -main [in out & args]
               (?<- (hfs-delimited out)
                    [?word ?count]
                    ((hfs-delimited in :skip-header? true) _ ?line)
                    (split ?line :> ?word)
                    (c/count ?count)))

             ; Paul Lam
             ; github.com/Quantisan/Impatient




Friday, 01 March 13                                                                                                     30
Here is the same Word Count app written in Clojure, using Cascalog.
word count – Cascalog / Clojure
                                                                                                                    Document
                                                                                                                    Collection




             github.com/nathanmarz/cascalog/wiki
                                                                                                                                 Tokenize
                                                                                                                                            GroupBy
                                                                                                                            M                token    Count




                                                                                                                                               R              Word
                                                                                                                                                              Count




               • implements Datalog in Clojure, with predicates backed
                 by Cascading – for a highly declarative language
               • run ad-hoc queries from the Clojure REPL –
                 approx. 10:1 code reduction compared with SQL
               • composable subqueries, used for test-driven development
                 (TDD) practices at scale
               • Leiningen build: simple, no surprises, in Clojure itself
               • more new deployments than other Cascading DSLs –
                 Climate Corp is largest use case: 90% Clojure/Cascalog
               • has a learning curve, limited number of Clojure developers
               • aggregators are the magic, and those take effort to learn




Friday, 01 March 13                                                                                                                                                   31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.
word count – Scalding / Scala
                                                                                 Document
                                                                                 Collection




           import com.twitter.scalding._                                                 M
                                                                                              Tokenize
                                                                                                         GroupBy
                                                                                                          token    Count



                                                                                                            R              Word
                                                                                                                           Count


           class WordCount(args : Args) extends Job(args) {
             Tsv(args("doc"),
                  ('doc_id, 'text),
                  skipHeader = true)
               .read
               .flatMap('text -> 'token) {
                  text : String => text.split("[ [](),.]")
                }
               .groupBy('token) { _.size('count) }
               .write(Tsv(args("wc"), writeHeader = true))
           }




Friday, 01 March 13                                                                                                                32
Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.
word count – Scalding / Scala
                                                                                                                                                                                  Document
                                                                                                                                                                                  Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                                               Tokenize
                                                                                                                                                                                                          GroupBy
                                                                                                                                                                                          M                token    Count




                                                                                                                                                                                                             R              Word
                                                                                                                                                                                                                            Count




                • extends the Scala collections API so that distributed lists
                  become “pipes” backed by Cascading
                • code is compact, easy to understand
                • nearly 1:1 between elements of conceptual flow diagram
                  and function calls
                • extensive libraries are available for linear algebra, abstract
                  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                • significant investments by Twitter, Etsy, eBay, etc.
                • great for data services at scale
                • less learning curve than Cascalog,
                  not as much of a high-level language




Friday, 01 March 13                                                                                                                                                                                                                 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
word count – Scalding / Scala
                                                                                                                                                    Document
                                                                                                                                                    Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                 Tokenize
                                                                                                                                                                            GroupBy
                                                                                                                                                            M                token    Count




                                                                                                                                                                               R              Word
                                                                                                                                                                                              Count




               • extends the Scala collections API so that distributed lists
                 become “pipes” backed by Cascading
               • code is compact, easy to understand
               • nearly 1:1 between elements of conceptual flow diagram
                 and function calls        Cascalog and Scalding DSLs
               • extensive libraries are available for linear algebra, abstractaspects
                                           leverage the functional
                 algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                           of MapReduce, helping to limit
               • significant investments by Twitter, Etsy, eBay, etc.
                                           complexity in process
               • great for data services at scale
                 (imagine SOA infra @ Google as an open source project)
               • less learning curve than Cascalog,
                 not as much of a high-level language



Friday, 01 March 13                                                                                                                                                                                   34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction
                                                                                  Document
                                                                                  Collection



                                                                                                               Scrub
                                                                                               Tokenize
                                                                                                               token

                                                                                          M




                     1. Funnel
                                                                                                                       HashJoin   Regex
                                                                                                                         Left     token
                                                                                                                                          GroupBy    R
                                                                                                          Stop Word                        token
                                                                                                             List
                                                                                                                         RHS




                                                                                                                                             Count




                                                                                                                                                         Word
                                                                                                                                                         Count




                     2. Circa 2008
                     3. Cascading
                     4. Sample Code
                     5. Workflows
                     6. Abstraction
                     7. Trendlines


Friday, 01 March 13                                                                                                                                              35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
Enterprise Data Workflows
                                                                                    Customers
            Back to our marketing funnel, let’s consider
            an example app… at the front end                                          Web
                                                                                      App
            LOB use cases drive demand for apps
                                                                        logs         Cache
                                                                          logs
                                                                            Logs

                                                   Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Friday, 01 March 13                                                                               36
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                                 Customers
             An example… in the back office
             Organizations have substantial investments                                                            Web
                                                                                                                   App
             in people, infrastructure, process
                                                                                                     logs         Cache
                                                                                                       logs
                                                                                                         Logs

                                                                      Support
                                                                                                        source
                                                                                              trap                  sink
                                                                                                          tap
                                                                                               tap                  tap


                                                                                                      Data
                                                                     Modeling            PMML
                                                                                                     Workflow

                                                                                                                   source
                                                                                              sink
                                                                                                                     tap
                                                                                              tap

                                                                     Analytics
                                                                      Cubes                                       customer
                                                                                                                   Customer
                                                                                                                 profile DBs
                                                                                                                     Prefs
                                                                                                       Hadoop
                                                                                                       Cluster
                                                                    Reporting




Friday, 01 March 13                                                                                                            37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                                          Customers
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
                                                                                                            App
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache
                                                                                                logs
                                                                                                  Logs

                                                                          Support
                                                                                                 source
                                                                                       trap                  sink
                                                                                                   tap
                                                                                        tap                  tap


                                                                                               Data
                                                                         Modeling    PMML
                                                                                              Workflow

                                                                                                            source
                                                                                       sink
                                                                                                              tap
                                                                                       tap

                                                                         Analytics
                                                                          Cubes                            customer
                                                                                                            Customer
                                                                                                          profile DBs
                                                                                                              Prefs
                                                                                                Hadoop
                                                                                                Cluster
                                                                        Reporting




Friday, 01 March 13                                                                                                     38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Cascading workflows – taps

               •   taps integrate other data frameworks, as tuple streams
                                                                                                            Customers

               •   these are “plumbing” endpoints in the pattern language
               •   sources (inputs), sinks (outputs), traps (exceptions)                                      Web
                                                                                                              App


               •   text delimited, JDBC, Memcached,
                   HBase, Cassandra, MongoDB, etc.                                              logs
                                                                                                  logs
                                                                                                    Logs
                                                                                                             Cache



               • data serialization: Avro, Thrift,
                                                                           Support
                                                                                                   source
                                                                                         trap                  sink
                                                                                                     tap
                   Kryo, JSON, etc.                                                       tap                  tap




               • extend a new kind of tap in just
                                                                                                 Data
                                                                           Modeling    PMML
                                                                                                Workflow

                   a few lines of Java                                                   sink
                                                                                                              source
                                                                                                                tap
                                                                                         tap

                                                                           Analytics
                                                                            Cubes                            customer
                                                                                                              Customer
                                                                                                            profile DBs
             schema and provenance get                                                            Hadoop
                                                                                                                Prefs


             derived from analysis of the taps                             Reporting
                                                                                                  Cluster




Friday, 01 March 13                                                                                                       39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
Cascading workflows – taps

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                source and sink taps
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );                      for TSV data in HDFS
            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                        40
Here are the taps in the WordCount source
Cascading workflows – topologies

               • topologies execute workflows on clusters
                                                                                                                                              Customers

               • flow planner is like a compiler for queries
                 - Hadoop (MapReduce jobs)                                                                                                      Web
                                                                                                                                                App


                 - local mode (dev/test or special config)
                                                                                                                                  logs         Cache
                 - in-memory data grids (real-time)                                                                                 logs
                                                                                                                                      Logs

                                                                                                             Support

               • flow planner can be extended                                                                               trap
                                                                                                                            tap
                                                                                                                                     source
                                                                                                                                       tap       sink
                                                                                                                                                 tap
                   to support other topologies
                                                                                                                                   Data
                                                                                                             Modeling    PMML
                                                                                                                                  Workflow

                                                                                                                                                source
                                                                                                                           sink
                                                                                                                                                  tap
             blend flows in different topologies                                                                            tap

                                                                                                             Analytics
             into the same app – for example,                                                                 Cubes                            customer
                                                                                                                                                Customer
                                                                                                                                              profile DBs
             batch (Hadoop) + transactions (IMDG)                                                                                   Hadoop
                                                                                                                                                  Prefs

                                                                                                                                    Cluster
                                                                                                             Reporting




Friday, 01 March 13                                                                                                                                         41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
Cascading workflows – topologies

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );   flow planner for
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                     Apache Hadoop
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                topology
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                   42
Here is the flow planner for Hadoop in the WordCount source
example topologies…




Friday, 01 March 13                                                                             43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base
                                                                                                                                                                             Customers

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                                                 Web
                                                                                                                                                                               App

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                        logs
                                                                                                                                                  logs                        Cache
                                                                                                                                                    Logs

               • relational catalog over a collection                                                        Support
                                                                                                                                                    source
                   of unstructured data                                                                                          trap
                                                                                                                                  tap
                                                                                                                                                      tap                       sink
                                                                                                                                                                                tap



               • SQL shell prompt to run queries                                                            Modeling         PMML
                                                                                                                                                  Data
                                                                                                                                                 Workflow


               • enable analysts without retraining                                                                              sink
                                                                                                                                 tap
                                                                                                                                                                               source
                                                                                                                                                                                 tap

                   on Hadoop, etc.                                                                          Analytics
                                                                                                             Cubes                                                            customer

               • transparency for Support, Ops,                                                                                                    Hadoop
                                                                                                                                                                               Customer
                                                                                                                                                                             profile DBs
                                                                                                                                                                                 Prefs

                   Finance, et al.                                                                          Reporting
                                                                                                                                                   Cluster



               • a language for queries – not a database,
                   but ANSI SQL as a DSL for workflows

Friday, 01 March 13                                                                                                                                                                        44
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.

BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.
ANSI SQL – CSV data in local file system




               cascading.org/lingual


Friday, 01 March 13                                                                              45
The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog




                cascading.org/lingual


Friday, 01 March 13                                                                       46
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries




              cascading.org/lingual


Friday, 01 March 13                                                        47
Here’s an example SQL query on that “employee” test database from MySQL.
Cascading workflows – machine learning

              • migrate workloads: SAS,Teradata, etc.,
                   exporting predictive models as PMML                                                                                        Customers



              • Cascading creates parallelized models                                                                                           Web
                                                                                                                                                App
                   to run at scale on Hadoop clusters
              • Random Forest, Logistic Regression,                                                                               logs
                                                                                                                                    logs       Cache
                                                                                                                                      Logs
                   GLM, Decision Trees, K-Means,                                                              Support
                   Hierarchical Clustering, etc.                                                                           trap
                                                                                                                                     source
                                                                                                                                       tap       sink
                                                                                                                            tap                  tap

              • integrate with other libraries                                                                                     Data
                   (Matrix API, etc.) and great open                                                         Modeling    PMML
                                                                                                                                  Workflow


                   source tools (R, Weka, KNIME,                                                                           sink
                                                                                                                           tap
                                                                                                                                                source
                                                                                                                                                  tap

                   RapidMiner, etc.)                                                                         Analytics
                                                                                                              Cubes                            customer

              • 2 lines of code or pre-built JAR                                                                                    Hadoop
                                                                                                                                                Customer
                                                                                                                                              profile DBs
                                                                                                                                                  Prefs

                                                                                                                                    Cluster
                                                                                                            Reporting


             Run multiple variants of models as
             customer experiments

Friday, 01 March 13                                                                                                                                         48
PMML has been around for a while, and export is supported by nearly every commercial analytics platform,
covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
model creation in R
                       ## train a RandomForest model

                       f <- as.formula("as.factor(label) ~ .")
                       fit <- randomForest(f, data_train, ntree=50)

                       ## test the model on the holdout test set

                       print(fit$importance)
                       print(fit)

                       predicted <- predict(fit, data)
                       data$predicted <- predicted
                       confuse <- table(pred = predicted, true = data[,1])
                       print(confuse)

                       ## export predicted labels to TSV

                       write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t",
                       row.names=FALSE)

                       ## export RF model to PMML

                       saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))




               cascading.org/pattern


Friday, 01 March 13                                                                                                              49
Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.
model run at scale as a Cascading app




                                                      Customer
                                                       Orders



                                                                                                        Scored                                                GroupBy
                                                                             Classify                                               Assert
                                                                                                        Orders                                                 token

                                                                M                                                                                                               R




                                                PMML
                                                Model
                                                                                                                                                                   Count




                                                                                                                                    Failure                                     Confusion
                                                                                                                                     Traps                                       Matrix




               cascading.org/pattern


Friday, 01 March 13                                                                                                                                                                                             50
Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.
model run at scale as a Cascading app
                      public class Main {
                        public static void main( String[] args ) {
                          String pmmlPath = args[ 0 ];
                          String ordersPath = args[ 1 ];
                          String classifyPath = args[ 2 ];
                          String trapPath = args[ 3 ];

                            Properties properties = new Properties();
                            AppProps.setApplicationJarClass( properties, Main.class );
                            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

                            // create source and sink taps
                            Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
                            Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
                            Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

                            // define a "Classifier" model from PMML to evaluate the orders
                            ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
                            Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

                            // connect the taps, pipes, etc., into a flow
                            FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
                             .addSource( classifyPipe, ordersTap )
                             .addTrap( classifyPipe, trapTap )
                             .addSink( classifyPipe, classifyTap );

                            // write a DOT file and run the flow
                            Flow classifyFlow = flowConnector.connect( flowDef );
                            classifyFlow.writeDOT( "dot/classify.dot" );
                            classifyFlow.complete();
                          }
                      }

Friday, 01 March 13                                                                                                                      51
Source code for a simple Cascading app that runs PMML models in general.
PMML support…




Friday, 01 March 13                                                   52
Popular tools which can create predictive models for export as PMML
Cascading workflows – test-driven development

               •   assert patterns (regex) on the tuple streams
                                                                                                                                                                             Customers
               •   adjust assert levels, like log4j levels
               •   trap edge cases as “data exceptions”                                                                                                                         Web
                                                                                                                                                                                App

               •   TDD at scale:
                   1. start from raw inputs in the flow graph                                                                                         logs
                                                                                                                                                       logs
                                                                                                                                                         Logs
                                                                                                                                                                               Cache


                   2. define stream assertions for each stage                                                     Support
                                                                                                                                                            source
                                                                                                                                      trap                                        sink
                      of transforms                                                                                                    tap
                                                                                                                                                              tap
                                                                                                                                                                                  tap



                   3. verify exceptions, code to remove them                                                    Modeling          PMML
                                                                                                                                                       Data
                                                                                                                                                      Workflow

                   4. when impl is complete, app has full                                                                             sink
                                                                                                                                                                                source
                                                                                                                                                                                  tap
                                                                                                                                      tap
                      test coverage                                                                             Analytics
                                                                                                                 Cubes
               •   TDD follows from Cascalog’s                                                                                                                                customer
                                                                                                                                                                               Customer
                                                                                                                                                                             profile DBs
                                                                                                                                                                                 Prefs
                   composable subqueries                                                                                                                Hadoop
                                                                                                                                                        Cluster
                                                                                                                Reporting

               • redirect traps in production
                   to Ops, QA, Support, Audit, etc.

Friday, 01 March 13                                                                                                                                                                                                            53
TDD is not usually high on the list when people start discussing Big Data apps.

The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading workflows – TDD meets API principles

               • specify what is required, not how it must
                   be achieved                                                                Customers




               • plan far ahead, before consuming cluster                                       Web
                                                                                                App

                   resources – fail fast prior to submit
                                                                                  logs         Cache
               • fail the same way twice – deterministic                            logs
                                                                                      Logs

                                                             Support
                   flow planners help reduce engineering                    trap
                                                                                     source
                                                                                                 sink
                                                                                       tap
                   costs for debugging at scale                             tap                  tap


                                                                                   Data
                                                             Modeling    PMML

               • same JAR, any scale – app does not                               Workflow

                                                                                                source
                   require a recompile to change data                      sink
                                                                           tap
                                                                                                  tap


                   taps or cluster topologies                Analytics
                                                              Cubes                            customer
                                                                                                Customer
                                                                                              profile DBs
                                                                                                  Prefs
                                                                                    Hadoop
                                                                                    Cluster
                                                             Reporting




Friday, 01 March 13                                                                                         54
Some of the design principles for the pattern language
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,




                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff


              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Friday, 01 March 13                                                                                                                           55
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…

              Enterprise: must contend with
              complexity at scale everyday…
              incumbents extend current practices and
              infrastructure investments – using J2EE,




                                                            complexity ➞
              ANSI SQL, SAS, etc. – to migrate
              workflows onto Apache Hadoop while
              leveraging existing staff
                                         Hadoop almost never gets used
                                         in isolation; data workflows define
               Start-ups: crave complexity and
               scale to become viable… the “glue” required for system
               new ventures move into Enterprise space of Enterprise apps
                                         integration
               to compete using relatively lean staff,
               while leveraging sophisticated engineering
               practices, e.g., Cascalog and Scalding
                                                                           scale ➞

Friday, 01 March 13                                                                  56
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M




                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                             57
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading workflows – pattern language

             Cascading uses a “plumbing” metaphor in the Java API,
             to define workflows out of familiar elements: Pipes, Taps,
             Tuple Flows, Filters, Joins, Traps, etc.
                                                          Document
                                                          Collection



                                                                                       Scrub
                                                                       Tokenize
                                                                                       token

                                                                  M



                                                                                               HashJoin   Regex
                                                                                                 Left     token
                                                                                                                  GroupBy    R
                                                                                  Stop Word                        token
                                                                                     List
                                                                                                 RHS




                                                                                                                     Count


             Data is represented as flows of tuples. Operations within                                                            Word

             the tuple flows bring functional programming aspects into                                                            Count




             Java apps.
             In formal terms, this provides a pattern language.


Friday, 01 March 13                                                                                                                      58
A pattern language, based on the metaphor of “plumbing”
references…

                      pattern language: a structured method for solving
                      large, complex design problems, where the syntax of
                      the language promotes the use of best practices.

                      amazon.com/dp/0195019199



                      design patterns: the notion originated in consensus
                      negotiation for architecture, later applied in OOP
                      software engineering by “Gang of Four”.
                      amazon.com/dp/0201633612




Friday, 01 March 13                                                                                                 59
Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
Cascading workflows – pattern language

             Cascading uses a “plumbing” metaphor in the Java API,
             to define workflows out of familiar elements: Pipes, Taps,
             Tuple Flows, Filters, Joins, Traps, etc.
                                                           Document
                                                           Collection



                                                                                        Scrub
                                                                        Tokenize



                                                                   design principles of the pattern
                                                                                        token

                                                                   M




                                                                   language ensure best practices
                                                                                   Stop Word
                                                                                      List
                                                                                                HashJoin
                                                                                                  Left
                                                                                                           Regex
                                                                                                           token
                                                                                                                   GroupBy
                                                                                                                    token
                                                                                                                              R




                                                                   for robust, parallel data workflows
                                                                                                  RHS




                                                                   at scale                                           Count


              Data is represented as flows of tuples. Operations within                                                            Word

              the tuple flows bring functional programming aspects into                                                            Count




              Java apps.
              In formal terms, this provides a pattern language.


Friday, 01 March 13                                                                                                                       60
The pattern language provides a structured method for solving large,
complex design problems where the syntax of the language promotes
use of best practices – which also addresses staffing issues
Cascading workflows – literate programming

             Cascading workflows generate their own visual
             documentation: flow diagrams

                                                            Document
                                                            Collection



                                                                                                   Scrub
                                                                              Tokenize
                                                                                                   token

                                                                    M



                                                                                                                     HashJoin             Regex
                                                                                                                       Left               token
                                                                                                                                                              GroupBy      R
                                                                                              Stop Word                                                        token
                                                                                                 List
                                                                                                                       RHS




                                                                                                                                                                  Count



              In formal terms, flow diagrams leverage a methodology                                                                                                                 Word
                                                                                                                                                                                   Count

              called literate programming
              Provides intuitive, visual representations for apps, great
              for cross-team collaboration.


Friday, 01 March 13                                                                                                                                                                                                                          61
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
references…

                       by Don Knuth
                       Literate Programming
                       Univ of Chicago Press, 1992
                       literateprogramming.com/

                       “Instead of imagining that our main task is
                        to instruct a computer what to do, let us
                        concentrate rather on explaining to human
                        beings what we want a computer to do.”




Friday, 01 March 13                                                                                       62
Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
examples…

                        • Scalding apps have nearly 1:1 correspondence
                            between function calls and the elements in their
                            flow diagrams – excellent elision and literate
                            representation
                        •   noticed on cascading-users email list:
                            when troubleshooting issues, Cascading experts ask
                            novices to provide an app’s flow diagram (generated                                                                      [head]


                            as a DOT file), sometimes in lieu of showing code                                          Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                                                              [{2}:'doc_id', 'text']
                                                                                                                                              [{2}:'doc_id', 'text']




                                                                                                                                                                                           map
                                                                                                                       Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                      In formal terms, a flow diagram is a directed, acyclic                                                                       [{1}:'token']
                                                                                                                                                  [{1}:'token']



                      graph (DAG) on which lots of interesting math applies                                                             GroupBy('wc')[by:['token']]



                      for query optimization, predictive models about app
                                                                                                                                                wc[{1}:'token']
                                                                                                                                                [{1}:'token']




                                                                                                                                                                                           reduce
                      execution, parallel efficiency metrics, etc.                                                                    Every('wc')[Count[decl:'count']]

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{1}:'token']



                                                                                                                   Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{2}:'token', 'count']



                                                                                                                                                     [tail]




Friday, 01 March 13                                                                                                                                                                                 63
Literate programming examples observed on the email list are some of the best illustrations of this methodology.
Cascading workflows – business process

            Following the essence of literate programming, Cascading
            workflows provide statements of business process
            This recalls a sense of business process management
            for Enterprise apps (think BPM/BPEL for Big Data)
            As a separation of concerns between business process
            and implementation details (Hadoop, etc.)
            This is especially apparent in large-scale Cascalog apps:
                “Specify what you require, not how to achieve it.”
            By virtue of the pattern language, the flow planner in used
            in a Cascading app determines how to translate business
            process into efficient, parallel jobs at scale.




Friday, 01 March 13                                                      64
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
references…

                      by Edgar Codd
                      “A relational model of data for large shared data banks”
                      Communications of the ACM, 1970
                      dl.acm.org/citation.cfm?id=362685
                      Rather than arguing between SQL vs. NoSQL…
                      structured vs. unstructured data frameworks…
                      this approach focuses on:
                            the process of structuring data
                      That’s what apps do – Making Data Work




Friday, 01 March 13                                                                                                                                             65
Focus on *the process of structuring data*
which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
Cascading workflows – functional relational programming

             The combination of functional programming, pattern language,
             DSLs, literate programming, business process, etc., traces back
             to the original definition of the relational model (Codd, 1970)
             prior to SQL.
             Cascalog, in particular, implements more of what Codd intended
             for a “data sublanguage” and is considered to be close to a full
             implementation of the functional relational programming
             paradigm defined in:
                    Moseley & Marks, 2006
                    “Out of the Tar Pit”
                    goo.gl/SKspn




Friday, 01 March 13                                                             66
A more contemporary statement along similar lines...
Cascading workflows – functional relational programming

            The combination of functional programming, pattern language,
            DSLs, literate programming, business process, etc., traces back
            to the original definition of the relational model (Codd, 1970)
            prior to SQL.
            Cascalog, in particular, implements more of what Codd intended for a
                                       several theoretical aspects converge
            “data sublanguage” and is considered to be close to a full
            implementation of the functional relational programming
            paradigm defined in:        into software engineering practices
                      Moseley & Marks, 2006which mitigates the complexity of
                      “Out of the Tar Pit” building and maintaining Enterprise
                      goo.gl/SKspn
                                      data workflows



Friday, 01 March 13                                                                67
The Workflow Abstraction
                                                                                                                                              Document
                                                                                                                                              Collection



                                                                                                                                                                           Scrub
                                                                                                                                                           Tokenize
                                                                                                                                                                           token

                                                                                                                                                      M




                      1. Funnel
                                                                                                                                                                                   HashJoin   Regex
                                                                                                                                                                                     Left     token
                                                                                                                                                                                                      GroupBy    R
                                                                                                                                                                      Stop Word                        token
                                                                                                                                                                         List
                                                                                                                                                                                     RHS




                                                                                                                                                                                                         Count




                                                                                                                                                                                                                     Word
                                                                                                                                                                                                                     Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                                                                                          68
Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.
Where did Big Data come from, and where is this kind of work headed?
Q3 1997: inflection point

             Four independent teams were working toward horizontal
             scale-out of workflows based on commodity hardware.
             This effort prepared the way for huge Internet successes
             in the 1997 holiday season… AMZN, EBAY, Inktomi
             (YHOO Search), then GOOG

             MapReduce and the Apache Hadoop open source stack
             emerged from this.




Friday, 01 March 13                                                                                                        69
Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.
Circa 1996: pre- inflection point

                                                                              Stakeholder                   Customers

                                                       Excel pivot tables
                                                     PowerPoint slide decks        strategy



                                                            BI
                                                                                  Product
                                                          Analysts


                                                                                 requirements



                                                          SQL Query                             optimized
                                                                                Engineering       code         Web App
                                                           result sets



                                                                                                               transactions




                                                                                                               RDBMS




Friday, 01 March 13                                                                                                           70
Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
Circa 2001: post- big ecommerce successes

                                                        Stakeholder                                            Product                                          Customers




                                                            dashboards                                                                                                 UX
                                                                                                            Engineering

                                                                                          models                                         servlets

                                                                                                             recommenders
                                                        Algorithmic                                                 +                                           Web Apps
                                                         Modeling                                               classifiers


                                                                                                                                                               Middleware
                                                                                          aggregation
                                                                                                                                          event
                                                            SQL Query                                                                    history
                                                             result sets                                                                                             customer
                                                                                                                                                                   transactions
                                                                                                                  Logs



                                                               DW                                                   ETL                                            RDBMS




Friday, 01 March 13                                                                                                                                                                                                                    71
Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the
marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
Circa 2013: clusters everywhere

                                                                                      Data Products                                                     Customers
                                                             business
                                  Domain                     process                                                                                                                 Prod
                                  Expert                                                 Workflow
                                                                dashboard
                                                                 metrics
                                                  data
                                                                                                                                                       Web Apps,               s/w
                                                                                           History                              services
                                                science                                                                                                Mobile, etc.            dev
                                 Data
                               Scientist
                                                                                           Planner                                                 social
                                                              discovery                                                                         interactions
                                                                  +                                      optimized                                             transactions,
                                                                                                                                                                                      Eng
                                                              modeling                         taps       capacity                                                content

                                  App Dev
                                                                                               Use Cases Across Topologies


                                                                                            Hadoop,                         Log                          In-Memory
                                                                                              etc.                         Events                         Data Grid
                                    Ops                              DW                                                                                                               Ops
                                                                                                                                       batch       near time


                                                                                                                        Cluster Scheduler
                                  introduced                                                                                                                                         existing
                                   capability                                                                                                                                         SDLC

                                                                                                                                                               RDBMS
                                                                                                                                                                RDBMS


Friday, 01 March 13                                                                                                                                                                             72
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Mesos, etc.
Asymptotically…

              • long-term trends toward more instrumentation                 DSL
                  of Enterprise data workflows:
                  - workflow abstraction enables business cases             Planner/
                  - more machine data collected about apps                 Optimizer

                  - flow diagram (DAG) as unit of work
                      (abstract type for machine data)                     Workflow

                  - evolving feedback loops convert machine data                        App
                      into actionable insights and optimizations                       History

                                                                           Cluster
              • industry moves beyond common needs of ad-hoc
                  queries on logs and basic reporting, as a new class
                  of complex data workflows emerges to provide
                                                                            Cluster
                  the insights required by Enterprise                      Scheduler

              • end game is less about “bigness” of data, more about
                  managing complexity in the process of structuring data

Friday, 01 March 13                                                                              73
In summary…
references…

                       by Leo Breiman
                       Statistical Modeling: The Two Cultures
                       Statistical Science, 2001
                       bit.ly/eUTh9L




Friday, 01 March 13                                                                                                                                                                                                         74
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
references…

                      Amazon
                      “Early Amazon: Splitting the website” – Greg Linden
                      glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

                      eBay
                      “The eBay Architecture” – Randy Shoup, Dan Pritchett
                      addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
                      addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

                      Inktomi (YHOO Search)
                      “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
                      youtube.com/watch?v=E91oEn1bnXM

                      Google
                      “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
                      youtube.com/watch?v=qsan-GQaeyk
                      perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
                      “The Birth of Google” – John Battelle
                      wired.com/wired/archive/13.08/battelle.html




Friday, 01 March 13                                                                              75
In their own words…
references…


                        by Paco Nathan
                        Enterprise Data Workflows
                        with Cascading
                        O’Reilly, 2013
                        amazon.com/dp/1449358721




Friday, 01 March 13                                           76
Some of this material comes from an upcoming O’Reilly book:
“Enterprise Data Workflows with Cascading”

This should be in Rough Cuts soon -
scheduled to be out in print this June.

Many thanks to my wonderful editor, Courtney Nash.
drill-down…


                      blog, dev community, code/wiki/gists, maven repo,
                      commercial products, career opportunities:
                            cascading.org
                            zest.to/group11
                            github.com/Cascading
                            conjars.org
                            goo.gl/KQtUL
                            concurrentinc.com

                      join us for very interesting work!                  Copyright @2013, Concurrent, Inc.




Friday, 01 March 13                                                                                           77
Links to our open source projects, developer community, etc…

contact me @pacoid
http://concurrentinc.com/
(we're hiring too!)
Ad

More Related Content

Similar to The Workflow Abstraction (19)

Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Paco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Paco Nathan
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
Rob Cottingham
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
Rob Cottingham
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Paco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
Paco Nathan
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Mojisola Erdt née Anjorin
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Paco Nathan
 
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrap
Scott Farquhar
 
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney Logica
Mark Hellyer
 
DTO #ChefConf2012
DTO #ChefConf2012DTO #ChefConf2012
DTO #ChefConf2012
AnthonyShortland
 
Parramatta Aegon
Parramatta    AegonParramatta    Aegon
Parramatta Aegon
Mark Hellyer
 
Hyena Labs Works
Hyena Labs WorksHyena Labs Works
Hyena Labs Works
Hyena Design Studio
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Paco Nathan
 
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profile
inkscoop
 
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson Executive
Mark Hellyer
 
Chatswood Kumon
Chatswood   KumonChatswood   Kumon
Chatswood Kumon
Mark Hellyer
 
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper Web
Beat Signer
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Paco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
Paco Nathan
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
Rob Cottingham
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
Rob Cottingham
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Paco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
Paco Nathan
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Mojisola Erdt née Anjorin
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
Paco Nathan
 
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrap
Scott Farquhar
 
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney Logica
Mark Hellyer
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
Paco Nathan
 
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profile
inkscoop
 
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson Executive
Mark Hellyer
 
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper Web
Beat Signer
 

More from OReillyStrata (14)

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
OReillyStrata
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_upload
OReillyStrata
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
OReillyStrata
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7
OReillyStrata
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data Institute
OReillyStrata
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business Questions
OReillyStrata
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
OReillyStrata
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)
OReillyStrata
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
OReillyStrata
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of Discovery
OReillyStrata
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012
OReillyStrata
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012
OReillyStrata
 
Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
OReillyStrata
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_upload
OReillyStrata
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
OReillyStrata
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7
OReillyStrata
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data Institute
OReillyStrata
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business Questions
OReillyStrata
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
OReillyStrata
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)
OReillyStrata
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
OReillyStrata
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of Discovery
OReillyStrata
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012
OReillyStrata
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012
OReillyStrata
 
Ad

Recently uploaded (20)

Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Ad

The Workflow Abstraction

  • 1. “The Workflow Abstraction” Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Friday, 01 March 13 1 Background: dual in quantitative and distributed systems. I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
  • 2. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 2 This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data Workflows We’ll consider some theory, examples, best practices, trendlines -- what are the drivers that brought us, and where is this work heading toward? Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
  • 3. Marketing Funnel – overview In reference to Making Data Work… Customers Almost every business uses a model similar to this – give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get refined through several stages, then results flow out the bottom. Interest Evalutation Conversion Referral Repeat Friday, 01 March 13 3 Let’s consider one of the most fundamental predictive models used in business: a marketing funnel. This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
  • 4. Marketing Funnel – clickstream Different funnel stages get represented in ecommerce by events captured in Customers log files, as a class of machine data called clickstream Campaigns Impression • ad impressions Awareness • URL clicks Click • landing page views Interest • new user registrations Sign Up Evalutation • session cookies Purchase • online purchases Conversion • social network activity "Like" • etc. Referral Repeat Friday, 01 March 13 4 Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
  • 5. Marketing Funnel – metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: Campaigns • CPM: cost per thousand Impression • CTR: click-through rate Awareness CPM • CPA: cost per action Click • etc. Interest CTR Sign Up Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 5 The many different highly-nuanced metrics which apply are mind-boggling :)
  • 6. Marketing Funnel – example calculations Customers Campaigns Awareness Interest metric cost events formula rate Evalutation Conversion Referral Repeat $4,000 CPM $4,000 10^6 ÷ $4.00 (10^6 ÷ 10^3) 3∙10^3 CTR - 3∙10^3 ÷ 10^6 0.3% $4,000 CPA - 20 ÷ $200 20 Friday, 01 March 13 6 Here are examples of the kinds of calculations performed...
  • 7. Marketing Funnel – predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc. Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI = (LTV − CPP) ∕ CPP Interest As an example, after crunching lots of logs, Evalutation suppose that… Conversion CPP = $200 LTV = $2000 Referral ROI = ($2000 − $200) ∕ $200 Repeat for a 9x multiple Friday, 01 March 13 7 For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers, which describes the efficiency of the marketing funnel at different stages.
  • 8. Marketing Funnel – example architecture Customers Campaigns Customers Awareness Let’s consider an example architecture Interest Evalutation for calculating, reporting, and taking action Web Conversion on funnel metrics, based on large-scale App Referral Repeat clickstream data… logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 8 Here’s an example architecture of using clickstream metrics within an online business.
  • 9. Marketing Funnel – complexities Multiple ad partners, different contracts terms, reporting different metrics at Customers × × different times, click scrubs, etc. Campaigns Campaigns target specific geo/demo, Impression × × test alternate landing pages, probably Awareness CPM need to segment customer base… Click These issues make clickstream data Interest CTR large and yet sparse. Sign Up Evalutation behaviors Other issues: × Purchase • seasonal variation Conversion CPA • fluctuating currency exchange rates "Like" Referral NPS, social graph, etc. • distortions due to credit card fraud • diminishing returns Repeat loyalty, win back, etc. • forecasting requirements Friday, 01 March 13 9 However, real life intercedes. In many businesses, this is a complicated model to calculate correctly. scrubs many vendors, data sources, different metrics to be aligned lots of roll-ups Bayesian point estimates forecasts and dashboards social dimension makes this convoluted not simple
  • 10. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 10 The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
  • 11. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. funnel modeling and optimization Interest CTR The need for these insights has been a Sign Up driver for Hadoop-relatedrequires complex data workflows technologies. Evalutation behaviors to obtain the required insights Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 11 These needs imply complex data workflows. It’s not about doing a BI query or a pivot table; that’s how retailers were thinking when Amazon came along.
  • 12. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 12 A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
  • 13. Circa 2008 – Hadoop at scale Customers Scenario: Analytics team at a large ad network… Campaigns Awareness Company had invested $MM capex in a Interest large data warehouse across LOBs Evalutation Conversion Mission-critical app had been written as Referral collab Repeat a large SQL workflow in the DW roll-ups filter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers – billions of calculations daily query/load Predictive models matched publisher ~ advertiser clickstream RDBMS and campaign ~ user, to optimize marketing funnel performance Friday, 01 March 13 13 Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network.. Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
  • 14. Circa 2008 – Hadoop at scale Customers Issues: Campaigns Awareness • critical app had hit hard limits for scalability Interest • several Tb data, 100’s of servers Evalutation Conversion • batch window length vs. failure rate vs. SLA collab Referral Repeat in the context of business growth posed roll-ups filter an existential risk × We built out a team to address these issues per-user recommends as rapidly as possible… Needed to re-create that data workflows query/load based on Enterprise requirements. clickstream RDBMS Friday, 01 March 13 14 Marching orders: 5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City; 5 weeks to reverse engineer the mission-critical app without any access to its author; 5 weeks to implement a Hadoop version which could scale-out on EC2. We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
  • 15. Circa 2008 – Hadoop at scale Approach: roll-ups collab filter • reverse-engineered business process from ~1500 lines of undocumented SQL per-user • created a large, multi-step Apache Hadoop recommends app on AWS HDFS • leveraged cloud strategy to trade $MM capex for lower, scalable opex • Amazon identified our app as one of the msg queue largest Hadoop deployments on EC2 • our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstream Friday, 01 March 13 15 Our solution involved dependencies among more than a dozen Hadoop job steps.
  • 16. Circa 2008 – Hadoop at scale × Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends HDFS • data scientists wore beepers since Ops × × lacked visibility into business process • coding directly in MapReduce created a staffing bottleneck msg queue query/load clickstream RDBMS Friday, 01 March 13 16 This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM -- for troubleshooting, handling exceptions, notifications, etc. Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea. Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve
  • 17. Circa 2008 – Hadoop at scale Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends • data scientists worea good since Ops for a large, commercial beepers solution HDFS lacked visibility into Apachebusiness logic deployment, but the app’s Hadoop • coding directly in MapReduce created a staffing bottleneck workflow management lacked crucial msg queue features… query/load which led to a search for a better clickstream RDBMS workflow abstraction Friday, 01 March 13 17 While leading this team, I sought out other ways of managing a complex workflow involving Hadoop. I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
  • 18. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 18 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 19. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology. Friday, 01 March 13 19 Cascading initially grew from interaction with the Nutch project, before Hadoop had a name API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  • 20. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without an need to create an entirely new language • allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clusters Friday, 01 March 13 20 Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  • 21. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089 Friday, 01 March 13 21 Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch” The issues: * staffing bottleneck * operational complexity * system integration
  • 22. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc. Friday, 01 March 13 22 Several published case studies about Cascading, Cascalog, Scalding, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  • 23. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 23 Many case studies, many Enterprise production deployments now for 5+ years.
  • 24. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascading as the basis for workflow abstractions atop Hadoop and more, Cascalog in Clojure (2010) Scalding in Scala (2012) with a 5+ year history of production deployments across multiple verticals github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 24 Cascading as a basis for workflow abstraction, for Enterprise data workflows
  • 25. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 25 Code samples in Cascading / Cascalog / Scalding, based on Word Count
  • 26. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Friday, 01 March 13 26 Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already... Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
  • 27. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 Friday, 01 March 13 27 Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  • 28. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 28 Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop 2nd to last line: generates a DOT file for the flow diagram
  • 29. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 29 As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
  • 30. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Friday, 01 March 13 30 Here is the same Word Count app written in Clojure, using Cascalog.
  • 31. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Friday, 01 March 13 31 From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments. Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  • 32. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Friday, 01 March 13 32 Here is the same Word Count app written in Scala, using Scalding. Very compact, easy to understand; however, also more imperative than Cascalog.
  • 33. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 33 If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  • 34. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 34 Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  • 35. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 35 Tracking back to the Marketing Funnel as an example workflow… Let’s consider how Cascading apps incorporate other components beyond Hadoop
  • 36. Enterprise Data Workflows Customers Back to our marketing funnel, let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 36 LOB use cases drive the demand for Big Data apps
  • 37. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 37 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 38. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 38 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 39. Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. logs logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs derived from analysis of the taps Reporting Cluster Friday, 01 March 13 39 Speaking of system integration, taps provide the simplest approach for integrating different frameworks.
  • 40. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 40 Here are the taps in the WordCount source
  • 41. Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting Friday, 01 March 13 41 Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  • 42. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 42 Here is the flow planner for Hadoop in the WordCount source
  • 43. example topologies… Friday, 01 March 13 43 Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works. Several other widely used platforms would also be likely suspects for Cascading flow planners.
  • 44. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow • enable analysts without retraining sink tap source tap on Hadoop, etc. Analytics Cubes customer • transparency for Support, Ops, Hadoop Customer profile DBs Prefs Finance, et al. Reporting Cluster • a language for queries – not a database, but ANSI SQL as a DSL for workflows Friday, 01 March 13 44 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer. BTW, most of the SQL in the world is written by machines. This is not a database; this is about making machine-to-machine communications simpler and more robust at scale.
  • 45. ANSI SQL – CSV data in local file system cascading.org/lingual Friday, 01 March 13 45 The test database for MySQL is available for download from https://launchpad.net/test-db/ Here we have a bunch o’ CSV flat files in a directory in the local file system. Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • 46. ANSI SQL – shell prompt, catalog cascading.org/lingual Friday, 01 March 13 46 Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • 47. ANSI SQL – queries cascading.org/lingual Friday, 01 March 13 47 Here’s an example SQL query on that “employee” test database from MySQL.
  • 48. Cascading workflows – machine learning • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • Cascading creates parallelized models Web App to run at scale on Hadoop clusters • Random Forest, Logistic Regression, logs logs Cache Logs GLM, Decision Trees, K-Means, Support Hierarchical Clustering, etc. trap source tap sink tap tap • integrate with other libraries Data (Matrix API, etc.) and great open Modeling PMML Workflow source tools (R, Weka, KNIME, sink tap source tap RapidMiner, etc.) Analytics Cubes customer • 2 lines of code or pre-built JAR Hadoop Customer profile DBs Prefs Cluster Reporting Run multiple variants of models as customer experiments Friday, 01 March 13 48 PMML has been around for a while, and export is supported by nearly every commercial analytics platform, covering a wide variety of predictive modeling algorithms. Cascading reads PMML, building out workflows under the hood which run efficiently in parallel. Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;) Several companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
  • 49. model creation in R ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) cascading.org/pattern Friday, 01 March 13 49 Sample code in R for generating a predictive model for anti-fraud, based on a machine learning algorithm called Random Forest.
  • 50. model run at scale as a Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix cascading.org/pattern Friday, 01 March 13 50 Conceptual flow diagram for a Cascading app which runs a PMML model at scale, while trapping data exceptions (e.g., regression tests) and tallying a “confusion matrix” for quantifying the model performance.
  • 51. model run at scale as a Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } Friday, 01 March 13 51 Source code for a simple Cascading app that runs PMML models in general.
  • 52. PMML support… Friday, 01 March 13 52 Popular tools which can create predictive models for export as PMML
  • 53. Cascading workflows – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes • TDD follows from Cascalog’s customer Customer profile DBs Prefs composable subqueries Hadoop Cluster Reporting • redirect traps in production to Ops, QA, Support, Audit, etc. Friday, 01 March 13 53 TDD is not usually high on the list when people start discussing Big Data apps. The notion of a “data exception” was introduced into Cascading, based on setting stream assertion levels as part of the business logic of an application. Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 54. Cascading workflows – TDD meets API principles • specify what is required, not how it must be achieved Customers • plan far ahead, before consuming cluster Web App resources – fail fast prior to submit logs Cache • fail the same way twice – deterministic logs Logs Support flow planners help reduce engineering trap source sink tap costs for debugging at scale tap tap Data Modeling PMML • same JAR, any scale – app does not Workflow source require a recompile to change data sink tap tap taps or cluster topologies Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 54 Some of the design principles for the pattern language
  • 55. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Friday, 01 March 13 55 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 56. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Friday, 01 March 13 56 Hadoop is almost never used in isolation. Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
  • 57. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 57 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 58. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language. Friday, 01 March 13 58 A pattern language, based on the metaphor of “plumbing”
  • 59. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612 Friday, 01 March 13 59 Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
  • 60. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language. Friday, 01 March 13 60 The pattern language provides a structured method for solving large, complex design problems where the syntax of the language promotes use of best practices – which also addresses staffing issues
  • 61. Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration. Friday, 01 March 13 61 Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming. Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
  • 62. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” Friday, 01 March 13 62 Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
  • 63. examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:'token'] [{1}:'token'] graph (DAG) on which lots of interesting math applies GroupBy('wc')[by:['token']] for query optimization, predictive models about app wc[{1}:'token'] [{1}:'token'] reduce execution, parallel efficiency metrics, etc. Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 63 Literate programming examples observed on the email list are some of the best illustrations of this methodology.
  • 64. Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale. Friday, 01 March 13 64 Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL)
  • 65. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data Work Friday, 01 March 13 65 Focus on *the process of structuring data* which must happen before the large-scale joins, predictive models, visualizations, etc. Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work. BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
  • 66. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspn Friday, 01 March 13 66 A more contemporary statement along similar lines...
  • 67. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflows Friday, 01 March 13 67
  • 68. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 68 Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data. Where did Big Data come from, and where is this kind of work headed?
  • 69. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this. Friday, 01 March 13 69 Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion: parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.
  • 70. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS Friday, 01 March 13 70 Ah, teh olde days - Perl and C++ for CGI :) Feedback loops shown in red represent data innovations at the time… Characterized by slow, manual processes: data modeling / business intelligence; “throw it over the wall”… this thinking led to impossible silos
  • 71. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS Friday, 01 March 13 71 Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study. LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
  • 72. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS Friday, 01 March 13 72 Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams. Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric. Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment. We see this feeding into cluster optimization in YARN, Mesos, etc.
  • 73. Asymptotically… • long-term trends toward more instrumentation DSL of Enterprise data workflows: - workflow abstraction enables business cases Planner/ - more machine data collected about apps Optimizer - flow diagram (DAG) as unit of work (abstract type for machine data) Workflow - evolving feedback loops convert machine data App into actionable insights and optimizations History Cluster • industry moves beyond common needs of ad-hoc queries on logs and basic reporting, as a new class of complex data workflows emerges to provide Cluster the insights required by Enterprise Scheduler • end game is less about “bigness” of data, more about managing complexity in the process of structuring data Friday, 01 March 13 73 In summary…
  • 74. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L Friday, 01 March 13 74 Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  • 75. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx “The Birth of Google” – John Battelle wired.com/wired/archive/13.08/battelle.html Friday, 01 March 13 75 In their own words…
  • 76. references… by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Friday, 01 March 13 76 Some of this material comes from an upcoming O’Reilly book: “Enterprise Data Workflows with Cascading” This should be in Rough Cuts soon - scheduled to be out in print this June. Many thanks to my wonderful editor, Courtney Nash.
  • 77. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc. Friday, 01 March 13 77 Links to our open source projects, developer community, etc… contact me @pacoid http://concurrentinc.com/ (we're hiring too!)