SlideShare a Scribd company logo
Why                     Scalding




            Needs Scala
         A Look at Scoobi and Scalding
            Scala DSLs for Hadoop



Scoobi

                                         @agemooij
Obligatory “About Me” Slide
Rocks!
But programming



    kinda
Sucks!
Hello World Word Count
        using
 Hadoop MapReduce
Split lines into words

Turn each word into a Pair(word, 1)


                        Group by word (?)



    For each word, sum the 1s to get the total
Lots of small unintuitive
                   Mapper and Reducer
                          Classes




          Lots of Hadoop intrusiveness
       (Context, Writables, Exceptions, etc.)




Low level glue code




Actually runs the code on the cluster
This does not make me a
           happy Hadoop developer!
Especially for things that are a little bit more complicated than counting words




 • Unintuitive, invasive programming model
 • Hard to compose/chain jobs into real, more
     complicated programs
 •   Lots of low-level boilerplate code
 •   Branching, Joins, CoGroups, etc. hard to
     implement
What Are the Alternatives?
Counting Words using Apache Pig




Nice!
Already a lot better, but anything more complex gets
hard pretty fast.
Pig is hard to customize/extend
Handy for quick exploration of data!
           And the same goes for Hive
package cascadingtutorial.wordcount;

/**
                                                                                 Very powerful!
 * Wordcount example in Cascading
 */                                                                              Record Model
public class Main
  {
                                                                                 Pipes & Filters
  public static void main( String[] args )
    {
      String inputPath = args[0];
                                                                                 Joins & CoGroups
      String outputPath = args[1];

         Scheme inputScheme = new TextLine(new Fields("offset", "line"));
         Scheme outputScheme = new TextLine();

         Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?
           new Hfs(inputScheme, inputPath)    :                          Not very intuitive
           new Lfs(inputScheme, inputPath);
         Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
           new Hfs(outputScheme, outputPath) :                           Strange new abstraction
           new Lfs(outputScheme, outputPath);

         Pipe wcPipe = new Each("wordcount",
                                                                         Lots of boilerplate code
             new Fields("line"),
             new RegexSplitGenerator(new Fields("word"), "s+"),
             new Fields("word"));

         wcPipe = new GroupBy(wcPipe, new Fields("word"));
         wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

         Properties properties = new Properties();
         FlowConnector.setApplicationJarClass(properties, Main.class);

         Flow parsedLogFlow = new FlowConnector(properties)
           .connect(sourceTap, sinkTap, wcPipe);
         parsedLogFlow.start();
         parsedLogFlow.complete();
     }
 }
Meh...  I’m lazy
I want more power with less work!
How would we
count words in
 plain Scala?
  (My current language of choice)
Nice!
Familiar, intuitive
What if...?
But that code doesn’t
 scale to my cluster!
                 Or does it?




Meanwhile at Google...
Introducing
         Scoobi & Scalding
         Scala DSLs for Hadoop MapReduce


NOTE:
My relative familiarity
with either platform:
                          Scalding
                            5%




                             Scoobi
                              95%
http://github.com/nicta/scoobi


       A Scala library that
    implements a higher level
     programming model for
       Hadoop MapReduce
Counting Words using Scoobi




                                            Split lines into words
                                            Turn each word into a Pair(word, 1)
                                            Group by word
                                            For each word, sum the 1s to get the total




        Actually runs the code on the cluster
Scoobi is...
•   A distributed collections abstraction:
    •   Distributed collection objects abstract data in HDFS
    •   Methods on these objects abstract map/reduce
        operations
    •   Programs manipulate distributed collections objects
    •   Scoobi turns these manipulations into MapReduce jobs
    •   Based on Google’s FlumeJava / Cascades
•   A source code generator (it generates Java code!)
•   A job plan optimizer
•   Open sourced by NICTA
•   Written in Scala (W00t!)
DList[T]
•   Abstracts storage of data and files on HDFS
•   Calling methods on DList objects to transform and
    manipulate them abstracts the mapper, combiner,
    sort-and-shuffle, and reducer phases of MapReduce
•   Persisting a DList triggers compilation of the graph
    into one or more MR jobs and their execution
•   Very familiar: like standard Scala Lists
•   Strongly typed
•   Parameterized with rich types and Tuples
•   Easy list manipulation using typical higher order
    functions like map, flatMap, filter, etc.
DList[T]
IO
    •   Can read/write text files, Sequence files and Avro files
    •   Can influence sorting (raw, secondary)


                   Serialization
•   Serialization of custom types through Scala type
    classes and WireFormat[T]
•   Scoobi implements WireFormat[T] for primitive types,
    strings, tuples, Option[T], either[T], Iterable[T], etc.
•   Out of the box support for serialization of Scala case
    classes
IO/Serialization I
IO/Serialization II




      For normal (i.e. non-case) classes
Further Info
Version 0.4 released today (!)
• Avro, Sequence Files
• Materialized DObjects
• DList reduction methods (product, min,
    etc.)
•   Vastly improved testing support
•   Less overhead
•   Much more


http://nicta.github.com/scoobi/

scoobi-dev@googlegroups.com
scoobi-users@googlegroups.com
Scalding!



http://github.com/twitter/scalding


      A Scala library that
   implements a higher level
    programming model for
     Hadoop MapReduce
           Cascading
Counting Words using Scalding
Scalding is...
•   A distributed collections abstraction

•   A wrapper around Cascading (i.e. no source code
    generation)

•   Based on the same record model (i.e. named fields)

•   Less strongly typed

•   Uses Kryo Serialization

•   Used by Twitter in production

•   Written in Scala (W00t!)
Further Info
Current version: 0.5.4



http://github.com/twitter/scalding
https://github.com/twitter/scalding/wiki

@scalding

cascading-user@googlegroups.com

http://blog.echen.me/2012/02/09/movie-recommendations-and-more-
via-mapreduce-and-scalding/
How do they compare?
                              Small feature
Different approaches,    differences, which will
     similar power         even out over time

  Scoobi gets a little    Twitter is definitely a
  closer to idiomatic       bigger fish than
        Scala             NICTA, so Scalding
                          gets all the attention
  Both open sourced
      (last year)        Scoobi has better docs!
Which one should I use?
Ehm...

    ...I’m extremely prejudiced!
Questions?
Ad

More Related Content

What's hot (20)

3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
Hadoop User Group
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
Joseph Adler
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
tutorialsruby
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
"Mikhail "Misha"" Dmitriev
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
Duyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
Rob Vesse
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
Josiah Carlson
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
Adrian Spender
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
Dean Wampler
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
Duyhai Doan
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
Rob Vesse
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 

Similar to Why hadoop map reduce needs scala, an introduction to scoobi and scalding (20)

Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
DataWorks Summit
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009
spierre
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Marius Soutier
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
Viplav Jain
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
Martin Odersky
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in Scala
Dmitry Buzdin
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
DataWorks Summit
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009Sugar Presentation - YULHackers March 2009
Sugar Presentation - YULHackers March 2009
spierre
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
Viplav Jain
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
Martin Odersky
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
AestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in ScalaAestasIT - Internal DSLs in Scala
AestasIT - Internal DSLs in Scala
Dmitry Buzdin
 
Ad

More from Xebia Nederland BV (20)

The 10 tip recipe for business model innovation
The 10 tip recipe for business model innovationThe 10 tip recipe for business model innovation
The 10 tip recipe for business model innovation
Xebia Nederland BV
 
Scan je teams!
Scan je teams!Scan je teams!
Scan je teams!
Xebia Nederland BV
 
Holacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taartHolacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taart
Xebia Nederland BV
 
3* Scrum Master
3* Scrum Master3* Scrum Master
3* Scrum Master
Xebia Nederland BV
 
Judo Strategy
Judo StrategyJudo Strategy
Judo Strategy
Xebia Nederland BV
 
Agile en Scrum buiten IT
Agile en Scrum buiten ITAgile en Scrum buiten IT
Agile en Scrum buiten IT
Xebia Nederland BV
 
Scrumban
ScrumbanScrumban
Scrumban
Xebia Nederland BV
 
Creating the right products
Creating the right productsCreating the right products
Creating the right products
Xebia Nederland BV
 
Videoscribe je agile transitie
Videoscribe je agile transitieVideoscribe je agile transitie
Videoscribe je agile transitie
Xebia Nederland BV
 
Sketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint RetrospectivesSketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint Retrospectives
Xebia Nederland BV
 
Why we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right questionWhy we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right question
Xebia Nederland BV
 
Testen in de transitie naar continuous delivery
Testen in de transitie naar continuous deliveryTesten in de transitie naar continuous delivery
Testen in de transitie naar continuous delivery
Xebia Nederland BV
 
Becoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredientBecoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredient
Xebia Nederland BV
 
How DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of TestingHow DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of Testing
Xebia Nederland BV
 
Become a digital company - Case KPN / Xebia
Become a digital company - Case KPN / XebiaBecome a digital company - Case KPN / Xebia
Become a digital company - Case KPN / Xebia
Xebia Nederland BV
 
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nlBuilding a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
Xebia Nederland BV
 
Webinar Xebia & bol.com
Webinar Xebia & bol.comWebinar Xebia & bol.com
Webinar Xebia & bol.com
Xebia Nederland BV
 
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
Xebia Nederland BV
 
TestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson SmartTestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson Smart
Xebia Nederland BV
 
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé MochtarTestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
Xebia Nederland BV
 
The 10 tip recipe for business model innovation
The 10 tip recipe for business model innovationThe 10 tip recipe for business model innovation
The 10 tip recipe for business model innovation
Xebia Nederland BV
 
Holacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taartHolacracy: een nieuwe bodem voor de Scrum taart
Holacracy: een nieuwe bodem voor de Scrum taart
Xebia Nederland BV
 
Videoscribe je agile transitie
Videoscribe je agile transitieVideoscribe je agile transitie
Videoscribe je agile transitie
Xebia Nederland BV
 
Sketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint RetrospectivesSketchnote je Product Backlog Items & Sprint Retrospectives
Sketchnote je Product Backlog Items & Sprint Retrospectives
Xebia Nederland BV
 
Why we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right questionWhy we need test automation, but it’s not the right question
Why we need test automation, but it’s not the right question
Xebia Nederland BV
 
Testen in de transitie naar continuous delivery
Testen in de transitie naar continuous deliveryTesten in de transitie naar continuous delivery
Testen in de transitie naar continuous delivery
Xebia Nederland BV
 
Becoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredientBecoming an agile enterprise, focus on the test ingredient
Becoming an agile enterprise, focus on the test ingredient
Xebia Nederland BV
 
How DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of TestingHow DUO started with Continuous Delivery and changed their way of Testing
How DUO started with Continuous Delivery and changed their way of Testing
Xebia Nederland BV
 
Become a digital company - Case KPN / Xebia
Become a digital company - Case KPN / XebiaBecome a digital company - Case KPN / Xebia
Become a digital company - Case KPN / Xebia
Xebia Nederland BV
 
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nlBuilding a Docker powered feature driven delivery pipeline at hoyhoy.nl
Building a Docker powered feature driven delivery pipeline at hoyhoy.nl
Xebia Nederland BV
 
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
TestWorks Conf The magic of models for 1000% test automation - Machiel van de...
Xebia Nederland BV
 
TestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson SmartTestWorks Conf Serenity BDD in action - John Ferguson Smart
TestWorks Conf Serenity BDD in action - John Ferguson Smart
Xebia Nederland BV
 
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé MochtarTestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
TestWorks Conf Scalable QA with docker - Maarten van den Ende and Adé Mochtar
Xebia Nederland BV
 
Ad

Recently uploaded (20)

Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

  • 1. Why Scalding Needs Scala A Look at Scoobi and Scalding Scala DSLs for Hadoop Scoobi @agemooij
  • 4. But programming kinda Sucks!
  • 5. Hello World Word Count using Hadoop MapReduce
  • 6. Split lines into words Turn each word into a Pair(word, 1) Group by word (?) For each word, sum the 1s to get the total
  • 7. Lots of small unintuitive Mapper and Reducer Classes Lots of Hadoop intrusiveness (Context, Writables, Exceptions, etc.) Low level glue code Actually runs the code on the cluster
  • 8. This does not make me a happy Hadoop developer! Especially for things that are a little bit more complicated than counting words • Unintuitive, invasive programming model • Hard to compose/chain jobs into real, more complicated programs • Lots of low-level boilerplate code • Branching, Joins, CoGroups, etc. hard to implement
  • 9. What Are the Alternatives?
  • 10. Counting Words using Apache Pig Nice! Already a lot better, but anything more complex gets hard pretty fast. Pig is hard to customize/extend Handy for quick exploration of data! And the same goes for Hive
  • 11. package cascadingtutorial.wordcount; /** Very powerful! * Wordcount example in Cascading */ Record Model public class Main { Pipes & Filters public static void main( String[] args ) { String inputPath = args[0]; Joins & CoGroups String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : Not very intuitive new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : Strange new abstraction new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", Lots of boilerplate code new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  • 12. Meh... I’m lazy I want more power with less work!
  • 13. How would we count words in plain Scala? (My current language of choice)
  • 15. But that code doesn’t scale to my cluster! Or does it? Meanwhile at Google...
  • 16. Introducing Scoobi & Scalding Scala DSLs for Hadoop MapReduce NOTE: My relative familiarity with either platform: Scalding 5% Scoobi 95%
  • 17. http://github.com/nicta/scoobi A Scala library that implements a higher level programming model for Hadoop MapReduce
  • 18. Counting Words using Scoobi Split lines into words Turn each word into a Pair(word, 1) Group by word For each word, sum the 1s to get the total Actually runs the code on the cluster
  • 19. Scoobi is... • A distributed collections abstraction: • Distributed collection objects abstract data in HDFS • Methods on these objects abstract map/reduce operations • Programs manipulate distributed collections objects • Scoobi turns these manipulations into MapReduce jobs • Based on Google’s FlumeJava / Cascades • A source code generator (it generates Java code!) • A job plan optimizer • Open sourced by NICTA • Written in Scala (W00t!)
  • 20. DList[T] • Abstracts storage of data and files on HDFS • Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce • Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution • Very familiar: like standard Scala Lists • Strongly typed • Parameterized with rich types and Tuples • Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
  • 22. IO • Can read/write text files, Sequence files and Avro files • Can influence sorting (raw, secondary) Serialization • Serialization of custom types through Scala type classes and WireFormat[T] • Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc. • Out of the box support for serialization of Scala case classes
  • 24. IO/Serialization II For normal (i.e. non-case) classes
  • 25. Further Info Version 0.4 released today (!) • Avro, Sequence Files • Materialized DObjects • DList reduction methods (product, min, etc.) • Vastly improved testing support • Less overhead • Much more http://nicta.github.com/scoobi/ [email protected] [email protected]
  • 26. Scalding! http://github.com/twitter/scalding A Scala library that implements a higher level programming model for Hadoop MapReduce Cascading
  • 28. Scalding is... • A distributed collections abstraction • A wrapper around Cascading (i.e. no source code generation) • Based on the same record model (i.e. named fields) • Less strongly typed • Uses Kryo Serialization • Used by Twitter in production • Written in Scala (W00t!)
  • 29. Further Info Current version: 0.5.4 http://github.com/twitter/scalding https://github.com/twitter/scalding/wiki @scalding [email protected] http://blog.echen.me/2012/02/09/movie-recommendations-and-more- via-mapreduce-and-scalding/
  • 30. How do they compare? Small feature Different approaches, differences, which will similar power even out over time Scoobi gets a little Twitter is definitely a closer to idiomatic bigger fish than Scala NICTA, so Scalding gets all the attention Both open sourced (last year) Scoobi has better docs!
  • 31. Which one should I use? Ehm... ...I’m extremely prejudiced!