SlideShare a Scribd company logo
Composing re-useable
ETL
on Hadoop
                       Paul Lam (@Quantisan)

                       Big Data London, 1 October 2012
Data science lifecycle


                Acquire




       Action             Analyse
Data science lifecycle


                   Acquire        80% of work




          Action             Analyse



80% of result
From these

✤   {:status 200, :scheme http, :pipe ., :request-uri /broadband/?
    gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec
    1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string
    gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request
    GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream-
    response-time 0.164, :sent-http-content-type text/html, :hostname nginx-
    lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/
    2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk?
    sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/
    5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60
    Safari/537.1, :request-time 0.164, :request-body -, :http-host
    www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream-
    status 200, :uscc <ANON>}
To these

✤   Removing bots and crawlers

✤   Picking out relevant events

✤   Grouping events by users

✤   Sequencing the events

✤   Structuring as a matrix

✤   Graphing
Using this




✤   an abstraction framework for building MapReduce jobs

✤   linearly scalable data processing

✤   www.cascading.org
Word Count - MapReduce

✤   public static class Map extends MapReduceBase implements
    Mapper<LongWritable, Text, Text, IntWritable>

    ✤   public void map(LongWritable key, Text value, OutputCollector<Text,
        IntWritable> output, Reporter reporter) throws IOException

    ✤   “take this line and split it to word tokens”

✤   public static class Reduce extends MapReduceBase implements Reducer<Text,
    IntWritable, Text, IntWritable>

    ✤   public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
        IntWritable> output, Reporter reporter) throws IOException

    ✤   “take each word token and increment a counter”
Word Count - Cascalog
Benefits of
Domain Specific Language

✤   At same level of abstraction of the problem

    ✤   split words, then do a count on them

✤   Fewer custom code, less prone to implementation bugs

✤   More readable

✤   More productive
TF-IDF

✤   Extended from word
    count example

✤   Single-purpose
    methods

✤   Composition of
    functions



✤   github.com/Quantisan/Impatient


✤   github.com/Cascading/Impatient
Our data processing methodology

✤   Apply single-purpose
    functions to immutable data

✤   Only build what we need as
    we go

✤   Composability, extensibility,
    maintainability

✤   Use the right tool for the right
    task
Contact



✤   Paul Lam, data scientist at uSwitch

✤   @Quantisan

✤   paul.lam@forward.co.uk

More Related Content

What's hot (20)

PPTX
Android and REST
Roman Woźniak
 
ODP
Clug 2011 March web server optimisation
grooverdan
 
PDF
Easy REST with OpenAPI
ZWEIDENKER GmbH
 
PDF
Nginx: Accelerate Rails, HTTP Tricks
Adam Wiggins
 
PDF
Care and feeding notes
Perrin Harkins
 
PDF
[245] presto 내부구조 파헤치기
NAVER D2
 
PDF
Side by Side with Elasticsearch and Solr
Sematext Group, Inc.
 
PPTX
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
PDF
WordPress Need For Speed
pdeschen
 
PDF
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people
 
PPT
5 things MySql
sarahnovotny
 
PPT
Lightweight DAS components in Perl
guestbab097
 
PDF
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
PPTX
Solving anything in VCL
Fastly
 
PDF
Clug 2012 March web server optimisation
grooverdan
 
PDF
Getting a Grip on CDN Performance - Why and How
Aaron Peters
 
PPT
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Rick Copeland
 
PPTX
Mongo db v3_deep_dive
Bryan Reinero
 
PDF
mongoDB Performance
Moshe Kaplan
 
PPTX
Reverse proxy & web cache with NGINX, HAProxy and Varnish
El Mahdi Benzekri
 
Android and REST
Roman Woźniak
 
Clug 2011 March web server optimisation
grooverdan
 
Easy REST with OpenAPI
ZWEIDENKER GmbH
 
Nginx: Accelerate Rails, HTTP Tricks
Adam Wiggins
 
Care and feeding notes
Perrin Harkins
 
[245] presto 내부구조 파헤치기
NAVER D2
 
Side by Side with Elasticsearch and Solr
Sematext Group, Inc.
 
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
WordPress Need For Speed
pdeschen
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people
 
5 things MySql
sarahnovotny
 
Lightweight DAS components in Perl
guestbab097
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Solving anything in VCL
Fastly
 
Clug 2012 March web server optimisation
grooverdan
 
Getting a Grip on CDN Performance - Why and How
Aaron Peters
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Rick Copeland
 
Mongo db v3_deep_dive
Bryan Reinero
 
mongoDB Performance
Moshe Kaplan
 
Reverse proxy & web cache with NGINX, HAProxy and Varnish
El Mahdi Benzekri
 

Similar to Composing re-useable ETL on Hadoop (20)

PDF
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
Aman Kohli
 
PDF
Presto anatomy
Dongmin Yu
 
PDF
20190516 web security-basic
MksYi
 
PDF
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Nicolas Martignole
 
PPTX
Being HAPI! Reverse Proxying on Purpose
Aman Kohli
 
PDF
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Rafał Kuć
 
PDF
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
Sematext Group, Inc.
 
PPT
Life on the Edge with ESI
Kit Chan
 
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
PDF
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
Dongwook Lee
 
PDF
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
KEY
Using and scaling Rack and Rack-based middleware
Alona Mekhovova
 
PPTX
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Jonathan Klein
 
PDF
Cape Cod Web Technology Meetup - 2
Asher Martin
 
ODP
Web program-peformance-optimization
xiaojueqq12345
 
PDF
The Big Picture and How to Get Started
guest1af57e
 
PDF
Whatever it takes - Fixing SQLIA and XSS in the process
guest3379bd
 
PDF
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes
 
KEY
OSCON 2011 Learning CouchDB
Bradley Holt
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
Aman Kohli
 
Presto anatomy
Dongmin Yu
 
20190516 web security-basic
MksYi
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Nicolas Martignole
 
Being HAPI! Reverse Proxying on Purpose
Aman Kohli
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Rafał Kuć
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
Sematext Group, Inc.
 
Life on the Edge with ESI
Kit Chan
 
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
Dongwook Lee
 
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
Using and scaling Rack and Rack-based middleware
Alona Mekhovova
 
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Jonathan Klein
 
Cape Cod Web Technology Meetup - 2
Asher Martin
 
Web program-peformance-optimization
xiaojueqq12345
 
The Big Picture and How to Get Started
guest1af57e
 
Whatever it takes - Fixing SQLIA and XSS in the process
guest3379bd
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes
 
OSCON 2011 Learning CouchDB
Bradley Holt
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Ad

More from Paul Lam (9)

PDF
Mozambique, Smallholder Farming, and Technology
Paul Lam
 
PDF
When a machine learning researcher and a software engineer walk into a bar
Paul Lam
 
PDF
Evolution of Our Software Architecture
Paul Lam
 
PDF
A gentle introduction to functional programming through music and clojure
Paul Lam
 
PDF
Yet another startup built on Clojure(Script)
Paul Lam
 
PDF
Clojure in US vs Europe
Paul Lam
 
PDF
2014 docker boston fig for developing microservices
Paul Lam
 
PDF
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Paul Lam
 
KEY
An agile approach to knowledge discovery on web log data
Paul Lam
 
Mozambique, Smallholder Farming, and Technology
Paul Lam
 
When a machine learning researcher and a software engineer walk into a bar
Paul Lam
 
Evolution of Our Software Architecture
Paul Lam
 
A gentle introduction to functional programming through music and clojure
Paul Lam
 
Yet another startup built on Clojure(Script)
Paul Lam
 
Clojure in US vs Europe
Paul Lam
 
2014 docker boston fig for developing microservices
Paul Lam
 
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Paul Lam
 
An agile approach to knowledge discovery on web log data
Paul Lam
 
Ad

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Python basic programing language for automation
DanialHabibi2
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 

Composing re-useable ETL on Hadoop

  • 1. Composing re-useable ETL on Hadoop Paul Lam (@Quantisan) Big Data London, 1 October 2012
  • 2. Data science lifecycle Acquire Action Analyse
  • 3. Data science lifecycle Acquire 80% of work Action Analyse 80% of result
  • 4. From these ✤ {:status 200, :scheme http, :pipe ., :request-uri /broadband/? gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec 1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream- response-time 0.164, :sent-http-content-type text/html, :hostname nginx- lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/ 2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk? sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/ 5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1, :request-time 0.164, :request-body -, :http-host www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream- status 200, :uscc <ANON>}
  • 5. To these ✤ Removing bots and crawlers ✤ Picking out relevant events ✤ Grouping events by users ✤ Sequencing the events ✤ Structuring as a matrix ✤ Graphing
  • 6. Using this ✤ an abstraction framework for building MapReduce jobs ✤ linearly scalable data processing ✤ www.cascading.org
  • 7. Word Count - MapReduce ✤ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ✤ public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take this line and split it to word tokens” ✤ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> ✤ public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take each word token and increment a counter”
  • 8. Word Count - Cascalog
  • 9. Benefits of Domain Specific Language ✤ At same level of abstraction of the problem ✤ split words, then do a count on them ✤ Fewer custom code, less prone to implementation bugs ✤ More readable ✤ More productive
  • 10. TF-IDF ✤ Extended from word count example ✤ Single-purpose methods ✤ Composition of functions ✤ github.com/Quantisan/Impatient ✤ github.com/Cascading/Impatient
  • 11. Our data processing methodology ✤ Apply single-purpose functions to immutable data ✤ Only build what we need as we go ✤ Composability, extensibility, maintainability ✤ Use the right tool for the right task
  • 12. Contact ✤ Paul Lam, data scientist at uSwitch ✤ @Quantisan ✤ [email protected]

Editor's Notes