SlideShare a Scribd company logo
1
New tools for building applications
on Apache Hadoop
Eli Collins
Software Engineer, Cloudera
@elicollins
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-frameworks-api
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Context – building better products w/ data
• Analytics-driven development
• Diverse data sources & formats
• Tools that make it easier to build apps on Hadoop
• Apache Avro
• Apache Crunch
• Cloudera ML
• Cloudera CDK
2
Serialization & formats w/ Apache Avro
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compressed, splittable
• Interoperable
• Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP
• Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc
• Dynamic
• Can read & write w/o generating code first
• Evolvable
3
Complex pipelines w/ Apache Crunch
• Not all data formats are a natural fit for Pig & Hive
• Workaround – large, custom UDFs (or MR)
• Crunch
• API for MapReduce in Java (& Scala)
• Based on Google’s FlumeJava paper
• Combine small # primitives & light-weight UDFs
4
Crunch – advantages
• It’s just Java
• Full programming language
• No need to learn or switch between languages
• Natural type system
• Hadoop writables & Avro native support
• Modular library for reuse
• Create glue code for data transformation
that can be combined with a ML algorithm
into a single MR job
5
Crunch – core concepts
PCollection: distributed, unordered collection of
elements w/ parallelDo operator.
PTable: sub-interface of PCollection. Distributed, sorted
map. Also has groupBy operator to aggr values by key.
Pipeline: coordinates the building and execution of
underlying MapReduce jobs.
6
Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
Based on Google’s Cascade project
Cloudera ML
• Open source libraries and tools to help data
scientists perform common tasks
• Data preparation
• Model evaluation
• Built-in commands
• summarize, sample, normalize, pivot, etc
• K-means clustering on Hadoop
• Scalable k-means++ by Bahmani et al
• Other implementations as well
9
Cloudera ML (cont)
• Built using Crunch
• Vector format – leverages Mahout’s Vector interface
& classes
• Record format – thin wrapper on Avro’s
GenericRecord/Schema and HCatRecord/Schema
interfaces
• More at github.com/cloudera/ml
10
Cloudera Development Kit (CDK)
• Open source libraries, tools & docs that make
building systems on Hadoop easier
• Provides higher-level APIs atop existing CDH components
• Codify patterns for common use cases
• Doing the right thing should be easy & obvious
11
CDK – loosely coupled modules
CDK is prescriptive but..
• Modules can be used independently, as needed
• Doesn’t force you into a programing paradigm
• Doesn’t make you to adopt a ton of dependencies
12
CDK – data module
• Easier to work with data sets on Hadoop file systems
• Automatic serialization/de-serialization of Java POJOs
and Avro records
• Automatic compression, file & directory layout
• Automatic partitioning
• Metadata plugin provider (Hive/HCatalog)
13
CDK – example data module usage
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration())
.directory(new Path(“/data”)).get();
DatasetDescriptor desc new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53)).get();
Dataset events = repo.create(“events”, desc);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(desc.getSchema())
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis()).build());
writer.close();
repo.drop(“events”);
14
CDK – example directory contents
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId-0
/xxxx.avro
/xxxx.avro
/userId-1
/xxxx.avro
/userId-2
15
A dataset
Per-dataset metadata provider
Partioned dataset “entities”:
Snappy compressed Avro data
files containing individual records
The dataset repository
CDK – what’s new & coming
• Log application events to a dataset w/ the log4j API &
Flume as the transport
• Datasets exposed as Crunch sources and targets
• Date partitioning (year/month/day/hour/min)
• More examples
• Morphlines (library for record-transformation)
• More dataset repositories & languages
16
CDK – more info
• github.com/cloudera/cdk
• github.com/cloudera/cdk-examples
• Binary artifacts in Cloudera’s maven repo
• Mailing list: cdk-dev@cloudera.org
17
A guide to Python frameworks for Hadoop
• Uri Laserson, data scientist @ Cloudera
• Streaming, mrjob, dumbo, hadoopy, pydoop & more
• Thursday June 13th 7pm @ Foursquare (NYC HUG)
Interested in more topics like this?
follow @ClouderaEng
18
19
Thank You!
Eli Collins
@elicollins
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/hadoop-
frameworks-api
Ad

More Related Content

What's hot (20)

Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
The HDF-EOS Tools and Information Center
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
HDF Update
HDF UpdateHDF Update
HDF Update
The HDF-EOS Tools and Information Center
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
jasonfrantz
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
ArangoDB Database
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Wojciech Biela
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
The HDF-EOS Tools and Information Center
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
The HDF-EOS Tools and Information Center
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
The HDF-EOS Tools and Information Center
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
Mike Frampton
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
Rajitha D
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
An HDF-EOS Data Server Based on OPeNDAP and ECHO
An HDF-EOS Data Server Based on OPeNDAP and ECHOAn HDF-EOS Data Server Based on OPeNDAP and ECHO
An HDF-EOS Data Server Based on OPeNDAP and ECHO
The HDF-EOS Tools and Information Center
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
jasonfrantz
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
ArangoDB Database
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Wojciech Biela
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
Mike Frampton
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
Rajitha D
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 

Similar to Building Applications using Apache Hadoop (20)

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Kaspersky Lab Products Remover 1.0.5497.0
Kaspersky Lab Products Remover 1.0.5497.0Kaspersky Lab Products Remover 1.0.5497.0
Kaspersky Lab Products Remover 1.0.5497.0
blouch113kp
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
Neo4j
 
Windows 11 Professional 2025 with Office 2021
Windows 11 Professional 2025 with Office 2021Windows 11 Professional 2025 with Office 2021
Windows 11 Professional 2025 with Office 2021
alihamzakpa097
 
LDPlayer Free Download (Latest version 2025)
LDPlayer Free Download (Latest version 2025)LDPlayer Free Download (Latest version 2025)
LDPlayer Free Download (Latest version 2025)
blouch138kp
 
ProtonVPN Crack Free Download [Updated Version]
ProtonVPN Crack Free Download [Updated Version]ProtonVPN Crack Free Download [Updated Version]
ProtonVPN Crack Free Download [Updated Version]
alihamzakpa081
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
Some corner at the Laboratory
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Getting Started with Drupal - Handouts
Getting Started with Drupal - HandoutsGetting Started with Drupal - Handouts
Getting Started with Drupal - Handouts
Rachel Vacek
 
Intership(Hadoop cluster and DevOps.pptx
Intership(Hadoop cluster and DevOps.pptxIntership(Hadoop cluster and DevOps.pptx
Intership(Hadoop cluster and DevOps.pptx
jeevankenchanagoudar
 
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
Wong Hoi Sing Edison
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
Codemotion
 
Resume_VipinKP
Resume_VipinKPResume_VipinKP
Resume_VipinKP
indhuparvathy
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
Lindsay Millard
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
Harald Erb
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
Kaspersky Lab Products Remover 1.0.5497.0
Kaspersky Lab Products Remover 1.0.5497.0Kaspersky Lab Products Remover 1.0.5497.0
Kaspersky Lab Products Remover 1.0.5497.0
blouch113kp
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
Neo4j
 
Windows 11 Professional 2025 with Office 2021
Windows 11 Professional 2025 with Office 2021Windows 11 Professional 2025 with Office 2021
Windows 11 Professional 2025 with Office 2021
alihamzakpa097
 
LDPlayer Free Download (Latest version 2025)
LDPlayer Free Download (Latest version 2025)LDPlayer Free Download (Latest version 2025)
LDPlayer Free Download (Latest version 2025)
blouch138kp
 
ProtonVPN Crack Free Download [Updated Version]
ProtonVPN Crack Free Download [Updated Version]ProtonVPN Crack Free Download [Updated Version]
ProtonVPN Crack Free Download [Updated Version]
alihamzakpa081
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Getting Started with Drupal - Handouts
Getting Started with Drupal - HandoutsGetting Started with Drupal - Handouts
Getting Started with Drupal - Handouts
Rachel Vacek
 
Intership(Hadoop cluster and DevOps.pptx
Intership(Hadoop cluster and DevOps.pptxIntership(Hadoop cluster and DevOps.pptx
Intership(Hadoop cluster and DevOps.pptx
jeevankenchanagoudar
 
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
[HKDUG] #20151017 - BarCamp 2015 - Drupal 8 is Coming! Are You Ready?
Wong Hoi Sing Edison
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
Codemotion
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
Lindsay Millard
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
Sri Ambati
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
Harald Erb
 
Ad

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Ad

Recently uploaded (20)

HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Build 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHSBuild 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHS
TECH EHS Solution
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfAre Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Telecoms Supermarket
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Unlocking the Power of IVR: A Comprehensive Guide
Unlocking the Power of IVR: A Comprehensive GuideUnlocking the Power of IVR: A Comprehensive Guide
Unlocking the Power of IVR: A Comprehensive Guide
vikasascentbpo
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Build 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHSBuild 3D Animated Safety Induction - Tech EHS
Build 3D Animated Safety Induction - Tech EHS
TECH EHS Solution
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdfAre Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Are Cloud PBX Providers in India Reliable for Small Businesses (1).pdf
Telecoms Supermarket
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Unlocking the Power of IVR: A Comprehensive Guide
Unlocking the Power of IVR: A Comprehensive GuideUnlocking the Power of IVR: A Comprehensive Guide
Unlocking the Power of IVR: A Comprehensive Guide
vikasascentbpo
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 

Building Applications using Apache Hadoop

  • 1. 1 New tools for building applications on Apache Hadoop Eli Collins Software Engineer, Cloudera @elicollins
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-frameworks-api
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Context – building better products w/ data • Analytics-driven development • Diverse data sources & formats • Tools that make it easier to build apps on Hadoop • Apache Avro • Apache Crunch • Cloudera ML • Cloudera CDK 2
  • 5. Serialization & formats w/ Apache Avro • Expressive • Records, arrays, unions, enums • Efficient • Compact binary, compressed, splittable • Interoperable • Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP • Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc • Dynamic • Can read & write w/o generating code first • Evolvable 3
  • 6. Complex pipelines w/ Apache Crunch • Not all data formats are a natural fit for Pig & Hive • Workaround – large, custom UDFs (or MR) • Crunch • API for MapReduce in Java (& Scala) • Based on Google’s FlumeJava paper • Combine small # primitives & light-weight UDFs 4
  • 7. Crunch – advantages • It’s just Java • Full programming language • No need to learn or switch between languages • Natural type system • Hadoop writables & Avro native support • Modular library for reuse • Create glue code for data transformation that can be combined with a ML algorithm into a single MR job 5
  • 8. Crunch – core concepts PCollection: distributed, unordered collection of elements w/ parallelDo operator. PTable: sub-interface of PCollection. Distributed, sorted map. Also has groupBy operator to aggr values by key. Pipeline: coordinates the building and execution of underlying MapReduce jobs. 6
  • 9. Crunch – word count 7 public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 10. Scrunch – Scala wrapper 8 class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } } Based on Google’s Cascade project
  • 11. Cloudera ML • Open source libraries and tools to help data scientists perform common tasks • Data preparation • Model evaluation • Built-in commands • summarize, sample, normalize, pivot, etc • K-means clustering on Hadoop • Scalable k-means++ by Bahmani et al • Other implementations as well 9
  • 12. Cloudera ML (cont) • Built using Crunch • Vector format – leverages Mahout’s Vector interface & classes • Record format – thin wrapper on Avro’s GenericRecord/Schema and HCatRecord/Schema interfaces • More at github.com/cloudera/ml 10
  • 13. Cloudera Development Kit (CDK) • Open source libraries, tools & docs that make building systems on Hadoop easier • Provides higher-level APIs atop existing CDH components • Codify patterns for common use cases • Doing the right thing should be easy & obvious 11
  • 14. CDK – loosely coupled modules CDK is prescriptive but.. • Modules can be used independently, as needed • Doesn’t force you into a programing paradigm • Doesn’t make you to adopt a ton of dependencies 12
  • 15. CDK – data module • Easier to work with data sets on Hadoop file systems • Automatic serialization/de-serialization of Java POJOs and Avro records • Automatic compression, file & directory layout • Automatic partitioning • Metadata plugin provider (Hive/HCatalog) 13
  • 16. CDK – example data module usage DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration()) .directory(new Path(“/data”)).get(); DatasetDescriptor desc new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53)).get(); Dataset events = repo.create(“events”, desc); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(desc.getSchema()) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()).build()); writer.close(); repo.drop(“events”); 14
  • 17. CDK – example directory contents /data /events /.metadata /schema.avsc /descriptor.properties /userId-0 /xxxx.avro /xxxx.avro /userId-1 /xxxx.avro /userId-2 15 A dataset Per-dataset metadata provider Partioned dataset “entities”: Snappy compressed Avro data files containing individual records The dataset repository
  • 18. CDK – what’s new & coming • Log application events to a dataset w/ the log4j API & Flume as the transport • Datasets exposed as Crunch sources and targets • Date partitioning (year/month/day/hour/min) • More examples • Morphlines (library for record-transformation) • More dataset repositories & languages 16
  • 19. CDK – more info • github.com/cloudera/cdk • github.com/cloudera/cdk-examples • Binary artifacts in Cloudera’s maven repo • Mailing list: [email protected] 17
  • 20. A guide to Python frameworks for Hadoop • Uri Laserson, data scientist @ Cloudera • Streaming, mrjob, dumbo, hadoopy, pydoop & more • Thursday June 13th 7pm @ Foursquare (NYC HUG) Interested in more topics like this? follow @ClouderaEng 18
  • 22. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- frameworks-api