SlideShare a Scribd company logo
FLINK - A CONVENIENT ABSTRACTION LAYER
FOR YARN?
VYACHESLAV ZHOLUDEV
INTRODUCTION
• YARN opened Hadoop for many more developers
• API to integrate into a Hadoop cluster
• Flexibility
• Applications: MR, TEZ, Flink, Spark,…
• Flink has been great in using the opportunity
• Flexible program execution graph
• Operators other than Map and Reduce
• Clean and convenient API
• Efficient with I/O
EXPECTATIONS FROM YARN
• New programming models in addition to MapReduce
• More alternatives to cover cases where the MapReduce paradigm does
not suit well
• Flexibility with expressing operations on data
• Elasticity of a cluster
• Ability to write own applications to distribute computations across
the cluster
DISTRIBUTING COMPUTATIONAL TASKS
• Writing own YARN application
• Complicated
• Tedious
• Error-prone
• Somebody must have done
something simpler
• Apache Twill
• Was not simple enough still
• Execute CLI tools remotely
(if everything else fails)
• Flink?
FLINK AT RESEARCHGATE
Lots of benefits:
• Made MapReduce jobs more readable
• More compact
• Less boiler plate code
• Easier to understand and maintain
• Got rid of ugly Hive queries and optimised runtime
• Better and cleaner orchestration of workflow
subtasks (before we had to glue multiple MR jobs)
• Iterative machine learning algorithms
• Distributing computational tasks across a cluster
REAL USE CASE:
MONGODB TO AVRO BRIDGE
REAL USE CASE
• In essence:
• Reads MongoDB documents
• Converts them to Avro records (based on a provided Avro schema)
• Persists them on HDFS
• Avrongo evolution
• One threaded program
• Multi-threaded program talking to different shards in parallel
• Distributed across cluster
• Reasons for distributing:
• Were CPU bound
• HDFS load distribution
A MongoDB to Avro Bridge (aka Avrongo)
Used to dump live DB data to HDFS for further batch-processing and analytics
HOW AVRONGO WORKS?
Basic Version
• One thread
• Using one MongoDB cursor to iterate the whole collection
• Suitable for smaller collections
MONGODB SHARDS AND CHUNKS
• Controlling load on the MongoDB cluster
• Deterministic way of splitting collection for input
Utilizing MongoDB chunks
AVRONGO - SHARDED VERSION
• Collecting chunks information (sets of documents living on a particular
shard)
• Processing chunks of each shard in a separate group of threads
AVRONGO - FLINK VERSION
• Custom InputFormat that distributes MongoDB chunks uniformly
• FlatMap operator
• Number of task nodes = (number of shards) x (parallelism per shard)
• Custom Generic AvroOutputFormat
• Slower shards receive a bit more attention
FLINK APPROACH
Outcome
• No longer bound by CPU
• Imports to HDFS are faster
• Some collections: from 6h to 2.5h or from 3.5h to 2h
• Very few lines of code
• Same command line interface (no efforts to migrate to Flink-based version)
• Reusing the same converter as in standalone versions
• All orchestration and parallelisation work is done automatically by Flink
Benefits
ANOTHER USE CASE:
DISTRIBUTED FILE COPYING
HADOOP DISTCP
• Generates a MapReduce job that copies big amount of data
• List of files as an input to a Map Task
• Two types of Input Formats:
• UniformSizeInputFormat
• DynamicInputFormat
• gives more load to faster mappers
• complicated code
• utilizes FS to feed the mappers
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
• Implements the same logic as in a
DynamicInputFormat of Hadoop’s distcp
• Much fewer lines of code
• Same runtime as Hadoop distcp
• Available in Flink Java examples
• Not fault-tolerant (yet)
FLINK DISTCP
https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/
src/main/java/org/apache/flink/examples/java/distcp
CONCLUSIONS
CONCLUSIONS
• Flink - a thin layer for implementing your YARN application for parallelising
independent tasks on the cluster
• Thanks to custom input formats that are easy to implement
• No boilerplate code
Would be nice to have:
• Elasticity
• Better progress tracking
• Fault tolerance
Custom input format + a Flink operator with business logic = Happiness
QUESTIONS?
https://www.researchgate.net/careers

More Related Content

What's hot (20)

PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PPTX
SICS: Apache Flink Streaming
Turi, Inc.
 
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
PPTX
Apache flink
Ahmed Nader
 
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
PDF
Flink Apachecon Presentation
Gyula Fóra
 
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
PDF
Stateful Distributed Stream Processing
Gyula Fóra
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
SICS: Apache Flink Streaming
Turi, Inc.
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Apache flink
Ahmed Nader
 
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Forward
 
Flink Apachecon Presentation
Gyula Fóra
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
Stateful Distributed Stream Processing
Gyula Fóra
 
Introduction to Apache Flink
datamantra
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 

Viewers also liked (20)

PDF
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
PPTX
Aljoscha Krettek – Notions of Time
Flink Forward
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PDF
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PPTX
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
Aljoscha Krettek – Notions of Time
Flink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Ad

Similar to Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn? (20)

PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PPTX
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
PPTX
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
PPT
Apache flink-crash-course-by-slim-baltagi-and-srini-palthepu-150817191850-lva...
Yun Lung Li
 
PPT
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
PDF
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PDF
Apache Flink
Mike Frampton
 
PPTX
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
PPTX
What's new in hadoop 3.0
Heiko Loewe
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017
Till Rohrmann
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
Apache flink-crash-course-by-slim-baltagi-and-srini-palthepu-150817191850-lva...
Yun Lung Li
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
Flink Streaming @BudapestData
Gyula Fóra
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Apache Flink
Mike Frampton
 
A Multi Colored YARN
DataWorks Summit/Hadoop Summit
 
What's new in hadoop 3.0
Heiko Loewe
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

  • 1. FLINK - A CONVENIENT ABSTRACTION LAYER FOR YARN? VYACHESLAV ZHOLUDEV
  • 2. INTRODUCTION • YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,… • Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O
  • 3. EXPECTATIONS FROM YARN • New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across the cluster
  • 4. DISTRIBUTING COMPUTATIONAL TASKS • Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done something simpler • Apache Twill • Was not simple enough still • Execute CLI tools remotely (if everything else fails) • Flink?
  • 5. FLINK AT RESEARCHGATE Lots of benefits: • Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain • Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster
  • 6. REAL USE CASE: MONGODB TO AVRO BRIDGE
  • 7. REAL USE CASE • In essence: • Reads MongoDB documents • Converts them to Avro records (based on a provided Avro schema) • Persists them on HDFS • Avrongo evolution • One threaded program • Multi-threaded program talking to different shards in parallel • Distributed across cluster • Reasons for distributing: • Were CPU bound • HDFS load distribution A MongoDB to Avro Bridge (aka Avrongo) Used to dump live DB data to HDFS for further batch-processing and analytics
  • 8. HOW AVRONGO WORKS? Basic Version • One thread • Using one MongoDB cursor to iterate the whole collection • Suitable for smaller collections
  • 9. MONGODB SHARDS AND CHUNKS • Controlling load on the MongoDB cluster • Deterministic way of splitting collection for input Utilizing MongoDB chunks
  • 10. AVRONGO - SHARDED VERSION • Collecting chunks information (sets of documents living on a particular shard) • Processing chunks of each shard in a separate group of threads
  • 11. AVRONGO - FLINK VERSION • Custom InputFormat that distributes MongoDB chunks uniformly • FlatMap operator • Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat • Slower shards receive a bit more attention
  • 12. FLINK APPROACH Outcome • No longer bound by CPU • Imports to HDFS are faster • Some collections: from 6h to 2.5h or from 3.5h to 2h • Very few lines of code • Same command line interface (no efforts to migrate to Flink-based version) • Reusing the same converter as in standalone versions • All orchestration and parallelisation work is done automatically by Flink Benefits
  • 14. HADOOP DISTCP • Generates a MapReduce job that copies big amount of data • List of files as an input to a Map Task • Two types of Input Formats: • UniformSizeInputFormat • DynamicInputFormat • gives more load to faster mappers • complicated code • utilizes FS to feed the mappers https://hadoop.apache.org/docs/r1.2.1/distcp2.html
  • 15. • Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp • Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples • Not fault-tolerant (yet) FLINK DISTCP https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/ src/main/java/org/apache/flink/examples/java/distcp
  • 17. CONCLUSIONS • Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster • Thanks to custom input formats that are easy to implement • No boilerplate code Would be nice to have: • Elasticity • Better progress tracking • Fault tolerance Custom input format + a Flink operator with business logic = Happiness