SlideShare a Scribd company logo
Apache Spark vs Apache Flink
Two most contemporary general purpose data processing platform.
AKASH SIHAGPh. No. : +91-7737111579
asihag70@gmail.com akash.sihag@infoobjects.com
Introduction
Apache Spark is a fast and general
engine for large-scale data processing.
Apache Flink
Apache Flink is an open source
platform for distributed stream and
batch data processing.
Apache Spark
The Inception
● 2009 : At UC Berkeley's AMPLab by
Matei Zaharia
● 2010 : Open-sourced under BSD
license.
● 2013 : Donated to Apache Software
Foundation and switched its license
to Apache 2.0
● 2014 : Top Level Apache Project
Apache Flink
● 2010 : Started as a collaboration
of of Technical University Berlin,
Humboldt-Universität zu Berlin,
and Hasso-Plattner-Institut
Potsdam.
● 2014 : Apache Incubator.
● 2014(Dec) : Apache Top Level
Project.
Apache Spark
Overview Spark
Apache link
Components:
● Spark SQL : For SQL and unstructured data processing.
● Spark Streaming: For processing live streaming data.
● MLib : Machine Learning Algorithm.
● GraphX : Graph based processing.
● Spark Core : Its the base processing engine in Spark that works on the concept of RDD and
all API’s resides on top of it.
Deploy:
● Standalone : Included with Spark.
● Apache Mesos
● Apache YARN : Hadoop 2 resource manager.
Spark Core
Overview Flink
Apache link
Components:
● DataStream API : For unbounded streams.
● DataSet API : Batch processing.
● Table API : For SQL like operations.
● CEP : Complex Event processing API.
● M L Library : For machine Learning algorithms.
● Gelly : Graph processing API.
Deploy:
● Standalone : Included with Flink.
● Local : Single JVM
● Apache YARN : Hadoop 2 resource manager.
Deep Dive
Computing Paradigm
● Work on the abstraction of RDD i.e.
Resilient distributed datasets.
● Supports in-memory computation.
● Lazy Evaluation (Transformation-
action).
● DAG is generated for every Spark
Job.
● Streams are processed as chunks
of batches.
Apache Flink
● Works on the abstraction of Cyclic
Data Flows.
● Supports in-memory computation.
● Lazy Evaluation (Iterative-
Transformation).
● Job Graph are generated.
● Batches are processed as
streams.
Apache Spark
Similarities
Apache link● Both are data processing platforms.
● Similar kind of collection APIs.
● Leverages frameworks like AKKA, YARN.
● Since APIs are similar, code porting takes less efforts.
● Both provides stream and batch processing.
● Fault-Tolerant.
● APIs in JAVA and Scala.
Apache Spark Apache Flink
● Near real time stream processing.
● Batch and streaming transformations are
possible.
● Limited window based operations.
● Catalyst Optimizer for SQL operations.
● Stateful Operation till v1.5 are not so
efficient.
Note: In Spark 1.6 stateful operations are
drastically improved.
● Structured data source support is matured.
Ex: HiveContext can be created directly via
Spark SQL.
● More committer and third party APIs.
● Spark uses JAVA Heap memory allocation
for cached data.
Note: From Spark 1.5 spark started implementing
off-heap memory allocation (Tungsten).
● ML algos are implemented via DAG
● Real time stream processing.
● Batch with streams operations are not
possible and so operating on historic data
with live streaming is not so great.
● Various flavours of window based
operations based on triggers, record counts
and events.
● Optimizer for streams as well as batches.
● Efficient stateful stream operations.
● Structured data support is not so matured
and still only have Hadoop InputFormat
API.
● Relatively new ecosystem.
● Flink implemented custom memory
allocation from its inception.
● ML algos are implemented in native style.
VS
Conclusion:
Past
● Spark came first
as a unified
platform and lead
the Big Data world.
● Flink took some
time to come into
existence.
Present
● Spark due to its lead is now
more mature and has a big
community and API support.
● Flink improved the unified
platform idea and is also
capable of solving Spark’s
limitations to some extent.
Claims itself to be faster in
stream as well as batch
processing.
Future
● As Spark has a very
fast development
cycle, it is supposed
to improve itself over
time.
● Flink proved itself
better than Spark as
far as abstraction is
concerned but is still
a newbie.
THANK YOU

More Related Content

What's hot (20)

PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PPTX
Introduction to GraphQL Presentation.pptx
Knoldus Inc.
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPTX
Spark
Heena Madan
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PDF
Introduction to Apache Beam
Knoldus Inc.
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PPTX
PySpark dataframe
Jaemun Jung
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Introduction to Apache Flink
mxmxm
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PDF
Kafka Streams State Stores Being Persistent
confluent
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Introduction to GraphQL Presentation.pptx
Knoldus Inc.
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Introduction to Apache Beam
Knoldus Inc.
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Flink vs. Spark
Slim Baltagi
 
PySpark dataframe
Jaemun Jung
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Apache Flink
mxmxm
 
Apache Airflow overview
NikolayGrishchenkov
 
Kafka Streams State Stores Being Persistent
confluent
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Viewers also liked (20)

PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PPTX
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Vinod Kumar Vavilapalli
 
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PDF
21.04.2016 Meetup: Spark vs. Flink
Comsysto Reply GmbH
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
PDF
MmmooOgle: From Big Data to Decisions for Dairy Cows
Spark Summit
 
PPTX
Apache Flink Training: DataSet API Basics
Flink Forward
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PPTX
Michael Häusler – Everyday flink
Flink Forward
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Vinod Kumar Vavilapalli
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
21.04.2016 Meetup: Spark vs. Flink
Comsysto Reply GmbH
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Flink history, roadmap and vision
Stephan Ewen
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
MmmooOgle: From Big Data to Decisions for Dairy Cows
Spark Summit
 
Apache Flink Training: DataSet API Basics
Flink Forward
 
Apache Flink Training: System Overview
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Michael Häusler – Everyday flink
Flink Forward
 
Ad

Similar to Apache Spark vs Apache Flink (20)

PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
IOT.ppt
Mvidhya9
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Spark
Srinath Reddy
 
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PDF
Review on Apache Spark Technology
IRJET Journal
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PDF
Apache spark 2.4 and beyond
Xiao Li
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Introduction to Apache Spark
datamantra
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Apache Spark PDF
Naresh Rupareliya
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
IOT.ppt
Mvidhya9
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark Streaming
Bartosz Jankiewicz
 
20170126 big data processing
Vienna Data Science Group
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Review on Apache Spark Technology
IRJET Journal
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Apache spark 2.4 and beyond
Xiao Li
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Introduction to Apache Spark
datamantra
 
Ad

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Apache Spark vs Apache Flink

  • 1. Apache Spark vs Apache Flink Two most contemporary general purpose data processing platform. AKASH SIHAGPh. No. : +91-7737111579 [email protected] [email protected]
  • 2. Introduction Apache Spark is a fast and general engine for large-scale data processing. Apache Flink Apache Flink is an open source platform for distributed stream and batch data processing. Apache Spark
  • 3. The Inception ● 2009 : At UC Berkeley's AMPLab by Matei Zaharia ● 2010 : Open-sourced under BSD license. ● 2013 : Donated to Apache Software Foundation and switched its license to Apache 2.0 ● 2014 : Top Level Apache Project Apache Flink ● 2010 : Started as a collaboration of of Technical University Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam. ● 2014 : Apache Incubator. ● 2014(Dec) : Apache Top Level Project. Apache Spark
  • 4. Overview Spark Apache link Components: ● Spark SQL : For SQL and unstructured data processing. ● Spark Streaming: For processing live streaming data. ● MLib : Machine Learning Algorithm. ● GraphX : Graph based processing. ● Spark Core : Its the base processing engine in Spark that works on the concept of RDD and all API’s resides on top of it. Deploy: ● Standalone : Included with Spark. ● Apache Mesos ● Apache YARN : Hadoop 2 resource manager. Spark Core
  • 5. Overview Flink Apache link Components: ● DataStream API : For unbounded streams. ● DataSet API : Batch processing. ● Table API : For SQL like operations. ● CEP : Complex Event processing API. ● M L Library : For machine Learning algorithms. ● Gelly : Graph processing API. Deploy: ● Standalone : Included with Flink. ● Local : Single JVM ● Apache YARN : Hadoop 2 resource manager.
  • 7. Computing Paradigm ● Work on the abstraction of RDD i.e. Resilient distributed datasets. ● Supports in-memory computation. ● Lazy Evaluation (Transformation- action). ● DAG is generated for every Spark Job. ● Streams are processed as chunks of batches. Apache Flink ● Works on the abstraction of Cyclic Data Flows. ● Supports in-memory computation. ● Lazy Evaluation (Iterative- Transformation). ● Job Graph are generated. ● Batches are processed as streams. Apache Spark
  • 8. Similarities Apache link● Both are data processing platforms. ● Similar kind of collection APIs. ● Leverages frameworks like AKKA, YARN. ● Since APIs are similar, code porting takes less efforts. ● Both provides stream and batch processing. ● Fault-Tolerant. ● APIs in JAVA and Scala.
  • 9. Apache Spark Apache Flink ● Near real time stream processing. ● Batch and streaming transformations are possible. ● Limited window based operations. ● Catalyst Optimizer for SQL operations. ● Stateful Operation till v1.5 are not so efficient. Note: In Spark 1.6 stateful operations are drastically improved. ● Structured data source support is matured. Ex: HiveContext can be created directly via Spark SQL. ● More committer and third party APIs. ● Spark uses JAVA Heap memory allocation for cached data. Note: From Spark 1.5 spark started implementing off-heap memory allocation (Tungsten). ● ML algos are implemented via DAG ● Real time stream processing. ● Batch with streams operations are not possible and so operating on historic data with live streaming is not so great. ● Various flavours of window based operations based on triggers, record counts and events. ● Optimizer for streams as well as batches. ● Efficient stateful stream operations. ● Structured data support is not so matured and still only have Hadoop InputFormat API. ● Relatively new ecosystem. ● Flink implemented custom memory allocation from its inception. ● ML algos are implemented in native style. VS
  • 10. Conclusion: Past ● Spark came first as a unified platform and lead the Big Data world. ● Flink took some time to come into existence. Present ● Spark due to its lead is now more mature and has a big community and API support. ● Flink improved the unified platform idea and is also capable of solving Spark’s limitations to some extent. Claims itself to be faster in stream as well as batch processing. Future ● As Spark has a very fast development cycle, it is supposed to improve itself over time. ● Flink proved itself better than Spark as far as abstraction is concerned but is still a newbie.