SlideShare a Scribd company logo
Juggling	
  with	
  Bits	
  and	
  Bytes	
  
How	
  Apache	
  Flink	
  operates	
  on	
  binary	
  data	
  
	
  
Fabian	
  Hueske	
  
:ueske@apache.org	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  @:ueske	
  
	
  
1	
  
Big	
  Data	
  frameworks	
  on	
  JVMs	
  
•  Many	
  (open	
  source)	
  Big	
  Data	
  frameworks	
  run	
  on	
  JVMs	
  
–  Hadoop,	
  Drill,	
  Spark,	
  Hive,	
  Pig,	
  and	
  ...	
  
–  Flink	
  as	
  well	
  
•  Common	
  challenge:	
  How	
  to	
  organize	
  data	
  in-­‐memory?	
  
–  In-­‐memory	
  processing	
  (sorOng,	
  joining,	
  aggregaOng)	
  
–  In-­‐memory	
  caching	
  of	
  intermediate	
  results	
  
•  Memory	
  management	
  of	
  a	
  system	
  influences	
  
–  Reliability	
  
–  Resource	
  efficiency,	
  performance	
  &	
  performance	
  predictability	
  
–  Ease	
  of	
  configuraOon	
  
2	
  
The	
  straight-­‐forward	
  approach	
  
Store	
  and	
  process	
  data	
  as	
  objects	
  on	
  the	
  heap	
  
•  Put	
  objects	
  in	
  an	
  array	
  and	
  sort	
  it	
  
	
  
A	
  few	
  notable	
  drawbacks	
  
•  PredicOng	
  memory	
  consumpOon	
  is	
  hard	
  
–  If	
  you	
  fail,	
  an	
  OutOfMemoryError	
  will	
  kill	
  you!	
  
•  High	
  garbage	
  collecOon	
  overhead	
  
–  Easily	
  50%	
  of	
  Ome	
  spend	
  on	
  GC	
  
•  Objects	
  have	
  considerable	
  space	
  overhead	
  
–  At	
  least	
  8	
  bytes	
  for	
  each	
  (nested)	
  object!	
  (Depends	
  on	
  arch)	
  
3	
  
FLINK’S	
  APPROACH	
  
4	
  
Flink	
  adopts	
  DBMS	
  technology	
  
•  Allocates	
  fixed	
  number	
  of	
  memory	
  segments	
  upfront	
  
•  Data	
  objects	
  are	
  serialized	
  into	
  memory	
  segments	
  
•  DBMS-­‐style	
  algorithms	
  work	
  on	
  binary	
  representaOon	
  
5	
  
Why	
  is	
  that	
  good?	
  
•  Memory-­‐safe	
  execuOon	
  
–  Used	
  and	
  available	
  memory	
  segments	
  are	
  easy	
  to	
  count	
  
–  No	
  parameter	
  tuning	
  for	
  reliable	
  operaOons!	
  
•  Efficient	
  out-­‐of-­‐core	
  algorithms	
  
–  Memory	
  segments	
  can	
  be	
  efficiently	
  wrifen	
  to	
  disk	
  
•  Reduced	
  GC	
  pressure	
  
–  Memory	
  segments	
  are	
  off-­‐heap	
  or	
  never	
  deallocated	
  
–  Data	
  objects	
  are	
  short-­‐lived	
  or	
  reused	
  
•  Space-­‐efficient	
  data	
  representaOon	
  
•  Efficient	
  operaOons	
  on	
  binary	
  data	
  
6	
  
What	
  does	
  it	
  cost?	
  
•  Significant	
  implementaOon	
  investment	
  
–  Using	
  java.uOl.HashMap	
  
vs.	
  
–  ImplemenOng	
  a	
  spillable	
  hash	
  table	
  backed	
  by	
  byte	
  arrays	
  
and	
  custom	
  serializaOon	
  stack	
  
•  Other	
  systems	
  use	
  similar	
  techniques	
  
–  Apache	
  Drill,	
  Apache	
  AsterixDB	
  (incubaOng)	
  
•  Apache	
  Spark	
  evolves	
  into	
  a	
  similar	
  direcOon	
  
7	
  
MEMORY	
  ALLOCATION	
  
8	
  
Memory	
  segments	
  
•  Unit	
  of	
  memory	
  distribuOon	
  in	
  Flink	
  
–  Fixed	
  number	
  allocated	
  when	
  worker	
  starts	
  
•  Backed	
  by	
  a	
  regular	
  byte	
  array	
  (default	
  32KB)	
  
•  On-­‐heap	
  or	
  off-­‐heap	
  allocaOon	
  
•  R/W	
  access	
  through	
  Java’s	
  efficient	
  unsafe	
  methods	
  
•  MulOple	
  memory	
  segments	
  can	
  be	
  logically	
  
concatenated	
  to	
  a	
  larger	
  chunk	
  of	
  memory	
  
9	
  
On-­‐heap	
  memory	
  allocaOon	
  
10	
  
Off-­‐heap	
  memory	
  allocaOon	
  
11	
  
On-­‐heap	
  vs.	
  Off-­‐heap	
  
•  No	
  significant	
  performance	
  difference	
  in	
  	
  
micro-­‐benchmarks	
  
•  Garbage	
  CollecOon	
  
–  Smaller	
  heap	
  -­‐>	
  faster	
  GC	
  
•  Faster	
  start-­‐up	
  Ome	
  
–  A	
  mulO-­‐GB	
  JVM	
  heap	
  takes	
  Ome	
  to	
  allocate	
  
12	
  
DATA	
  SERIALIZATION	
  
13	
  
Custom	
  de/serializaOon	
  stack	
  
•  Many	
  alternaOves	
  for	
  Java	
  object	
  serializaOon	
  
–  Dynamic:	
  Kryo	
  
–  Schema-­‐dependent:	
  Apache	
  Avro,	
  Apache	
  Thrip,	
  Protobufs	
  
•  But	
  Flink	
  has	
  its	
  own	
  serializaOon	
  stack	
  
–  OperaOng	
  on	
  serialized	
  data	
  requires	
  knowledge	
  of	
  layout	
  
–  Control	
  over	
  layout	
  can	
  improve	
  efficiency	
  of	
  operaOons	
  
–  Data	
  types	
  are	
  known	
  before	
  execuOon	
  
14	
  
Rich	
  &	
  extensible	
  type	
  system	
  
•  SerializaOon	
  framework	
  requires	
  knowledge	
  of	
  types	
  
•  Flink	
  analyzes	
  return	
  types	
  of	
  funcOons	
  
–  Java:	
  ReflecOon	
  based	
  type	
  analyzer	
  
–  Scala:	
  Compiler	
  informaOon	
  +	
  CodeGen	
  via	
  Macros	
  
•  Rich	
  type	
  system	
  
–  Atomics:	
  PrimiOves,	
  Writables,	
  Generic	
  types,	
  …	
  
–  Composites:	
  Tuples,	
  Pojos,	
  CaseClasses	
  
–  Extensible	
  by	
  custom	
  types	
  
15	
  
Serializing	
  a	
  Tuple3<Integer,	
  Double,	
  Person>	
  
16	
  
OPERATING	
  ON	
  BINARY	
  DATA	
  
17	
  
Data	
  processing	
  algorithms	
  
•  Flink’s	
  algorithms	
  are	
  based	
  on	
  RDBMS	
  technology	
  
–  External	
  Merge	
  Sort,	
  Hybrid	
  Hash	
  Join,	
  Sort	
  Merge	
  Join,	
  …	
  
•  Algorithms	
  receive	
  a	
  budget	
  of	
  memory	
  segments	
  
–  AutomaOc	
  decision	
  about	
  budget	
  size	
  
–  No	
  fine-­‐tuning	
  of	
  operator	
  memory!	
  
•  Operate	
  in-­‐memory	
  as	
  long	
  as	
  data	
  fits	
  into	
  budget	
  
–  And	
  gracefully	
  spill	
  to	
  disk	
  if	
  data	
  exceeds	
  memory	
  
18	
  
In-­‐memory	
  sort	
  –	
  Fill	
  the	
  sort	
  buffer	
  
19	
  
In-­‐memory	
  sort	
  –	
  Sort	
  the	
  buffer	
  
20	
  
In-­‐memory	
  sort	
  –	
  Read	
  sorted	
  buffer	
  
21	
  
SHOW	
  ME	
  NUMBERS!	
  
22	
  
Sort	
  benchmark	
  
•  Task:	
  Sort	
  10	
  million	
  Tuple2<Integer,	
  String>	
  records	
  
–  String	
  length	
  12	
  chars	
  
•  	
  Tuple	
  has	
  16	
  Bytes	
  of	
  raw	
  data	
  
•  ~152	
  MB	
  raw	
  data	
  
–  Integers	
  uniformly,	
  Strings	
  long-­‐tail	
  distributed	
  
–  Sort	
  on	
  Integer	
  field	
  and	
  on	
  String	
  field	
  
•  Generated	
  input	
  provided	
  as	
  mutable	
  object	
  iterator	
  
•  Use	
  JVM	
  with	
  900	
  MB	
  heap	
  size	
  
–  Minimum	
  size	
  to	
  reliable	
  run	
  the	
  benchmark	
  
23	
  
SorOng	
  methods	
  
1.  Objects-­‐on-­‐Heap:	
  	
  
–  Put	
  cloned	
  data	
  objects	
  in	
  ArrayList	
  and	
  use	
  Java’s	
  CollecOon	
  sort.	
  	
  
–  ArrayList	
  is	
  iniOalized	
  with	
  right	
  size.	
  
2.  Flink-­‐serialized	
  (on-­‐heap):	
  	
  
–  Using	
  Flink’s	
  custom	
  serializers.	
  
–  Integer	
  with	
  full	
  binary	
  sorOng	
  key,	
  String	
  with	
  8	
  byte	
  prefix	
  key.	
  
3.  Kryo-­‐serialized	
  (on-­‐heap):	
  	
  
–  Serialize	
  fields	
  with	
  Kryo.	
  	
  
–  No	
  binary	
  sorOng	
  keys,	
  objects	
  are	
  deserialized	
  for	
  comparison.	
  
•  All	
  implementaOons	
  use	
  a	
  single	
  thread	
  
•  Average	
  execuOon	
  Ome	
  of	
  10	
  runs	
  reported	
  
•  GC	
  triggered	
  between	
  runs	
  (does	
  not	
  go	
  into	
  reported	
  Ome)	
  
24	
  
ExecuOon	
  Ome	
  
25	
  
Garbage	
  collecOon	
  and	
  heap	
  usage	
  
26	
  
Objects-­‐on-­‐heap	
  
Flink-­‐serialized	
  
Memory	
  usage	
  
27	
  
•  Breakdown:	
  Flink	
  serialized	
  -­‐	
  Sort	
  Integer	
  
–  4	
  bytes	
  Integer	
  
–  12	
  bytes	
  String	
  
–  4	
  bytes	
  String	
  length	
  
–  4	
  bytes	
  pointer	
  
–  4	
  bytes	
  Integer	
  sorOng	
  key	
  
–  28	
  bytes	
  *	
  10M	
  records	
  =	
  267	
  MB	
  
Object-­‐on-­‐heap	
   Flink-­‐serialized	
   Kryo-­‐serialized	
  
Sort	
  Integer	
   Approx.	
  700	
  MB	
   277	
  MB	
   266	
  MB	
  
Sort	
  String	
   Approx.	
  700	
  MB	
   315	
  MB	
   266	
  MB	
  
Going	
  out-­‐of-­‐core	
  
28	
  
•  Single	
  thread	
  HashJoin	
  with	
  4GB	
  memory	
  budget	
  
•  Build	
  side	
  varies,	
  Probe	
  side	
  64GB	
  
WHAT’S	
  NEXT?	
  
29	
  
We’re	
  not	
  done	
  yet!	
  
	
  
•  SerializaOon	
  layouts	
  tailored	
  towards	
  operaOons	
  
–  More	
  efficient	
  operaOons	
  on	
  binary	
  data	
  
•  Table	
  API	
  provides	
  full	
  semanOcs	
  for	
  execuOon	
  
–  Use	
  code	
  generaOon	
  to	
  operate	
  fully	
  on	
  binary	
  data	
  
•  …	
  
30	
  
Summary	
  
•  AcOve	
  memory	
  management	
  avoids	
  OOMErrors	
  
•  Highly	
  efficient	
  data	
  serializaOon	
  stack	
  
–  Facilitates	
  operaOons	
  on	
  binary	
  data	
  
–  Makes	
  more	
  data	
  fit	
  into	
  memory	
  
•  DBMS-­‐style	
  operators	
  operate	
  on	
  binary	
  data	
  	
  
–  High	
  performance	
  in-­‐memory	
  processing	
  	
  
–  Graceful	
  destaging	
  to	
  disk	
  if	
  necessary	
  
•  Read	
  Flink’s	
  blog:	
  	
  
–  hfp://flink.apache.org/news/2015/05/11/Juggling-­‐with-­‐Bits-­‐and-­‐Bytes.html	
  
–  hfp://flink.apache.org/news/2015/03/13/peeking-­‐into-­‐Apache-­‐Flinks-­‐Engine-­‐Room.html	
  
–  hfp://flink.apache.org/news/2015/09/16/off-­‐heap-­‐memory.html	
  
	
  
31	
  
32	
  
hfp://flink.apache.org 	
   	
  @ApacheFlink	
  
Apache	
  Flink	
  
Ad

More Related Content

What's hot (20)

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Taro L. Saito
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
Gyula Fóra
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
Sri Ambati
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
Gyula Fóra
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Taro L. Saito
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
Gyula Fóra
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
Sri Ambati
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 

Viewers also liked (20)

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Flink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Flink Forward
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Ad

Similar to Fabian Hueske – Juggling with Bits and Bytes (20)

Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Java Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and SolutionsJava Memory Analysis: Problems and Solutions
Java Memory Analysis: Problems and Solutions
"Mikhail "Misha"" Dmitriev
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
Ryft
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcache
HyeonSeok Choi
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
An Efficient Backup and Replication of Storage
An Efficient Backup and Replication of StorageAn Efficient Backup and Replication of Storage
An Efficient Backup and Replication of Storage
Takashi Hoshino
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
Ryft
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Overview of the ehcache
Overview of the ehcacheOverview of the ehcache
Overview of the ehcache
HyeonSeok Choi
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage EngineMongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB Evenings Boston - An Update on MongoDB's WiredTiger Storage Engine
MongoDB
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Ad

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 

Fabian Hueske – Juggling with Bits and Bytes

  • 1. Juggling  with  Bits  and  Bytes   How  Apache  Flink  operates  on  binary  data     Fabian  Hueske   :[email protected]                    @:ueske     1  
  • 2. Big  Data  frameworks  on  JVMs   •  Many  (open  source)  Big  Data  frameworks  run  on  JVMs   –  Hadoop,  Drill,  Spark,  Hive,  Pig,  and  ...   –  Flink  as  well   •  Common  challenge:  How  to  organize  data  in-­‐memory?   –  In-­‐memory  processing  (sorOng,  joining,  aggregaOng)   –  In-­‐memory  caching  of  intermediate  results   •  Memory  management  of  a  system  influences   –  Reliability   –  Resource  efficiency,  performance  &  performance  predictability   –  Ease  of  configuraOon   2  
  • 3. The  straight-­‐forward  approach   Store  and  process  data  as  objects  on  the  heap   •  Put  objects  in  an  array  and  sort  it     A  few  notable  drawbacks   •  PredicOng  memory  consumpOon  is  hard   –  If  you  fail,  an  OutOfMemoryError  will  kill  you!   •  High  garbage  collecOon  overhead   –  Easily  50%  of  Ome  spend  on  GC   •  Objects  have  considerable  space  overhead   –  At  least  8  bytes  for  each  (nested)  object!  (Depends  on  arch)   3  
  • 5. Flink  adopts  DBMS  technology   •  Allocates  fixed  number  of  memory  segments  upfront   •  Data  objects  are  serialized  into  memory  segments   •  DBMS-­‐style  algorithms  work  on  binary  representaOon   5  
  • 6. Why  is  that  good?   •  Memory-­‐safe  execuOon   –  Used  and  available  memory  segments  are  easy  to  count   –  No  parameter  tuning  for  reliable  operaOons!   •  Efficient  out-­‐of-­‐core  algorithms   –  Memory  segments  can  be  efficiently  wrifen  to  disk   •  Reduced  GC  pressure   –  Memory  segments  are  off-­‐heap  or  never  deallocated   –  Data  objects  are  short-­‐lived  or  reused   •  Space-­‐efficient  data  representaOon   •  Efficient  operaOons  on  binary  data   6  
  • 7. What  does  it  cost?   •  Significant  implementaOon  investment   –  Using  java.uOl.HashMap   vs.   –  ImplemenOng  a  spillable  hash  table  backed  by  byte  arrays   and  custom  serializaOon  stack   •  Other  systems  use  similar  techniques   –  Apache  Drill,  Apache  AsterixDB  (incubaOng)   •  Apache  Spark  evolves  into  a  similar  direcOon   7  
  • 9. Memory  segments   •  Unit  of  memory  distribuOon  in  Flink   –  Fixed  number  allocated  when  worker  starts   •  Backed  by  a  regular  byte  array  (default  32KB)   •  On-­‐heap  or  off-­‐heap  allocaOon   •  R/W  access  through  Java’s  efficient  unsafe  methods   •  MulOple  memory  segments  can  be  logically   concatenated  to  a  larger  chunk  of  memory   9  
  • 12. On-­‐heap  vs.  Off-­‐heap   •  No  significant  performance  difference  in     micro-­‐benchmarks   •  Garbage  CollecOon   –  Smaller  heap  -­‐>  faster  GC   •  Faster  start-­‐up  Ome   –  A  mulO-­‐GB  JVM  heap  takes  Ome  to  allocate   12  
  • 14. Custom  de/serializaOon  stack   •  Many  alternaOves  for  Java  object  serializaOon   –  Dynamic:  Kryo   –  Schema-­‐dependent:  Apache  Avro,  Apache  Thrip,  Protobufs   •  But  Flink  has  its  own  serializaOon  stack   –  OperaOng  on  serialized  data  requires  knowledge  of  layout   –  Control  over  layout  can  improve  efficiency  of  operaOons   –  Data  types  are  known  before  execuOon   14  
  • 15. Rich  &  extensible  type  system   •  SerializaOon  framework  requires  knowledge  of  types   •  Flink  analyzes  return  types  of  funcOons   –  Java:  ReflecOon  based  type  analyzer   –  Scala:  Compiler  informaOon  +  CodeGen  via  Macros   •  Rich  type  system   –  Atomics:  PrimiOves,  Writables,  Generic  types,  …   –  Composites:  Tuples,  Pojos,  CaseClasses   –  Extensible  by  custom  types   15  
  • 16. Serializing  a  Tuple3<Integer,  Double,  Person>   16  
  • 17. OPERATING  ON  BINARY  DATA   17  
  • 18. Data  processing  algorithms   •  Flink’s  algorithms  are  based  on  RDBMS  technology   –  External  Merge  Sort,  Hybrid  Hash  Join,  Sort  Merge  Join,  …   •  Algorithms  receive  a  budget  of  memory  segments   –  AutomaOc  decision  about  budget  size   –  No  fine-­‐tuning  of  operator  memory!   •  Operate  in-­‐memory  as  long  as  data  fits  into  budget   –  And  gracefully  spill  to  disk  if  data  exceeds  memory   18  
  • 19. In-­‐memory  sort  –  Fill  the  sort  buffer   19  
  • 20. In-­‐memory  sort  –  Sort  the  buffer   20  
  • 21. In-­‐memory  sort  –  Read  sorted  buffer   21  
  • 23. Sort  benchmark   •  Task:  Sort  10  million  Tuple2<Integer,  String>  records   –  String  length  12  chars   •   Tuple  has  16  Bytes  of  raw  data   •  ~152  MB  raw  data   –  Integers  uniformly,  Strings  long-­‐tail  distributed   –  Sort  on  Integer  field  and  on  String  field   •  Generated  input  provided  as  mutable  object  iterator   •  Use  JVM  with  900  MB  heap  size   –  Minimum  size  to  reliable  run  the  benchmark   23  
  • 24. SorOng  methods   1.  Objects-­‐on-­‐Heap:     –  Put  cloned  data  objects  in  ArrayList  and  use  Java’s  CollecOon  sort.     –  ArrayList  is  iniOalized  with  right  size.   2.  Flink-­‐serialized  (on-­‐heap):     –  Using  Flink’s  custom  serializers.   –  Integer  with  full  binary  sorOng  key,  String  with  8  byte  prefix  key.   3.  Kryo-­‐serialized  (on-­‐heap):     –  Serialize  fields  with  Kryo.     –  No  binary  sorOng  keys,  objects  are  deserialized  for  comparison.   •  All  implementaOons  use  a  single  thread   •  Average  execuOon  Ome  of  10  runs  reported   •  GC  triggered  between  runs  (does  not  go  into  reported  Ome)   24  
  • 26. Garbage  collecOon  and  heap  usage   26   Objects-­‐on-­‐heap   Flink-­‐serialized  
  • 27. Memory  usage   27   •  Breakdown:  Flink  serialized  -­‐  Sort  Integer   –  4  bytes  Integer   –  12  bytes  String   –  4  bytes  String  length   –  4  bytes  pointer   –  4  bytes  Integer  sorOng  key   –  28  bytes  *  10M  records  =  267  MB   Object-­‐on-­‐heap   Flink-­‐serialized   Kryo-­‐serialized   Sort  Integer   Approx.  700  MB   277  MB   266  MB   Sort  String   Approx.  700  MB   315  MB   266  MB  
  • 28. Going  out-­‐of-­‐core   28   •  Single  thread  HashJoin  with  4GB  memory  budget   •  Build  side  varies,  Probe  side  64GB  
  • 30. We’re  not  done  yet!     •  SerializaOon  layouts  tailored  towards  operaOons   –  More  efficient  operaOons  on  binary  data   •  Table  API  provides  full  semanOcs  for  execuOon   –  Use  code  generaOon  to  operate  fully  on  binary  data   •  …   30  
  • 31. Summary   •  AcOve  memory  management  avoids  OOMErrors   •  Highly  efficient  data  serializaOon  stack   –  Facilitates  operaOons  on  binary  data   –  Makes  more  data  fit  into  memory   •  DBMS-­‐style  operators  operate  on  binary  data     –  High  performance  in-­‐memory  processing     –  Graceful  destaging  to  disk  if  necessary   •  Read  Flink’s  blog:     –  hfp://flink.apache.org/news/2015/05/11/Juggling-­‐with-­‐Bits-­‐and-­‐Bytes.html   –  hfp://flink.apache.org/news/2015/03/13/peeking-­‐into-­‐Apache-­‐Flinks-­‐Engine-­‐Room.html   –  hfp://flink.apache.org/news/2015/09/16/off-­‐heap-­‐memory.html     31  
  • 32. 32   hfp://flink.apache.org    @ApacheFlink   Apache  Flink