SlideShare a Scribd company logo
(18)
GOBBLIN: UNIFYING DATA
INGESTION FOR HADOOP
Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha
Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil
Surlaker, Shirshanka Das, Chavdar Botev
DataAnalytics Infrastructure @ LinkedIn
(18)
Agenda
• Why Gobblin?
• Gobblin Overview
• Case Studies
• Gobblin in Details
• Gobblin in Production @ LinkedIn
• FutureWork
• Q&A
1
(18)
Data Ingestion Challenges @ LinkedIn
2
BIG engineering and operational COST!
Data Sources DataTypes Operational Pain
(18)
Pre-Gobblin Era
3
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
Pipeline #1
External
Partner Data
Pipeline #2
REST
JDBC
SOAP
...
Pipeline #3
Pipeline #4
Pipeline #5
Pipeline #n
Databases
(Oracle/Espresso)
(18)
The Gobblin Era
4
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
External
Partner Data
REST
JDBC
SOAP
...
Databases
(Oracle/Espresso)
(18)
Requirements
5
Multi-platform and
Scalability Support
Rich Source
Integration
Centralized State
Management
OperabilityExtensibility Self Service
(18)
Architecture Overview
6
Constructs for Building Ingestion Flows
WorkUnit /Task
Execution Runtime
Deployment Mode
state store
compaction
retention
mgmt.
monitoring
Standalone Hadoop MR Yarn
Source Extractor Converter
Qlty. Chker. Writer Publisher
Task Executor Task StateTracker
Job Launcher Job Scheduler
(18)
Case Study: Kafka Ingestion
7
KafkaAvroSource
KafkaAvroExtracto
r
WorkUnit 1
(Topic 1, Partition 1)
KafkaConverter
TimePartitioned
AvroWriter
Avro
/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro
Compaction
/kafka/topic/daily/yyyy/mm/dd/*.avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 2
(Topic 1, Partition 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 3
(Topic 1, Partitions 1 & 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
TimePartitioned
DataPublisher
(18)
Case Study: Database Ingestion
8
JdbcSource
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
/database/table/incremental/snapshot-ts/*.avro
Compaction
/database/table/full/snapshot-ts/*.avro
SchemaCompatibiliy
& Count Qlty. Chker
SnapshotDataPublisher
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker
(18)
Case Study – Filtering Sensitive Data
9
Has Sensitive
Data?
no
Source
Extractor
WorkUnit
Converter and
Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive Data
Filtering Converter
yes
(18)
Data Quality Checking
10
Record-level
Policies
Writer
Task-level
Policies
Publisher
Quarantine
FailTask
Quality Checkers
- Per record or per task.
- Policy driven
- Composable
~ Schema compatibility
~ Audit check
~ Sensitive fields
~ Required fields
~ Unique key
(18)
State and Metadata Mgmt.
11
State Store
- Stores runtime metadata, e.g., checkpoints
(a.k.a. watermarks)
~ Carried over between job runs
- Default impl: serializes job/task states into
files, one per run.
- Allows other implementations that conform
to the interface to be plugged in.
State Store
job run #2
job run #3job run #1
SEP
2
SEP
3
SEP
2 SEP
3
EXAMPLE
(18)
Metrics / Events and Alerting
12
Kafka
MetricContext
Topic 1
MetricContext
Topic 2
MetricContext
Partition 1
MetricContext
Partition 2
MetricContext
20
12 8
6 6
Metric
Reporter
Event
ReporterMetrics / Events
Collection and Reporting
- Metrics for ingestion progress
~ supports tagging
~ real-time aggregation
- Events for major milestones
~ “fire-and-forget”
- Various built-in metric / event
reporters
(18)
Running Modes
13
Standalone
Runs in a single
JVM; tasks run in a
thread pool.
Scale-out with
MapReduce
Each job run launches
a MR job, using
mappers as containers
to run tasks.
Scale-out with
General
Distributed
Resource Manager
Supports long-running
continuous ingestion,
with better resource
utilization and SLA
guarantees.
YARN
*in progress
(18)
Gobblin in Production @ LinkedIn
• In production since 2014
• Usages
– Internal sources  HDFS
• Kafka, MySQL, Dropbox, etc.
– External sources  HDFS
• Salesforce, GoogleAnalytics, S3, etc.
– HDFS  HDFS
• Closed member data purging
– Egress from HDFS (future work)
• Data volume
– Over a dozen data sources,
– thousands of datasets,
– tens ofTBs,
… daily.
14
(18)
FutureWork
• Gobblin onYarn (alpha-release)
• Real-time elastic ingestion
• Integration with
– Apache Sqoop: using Sqoop connectors
– Logstash: log ingestion
– Morphlines: using Morphline transformation
– Apache Spark
15
(18)
Conclusions
16
Pain of
maintaining
multiple
ingestion
pipelines
Gobblin to the
rescue!
Data quality
assurance and
centralized state
management
Gobblin in
production for a
wide range of
data sources
Continuous real-
time ingestion
(18)17
ACKNOWLEDGEMENT
Pradhan Cadabam
Shrikanth Shankar
Suvodeep Pyne
Ray Ortigas
Henry Cai
Kenneth Goodhope
Erik Krogen
(18)
Thanks.
18
Github https://github.com/linkedin/gobblin
Documentation https://github.com/linkedin/gobblin/wiki
User Group https://groups.google.com/forum/#!forum/gobblin-users

More Related Content

What's hot (20)

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
DataWorks Summit
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Carl Steinbach
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
HostedbyConfluent
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
Embracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and DebeziumEmbracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and Debezium
Frank Lyaruu
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
DataWorks Summit
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Carl Steinbach
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
HostedbyConfluent
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
Embracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and DebeziumEmbracing Database Diversity with Kafka and Debezium
Embracing Database Diversity with Kafka and Debezium
Frank Lyaruu
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 

Viewers also liked (20)

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
DataWorks Summit
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
Vinod Nayal
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
Treasure Data, Inc.
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
Hari Shreedharan
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data Analytics
Intel IT Center
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on Mesos
Antony Arokiasamy
 
Meson: Heterogeneous Workflows with Spark at Netflix
Meson: Heterogeneous Workflows with Spark at NetflixMeson: Heterogeneous Workflows with Spark at Netflix
Meson: Heterogeneous Workflows with Spark at Netflix
Antony Arokiasamy
 
Internet of things ecosystem: The quest for value
Internet of things ecosystem: The quest for valueInternet of things ecosystem: The quest for value
Internet of things ecosystem: The quest for value
Deloitte United States
 
RabbitMQ Data Ingestion at Craft Conf
RabbitMQ Data Ingestion at Craft ConfRabbitMQ Data Ingestion at Craft Conf
RabbitMQ Data Ingestion at Craft Conf
Alvaro Videla
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
David Martínez Rego
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Scala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
Martin Odersky
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
DataWorks Summit
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
Vinod Nayal
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data Analytics
Intel IT Center
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on Mesos
Antony Arokiasamy
 
Meson: Heterogeneous Workflows with Spark at Netflix
Meson: Heterogeneous Workflows with Spark at NetflixMeson: Heterogeneous Workflows with Spark at Netflix
Meson: Heterogeneous Workflows with Spark at Netflix
Antony Arokiasamy
 
Internet of things ecosystem: The quest for value
Internet of things ecosystem: The quest for valueInternet of things ecosystem: The quest for value
Internet of things ecosystem: The quest for value
Deloitte United States
 
RabbitMQ Data Ingestion at Craft Conf
RabbitMQ Data Ingestion at Craft ConfRabbitMQ Data Ingestion at Craft Conf
RabbitMQ Data Ingestion at Craft Conf
Alvaro Videla
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Chris Fregly
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Scala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
Martin Odersky
 
Ad

Similar to Gobblin: Unifying Data Ingestion for Hadoop (20)

How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
AlanCaldera
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Márton Kodok
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
upthewaterspout
 
Introduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet UpIntroduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key DownloadMiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
Capcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 FullCapcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
minitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latestminitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latest
qaha7432
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
SingleStore
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
AlanCaldera
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Márton Kodok
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Introduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet UpIntroduction to Apache Flink at Vienna Meet Up
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050Avast Premium Security 24.12.9725 + License Key Till 2050
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]FastStone Capture 10.4 Crack + Serial Key [Latest]
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]EASEUS Partition Master 18.8 Crack + License Code [2025]
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key DownloadMiniTool Partition Wizard Crack 12.8 + Serial Key Download
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free4K Video Downloader Crack (2025) + License Key Free
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
Capcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 FullCapcut Pro Crack For PC Latest 2025 Full
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
minitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latestminitool partition wizard crack 12.8 latest
minitool partition wizard crack 12.8 latest
qaha7432
 
Ad

Recently uploaded (20)

Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
Mining Presentation Online Courses for Student
Mining Presentation Online Courses for StudentMining Presentation Online Courses for Student
Mining Presentation Online Courses for Student
Rizki229625
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitalityPhilippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdfPause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt
Wahajch
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)
apidays
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
Mining Presentation Online Courses for Student
Mining Presentation Online Courses for StudentMining Presentation Online Courses for Student
Mining Presentation Online Courses for Student
Rizki229625
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdf
Anass Nabil
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitalityPhilippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
[Eddie Lee] Capstone Project - AI PM Bootcamp - DataFox.pdf
Eddie Lee
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt
Wahajch
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 

Gobblin: Unifying Data Ingestion for Hadoop

  • 1. (18) GOBBLIN: UNIFYING DATA INGESTION FOR HADOOP Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev DataAnalytics Infrastructure @ LinkedIn
  • 2. (18) Agenda • Why Gobblin? • Gobblin Overview • Case Studies • Gobblin in Details • Gobblin in Production @ LinkedIn • FutureWork • Q&A 1
  • 3. (18) Data Ingestion Challenges @ LinkedIn 2 BIG engineering and operational COST! Data Sources DataTypes Operational Pain
  • 4. (18) Pre-Gobblin Era 3 OLTP Tracking Snapshot and delta file dumps Kafka Databus Changes Pipeline #1 External Partner Data Pipeline #2 REST JDBC SOAP ... Pipeline #3 Pipeline #4 Pipeline #5 Pipeline #n Databases (Oracle/Espresso)
  • 5. (18) The Gobblin Era 4 OLTP Tracking Snapshot and delta file dumps Kafka Databus Changes External Partner Data REST JDBC SOAP ... Databases (Oracle/Espresso)
  • 6. (18) Requirements 5 Multi-platform and Scalability Support Rich Source Integration Centralized State Management OperabilityExtensibility Self Service
  • 7. (18) Architecture Overview 6 Constructs for Building Ingestion Flows WorkUnit /Task Execution Runtime Deployment Mode state store compaction retention mgmt. monitoring Standalone Hadoop MR Yarn Source Extractor Converter Qlty. Chker. Writer Publisher Task Executor Task StateTracker Job Launcher Job Scheduler
  • 8. (18) Case Study: Kafka Ingestion 7 KafkaAvroSource KafkaAvroExtracto r WorkUnit 1 (Topic 1, Partition 1) KafkaConverter TimePartitioned AvroWriter Avro /kafka/topic/hourly/yyyy/mm/dd/hh/*.avro Compaction /kafka/topic/daily/yyyy/mm/dd/*.avro AuditCount QualityChecker KafkaAvroExtracto r WorkUnit 2 (Topic 1, Partition 2) KafkaConverter TimePartitioned AvroWriter Avro AuditCount QualityChecker KafkaAvroExtracto r WorkUnit 3 (Topic 1, Partitions 1 & 2) KafkaConverter TimePartitioned AvroWriter Avro AuditCount QualityChecker TimePartitioned DataPublisher
  • 9. (18) Case Study: Database Ingestion 8 JdbcSource JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row /database/table/incremental/snapshot-ts/*.avro Compaction /database/table/full/snapshot-ts/*.avro SchemaCompatibiliy & Count Qlty. Chker SnapshotDataPublisher JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row SchemaCompatibiliy & Count Qlty. Chker JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row SchemaCompatibiliy & Count Qlty. Chker
  • 10. (18) Case Study – Filtering Sensitive Data 9 Has Sensitive Data? no Source Extractor WorkUnit Converter and Quality Checker Fork and Branching Writer DataPublisher Writer Sensitive Data Filtering Converter yes
  • 11. (18) Data Quality Checking 10 Record-level Policies Writer Task-level Policies Publisher Quarantine FailTask Quality Checkers - Per record or per task. - Policy driven - Composable ~ Schema compatibility ~ Audit check ~ Sensitive fields ~ Required fields ~ Unique key
  • 12. (18) State and Metadata Mgmt. 11 State Store - Stores runtime metadata, e.g., checkpoints (a.k.a. watermarks) ~ Carried over between job runs - Default impl: serializes job/task states into files, one per run. - Allows other implementations that conform to the interface to be plugged in. State Store job run #2 job run #3job run #1 SEP 2 SEP 3 SEP 2 SEP 3 EXAMPLE
  • 13. (18) Metrics / Events and Alerting 12 Kafka MetricContext Topic 1 MetricContext Topic 2 MetricContext Partition 1 MetricContext Partition 2 MetricContext 20 12 8 6 6 Metric Reporter Event ReporterMetrics / Events Collection and Reporting - Metrics for ingestion progress ~ supports tagging ~ real-time aggregation - Events for major milestones ~ “fire-and-forget” - Various built-in metric / event reporters
  • 14. (18) Running Modes 13 Standalone Runs in a single JVM; tasks run in a thread pool. Scale-out with MapReduce Each job run launches a MR job, using mappers as containers to run tasks. Scale-out with General Distributed Resource Manager Supports long-running continuous ingestion, with better resource utilization and SLA guarantees. YARN *in progress
  • 15. (18) Gobblin in Production @ LinkedIn • In production since 2014 • Usages – Internal sources  HDFS • Kafka, MySQL, Dropbox, etc. – External sources  HDFS • Salesforce, GoogleAnalytics, S3, etc. – HDFS  HDFS • Closed member data purging – Egress from HDFS (future work) • Data volume – Over a dozen data sources, – thousands of datasets, – tens ofTBs, … daily. 14
  • 16. (18) FutureWork • Gobblin onYarn (alpha-release) • Real-time elastic ingestion • Integration with – Apache Sqoop: using Sqoop connectors – Logstash: log ingestion – Morphlines: using Morphline transformation – Apache Spark 15
  • 17. (18) Conclusions 16 Pain of maintaining multiple ingestion pipelines Gobblin to the rescue! Data quality assurance and centralized state management Gobblin in production for a wide range of data sources Continuous real- time ingestion
  • 18. (18)17 ACKNOWLEDGEMENT Pradhan Cadabam Shrikanth Shankar Suvodeep Pyne Ray Ortigas Henry Cai Kenneth Goodhope Erik Krogen