Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

39 likes11,324 views

The document discusses LinkedIn's implementation of a real-time data pipeline using Apache Kafka, emphasizing the need to leverage large volumes of data for product development. Key strategies include using a central data pipeline, enforcing data cleanliness, optimizing ETL processes, and ensuring evidence-based correctness. It details Kafka's performance at LinkedIn, reporting billions of messages processed daily across numerous services.

Technology Business

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

More Related Content

What's hot (20)

PPTX

Node.js ExpressEyal Vardi

PDF

Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit

PPTX

Web apiSudhakar Sharma

PPTX

[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2

PPTX

Apache Hive TutorialSandeep Patil

PDF

Intro to HBasealexbaranau

PDF

How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu

PPTX

Apache SparkSugumarSarDurai

PPTX

Master Real-Time Streams With Neo4j and Apache KafkaNeo4j

PPTX

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax

PPTX

Unit 5-apache hivevishal choudhary

PDF

ProxySQL High Availability (Clustering)Mydbops

PDF

Introduction to MongoDBMike Dirolf

PDF

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

PDF

[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT

PDF

Altinity Quickstart for ClickHouseAltinity Ltd

PPTX

Introduction to ShardingMongoDB

PPTX

Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim

PDF

CDC Stream Processing with Apache FlinkTimo Walther

PPTX

mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi

Node.js ExpressEyal Vardi

Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit

Web apiSudhakar Sharma

[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2

Apache Hive TutorialSandeep Patil

Intro to HBasealexbaranau

How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu

Apache SparkSugumarSarDurai

Master Real-Time Streams With Neo4j and Apache KafkaNeo4j

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax

Unit 5-apache hivevishal choudhary

ProxySQL High Availability (Clustering)Mydbops

Introduction to MongoDBMike Dirolf

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT

Altinity Quickstart for ClickHouseAltinity Ltd

Introduction to ShardingMongoDB

Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim

CDC Stream Processing with Apache FlinkTimo Walther

mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi

Viewers also liked (20)

PPTX

Architecture of a Kafka camus infrastructuremattlieber

PPTX

Data Infrastructure at LinkedInAmy W. Tang

PPTX

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

PPT

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

PPTX

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

PDF

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

PDF

Data Infrastructure at LinkedIn Amy W. Tang

PDF

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

PPTX

Introduction to Apache KafkaJeff Holoman

PDF

LinkedIn Communication ArchitectureLinkedIn

PDF

Introduction to DatabusAmy W. Tang

PDF

Building Distributed Systems Using HelixAmy W. Tang

PDF

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

PDF

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

PDF

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

PDF

Introduction to apache kafkaSamuel Kerrien

PPTX

Apache KafkaMaher TEBOURBI

PPTX

Realtime streaming architecture in INFINARIOJozo Kovac

PDF

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

PPTX

Intro to SnappyData WebinarSnappyData

Architecture of a Kafka camus infrastructuremattlieber

Data Infrastructure at LinkedInAmy W. Tang

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

Data Infrastructure at LinkedIn Amy W. Tang

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Introduction to Apache KafkaJeff Holoman

LinkedIn Communication ArchitectureLinkedIn

Introduction to DatabusAmy W. Tang

Building Distributed Systems Using HelixAmy W. Tang

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

Introduction to apache kafkaSamuel Kerrien

Apache KafkaMaher TEBOURBI

Realtime streaming architecture in INFINARIOJozo Kovac

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

Intro to SnappyData WebinarSnappyData

Similar to Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

PPTX

Apache Kafka at LinkedInGuozhang Wang

PPTX

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

PPTX

The "Big Data" Ecosystem at LinkedInSam Shah

PDF

The "Big Data" Ecosystem at LinkedInSam Shah

PDF

The “Big Data” Ecosystem at LinkedInKun Le

PPTX

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

PPTX

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

PDF

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

PPTX

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

PPTX

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

PPTX

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

PDF

Software Development & Architecture @ LinkedInC4Media

PPTX

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

PPT

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

PPTX

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

PPTX

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

PPTX

Hadoop Big Data A big pictureJ S Jodha

PPTX

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

PPTX

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

PPTX

Real time monitoring of hadoop and spark workflowsShankar Manian

Apache Kafka at LinkedInGuozhang Wang

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

The "Big Data" Ecosystem at LinkedInSam Shah

The “Big Data” Ecosystem at LinkedInKun Le

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Software Development & Architecture @ LinkedInC4Media

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

Hadoop Big Data A big pictureJ S Jodha

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Real time monitoring of hadoop and spark workflowsShankar Manian

More from Amy W. Tang (6)

PDF

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

PDF

LinkedIn Graph PresentationAmy W. Tang

PDF

Data Infrastructure at LinkedInAmy W. Tang

PDF

Voldemort on Solid State DrivesAmy W. Tang

PDF

Untangling Cluster Management with HelixAmy W. Tang

PDF

All Aboard the DatabusAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

LinkedIn Graph PresentationAmy W. Tang

Data Infrastructure at LinkedInAmy W. Tang

Voldemort on Solid State DrivesAmy W. Tang

Untangling Cluster Management with HelixAmy W. Tang

All Aboard the DatabusAmy W. Tang

Recently uploaded (20)

PPTX

Future Tech Innovations 2025 – A TechLists InsightTechLists

PDF

Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdfEmily Achieng

PDF

🚀 Let’s Build Our First Slack Workflow! 🔧.pdfSanjeetMishra29

PPTX

Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptxpresentifyai

PDF

UPDF - AI PDF Editor & Converter Key FeaturesDealFuel

PDF

Book industry state of the nation 2025 - Tech Forum 2025BookNet Canada

PPTX

The Project Compass - GDG on Campus MSITdscmsitkol

PDF

CIFDAQ Market Wrap for the week of 4th July 2025CIFDAQ

PDF

The 2025 InfraRed Report - Redpoint VenturesRazin Mustafiz

PDF

UiPath DevConnect 2025: Agentic Automation Community User Group MeetingDianaGray10

PDF

Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdfWonjun Hwang

PDF

AI Agents in the Cloud: The Rise of Agentic Cloud ArchitectureLilly Gracia

PDF

Transforming Utility Networks: Large-scale Data Migrations with FMESafe Software

PDF

How do you fast track Agentic automation use cases discovery?DianaGray10

PPTX

Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...Alexandra N. Martinez

PPTX

From Sci-Fi to Reality: Exploring AI EvolutionSvetlana Meissner

PPTX

Seamless Tech Experiences Showcasing Cross-Platform App Design.pptxpresentifyai

PDF

NASA A Researcher’s Guide to International Space Station : Physical Sciences ...Dr. PANKAJ DHUSSA

PDF

“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...Edge AI and Vision Alliance

PPTX

Mastering ODC + Okta Configuration - Chennai OSUGHathiMaryA