Jeremy Custenborder from Confluent talked about how Kafka brings an event-centric approach to building streaming applications, and how to use Kafka Connect and Kafka Streams to build them.
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Kafka Connect allows data ingestion into Kafka from external systems by using connectors. It provides scalability, fault tolerance, and exactly-once semantics. Connectors are run as tasks within workers that can run in either standalone or distributed mode. The Schema Registry works with Kafka Connect to handle schema validation and evolution.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Kafka Streams: What it is, and how to use it?confluent
Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent
Speaker: Robin Moffatt, Developer Advocate, Confluent
In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.
This is part 2 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
This document discusses Apache Kafka and Confluent's Kafka Connect tool for large-scale streaming data integration. Kafka Connect allows importing and exporting data from Kafka to other systems like HDFS, databases, search indexes, and more using reusable connectors. Connectors use converters to handle serialization between data formats. The document outlines some existing connectors and upcoming improvements to Kafka Connect.
Andrew Stevenson from DataMountaineer presented on Kafka Connect. Kafka Connect is a common framework that facilitates data streams between Kafka and other systems. It handles delivery semantics, offset management, serialization/deserialization and other complex tasks, allowing users to focus on domain logic. Connectors can load and unload data from various systems like Cassandra, Elasticsearch, and MongoDB. Configuration files are used to deploy connectors with no code required.
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Keigo Suda
This document discusses Apache Kafka and Kafka Connect. It provides an overview of Kafka Connect and how it can be used for ETL processes. Kafka Connect allows data to be exported from or imported to Kafka and integrated with other systems through customizable connectors. The document describes how to run Kafka Connect in standalone and distributed modes and highlights some popular connectors available for integrating Kafka with other data sources and sinks.
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
Since it was open sourced, Apache Kafka has been adopted very widely from web companies like Uber, Netflix, LinkedIn to more traditional enterprises like Cerner, Goldman Sachs and Cisco. At these companies, Kafka is used in a variety of ways - as a pipeline for collecting high-volume log data for load into Hadoop, a means for collecting operational metrics to feed monitoring and alerting applications, for low latency messaging use cases and to power near realtime stream processing.
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good.
We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
Monitoring Apache Kafka with Confluent Control Center confluent
Presentation by Nick Dearden, Direct, Product and Engineering, Confluent
It’s 3 am. Do you know how your Kafka cluster is doing?
With over 150 metrics to think about, operating a Kafka cluster can be daunting, particularly as a deployment grows. Confluent Control Center is the only complete monitoring and administration product for Apache Kafka and is designed specifically for making the Kafka operators life easier.
Join Confluent as we cover how Control Center is used to simplify deployment, operability, and ensure message delivery.
Watch the recording: https://www.confluent.io/online-talk/monitoring-and-alerting-apache-kafka-with-confluent-control-center/
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
The document discusses deploying Apache Kafka on DC/OS. It provides an overview of Kafka and why it is useful to deploy it on DC/OS. It outlines important considerations for deploying Kafka brokers and Zookeeper as stateful services on DC/OS, including using dedicated disks, placement constraints, and service discovery. The document warns of potential gotchas like broker restarts impacting catch up time and Kafka Streams fault tolerance.
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
The document summarizes a presentation on Apache Kafka's Streams API given in Munich, Germany on January 25, 2017. The presentation introduced the Streams API, which allows users to build stream processing applications that run on client machines and integrate natively with Apache Kafka. Key features highlighted included the API's ability to perform both stateful and stateless computations, support for interactive queries, and guarantees of at-least-once processing. The roadmap for future Streams API development was also briefly outlined.
Introduction to Apache Kafka and why it matters - MadridPaolo Castagna
This document provides an introduction to Apache Kafka and discusses why it is an important distributed streaming platform. It outlines how Kafka can be used to handle streaming data flows in a reliable and scalable way. It also describes the various Apache Kafka APIs including Kafka Connect, Streams API, and KSQL that allow organizations to integrate Kafka with other systems and build stream processing applications.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include:
- The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine.
- It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics.
- Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Strata Data Conference, London, May 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57619
Abstract:
Modern businesses have data at their core, but this data is changing continuously. How can you harness this torrent of information in real time? The answer: stream processing.
The core platform for streaming data is Apache Kafka, and thousands of companies are using Kafka to transform and reshape their industries, including Netflix, Uber, PayPal, Airbnb, Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: to succeed, many technologies need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we engineers would like to work and how we actually end up working in practice.
Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies, like high scalability, distributed computing, and fault tolerance. Michael also covers Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced interactive queries functionality. Along the way, Michael shares common use cases that demonstrate that stream processing in practice often requires database-like functionality and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (for example, in the form of event-driven, containerized microservices). As you’ll see, Kafka makes such architectures equally viable for small-, medium-, and large-scale use cases.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
This document discusses Apache Kafka and Confluent's Kafka Connect tool for large-scale streaming data integration. Kafka Connect allows importing and exporting data from Kafka to other systems like HDFS, databases, search indexes, and more using reusable connectors. Connectors use converters to handle serialization between data formats. The document outlines some existing connectors and upcoming improvements to Kafka Connect.
Andrew Stevenson from DataMountaineer presented on Kafka Connect. Kafka Connect is a common framework that facilitates data streams between Kafka and other systems. It handles delivery semantics, offset management, serialization/deserialization and other complex tasks, allowing users to focus on domain logic. Connectors can load and unload data from various systems like Cassandra, Elasticsearch, and MongoDB. Configuration files are used to deploy connectors with no code required.
Apache Kafka & Kafka Connectを に使ったデータ連携パターン(改めETLの実装)Keigo Suda
This document discusses Apache Kafka and Kafka Connect. It provides an overview of Kafka Connect and how it can be used for ETL processes. Kafka Connect allows data to be exported from or imported to Kafka and integrated with other systems through customizable connectors. The document describes how to run Kafka Connect in standalone and distributed modes and highlights some popular connectors available for integrating Kafka with other data sources and sinks.
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
Since it was open sourced, Apache Kafka has been adopted very widely from web companies like Uber, Netflix, LinkedIn to more traditional enterprises like Cerner, Goldman Sachs and Cisco. At these companies, Kafka is used in a variety of ways - as a pipeline for collecting high-volume log data for load into Hadoop, a means for collecting operational metrics to feed monitoring and alerting applications, for low latency messaging use cases and to power near realtime stream processing.
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good.
We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
Monitoring Apache Kafka with Confluent Control Center confluent
Presentation by Nick Dearden, Direct, Product and Engineering, Confluent
It’s 3 am. Do you know how your Kafka cluster is doing?
With over 150 metrics to think about, operating a Kafka cluster can be daunting, particularly as a deployment grows. Confluent Control Center is the only complete monitoring and administration product for Apache Kafka and is designed specifically for making the Kafka operators life easier.
Join Confluent as we cover how Control Center is used to simplify deployment, operability, and ensure message delivery.
Watch the recording: https://www.confluent.io/online-talk/monitoring-and-alerting-apache-kafka-with-confluent-control-center/
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
The document discusses deploying Apache Kafka on DC/OS. It provides an overview of Kafka and why it is useful to deploy it on DC/OS. It outlines important considerations for deploying Kafka brokers and Zookeeper as stateful services on DC/OS, including using dedicated disks, placement constraints, and service discovery. The document warns of potential gotchas like broker restarts impacting catch up time and Kafka Streams fault tolerance.
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
The document summarizes a presentation on Apache Kafka's Streams API given in Munich, Germany on January 25, 2017. The presentation introduced the Streams API, which allows users to build stream processing applications that run on client machines and integrate natively with Apache Kafka. Key features highlighted included the API's ability to perform both stateful and stateless computations, support for interactive queries, and guarantees of at-least-once processing. The roadmap for future Streams API development was also briefly outlined.
Introduction to Apache Kafka and why it matters - MadridPaolo Castagna
This document provides an introduction to Apache Kafka and discusses why it is an important distributed streaming platform. It outlines how Kafka can be used to handle streaming data flows in a reliable and scalable way. It also describes the various Apache Kafka APIs including Kafka Connect, Streams API, and KSQL that allow organizations to integrate Kafka with other systems and build stream processing applications.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include:
- The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine.
- It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics.
- Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Strata Data Conference, London, May 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57619
Abstract:
Modern businesses have data at their core, but this data is changing continuously. How can you harness this torrent of information in real time? The answer: stream processing.
The core platform for streaming data is Apache Kafka, and thousands of companies are using Kafka to transform and reshape their industries, including Netflix, Uber, PayPal, Airbnb, Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: to succeed, many technologies need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we engineers would like to work and how we actually end up working in practice.
Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies, like high scalability, distributed computing, and fault tolerance. Michael also covers Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced interactive queries functionality. Along the way, Michael shares common use cases that demonstrate that stream processing in practice often requires database-like functionality and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (for example, in the form of event-driven, containerized microservices). As you’ll see, Kafka makes such architectures equally viable for small-, medium-, and large-scale use cases.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
Jay Kreps, Open Source Visionary and Co Founder of Confluent and several open source projects will be visiting LA. I have asked him to come present at our group. He will present his vision and will answer questions regarding Kafka and other projects
Bio:-
Jay is the co-founder and CEO at Confluent a company built around realtime data streams and the open source messaging system Apache Kafka. He is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.
Keystone processes over 1 trillion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
1. Kafka is described as a "WAL (write-ahead logging) system" and "the global commit log thingy" that was used as part of LinkedIn's data pipeline architecture.
2. LinkedIn had an ad hoc approach to data pipelines between systems that became more complex over time, so they built pipelines using Kafka.
3. The Kafka ecosystem includes storage using Kafka brokers, publishing and subscribing using producers and consumers, and stream processing using tools like Kafka Streams and KSQL.
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.
This document summarizes an event-driven architecture presentation using Java. It discusses using Apache Kafka/Amazon Kinesis for messaging, Docker for containerization, Vert.x for reactive applications, Apache Camel/AWS Lambda for integration, and Google Protocol Buffers for data serialization. It covers infrastructure components, software frameworks, local and AWS deployment, and integration testing between Kinesis and Kafka. The presentation provides resources for code samples and Docker images discussed.
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation
This document discusses LINE's use of Apache Kafka to build a company-wide data pipeline to handle 150 billion messages per day. LINE uses Kafka as a distributed streaming platform and message queue to reliably transmit events between internal systems. The author discusses LINE's architecture, metrics like 40PB of accumulated data, and engineering challenges like optimizing Kafka's performance through contributions to reduce latency. Building systems at this massive scale requires a focus on scalability, reliability, and leveraging open source technologies like Kafka while continuously improving performance.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming apps. It was developed by LinkedIn in 2011 to solve problems with data integration and processing. Kafka uses a publish-subscribe messaging model and is designed to be fast, scalable, and durable. It allows both streaming and storage of data and acts as a central data backbone for large organizations.
___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
Apache Kafka is a high-throughput distributed messaging system that can be used for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging model and is designed as a distributed commit log. Kafka allows for both push and pull models where producers push data and consumers pull data from topics which are divided into partitions to allow for parallelism.
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
Web Analytics using Kafka - August talk w/ Women Who CodePurnima Kamath
Purnima Kamath's presentation discusses using Apache Kafka for web analytics. It introduces Kafka as a distributed commit log that can throttle high volumes of event data from web servers to prevent request drop-offs. The presentation covers Kafka's publish-subscribe model using topics and partitions, how it guarantees ordering and allows for replays. It also demonstrates how Kafka Streams enables real-time extract, transform, load operations on streaming data and maintains application state in local stores. The demo shows a sample web analytics pipeline using Kafka to capture device, gender, browser and preference change events.
This document discusses streaming data architectures and technologies. It begins with defining streaming processing as processing data continuously as it arrives, rather than in batches. It then covers streaming architectures, scalable data ingestion technologies like Kafka and Flume, and real-time streaming processing systems like Storm, Samza and Spark Streaming. The document aims to provide an overview of building distributed streaming systems for processing high volumes of real-time data.
Sadayuki Furuhashi created Kumofs, a distributed key-value store, and MessagePack, a cross-language object serialization library. Kumofs is optimized for low latency with zero-hop reads and no single points of failure. It scales out linearly as servers are added without impacting applications. MessagePack is a compact binary format like JSON used for cross-language communication. MessagePack-RPC is a cross-language messaging library that uses an asynchronous, pipelined protocol over an event-driven I/O model.
- Sadayuki Furuhashi is a computer science student at the University of Tsukuba who created Kumofs, a distributed key-value store, and MessagePack, a cross-language communication library.
- Kumofs is optimized for low-latency and scalability through a consistent hashing algorithm and dynamic rebalancing without impacting applications. It has no single point of failure.
- MessagePack is a compact binary serialization format like JSON used for fast cross-language communication. MessagePack-RPC builds on this to enable asynchronous, pipelined RPC between languages.
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfvictordsane
Microsoft Azure is a cloud platform that empowers businesses with scalable computing, data analytics, artificial intelligence, and cybersecurity capabilities.
Arguably the biggest hurdle for most organizations is understanding how to get started.
Microsoft Azure is a consumption-based cloud service. This means you pay for what you use. Unlike traditional software, Azure resources (e.g., VMs, databases, storage) are billed based on usage time, storage size, data transfer, or resource configurations.
There are three primary Azure purchasing models:
• Pay-As-You-Go (PAYG): Ideal for flexibility. Billed monthly based on actual usage.
• Azure Reserved Instances (RI): Commit to 1- or 3-year terms for predictable workloads. This model offers up to 72% cost savings.
• Enterprise Agreements (EA): Best suited for large organizations needing comprehensive Azure solutions and custom pricing.
Licensing Azure: What You Need to Know
Azure doesn’t follow the traditional “per seat” licensing model. Instead, you pay for:
• Compute Hours (e.g., Virtual Machines)
• Storage Used (e.g., Blob, File, Disk)
• Database Transactions
• Data Transfer (Outbound)
Purchasing and subscribing to Microsoft Azure is more than a transactional step, it’s a strategic move.
Get in touch with our team of licensing experts via [email protected] to further understand the purchasing paths, licensing options, and cost management tools, to optimize your investment.
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfVarsha Nayak
In recent years, organizations have increasingly sought robust open source alternative to Jasper Reports as the landscape of open-source reporting tools rapidly evolves. While Jaspersoft has been a longstanding choice for generating complex business intelligence and analytics reports, factors such as licensing changes and growing demands for flexibility have prompted many businesses to explore other options. Among the most notable alternatives to Jaspersoft, Helical Insight stands out for its powerful open-source architecture, intuitive analytics, and dynamic dashboard capabilities. Designed to be both flexible and budget-friendly, Helical Insight empowers users with advanced features—such as in-memory reporting, extensive data source integration, and customizable visualizations—making it an ideal solution for organizations seeking a modern, scalable reporting platform. This article explores the future of open-source reporting and highlights why Helical Insight and other emerging tools are redefining the standards for business intelligence solutions.
The rise of e-commerce has redefined how retailers operate—and reconciliation...Prachi Desai
As payment flows grow more fragmented, the complexity of reconciliation and revenue recognition increases. The result? Mounting operational costs, silent revenue leakages, and avoidable financial risk.
Spot the inefficiencies. Automate what’s slowing you down.
https://www.taxilla.com/ecommerce-reconciliation
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesMarjukka Niinioja
Teams delivering API are challenges with:
- Connecting APIs to business strategy
- Measuring API success (audit & lifecycle metrics)
- Partner/Ecosystem onboarding
- Consistent documentation, security, and publishing
🧠 The big takeaway?
Many teams can build APIs. But few connect them to value, visibility, and long-term improvement.
That’s why the APIOps Cycles method helps teams:
📍 Start where the pain is (one “metro station” at a time)
📈 Scale success across strategy, platform, and operations
🛠 Use collaborative canvases to get buy-in and visibility
Want to try it and learn more?
- Follow APIOps Cycles in LinkedIn
- Visit the www.apiopscycles.com site
- Subscribe to email list
-
Generative Artificial Intelligence and its ApplicationsSandeepKS52
The exploration of generative AI begins with an overview of its fundamental concepts, highlighting how these technologies create new content and ideas by learning from existing data. Following this, the focus shifts to the processes involved in training and fine-tuning models, which are essential for enhancing their performance and ensuring they meet specific needs. Finally, the importance of responsible AI practices is emphasized, addressing ethical considerations and the impact of AI on society, which are crucial for developing systems that are not only effective but also beneficial and fair.
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...SheenBrisals
The distributed nature of modern applications and their architectures brings a great level of complexity to engineering teams. Though API contracts, asynchronous communication patterns, and event-driven architecture offer assistance, not all enterprise teams fully utilize them. While adopting cloud and modern technologies, teams are often hurried to produce outcomes without spending time in upfront thinking. This leads to building tangled applications and distributed monoliths. For those organizations, it is hard to recover from such costly mistakes.
In this talk, Sheen will explain how enterprises should decompose by starting at the organizational level, applying Domain-Driven Design, and distilling to a level where teams can operate within a boundary, ownership, and autonomy. He will provide organizational, team, and design patterns and practices to make the best use of event-driven architecture by understanding the types of events, event structure, and design choices to keep the domain model pure by guarding against corruption and complexity.
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...Insurance Tech Services
A modern Policy Administration System streamlines workflows and integrates with core systems to boost speed, accuracy, and customer satisfaction across the policy lifecycle. Visit https://www.damcogroup.com/insurance/policy-administration-systems for more details!
In a tight labor market and tighter economy, PMOs and resource managers must ensure that every team member is focused on the highest-value work. This session explores how AI reshapes resource planning and empowers organizations to forecast capacity, prevent burnout, and balance workloads more effectively, even with shrinking teams.
Bonk coin airdrop_ Everything You Need to Know.pdfHerond Labs
The Bonk airdrop, one of the largest in Solana’s history, distributed 50% of its total supply to community members, significantly boosting its popularity and Solana’s network activity. Below is everything you need to know about the Bonk coin airdrop, including its history, eligibility, how to claim tokens, risks, and current status.
https://blog.herond.org/bonk-coin-airdrop/
Agentic Techniques in Retrieval-Augmented Generation with Azure AI SearchMaxim Salnikov
Discover how Agentic Retrieval in Azure AI Search takes Retrieval-Augmented Generation (RAG) to the next level by intelligently breaking down complex queries, leveraging full conversation history, and executing parallel searches through a new LLM-powered query planner. This session introduces a cutting-edge approach that delivers significantly more accurate, relevant, and grounded answers—unlocking new capabilities for building smarter, more responsive generative AI applications.
Traditional Retrieval-Augmented Generation (RAG) pipelines work well for simple queries—but when users ask complex, multi-part questions or refer to previous conversation history, they often fall short. That’s where Agentic Retrieval comes in: a game-changing advancement in Azure AI Search that brings LLM-powered reasoning directly into the retrieval layer.
This session unveils how agentic techniques elevate your RAG-based applications by introducing intelligent query planning, subquery decomposition, parallel execution, and result merging—all orchestrated by a new Knowledge Agent. You’ll learn how this approach significantly boosts relevance, groundedness, and answer quality, especially for sophisticated enterprise use cases.
Key takeaways:
- Understand the evolution from keyword and vector search to agentic query orchestration
- See how full conversation context improves retrieval accuracy
- Explore measurable improvements in answer relevance and completeness (up to 40% gains!)
- Get hands-on guidance on integrating Agentic Retrieval with Azure AI Foundry and SDKs
- Discover how to build scalable, AI-first applications powered by this new paradigm
Whether you're building intelligent copilots, enterprise Q&A bots, or AI-driven search solutions, this session will equip you with the tools and patterns to push beyond traditional RAG.
Top 5 Task Management Software to Boost Productivity in 2025Orangescrum
In this blog, you’ll find a curated list of five powerful task management tools to watch in 2025. Each one is designed to help teams stay organized, improve collaboration, and consistently hit deadlines. We’ve included real-world use cases, key features, and data-driven insights to help you choose what fits your team best.
AI and Deep Learning with NVIDIA TechnologiesSandeepKS52
Artificial intelligence and deep learning are transforming various fields by enabling machines to learn from data and make decisions. Understanding how to prepare data effectively is crucial, as it lays the foundation for training models that can recognize patterns and improve over time. Once models are trained, the focus shifts to deployment, where these intelligent systems are integrated into real-world applications, allowing them to perform tasks and provide insights based on new information. This exploration of AI encompasses the entire process from initial concepts to practical implementation, highlighting the importance of each stage in creating effective and reliable AI solutions.
Integrating Survey123 and R&H Data Using FMESafe Software
West Virginia Department of Transportation (WVDOT) actively engages in several field data collection initiatives using Collector and Survey 123. A critical component for effective asset management and enhanced analytical capabilities is the integration of Geographic Information System (GIS) data with Linear Referencing System (LRS) data. Currently, RouteID and Measures are not captured in Survey 123. However, we can bridge this gap through FME Flow automation. When a survey is submitted through Survey 123 for ArcGIS Portal (10.8.1), it triggers FME Flow automation. This process uses a customized workbench that interacts with a modified version of Esri's Geometry to Measure API. The result is a JSON response that includes RouteID and Measures, which are then applied to the feature service record.
Build enterprise-ready applications using skills you already have!PhilMeredith3
Process Tempo is a rapid application development (RAD) environment that empowers data teams to create enterprise-ready applications using skills they already have.
With Process Tempo, data teams can craft beautiful, pixel-perfect applications the business will love.
Process Tempo combines features found in business intelligence tools, graphic design tools and workflow solutions - all in a single platform.
Process Tempo works with all major databases such as Databricks, Snowflake, Postgres and MySQL. It also works with leading graph database technologies such as Neo4j, Puppy Graph and Memgraph.
It is the perfect platform to accelerate the delivery of data-driven solutions.
For more information, you can find us at www.processtempo.com
Marketo & Dynamics can be Most Excellent to Each Other – The SequelBradBedford3
So you’ve built trust in your Marketo Engage-Dynamics integration—excellent. But now what?
This sequel picks up where our last adventure left off, offering a step-by-step guide to move from stable sync to strategic power moves. We’ll share real-world project examples that empower sales and marketing to work smarter and stay aligned.
If you’re ready to go beyond the basics and do truly most excellent stuff, this session is your guide.
Marketo & Dynamics can be Most Excellent to Each Other – The SequelBradBedford3
Ad
Confluent building a real-time streaming platform using kafka streams and kafka connect-20min-version
1. Building a real-
time streaming
platform using
Kafka Connect +
Kafka Streams
Jeremy Custenborder, Systems Engineer, Confluent
26. • Everything in the company is a real-time stream
• > 1.2 trillion messages written per day
• > 3.4 trillion messages read per day
• ~ 1 PB of stream data
• Thousands of engineers
• Tens of thousands of producer processes
#2: Hi, I’m Neha Narkhede…
There is a big paradigm shift happening around the world where companies are moving rapidly towards leveraging data in real-time and fundamentally moving away from batch-oriented computing. But how do you do that? Well that is what today’s talk is about. I’m going to summarize 6 years of work in 15 mins, so let’s get started.
#4: Unordered, unbounded and large-scale datasets are increasingly common in day-to-day business. Stream data means different things for different businesses. For retail, it might mean streams of orders and shipments, for finance, it might mean streams of stock ticker data while for web companies, it might mean streams of user activity data. Stream data is everywhere. At the same time, there is a huge push towards getting faster results: doing instant credit card fraud detection, doing instant credit card payment processing vs only 5 times a day, being able to detect and alert on a problem that causes retail sales to dip in seconds vs a day later (you can only imagine what that would do to retail companies over black Friday)
#5: So the takeaway is that businesses operate in real-time not batch, if you go to a store to buy something, you don’t wait there for several hours to get it. So data processing required to make key business decisions and to operate a business effectively should also happen in real-time.
Here are some examples to support that claim…
#6: Event = something that happened. Different for different businesses.
#8: Log files are also event streams. For instance, every line in a log file is an event that in this case tells you how the service is being used.
#9: There is an inherent duality in tables and streams; Traditional databases are all about tables full of state but are not designed to respond to streams of events that modify those tables.
#10: Tables have rows that store the latest value for a unique key. But…no notion of time
#11: If you look at how a table gets constructed over time, you will notice that…
#12: The operations are actually a stream of events where the event is just the operation that modifies the table.
Every database does this internally and it is called a changelog
#13: So events are everywhere, what next? We need to fundamentally move to event-centric thinking. For a retail website, there are possibly various avenues that generate the “product view” event. A standard thing to do is to ensure that all product view data ends up in Hadoop so you can run analytics on user interest to power various business functions from marketing to product positioning and so on.
#15: Reality about 100x more complex. In some corner, you are using some messaging system for app-to-app communication. You might have a custom way of loading data from various databases into Hadoop. But then more destinations appear over time and now you have to feed the same data to a search system, various caches etc.
This is a common reality and a simplified version.
300 services
~100 databases
Multi-datacenter
Trolling: load into Oracle, search, etc
#16: The core insight is that a data pipeline is also an event stream.
#17: What you need instead of that scary picture is a central streaming platform at the heart of a datacenter. A central nervous system that collects data from various sources and feeds all other systems and apps that need to consume and process data in real-time.
Why does this make sense?
#18: Why is a streaming platform needed? Because data sources and destinations add up over time. Initially you might have just the web app that produces the product view event and maybe you’ve only thought about analyzing it in Hadoop.
#19: But over time, the mobile app shows up that also produces the same data and several more applications as destinations for search, recommendations, security etc.
Event centric thinking involves building a forward-compatible architecture. You will never be able to foresee what future apps might show up that will need the same data. So capture it in a central, scalable streaming platform that asynchronously feeds downstream systems.
#20: So how do you build such a streaming platform?
#22: At a high-level, Kafka is a pub-sub messaging system that has producers that capture events. Events are sent to and stored locally on a central cluster of brokers. And consumers subscribe to topics or named categories of data. End-to-end, producers to consumer data flow is real-time.
#23: Magic of Kafka is in the implementation. It is not just a pub-sub messaging system, it is a modern distributed platform…
How so?
#27: All that means, you can throw lots of data at Kafka and have it be made available throughout the company within milliseconds. At LinkedIn and several other companies, Kafka is deployed at a large scale…
#28: In the last 5 years since it was open-sourced, it has been widely adopted by 1000s of companies worldwide.
#29: So Kafka is the foundation of the central streaming platform.
#31: Infrastructure is really only as useful as the data it has. The next step moving to a streaming platform based data architecture is solving the ETL problem.
#38: Doesn’t mean you drop everything on the floor if anything slows down
Streaming algorithms—online space
Can compute median
#39: About how inputs are translated into outputs (very fundamental)
#40: HTTP/REST
All databases
Run all the time
Each request totally independent—No real ordering
Can fail individual requests if you want
Very simple!
About the future!
#41: “Ed, the MapReduce job never finishes if you watch it like that”
Job kicks off at a certain time
Cron!
Processes all the input, produces all the input
Data is usually static
Hadoop!
DWH, JCL
Archaic but powerful. Can do analytics! Compex algorithms!
Also can be really efficient!
Inherently high latency
#42: Generalizes request/response and batch.
Program takes some inputs and produces some outputs
Could be all inputs
Could be one at a time
Runs continuously forever!
#43: For some time, stream processing was thought of as a faster map-reduce layer useful for faster analytics, requiring deployment of a central cluster much like Hadoop. But in my experience, I’ve learnt that the most compelling applications that do stream processing look much more like an event-driven microservice and less like a Hive query or Spark job.
#44: Companies == streams
What a retail store do
Streams
Retail
- Sales
- Shipments and logistics
- Pricing
- Re-ordering
- Analytics
- Fraud and theft
#50: Let’s dive into the real-time analytics and apps area
#53: Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it.
Mission: Build a Streaming Platform
Product: Confluent Platform
#57: Thank you slide. Add to the end of your presentation.
#58: Thank you slide. Add to the end of your presentation.