This presentation goes into detail on how and why Eventador created SQLStreamBuilder for easy streaming SQL—and the lessons learned along the way.
This presentation was given by Eventador CEO and Co-founder Kenny Gorman at Flink Forward Europe 2019.
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
Uber uses Apache Flink to generate and manage business metrics in real-time from raw streaming data sources. The system defines metrics using a domain-specific language and optimizes an execution plan to generate the metrics directly rather than first generating raw datasets. This avoids inefficiencies, inconsistencies, and wasted resources. The system provides a unified way to define metrics from multiple data sources and store results in various databases and warehouses.
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
The Apache Beam programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in the Google Cloud Dataflow runner and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
Computation convergence problems, auto-generate Beam API code from Pig scripts, convergences at LinkedIn with AORA (Author Once Run Anywhere) principle.
Blog post:
https://engineering.linkedin.com/blog/2019/01/bridging-offline-and-nearline-computations-with-apache-calcite
Code example:
Pig script: https://gist.github.com/khaitranq/1d06c27832f15fa52a4a7e2fa7bec340
Beam autogen code: https://gist.github.com/khaitranq/785dbb8495cd382788f3ca8200231d8
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.
Streaming Data from Cassandra into KafkaAbrar Sheikh
Yelp has built a robust stream processing ecosystem called Data Pipeline. As part of this system we created a Cassandra Source Connector, which streams data updates made to Cassandra into Kafka in real time. We use Cassandra CDC and leverage the stateful stream processing of Apache Flink to produce a Kafka stream containing the full content of each modified row, as well as its previous value.
https://www.datastax.com/accelerate/agenda?session=Streaming-Cassandra-into-Kafka
This document provides an overview of Airflow, an open-source workflow management platform for authoring, scheduling and monitoring data pipelines. It describes Airflow's key components including the web server, scheduler, workers and metadata database. It explains how Airflow works by parsing DAGs, instantiating tasks and changing their state as they are scheduled, queued, run and monitored. The document also covers concepts like DAGs, operators, dependencies, concurrency vs parallelism and advanced topics such as subDAGs, hooks, XCOM and branching workflows.
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward
Advancements in stream processing and OLAP (Online Analytical Processing) technologies have enabled faster insights into the data coming in, thus powering near real time decisions. This talk focuses on how Uber uses real time analytics for solving complex problems such as Fraud detection, Operational intelligence, Intelligent Incentive spend and showcases the corresponding infrastructure that makes this possible. I will go over the key challenges involved in data ingestion, correctness and backfill. We will also go over enabling SQL and Flink to support real-time decision making for data science and analysts.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward
Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.
The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.
Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.
This document summarizes new enhancements to the Java Streams API in Java 9, including the addition of takeWhile, dropWhile, ofNullable methods as well as performance improvements to the iterate method. It provides examples of how each new method works and why they are useful, such as takeWhile and dropWhile being potentially more efficient than filter in some cases. It also shows performance test results indicating that streams in Java 9 are faster than in Java 8. In addition, background information is given on streams, monads, and existing stream methods from Java 8 like filter, map, and collect.
In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
In 2016, we introduced Alibaba’s compute engine Blink which was based on our private branch of flink. It enalbed many large scale applications in Alibaba’s core business, such as search, recommendation and ads. With the deep and close colaboration with the flink community, we are finally close to contribute our improvements back to the flink community. In this talk, we will present our key contributions to flink runtime recently, such as the new YARN cluster mode for Flip-6, fine-grained failover for Flip-1, async i/o for Flip-12, incremental checkpoint, and the further improvements plan from Alibaba in the near future. Moreover, we will show some production use cases to illustrate how flink works in Alibaba’s large scale online applications, which includes real-time ETL as well as online machine learning. This talk is presented by Alibaba.
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
In this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data.
Unification of streaming and batch is a main theme for Flink. Since 1.9.0, we have integrated Flink with Hive in a platform level. I will talk about:
- what features we have released so far, and what they enable our customers to do
- best practices to use Flink with Hive
- what is the latest development status of Flink-Hive integration at the time of Flink Forward Berlin (Oct 2019), and what to look for in the next release (probably 1.11)
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward
This document summarizes recent improvements to Flink SQL and Table API by Blink, Alibaba's distribution of Flink. Key improvements include support for stream-stream joins, user-defined functions, table functions and aggregate functions, retractable streams, and over/group aggregates. Blink aims to make Flink work well at large scale for Alibaba's search and recommendation systems. Many of the improvements will be included in upcoming Flink releases.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward
Many stream processing applications can benefit from or need to rely on the prediction made with machine learning (ML) methods. In this presentation, new features of Apache Samoa are presented with a real data processing scenario. These features make Apache SAMOA fully accessible for Apache Flink users: (1) the data stream processed within Apache Flink is forwarded to Apache Samoa stream mining engine to perform predictions with stream-oriented ML models, (2) ML models evolve after every labelled instance and, at the same time, new predictions are sent back to Apache Flink. In both cases, Apache Kafka is used for data exchange. Hence, Apache Samoa is used as stream mining engine, provided with input data from, and sending predictions to Apache Flink. During the presentation, real life aspects are illustrated with code examples, such as input and prediction stream integration and monitoring latency of data processing and stream mining.
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward
Flink's streaming API can be used to construct a scalable, fault tolerant framework for buffering high frequency time series data, with the goal being to output larger, immutable blocks of data. As the data is being buffered into larger blocks, Flink's queryable state feature can be used to service requests for data still in the "buffering" state. The high frequency time series data set in this example is electro cardiogram data (EKG) that is buffered from a sample rate in millisecond into multi minute blocks.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
This document provides an overview of Apache Airflow, an open-source workflow management system. It describes Airflow's key features like workflow definition using directed acyclic graphs (DAGs), rich UI, scheduler, operators for tasks like databases and web services, and use of Jinja templating. The document also discusses Airflow's architecture with parallel execution, UI, command line operations like backfilling, and security features. Airflow is used by over 200 companies for workflows like ETL, analytics, and machine learning pipelines.
This document discusses recommendations and machine learning at Netflix. It provides an overview of:
- How Netflix provides personalized recommendations on member homepages to help them find content to watch.
- Netflix's experimentation cycle of designing experiments, collecting data, generating features, training models, and doing A/B testing.
- How Netflix handles "facts" or input data for recommendations, including how facts change over time and how they are logged and stored at scale.
- The challenges of logging and accessing facts at Netflix's scale, and how they are addressing issues like deduplication, performance, and supporting different access patterns.
Real-Time Stream Processing with KSQL and Apache Kafkaconfluent
Real Time Stream Processing with KSQL and Kafka
David Peterson, Confluent APAC
APIdays Melbourne 2018
Unordered, unbounded and massive datasets are increasingly common in day-to-day business. Using this to your advantage is incredibly difficult with current system designs. We are stuck in a model where we can only take advantage of this *after* it has happened. Many times, this is too late to be useful in the enterprise.
KSQL is a streaming SQL engine for Apache Kafka. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. KSQL (like Kafka) is open-source, distributed, scalable, and reliable.
A real time Kafka platform moves your data up the stack, closer to the heart of your business, allowing you to build scalable, mission-critical services by quickly deploying SQL-like queries in a severless pattern.
This talk will highlight key use cases for real time data, and stream processing with KSQL: Real time analytics, security and anomaly detection, real time ETL / data integration, Internet of Things, application development, and deploying Machine Learning models with KSQ.
Real time data and stream processing means that Kafka is just as important to the disrupted as it is to the disruptors.
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
The document discusses graph computations and Pregel, an API for graph processing. It introduces Pregel's vertex-centric programming model where computation is organized into supersteps and depends only on neighboring vertices. Examples like PageRank are shown implemented in Pregel. GraphX is also introduced as a library providing Pregel-like abstractions on Spark. The document then discusses distributing matrix computations, covering partitioning schemes for matrices and how to distribute operations like multiplication and singular value decomposition (SVD) across a cluster.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Introduction to SQLStreamBuilder: Rich Streaming SQL Interface for Creating a...Eventador
Discover how SQLStreamBuilder enables you to run streaming SQL against unbounded streams of data and create new, persistent streaming jobs.
https://eventador.io/sql-streambuilder/
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
Apache Beam is a unified programming model for batch and streaming data processing that provides portability across distributed processing backends. It aims to support multiple languages like Java, Python and Go. The Beam Python SDK allows writing pipelines in Python that can run on distributed backends like Apache Flink. Lyft developed a Python SDK runner for Flink that translates Python pipelines to native Flink APIs using the Beam Fn API for communication between the SDK and runner. Future work includes improving performance of Python pipelines on JVM runners and supporting multiple languages in a single pipeline.
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward
Advancements in stream processing and OLAP (Online Analytical Processing) technologies have enabled faster insights into the data coming in, thus powering near real time decisions. This talk focuses on how Uber uses real time analytics for solving complex problems such as Fraud detection, Operational intelligence, Intelligent Incentive spend and showcases the corresponding infrastructure that makes this possible. I will go over the key challenges involved in data ingestion, correctness and backfill. We will also go over enabling SQL and Flink to support real-time decision making for data science and analysts.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward
Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.
The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.
Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.
This document summarizes new enhancements to the Java Streams API in Java 9, including the addition of takeWhile, dropWhile, ofNullable methods as well as performance improvements to the iterate method. It provides examples of how each new method works and why they are useful, such as takeWhile and dropWhile being potentially more efficient than filter in some cases. It also shows performance test results indicating that streams in Java 9 are faster than in Java 8. In addition, background information is given on streams, monads, and existing stream methods from Java 8 like filter, map, and collect.
In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
In 2016, we introduced Alibaba’s compute engine Blink which was based on our private branch of flink. It enalbed many large scale applications in Alibaba’s core business, such as search, recommendation and ads. With the deep and close colaboration with the flink community, we are finally close to contribute our improvements back to the flink community. In this talk, we will present our key contributions to flink runtime recently, such as the new YARN cluster mode for Flip-6, fine-grained failover for Flip-1, async i/o for Flip-12, incremental checkpoint, and the further improvements plan from Alibaba in the near future. Moreover, we will show some production use cases to illustrate how flink works in Alibaba’s large scale online applications, which includes real-time ETL as well as online machine learning. This talk is presented by Alibaba.
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
In this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data.
Unification of streaming and batch is a main theme for Flink. Since 1.9.0, we have integrated Flink with Hive in a platform level. I will talk about:
- what features we have released so far, and what they enable our customers to do
- best practices to use Flink with Hive
- what is the latest development status of Flink-Hive integration at the time of Flink Forward Berlin (Oct 2019), and what to look for in the next release (probably 1.11)
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward
This document summarizes recent improvements to Flink SQL and Table API by Blink, Alibaba's distribution of Flink. Key improvements include support for stream-stream joins, user-defined functions, table functions and aggregate functions, retractable streams, and over/group aggregates. Blink aims to make Flink work well at large scale for Alibaba's search and recommendation systems. Many of the improvements will be included in upcoming Flink releases.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Apache Airflow is an open-source workflow management platform developed by Airbnb and now an Apache Software Foundation project. It allows users to define and manage data pipelines as directed acyclic graphs (DAGs) of tasks. The tasks can be operators to perform actions, move data between systems, and use sensors to monitor external systems. Airflow provides a rich web UI, CLI and integrations with databases, Hadoop, AWS and others. It is scalable, supports dynamic task generation and templates, alerting, retries, and distributed execution across clusters.
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...Flink Forward
Many stream processing applications can benefit from or need to rely on the prediction made with machine learning (ML) methods. In this presentation, new features of Apache Samoa are presented with a real data processing scenario. These features make Apache SAMOA fully accessible for Apache Flink users: (1) the data stream processed within Apache Flink is forwarded to Apache Samoa stream mining engine to perform predictions with stream-oriented ML models, (2) ML models evolve after every labelled instance and, at the same time, new predictions are sent back to Apache Flink. In both cases, Apache Kafka is used for data exchange. Hence, Apache Samoa is used as stream mining engine, provided with input data from, and sending predictions to Apache Flink. During the presentation, real life aspects are illustrated with code examples, such as input and prediction stream integration and monitoring latency of data processing and stream mining.
Flink Forward SF 2017: Joe Olson - Using Flink and Queryable State to Buffer ...Flink Forward
Flink's streaming API can be used to construct a scalable, fault tolerant framework for buffering high frequency time series data, with the goal being to output larger, immutable blocks of data. As the data is being buffered into larger blocks, Flink's queryable state feature can be used to service requests for data still in the "buffering" state. The high frequency time series data set in this example is electro cardiogram data (EKG) that is buffered from a sample rate in millisecond into multi minute blocks.
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine.
https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices
This document provides an overview of Apache Airflow, an open-source workflow management system. It describes Airflow's key features like workflow definition using directed acyclic graphs (DAGs), rich UI, scheduler, operators for tasks like databases and web services, and use of Jinja templating. The document also discusses Airflow's architecture with parallel execution, UI, command line operations like backfilling, and security features. Airflow is used by over 200 companies for workflows like ETL, analytics, and machine learning pipelines.
This document discusses recommendations and machine learning at Netflix. It provides an overview of:
- How Netflix provides personalized recommendations on member homepages to help them find content to watch.
- Netflix's experimentation cycle of designing experiments, collecting data, generating features, training models, and doing A/B testing.
- How Netflix handles "facts" or input data for recommendations, including how facts change over time and how they are logged and stored at scale.
- The challenges of logging and accessing facts at Netflix's scale, and how they are addressing issues like deduplication, performance, and supporting different access patterns.
Real-Time Stream Processing with KSQL and Apache Kafkaconfluent
Real Time Stream Processing with KSQL and Kafka
David Peterson, Confluent APAC
APIdays Melbourne 2018
Unordered, unbounded and massive datasets are increasingly common in day-to-day business. Using this to your advantage is incredibly difficult with current system designs. We are stuck in a model where we can only take advantage of this *after* it has happened. Many times, this is too late to be useful in the enterprise.
KSQL is a streaming SQL engine for Apache Kafka. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. KSQL (like Kafka) is open-source, distributed, scalable, and reliable.
A real time Kafka platform moves your data up the stack, closer to the heart of your business, allowing you to build scalable, mission-critical services by quickly deploying SQL-like queries in a severless pattern.
This talk will highlight key use cases for real time data, and stream processing with KSQL: Real time analytics, security and anomaly detection, real time ETL / data integration, Internet of Things, application development, and deploying Machine Learning models with KSQ.
Real time data and stream processing means that Kafka is just as important to the disrupted as it is to the disruptors.
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
The document discusses graph computations and Pregel, an API for graph processing. It introduces Pregel's vertex-centric programming model where computation is organized into supersteps and depends only on neighboring vertices. Examples like PageRank are shown implemented in Pregel. GraphX is also introduced as a library providing Pregel-like abstractions on Spark. The document then discusses distributing matrix computations, covering partitioning schemes for matrices and how to distribute operations like multiplication and singular value decomposition (SVD) across a cluster.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Introduction to SQLStreamBuilder: Rich Streaming SQL Interface for Creating a...Eventador
Discover how SQLStreamBuilder enables you to run streaming SQL against unbounded streams of data and create new, persistent streaming jobs.
https://eventador.io/sql-streambuilder/
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
Apache Beam is a unified programming model for batch and streaming data processing that provides portability across distributed processing backends. It aims to support multiple languages like Java, Python and Go. The Beam Python SDK allows writing pipelines in Python that can run on distributed backends like Apache Flink. Lyft developed a Python SDK runner for Flink that translates Python pipelines to native Flink APIs using the Beam Fn API for communication between the SDK and runner. Future work includes improving performance of Python pipelines on JVM runners and supporting multiple languages in a single pipeline.
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.
LivePerson uses CouchBase for real-time analytics of visitor data to provide visibility to customers on their online visitors. Previously, visitor state was stored in memory on stateful web servers, limiting scalability. CouchBase was chosen for its performance, resilience, linear scalability, schema flexibility, and ability to handle LivePerson's high throughput of over 1 million concurrent visitors and 100k operations per second. It is used to store visitor documents containing events and is queried to return relevant visitors to agents. Cross data center replication is also used to improve resilience. LivePerson has found CouchBase easy to develop on and has expanded its use to additional cases like session state and caching.
XStream: stream processing platform at facebookAniket Mokashi
XStream is Facebook's unified stream processing platform that provides a fully managed stream processing service. It was built using the Stylus C++ stream processing framework and uses a common SQL dialect called CoreSQL. XStream employs an interpretive execution model using the new Velox vectorized SQL evaluation engine for high performance. This provides a consistent and high efficiency stream processing platform to support diverse real-time use cases at planetary scale for Facebook.
Fast federated SQL with Apache CalciteChris Baynes
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
OpenLineage for Stream Processing | Kafka Summit LondonHostedbyConfluent
"OpenLineage is an open platform for the collection and analysis of data lineage, which includes an open standard for lineage data collection, integration libraries for the most common tools, and a metadata repository/reference implementation (Marquez).
In recent months, stream processing, which is an important use case for Apache Kafka, has gained the particular focus of the OpenLineage community with many useful features completed or begun, including:
* A seamless OpenLineage & Apache Flink integration,
* Support for streaming jobs in Marquez,
* Progress on a built-in lineage API within the Flink codebase.
Cross-platform lineage allows for a holistic overview of data flow and its dependencies within organizations, including stream processing.
This talk will provide an overview of the most recent developments in the OpenLineage Flink integration and share what’s in store for this important collaboration.
This talk is a must-attend for those wishing to stay up-to-date on lineage developments in the stream processing world."
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent
"Kafka Connect, the framework for building scalable and reliable data pipelines, has gained immense popularity in the data engineering landscape. This session will provide a comprehensive guide to creating Kafka connectors using Kotlin, a language known for its conciseness and expressiveness.
In this session, we will explore a step-by-step approach to crafting Kafka connectors with Kotlin, from inception to deployment using an simple use case. The process includes the following key aspects:
Understanding Kafka Connect: We'll start with an overview of Kafka Connect and its architecture, emphasizing its importance in real-time data integration and streaming.
Connector Design: Delve into the design principles that govern connector creation. Learn how to choose between source and sink connectors and identify the data format that suits your use case.
Building a Source Connector: We'll start with building a Kafka source connector, exploring key considerations, such as data transformations, serialization, deserialization, error handling and delivery guarantees. You will see how Kotlin's concise syntax and type safety can simplify the implementation.
Testing: Learn how to rigorously test your connector to ensure its reliability and robustness, utilizing best practices for testing in Kotlin.
Connector Deployment: go through the process of deploying your connector in a Kafka Connect cluster, and discuss strategies for monitoring and scaling.
Real-World Use Cases: Explore real-world examples of Kafka connectors built with Kotlin.
By the end of this session, you will have a solid foundation for creating and deploying Kafka connectors using Kotlin, equipped with practical knowledge and insights to make your data integration processes more efficient and reliable. Whether you are a seasoned developer or new to Kafka Connect, this guide will help you harness the power of Kafka and Kotlin for seamless data flow in your applications."
The document discusses the requirements and architecture of an SDN controller. It states that an SDN controller should be a flexible platform that can accommodate diverse applications through common APIs and extensibility. It should also scale to support independent development and integration of applications. The OpenDaylight controller satisfies these requirements through its use of YANG modeling and the Model-Driven Service Abstraction Layer (MD-SAL). MD-SAL generates Java classes from YANG models and provides messaging between controller components.
Apache Samza is a stream processing framework that provides high-level APIs and powerful stream processing capabilities. It is used by many large companies for real-time stream processing. The document discusses Samza's stream processing architecture at LinkedIn, how it scales to process billions of messages per day across thousands of machines, and new features around faster onboarding, powerful APIs including Apache Beam support, easier development through high-level APIs and tables, and better operability in YARN and standalone clusters.
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
Understanding Framework Architecture using Eclipseanshunjain
Talk on Framework architectures given at SAP Labs India for Eclipse Day India 2011 - Code attached Here: https://sites.google.com/site/anshunjain/eclipse-presentations
Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa
The document discusses experiences with evangelizing the use of Java within Oracle databases. It provides a timeline of Java support in Oracle databases from 8i to 12c. It describes developing, testing, and deploying database-resident Java applications. Examples discussed include a content management system and RESTful web services implemented as stored procedures, as well as the Scotas OLS product for embedded Solr search. The conclusion covers challenges with open source projects, impedance mismatch between databases and Java, and lack of overlap between skillsets.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Streaming and Kafka Streams are two popular stream processing platforms. Spark Streaming uses micro-batching and allows for code reuse between batch and streaming jobs. Kafka Streams is embedded directly into Apache Kafka and leverages Kafka as its internal messaging layer. Both platforms support stateful stream processing operations like windowing, aggregations, and joins through distributed state stores. A demo application is shown that detects dangerous driving by joining truck position data with driver data using different streaming techniques.
The document discusses the history of build tools and outlines a roadmap for sbt 1.0 focused on stability and modularization. It introduces sbt-server as a way to centralize build tasks and allow multiple clients to interact with the build. Key points of the sbt-server design include running tasks in a centralized queue, handling reconnects, using a versioned protocol for communication, and supporting background jobs and input. The document also discusses bringing existing sbt plugins onto sbt-server without breaking functionality.
Real-time Streaming Pipelines with FLaNKData Con LA
Introducing the FLaNK stack which combines Apache Flink, Apache NiFi and Apache Kafka to build fast applications for IoT, AI, rapid ingest and deploy them anywhere. I will walk through live demos and show how to do this yourself.
FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.
We will discuss a use case - Smart Stocks with FLaNK (NiFi, Kafka, Flink SQL)
Bio -
Tim Spann is an avid blogger and the Big Data Zone Leader for Dzone (https://dzone.com/users/297029/bunkertor.html). He runs the the successful Future of Data Princeton meetup with over 1200 members at http://www.meetup.com/futureofdata-princeton/. He is currently a Senior Solutions Engineer at Cloudera in the Princeton New Jersey area. You can find all the source and material behind his talks at his Github and Community blog:
https://github.com/tspannhw/ApacheDeepLearning201
https://community.hortonworks.com/users/9304/tspann.html
Automate the operation of your Oracle Cloud infrastructure v2.0Nelson Calero
Presentation delivered in Collaborate 19 conference in April 2019 in San Antonio
Abstract: The Oracle Cloud provides APIs and command line utilities to handle your infrastructure in the cloud without using the web console. In addition, there are orchestration tools such as Terraform to build, change and version your infrastructure, allowing automation and configuration management.
This session introduces to OCI services and APIs through examples from a DBA perspective, looking to minimize manual interventions when creating instances and containers, deploying a cluster using the project terraform-kubernetes-installer, and backing up your databases.
This is an updated version of a similar session a did last year, now focused on OCI new generation services and tools.
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
This document discusses best practices for middleware and integration architecture modernization using Apache Camel. It provides an overview of Apache Camel, including what it is, how it works through routes, and the different Camel projects. It then covers trends in integration architecture like microservices, cloud native, and serverless. Key aspects of Camel K and Camel Quarkus are summarized. The document concludes with a brief discussion of the Camel Kafka Connector and pointers to additional resources.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
1. Writing an interactive interface for SQL on Flink
How and why we created SQLStreamBuilder—and the lessons learned along the way
Kenny Gorman
Co-Founder and CEO
www.eventador.io
2019 Flink Forward Berlin
2. Background and motivations
● Eventador.io has offered a managed Flink runtime for a few years now. We
started to see some customer patterns emerge.
● The state of the art today is to write Flink jobs in Java or Scala using the
DataStream/Set API and/or the Table API’s.
● While powerful, the time and expertise needed isn’t trivial. Adoption and time to
market lags.
● Teams are busy writing code. Completely swamped to be precise.
3. Why SQL anyway?
● SQL is > 30 years old. It’s massively useful for inspecting and reasoning about
data. Everyone knows SQL.
● It’s declarative, just ask for what you want to see.
● It’s been extended to accommodate streaming constructs like windows
(Flink/Calcite).
● Streaming SQL never completes, it’s a query on boundless data.
● It’s an amazing way to interact with streaming data.
4. Of workloads could be represented
with SQL, and we plan to grow that.
Require more complex logic best
represented in Java/Scala.
80%
20%
5. What if we could go beyond simply building
processors in SQL - do it interactively, manage
schema’s and make it all easy?
Could building logic on streams be as productive
and intuitive as using a database yet as scalable
and powerful as Flink?
6. Eventador SQLStreamBuilder
● Interactive SQL editor - create and submit any Flink compatible SQL
● Virtual Table Registry - source/sink + schema definition
● Query Parser - Gives instant feedback
● Job payload management - Builds job payloads
● Flink runner - Takes the payload and runs the job
● Delivered as a cloud service - in your AWS account
7. Feedback on SQL
execution
Where do I send results?
Where to run the job
The SQL statement Sampling rather than a
result-set
A sample of results
in browser
8. Schema management - Virtual Table Registry
● SQL requires a schema of typed
columns - streams don’t have have to
have this.
● It’s common to use AVRO (easy to solve
for) but also free-form JSON
● Free form means - a total F**ing mess.
● Sources - Kafka/Kinesis (soon)
● Sinks - Kafka,S3, JDBC, ELK (soon)
9. SQLStreamBuilder Components
● Interactive SQL interface
○ Handles query creation and submission.
○ Handles feedback from SQLIO
○ Interface to build queries, sources and sinks
○ Python + Vue.js
○ Results are sampled back to interface
● SQL engine (SQLIO)
○ Parse incoming statements
○ Map data sources/sinks
○ Parse schema (Schema Registry+AVRO / JSON)
○ Build POJOs
○ Submit payload to runner (Flink)
○ Java
● Virtual Table Registry
○ Creation of schema for streams
○ AVRO + JSON
○ Python
10. SQLStreamBuilder (con’t)
● Job Management Interface
○ Stop/Start/Edit/etc
○ Python + Vue.js
○ Uses Flink APIs
● Builder
○ Handles creation of assets via K8’s
○ Python
○ PostgreSQL backend
○ Kubernetes orchestration
● Flink runner
○ Run jobs on Flink 1.8.2
○ Kubernetes orchestration
○ Any Flink compatible SQL statement
12. Query Lifecycle - Execute
SQLIO
Apache Kafka / Socket.io
SQL console
Column Column Column
Value Value Value
- If class exists
- class, method, params
.connect(
new Kafka()
.version("0.11")
.topic("...")
.sinkPartitionerXX
result.writeToSink(..);
env.execute(..);
- Enhanced schema typing
- Enhanced feedback/logging
- Sends base64 encoded payload to Flink
Job
SAMPLE THE DATA TO USER
13. SQL join streams from multiple clusters/types
Write to multiple types of sinks, building complex
processing pipelines
Aggregate data before pushing to expensive/slow
database endpoints
Conditionally write to multiple S3 buckets
14. Building Processing Environments
SELECT * FROM sensors
JOIN account_info ON ...
SELECT sensorid, max(temp)
FROM stream
GROUP BY sensorid, tumble(..)
SELECT sensorid, region
FROM stream
WHERE region IN [...]
s3://xxx/yyys3://xxx/yyy
sms
SELECT * FROM table
WHERE user_selected_thing = ‘foo’;
SELECT sensorid, message
FROM stream
WHERE is_alert = ‘t’ Data Science
ML Team(s)
SnowFlakeDB
or other Data
Warehouse
15. Javascript User Functions - Introduced Today
function ICAO_lookup(icao) {
try {
var c = new java.net.URL('http://tornado.beebe.cc/' + icao).openConnection();
c.requestMethod='GET';
var reader = new java.io.BufferedReader(new java.io.InputStreamReader(c.inputStream));
return reader.readLine();
} catch(err) {
return "Unknown: " + err;
}
}
ICAO_lookup($p0);