The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
User-defined features and session extensions in Apache Spark allow for high-level and low-level customization. High-level customization includes user-defined types (UDT), user-defined functions (UDF), and user-defined aggregate functions (UDAF). Low-level customization involves extensions to the analyzer rules, optimizations, physical execution, and more. The document provides examples of UDT, UDF, and UDAF and discusses how they allow incorporating custom logic into Spark applications similar to stored procedures in databases.
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
This document discusses sessionization techniques using Apache Spark batch and streaming processing. It describes using Spark to join previous session data with new log data to generate user sessions in batch mode. For streaming, it covers using watermarks and stateful processing to continuously generate sessions from streaming data. Key aspects covered include checkpointing to provide fault tolerance, configuring the state store, and techniques for reprocessing data in batch and streaming contexts.
The document discusses sessionization with Spark streaming to analyze user sessions from a constant stream of page visit data. Key points include:
- Streaming page visit data presents challenges like joining new visits to ongoing sessions and handling variable data volumes and long user sessions.
- The proposed solution uses Spark streaming to join a checkpoint of incomplete sessions with new visit data to calculate session metrics in real-time.
- Important aspects are controlling data ingress size and partitioning to optimize performance of operations like joins and using custom formats to handle output to multiple sinks.
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Ontico
Postgres has always had strong support for relational storage. However, there are many cases where relational storage is either inefficient or overly restrictive. This talk shows the many ways that Postgres has expanded to support non-relational storage, specifically the ability to store and index multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage. The talk will cover many examples of multiple-value-per-field storage, including arrays, range types, geometry, full text search, xml, json, and records.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeJeff Frost
This document provides instructions for minimizing downtime when performing a major version upgrade of PostgreSQL using logical replication with Slony. It discusses various methods for performing the upgrade, including dump/restore, pg_upgrade, and logical replication with Slony. It then provides a step-by-step guide to setting up logical replication between two PostgreSQL nodes using Slony, including initializing the cluster and nodes, creating replication sets, subscribing nodes, and monitoring the initial synchronization process. The document demonstrates how Slony allows performing a graceful switchover and switchback between nodes when upgrading PostgreSQL versions.
RestMQ is a message queue system based on Redis that allows storing and retrieving messages through HTTP requests. It uses Redis' data structures like lists, sets, and hashes to maintain queues and messages. Messages can be added to and received from queues using RESTful endpoints. Additional features include status monitoring, queue control, and support for protocols like JSON, Comet, and WebSockets. The core functionality is language-agnostic but implementations exist in Python and Ruby.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Lucidworks
This document summarizes options for ingesting logs into Apache Solr using Logstash and rsyslog. It discusses sending logs from Logstash or rsyslog to Solr, and processing logs with Logstash, rsyslog, or using rsyslog with Redis and Logstash before indexing with Solr. Configuration examples are provided for Logstash and rsyslog to ingest logs and structure them as JSON for indexing in Solr.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
PgCenter is a tool for monitoring and troubleshooting PostgreSQL. It provides a graphical interface to view key performance metrics and statuses. Some of its main features include displaying server health, load, memory and disk usage, statement performance, replication status and more. It aims to help PostgreSQL administrators quickly check the health of their databases and identify potential problems.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
This document discusses centralized and unified logging. It describes how Fluentd provides a pluggable architecture for collecting, transporting, storing, analyzing, and alerting on logs from various sources in a centralized and scalable way. Examples are given of using Fluentd plugins to collect Apache logs, parse and enrich the data, forward to multiple outputs like Elasticsearch and Graphite, and more.
This document summarizes full text search capabilities in PostgreSQL. It begins with an introduction and overview of common full text search solutions. It then discusses reasons to use full text search in PostgreSQL, including consistency and no need for additional software. The document covers basics of full text search in PostgreSQL like to_tsvector, to_tsquery, and indexes. It also covers fuzzy full text search using pg_trgm and functions like similarity. Other topics mentioned include ts_headline, ts_rank, and the RUM extension.
This presentation is primarily focused on how to use collectd (http://collectd.org/) to gather data from the Postgres statistics tables. Examples of how to use collectd with Postgres will be shown. There is some hackery involved to make collectd do a little more and collect more meaningful data from Postgres. These small patches will be explored. A small portion of the discussion will be about how to visualize the data.
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
How Netflix run Apache Flink at very large scale in these two scenarios. (1) Thousands of stateless routing jobs in the context of Keystone data pipeline (2) single large state job with many TBs of state and parallelism at a couple thousands
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
One of the most powerful features of PostgreSQL is its diversity of procedural languages, but with that diversity comes a lot of options.
Did you ever wonder:
- What all of those options are on the CREATE FUNCTION statement?
- How do they affect my application?
- Does my choice of procedural language affect the performance of my statements?
- Should I create a single trigger with IF statements or several simple triggers?
- How do I debug my code?
- Can I tell which line in my function is taking all of the time?
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
Speaker: Alexander Kukushkin
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
This document provides an overview of pgCenter, a tool for managing and monitoring PostgreSQL databases. It describes pgCenter's interface which displays system metrics, PostgreSQL statistics and additional information. The interface shows values for items like CPU and memory usage, database connections, autovacuum operations, and query information. PgCenter provides a quick way to view real-time PostgreSQL and server performance metrics.
This document summarizes PL/Java, which allows writing server-side functions in Java for PostgreSQL. It discusses how to define and deploy Java functions, configure PL/Java, handle parameters and return types, use JDBC from functions, and write triggers in Java. While compatible with Oracle's SQL/JRT standard, PL/Java has some limitations around memory usage and performance. It works best on Linux and is a stable option for adding Java code to PostgreSQL databases.
PostgreSQL is one of the most advanced relational databases. It offers superb replication capabilities. The most important features are: Streaming replication, Point-In-Time-Recovery, advanced monitoring, etc.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
RestMQ is a message queue system based on Redis that allows storing and retrieving messages through HTTP requests. It uses Redis' data structures like lists, sets, and hashes to maintain queues and messages. Messages can be added to and received from queues using RESTful endpoints. Additional features include status monitoring, queue control, and support for protocols like JSON, Comet, and WebSockets. The core functionality is language-agnostic but implementations exist in Python and Ruby.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Lucidworks
This document summarizes options for ingesting logs into Apache Solr using Logstash and rsyslog. It discusses sending logs from Logstash or rsyslog to Solr, and processing logs with Logstash, rsyslog, or using rsyslog with Redis and Logstash before indexing with Solr. Configuration examples are provided for Logstash and rsyslog to ingest logs and structure them as JSON for indexing in Solr.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
PgCenter is a tool for monitoring and troubleshooting PostgreSQL. It provides a graphical interface to view key performance metrics and statuses. Some of its main features include displaying server health, load, memory and disk usage, statement performance, replication status and more. It aims to help PostgreSQL administrators quickly check the health of their databases and identify potential problems.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
This document discusses centralized and unified logging. It describes how Fluentd provides a pluggable architecture for collecting, transporting, storing, analyzing, and alerting on logs from various sources in a centralized and scalable way. Examples are given of using Fluentd plugins to collect Apache logs, parse and enrich the data, forward to multiple outputs like Elasticsearch and Graphite, and more.
This document summarizes full text search capabilities in PostgreSQL. It begins with an introduction and overview of common full text search solutions. It then discusses reasons to use full text search in PostgreSQL, including consistency and no need for additional software. The document covers basics of full text search in PostgreSQL like to_tsvector, to_tsquery, and indexes. It also covers fuzzy full text search using pg_trgm and functions like similarity. Other topics mentioned include ts_headline, ts_rank, and the RUM extension.
This presentation is primarily focused on how to use collectd (http://collectd.org/) to gather data from the Postgres statistics tables. Examples of how to use collectd with Postgres will be shown. There is some hackery involved to make collectd do a little more and collect more meaningful data from Postgres. These small patches will be explored. A small portion of the discussion will be about how to visualize the data.
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
How Netflix run Apache Flink at very large scale in these two scenarios. (1) Thousands of stateless routing jobs in the context of Keystone data pipeline (2) single large state job with many TBs of state and parallelism at a couple thousands
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
One of the most powerful features of PostgreSQL is its diversity of procedural languages, but with that diversity comes a lot of options.
Did you ever wonder:
- What all of those options are on the CREATE FUNCTION statement?
- How do they affect my application?
- Does my choice of procedural language affect the performance of my statements?
- Should I create a single trigger with IF statements or several simple triggers?
- How do I debug my code?
- Can I tell which line in my function is taking all of the time?
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
Speaker: Alexander Kukushkin
Kubernetes is a solid leader among different cloud orchestration engines and its adoption rate is growing on a daily basis. Naturally people want to run both their applications and databases on the same infrastructure.
There are a lot of ways to deploy and run PostgreSQL on Kubernetes, but most of them are not cloud-native. Around one year ago Zalando started to run HA setup of PostgreSQL on Kubernetes managed by Patroni. Those experiments were quite successful and produced a Helm chart for Patroni. That chart was useful, albeit a single problem: Patroni depended on Etcd, ZooKeeper or Consul.
Few people look forward to deploy two applications instead of one and support them later on. In this talk I would like to introduce Kubernetes-native Patroni. I will explain how Patroni uses Kubernetes API to run a leader election and store the cluster state. I’m going to live-demo a deployment of HA PostgreSQL cluster on Minikube and share our own experience of running more than 130 clusters on Kubernetes.
Patroni is a Python open-source project developed by Zalando in cooperation with other contributors on GitHub: https://github.com/zalando/patroni
This document provides an overview of pgCenter, a tool for managing and monitoring PostgreSQL databases. It describes pgCenter's interface which displays system metrics, PostgreSQL statistics and additional information. The interface shows values for items like CPU and memory usage, database connections, autovacuum operations, and query information. PgCenter provides a quick way to view real-time PostgreSQL and server performance metrics.
This document summarizes PL/Java, which allows writing server-side functions in Java for PostgreSQL. It discusses how to define and deploy Java functions, configure PL/Java, handle parameters and return types, use JDBC from functions, and write triggers in Java. While compatible with Oracle's SQL/JRT standard, PL/Java has some limitations around memory usage and performance. It works best on Linux and is a stable option for adding Java code to PostgreSQL databases.
PostgreSQL is one of the most advanced relational databases. It offers superb replication capabilities. The most important features are: Streaming replication, Point-In-Time-Recovery, advanced monitoring, etc.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.
What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.
These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Spark Streaming can be used to process streaming data from Kafka in real-time. There are two main approaches - the receiver-based approach where Spark receives data from Kafka receivers, and the direct approach where Spark directly reads data from Kafka. The document discusses using Spark Streaming to process tens of millions of transactions per minute from Kafka for an ad exchange system. It describes architectures where Spark Streaming is used to perform real-time aggregations and update databases, as well as save raw data to object storage for analytics and recovery. Stateful processing with mapWithState transformations is also demonstrated to update Cassandra in real-time.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Structured Streaming vs. Kafka Streams was compared. Spark Structured Streaming runs on a Spark cluster and allows reuse of Spark investments, while Kafka Streams is a Java library that provides low latency continuous processing. Both platforms support stateful operations like windows, aggregations and joins. Spark Structured Streaming supports multiple languages but has higher latency due to micro-batching, while Kafka Streams currently only supports Java but provides lower latency continuous processing.
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Sparkstreaming with kafka and h base at scale (1)Sigmoid
This document discusses Spark streaming with Kafka and HBase integration. It provides an overview of Spark, Spark Streaming, and how to receive streaming data using Spark Streaming with Kafka. It also discusses tips for creating a scalable pipeline and how to integrate HBase, including reading and writing data from and to HBase.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Strata NYC 2015: What's new in Spark StreamingDatabricks
Spark Streaming allows processing of live data streams at scale. Recent improvements include:
1) Enhanced fault tolerance through a write-ahead log and replay of unprocessed data on failure.
2) Dynamic backpressure to automatically adjust ingestion rates and ensure stability.
3) Visualization tools for debugging and monitoring streaming jobs.
4) Support for streaming machine learning algorithms and integration with other Spark components.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
Spark Streaming provides fault-tolerant stream processing capabilities to Spark. To achieve fault-tolerance and exactly-once processing semantics in production, Spark Streaming uses checkpointing to recover from driver failures and write-ahead logging to recover processed data from executor failures. The key aspects required are configuring automatic driver restart, periodically saving streaming application state to a fault-tolerant storage system using checkpointing, and synchronously writing received data batches to storage using write-ahead logging to allow recovery after failures.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
This document introduces Apache Flink and Apache Kafka for stream processing. It discusses how Flink provides high-throughput, low-latency stream processing with exactly-once semantics. It demonstrates using Flink's Table API and SQL to read data from a Kafka topic, apply a SQL query to filter and transform the data, and write the results to another Kafka topic. The code example shows how to define sources, sinks, and queries to build a simple real-time data pipeline with Flink and Kafka.
This document discusses how Kafka handles timestamps and offsets. It explains that Kafka maintains offset and time-based indexes to allow fetching log data by offset or timestamp. When new log records are appended, the indexes are updated with the largest offset and timestamp. If a record has a timestamp older than the existing minimum in the time index, Kafka will still append it but the time index entry will not be updated.
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)Andre Hora
Software testing plays a crucial role in the contribution process of open-source projects. For example, contributions introducing new features are expected to include tests, and contributions with tests are more likely to be accepted. Although most real-world projects require contributors to write tests, the specific testing practices communicated to contributors remain unclear. In this paper, we present an empirical study to understand better how software testing is approached in contribution guidelines. We analyze the guidelines of 200 Python and JavaScript open-source software projects. We find that 78% of the projects include some form of test documentation for contributors. Test documentation is located in multiple sources, including CONTRIBUTING files (58%), external documentation (24%), and README files (8%). Furthermore, test documentation commonly explains how to run tests (83.5%), but less often provides guidance on how to write tests (37%). It frequently covers unit tests (71%), but rarely addresses integration (20.5%) and end-to-end tests (15.5%). Other key testing aspects are also less frequently discussed: test coverage (25.5%) and mocking (9.5%). We conclude by discussing implications and future research.
Why Orangescrum Is a Game Changer for Construction Companies in 2025Orangescrum
Orangescrum revolutionizes construction project management in 2025 with real-time collaboration, resource planning, task tracking, and workflow automation, boosting efficiency, transparency, and on-time project delivery.
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDinusha Kumarasiri
AI is transforming APIs, enabling smarter automation, enhanced decision-making, and seamless integrations. This presentation explores key design principles for AI-infused APIs on Azure, covering performance optimization, security best practices, scalability strategies, and responsible AI governance. Learn how to leverage Azure API Management, machine learning models, and cloud-native architectures to build robust, efficient, and intelligent API solutions
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMaxim Salnikov
Imagine if apps could think, plan, and team up like humans. Welcome to the world of AI agents and agentic user interfaces (UI)! In this session, we'll explore how AI agents make decisions, collaborate with each other, and create more natural and powerful experiences for users.
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
Douwan Crack 2025 new verson+ License codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
Douwan Preactivated Crack Douwan Crack Free Download. Douwan is a comprehensive software solution designed for data management and analysis.
Join Ajay Sarpal and Miray Vu to learn about key Marketo Engage enhancements. Discover improved in-app Salesforce CRM connector statistics for easy monitoring of sync health and throughput. Explore new Salesforce CRM Synch Dashboards providing up-to-date insights into weekly activity usage, thresholds, and limits with drill-down capabilities. Learn about proactive notifications for both Salesforce CRM sync and product usage overages. Get an update on improved Salesforce CRM synch scale and reliability coming in Q2 2025.
Key Takeaways:
Improved Salesforce CRM User Experience: Learn how self-service visibility enhances satisfaction.
Utilize Salesforce CRM Synch Dashboards: Explore real-time weekly activity data.
Monitor Performance Against Limits: See threshold limits for each product level.
Get Usage Over-Limit Alerts: Receive notifications for exceeding thresholds.
Learn About Improved Salesforce CRM Scale: Understand upcoming cloud-based incremental sync.
This presentation explores code comprehension challenges in scientific programming based on a survey of 57 research scientists. It reveals that 57.9% of scientists have no formal training in writing readable code. Key findings highlight a "documentation paradox" where documentation is both the most common readability practice and the biggest challenge scientists face. The study identifies critical issues with naming conventions and code organization, noting that 100% of scientists agree readable code is essential for reproducible research. The research concludes with four key recommendations: expanding programming education for scientists, conducting targeted research on scientific code quality, developing specialized tools, and establishing clearer documentation guidelines for scientific software.
Presented at: The 33rd International Conference on Program Comprehension (ICPC '25)
Date of Conference: April 2025
Conference Location: Ottawa, Ontario, Canada
Preprint: https://arxiv.org/abs/2501.10037
Adobe After Effects Crack FREE FRESH version 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe After Effects is a software application used for creating motion graphics, special effects, and video compositing. It's widely used in TV and film post-production, as well as for creating visuals for online content, presentations, and more. While it can be used to create basic animations and designs, its primary strength lies in adding visual effects and motion to videos and graphics after they have been edited.
Here's a more detailed breakdown:
Motion Graphics:
.
After Effects is powerful for creating animated titles, transitions, and other visual elements to enhance the look of videos and presentations.
Visual Effects:
.
It's used extensively in film and television for creating special effects like green screen compositing, object manipulation, and other visual enhancements.
Video Compositing:
.
After Effects allows users to combine multiple video clips, images, and graphics to create a final, cohesive visual.
Animation:
.
It uses keyframes to create smooth, animated sequences, allowing for precise control over the movement and appearance of objects.
Integration with Adobe Creative Cloud:
.
After Effects is part of the Adobe Creative Cloud, a suite of software that includes other popular applications like Photoshop and Premiere Pro.
Post-Production Tool:
.
After Effects is primarily used in the post-production phase, meaning it's used to enhance the visuals after the initial editing of footage has been completed.
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Ranjan Baisak
As software complexity grows, traditional static analysis tools struggle to detect vulnerabilities with both precision and context—often triggering high false positive rates and developer fatigue. This article explores how Graph Neural Networks (GNNs), when applied to source code representations like Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs), can revolutionize vulnerability detection. We break down how GNNs model code semantics more effectively than flat token sequences, and how techniques like attention mechanisms, hybrid graph construction, and feedback loops significantly reduce false positives. With insights from real-world datasets and recent research, this guide shows how to build more reliable, proactive, and interpretable vulnerability detection systems using GNNs.
Download Wondershare Filmora Crack [2025] With Latesttahirabibi60507
Copy & Past Link 👉👉
http://drfiles.net/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentShubham Joshi
A secure test infrastructure ensures that the testing process doesn’t become a gateway for vulnerabilities. By protecting test environments, data, and access points, organizations can confidently develop and deploy software without compromising user privacy or system integrity.
Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers.
Here's a more detailed breakdown:
Key Features and Functions:
Organization:
Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels.
Editing:
It offers a wide range of editing tools for making adjustments to color, tone, and more.
Processing:
Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images.
Desktop-Focused:
The application is designed to be used on a computer, with the original photos stored locally on the hard drive.
Non-Destructive Editing:
Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched.
Key Differences from Lightroom (Cloud-Based):
Storage Location:
Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud.
Workflow:
Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow.
Connectivity:
Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos.
Organization:
Lightroom Classic offers more advanced organization features like Collections and Keywords.
Who is it for?
Professional Photographers:
PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application.
Users with Large Collections:
Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features.
Users who prefer a traditional workflow:
Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.
⭕️➡️ FOR DOWNLOAD LINK : http://drfiles.net/ ⬅️⭕️
Maxon Cinema 4D 2025 is the latest version of the Maxon's 3D software, released in September 2024, and it builds upon previous versions with new tools for procedural modeling and animation, as well as enhancements to particle, Pyro, and rigid body simulations. CG Channel also mentions that Cinema 4D 2025.2, released in April 2025, focuses on spline tools and unified simulation enhancements.
Key improvements and features of Cinema 4D 2025 include:
Procedural Modeling: New tools and workflows for creating models procedurally, including fabric weave and constellation generators.
Procedural Animation: Field Driver tag for procedural animation.
Simulation Enhancements: Improved particle, Pyro, and rigid body simulations.
Spline Tools: Enhanced spline tools for motion graphics and animation, including spline modifiers from Rocket Lasso now included for all subscribers.
Unified Simulation & Particles: Refined physics-based effects and improved particle systems.
Boolean System: Modernized boolean system for precise 3D modeling.
Particle Node Modifier: New particle node modifier for creating particle scenes.
Learning Panel: Intuitive learning panel for new users.
Redshift Integration: Maxon now includes access to the full power of Redshift rendering for all new subscriptions.
In essence, Cinema 4D 2025 is a major update that provides artists with more powerful tools and workflows for creating 3D content, particularly in the fields of motion graphics, VFX, and visualization.
7. Streaming query execution - micro-batch
7
load state
for t1 query
load offsets
to process &
write them
for t1 query
process
data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
9. Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
long-running, per partition
10. Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
if all tasks
processed
offsets within
epoc
long-running, per partition
11. Popular data transformations
11
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
16. Kafka data source configuration
16
⇢ Where?
⇢ What?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
17. Kafka data source configuration
17
⇢ Where?
⇢ What?
⇢ How?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
data loss failure (streaming), max reading rate control, Spark partitions number
20. From the fetch to the reading - micro-batch
20
data loss
checks,
skewness
optimization
initialize
offsets to
process
create data
consumer if
needed
checkpoint
processed
offsets
poll
data
Apache Kafka broker
next offsets to process
max offsets in partition
(no maxOffsetsPerTrigger)
distribute
offsets to
executors
as long as
the read offset < max offset for topic/partition
data locality
if new data
available
data loss checks
if no
fatal failure
22. Data loss protection - conditions
22
deleted partitions expired records
(metadata consumer)
23. Data loss protection - conditions
23
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
24. Data loss protection - conditions
24
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
expired records
(data consumer)
31. 1 or multiple outputs - how?
31
private def createProjection = {
val topicExpression = topic.map(Literal(_)).orElse {
inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME)
}.getOrElse {
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
KafkaRowWriter
32. Summary
32
● micro-batch oriented
● low latency in progress effort
● fault-tolerance with checkpoint mechanism
● batch and streaming supported
● alternative way to other streaming approaches
33. Resources
● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-
integration.html
● Structured streaming support for consuming from Kafka:
https://issues.apache.org/jira/browse/SPARK-15406
● Github data generator: https://github.com/bartosz25/data-generator
● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo
● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming
33
#8: ask if everybody is aware of the watermark
explain the idea of state store + where it can be stored
explain where checkpoint location + where it can be stored (HDFS compatible fs)
#12: limit is useless since it will stop returning data as soon as it's reached
#13: limit is useless since it will stop returning data as soon as it's reached
#14: THE CODE used in the transformation is distributed only once, for the first query, or it's compiled & distributed for every query?
#16:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
optinals
> failOnDataLoss
> maxOffsetsPerTrigger
> minPartitions
#17:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
optinals
> failOnDataLoss
> maxOffsetsPerTrigger
> minPartitions
#18:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
optinals
> failOnDataLoss
> maxOffsetsPerTrigger
> minPartitions
#20: TODO: extract_json
no schema registry, even though there was a blog post of Xebia about integration of it
#21: explain that it can be different → V1 vs V2 data source
say that it doesn't happen for the next query becaue data is stored in memory, unless the check on data loss
poll data = seek + poll
poll data => explain data loss checks
consumer on the executor lifecycle ⇒ Is it closed after the batch read? In fact, it depends whether there are new topic/partitons. If it's not the case, it's reused, if yes, a new one is created.
an exception ⇒ contiunous streaming mode always recreates a new consumer!!!!!!! EXPLAIN the diff between micro batch and continuous reader
#27: explain why not transactions (see comment from wfc)
#28: say that KafkaRowWriter is shared by V1 and V2 data sinks