This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
This document discusses Spark running on YARN and how it can provide benefits over other cluster managers. It covers calculating container sizes to optimize resource usage, isolating containers, label-based scheduling, dynamic resource allocation, resilience to failures, accessing secure environments, and debugging YARN applications. Future work areas discussed are more advanced dynamic allocation algorithms, integrating the YARN timeline server, and making security services pluggable.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Codemotion
En esta charla se presentará como se puede afrontar el reto de implantar el Deep Learning sobre la estructura de cómputo de Spark. Se hablará de como construir un proyecto utilizando la infraestructura de Spark ML y BigDL de Intel y su puesta en producción.
Find out more at https://madrid2018.codemotionworld.com/speakers/
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareMaria Stylianou
This document describes SPARJA, a distributed social graph partitioning and replication middleware. SPARJA is based on SPAR but improves it in two key ways: 1) It uses a distributed partitioning algorithm that does not require a global view of the social graph; and 2) It eliminates the single point of failure of the central partition manager in SPAR. The document evaluates SPARJA against SPAR on both synthetic and real social graph datasets, finding that SPARJA performs on par with or better than SPAR depending on the graph structure and level of clustering.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
A Master Guide To Apache Spark Application And Versatile Uses.pdfDataSpace Academy
A leading name in big data handling tasks, Apache Spark earns kudos for its ability to handle vast amounts of data swiftly and efficiently. The tool is also a major name in the development of APIs in Java, Python, and R. The blog offers a master guide on all the key aspects of Apache Spark, including versatility, fault tolerance, real-time streaming, and more. The blog also goes on to explain the operational procedure of the tool, step by step. Finally, the article wraps up with benefits and also limitations of the tool.
Apache Spark and Apache Storm are both open-source frameworks for processing large datasets. Spark is better suited for batch processing due to its in-memory computing approach, while Storm excels at real-time stream processing with very low latencies. When deciding between the two, the use case and data processing needs should be considered, as Spark and Storm each have distinct strengths - Spark for batch jobs and Storm for real-time streams. Programming languages supported also differ between the two platforms.
This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).
Bananas is a data stream processing system that achieves remarkably high throughput (over 100,000 packets/second) with sub-millisecond latencies. Benchmark tests show that Spark Streaming struggles to maintain steady performance with bounded latencies even at a fraction of Bananas' throughput. The key reasons for Bananas' superior performance are that it processes each data packet individually rather than in batches like Spark Streaming, and it uses lockless shared memory and fast scheduling. Future tests will further evaluate Bananas against other streaming systems under different workloads and cluster sizes.
Power minimization of systems using Performance Enhancement Guaranteed CachesIJTET Journal
Caches have long been an instrument for speeding memory access from microcontrollers to center based ASIC plans. For hard ongoing frameworks however stores are tricky because of most pessimistic scenario execution time estimation. As of late, an on-chip scratch cushion memory (SPM) to decrease the force and enhance execution. SPM does not productively reuse its space while execution. Here, an execution improvement ensured reserves (PEG-C) to improve the execution. It can likewise be utilized like a standard reserve to progressively store guidelines and information in view of their runtime access examples prompting attain to great execution. All the earlier plans have corruption of execution when contrasted with PEG-C. It has a superior answer for equalization time consistency and normal case execution
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?Markus Michalewicz
Oracle Real Application Clusters (Oracle RAC) is the preferred availability and scalability solution for Oracle Databases, as most applications can benefit from its capabilities without making any changes. This mini session explains the secrets behind Oracle RAC’s horizontal scaling algorithm, Cache Fusion, and how you can test and ensure that your application is “Oracle RAC ready.”
This deck was first presented in OOW19 as an AskTom theater / mini session and will be presented as a full version in other conferences going forward at which time I will provide an updated version of the deck.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
A presentation about COCOMA, a framework for COntrolled COntentious and MAlicious patterns, presented at MERMAT, 2nd International Workshop on Measurement-based Experimental Research, Methodology and Tools, FIA 2013, Dublin, Ireland
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Apache Spark is a fast and general engine for large-scale data processing that eBay uses to improve user experiences, provide relevant offers, and optimize performance. Spark provides simple programming abstractions and powerful in-memory caching capabilities to enable high-performance iterative processing of large datasets. At eBay, Spark jobs are commonly run on Hadoop clusters using Yarn and process data stored in HDFS, with many jobs written in Scala. Spark is helping eBay create more value from its data and its use is expanding from experimental to everyday.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Spark is a cluster computing framework designed to be fast, general-purpose, and able to handle a wide range of workloads including batch processing, iterative algorithms, interactive queries, and streaming. It is faster than Hadoop for interactive queries and complex applications by running computations in-memory when possible. Spark also simplifies combining different processing types through a single engine. It offers APIs in Java, Python, Scala and SQL and integrates closely with other big data tools like Hadoop. Spark is commonly used for interactive queries on large datasets, streaming data processing, and machine learning tasks.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
Mesos is an open source cluster manager that improves resource utilization. It allows Spark Streaming jobs to leverage Mesos fault tolerance features like driver supervision using Marathon. Backpressure is also supported in Spark Streaming to prevent scheduling delays from fast data arrival. Reactive Streams provide more direct backpressure control and are expected in future Spark versions.
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
This document contains information about stacks, interrupts, interrupt service routines, macros, and procedures in computer science. Stacks are a last-in, first-out data structure that uses registers to store and retrieve data. Interrupts pause the main program to run an interrupt service routine in response to an event. Macros allow repetitive code to be defined once and reused, while procedures are subroutines that are called but do not allow parameter passing like macros do.
A Master Guide To Apache Spark Application And Versatile Uses.pdfDataSpace Academy
A leading name in big data handling tasks, Apache Spark earns kudos for its ability to handle vast amounts of data swiftly and efficiently. The tool is also a major name in the development of APIs in Java, Python, and R. The blog offers a master guide on all the key aspects of Apache Spark, including versatility, fault tolerance, real-time streaming, and more. The blog also goes on to explain the operational procedure of the tool, step by step. Finally, the article wraps up with benefits and also limitations of the tool.
Apache Spark and Apache Storm are both open-source frameworks for processing large datasets. Spark is better suited for batch processing due to its in-memory computing approach, while Storm excels at real-time stream processing with very low latencies. When deciding between the two, the use case and data processing needs should be considered, as Spark and Storm each have distinct strengths - Spark for batch jobs and Storm for real-time streams. Programming languages supported also differ between the two platforms.
This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).
Bananas is a data stream processing system that achieves remarkably high throughput (over 100,000 packets/second) with sub-millisecond latencies. Benchmark tests show that Spark Streaming struggles to maintain steady performance with bounded latencies even at a fraction of Bananas' throughput. The key reasons for Bananas' superior performance are that it processes each data packet individually rather than in batches like Spark Streaming, and it uses lockless shared memory and fast scheduling. Future tests will further evaluate Bananas against other streaming systems under different workloads and cluster sizes.
Power minimization of systems using Performance Enhancement Guaranteed CachesIJTET Journal
Caches have long been an instrument for speeding memory access from microcontrollers to center based ASIC plans. For hard ongoing frameworks however stores are tricky because of most pessimistic scenario execution time estimation. As of late, an on-chip scratch cushion memory (SPM) to decrease the force and enhance execution. SPM does not productively reuse its space while execution. Here, an execution improvement ensured reserves (PEG-C) to improve the execution. It can likewise be utilized like a standard reserve to progressively store guidelines and information in view of their runtime access examples prompting attain to great execution. All the earlier plans have corruption of execution when contrasted with PEG-C. It has a superior answer for equalization time consistency and normal case execution
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?Markus Michalewicz
Oracle Real Application Clusters (Oracle RAC) is the preferred availability and scalability solution for Oracle Databases, as most applications can benefit from its capabilities without making any changes. This mini session explains the secrets behind Oracle RAC’s horizontal scaling algorithm, Cache Fusion, and how you can test and ensure that your application is “Oracle RAC ready.”
This deck was first presented in OOW19 as an AskTom theater / mini session and will be presented as a full version in other conferences going forward at which time I will provide an updated version of the deck.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
A presentation about COCOMA, a framework for COntrolled COntentious and MAlicious patterns, presented at MERMAT, 2nd International Workshop on Measurement-based Experimental Research, Methodology and Tools, FIA 2013, Dublin, Ireland
Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users.
Speakers: Ankit Agarwal, Sameer Agarwal
Apache Spark is a fast and general engine for large-scale data processing that eBay uses to improve user experiences, provide relevant offers, and optimize performance. Spark provides simple programming abstractions and powerful in-memory caching capabilities to enable high-performance iterative processing of large datasets. At eBay, Spark jobs are commonly run on Hadoop clusters using Yarn and process data stored in HDFS, with many jobs written in Scala. Spark is helping eBay create more value from its data and its use is expanding from experimental to everyday.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Spark is a cluster computing framework designed to be fast, general-purpose, and able to handle a wide range of workloads including batch processing, iterative algorithms, interactive queries, and streaming. It is faster than Hadoop for interactive queries and complex applications by running computations in-memory when possible. Spark also simplifies combining different processing types through a single engine. It offers APIs in Java, Python, Scala and SQL and integrates closely with other big data tools like Hadoop. Spark is commonly used for interactive queries on large datasets, streaming data processing, and machine learning tasks.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
Mesos is an open source cluster manager that improves resource utilization. It allows Spark Streaming jobs to leverage Mesos fault tolerance features like driver supervision using Marathon. Backpressure is also supported in Spark Streaming to prevent scheduling delays from fast data arrival. Reactive Streams provide more direct backpressure control and are expected in future Spark versions.
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
This document contains information about stacks, interrupts, interrupt service routines, macros, and procedures in computer science. Stacks are a last-in, first-out data structure that uses registers to store and retrieve data. Interrupts pause the main program to run an interrupt service routine in response to an event. Macros allow repetitive code to be defined once and reused, while procedures are subroutines that are called but do not allow parameter passing like macros do.
This document discusses various tools and techniques for performing basic dynamic malware analysis, including sandboxes, Process Monitor, Process Explorer, and Regshot. It explains how sandboxes like GFI Sandbox can provide initial analysis of malware but have limitations. Process Monitor and Process Explorer allow monitoring processes, registry changes, and other activity in real-time. Regshot facilitates comparing registry snapshots before and after malware is run.
The document discusses various techniques for analyzing potential malware through static analysis. It covers hashing files to generate unique identifiers, searching for strings, examining import/export functions and the portable executable file format, and using tools like PEview and Resource Hacker. The goal of static analysis is to understand a file's functionality and attributes without executing it.
This chapter provides an overview of malware analysis. It outlines the goals of malware analysis as determining what happened during a network intrusion and ensuring all infected machines and files are located. It describes static and dynamic analysis techniques, from basic to advanced. It also defines common types of malware like backdoors, botnets, downloaders, and more. Finally, it provides general rules for malware analysis, like focusing on key features and using different tools/approaches when stuck.
The document discusses disassembly theory, including first, second, and third generation languages. It then discusses the why and how of disassembly, including uses for malware analysis, vulnerability analysis, software interoperability, compiler validation, and debugging displays. It describes the basic process of disassembly and two common algorithms: linear sweep and recursive descent. Finally, it outlines some common reversing and disassembly tools like file, PE Tools, PEiD, nm, ldd, objdump, otool, and strings.
This document provides an overview of getting started with IDA and navigating disassemblies:
- Launching IDA involves choosing a file to analyze which loads the file and displays it. The history allows reopening recent files.
- The initial analysis populates various windows like Functions and disassembles the code. Data displays include graph, text, hex, and named views.
- Navigation uses double-clicks, addresses, and the stack frame. Searches find text or binary patterns.
- Common tasks involve naming locations and variables, transforming code/data, and recognizing data structures.
The document defines key concepts in Storm including:
1. A Storm topology defines a graph of computation and is composed of spouts, streams, and bolts.
2. Spouts are sources that read/listen to external data and emit tuples to streams.
3. Streams are sequences of tuples that flow between topology components and can be processed in parallel by bolts.
Kafka uses a publish-subscribe messaging model with topics that can be partitioned across multiple servers. Messages are organized into topics which are distributed and stored across partitions. Producers write data to topics in the form of messages which are consumed by subscribers. The messages are distributed to partitions in a partitioned topic for scalability and fault tolerance.
The document discusses stream processing models. It describes the key components as data sources, stream processing pipelines, and data sinks. Data sources refer to the inputs of streaming data, pipelines are the processing applied to the streaming data, and sinks are the outputs where the results are stored or sent. Stateful stream processing requires ensuring state is preserved over time and data consistency even during failures. Frameworks like Apache Spark use sources and sinks to connect to streaming data sources like Kafka and send results to other systems, acting as pipelines between different distributed systems.
The document discusses analyzing potential malware using static analysis techniques. It describes examining a file's Portable Executable (PE) header, which contains metadata about the code, libraries and functions. It also summarizes analyzing imported and exported functions, which can provide clues about the program's functionality. Static analysis tools like PEview, Dependency Walker, Strings and Resource Hacker are used to extract further information from files.
This document provides an overview of malware analysis. It discusses the goals of malware analysis as determining what happened on a network and ensuring all infected files are found. It also describes static and dynamic analysis techniques, from basic approaches like examining file contents up to advanced methods like reverse engineering code. The document outlines common types of malware like backdoors, botnets, and information stealing malware. Finally, it provides some general rules for malware analysis like focusing on key features and using different analysis approaches when getting stuck.
The role of the lexical analyzer
Specification of tokens
Finite state machines
From a regular expressions to an NFA
Convert NFA to DFA
Transforming grammars and regular expressions
Transforming automata to grammars
Language for specifying lexical analyzers
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
Sorting Order and Stability in Sorting.
Concept of Internal and External Sorting.
Bubble Sort,
Insertion Sort,
Selection Sort,
Quick Sort and
Merge Sort,
Radix Sort, and
Shell Sort,
External Sorting, Time complexity analysis of Sorting Algorithms.
Concept of Problem Solving, Introduction to Algorithms, Characteristics of Algorithms, Introduction to Data Structure, Data Structure Classification (Linear and Non-linear, Static and Dynamic, Persistent and Ephemeral data structures), Time complexity and Space complexity, Asymptotic Notation - The Big-O, Omega and Theta notation, Algorithmic upper bounds, lower bounds, Best, Worst and Average case analysis of an Algorithm, Abstract Data Types (ADT)
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai
With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to
understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes
the answer to interpreting the decision-making process that AI malware analysis models use to determine
malicious benign samples to gain trust that in a production environment, the system is able to catch
malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI
as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in
many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to
conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to
balance explainability with security. Following this framework, designing a machine with an AI malware
detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its
decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on
choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique,
implementing AdvXAI defensive measures, and continually evaluating the model. This framework will
significantly contribute to automated malware detection and XAI efforts allowing for secure systems that
are resilient to adversarial attacks.
Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.
☁️ GDG Cloud Munich: Build With AI Workshop - Introduction to Vertex AI! ☁️
Join us for an exciting #BuildWithAi workshop on the 28th of April, 2025 at the Google Office in Munich!
Dive into the world of AI with our "Introduction to Vertex AI" session, presented by Google Cloud expert Randy Gupta.
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfMohamedAbdelkader115
Glad to be one of only 14 members inside Kuwait to hold this credential.
Please check the members inside kuwait from this link:
https://www.rics.org/networking/find-a-member.html?firstname=&lastname=&town=&country=Kuwait&member_grade=(AssocRICS)&expert_witness=&accrediation=&page=1
International Journal of Distributed and Parallel systems (IJDPS)samueljackson3773
The growth of Internet and other web technologies requires the development of new
algorithms and architectures for parallel and distributed computing. International journal of
Distributed and parallel systems is a bimonthly open access peer-reviewed journal aims to
publish high quality scientific papers arising from original research and development from
the international community in the areas of parallel and distributed systems. IJDPS serves
as a platform for engineers and researchers to present new ideas and system technology,
with an interactive and friendly, but strongly professional atmosphere.
The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.
1. 1
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
UNIT IV
Apache Spark Streaming Introduction - Spark’s Memory Usage - Understanding Resilience and
Fault - Tolerance in a Distributed System - Spark’s cluster manager - Data Delivery Semantics in
Spark - Data Delivery Semantics in Spark Applications - Microbatching - Dynamic Batch Interval
- Structured Stream processing model - Spark Streaming Resilience Model - Data Structures in
Spark – RDDs and DStreams - Spark Fault Tolerance Guarantees - First Steps in Structured
Streaming - Streaming Analytics Phases - Acquiring streaming data - Transforming streaming data
- Output the resulting data - Demo – Stream Processing with Spark Streaming
Apache Spark Streaming Introduction
Spark offers two different stream-processing APIs,
• Spark Streaming and
• Structured Streaming:
Spark Streaming: This is an API and a set of connectors, in which a Spark program is being
served small batches of data collected from a stream in the form of microbatches spaced at fixed
time intervals, performs a given computation, and eventually returns a result at every interval.
Structured Streaming: This is an API and a set of connectors, built on the substrate of a SQL
query optimizer, Catalyst. It offers an API based on DataFrames and the notion of continuous
queries over an unbounded table that is constantly updated with fresh records from the stream.
Spark’s Memory Usage
Spark offers in-memory storage of slices of a dataset, which must be initially loaded from
a data source. The data source can be a distributed filesystem or another storage medium. Spark’s
form of in-memory storage is analogous to the operation of caching data.
Hence, a value in Spark’s in-memory storage has a base, which is its initial data source, and
layers of successive operations applied to it.
Failure Recovery What happens in case of a failure? Because Spark knows exactly which data
source was used to ingest the data in the first place, and because it also knows all the operations
that were performed on it thus far, it can reconstitute the segment of lost data that was on a crashed
executor, from scratch. Obviously, this goes faster if that reconstitution (recovery, in Spark’s
parlance), does not need to be totally from scratch. So, Spark offers a replication mechanism, quite
in a similar way to distributed filesystems. However, because memory is such a valuable yet limited
commodity, Spark makes (by default) the cache short lived.
Lazy Evaluation A good part of the operations that can be defined on values in Spark’s storage
have a lazy execution, and it is the execution of a final, eager output operation that will trigger the
actual execution of computation in a Spark cluster. It’s worth noting that if a program consists of
a series of linear operations, with the previous one feeding into the next, the intermediate results
disappear right after said next step has consumed its input.
Cache Hints On the other hand, what happens if we have several operations to do on a single
intermediate result? Should we have to compute it several times? Thankfully, Spark lets users
specify that an intermediate value is important and how its contents should be safeguarded for
later. Figure below presents the data flow of such an operation.
2. 2
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Figure: Operations on cached values
Finally, Spark offers the opportunity to spill the cache to secondary storage in case it runs out of
memory on the cluster, extending the in-memory operation to secondary —and significantly
slower—storage to preserve the functional aspects of a data pro‐ cess when faced with temporary
peak loads.
Now that we have an idea of the main characteristics of Apache Spark, let’s spend some time
focusing on one design choice internal to Spark, namely, the latency versus throughput trade-off.
Understanding Resilience and Fault - Tolerance in a Distributed System
Resilience and fault tolerance are absolutely essential for a distributed application: they are the
condition by which we will be able to perform the user’s computation to completion. Nowadays,
clusters are made of commodity machines that are ideally operated near peak capacity over their
lifetime.
To put it mildly, hardware breaks quite often. A resilient application can make progress with its
process despite latencies and noncritical faults in its distributed environment. A fault-tolerant
application is able to succeed and complete its process despite the unplanned termination of one
or several of its nodes.
This sort of resiliency is especially relevant in stream processing given that the applications we’re
scheduling are supposed to live for an undetermined amount of time. That undetermined amount
of time is often correlated with the life cycle of the data source. For example, if we are running a
retail website and we are analyzing transactions and website interactions as they come into the
system against the actions and clicks and navigation of users visiting the site, we potentially have a
data source that will be available for the entire duration of the lifetime of our business, which we
hope to be very long, if our business is going to be successful.
As a consequence, a system that will process our data in a streaming fashion should run
uninterrupted for long periods of time.
This “show must go on” approach of streaming computation makes the resiliency and fault-
tolerance characteristics of our applications more important. For a batch job, we could launch it,
hope it would succeed, and relaunch if we needed to change it or in case of failure. For an online
streaming Spark pipeline, this is not a reasonable assumption.
3. 3
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Fault Recovery
In the context of fault tolerance, we are also interested in understanding how long it takes to
recover from failure of one particular node. Indeed, stream processing has a particular aspect: data
continues being generated by the data source in real time. To deal with a batch computing failure,
we always have the opportunity to restart from scratch and accept that obtaining the results of
computation will take longer. Thus, a very primitive form of fault tolerance is detecting the failure
of a particular node of our deployment, stopping the computation, and restarting from scratch.
That process can take more than twice the original duration that we had budgeted for that
computation, but if we are not in a hurry, this still acceptable.
For stream processing, we need to keep receiving data and thus potentially storing it, if the
recovering cluster is not ready to assume any processing yet. This can pose a problem at a high
throughput: if we try restarting from scratch, we will need not only to reprocess all of the data that
we have observed since the beginning of the application—which in and of itself can be a
challenge—but during that reprocessing of historical data, we will need it to continue receiving
and thus potentially storing new data that was generated while we were trying to catch up. This
pattern of restarting from scratch is something so intractable for streaming that we will pay special
attention to Spark’s ability to restart only minimal amounts of computation in the case that a node
becomes unavailable or nonfunctional.
Cluster Manager Support for Fault Tolerance
We want to highlight why it is still important to understand Spark’s fault tolerance guarantees, even
if there are similar features present in the cluster managers of YARN, Mesos, or Kubernetes. To
understand this, we can consider that cluster managers help with fault tolerance when they work
hand in hand with a framework that is able to report failures and request new resources to cope
with those exceptions. Spark possesses such capabilities.
For example, production cluster managers such as YARN, Mesos, or Kubernetes have the ability
to detect a node’s failure by inspecting endpoints on the node and asking the node to report on its
own readiness and liveness state. If these cluster managers detect a failure and they have spare
capacity, they will replace that node with another, made available to Spark. That particular action
implies that the Spark executor code will start anew in another node, and then attempt to join the
existing Spark cluster.
The cluster manager, by definition, does not have introspection capabilities into the applications
being run on the nodes that it reserves. Its responsibility is limited to the container that runs the
user’s code.
That responsibility boundary is where the Spark resilience features start. To recover from a failed
node, Spark needs to do the following:
• Determine whether that node contains some state that should be reproduced in the form
of checkpointed files
• Understand at which stage of the job a node should rejoin the computation
The goal here is for us to explore that if a node is being replaced by the cluster man‐ ager, Spark
has capabilities that allow it to take advantage of this new node and to distribute computation onto
it.
4. 4
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Within this context, our focus is on Spark’s responsibilities as an application and underline the
capabilities of a cluster manager only when necessary: for instance, a node could be replaced
because of a hardware failure or because its work was simply preempted by a higher-priority job.
Apache Spark is blissfully unaware of the why, and focuses on the how.
Spark’s cluster manager
Spark has two internal cluster managers:
The local cluster manager
This emulates the function of a cluster manager (or resource manager) for testing purposes.
It reproduces the presence of a cluster of distributed machines using a threading model that relies
on your local machine having only a few available cores. This mode is usually not very confusing
because it executes only on the user’s laptop.
The standalone cluster manager
A relatively simple, Spark-only cluster manager that is rather limited in its availability to
slice and dice resource allocation. The standalone cluster manager holds and makes available the
entire worker node on which a Spark executor is deployed and started. It also expects the executor
to have been predeployed there, and the actual shipping of that .jar to a new machine is not within
its scope. It has the ability to take over a specific number of executors, which are part of its
deployment of worker nodes, and execute a task on it. This cluster manager is extremely useful for
the Spark developers to provide a bare-bones resource management solution that allows you to
focus on improving Spark in an environment without any bells and whistles. The standalone cluster
manager is not recommended for production deployments.
As a summary, Apache Spark is a task scheduler in that what it schedules are tasks, units of
distribution of computation that have been extracted from the user program. Spark also
communicates and is deployed through cluster managers including Apache Mesos, YARN, and
Kubernetes, or allowing for some cases its own standalone cluster manager. The purpose of that
communication is to reserve a number of executors, which are the units to which Spark
understands equal-sized amounts of computation resources, a virtual “node” of sorts. The reserved
resources in question could be provided by the cluster manager as the following:
• Limited processes (e.g., in some basic use cases of YARN), in which processes have their
resource consumption metered but are not prevented from accessing each other’s resource
by default.
• Containers (e.g., in the case of Mesos or Kubernetes), in which containers are a relatively
lightweight resource reservation technology that is born out of the groups and namespaces
of the Linux kernel and have known their most popular iteration with the Docker project.
• They also could be one of the above deployed on virtual machines (VMs), them‐ selves
coming with specific cores and memory reservation.
Data Delivery Semantics in Spark
As you have seen in the streaming model, the fact that streaming jobs act on the basis of data that
is generated in real time means that intermediate results need to be provided to the consumer of
that streaming pipeline on a regular basis.
5. 5
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Those results are being produced by some part of our cluster. Ideally, we would like those
observable results to be coherent, in line, and in real time with respect to the arrival of data. This
means that we want results that are exact, and we want them as soon as possible. However,
distributed computation has its own challenges in that it sometimes includes not only individual
nodes failing, as we have mentioned, but it also encounters situations like network partitions, in
which some parts of our cluster are not able to communicate with other parts of that cluster, as
illustrated in below Figure.
Figure: A network partition
Spark has been designed using a driver/executor architecture. A specific machine, the driver, is
tasked with keeping track of the job progression along with the job submissions of a user, and the
computation of that program occurs as the data arrives. How‐ ever, if the network partitions
separate some part of the cluster, the driver might be able to keep track of only the part of the
executors that form the initial cluster. In the other section of our partition, we will find nodes that
are entirely able to function, but will simply be unable to account for the proceedings of their
computation to the driver.
This creates an interesting case in which those “zombie” nodes do not receive new tasks, but might
well be in the process of completing some fragment of computation that they were previously
given. Being unaware of the partition, they will report their results as any executor would. And
because this reporting of results sometimes does not go through the driver (for fear of making the
driver a bottleneck), the reporting of these zombie results could succeed.
Because the driver, a single point of bookkeeping, does not know that those zombie executors are
still functioning and reporting results, it will reschedule the same tasks that the lost executors had
to accomplish on new nodes. This creates a double answering problem in which the zombie
machines lost through partitioning and the machines bearing the rescheduled tasks both report the
same results. This bears real consequences: one example of stream computation that we previously
mentioned is routing tasks for financial transactions. A double withdrawal, in that context, or
double stock purchase orders, could have tremendous consequences.
It is not only the aforementioned problem that causes different processing semantics. Another
important reason is that when output from a stream-processing application and state
checkpointing cannot be completed in one atomic operation, it will cause data corruption if failure
happens between checkpointing and outputting.
6. 6
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
These challenges have therefore led to a distinction between at least once processing and at most
once processing:
• At least once: This processing ensures that every element of a stream has been processed
once or more.
• At most once: This processing ensures that every element of the stream is processed once
or less.
• Exactly once: This is the combination of “at least once” and “at most once.”
At-least-once processing is the notion that we want to make sure that every chunk of initial data
has been dealt with—it deals with the node failure we were talking about earlier. As we’ve
mentioned, when a streaming process suffers a partial failure in which some nodes need to be
replaced or some data needs to be recomputed, we need to reprocess the lost units of computation
while keeping the ingestion of data going. That requirement means that if you do not respect at-
least-once processing, there is a chance for you, under certain conditions, to lose data.
The antisymmetric notion is called at-most-once processing. At-most-once processing systems
guarantee that the zombie nodes repeating the same results as a rescheduled node are treated in a
coherent manner, in which we keep track of only one set of results. By keeping track of what data
their results were about, we’re able to make sure we can discard repeated results, yielding at-most-
once processing guarantees. The way in which we achieve this relies on the notion of idempotence
applied to the “last mile” of result reception. Idempotence qualifies a function such that if we
apply it twice (or more) to any data, we will get the same result as the first time. This can be
achieved by keeping track of the data that we are reporting a result for, and having a bookkeeping
system at the output of our streaming process.
Microbatching
Two important approaches to stream processing:
• bulk-synchronous processing, and
• one-at-a-time record processing.
The objective of this is to connect those two ideas to the two APIs that Spark possesses for stream
processing: Spark Streaming and Structured Streaming.
Microbatching: An Application of Bulk-Synchronous Processing
Spark Streaming, the more mature model of stream processing in Spark, is roughly approximated
by what’s called a Bulk Synchronous Parallelism (BSP) system.
The gist of BSP is that it includes two things:
• A split distribution of asynchronous work
• A synchronous barrier, coming in at fixed intervals
The split is the idea that each of the successive steps of work to be done in streaming is separated
in a number of parallel chunks that are roughly proportional to the number of executors available
to perform this task. Each executor receives its own chunk (or chunks) of work and works
separately until the second element comes in. A particular resource is tasked with keeping track of
the progress of computation. With Spark Streaming, this is a synchronization point at the “driver”
that allows the work to progress to the next step. Between those scheduled steps, all of the
executors on the cluster are doing the same thing.
7. 7
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Note that what is being passed around in this scheduling process are the functions that describe
the processing that the user wants to execute on the data. The data is already on the various
executors, most often being delivered directly to these resources over the lifetime of the cluster.
This was coined “function-passing style” by Heather Miller in 2016 (and formalized in
[Miller2016]): asynchronously pass safe functions to distributed, stationary, immutable data in a
stateless container, and use lazy combinators to eliminate intermediate data structures.
The frequency at which further rounds of data processing are scheduled is dictated by a time
interval. This time interval is an arbitrary duration that is measured in batch processing time; that
is, what you would expect to see as a “wall clock” time observa‐ tion in your cluster.
For stream processing, we choose to implement barriers at small, fixed intervals that better
approximate the real-time notion of data processing
One-Record-at-a-Time Processing
By contrast, one-record-at-a-time processing functions by pipelining: it analyzes the whole
computation as described by user-specified functions and deploys it as pipelines using the
resources of the cluster. Then, the only remaining matter is to flow data through the various
resources, following the prescribed pipeline. Note that in this latter case, each step of the
computation is materialized at some place in the cluster at any given point. Systems that function
mostly according to this paradigm include Apache Flink, Naiad, Storm, and IBM Streams. This
does not necessarily mean that those systems are incapable of microbatching, but rather
characterizes their major or most native mode of operation and makes a statement on their
dependency on the process of pipelining, often at the heart of their processing.
The minimum latency, or time needed for the system to react to the arrival of one particular event,
is very different between those two: minimum latency of the micro‐ batching system is therefore
the time needed to complete the reception of the current microbatch (the batch interval) plus the
time needed to start a task at the executor where this data falls (also called scheduling time). On
the other hand, a system pro‐ cessing records one by one can react as soon as it meets the event
of interest.
Microbatching Versus One-at-a-Time: The Trade-Offs
Despite their higher latency, microbatching systems offer significant advantages:
• They are able to adapt at the synchronization barrier boundaries. That adaptation might
represent the task of recovering from failure, if a number of executors have been shown
to become deficient or lose data. The periodic synchronization can also give us an
opportunity to add or remove executor nodes, giving us the possibility to grow or shrink
our resources depending on what we’re seeing as the cluster load, observed through the
throughput on the data source.
• Our BSP systems can sometimes have an easier time providing strong consistency because
their batch determinations—that indicate the beginning and the end of a particular batch
of data—are deterministic and recorded. Thus, any kind of computation can be redone
and produce the same results the second time.
• Having data available as a set that we can probe or inspect at the beginning of the
microbatch allows us to perform efficient optimizations that can provide ideas on the way
to compute on the data. Exploiting that on each microbatch, we can con‐ sider the specific
8. 8
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
case rather than the general processing, which is used for all possible input. For example,
we could take a sample or compute a statistical measure before deciding to process or drop
each microbatch.
More importantly, the simple presence of the microbatch as a well-identified element also allows
an efficient way of specifying programming for both batch processing (where the data is at rest
and has been saved somewhere) and streaming (where the data is in flight). The microbatch, even
for mere instants, looks like data at rest.
Dynamic Batch Interval
What is this notion of dynamic batch interval? The dynamic batch interval is the notion that the
recomputation of data in a streaming DataFrame or Dataset consists of an update of existing data
with the new elements seen over the wire. This update is occurring based on a trigger and the usual
basis of this would be time duration. That time duration is still determined based on a fixed world
clock signal that we expect to be synchronized within our entire cluster and that represents a single
synchronous source of time that is shared among every executor.
However, this trigger can also be the statement of “as often as possible.” That statement is simply
the idea that a new batch should be started as soon as the previous one has been processed, given
a reasonable initial duration for the first batch. This means that the system will launch batches as
often as possible. In this situation, the latency that can be observed is closer to that of one-element-
at-a-time processing. The idea here is that the microbatches produced by this system will converge
to the smallest manageable size, making our stream flow faster through the executor computations
that are necessary to produce a result. As soon as that result is produced, a new query will be
started and scheduled by the Spark driver.
Structured Stream processing model
The main steps in Structured Streaming processing are as follows:
1. When the Spark driver triggers a new batch, processing starts with updating the account
of data read from a data source, in particular, getting data offsets for the beginning and the
end of the latest batch.
2. This is followed by logical planning, the construction of successive steps to be executed
on data, followed by query planning (intrastep optimization).
3. And then the launch and scheduling of the actual computation by adding a new batch of
data to update the continuous query that we’re trying to refresh.
Hence, from the point of view of the computation model, we will see that the API is significantly
different from Spark Streaming.
The Disappearance of the Batch Interval
We now briefly explain what Structured Streaming batches mean and their impact with respect to
operations. In Structured Streaming, the batch interval that we are using is no longer a computation
budget. With Spark Streaming, the idea was that if we produce data every two minutes and flow
data into Spark’s memory every two minutes, we should produce the results of computation on
that batch of data in at least two minutes, to clear the memory from our cluster for the next
microbatch. Ideally, as much data flows out as flows in, and the usage of the collective memory of
our cluster remains stable.
9. 9
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
With Structured Streaming, without this fixed time synchronization, our ability to see performance
issues in our cluster is more complex: a cluster that is unstable—that is, unable to “clear out” data
by finishing to compute on it as fast as new data flows in— will see ever-growing batch processing
times, with an accelerating growth. We can expect that keeping a hand on this batch processing
time will be pivotal.
However, if we have a cluster that is correctly sized with respect to the throughput of our data,
there are a lot of advantages to have an as-often-as-possible update. In particular, we should expect
to see very frequent results from our Structured Streaming cluster with a higher granularity than
we used to in the time of a conservative batch interval.
Spark Streaming Resilience Model
In most cases, a streaming job is a long-running job. By definition, streams of data observed and
processed over time lead to jobs that run continuously. As they process data, they might
accumulate intermediary results that are difficult to reproduce after the data has left the processing
system. Therefore, the cost of failure is considerable and, in some cases, complete recovery is
intractable.
In distributed systems, especially those relying on commodity hardware, failure is a function of
size: the larger the system, the higher the probability that some component fails at any time.
Distributed stream processors need to factor this chance of failure in their operational model.
We look at the resilience that the Apache Spark platform provides us: how it’s able to recover
partial failure and what kinds of guarantees we are given for the data passing through the system
when a failure occurs. We begin by getting an overview of the different internal components of
Spark and their relation to the core data structure. With this knowledge, you can proceed to
understand the impact of failure at the different levels and the measures that Spark offers to
recover from such failure.
RDDs and DStreams
Spark builds its data representations on Resilient Distributed Datasets (RDDs). Introduced in 2011
by the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing” [Zaharia2011], RDDs are the foundational data structure in Spark. It is at this ground
level that the strong fault tolerance guarantees of Spark start.
RDDs are composed of partitions, which are segments of data stored on individual nodes and
tracked by the Spark driver that is presented as a location-transparent data structure to the user.
We illustrate these components in below Figure in which the classic word count application is
broken down into the different elements that comprise an RDD.
Figure: An RDD operation represented in a distributed system
10. 10
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
The colored blocks are data elements, originally stored in a distributed filesystem, represented on
the far left of the figure. The data is stored as partitions, illustrated as columns of colored blocks
inside the file. Each partition is read into an executor, which we see as the horizontal blocks. The
actual data processing happens within the executor. There, the data is transformed following the
transformations described at the RDD level:
• .flatMap(l => l.split(" ")) separates sentences into words separated by space.
• .map(w => (w,1)) transforms each word into a tuple of the form (, 1) in this way preparing
the words for counting.
• .reduceByKey(_ + _) computes the count, using the as a key and apply‐ ing a sum operation
to the attached number.
• The final result is attained by bringing the partial results together using the same reduce
operation.
RDDs constitute the programmatic core of Spark. All other abstractions, batch and streaming
alike, including DataFrames, DataSets, and DStreams are built using the facilities created by RDDs,
and, more important, they inherit the same fault tolerance capabilities.
Another important characteristic of RDDs is that Spark will try to keep their data preferably in-
memory for as long as it is required and provided enough capacity in the system. This behavior is
configurable through storage levels and can be explicitly controlled by calling caching operations.
We mention those structures here to present the idea that Spark tracks the progress of the user’s
computation through modifications of the data. Indeed, knowing how far along we are in what
the user wants to do through inspecting the control flow of his program (including loops and
potential recursive calls) can be a daunting and errorprone task. It is much more reliable to define
types of distributed data collections, and let the user create one from another, or from other data
sources.
In below figure, we show the same word count program, now in the form of the user provided
code (left) and the resulting internal RDD chain of operations. This dependency chain forms a
particular kind of graph, a Directed Acyclic Graph (DAG). The DAG informs the scheduler,
appropriately called DAG Scheduler, on how to dis‐ tribute the computation and is also the
foundation of the failure-recovery functionality, because it represents the internal data and their
dependencies.
Figure: RDD lineage
11. 11
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
As the system tracks the ordered creation of these distributed data collections, it tracks the work
done, and what’s left to accomplish.
Data Structures in Spark
To understand at what level fault tolerance operates in Spark, it’s useful to go through an overview
of the nomenclature of some core concepts. We begin by assuming that the user provides a
program that ends up being divided into chunks and executed on various machines, as we saw in
the previous section, and as depicted in below Figure.
Figure: Spark nomenclature
Let’s run down those steps, which define the vocabulary of the Spark runtime:
User Program The user application in Spark Streaming is composed of user-specified function
calls operating on a resilient data structure (RDD, DStream, streaming DataSet, and so on),
categorized as actions and transformations.
Transformed User Program The user program may undergo adjustments that modify some of
the specified calls to make them simpler, the most approachable and understandable of which is
map-fusion. Query plan is a similar but more advanced concept in Spark SQL.
RDD A logical representation of a distributed, resilient, dataset. In the illustration, we see that the
initial RDD comprises three parts, called partitions.
Partition A partition is a physical segment of a dataset that can be loaded independently.
Stages The user’s operations are then grouped into stages, whose boundary separates user
operations into steps that must be executed separately. For example, operations that require a
shuffle of data across multiple nodes, such as a join between the results of two distinct upstream
operations, mark a distinct stage. Stages in Apache Spark are the unit of sequencing: they are
executed one after the other. At most one of any interdependent stages can be running at any given
time.
Jobs After these stages are defined, what internal actions Spark should take is clear. Indeed, at this
stage, a set of interdependent jobs is defined. And jobs, precisely, are the vocabulary for a unit of
scheduling. They describe the work at hand from the point of view of an entire Spark cluster,
whether it’s waiting in a queue or currently being run across many machines.
12. 12
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Tasks Depending on where their source data is on the cluster, jobs can then be cut into tasks,
crossing the conceptual boundary between distributed and single-machine computing: a task is a
unit of local computation, the name for the local, executor bound part of a job.
Spark aims to make sure that all of these steps are safe from harm and to recover quickly in the
case of any incident occurring in any stage of this process. This concern is reflected in fault-
tolerance facilities that are structured by the aforementioned notions: restart and checkpointing
operations that occur at the task, job, stage, or program level.
Spark Fault Tolerance Guarantees
Now that we have seen the “pieces” that constitute the internal machinery in Spark, we are ready
to understand that failure can happen at many different levels. In this section, we see Spark fault-
tolerance guarantees organized by “increasing blast radius,” from the more modest to the larger
failure. We are going to investigate the following:
• How Spark mitigates Task failure through restarts
• How Spark mitigates Stage failure through the shuffle service
• How Spark mitigates the disappearance of the orchestrator of the user program, through
driver restarts
Task Failure Recovery: Tasks can fail when the infrastructure on which they are running has a
failure or logical conditions in the program lead to an sporadic job, like OutOfMemory, network,
storage errors, or problems bound to the quality of the data being processed.
If the input data of the task was stored, through a call to cache() or persist() and if the chosen
storage level implies a replication of data, the task does not need to have its input recomputed,
because a copy of it exists in complete form on another machine of the cluster. We can then use
this input to restart the task. Table summarizes the different storage levels configurable in Spark
and their chacteristics in terms of mem‐ ory usage and replication factor.
13. 13
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
If, however, there was no persistence or if the storage level does not guarantee the existence of a
copy of the task’s input data, the Spark driver will need to consult the DAG that stores the user-
specified computation to determine which segments of the job need to be recomputed.
Consequently, without enough precautions to save either on the caching or on the storage level,
the failure of a task can trigger the recomputation of several others, up to a stage boundary.
Stage boundaries imply a shuffle, and a shuffle implies that intermediate data will somehow be
materialized: as we discussed, the shuffle transforms executors into data servers that can provide
the data to any other executor serving as a destination.
As a consequence, these executors have a copy of the map operations that led up to the shuffle.
Hence, executors that participated in a shuffle have a copy of the map operations that led up to it.
But that’s a lifesaver if you have a dying downstream exec‐ utor, able to rely on the upstream
servers of the shuffle (which serve the output of the map-like operation). What if it’s the contrary:
you need to face the crash of one of the upstream executors?
Stage Failure Recovery We’ve seen that task failure (possibly due to executor crash) was the
most frequent incident happening on a cluster and hence the most important event to mitigate.
Recurrent task failures will lead to the failure of the stage that contains that task. This brings us to
the second facility that allows Spark to resist arbitrary stage failures: the shuffle service.
When this failure occurs, it always means some rollback of the data, but a shuffle operation, by
definition, depends on all of the prior executors involved in the step that precedes it.
As a consequence, since Spark 1.3 we have the shuffle service, which lets you work on map data
that is saved and distributed through the cluster with a good locality, but, more important, through
a server that is not a Spark task. It’s an external file exchange service written in Java that has no
dependency on Spark and is made to be a much longer-running service than a Spark executor. This
additional service attaches as a separate process in all cluster modes of Spark and simply offers a
data file exchange for executors to transmit data reliably, right before a shuffle. It is highly
optimized through the use of a netty backend, to allow a very low overhead in trans‐ mitting data.
This way, an executor can shut down after the execution of its map task, as soon as the shuffle
service has a copy of its data. And because data transfers are faster, this transfer time is also highly
reduced, reducing the vulnerable time in which any executor could face an issue.
Driver Failure Recovery Having seen how Spark recovers from the failure of a particular task
and stage, we can now look at the facilities Spark offers to recover from the failure of the driver
program. The driver in Spark has an essential role: it is the depository of the block manager, which
knows where each block of data resides in the cluster. It is also the place where the DAG lives.
Finally, it is where the scheduling state of the job, its metadata, and logs resides. Hence, if the
driver is lost, a Spark cluster as a whole might well have lost which stage it has reached in
computation, what the computation actually consists of, and where the data that serves it can be
found, in one fell swoop.
Cluster-mode deployment Spark has implemented what’s called the cluster deployment mode,
which allows the driver program to be hosted on the cluster, as opposed to the user’s computer.
The deployment mode is one of two options: in client mode, the driver is launched in the same
process as the client that submits the application. In cluster mode, however, the driver is launched
14. 14
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
from one of the worker processes inside the cluster, and the client process exits as soon as it fulfills
its responsibility of submitting the application without waiting for the application to finish.
This, in sum, allows Spark to operate an automatic driver restart, so that the user can start a job in
a “fire and forget fashion,” starting the job and then closing their laptop to catch the next train.
Every cluster mode of Spark offers a web UI that will let the user access the log of their application.
Another advantage is that driver failure does not mark the end of the job, because the driver
process will be relaunched by the cluster manager. But this only allows recovery from scratch,
given that the temporary state of the computation—previously stored in the driver machine—
might have been lost.
Checkpointing To avoid losing intermediate state in case of a driver crash, Spark offers the option
of checkpointing; that is, recording periodically a snapshot of the application’s state to disk. The
setting of the sparkContext.setCheckpointDirectory() option should point to reliable storage (e.g.,
Hadoop Distributed File System [HDFS]) because having the driver try to reconstruct the state of
intermediate RDDs from its local filesystem makes no sense: those intermediate RDDs are being
created on the executors of the cluster and should as such not require any interaction with the
driver for backing them up.
First Steps in Structured Streaming
In the previous section, we learned about the high-level concepts that constitute Structured
Streaming, such as sources, sinks, and queries. We are now going to explore Structured Streaming
from a practical perspective, using a simplified web log analytics use case as an example.
Before we begin delving into our first streaming application, we are going to see how classical
batch analysis in Apache Spark can be applied to the same use case.
This exercise has two main goals:
• First, most, if not all, streaming data analytics start by studying a static data sample. It is
far easier to start a study with a file of data, gain intuition on how the data looks, what kind
of patterns it shows, and define the process that we require to extract the intended
knowledge from that data. Typically, it’s only after we have defined and tested our data
analytics job, that we proceed to transform it into a streaming process that can apply our
analytic logic to data on the move.
• Second, from a practical perspective, we can appreciate how Apache Spark simplifies many
aspects of transitioning from a batch exploration to a streaming application through the
use of uniform APIs for both batch and streaming ana‐ lytics. This exploration will allow
us to compare and contrast the batch and streaming APIs in Spark and show us the
necessary steps to move from one to the other.
Batch Analytics
Given that we are working with archive log files, we have access to all of the data at once. Before
we begin building our streaming application, let’s take a brief intermezzo to have a look at what a
classical batch analytics job would look like
First, we load the log files, encoded as JSON, from the directory where we unpacked them:
// This is the location of the unpackaged files. Update accordingly
val logsDirectory = ???
15. 15
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
val rawLogs = sparkSession.read.json(logsDirectory)
Next, we declare the schema of the data as a case class to use the typed Dataset API. Following
the formal description of the dataset (at NASA-HTTP ), the log is structured as follows:
The logs are an ASCII file with one line per request, with the following columns:
• Host making the request. A hostname when possible, otherwise the Internet address if the
name could not be looked up.
• Timestamp in the format “DAY MON DD HH:MM:SS YYYY,” where DAY is the day
of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is
the time of day using a 24-hour clock, and YYYY is the year. The timezone is –0400.
• Request given in quotes.
• HTTP reply code.
• Bytes in the reply.
Translating that schema to Scala, we have the following case class definition:
import java.sql.Timestamp
case class WebLog(host: String,
timestamp: Timestamp,
request: String,
http_reply: Int,
bytes: Long )
We convert the original JSON to a typed data structure using the previous schema definition:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// we need to narrow the `Interger` type because
// the JSON representation is interpreted as `BigInteger`
val preparedLogs = rawLogs.withColumn("http_reply", $"http_reply".cast(IntegerType))
val weblogs = preparedLogs.as[WebLog]
Now that we have the data in a structured format, we can begin asking the questions that interest
us. As a first step, we would like to know how many records are con‐ tained in our dataset:
val recordCount = weblogs.count >recordCount: Long = 1871988
A common question would be: “what was the most popular URL per day?” To answer that, we
first reduce the timestamp to the day of the month. We then group by this new dayOfMonth
column and the request URL and we count over this aggregate. We finally order using descending
order to get the top URLs first:
val topDailyURLs = weblogs.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))
16. 16
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
topDailyURLs.show()
+----------+----------------------------------------+-----+
|dayOfMonth| request|count|
+----------+----------------------------------------+-----+
| 13|GET /images/NASA-logosmall.gif HTTP/1.0 |12476|
| 13|GET /htbin/cdt_main.pl HTTP/1.0 | 7471|
| 12|GET /images/NASA-logosmall.gif HTTP/1.0 | 7143|
| 13|GET /htbin/cdt_clock.pl HTTP/1.0 | 6237|
| 6|GET /images/NASA-logosmall.gif HTTP/1.0 | 6112|
| 5|GET /images/NASA-logosmall.gif HTTP/1.0 | 5865| ...
Top hits are all images. What now? It’s not unusual to see that the top URLs are images commonly
used across a site. Our true interest lies in the content pages generating the most traffic. To find
those, we first filter on html content and then proceed to apply the top aggregation we just learned.
As we can see, the request field is a quoted sequence of [HTTP_VERB] URL [HTTP_VER
SION]. We will extract the URL and preserve only those ending in .html, .htm, or no extension
(directories). This is a simplification for the purpose of this example:
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html",".htm", "")
val contentPageLogs = weblogs.filter
{log => log.request match
{ case urlExtractor(url) => val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext) case _ => false
}
}
With this new dataset that contains only .html, .htm, and directories, we proceed to apply the same
top-k function as earlier:
val topContentPages = contentPageLogs
.withColumn("dayOfMonth", dayofmonth($"timestamp"))
.select($"request", $"dayOfMonth")
.groupBy($"dayOfMonth", $"request")
.agg(count($"request").alias("count"))
.orderBy(desc("count"))
topContentPages.show()
+----------+------------------------------------------------+-----+
|dayOfMonth| request|count|
+----------+------------------------------------------------+-----+
| 13| GET /shuttle/countdown/liftoff.html HTTP/1.0" | 4992|
| 5| GET /shuttle/countdown/ HTTP/1.0" | 3412|
| 6| GET /shuttle/countdown/ HTTP/1.0" | 3393|
| 3| GET /shuttle/countdown/ HTTP/1.0" | 3378|
| 13| GET /shuttle/countdown/ HTTP/1.0" | 3086|
| 7| GET /shuttle/countdown/ HTTP/1.0" | 2935|
17. 17
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
| 4| GET /shuttle/countdown/ HTTP/1.0" | 2832|
| 2| GET /shuttle/countdown/ HTTP/1.0" | 2330| ...
We can see that the most popular page that month was lifto…
.html, corresponding to the coverage
of the launch of the Discovery shuttle, as documented on the NASA archives. It’s closely followed
by countdown/, the days prior to the launch.
Streaming Analytics Phases
In the previous section, we explored historical NASA web log records. We found trending events
in those records, but much later than when the actual events happened.
One key driver for streaming analytics comes from the increasing demand of organizations to have
timely information that can help them make decisions at many different levels.
We can use the lessons that we have learned while exploring the archived records using a batch-
oriented approach and create a streaming job that will provide us with trending information as it
happens.
The first difference that we observe with the batch analytics is the source of the data. For our
streaming exercise, we will use a TCP server to simulate a web system that delivers its logs in real
time. The simulator will use the same dataset but will feed it through a TCP socket connection
that will embody the stream that we will be analyzing.
Connecting to a Stream
If you recall from the introduction of this chapter, Structured Streaming defines the concepts of
sources and sinks as the key abstractions to consume a stream and produce a result. We are going
to use the TextSocketSource implementation to connect to the server through a TCP socket.
Socket connections are defined by the host of the server and the port where it is listening for
connections. These two configuration elements are required to create the socket source:
val stream = sparkSession.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
Note how the creation of a stream is quite similar to the declaration of a static data‐ source in the
batch case. Instead of using the read builder, we use the readStream construct and we pass to it
the parameters required by the streaming source. As you will see during the course of this exercise
and later on as we go into the details of Structured Streaming, the API is basically the same
DataFrame and Dataset API for static data but with some modifications and limitations that you
will learn in detail.
Preparing the Data in the Stream
The socket source produces a streaming DataFrame with one column, value, which contains the
data received from the stream.
In the batch analytics case, we could load the data directly as JSON records. In the case of the
Socket source, that data is plain text. To transform our raw data to WebLog records, we first
require a schema. The schema provides the necessary information to parse the text to a JSON
object. It’s the structure when we talk about structured streaming.
18. 18
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
After defining a schema for our data, we proceed to create a Dataset, following these steps:
import java.sql.Timestamp
case class WebLog(host:String,
timestamp: Timestamp,
request: String,
http_reply:Int,
bytes: Long
)
val webLogSchema = Encoders.product[WebLog].schema
val jsonStream = stream.select(from_json($"value", webLogSchema) as "record")
val webLogStream: Dataset[WebLog] = jsonStream.select("record.*").as[WebLog]
1. Obtain a schema from the case class definition
2. Transform the text value to JSON using the JSON support built into Spark SQL
3. Use the Dataset API to transform the JSON records to WebLog objects
As a result of this process, we obtain a Streaming Dataset of WebLog records.
Operations on Streaming Dataset
The webLogStream we just obtained is of type Dataset[WebLog] like we had in the batch analytics
job. The difference between this instance and the batch version is that webLogStream is a
streaming Dataset.
We can observe this by querying the object:
webLogStream.isStreaming
> res: Boolean = true
At this point in the batch job, we were creating the first query on our data: How many records are
contained in our dataset? This is a question that we can easily answer when we have access to all
of the data. However, how do we count records that are constantly arriving? The answer is that
some operations that we consider usual on a static Dataset, like counting all records, do not have
a defined meaning on a Streaming Dataset.
As we can observe, attempting to execute the count query in the following code snippet will result
in an AnalysisException:
val count = webLogStream.count()
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must
be executed with writeStream.start();;
This means that the direct queries we used on a static Dataset or DataFrame now need two levels
of interaction. First, we need to declare the transformations of our stream, and then we need to
start the stream process.
Creating a Query
What are popular URLs? In what time frame? Now that we have immediate analytic access to the
stream of web logs, we don’t need to wait for a day or a month to have a rank of the popular
URLs. We can have that information as trends unfold in much shorter windows of time.
19. 19
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
First, to define the period of time of our interest, we create a window over some time‐ stamp. An
interesting feature of Structured Streaming is that we can define that time interval on the timestamp
when the data was produced, also known as event time, as opposed to the time when the data is
being processed.
Our window definition will be of five minutes of event data. Given that our timeline is simulated,
the five minutes might happen much faster or slower than the clock time. In this way, we can
clearly appreciate how Structured Streaming uses the time‐ stamp information in the events to
keep track of the event timeline.
As we learned from the batch analytics, we should extract the URLs and select only content pages,
like .html, .htm, or directories. Let’s apply that acquired knowledge first before proceeding to
define our windowed query:
// A regex expression to extract the accessed URL from weblog.request
val urlExtractor = """^GET (.+) HTTP/d.d""".r
val allowedExtensions = Set(".html", ".htm", "")
val contentPageLogs: String => Boolean = url => {
val ext = url.takeRight(5).dropWhile(c => c != '.')
allowedExtensions.contains(ext)
}
val urlWebLogStream = webLogStream.flatMap { weblog =>
weblog.request match {
case urlExtractor(url) if (contentPageLogs(url)) =>
Some(weblog.copy(request = url))
case _ => None
}
}
We have converted the request to contain only the visited URL and filtered out all noncontent
pages. Now, we define the windowed query to compute the top trending URLs:
val rankingURLStream = urlWebLogStream
.groupBy($"request", window($"timestamp", "5 minutes", "1 minute"))
.count()
Start the Stream Processing
All of the steps that we have followed so far have been to define the process that the stream will
undergo. But no data has been processed yet.
To start a Structured Streaming job, we need to specify a sink and an output mode. These are two
new concepts introduced by Structured Streaming:
• A sink defines where we want to materialize the resulting data; for example, to a file in a
filesystem, to an in-memory table, or to another streaming system such as Kafka.
• The output mode defines how we want the results to be delivered: do we want to see all
data every time, only updates, or just the new records?
20. 20
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
These options are given to a writeStream operation. It creates the streaming query that starts the
stream consumption, materializes the computations declared on the query, and produces the result
to the output sink. For now, let’s use them empirically and observe the results.
For our query, shown in below Example, we use the memory sink and output mode complete to
have a fully updated table each time new records are added to the result of keeping track of the
URL ranking.
Example. Writing a stream to a sink
val query = rankingURLStream.writeStream
.queryName("urlranks")
.outputMode("complete")
.format("memory")
.start()
The memory sink outputs the data to a temporary table of the same name given in the queryName
option. We can observe this by querying the tables registered on Spark SQL:
scala> spark.sql("show tables").show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | urlranks| true|
+--------+---------+-----------+
In the expression in Example, query is of type StreamingQuery and it’s a handler to control the
query life cycle.
Exploring the Data
Given that we are accelerating the log timeline on the producer side, after a few seconds, we can
execute the next command to see the result of the first windows, as illustrated in Figure. Note how
the processing time (a few seconds) is decoupled from the event time (hun‐ dreds of minutes of
logs):
urlRanks.select($"request", $"window", $"count").orderBy(desc("count"))
Figure: URL ranking: query results by window
21. 21
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Acquiring streaming data
In Structured Streaming, a source is an abstraction that lets us consume data from a streaming data
producer. Sources are not directly created. Instead, the sparkSession provides a builder method,
readStream, that exposes the API to specify a streaming source, called a format, and provide its
configuration.
For example, the code in Example creates a File streaming source. We specify the type of source
using the format method. The method schema lets us provide a schema for the data stream, which
is mandatory for certain source types, such as this File source.
Example. File streaming source
val fileStream = spark.readStream
.format("json")
.schema(schema)
.option("mode","DROPMALFORMED")
.load("/tmp/datasrc")
>fileStream:
org.apache.spark.sql.DataFrame = [id: string, timestamp: timestamp ... ]
Each source implementation has different options, and some have tunable parameters. In
Example, we are setting the option mode to DROPMALFORMED. This option instructs the
JSON stream processor to drop any line that neither complies with the JSON format nor matches
the provided schema.
Behind the scenes, the call to spark.readStream creates a DataStreamBuilder instance. This instance
is in charge of managing the different options provided through the builder method calls. Calling
load(...) on this DataStreamBuilder instance validates the options provided to the builder and, if
everything checks, it returns a streaming DataFrame.
In our example, this streaming DataFrame represents the stream of data that will result from
monitoring the provided path and processing each new file in that path as JSON-encoded data,
parsed using the schema provided. All malformed code will be dropped from this data stream.
Loading a streaming source is lazy. What we get is a representation of the stream, embodied in the
streaming DataFrame instance, that we can use to express the series of transformations that we
want to apply to it in order to implement our specific business logic. Creating a streaming
DataFrame does not result in any data actually being consumed or processed until the stream is
materialized. This requires a query, as you will see further on.
Available Sources
As of Spark v2.4.0, the following streaming sources are supported:
• json, orc, parquet, csv, text, textFile: These are all file-based streaming sources. The
base functionality is to monitor a path (folder) in a filesystem and consume files atomically
placed in it. The files found will then be parsed by the formatter specified. For example, if
json is provided, the Spark json reader will be used to process the files, using the schema
information provided.
• socket Establishes a client connection to a TCP server that is assumed to provide text data
through a socket connection.
22. 22
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• kafka Creates Kafka consumer able to retrieve data from Kafka.
• rate Generates a stream of rows at the rate given by the rowsPerSecond option. It’s mainly
intended as a testing source.
Transforming streaming data
As we saw in the previous section, the result of calling load is a streaming DataFrame. After we
have created our streaming DataFrame using a source, we can use the Data set or DataFrame API
to express the logic that we want to apply to the data in the stream in order to implement our
specific use case.
Assuming that we are using data from a sensor network, in Example 8-3 we are selecting the fields
deviceId, timestamp, sensorType, and value from a sensor Stream and filtering to only those
records where the sensor is of type temperature and its value is higher than the given threshold.
Example: Filter and projection
val highTempSensors = sensorStream
.select($"deviceId", $"timestamp", $"sensorType", $"value")
.where($"sensorType" === "temperature" && $"value" > threshold)
Likewise, we can aggregate our data and apply operations to the groups over time. Example shows
that we can use timestamp information from the event itself to define a time window of five
minutes that will slide every minute.
What is important to grasp here is that the Structured Streaming API is practically the same as the
Dataset API for batch analytics, with some additional provisions spe‐ cific to stream processing.
Example 8-4. Average by sensor type over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType", $"value")
.groupBy(window($"timestamp", "5 minutes", "1 minute"), $"sensorType")
.agg(avg($"value")
If you are not familiar with the structured APIs of Spark, we suggest that you familiarize yourself
with it. Covering this API in detail is beyond the scope of this book.
Streaming API Restrictions on the DataFrame API
As we hinted in the previous chapter, some operations that are offered by the standard DataFrame
and Dataset API do not make sense on a streaming context. We gave the example of stream.count,
which does not make sense to use on a stream. In general, operations that require immediate
materialization of the underlying dataset are not allowed. These are the API operations not directly
supported on streams:
• count
• show
• describe
• limit
• take(n)
• distinct
• foreach
23. 23
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• sort
• multiple stacked aggregations
Next to these operations, stream-stream and static-stream joins are partially supported.
Understanding the limitations Although some operations, like count or limit, do not make sense
on a stream, some other stream operations are computationally difficult. For example, distinct is
one of them. To filter duplicates in an arbitrary stream, it would require that you remember all of
the data seen so far and compare each new record with all records already seen. The first condition
would require infinite memory and the second has a computational complexity of O(n 2 ), which
becomes prohibitive as the number of elements (n) increases.
Operations on aggregated streams Some of the unsupported operations become defined after
we apply an aggregation function to the stream. Although we can’t count the stream, we could
count messages received per minute or count the number of devices of a certain type.
In Example, we define a count of events per sensorType per minute.
Example: Count of sensor types over time
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Likewise, it’s also possible to define a sort on aggregated data, although it’s further restricted to
queries with output mode complete.
Stream deduplication We discussed that distinct on an arbitrary stream is computationally
difficult to implement. But if we can define a key that informs us when an element in the stream
has already been seen, we can use it to remove duplicates: stream.dropDuplicates("") …
Workarounds Although some operations are not supported in the exact same way as in the batch
model, there are alternative ways to achieve the same functionality:
• foreach Although foreach cannot be directly used on a stream, there’s a foreach sink that
provides the same functionality. Sinks are specified in the output definition of a stream.
• show Although show requires an immediate materialization of the query, and hence it’s
not possible on a streaming Dataset, we can use the console sink to output data to the
screen.
Output the resulting data
All operations that we have done so far—such as creating a stream and applying transformations
on it—have been declarative. They define from where to consume the data and what operations
we want to apply to it. But up to this point, there is still no data flowing through the system.
Before we can initiate our stream, we need to first define where and how we want the output data
to go:
• Where relates to the streaming sink: the receiving side of our streaming data.
• How refers to the output mode: how to treat the resulting records in our stream.
24. 24
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
From the API perspective, we materialize a stream by calling writeStream on a streaming
DataFrame or Dataset.
Calling writeStream on a streaming Dataset creates a DataStreamWriter. This is a builder instance
that provides methods to configure the output behavior of our streaming process.
Example. File streaming sink
val query = stream.writeStream
.format("json")
.queryName("json-writer")
.outputMode("append")
.option("path", "/target/dir")
.option("checkpointLocation", "/checkpoint/dir")
.trigger(ProcessingTime("5 seconds"))
.start() >query: org.apache.spark.sql.streaming.StreamingQuery = ...
format
The format method lets us specify the output sink by providing the name of a builtin sink or the
fully qualified name of a custom sink.
As of Spark v2.4.0, the following streaming sinks are available:
• console sink A sink that prints to the standard output. It shows a number of rows
configurable with the option numRows.
• file sink File-based and format-specific sink that writes the results to a filesystem. The
format is specified by providing the format name: csv, hive, json, orc, parquet, avro, or
text.
• kafka sink A Kafka-specific producer sink that is able to write to one or more Kafka
topics.
• memory sink Creates an in-memory table using the provided query name as table name.
This table receives continuous updates with the results of the stream.
• foreach sink Provides a programmatic interface to access the stream contents, one
element at the time.
• foreachBatch sink foreachBatch is a programmatic sink interface that provides access to
the com‐ plete DataFrame that corresponds to each underlying microbatch of the
Structured Streaming execution.
outputMode
The outputMode specifies the semantics of how records are added to the output of the streaming
query. The supported modes are append, update, and complete:
• append (default mode) Adds only final records to the output stream. A record is
considered final when no new records of the incoming stream can modify its value. This
is always the case with linear transformations like those resulting from applying projection,
filtering, and mapping. This mode guarantees that each resulting record will be output only
once.
• update Adds new and updated records since the last trigger to the output stream. update
is meaningful only in the context of an aggregation, where aggregated values change as
25. 25
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
new records arrive. If more than one incoming record changes a single result, all changes
between trigger intervals are collated into one output record.
• complete complete mode outputs the complete internal representation of the stream. This
mode also relates to aggregations, because for nonaggregated streams, we would need to
remember all records seen so far, which is unrealistic. From a practical perspective,
complete mode is recommended only when you are aggregating values over low-cardinality
criteria, like count of visitors by country, for which we know that the number of countries
is bounded.
Understanding the append semantic
When the streaming query contains aggregations, the definition of final becomes nontrivial. In
an aggregated computation, new incoming records might change an existing aggregated value
when they comply with the aggregation criteria used. Following our definition, we cannot
output a record using append until we know that its value is final. Therefore, the use of the
append output mode in combination with aggregate queries is restricted to queries for which
the aggregation is expressed using event-time and it defines a watermark. In that case, append
will output an event as soon as the watermark has expired and hence it’s considered that no
new records can alter the aggregated value. As a consequence, output events in append mode
will be delayed by the aggregation time window plus the watermark offset.
queryName With queryName, we can provide a name for the query that is used by some
sinks and also presented in the job description in the Spark Console, as depicted in Figure.
Figure: Completed Jobs in the Spark UI showing the query name in the job descrip‐ tion
option With the option method, we can provide specific key–value pairs of configuration to the
stream, akin to the configuration of the source. Each sink can have specific con‐ figuration we can
customize using this method. We can add as many .option(...) calls as necessary to configure the
sink.
options options is an alternative to option that takes a Map[String, String] containing all the key–
value configuration parameters that we want to set. This alternative is more friendly to an
externalized configuration model, where we don’t know a priori the set‐ tings to be passed to the
sink’s configuration.
trigger The optional trigger option lets us specify the frequency at which we want the results to
be produced. By default, Structured Streaming will process the input and produce a result as soon
as possible. When a trigger is specified, output will be produced at each trigger interval.
org.apache.spark.sql.streaming.Trigger provides the following supported triggers:
• ProcessingTime() Lets us specify a time interval that will dictate the frequency of the
query results.
26. 26
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
• Once() A particular Trigger that lets us execute a streaming job once. It is useful for testing
and also to apply a defined streaming job as a single-shot batch operation.
• Continuous() This trigger switches the execution engine to the experimental continuous
engine for low-latency processing. The checkpoint-interval parameter indicates the
frequency of the asynchronous checkpointing for data resilience. It should not be confused
with the batch interval of the ProcessingTime trigger.
start() To materialize the streaming computation, we need to start the streaming process.
Finally, start() materializes the complete job description into a streaming computation and
initiates the internal scheduling process that results in data being consumed from the source,
processed, and produced to the sink. start() returns a Streaming Query object, which is a handle
to manage the individual life cycle of each query. This means that we can simultaneously start
and stop multiple queries independently of one other within the same sparkSession.
Demo
The first part of our program deals with the creation of the streaming Dataset:
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
> rawData: org.apache.spark.sql.DataFrame
The entry point of Structured Streaming is an existing Spark Session (sparkSession). As you can
appreciate on the first line, the creation of a streaming Dataset is almost identical to the creation
of a static Dataset that would use a read operation instead. sparkSession.readStream returns a
DataStreamReader, a class that implements the builder pattern to collect the information needed
to construct the streaming source using a fluid API. In that API, we find the format option that
lets us specify our source provider, which, in our case, is kafka. The options that follow it are
specific to the source:
• kafka.bootstrap.servers
o Indicates the set of bootstrap servers to contact as a comma-separated list of
host:port addresses
• subscribe
o Specifies the topic or topics to subscribe to
• startingOffsets
o The offset reset policy to apply when this application starts out fresh.
The load() method evaluates the DataStreamReader builder and creates a DataFrame as a result,
as we can see in the returned value:
> rawData: org.apache.spark.sql.DataFrame
A DataFrame is an alias for Dataset[Row] with a known schema. After creation, you can use
streaming Datasets just like regular Datasets. This makes it possible to use the full-fledged Dataset
API with Structured Streaming, albeit some exceptions apply because not all operations, such as
show() or count(), make sense in a streaming context.
27. 27
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
To programmatically differentiate a streaming Dataset from a static one, we can ask a Dataset
whether it is of the streaming kind:
rawData.isStreaming
res7: Boolean = true
And we can also explore the schema attached to it, using the existing Dataset API, as demonstrated
in Example.
Example. The Kafka schema
rawData.printSchema()
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
In general, Structured Streaming requires the explicit declaration of a schema for the consumed
stream. In the specific case of kafka, the schema for the resulting Dataset is fixed and is
independent of the contents of the stream. It consists of a set of fields specific to the Kakfa source:
key, value, topic, partition, offset, timestamp, and timestampType, as we can see in Example 9-1.
In most cases, applications will be mostly interested in the contents of the value field where the
actual payload of the stream resides.
Application Logic
Recall that the intention of our job is to correlate the incoming IoT sensor data with a reference
file that contains all known sensors with their configuration. That way, we would enrich each
incoming record with specific sensor parameters that would allow us to interpret the reported data.
We would then save all correctly processed records to a Parquet file. The data coming from
unknown sensors would be saved to a sepa‐ rate file for later analysis.
Using Structured Streaming, our job can be implemented in terms of Dataset opera‐ tions:
val iotData = rawData.select($"value").as[String].flatMap{record => val fields = record.split(",")
try {
SensorData(fields(0).toInt, fields(1).toLong, fields(2).toDouble)
}.toOption
}
val sensorRef = sparkSession.read.parquet(s"$workDir/$referenceFile") sensorRef.cache()
val sensorWithInfo = sensorRef.join(iotData, Seq("sensorId"), "inner")
val knownSensors = sensorWithInfo
.withColumn("dnvalue", $"value"*($"maxRange"-$"minRange")+$"minRange")
.drop("value", "maxRange", "minRange")
In the first step, we transform our CSV-formatted records back into SensorData entries. We apply
Scala functional operations on the typed Dataset[String] that we obtained from extracting the value
field as a String.
28. 28
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
Then, we use a streaming Dataset to static Dataset inner join to correlate the sensor data with the
corresponding reference using the sensorId as key.
To complete our application, we compute the real values of the sensor reading using the minimum-
maximum ranges in the reference data.
Writing to a Streaming Sink
The final step of our streaming application is to write the enriched IoT data to a Parquet-formatted
file. In Structured Streaming, the write operation is crucial: it marks the completion of the declared
transformations on the stream, defines a write mode, and upon calling start(), the processing of
the continuous query will begin.
In Structured Streaming, all operations are lazy declarations of what we want to do with the
streaming data. Only when we call start() will the actual consumption of the stream begin and the
query operations on the data materialize into actual results:
val knownSensorsQuery = knownSensors.writeStream
.outputMode("append")
.format("parquet")
.option("path", targetPath)
.option("checkpointLocation", "/tmp/checkpoint")
.start()
Let’s break this operation down:
• writeStream creates a builder object where we can configure the options for the desired
write operation, using a fluent interface.
• With format, we specify the sink that will materialize the result downstream. In our case,
we use the built-in FileStreamSink with Parquet format.
• mode is a new concept in Structured Streaming: given that we, theoretically, have access
to all the data seen in the stream so far, we also have the option to produce different views
of that data.
• The append mode, used here, implies that the new records affected by our streaming
computation are produced to the output.
The result of the start call is a StreamingQuery instance. This object provides meth‐ ods to control
the execution of the query and request information about the status of our running streaming
query, as shown in Example.
Example. Query progress
knownSensorsQuery.recentProgress
res37: Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array({
"id" : "6b9fe3eb-7749-4294-b3e7-2561f1e840b6",
"runId" : "0d8d5605-bf78-4169-8cfe-98311fc8365c",
"name" : null,
"timestamp" : "2017-08-10T16:20:00.065Z",
"numInputRows" : 4348,
"inputRowsPerSecond" : 395272.7272727273,
"processedRowsPerSecond" : 28986.666666666668,
"durationMs" : {
29. 29
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
"addBatch" : 127,
"getBatch" : 3,
"getOffset" : 1,
"queryPlanning" : 7,
"triggerExecution" : 150,
"walCommit" : 11
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[iot-data]]",
"startOffset" : {
"iot-data" : { "0" : 19048348 } },
"endOffset" : {
"iot-data" : { "0" : 19052696 } },
"numInputRow...
In Example, we can see the StreamingQueryProgress as a result of calling known
SensorsQuery.recentProgress. If we see nonzero values for the numInputRows, we can be
certain that our job is consuming data. We now have a Structured Streaming job running
properly.
Stream Processing with Spark Streaming
Spark Streaming was the first stream-processing framework built on top of the dis‐ tributed
processing capabilities of Spark. Nowadays, it offers a mature API that’s widely adopted in the
industry to process large-scale data streams.
Spark is, by design, a system that is really good at processing data distributed over a cluster of
machines. Spark’s core abstraction, the Resilient Distributed Dataset (RDD), and its fluent
functional API permits the creation of programs that treat distributed data as a collection. That
abstraction lets us reason about data-processing logic in the form of transformation of the
distributed dataset. By doing so, it reduces the cogni‐ tive load previously required to create and
execute scalable and distributed dataprocessing programs.
Spark Streaming was created upon a simple yet powerful premise: apply Spark’s dis‐ tributed
computing capabilities to stream processing by transforming a continuous stream of data into
discrete data collections on which Spark could operate.
As we can see in Figure, the main task of Spark Streaming is to take data from the stream, package
it down into small batches, and provide them to Spark for further processing. The output is then
produced to some downstream system.
Figure. Spark and Spark Streaming in action
30. 30
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
The DStream Abstraction
Whereas Structured Streaming, which you learned in Part II, builds its streaming capabilities on
top of the Spark SQL abstractions of DataFrame and Dataset, Spark Streaming relies on the much
more fundamental Spark abstraction of RDD. At the same time, Spark Streaming introduces a
new concept: the Discretized Stream or DStream. A DStream represents a stream in terms of
discrete blocks of data that in turn are represented as RDDs over time, as we can see in Figure.
Figure. DStreams and RDDs in Spark Streaming
The DStream abstraction is primarily an execution model that, when combined with a functional
programming model, provides us with a complete framework to develop and execute streaming
applications.
DStreams as a Programming Model
The code representation of DStreams give us a functional programming API consis‐ tent with the
RDD API and augmented with stream-specific functions to deal with aggregations, time-based
operations, and stateful computations. In Spark Streaming, we consume a stream by creating a
DStream from one of the native implementations, such as a SocketInputStream or using one of
the many connectors available that pro‐ vide a DStream implementation specific to a stream
provider (this is the case of Kafka, Twitter, or Kinesis connectors for Spark Streaming, just to
name a few):
// creates a dstream using a client socket connected to the given host and port val textDStream =
ssc.socketTextStream("localhost", 9876)
After we have obtained a DStream reference, we can implement our application logic using the
functions provided by the DStream API. For example, if the textDStream in the preceding code
is connected to a log server, we could count the number of error occurrences:
// we break down the stream of logs into error or info (not error)
// and create pairs of `(x, y)`.
// (1, 1) represents an error, and
// (0, 1) a non-error occurrence.
val errorLabelStream = textDStream.map{line =>
if (line.contains("ERROR")) (1, 1) else (0, 1)
}
We can then count the totals and compute the error rate by using an aggregation function called
reduce:
31. 31
18CSE489T/STREAMING ANALYTICS Dr. A. Manju/AP/SRMIST, RAMAPURAM
// reduce by key applies the provided function for each key.
val errorCountStream = errorLabelStream.reduce {
case ((x1,y1), (x2, y2)) => (x1+x2, y1+y2)
}
To obtain our error rate, we perform a safe division:
// compute the error rate and create a string message with the value
val errorRateStream = errorCountStream.map {case (errors, total) =>
val errorRate = if (total > 0 ) errors.toDouble/total else 0.0
"Error Rate:" + errorRate
}
It’s important to note that up until now, we have been using transformations on the DStream but
there is still no data processing happening. All transformations on DStreams are lazy. This process
of defining the logic of a stream-processing applica‐ tion is better seen as the set of transformations
that will be applied to the data after the stream processing is started. As such, it’s a plan of action
that Spark Streaming will recurrently execute on the data consumed from the source DStream.
DStreams are immutable. It’s only through a chain of transformations that we can process and
obtain a result from our data.
Finally, the DStream programming model requires that the transformations are ended by an output
operation. This particular operation specifies how the DStream is materialized. In our case, we are
interested in printing the results of this stream com‐ putation to the console:
// print the results to the console
errorRateStream.print()
In summary, the DStream programming model consists of the functional composition of
transformations over the stream payload, materialized by one or more output operations and
recurrently executed by the Spark Streaming engine.
DStreams as an Execution Model
In the preceding introduction to the Spark Streaming programming model, we could see how data
is transformed from its original form into our intended result as a series of lazy functional
transformations. The Spark Streaming engine is responsible for taking that chain of functional
transformations and turning it into an actual execution plan. That happens by receiving data from
the input stream(s), collecting that data into batches, and feeding it to Spark in a timely manner.
The measure of time to wait for data is known as the batch interval. It is usually a short amount
of time, ranging from approximately two hundred milliseconds to minutes depending on the
application requirements for latency. The batch interval is the central unit of time in Spark
Streaming. At each batch interval, the data corresponding to the previous interval is sent to Spark
for processing while new data is received. This process repeats as long as the Spark Streaming job
is active and healthy. A natural consequence of this recurring microbatch operation is that the
computation on the batch’s data has to complete within the duration of the batch interval so that
computing resources are available when the new microbatch arrives. As you will learn in this part
of the book, the batch interval dictates the time for most other functions in Spark Streaming.