Spark & Spark Streaming Internals - Nov 15 (1)

Dec 6, 20144 likes853 views

This document summarizes Spark and Spark Streaming internals. It discusses the Resilient Distributed Dataset (RDD) model in Spark, which allows for fault tolerance through lineage-based recomputation. It provides an example of log mining using RDD transformations and actions. It then discusses Spark Streaming, which provides a simple API for stream processing by treating streams as series of small batch jobs on RDDs. Key concepts discussed include Discretized Stream (DStream), transformations, and output operations. An example Twitter hashtag extraction job is outlined.

Spark & Spark Streaming Internals
Akhil Das
akhil@sigmoidanalytics.com

Resilient Distributed Dataset (RDD)
Restricted form of distributed shared memory
➔ Immutable
➔ Can only be built through deterministic transformations (textFile, map,
filter, join, …)
Efficient fault recovery using lineage
➔ Recompute lost partitions on failure
➔ No cost if nothing fails

RDD Operations
Transformations
➔ map/flatmap
➔ filter
➔ union/join/groupBy
➔ cache
…….
Actions
➔ collect/count
➔ save
➔ take
…….

Log Mining Example
Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
Base RDD
errors = lines.filter(_.startsWith(“ERROR”))
Transformed RDD
messages = errors.map(_.split(‘t’)(2))
messages.persist()
messages.filter(_.contains(“foo”)).count
Action
messages.filter(_.contains(“bar”)).count

What is Spark Streaming?
Framework for large scale stream processing
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Overview
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
operations are returned in batches

Key Concepts
➔ DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
➔ Transformations – modify data from one DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
➔ Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results

Eg: Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.
startsWith("#"))))
hashTags.saveAsHadoopFiles("hdfs://...") Transformation
#Ebola, #India,
#Mars ...

This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

BDM25 - Spark runtime internalDavid Lauzon

Apache Spark: What's under the hoodAdarsh Pannu

This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.

Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal

We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.

Apache Spark, the Next Generation Cluster ComputingGerger

This document provides a 3 sentence summary of the key points: Apache Spark is an open source cluster computing framework that is faster than Hadoop MapReduce by running computations in memory through RDDs, DataFrames and Datasets. It provides high-level APIs for batch, streaming and interactive queries along with libraries for machine learning. Spark's performance is improved through techniques like Catalyst query optimization, Tungsten in-memory columnar formats, and whole stage code generation.

Apache Spark RDDsDean Chen

Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.

Spark Deep DiveCorey Nolet

Apache Spark ArchitectureAlexey Grishchenko

Introduction to Spark InternalsPietro Michiardi

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

Spark shuffle introductioncolorant

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

Apache Spark RDD 101sparkInstructor

The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.

Intro to Apache SparkRobert Sanders

Apache Spark Introductionsudhakara st

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

Introduction to sparkDuyhai Doan

This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.

DTCC '14 Spark Runtime InternalsCheng Lian

This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.

Transformations and actions a visual guide trainingSpark Summit

Spark introduction and architectureSohil Jain

Spark coreFreeman Zhang

Spark is a distributed data processing framework that uses RDDs (Resilient Distributed Datasets) to represent data distributed across a cluster. RDDs support transformations like map, filter, and actions like reduce to operate on the distributed data in a parallel and fault-tolerant manner. Key concepts include lazy evaluation of transformations, caching of RDDs, and use of broadcast variables and accumulators for sharing data across nodes.

Introduction to Apache SparkDatio Big Data

Apache spark IntroTudor Lapusan

Tuning and Debugging in Apache SparkPatrick Wendell

Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

IBM Spark Meetup - RDD & Spark BasicsSatya Narayan

The document provides information about Resilient Distributed Datasets (RDDs) in Spark, including how to create RDDs from external data or collections, RDD operations like transformations and actions, partitioning, and different types of shuffles like hash-based and sort-based shuffles. RDDs are the fundamental data structure in Spark, acting as a distributed collection of objects that can be operated on in parallel.

Apache Spark Majid Hajibaba

This document provides an overview of Spark, including: - Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions. - Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel. - An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.

Spark overviewLisa Hua

Apache Spark & StreamingFernando Rodriguez

Intro to apache spark stand fordThu Hiền

Here are the steps to complete the assignment: 1. Create RDDs to filter each file for lines containing "Spark": val readme = sc.textFile("README.md").filter(_.contains("Spark")) val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark")) 2. Perform WordCount on each: val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _) val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _) 3. Join the two RDDs: val joined = readmeCounts.join(changes

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Apache Spark: What? Why? When?Massimo Schenone

Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.

Spark 计算模型wang xing

The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.

More Related Content

What's hot (20)

Apache Spark ArchitectureAlexey Grishchenko

Introduction to Spark InternalsPietro Michiardi

Spark shuffle introductioncolorant

Apache Spark RDD 101sparkInstructor

Intro to Apache SparkRobert Sanders

Apache Spark Introductionsudhakara st

Introduction to sparkDuyhai Doan

DTCC '14 Spark Runtime InternalsCheng Lian

Transformations and actions a visual guide trainingSpark Summit

Spark introduction and architectureSohil Jain

Spark coreFreeman Zhang

Introduction to Apache SparkDatio Big Data

Apache spark IntroTudor Lapusan

Tuning and Debugging in Apache SparkPatrick Wendell

IBM Spark Meetup - RDD & Spark BasicsSatya Narayan

Apache Spark Majid Hajibaba

Spark overviewLisa Hua

Apache Spark & StreamingFernando Rodriguez

Intro to apache spark stand fordThu Hiền

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Apache Spark ArchitectureAlexey Grishchenko

Introduction to Spark InternalsPietro Michiardi

Spark shuffle introductioncolorant

Apache Spark RDD 101sparkInstructor

Intro to Apache SparkRobert Sanders

Apache Spark Introductionsudhakara st

Introduction to sparkDuyhai Doan

DTCC '14 Spark Runtime InternalsCheng Lian

Transformations and actions a visual guide trainingSpark Summit

Spark introduction and architectureSohil Jain

Spark coreFreeman Zhang

Introduction to Apache SparkDatio Big Data

Apache spark IntroTudor Lapusan

Tuning and Debugging in Apache SparkPatrick Wendell

IBM Spark Meetup - RDD & Spark BasicsSatya Narayan

Apache Spark Majid Hajibaba

Spark overviewLisa Hua

Apache Spark & StreamingFernando Rodriguez

Intro to apache spark stand fordThu Hiền

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Similar to Spark & Spark Streaming Internals - Nov 15 (1) (20)

Apache Spark: What? Why? When?Massimo Schenone

Spark 计算模型wang xing

Spark training-in-bangaloreKelly Technologies

Introduction to Apache SparkMohamed hedi Abidi

CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

Spark Study NotesRichard Kuo

This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.

Apache Spark and DataStax EnablementVincent Poncet

SparkNotesDemet Aksoy

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.

No more struggles with Apache Spark workloads in productionChetan Khatri

Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production. Apache Spark Primary data structures (RDD, DataSet, Dataframe) Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. Parallel read from JDBC: Challenges and best practices. Bulk Load API vs JDBC write An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Avoid unnecessary shuffle Alternative to spark default sort Why dropDuplicates() doesn’t result consistency, What is alternative Optimize Spark stage generation plan Predicate pushdown with partitioning and bucketing Why not to use Scala Concurrent ‘Future’ explicitly!

Scala+dataSamir Bessalah

This document discusses Scala and big data technologies. It provides an overview of Scala libraries for working with Hadoop and MapReduce, including Scalding which provides a Scala DSL for Cascading. It also covers Spark, a cluster computing framework that operates on distributed datasets in memory for faster performance. Additional Scala projects for data analysis using functional programming approaches on Hadoop are also mentioned.

Unified Big Data Processing with Apache SparkC4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

Apache Spark WorkshopMichael Spector

11. From Hadoop to Spark 2/2Fabio Fumarola

Spark provides tools for distributed processing of large datasets across clusters. It includes APIs for distributed datasets called RDDs (Resilient Distributed Datasets) and transformations and actions that can be performed on those datasets in parallel. Key features of Spark include the Spark Shell for interactive use, DataFrames for structured data processing, and Spark Streaming for real-time data analysis.

Bigdata processing with Spark - part IIArjen de Vries

This document provides a summary of Spark RDDs and the Spark execution model: - RDDs (Resilient Distributed Datasets) are Spark's fundamental data structure, representing an immutable distributed collection of objects that can be operated on in parallel. RDDs track lineage to support fault tolerance and optimization. - Spark uses a logical plan built from transformations on RDDs, which is then optimized and scheduled into physical stages and tasks by the Spark scheduler. Tasks operate on partitions of RDDs in a data-parallel manner. - The scheduler pipelines transformations where possible, truncates redundant work, and leverages caching and data locality to improve performance. It splits the graph into stages separated by shuffle operations or parent RDD boundaries

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

This document provides an overview of Apache Spark, an open-source unified analytics engine for large-scale data processing. It discusses Spark's core APIs including RDDs and transformations/actions. It also covers Spark SQL, Spark Streaming, MLlib, and GraphX. Spark provides a fast and general engine for big data processing, with explicit operations for streaming, SQL, machine learning, and graph processing. The document includes installation instructions and examples of using various Spark components.

Apache Spark: What? Why? When?Massimo Schenone

Spark 计算模型wang xing

Spark training-in-bangaloreKelly Technologies

Introduction to Apache SparkMohamed hedi Abidi

CloudCamp Chicago lightning talk "Spark: A Quick Ignition" - Matthew Kem...CloudCamp Chicago

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

Spark Study NotesRichard Kuo

Apache Spark and DataStax EnablementVincent Poncet

SparkNotesDemet Aksoy

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

No more struggles with Apache Spark workloads in productionChetan Khatri

Scala+dataSamir Bessalah

Unified Big Data Processing with Apache SparkC4Media

Apache Spark WorkshopMichael Spector

11. From Hadoop to Spark 2/2Fabio Fumarola

Bigdata processing with Spark - part IIArjen de Vries

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Spark & Spark Streaming Internals - Nov 15 (1)

1. Spark & Spark Streaming Internals Akhil Das [email protected]

2. Apache Spark Spark Stack

3. Spark Internals

4. Resilient Distributed Dataset (RDD) Restricted form of distributed shared memory ➔ Immutable ➔ Can only be built through deterministic transformations (textFile, map, filter, join, …) Efficient fault recovery using lineage ➔ Recompute lost partitions on failure ➔ No cost if nothing fails

5. RDD Operations Transformations ➔ map/flatmap ➔ filter ➔ union/join/groupBy ➔ cache ……. Actions ➔ collect/count ➔ save ➔ take …….

6. Log Mining Example Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base RDD errors = lines.filter(_.startsWith(“ERROR”)) Transformed RDD messages = errors.map(_.split(‘t’)(2)) messages.persist() messages.filter(_.contains(“foo”)).count Action messages.filter(_.contains(“bar”)).count

7. What is Spark Streaming?

8. What is Spark Streaming? Framework for large scale stream processing ➔ Scales to 100s of nodes ➔ Can achieve second scale latencies ➔ Provides a simple batch-like API for implementing complex algorithm ➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

9. Overview Run a streaming computation as a series of very small, deterministic batch jobs SparkStreaming Spark - Chop up the live stream into batches of X seconds - Spark treats each batch of data as RDDs and processes them using RDD operations - Finally, the processed results of the RDD operations are returned in batches

10. Key Concepts ➔ DStream – sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets ➔ Transformations – modify data from one DStream to another - Standard RDD operations – map, countByValue, reduce, join, … - Stateful operations – window, countByValueAndWindow, … ➔ Output Operations – send data to external entity - saveAsHadoopFiles – saves to HDFS - foreach – do anything with each batch of results

11. Eg: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_. startsWith("#")))) hashTags.saveAsHadoopFiles("hdfs://...") Transformation #Ebola, #India, #Mars ...

12. Thank You

Spark & Spark Streaming Internals - Nov 15 (1)

Recommended

More Related Content

What's hot (20)

Similar to Spark & Spark Streaming Internals - Nov 15 (1) (20)

Spark & Spark Streaming Internals - Nov 15 (1)