0% found this document useful (0 votes)

3 views

06 Big Data

Apache Spark is an open-source data processing framework designed for big data analytics, offering advantages over Hadoop such as faster processing speeds and support for multiple programming languages. It utilizes in-memory processing and a master-slave architecture to efficiently handle various workloads, including batch and real-time processing. Spark integrates with cloud services like Azure, AWS, and GCP, and features components such as Spark SQL, Spark Streaming, and MLlib for advanced analytics.

Uploaded by

berketaraya588

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

06 Big Data

Uploaded by

berketaraya588

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

SENg5302-Chapter-VI Big Data Analytics with

Apache Spark
What is Spark
Apache Spark is an open source data processing framework for performing Big data
analytics on distributed computing cluster.
Supports wide variety of operations, compared to Map and Reduce functions.
Provides concise and consistent APIs in Scala, Java and Python.
Spark is written in Scala Programming Language and runs in JVM.

It currently has support in the following Programming languages to develop

applications in Spark:
Scala(default)
Java
Python
R
Evolution of Apache Spark
• Spark is one of Hadoop’s sub project developed in 2009 in UC
Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in
2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has
become a top level Apache project from Feb-2014. Now, most
organizations across the world have incorporated Apache
Spark for empowering their Big Data applications.
Spark Overview
Spark was introduced by Apache Software Foundation for
speeding up the Hadoop computational computing
software process.

As against a common belief, Spark is not a modified

version of Hadoop and is not, really, dependent on
Hadoop because it has its own cluster management.
Hadoop is just one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and

second is processing. Since Spark has its own cluster
management computation, it uses Hadoop for
storage purpose only.
Spark Overview contd…
Apache spark has become a key cluster computer
framework that catches the world of big data with fire.
It is a more accessible, powerful, and powerful data
tool to deal with a variety of big data challenges.
Apache Spark is a framework that is supported
in Scala, Python, R Programming, and Java.
Below are different implementations of Spark.
Spark – Default interface for Scala and
Java PySpark – Python interface for Spark
SparklyR – R interface for Spark.
Spark Overview contd…
It, is a general engine for Big Data analysis,
processing, and computations. It provides several
advantages over MapReduce: it is faster, easier to
use, offers simplicity, and runs virtually
everywhere.

It has built-in tools for SQL, Machine Learning,

and streaming which make it a very popular and
one of the most asked tools in the IT industry. Spark is
written in Scala.
Apache Spark overview contd..
Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation. It is based on
Hadoop MapReduce and it extends the MapReduce model
to efficiently use it for more types of computations, which
includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.

Spark is designed to cover a wide range of workloads

such as batch applications, iterative algorithms,
interactive queries and streaming. Apart from
supporting all these workload in a respective
system, it reduces the management burden of
maintaining separate tools.
Is Spark available as a managed service from
Spark on Azure
cloud providers?
As part of its big data capabilities and Databricks partnership, Azure provides Apache Spark. HDInsight, Azure
Databricks, and Azure Synapse Analytics are among the Azure tools and services that integrate with Apache Spark.
Spark on AWS

AWS (Amazon Web Services)

Includes Apache Spark in its big data services. Amazon EMR (Elastic MapReduce), AWS Glue, and Amazon
SageMaker are all services and tools that connect with Apache Spark.

• Amazon EMR is a fully managed big data platform that includes Apache Spark as
well as Hadoop, Hive, and Presto big data processing enginesSpark on GCP

Google Cloud Platform (GCP)

Includes Apache Spark in its big data offerings Cloud Dataproc and Cloud Dataflow.

Cloud Dataproc is a fully managed big data platform which includes Apache Spark and other big data processing
engines such as Hadoop and Hive
What is In-Memory Processing?

 In-memory processing is the practice of taking action on

data entirely in computer memory(e.g. in RAM).
 This is in contrast to other techniques of processing data
which rely on reading and writing data to and from slower
media such as disk drives(e.g.,SSD’s).
 In-memory processing typically implies large-scale
environments where multiple computers are pooled
together so their collective RAM can be used as a large and
fast storage medium.
 Since the storage appears as one big, single allocation of
RAM, large data sets can be processed all at once, versus
processing data sets that only fir into the RAM of a single
computer. This ofen done in the technology known as in-
memory data grids(IMDG).
Limitations of Mapreduce in Hadoop
Why choose Apache Spark over
Hadoop?
Apache Spark Hadoop
Parameter
Speed 100 times faster in memory Better than traditional systems
computations; ten times fast on
disk than Hadoop
Easy to Manage Everything in the same cluster Different engines required for
different tasks
Real-time Live data streaming Only efficient for batch
Analysis processing
Features of Apache Spark
Speed − Spark helps to run an application in Hadoop
cluster, up to 100 times faster in memory, and 10 times
faster when running on disk. This is possible by
reducing number of read/write operations to disk.
It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in

APIs in Java, Scala, or Python. Therefore, you can write
applications in different languages. Spark comes up with
80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’

and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
Features of Apache Spark
Limitations of Apache Spark
How does Apache Spark fit in the Hadoop
Ecosystem?
• Apache Spark can be used with Hadoop or
Hadoop YARN together. It can be deployed on
Hadoop in three ways: Standalone, YARN, and
SIMR. Spark Built on Hadoop
Spark Built on Hadoop
Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on
cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark into
Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
Apache Spark Advantages
•Spark is a general-purpose, in-memory, fault-
tolerant, distributed processing engine that allows you to
process data efficiently in a distributed fashion.
•Applications running on Spark are 100x faster
than traditional systems.
•You will get great benefits using Spark for data
ingestion pipelines.
•Using Spark we can process data from Hadoop
HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and
many file systems.
•Spark also is used to process real-time data
using Streaming and Kafka.
•Using Spark Streaming you can also stream files from
the file system and also stream from the socket.
• Spark natively has machine learning and graph libraries.
Why should we consider using Hadoop and Spark
together?
1. Efficient Storage and Cluster Management
2. For easy resource management and can easily
handle task scheduling across a cluster.
3. Disaster recovery capabilities
4. Provides better data security
5. Fast computation engine like
6. Runs virtually everywhere
Industries using Apache Spark(Use Cases)
Apache Spark Components
Components of Spark cont..
i. Spark Core
It is the kernel of Spark, which provides an execution
platform for all the Spark applications. It is
a
generalized platform to support a wide array of
applications.
ii. Spark SQL
It enables users to run SQL/HQL queries on the top of
Spark. Using Apache Spark SQL, we can process
structured as well as semi-structured data. It also
provides an engine for Hive to run unmodified queries
up to 100 times faster on existing deployments.
Components of Spark cont..
iii. Spark Streaming
Apache Spark Streaming enables powerful interactive and data analytics
application across live streaming data. The live streams are converted into
micro-batches which are executed on top of spark core.
iv. Spark MLlib
It is the scalable machine learning library which delivers both efficiencies as
well as the high-quality algorithm. Apache Spark MLlib is one of the hottest
choices for Data Scientist due to its capability of in-memory data processing,
which improves the performance of iterative algorithm drastically.
v. Spark GraphX
Apache Spark GraphX is the graph computation engine built on top of spark
that enables to process graph data at scale.

vi. SparkR
It is R package that gives light-weight frontend to use Apache Spark from R.
It allows data scientists to analyze large datasets and interactively run jobs on
them from the R shell. The main idea behind SparkR was to explore different
techniques to integrate the usability of R with the scalability of Spark.
Apache Spark Architecture
Working Spark Architecture
The Apache Spark framework uses a master-slave architecture that consists of a driver, which
runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well.

Driver Program in the Apache Spark architecture calls the main program of an application
and creates SparkContext. A SparkContext consists of all the basic functionalities.

Spark Driver contains various other components such as DAG Scheduler, Task Scheduler,
Backend Scheduler, and Block Manager, which are responsible for translating the user-written
code into jobs that are actually executed on the cluster.
Working Spark Architecture cont..
Spark Driver and SparkContext collectively watch over the job
execution within the cluster. Spark Driver works with the Cluster
Manager to manage various other jobs. The cluster Manager does the
resource allocating work. And then, the job is split into multiple
smaller tasks which are further distributed to worker nodes.

Whenever an RDD is created in the SparkContext, it can be

distributed across many worker nodes and can also be cached
there.
Worker nodes execute the tasks assigned by the Cluster Manager
and return it back to the Spark Context.

An executor is responsible for the execution of these tasks. The

lifetime of executors is the same as that of the Spark Application. If
we want to increase the performance of the system, we can increase
the number of workers so that the jobs can be divided into more
logical portions.
Spark Architecture Overview cont..
Apache Spark has a well-defined layered
architecture where all the spark components and
layers are loosely coupled. This architecture is
further integrated with various extensions and
libraries. Apache Spark Architecture is based on
two main abstractions:
Apache Spark RDDs support two types of
operations:

• Resilient Distributed Dataset (RDD)

• Directed Acyclic Graph (DAG)
Spark Eco-Systems
As you can see, Spark comes packed with high-level libraries, including
support for R, SQL, Python, Scala, Java etc. These standard libraries
increase the seamless integrations in a complex workflow. Over this, it also
allows various sets of services to integrate with it like MLlib,
GraphX, SQL + Data Frames, Streaming services etc.
Resilient Distributed Dataset(RDD)
•Resilient: Fault tolerant and is capable of rebuilding
data on failure
•Distributed: Distributed data among the multiple
nodes in a cluster
• Dataset: Collection of partitioned data with values
Read-only partitioned collection of records (like a DFS)
but with a record of how the dataset was created as
combination of transformations from other dataset(s).
RDD contd..
RDD(Resilient, Distributed, Dataset) is immutable
distributed collection of objects. RDD is a logical
reference of a dataset which is partitioned across many
server machines in the cluster. RDDs are Immutable
and are self recovered in case of failure. An RDD could
come from any data source, e.g. text files, a database via
JDBC, etc.
Creating an RDD

val rdd = sc.textFile("/some_file",3)

val lines = sc.parallelize(List("this is","an example"))
the argument ‘3’ in the method call sc.textFile() specifies
the number of partitions
Partitions

RDD are a collection of various data if it cannot fit

into a single node it should be partitioned across
various nodes. So it means, the more the number of
partitions, the more the parallelism. These partitions of
an RDD is distributed across all the nodes in the
network.
RDDs Operations(Transformations and Actions)

• There are two types of operations that you can perform on

an RDD- Transformations and Actions.
• Transformation applies some function on a RDD and creates
a new RDD, it does not modify the RDD that you apply the
function on.(Remember that RDDs are immutable). Also, the
new RDD keeps a pointer to it’s parent RDD.
• Transformations are lazy operations on a RDD that create
one or many new RDDs,
e.g. map,filter, reduceByKey, join, cogroup, randomSplit
• At high level, there are two transformations that can be
applied onto the RDDs, namely narrow transformation and
wide transformation. Wide transformations basically result
in stage boundaries.
RDD-Transformations
• Narrow transformation — doesn’t require the data to
be shuffled across the partitions. for example, Map,
filter etc..
• Wide transformation — requires the data to be
shuffled for example, reduceByKey etc..
• By applying transformations you incrementally build
a RDD lineage with all the parent RDDs of the final
RDD(s).Transformations are lazy, i.e. are not executed
immediately. Only after calling an action are
transformations executed.
• val rdd = sc.textFile("spam.txt")
val filtered = rdd.filter(line =>
line.contains("money"))
filtered.count()
Transformations contd..
sc.textFile() and rdd.filter() do not get executed
immediately, it will only get executed once you call an
Action on the RDD — here filtered.count(). An
Action is used to either save result to some location or
to display it. You can also print the RDD lineage
information by using the
filtered.toDebugString(filtered is command the
RDDs can also be thought of as a set RDDof here).
instructions
that has to be executed, first instruction being the load
instruction.
RDD
What is DAG in Apache Spark?
DAG a finite direct graph with no directed cycles. There are finitely
many vertices and edges, where each edge directed from one vertex to
another. It contains a sequence of vertices such that every edge is
directed from earlier to later in the sequence. It is a strict generalization
of MapReduce model. DAG operations can do better global
optimization than other systems like MapReduce. The picture of DAG
becomes clear in more complex jobs. Apache Spark DAG allows the
user to dive into the stage and expand on detail on any stage. In the
stage view, the details of all RDDs belonging to that stage are
expanded. The Scheduler splits the Spark RDD into stages
based on various transformation applied.
Need of Directed Acyclic Graph in Spark
• The limitations of Hadoop MapReduce became a key point to introduce
DAG in Spark. The computation through MapReduce in three steps:
• The data is read from HDFS.
• Then apply Map and Reduce operations.
• The computed result is written back to HDFS.
• Each MapReduce operation is independent of each other and HADOOP has
no idea of which Map reduce would come next. Sometimes for some
iteration, it is irrelevant to read and write back the immediate result between
two map-reduce jobs. In such case, the memory in stable storage (HDFS) or
disk memory gets wasted.
• In multiple-step, till the completion of the previous job all the jobs block
from the beginning. As a result, complex computation can require a long
time with small data volume.
• While in Spark, a DAG (Directed Acyclic Graph) of consecutive
computation stages is formed. In this way, we optimize the execution plan,
e.g. to minimize shuffling data around. In contrast, it is done manually
in
MapReduce by tuning each MapReduce step.
How DAG works in Spark?
 The interpreter is the first layer, using a Scala interpreter,
Spark interprets the code with some modifications.
 Spark creates an operator graph when you enter your code in
Spark console.
 When we call an Action on Spark RDD at a high level, Spark
submits the operator graph to the DAG Scheduler.
 Divide the operators into stages of the task in the DAG
Scheduler. A stage contains task based on the partition of the
input data. The DAG scheduler pipelines operators together.
For example, map operators schedule in a single stage.
 The stages pass on to the Task Scheduler. It launches task
through cluster manager. The dependencies of stages are
unknown to the task scheduler.
 The Workers execute the task on the slave.
 The image below briefly describes the steps of How DAG
works in the Spark job execution.
INTERNALS OF JOB EXECUTION IN
SPARK
Advantages of DAG in Spark
There are multiple advantages of Spark DAG,
 The lost RDD can recover using the Directed Acyclic Graph.
 Map Reduce has just two queries the map, and reduce but in DAG we have
multiple
levels. So to execute SQL query, DAG is more flexible.
 DAG helps to achieve fault tolerance. Thus we can recover the lost data.
 It can do a better global optimization than a system like Hadoop
MapReduce. Working of DAG Optimizer in Spark

We optimize the DAG in Apache Spark by rearranging and combining operators

wherever possible. For, example if we submit a spark job which has a map() operation
followed by a filter operation. The DAG Optimizer will rearrange the order of these
operators since filtering will reduce the number of records to undergo map operation.

DAG in Apache Spark is an alternative to the MapReduce. It is a programming style

used in distributed systems. In MapReduce, we just have two functions (map and
reduce), while DAG has multiple levels that form a tree structure. Hence, DAG
execution is faster than MapReduce because intermediate results does not write to
disk.
Apache Spark is a framework
• Apache Spark is a framework that is supported in
Scala, Python, R Programming, and Java. Below are
different implementations of Spark.
• Spark – Default interface for Scala and Java
• PySpark – Python interface for Spark
•SparklyR – R interface for
Spark. Spark Shell:
Apache Spark provides an interactive spark-shell. It
helps Spark applications to easily run on the command
line of the system. Using the Spark shell we can
run/test our application code interactively. Spark can
read from many types of data sources so that it can
access and process a large amount of data.
Cluster Manager Types
As of writing this Apache Spark, Spark supports below cluster
managers:

Standalone – a simple cluster manager included with Spark that

makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run
Hadoop MapReduce and Spark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is
mostly used, cluster manager.
Kubernetes – an open-source for
deployment, system scaling, and automating of
applications. management
containerized
local – which is not really a cluster manager but still I wanted to
mention as we use “local” for master() in order to run Spark on
your laptop/computer.
DataFrame Spark with Basic Examples

DataFrame is a distributed collection of data

organized into named columns. It is conceptually
equivalent to a table in a relational database or a data
frame in R/Python, but with richer optimizations
under the hood. DataFrames can be constructed from
a wide array of sources such as structured data files,
tables in Hive, external databases, or existing RDDs.
DataFrame creation
The simplest way to create a DataFrame is from a seq
collection. DataFrame can also be created from an
RDD and by reading files from several sources.
By using createDataFrame() function of the
SparkSession you can create a DataFrame.
using createDataFrame()

val columns =
Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema =
columns).toDF(columns:_*)
Since DataFrame’s are structure format which contains names and
df.show() shows the 20 elements from the DataFrame.
Using toDF() function

•Once we have an RDD, let’s use toDF() to create DataFrame in Spark.

By default, it creates column names as “_1” and “_2” as we have
two columns for each row.
Using toDF() function
Create Spark DataFrame from CSV
• In all the above examples, you have learned Spark to create DataFrame from RDD
and data collection objects. In real-time these are less used, In this and
following sections, you will learn how to create DataFrame from data
sources like CSV, text, JSON, Avro e.t.c

• Spark by default provides an API to read a delimiter files like comma, pipe, tab
separated files and it also provides several options on handling with header, with
out header, double quotes, data types e.t.c.

Ex: JavaScript Object Notation(JSON-format)

{
"name": "Mohamed",
"age": 35,
"isProfessor": true,
"subjects": ["AI", "ML", "NLP"]
}
Creating from TXT,JSON
Creating from an XML file
Creating from Hive
Create DataFrame from HBase table
UNIT-VI Conclusion –Apache Spark
As a result, we have seen every aspect of Apache Spark, what is
Apache spark programming and spark definition, History of
Spark, why Spark is needed, Components of Apache Spark,
Spark RDD, Features of Spark RDD, Spark Streaming, Features
of Apache Spark, Limitations of Apache Spark, Apache Spark
use cases.
It, provides a collection of technologies that increase the value of
big data and permits new Spark use cases. It gives us a unified
framework for creating, managing and implementing Spark big
data processing requirements.
In addition to the MapReduce operations, one can also
implement SQL queries and process streaming data through
Spark, which were the drawbacks for Hadoop-1. With Spark,
developers can develop with Spark features either on a stand-
alone basis or, combine them with MapReduce programming
techniques.

Amazon DEA-C01 AWS Certified Data Engineer - Associate Dumps
No ratings yet
Amazon DEA-C01 AWS Certified Data Engineer - Associate Dumps
20 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Shark
No ratings yet
Shark
24 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
Spark-Introduction
No ratings yet
Spark-Introduction
12 pages
4. Introduction-to-Apache-Spark
No ratings yet
4. Introduction-to-Apache-Spark
22 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Sspark
No ratings yet
Sspark
7 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Unit 5
100% (1)
Unit 5
109 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Apache Spark Self Learning 1
No ratings yet
Apache Spark Self Learning 1
7 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark
No ratings yet
Spark
9 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Spark BD
No ratings yet
Spark BD
9 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Module 3
No ratings yet
Module 3
51 pages
Bda 5
No ratings yet
Bda 5
21 pages
bda u3 p1 (intro to spark)
No ratings yet
bda u3 p1 (intro to spark)
66 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Payment Details
No ratings yet
Payment Details
2 pages
07 Big Data
No ratings yet
07 Big Data
33 pages
Ethics 03
No ratings yet
Ethics 03
89 pages
Ethics 01
No ratings yet
Ethics 01
70 pages
Open Datasets
No ratings yet
Open Datasets
9 pages
RAFED Technical Proposal 2
No ratings yet
RAFED Technical Proposal 2
22 pages
Intro To Data Science On Cloud
No ratings yet
Intro To Data Science On Cloud
1 page
Narendra Dataengineer Resume - pdf-1
No ratings yet
Narendra Dataengineer Resume - pdf-1
4 pages
Datamites Certified Data Analyst Brochure INDIA V9
No ratings yet
Datamites Certified Data Analyst Brochure INDIA V9
18 pages
Data Engineering
No ratings yet
Data Engineering
8 pages
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
14 pages
Big Data Masters Program
No ratings yet
Big Data Masters Program
13 pages
Open Elective III & IV List 2021-22
No ratings yet
Open Elective III & IV List 2021-22
6 pages
Question Bank( DA)-1
No ratings yet
Question Bank( DA)-1
14 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
Syllabus - PGD - DS - Batch-7 PDF
No ratings yet
Syllabus - PGD - DS - Batch-7 PDF
12 pages
Instant Download PySpark Recipes: A Problem-Solution Approach With PySpark2 1st Edition Raju Kumar Mishra PDF All Chapter
100% (9)
Instant Download PySpark Recipes: A Problem-Solution Approach With PySpark2 1st Edition Raju Kumar Mishra PDF All Chapter
62 pages
Mastering Apache Spark PDF
50% (2)
Mastering Apache Spark PDF
1,352 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Azure Synapse Analytics
No ratings yet
Azure Synapse Analytics
8 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Tushar Chhabra Data Engineer Resume
No ratings yet
Tushar Chhabra Data Engineer Resume
1 page
SAS BASE A00 211 Sample Questions
No ratings yet
SAS BASE A00 211 Sample Questions
33 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Backend Developer - 1238 - JD
No ratings yet
Backend Developer - 1238 - JD
2 pages
02-Tools For Data Science
No ratings yet
02-Tools For Data Science
6 pages
Eunetworks, London: Ndcs - Euned Integration Eunetworks, London Marks & Spencer, United Kingdom Braintree.,Australia
No ratings yet
Eunetworks, London: Ndcs - Euned Integration Eunetworks, London Marks & Spencer, United Kingdom Braintree.,Australia
7 pages
Set Up Apache Spark On A Multi-Node Cluster - Y Media Labs Innovation - Medium
No ratings yet
Set Up Apache Spark On A Multi-Node Cluster - Y Media Labs Innovation - Medium
11 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Data Engineering Cookbook
86% (7)
Data Engineering Cookbook
88 pages
Mca Microsoft Certified Associate Azure Data Engineer Study Guide Exam Dp 203 Team Ira 1nbsped 1119885426 9781119885429 Compress
No ratings yet
Mca Microsoft Certified Associate Azure Data Engineer Study Guide Exam Dp 203 Team Ira 1nbsped 1119885426 9781119885429 Compress
1,011 pages
构建基于Apache Kylin的大数据分析平台讲话
No ratings yet
构建基于Apache Kylin的大数据分析平台讲话
37 pages
MasterCard Data Engineering
No ratings yet
MasterCard Data Engineering
17 pages

06 Big Data

Uploaded by

06 Big Data

Uploaded by

SENg5302-Chapter-VI Big Data Analytics with

It currently has support in the following Programming languages to develop

As against a common belief, Spark is not a modified

Spark uses Hadoop in two ways – one is storage and

It has built-in tools for SQL, Machine Learning,

Spark is designed to cover a wide range of workloads

AWS (Amazon Web Services)

Google Cloud Platform (GCP)

 In-memory processing is the practice of taking action on

Supports multiple languages − Spark provides built-in

Advanced Analytics − Spark not only supports ‘Map’

Whenever an RDD is created in the SparkContext, it can be

An executor is responsible for the execution of these tasks. The

• Resilient Distributed Dataset (RDD)

val rdd = sc.textFile("/some_file",3)

RDD are a collection of various data if it cannot fit

• There are two types of operations that you can perform on

We optimize the DAG in Apache Spark by rearranging and combining operators

DAG in Apache Spark is an alternative to the MapReduce. It is a programming style

Standalone – a simple cluster manager included with Spark that

DataFrame is a distributed collection of data

•Once we have an RDD, let’s use toDF() to create DataFrame in Spark.

Ex: JavaScript Object Notation(JSON-format)

You might also like