06 Big Data
06 Big Data
Apache Spark
What is Spark
Apache Spark is an open source data processing framework for performing Big data
analytics on distributed computing cluster.
Supports wide variety of operations, compared to Map and Reduce functions.
Provides concise and consistent APIs in Scala, Java and Python.
Spark is written in Scala Programming Language and runs in JVM.
• Amazon EMR is a fully managed big data platform that includes Apache Spark as
well as Hadoop, Hive, and Presto big data processing enginesSpark on GCP
Cloud Dataproc is a fully managed big data platform which includes Apache Spark and other big data processing
engines such as Hadoop and Hive
What is In-Memory Processing?
vi. SparkR
It is R package that gives light-weight frontend to use Apache Spark from R.
It allows data scientists to analyze large datasets and interactively run jobs on
them from the R shell. The main idea behind SparkR was to explore different
techniques to integrate the usability of R with the scalability of Spark.
Apache Spark Architecture
Working Spark Architecture
The Apache Spark framework uses a master-slave architecture that consists of a driver, which
runs as a master node, and many executors that run across as worker nodes in the cluster.
Apache Spark can be used for batch processing and real-time processing as well.
Driver Program in the Apache Spark architecture calls the main program of an application
and creates SparkContext. A SparkContext consists of all the basic functionalities.
Spark Driver contains various other components such as DAG Scheduler, Task Scheduler,
Backend Scheduler, and Block Manager, which are responsible for translating the user-written
code into jobs that are actually executed on the cluster.
Working Spark Architecture cont..
Spark Driver and SparkContext collectively watch over the job
execution within the cluster. Spark Driver works with the Cluster
Manager to manage various other jobs. The cluster Manager does the
resource allocating work. And then, the job is split into multiple
smaller tasks which are further distributed to worker nodes.
val columns =
Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema =
columns).toDF(columns:_*)
Since DataFrame’s are structure format which contains names and
df.show() shows the 20 elements from the DataFrame.
Using toDF() function
• Spark by default provides an API to read a delimiter files like comma, pipe, tab
separated files and it also provides several options on handling with header, with
out header, double quotes, data types e.t.c.