Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark (Early Release) 1 / 2021-09-10 Fourth Early Release Edition Mahmoud Parsian download
Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark (Early Release) 1 / 2021-09-10 Fourth Early Release Edition Mahmoud Parsian download
Mahmoud Parsian
Data Algorithms with Spark
by Mahmoud Parsian
Copyright © 2021 Mahomoud Parsian. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].
NOTE
Complete programs for this chapter are presented in GitHub
Chapter 1.
Spark’s ecosystem
Spark’s eco-system is presented in Figure 1.4 which has three main
components:
Environments:
can run anywhere and integrate well with other
environments
Applications:
it integrates well with big data platforms and applications
Data Sources:
can read/write data from/to many data sources
Spark Architecture
Spark is an open-source cluster computing tool for data-intensive
computations, managing and coordinating the execution of tasks on
data across a cluster of computers. When you have small data, it is
possible to analyze it with a single computer in a reasonable amount
of time. When data is huge, using a single computer to store and
analyze that data might take a long time and might be even
impossible to analyze and process it. For big data, we may use
Spark. Spark is a tool for analyzing, processing, managing and
coordinating the execution of tasks on big data across a cluster of
computers.
A high-level Spark architecture is presented by the Figure 1.7.
Informally, a Spark cluster is comprised of a master node (denoted
by a “cluster manager”) and a set of “worker” nodes, which are
responsible for executing tasks submitted by the Spark application
(your application which you want to run on the Spark cluster) and
the cluster manager (which is responsible in managing your
application).
Figure 1-4. Spark Architecture
Figure 1.8 shows the real power of Spark: You may use several
langauges (Python, Scala, Java, R) to write your Spark applications
and then you may use rich libraries (such as Spark SQL, Streaming,
machine library, GraphX, GraphFrames, …) for solving any big data
problems and meanwhile you may read/write data from any data
sources.
In this book, you’ll interface Spark with Python through PySpark, the
Spark Python API that exposes the Spark programming model to
Python. PySpark has a comprehensive API (comprised of packages,
modules, classes, and methods) to access Spark API. It is important
to note that, for this book, all Spark API, packages, modules,
classes, and methods are PySpark specific. For example, when I
refer to the SparkContext class, I am referring to the
pyspark.SparkContext (Python class, defined in the pyspark
package) and when referring to the SparkSession, I am referring
to the pyspark.sql.SparkSession (Python class, in
pyspark.sql module).
To understand PySpark better, we do need to understand Spark,
since PySpark is a Python API for Spark. The Spark architecture is
based on a master/worker architecture. Typically, a Spark cluster has
a “master” (controller and manager) node and a set of “worker”
(executor) nodes. The number of worker nodes can be from 1 to
tens, hundreds, and thousands (based on the need of your business
and project requirements). You may also, run Spark on a standalone
server (such as MacBook, Linux, or Windows — typically, for
production environment, Spark is run on cluster of Linux servers). To
run a Spark program, you need to have an access to a Spark cluster
(comprised of one or many nodes) and have a “driver program”,
which talks to a single coordinator called master (also called a
“cluster manager”) that manages workers in which executors run.
For this book, all driver programs will be in PySpark.
To understand Spark architecture, you’ll need to understand the
following two classes: SparkSession and SparkContext. In
order to do something useful with the Spark cluster, the first thing
we have to do is to create an instance of SparkSession and from
the SparkSession we can access the SparkContext object
as an attribute. Once you create an instance of a
`SparkSession, then SparkContext becomes available inside
your SparkSession, in other words, SparkSession contains an
instance of a SparkContext as SparkSession.sparkContext.
PySpark defines SparkSession as:
# to debug SparkSession
print(spark)
# to debug SparkContext
print(sc)
Spark Usage
Is Spark used by many notable companies? Here are some examples
of how big companies use Spark.
FaceBook processes 60 TB of data on a daily basis. Spark
and MapReduce are at the heart of their algorithms and
processing production data
Viacom, with its 170 cable, broadcast, and online networks
in around 160 countries, is transforming itself into a data-
driven enterprise — collecting and analyzing petabytes of
network data to increase viewer loyalty and revenue.
Illumina 1 ingests thousands of genomes (big data — by big
data we mean that data can not fit and processed in one
server) using Spark, PySpark, MapReduce, and distributed
algorithms.
IBM and many social network and search engine companies
(such use Google, Twitter and Facebook) use Spark,
MapReduce, and distributed algorithms on a daily basis to
scale out their computations and operations.
Spark in a Nutshell
There are four main reasons (as noted on Spark’s website) to choose
Spark:
1. Speed
Spark programs can run up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk. Spark has an
advanced DAG execution engine that supports acyclic data
flow and in-memory computing (using memory is much
faster using Disk for intensive calculations and I/O). DAG
(Directed Acyclic Graph) in Spark is a set of Nodes and
Edges, where Nodes represent the RDDs and the edges
represent the Operation (transformation or action) to be
applied on RDD. On the calling of an Action, the created
DAG submits to DAG Scheduler which further splits the
graph into the stages of the task. Spark’s DAG Visualization
is presented as Figure 1.2.
Figure 1-8. Spark DAG Visualization
1. Ease of Use
You can write Spark applications quickly in Java, Scala,
Python, R, and SQL.
2. Generality
Spark is a general compute engine and it can be used for
solving any type of problems. Spark provides combination of
SQL, streaming, and complex analytics. Spark powers a set
of libraries including SQL and DataFrames, MLlib for machine
learning, GraphX, and Spark Streaming. You can combine all
of these libraries seamlessly in the same application. Also,
there are a lot of external libraries for Spark, which can
make data and graph processing (such as GraphFrames) like
a breeze.
3. Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the
cloud (such as Amazon Glue and Google Cloud).
Spark stack is presented by Figure 1-9
It can access diverse data sources including Text
files, relational databases, HDFS, Cassandra, HBase,
and S3.
What is PySpark?
In a nutshell, PySpark is the collaboration of Apache Spark and
Python Programming Language.
Figure 1-10. What is PySpark
Power of PySpark
I’ll show the amazing power of PySpark with a simple example. Let’s
say that we have lots of data records measuring the URL visits by
users (cellected by a search engine from many web servers) in the
following format:
<URL-address><,><frequency>
http://mapreduce4hackers.com,19779
http://mapreduce4hackers.com,31230
http://mapreduce4hackers.com,15708
...
https://www.illumina.com,87000
https://www.illumina.com,58086
...
Let’s say we want to find out average, median, and the standard
deviation of visiting numbers per key (as a URL-address). Assume
that another requirement is that to drop records, if the length of any
record is less than 5 (may be as a malformed URL). It is easy to
express an elegant solution for this in PySpark. This workflow can be
expressed by Figure 1.6.
Figure 1-12. Simple Workflow to Compute Mean, Median, Standard Deviation
First, let’s create some basic Python functions, which will help us in
solving our simple problem. since we are going to work with (key,
value) pairs, the first function, create_pair() accepts a single
record of <URL-address><,><frequency> and returns a (key,
value) as a pair (which will enable us to do “group by” on the key
field later on), where key is a URL-address and value is the
associated frequency.
The create_pair() function is illustrated below:
# input_path = "s3://<bucket>/key"
input_path = "/tmp/myinput.txt"
results = spark
.sparkContext
.textFile(input_path)
.filter(lambda record: len(record) > 5)
.map(create_pair)
.groupByKey()
.mapValues(compute_stats)
Drop out records, where record size is less than or equal 5 (keep
records if their length is greater than 5)
For example, if you have a big cluster then all of your 20,000 chunks
(called a partition) might be processed in one shot. But, If you have
a smaller cluster, then may be every 100 chunks can be processed
independently and in parallel. This process will continue until all of
your 20,000 partitions are processed. Spark is an analytic and
compute engine for parallel processing of data on a cluster.
Parallelism in Spark allows developers to perform tasks on hundreds
of computer servers in a cluster in parallel and independently.
Spark’s data abastractions (RDD and DataFrame) operate on
partitioned data. Therefore, we can say that partition is the main
unit of parallelism in Spark.
PySpark Architecture
The PySpark API is defined here: PySpark Documentation. Spark has
APIs for Java, Python (PySpark), Scala, and R. Spark is built for
cluster computing: having a “master” server and a set of “worker”
servers working together to enable big data computing. Typically, a
Spark-based distributed algorithm will run on a cluster of connected
computers. Spark, MapReduce paradigm, and distributed computing
enables high performance computing by using a set of commodity
servers.
PySpark is built on top of Spark’s Java API. Data is processed in
Python and cached / shuffled in the Java Virtual Machine (JVM). I
will cover “shuffling” concept in chapter 2. PySpark’s high-level
architecture is presented by the Figure 1.11.
Figure 1-13. PySpark High-Level Architecture
PySpark’s High-Level Architecture and data flow is illustrated by the
Figure 1.12.
2 "abc" ('A', 4)
… … …
DataFrame Example
Similar to an RDD, Spark’s DataFrame is an immutable distributed
collection of data. Unlike an RDD, data is organized into named
columns, like a table in a relational database. Spark’s DataFrame is
designed to make large data sets processing even easier by using
named columns. DataFrame allows programmers to impose a
structure onto a distributed collection of data, allowing higher-level
abstraction. Spark’s DataFrame enables to process CSV and JSON
files much easier than RDDs.
A DataFrame example is provided below: this DataFrame has 3
columns:
DataFrame[name, age, salary]
name: String
age: Integer
salary: Integer
+-----+----+---------+
| name| age| salary|
+-----+----+---------+
| bob| 33| 45000|
| jeff| 44| 78000|
| mary| 40| 67000|
| ...| ...| ...|
+-----+----+---------+
To see the most cases at the top (use descending Sort), we use the
sort() function:
+-------+-------+-----------+--------------+---------+
|case_id|country| city|infection_case|confirmed|
+-------+-------+-----------+--------------+---------+
| C0001| USA| New York| contact| 175|
+-------+-------+-----------+--------------+---------+
...
28.
— Minäkö?
— Niin, sinä.
— Jos sisaresi sillä saa autuutensa, mitä pahaa sinä olet siinä
näkevinäsi? — toisti Ludvig XV, kuitenkin melkoisen hämmästyneenä,
miten samanlaiset Variksen syytös ja se poliittinen esitelmä, jonka
Madame Louise oli äsken niin hartaasti hänelle pitänyt lähtiäisiksi,
olivat keskenään. — Kadehdittako jo hänen autuuttaan? Silloinpa
olisitte huonotapaisia kristittyjä.
Mutta samalla lohdutti häntä ajatus, että hän maksoi heille samalla
mitalla.
— Että?… — toisti Ludvig XV. — No, sano nyt, koska kerran olet
alkanut.
— No niin, sire, se oli se, että tänne tunkeutuu uusia ihmisiä…
— Se on esittely.
— Niin kai, tietysti! Niinkuin sinä osaisit pitää kiinni tuon suusi!
Niinkuin se ei aina joko haukottelisi, puhuisi tai pureskelisi!…
— Mitä tämä on! — keskeytti Ludvig XV, aivan kuin isä Molièren
näytelmissä. — Ahaa, nyt yhdytään samaan mielipiteeseen, vai niitä?
Näyttääpä siltä kuin olisi salaliitto perheessäni. Senkö tähden siis tuo
esittely ei saisi tapahtua; senkö tähden siis mesdames ovat poissa
kotoa, kun heidän luokseen aiotaan tulla vieraisille; sentähden siis he
eivät vastaa audienssianomuksiin ja hakemuksiin.
— Sen minä uskon; minun koirani ovat minulle hyviä, kun menen
niiden luokse; koirani ovat minun oikeita ystäviäni! Ja hyvästi siis
nyt, mesdames. Minä menen Charlotten, Belle-fillen ja Gredinetin
luokse. Elukka-raukat, niin, minä pidän niistä, pidän varsinkin
sentähden, että niillä on hyvänä puolena se, etteivät ne aina hauku
minulle totuuksia.
Ranska.'
29.
Kreivitär de Béarn.
Kaikkien kerrottujen katkerain kohtausten päävaikute, kaikkien
näiden hovissa toivottujen tai peljättyjen häpeäjuttujen
kompastuskivi, Béarnin kreivitär, matkusti, kuten Chon oli veljelleen
sanonut, nopeaa vauhtia Pariisia kohti.
Ihminen uskoo aina sitä, mitä hän toivoo. Ja niinpä antoi myöskin
rouva de Béarn tuon nuoren naikkosen puijata itsensä helposti
kertomuksellaan.
Pieni epäilyksen varjo hänessä kuitenkin oli huomattavissa; hän oli
tuntenut mestari Flageot'n jo kaksikymmentä vuotta; hän oli käynyt
ainakin pari sataa kertaa hänen asunnossaan Rue du Petit-Lion-
Saint-Sauveurin varrella, mutta koskaan hän ei ollut nähnyt siellä,
neliskulmaisella lattiamatolla, joka tuntui hänestä varsin pieneltä
asianajajan aution suureen huoneeseen verrattuna, koskaan ei hän
ollut nähnyt siellä lasta tulevan kärkkäästi anelemaan namusia
miesten tai naisten taskuista, joiden asioita mestari Flageot ajoi.
— Minä?
— Niin juuri, te… No niin, täällä on siis uutisia?
— Ettäkö ei mitään…
— Ei, ei mitään…
— Niin.
— Ei, madame.
— En.
Kreivitär hypähti koholle nojatuolissa ja löi kaksin käsin polviinsa.
— Niin.
— Ei.
— Se olisi viisainta.
— Mikä konnamaisuus!