Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
com
https://textbookfull.com/product/learn-pyspark-build-python-
based-machine-learning-and-deep-learning-models-1st-edition-
pramod-singh/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/deep-learning-with-python-develop-
deep-learning-models-on-theano-and-tensorflow-using-keras-jason-
brownlee/
textboxfull.com
Learn PySpark
Build Python-based Machine Learning and Deep
Learning Models
Pramod Singh
Bangalore, Karnataka, India
Apress Standard
Trademarked names, logos, and images may appear in this book. Rather
than use a trademark symbol with every occurrence of a trademarked
name, logo, or image, we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark. The use in this publication
of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of
opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true
and accurate at the date of publication, neither the author nor the
editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no
warranty, express or implied, with respect to the material contained
herein.
1. Introduction to Spark
Pramod Singh1
As this book is about Spark, it makes perfect sense to start the first
chapter by looking into some of Spark’s history and its different
components. This introductory chapter is divided into three sections. In
the first, I go over the evolution of data and how it got as far as it has, in
terms of size. I’ll touch on three key aspects of data. In the second
section, I delve into the internals of Spark and go over the details of its
different components, including its architecture and modus operandi.
The third and final section of this chapter focuses on how to use Spark
in a cloud environment.
History
The birth of the Spark project occurred at the Algorithms, Machine, and
People (AMP) Lab at the University of California, Berkeley. The project
was initiated to address the potential issues in the Hadoop MapReduce
framework. Although Hadoop MapReduce was a groundbreaking
framework to handle big data processing, in reality, it still had a lot of
limitations in terms of speed. Spark was new and capable of doing in-
memory computations, which made it almost 100 times faster than any
other big data processing framework. Since then, there has been a
continuous increase in adoption of Spark across the globe for big data
applications. But before jumping into the specifics of Spark, let’s
consider a few aspects of data itself.
Data can be viewed from three different angles: the way it is
collected, stored, and processed, as shown in Figure 1-1.
Data Collection
A huge shift in the manner in which data is collected has occurred over
the last few years. From buying an apple at a grocery store to deleting
an app on your mobile phone, every data point is now captured in the
back end and collected through various built-in applications. Different
Internet of things (IoT) devices capture a wide range of visual and
sensory signals every millisecond. It has become relatively convenient
for businesses to collect that data from various sources and use it later
for improved decision making.
Data Storage
In previous years, no one ever imagined that data would reside at some
remote location, or that the cost to store data would be as cheap as it is.
Businesses have embraced cloud storage and started to see its benefits
over on-premise approaches. However, some businesses still opt for on-
premise storage, for various reasons. It’s known that data storage
began by making use of magnetic tapes. Then the breakthrough
introduction of floppy discs made it possible to move data from one
place to another. However, the size of the data was still a huge
limitation. Flash drives and hard discs made it even easier to store and
transfer large amounts of data at a reduced cost. (See Figure 1-2.) The
latest trend in the advancement of storage devices has resulted in flash
drives capable of storing data up to 2TBs, at a throwaway price.
Data Processing
The final aspect of data is using stored data and processing it for some
analysis or to run an application. We have witnessed how efficient
computers have become in the last 20 years. What used to take five
minutes to execute probably takes less than a second using today’s
machines with advanced processing units. Hence, it goes without saying
that machines can process data much faster and easier. Nonetheless,
there is still a limit to the amount of data a single machine can process,
regardless of its processing power. So, the underlying idea behind Spark
is to use a collection (cluster) of machines and a unified processing
engine (Spark) to process and handle huge amounts of data, without
compromising on speed and security. This was the ultimate goal that
resulted in the birth of Spark.
Spark Architecture
There are five core components that make Spark so powerful and easy
to use. The core architecture of Spark consists of the following layers, as
shown in Figure 1-3:
Storage
Resource management
Engine
Ecosystem
APIs
Storage
Before using Spark, data must be made available in order to process it.
This data can reside in any kind of database. Spark offers multiple
options to use different categories of data sources, to be able to process
it on a large scale. Spark allows you to use traditional relational
databases as well as NoSQL, such as Cassandra and MongoDB.
Resource Management
The next layer consists of a resource manager. As Spark works on a set
of machines (it also can work on a single machine with multiple cores),
it is known as a Spark cluster . Typically, there is a resource manager in
any cluster that efficiently handles the workload between these
resources. The two most widely used resource managers are YARN and
Mesos. The resource manager has two main components internally:
1. Cluster manager
2. Worker
2. Spark driver
The task is the data processing logic that has been written in either
PySpark or Spark R code. It can be as simple as taking a total frequency
count of words to a very complex set of instructions on an unstructured
dataset. The second component is Spark driver, the main controller of a
Spark application, which consistently interacts with a cluster manager
to find out which worker nodes can be used to execute the request. The
role of the Spark driver is to request the cluster manager to initiate the
Spark executor for every worker node.
Spark SQL
SQL being used by most of the ETL operators across the globe makes it
a logical choice to be part of Spark offerings. It allows Spark users to
perform structured data processing by running SQL queries. In
actuality, Spark SQL leverages the catalyst optimizer to perform the
optimizations during the execution of SQL queries.
Another advantage of using Spark SQL is that it can easily deal with
multiple database files and storage systems such as SQL, NoSQL,
Parquet, etc.
MLlib
Training machine learning models on big datasets was starting to
become a huge challenge, until Spark’s MLlib (Machine Learning
library) came into existence. MLlib gives you the ability to train
machine learning models on huge datasets, using Spark clusters. It
allows you to build in supervised, unsupervised, and recommender
systems; NLP-based models; and deep learning, as well as within the
Spark ML library.
Structured Streaming
The Spark Streaming library provides the functionality to read and
process real-time streaming data. The incoming data can be batch data
or near real-time data from different sources. Structured Streaming is
capable of ingesting real-time data from such sources as Flume, Kafka,
Twitter, etc. There is a dedicated chapter on this component later in this
book (see Chapter 3).
Graph X
This is a library that sits on top of the Spark core and allows users to
process specific types of data (graph dataframes), which consists of
nodes and edges. A typical graph is used to model the relationship
between the different objects involved. The nodes represent the object,
and the edge between the nodes represents the relationship between
them. Graph dataframes are mainly used in network analysis, and
Graph X makes it possible to have distributed processing of such graph
dataframes.
Local Setup
It is relatively easy to install and use Spark on a local system, but it fails
the core purpose of Spark itself, if it’s not used on a cluster. Spark’s core
offering is distributed data processing, which will always be limited to a
local system’s capacity, in the case that it’s run on a local system,
whereas one can benefit more by using Spark on a group of machines
instead. However, it is always good practice to have Spark locally, as
well as to test code on sample data. So, follow these steps to do so:
1. Ensure that Java is installed; otherwise install Java.
Dockers
Another way of using Spark locally is through the containerization
technique of dockers . This allows users to wrap all the dependencies
and Spark files into a single image, which can be run on any system. We
can kill the container after the task is finished and rerun it, if required.
To use dockers for running Spark, we must install Docker on the system
first and then simply run the following command: [In]: docker
run -it -p 8888:8888 jupyter/pyspark-notebook".
Cloud Environments
As discussed earlier in this chapter, for various reasons, local sets are
not of much help when it comes to big data, and that’s where cloud-
based environments make it possible to ingest and process huge
datasets in a short period. The real power of Spark can be seen easily
while dealing with large datasets (in excess of 100TB). Most of the
cloud-based infra-providers allow you to install Spark, which
sometimes comes preconfigured as well. One can easily spin up the
clusters with required specifications, according to need. One of the
cloud-based environments is Databricks.
Databricks
Databricks is a company founded by the creators of Spark, in order to
provide the enterprise version of Spark to businesses, in addition to
full-fledged support. To increase Spark’s adoption among the
community and other users, Databricks also provides a free community
edition of Spark, with a 6GB cluster (single node). You can increase the
size of the cluster by signing up for an enterprise account with
Databricks, using the following steps:
1. Search for the Databricks web site and select Databricks
Community Edition, as shown in Figure 1-6.
3. Once you are on the home page, you can choose to either load a
new data source or create a notebook from scratch, as shown in
Figure 1-8. In the latter case, you must have the cluster up and
running, to be able to use the notebook. Therefore, you must click
New Cluster, to spin up the cluster. (Databricks provides a 6GB AWS
EMR cluster.)
Figure 1-8 Creating a Databricks notebook
4. To set up the cluster, you must give a name to the cluster and select
the version of Spark that must configure with the Python version,
as shown in Figure 1-9. Once all the details are filled in, you must
click Create Cluster and wait a couple of minutes, until it spins up.
5. You can also view the status of the cluster by going into the Clusters
option on the left side widget, as shown in Figure 1-10. It gives all
the information associated with the particular cluster and its
current status.
6. The final step is to open a notebook and attach it to the cluster you
just created (Figure 1-11). Once attached, you can start the PySpark
code.
Conclusion
This chapter provided a brief history of Spark, its core components, and
the process of accessing it in a cloud environment. In upcoming
chapters, I will delve deeper into the various aspects of Spark and how
to build different applications with it.
© Pramod Singh 2019
P. Singh, Learn PySpark
https://doi.org/10.1007/978-1-4842-4961-1_2
2. Data Processing
Pramod Singh1
Vá de retrato
Por consoantes;
Que eu sou Timantes
De um nariz de tocano côr de pato.
Pelo cabello
Começa a obra,
Que o tempo sobra
Para pintar a giba do camello.
Causa-me engulho
O pêllo untado,
Que de molhado,
Parece que sae sempre de mergulho.
Mas a fachada
Da sobrancelha
Se me assimelha
A uma negra vassoura esparramada.
Nariz de embono
Com tal sacada,
Que entra na escada
Duas horas primeiro que seu dono.
Nariz que falia
Longe do rosto,
Pois na Sé posto
Na Praça manda pôr a guarda em ala.
Membro de olfactos,
Mas tão quadrado
Que um rei coroado
O póde ter por copa de cem pratos.
Tão temerario
É o tal nariz,
Que por um triz
Não ficou cantareira de um armario.
Você perdôe,
Nariz nefando,
Que eu vou cortando
E inda fica nariz em que se assôe.
Ao pé da altura
Do naso oiteiro
Tem o sendeiro
O que boca nasceu e é rasgadura.
Na gargantona,
Membro do gosto,
Está composto
O orgão mui subtil da voz fanhona.
Vamos á giba:
Porém que intento,
Si não sou vento
Para poder subir lá tanto arriba?
Sempre eu insisto
Que no horizonte
D’esse alto monte
Foi tentar o diabo a Jesu-Christo.
Chamam-lhe auctores,
Por fallar fresco,
Dorsum burlesco,
No qual fabricaverunt peccatores.
Havendo apostas
Si é home’ ou féra,
Se assentou que era
Um caracol que traz a casa ás costas.
De grande arriba
Tanto se entona,
Que já blazona
Que engeitou ser canastra por ser giba.
Oh pico alçado!
Quem lá subira,
Para que vira
Si és Etna abrazador, si Alpe nevado.
Cousa pintada,
Sempre uma cousa,
Pois d’onde pousa
Sempre o vêm de bastão, sempre de espada.
Si bem se infere
Outro fracaso,
Porque em tal caso
Só se açoita quem toma o miserere.
E a entezadeira
Do gram ...,
Que em sujo trapo
Se alimpa nos fundilhos do Ferreira.
Seguem-se as pernas,
Sigam-se embora,
Porque eu por ora
Não me quero embarcar em taes cavernas.
Si bem assento
Nos meus miolos,
Que são dois rôlos
De tabaco já podre e fedorento.
Boa viagem,
Senhor Tocano,
Que para o anno
Vos espera a Bahia entre a bagagem.
MILAGRES DO BRAZIL
AO PADRE LOURENÇO RIBEIRO, HOMEM PARDO QUE FOI
VIGARIO DA FREGUEZIA DE PASSÉ
O fallar de intercadencia,
Entre silencio e palavra,
Crer que a testa se vos abra,
E encaixar-vos que é prudencia:
Alerta, homens de sciencia,
Que quer o Xisgaraviz
Que aquillo que vos não diz,
Por lh’o impedir a rudeza,
Avalieis madureza,
Sendo ignorancia traidora.
Entendeis-me agora?
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com