100% found this document useful (3 votes)
114 views

Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters

deep

Uploaded by

minathrabid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
114 views

Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters

deep

Uploaded by

minathrabid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

Learn PySpark: Build python-based machine learning


and deep learning models 1st Edition Pramod Singh

https://textbookfull.com/product/learn-pyspark-build-python-
based-machine-learning-and-deep-learning-models-1st-edition-
pramod-singh/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Machine Learning with PySpark: With Natural Language


Processing and Recommender Systems 1st Edition Pramod
Singh
https://textbookfull.com/product/machine-learning-with-pyspark-with-
natural-language-processing-and-recommender-systems-1st-edition-
pramod-singh/
textboxfull.com

Deep Learning with Python Learn Best Practices of Deep


Learning Models with PyTorch 2nd Edition Nikhil Ketkar
Jojo Moolayil
https://textbookfull.com/product/deep-learning-with-python-learn-best-
practices-of-deep-learning-models-with-pytorch-2nd-edition-nikhil-
ketkar-jojo-moolayil/
textboxfull.com

Deep Learning with Python Learn Best Practices of Deep


Learning Models with PyTorch 2nd Edition Nikhil Ketkar
Jojo Moolayil
https://textbookfull.com/product/deep-learning-with-python-learn-best-
practices-of-deep-learning-models-with-pytorch-2nd-edition-nikhil-
ketkar-jojo-moolayil-2/
textboxfull.com

Deploy Machine Learning Models to Production: With Flask,


Streamlit, Docker, and Kubernetes on Google Cloud Platform
Pramod Singh
https://textbookfull.com/product/deploy-machine-learning-models-to-
production-with-flask-streamlit-docker-and-kubernetes-on-google-cloud-
platform-pramod-singh/
textboxfull.com
Deploy Machine Learning Models to Production: With Flask,
Streamlit, Docker, and Kubernetes on Google Cloud Platform
1st Edition Pramod Singh
https://textbookfull.com/product/deploy-machine-learning-models-to-
production-with-flask-streamlit-docker-and-kubernetes-on-google-cloud-
platform-1st-edition-pramod-singh/
textboxfull.com

Hyperparameter Optimization in Machine Learning: Make Your


Machine Learning and Deep Learning Models More Efficient
1st Edition Tanay Agrawal
https://textbookfull.com/product/hyperparameter-optimization-in-
machine-learning-make-your-machine-learning-and-deep-learning-models-
more-efficient-1st-edition-tanay-agrawal/
textboxfull.com

Hyperparameter Optimization in Machine Learning: Make Your


Machine Learning and Deep Learning Models More Efficient
1st Edition Tanay Agrawal
https://textbookfull.com/product/hyperparameter-optimization-in-
machine-learning-make-your-machine-learning-and-deep-learning-models-
more-efficient-1st-edition-tanay-agrawal-2/
textboxfull.com

Deep Learning with Python Develop Deep Learning Models on


Theano and TensorFLow Using Keras Jason Brownlee

https://textbookfull.com/product/deep-learning-with-python-develop-
deep-learning-models-on-theano-and-tensorflow-using-keras-jason-
brownlee/
textboxfull.com

Practical Machine Learning with AWS : Process, Build,


Deploy, and Productionize Your Models Using AWS Himanshu
Singh
https://textbookfull.com/product/practical-machine-learning-with-aws-
process-build-deploy-and-productionize-your-models-using-aws-himanshu-
singh/
textboxfull.com
Pramod Singh

Learn PySpark
Build Python-based Machine Learning and Deep
Learning Models
Pramod Singh
Bangalore, Karnataka, India

Any source code or other supplementary material referenced by the


author in this book is available to readers on GitHub via the book’s
product page, located at www.​apress.​com/​978-1-4842-4960-4 . For
more detailed information, please visit www.​apress.​com/​source-code .

ISBN 978-1-4842-4960-4 e-ISBN 978-1-4842-4961-1


https://doi.org/10.1007/978-1-4842-4961-1

© Pramod Singh 2019

Apress Standard

Trademarked names, logos, and images may appear in this book. Rather
than use a trademark symbol with every occurrence of a trademarked
name, logo, or image, we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark. The use in this publication
of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of
opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true
and accurate at the date of publication, neither the author nor the
editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no
warranty, express or implied, with respect to the material contained
herein.

Distributed to the book trade worldwide by Springer Science+Business


Media New York, 233 Spring Street, 6th Floor, New York, NY 10013.
Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-
[email protected], or visit www.springeronline.com. Apress Media,
LLC is a California LLC and the sole member (owner) is Springer
Science+Business Media Finance Inc (SSBM Finance Inc). SSBM Finance
Inc is a Delaware corporation.
I dedicate this book to my wife, Neha, my son, Ziaan, and my parents.
Without you, this book wouldn’t have been possible. You complete my
world and are the source of my strength.
Introduction
The idea of writing this book had already been seeded while I was
working on my first book, and there was a strong reason for that. The
earlier book was more focused on machine learning using big data and
essentially did not deep-dive sufficiently into supporting aspects, but
this book goes a little deeper into the internals of Spark’s machine
learning library, as well as analyzing of streaming data. It is a good
reference point for someone who wants to learn more about how to
automate different workflows and build pipelines to handle real-time
data.
This book is divided into three main sections. The first provides an
introduction to Spark and data analysis on big data; the middle section
discusses using Airflow for executing different jobs, in addition to data
analysis on streaming data, using the structured streaming features of
Spark. The final section covers translation of a business problem into
machine learning and solving it, using Spark’s machine learning library,
with a deep dive into deep learning as well.
This book might also be useful to data analysts and data engineers,
as it covers the steps of big data processing using PySpark. Readers
who want to make a transition to the data science and machine learning
fields will also find this book a good starting point and can gradually
tackle more complicated areas later. The case studies and examples
given in the book make it really easy to follow and understand the
related fundamental concepts. Moreover, there are very few books
available on PySpark, and this book certainly adds value to readers’
knowledge. The strength of this book lies in its simplicity and on its
application of machine learning to meaningful datasets.
I have tried my best to put all my experience and knowledge into
this book, and I feel it is particularly relevant to what businesses are
seeking in order to solve real challenges. I hope that it will provide you
with some useful takeaways.
Acknowledgments
This is my second book on Spark, and along the way, I have come to
realize my love for handling big data and performing machine learning
as well. Going forward, I intend to write many more books, but first, let
me thank a few people who have helped me along this journey. First, I
must thank the most important person in my life, my beloved wife,
Neha, who selflessly supported me throughout and sacrificed so much
to ensure that I completed this book.
I must thank Celestin Suresh John, who believed in me and extended
the opportunity to write this book. Aditee Mirashi is one of the best
editors in India. This is my second book with her, and it was even more
exciting to work with her this time. As usual, she was extremely
supportive and always there to accommodate my requests. I especially
would like to thank Jim Markham, who dedicated his time to reading
every single chapter and offered so many useful suggestions. Thanks,
Jim, I really appreciate your input. I also want to thank Manoj Patil, who
had the patience to review every line of code and check the
appropriateness of each example. Thank you for your feedback and
encouragement. It really made a difference to me and the book.
I also want to thank the many mentors who have constantly forced
me to pursue my dreams. Thank you Sebastian Keupers, Dr. Vijay
Agneeswaran, Sreenivas Venkatraman, Shoaib Ahmed, and Abhishek
Kumar, for your time. Finally, I am infinitely grateful to my son, Ziaan,
and my parents, for their endless love and support, irrespective of
circumstances. You all make my world beautiful.
Table of Contents
Chapter 1:​Introduction to Spark
History
Data Collection
Data Storage
Data Processing
Spark Architecture
Storage
Resource Management
Engine and Ecosystem
Programming Language APIs
Setting Up Your Environment
Local Setup
Dockers
Cloud Environments
Conclusion
Chapter 2:​Data Processing
Creating a SparkSession Object
Creating Dataframes
Null Values
Subset of a Dataframe
Select
Filter
Where
Aggregations
Collect
User-Defined Functions (UDFs)
Pandas UDF
Joins
Pivoting
Window Functions or Windowed Aggregates
Conclusion
Chapter 3:​Spark Structured Streaming
Batch vs.​Stream
Batch Data
Stream Processing
Spark Streaming
Structured Streaming
Data Input
Data Processing
Final Output
Building a Structured App
Operations
Joins
Structured Streaming Alternatives
Conclusion
Chapter 4:​Airflow
Workflows
Graph Overview
Undirected Graphs
Directed Graphs
DAG Overview
Operators
Installing Airflow
Airflow Using Docker
Creating Your First DAG
Step 1:​Importing the Required Libraries
Step 2:​Defining the Default Arguments
Step 3:​Creating a DAG
Step 4:​Declaring Tasks
Step 5:​Mentioning Dependencies
Conclusion
Chapter 5:​MLlib:​Machine Learning Library
Calculating Correlations
Chi-Square Test
Transformations
Binarizer
Principal Component Analysis
Normalizer
Standard Scaling
Min-Max Scaling
MaxAbsScaler
Binning
Building a Classification Model
Step 1:​Load the Dataset
Step 2:​Explore the Dataframe
Step 3:​Data Transformation
Step 4:​Splitting into Train and Test Data
Step 5:​Model Training
Step 6:​Hyperparameter Tuning
Step 7:​Best Model
Conclusion
Chapter 6:​Supervised Machine Learning
Supervised Machine Learning Primer
Binary Classification
Multi-class Classification
Building a Linear Regression Model
Reviewing the Data Information
Generalized Linear Model Regression
Decision Tree Regression
Random Forest Regressors
Gradient-Boosted Tree Regressor
Step 1:​Build and Train a GBT Regressor Model
Step 2:​Evaluate the Model Performance on Test Data
Building Multiple Models for Binary Classification Tasks
Logistic Regression
Decision Tree Classifier
Support Vector Machines Classifiers
Naive Bayes Classifier
Gradient Boosted Tree Classifier
Random Forest Classifier
Hyperparameter Tuning and Cross-Validation
Conclusion
Chapter 7:​Unsupervised Machine Learning
Unsupervised Machine Learning Primer
Reviewing the Dataset
Importing SparkSession and Creating an Object
Reshaping a Dataframe for Clustering
Building Clusters with K-Means
Conclusion
Chapter 8:​Deep Learning Using PySpark
Deep Learning Fundamentals
Human Brain Neuron vs.​Artificial Neuron
Activation Functions
Neuron Computation
Training Process:​Neural Network
Building a Multilayer Perceptron Model
Conclusion
Index
About the Author and About the Technical
Reviewer

About the Author


Pramod Singh
has more than 11 years of hands-on
experience in data engineering and
sciences and is currently a manager
(data science) at Publicis Sapient in
India, where he drives strategic
initiatives that deal with machine
learning and artificial intelligence (AI).
Pramod has worked with multiple
clients, in areas such as retail, telecom,
and automobile and consumer goods,
and is the author of Machine Learning
with PyS park . He also speaks at major
forums, such as Strata Data, and at AI
conferences.
Pramod received a bachelor’s degree in electrical and electronics
engineering from Mumbai University and an MBA (operations and
finance) from Symbiosis International University, in addition to data
analytics certification from IIM–Calcutta.
Pramod lives in Bangalore with his wife and three-year-old son. In
his spare time, he enjoys playing guitar, coding, reading, and watching
soccer.

About the Technical Reviewer


Manoj Patil
has worked in the software industry for 19 years. He received an
engineering degree from COEP, Pune (India), and has been enjoying his
exciting IT journey ever since.
As a principal architect at TatvaSoft,
Manoj has taken many initiatives in the
organization, ranging from training and
mentoring teams, leading data science
and ML practice, to successfully
designing client solutions from different
functional domains.
He began his career as a Java
programmer but is fortunate to have
worked on multiple frameworks with
multiple languages and can claim to be a
full stack developer. In the last five years,
Manoj has worked extensively in the
field of BI, big data, and machine
learning, using such technologies as
Hitachi Vantara (Pentaho), the Hadoop ecosystem, TensorFlow, Python-
based libraries, and more.
He is passionate about learning new technologies, trends, and
reviewing books. When he’s not working, he’s either exercising or
reading/listening to infinitheism literature.
© Pramod Singh 2019
P. Singh, Learn PySpark
https://doi.org/10.1007/978-1-4842-4961-1_1

1. Introduction to Spark
Pramod Singh1

(1) Bangalore, Karnataka, India

As this book is about Spark, it makes perfect sense to start the first
chapter by looking into some of Spark’s history and its different
components. This introductory chapter is divided into three sections. In
the first, I go over the evolution of data and how it got as far as it has, in
terms of size. I’ll touch on three key aspects of data. In the second
section, I delve into the internals of Spark and go over the details of its
different components, including its architecture and modus operandi.
The third and final section of this chapter focuses on how to use Spark
in a cloud environment.

History
The birth of the Spark project occurred at the Algorithms, Machine, and
People (AMP) Lab at the University of California, Berkeley. The project
was initiated to address the potential issues in the Hadoop MapReduce
framework. Although Hadoop MapReduce was a groundbreaking
framework to handle big data processing, in reality, it still had a lot of
limitations in terms of speed. Spark was new and capable of doing in-
memory computations, which made it almost 100 times faster than any
other big data processing framework. Since then, there has been a
continuous increase in adoption of Spark across the globe for big data
applications. But before jumping into the specifics of Spark, let’s
consider a few aspects of data itself.
Data can be viewed from three different angles: the way it is
collected, stored, and processed, as shown in Figure 1-1.

Figure 1-1 Three aspects of data

Data Collection
A huge shift in the manner in which data is collected has occurred over
the last few years. From buying an apple at a grocery store to deleting
an app on your mobile phone, every data point is now captured in the
back end and collected through various built-in applications. Different
Internet of things (IoT) devices capture a wide range of visual and
sensory signals every millisecond. It has become relatively convenient
for businesses to collect that data from various sources and use it later
for improved decision making.

Data Storage
In previous years, no one ever imagined that data would reside at some
remote location, or that the cost to store data would be as cheap as it is.
Businesses have embraced cloud storage and started to see its benefits
over on-premise approaches. However, some businesses still opt for on-
premise storage, for various reasons. It’s known that data storage
began by making use of magnetic tapes. Then the breakthrough
introduction of floppy discs made it possible to move data from one
place to another. However, the size of the data was still a huge
limitation. Flash drives and hard discs made it even easier to store and
transfer large amounts of data at a reduced cost. (See Figure 1-2.) The
latest trend in the advancement of storage devices has resulted in flash
drives capable of storing data up to 2TBs, at a throwaway price.

Figure 1-2 Evolution of data storage


This trend clearly indicates that the cost to store data has been
reduced significantly over the years and continues to decline. As a
result, businesses don’t shy away from storing huge amounts of data,
irrespective of its kind. From logs to financial and operational
transactions to simple employee feedback, everything gets stored.

Data Processing
The final aspect of data is using stored data and processing it for some
analysis or to run an application. We have witnessed how efficient
computers have become in the last 20 years. What used to take five
minutes to execute probably takes less than a second using today’s
machines with advanced processing units. Hence, it goes without saying
that machines can process data much faster and easier. Nonetheless,
there is still a limit to the amount of data a single machine can process,
regardless of its processing power. So, the underlying idea behind Spark
is to use a collection (cluster) of machines and a unified processing
engine (Spark) to process and handle huge amounts of data, without
compromising on speed and security. This was the ultimate goal that
resulted in the birth of Spark.

Spark Architecture
There are five core components that make Spark so powerful and easy
to use. The core architecture of Spark consists of the following layers, as
shown in Figure 1-3:
Storage
Resource management
Engine
Ecosystem
APIs

Figure 1-3 Core components of Spark

Storage
Before using Spark, data must be made available in order to process it.
This data can reside in any kind of database. Spark offers multiple
options to use different categories of data sources, to be able to process
it on a large scale. Spark allows you to use traditional relational
databases as well as NoSQL, such as Cassandra and MongoDB.

Resource Management
The next layer consists of a resource manager. As Spark works on a set
of machines (it also can work on a single machine with multiple cores),
it is known as a Spark cluster . Typically, there is a resource manager in
any cluster that efficiently handles the workload between these
resources. The two most widely used resource managers are YARN and
Mesos. The resource manager has two main components internally:
1. Cluster manager

2. Worker

It’s kind of like master-slave architecture, in which the cluster


manager acts as a master node, and the worker acts as a slave node in
the cluster. The cluster manager keeps track of all information
pertaining to the worker nodes and their current status. Cluster
managers always maintain the following information:
Status of worker node (busy/available)
Location of worker node
Memory of worker node
Total CPU cores of worker node
The main role of the cluster manager is to manage the worker nodes
and assign them tasks, based on the availability and capacity of the
worker node. On the other hand, a worker node is only responsible for
executing the task it’s given by the cluster manager, as shown in Figure
1-4.
Figure 1-4 Resource management
The tasks that are given to the worker nodes are generally the
individual pieces of the overall Spark application. The Spark application
contains two parts:
1. Task

2. Spark driver

The task is the data processing logic that has been written in either
PySpark or Spark R code. It can be as simple as taking a total frequency
count of words to a very complex set of instructions on an unstructured
dataset. The second component is Spark driver, the main controller of a
Spark application, which consistently interacts with a cluster manager
to find out which worker nodes can be used to execute the request. The
role of the Spark driver is to request the cluster manager to initiate the
Spark executor for every worker node.

Engine and Ecosystem


The base of the Spark architecture is its core, which is built on top of
RDDs (Resilient Distributed Datasets) and offers multiple APIs for
building other libraries and ecosystems by Spark contributors. It
contains two parts: the distributed computing infrastructure and the
RDD programming abstraction. The default libraries in the Spark toolkit
come as four different offerings.

Spark SQL
SQL being used by most of the ETL operators across the globe makes it
a logical choice to be part of Spark offerings. It allows Spark users to
perform structured data processing by running SQL queries. In
actuality, Spark SQL leverages the catalyst optimizer to perform the
optimizations during the execution of SQL queries.
Another advantage of using Spark SQL is that it can easily deal with
multiple database files and storage systems such as SQL, NoSQL,
Parquet, etc.

MLlib
Training machine learning models on big datasets was starting to
become a huge challenge, until Spark’s MLlib (Machine Learning
library) came into existence. MLlib gives you the ability to train
machine learning models on huge datasets, using Spark clusters. It
allows you to build in supervised, unsupervised, and recommender
systems; NLP-based models; and deep learning, as well as within the
Spark ML library.

Structured Streaming
The Spark Streaming library provides the functionality to read and
process real-time streaming data. The incoming data can be batch data
or near real-time data from different sources. Structured Streaming is
capable of ingesting real-time data from such sources as Flume, Kafka,
Twitter, etc. There is a dedicated chapter on this component later in this
book (see Chapter 3).

Graph X
This is a library that sits on top of the Spark core and allows users to
process specific types of data (graph dataframes), which consists of
nodes and edges. A typical graph is used to model the relationship
between the different objects involved. The nodes represent the object,
and the edge between the nodes represents the relationship between
them. Graph dataframes are mainly used in network analysis, and
Graph X makes it possible to have distributed processing of such graph
dataframes.

Programming Language APIs


Spark is available in four languages. Because Spark is built using Scala,
that becomes the native language. Apart from Scala, we can also use
Python, Java, and R, as shown in Figure 1-5.

Figure 1-5 Language APIs

Setting Up Your Environment


In this final section of this chapter, I will go over how to set up the
Spark environment in the cloud. There are multiple ways in which we
can use Spark:
Local setup
Dockers
Cloud environment (GCP, AWS, Azure)
Databricks

Local Setup
It is relatively easy to install and use Spark on a local system, but it fails
the core purpose of Spark itself, if it’s not used on a cluster. Spark’s core
offering is distributed data processing, which will always be limited to a
local system’s capacity, in the case that it’s run on a local system,
whereas one can benefit more by using Spark on a group of machines
instead. However, it is always good practice to have Spark locally, as
well as to test code on sample data. So, follow these steps to do so:
1. Ensure that Java is installed; otherwise install Java.

2. Download the latest version of Apache Spark from


https://spark.apache.org/downloads.html .

3. Extract the files from the zipped folder.

4. Copy all the Spark-related files to their respective directory.

5. Configure the environment variables to be able to run Spark.

6. Verify the installation and run Spark.

Dockers
Another way of using Spark locally is through the containerization
technique of dockers . This allows users to wrap all the dependencies
and Spark files into a single image, which can be run on any system. We
can kill the container after the task is finished and rerun it, if required.
To use dockers for running Spark, we must install Docker on the system
first and then simply run the following command: [In]: docker
run -it -p 8888:8888 jupyter/pyspark-notebook".
Cloud Environments
As discussed earlier in this chapter, for various reasons, local sets are
not of much help when it comes to big data, and that’s where cloud-
based environments make it possible to ingest and process huge
datasets in a short period. The real power of Spark can be seen easily
while dealing with large datasets (in excess of 100TB). Most of the
cloud-based infra-providers allow you to install Spark, which
sometimes comes preconfigured as well. One can easily spin up the
clusters with required specifications, according to need. One of the
cloud-based environments is Databricks.

Databricks
Databricks is a company founded by the creators of Spark, in order to
provide the enterprise version of Spark to businesses, in addition to
full-fledged support. To increase Spark’s adoption among the
community and other users, Databricks also provides a free community
edition of Spark, with a 6GB cluster (single node). You can increase the
size of the cluster by signing up for an enterprise account with
Databricks, using the following steps:
1. Search for the Databricks web site and select Databricks
Community Edition, as shown in Figure 1-6.

Figure 1-6 Databricks web page


2. If you have a user account with Databricks, you can simply log in. If
you don’t have an account, you must create one, in order to use
Databricks, as shown in Figure 1-7.

Figure 1-7 Databricks login

3. Once you are on the home page, you can choose to either load a
new data source or create a notebook from scratch, as shown in
Figure 1-8. In the latter case, you must have the cluster up and
running, to be able to use the notebook. Therefore, you must click
New Cluster, to spin up the cluster. (Databricks provides a 6GB AWS
EMR cluster.)
Figure 1-8 Creating a Databricks notebook
4. To set up the cluster, you must give a name to the cluster and select
the version of Spark that must configure with the Python version,
as shown in Figure 1-9. Once all the details are filled in, you must
click Create Cluster and wait a couple of minutes, until it spins up.

Figure 1-9 Creating a Databricks cluster

5. You can also view the status of the cluster by going into the Clusters
option on the left side widget, as shown in Figure 1-10. It gives all
the information associated with the particular cluster and its
current status.

Figure 1-10 Databricks cluster list

6. The final step is to open a notebook and attach it to the cluster you
just created (Figure 1-11). Once attached, you can start the PySpark
code.

Figure 1-11 Databricks notebook

Overall, since 2010, when Spark became an open source platform,


its users have risen in number consistently, and the community
continues to grow every day. It’s no surprise that the number of
contributors to Spark has outpaced that of Hadoop. Some of the reasons
for Spark’s popularity were noted in a survey, the results of which are
shown in Figure 1-12.
Figure 1-12 Results of Spark adoption survey

Conclusion
This chapter provided a brief history of Spark, its core components, and
the process of accessing it in a cloud environment. In upcoming
chapters, I will delve deeper into the various aspects of Spark and how
to build different applications with it.
© Pramod Singh 2019
P. Singh, Learn PySpark
https://doi.org/10.1007/978-1-4842-4961-1_2

2. Data Processing
Pramod Singh1

(1) Bangalore, Karnataka, India

This chapter covers different steps to preprocess and handle data in


PySpark. Preprocessing techniques can certainly vary from case to case,
and many different methods can be used to massage the data into
desired form. The idea of this chapter is to expose some of the common
techniques for dealing with big data in Spark. In this chapter, we are
going to go over different steps involved in preprocessing data, such as
handling missing values, merging datasets, applying functions,
aggregations, and sorting. One major part of data preprocessing is the
transformation of numerical columns into categorical ones and vice
versa, which we are going to look at over the next few chapters and are
based on machine learning. The dataset that we are going to make use
of in this chapter is inspired by a primary research dataset and contains
a few attributes from the original dataset, with additional columns
containing fabricated data points.

Note All the following steps are written in Jupyter Notebook,


running Spark on a Docker image (mentioned in Chapter 1). All the
subsequent code can also be run in Databricks.

Creating a SparkSession Object


The first step is to create a SparkSession object , in order to use
Spark. We also import all the required functions and datatypes from
Another Random Scribd Document
with Unrelated Content
A Marianna da Rocha,
Por outro nome a Pellica,
A nenhum homem já dedica
A sua prata.

Não ha no Brazil mulata,


Que valha um recado só,
Mas Joanna Picaró
O Brazil todo.

Si em gostos não me accommodo,


Ao mais não haja disputa,
Cada um gabe a sua ...,
E haja socego.

Porque eu calo o meu emprego,


E o fiz adivinhação,
Com que tal veneração
Se lhe devia.

Fica-te embora, Bahia,


Que eu me vou por esse mundo,
Cortando pelo mar fundo
Numa barquinha.

Porque inda que és patria minha,


Sou segundo Scipião,
Que com dobrada razão
A minha idéa
Te diz: Non possidebis ossa mea.
OBSERVAÇÕES
CRITICAS SOBRE VARIAS MATERIAS POR OCCASIÃO DO
COMETA APPARECIDO EM 1680

Que esteja dando o francez


Camoezas ao romano,
Castanhas ao castelhano,
E ginjas ao portuguez?
E que estejam todos trez
Em uma scisma quieta
Reconhecendo esta treta
Tanto á vista, sem a ver?
Tudo será; mas a ser,
Effeitos são do cometa.

Que esteja o inglez mui quedo,


E o hollandez muito ufano,
Portugal cheio de engano,
Castella cheia de medo:
E que o turco viva ledo,
Vendo a Europa inquieta?
E que cada qual se metta
Em uma cova a tremer?
Tudo será, mas a ser,
Effeitos são do cometa.
Que esteja o francez zombando,
E a India padecendo,
Italia olhando e comendo,
Portugal rindo e chorando?
E que os esteja enganando
Quem sagaz os inquieta
Sem que nada lhe prometta?
Será; mas com mais razão,
Segundo a minha opinião,
Effeitos são do cometa.

Que esteja Angola de graça,


O Mazagão cahe, não cahe,
O Brazil feito Cambray,
Quando Hollanda feita caça?
E que jogue o =passa passa=
Comnosco o turco mahometa,
E que assim nos accommetta?
Será, pois é tão ladino;
Porém, segundo imagino,
Effeitos são do cometa.

Que venham os Franchinotes,


Com engano surrateiro,
A levar-nos o dinheiro
Por troca de assobiotes?
Que as patacas em pipotes
Nos levem á fiveleta,
Não sei si nisto me metta:
Porém sem metter-me em rodas,
Digo que estas cousas todas
Effeitos são do cometa.
Que venham homens extranhos
Ás direitas, e ás esquerdas,
Trazer-nos as suas perdas,
E levar os nossos ganhos:
E que sejamos tammanhos
Ignorantes, que nos metta
Sem debuxos a gazeta?
Será, que tudo é peior;
Mas porém seja o que for,
Effeitos são do cometa.

Que havendo tantas maldades,


Como experimentado temos,
Tantas novidades vemos,
Não havendo novidades?
E que estejam as cidades
Todas postas em dieta
Mau é; porém por directa
Permissão do mesmo Deus,
Si não são peccados meus,
Effeitos são do cometa.

Que se vejam, sem razão,


No extremo em que hoje se veem,
Um tostão feito um vintem,
E uma pataca um tostão?
E que estas mudanças vão
Fabricadas á curveta,
Sem que a ventura prometta
Nunca nenhuma melhora?
Será, que pois o céu chora,
Effeitos são do cometa.
Que o Reino em um estaleiro
Esteja, e nesta occasião
Haja pão, não haja pão,
Haja, não haja dinheiro:
E que se tome em Aveiro
Todo o ouro e prata invecta
Por certa via secreta?
Eu não sei como isto é:
Porém já que assim se vê,
Effeitos são do cometa.

Que haja no mundo quem tenha


Guizados para comer,
E traças para os haver,
Não tendo lume nem lenha:
E que sem renda mantenha
Carro, carroça, carreta,
E sem ter aonde os metta,
Dentro em si tanto accommode:
Póde ser, porém si póde,
Effeitos são do cometa.

Que andem os officiaes


Como fidalgos vestidos,
E que sejam presumidos
Os humildes como os mais:
E que não possam os taes
Cavalgar sem a maleta,
E que esteja tão quieta
A cidade e o povo mudo:
Será, mas sendo assim tudo,
Effeitos são do cometa.
Que se vejam por prazeres,
Sem repararem nas fomes,
As mulheres feitas homes,
E os homens feitos mulheres:
E que estejam os Misteres
Enfronhados na baeta,
Sem ouvirem a trombeta
Do povo, que é um clarim:
Será, porém sendo assim,
Effeitos são do cometa.

Que vista, quem rendas tem,


Galas vistosas por traça,
Supposto que bem mal faça,
Inda que mal fará bem:
Mas que as vista quem não tem
Mais que uma pobre saieta,
Que lhe vem pelo estafeta,
Por milagre nunca visto:
Será, mas sendo assim isto,
Effeitos são do cometa.

Que não veja o que ha de ver


Mal no bem, e bem no mal,
E se metta cada qual
No que se não ha de metter:
Que queira cada um ser
Capitão sem ter gineta,
Sendo ignorante propheta,
Sem ver quem foi, e quem é:
Será, mas pois si não vê,
Effeitos são do cometa.
Que o pobre e o rico namore,
E que com esta porfia,
O pobre alegre se ria
E o rico triste se chore:
E que o presumido more
Em palacio sem boleta,
E por não ter que lhe metta,
O tenha cheio de vento:
Póde ser; mas ao intento
Effeitos são do cometa.

Que ande o mundo como anda,


E que ao som do seu desvelo
Uns bailem ao saltarello,
E outros á sarabanda:
E que estando tudo á banda
Sendo eu um pobre poeta,
Que nestas cousas me metta
Sem ter licença de Apollo:
Será; porém si sou tolo,
Effeitos são do cometa.
A FOME
QUE HOUVE NA BAHIA NO ANNO DE 1691

Toda a cidade derrota


Esta fome universal,
E uns dão a culpa total
Á Camara, outros á frota;
A frota tudo abarrota
Dentro dos escotilhões,
A carne, o peixe, os feijões;
E si a Camara olha e ri,
Porque anda farta até aqui,
É cousa que me não toca:
Ponto em boca.

Si dizem que o marinheiro


Nos precede a toda a lei,
Porque é serviço do rei,
Concedo que está primeiro:
Mas tenho por mais inteiro
O Conselho que reparte,
Com egual mão e egual arte,
Por todos jantar e ceia;
Mas frota com tripa cheia,
E povo com pança ôca,
Ponto em boca.
A fome me tem já mudo,
Que é muda a boca esfaimada,
Mas si a frota não traz nada,
Porque razão leva tudo?
Que o povo por ser sisudo
Largue o ouro, largue a prata
A uma frota patarata,
Que entrando com véla cheia,
O lastro, que traz de areia,
Por lastro de assucar troca:
Ponto em boca.

Si quando vem para cá


Nenhum frete vem ganhar,
Quando para lá tornar
O mesmo não ganhará:
Quem o assucar lhe dá
Perde a caixa e paga o frete,
Porque o anno não promette
No negocio que o perder:
O frete por se dever
A caixa porque se choca.
Ponto em boca.

Elle tanto em seu abrigo,


E o povo todo faminto,
Elle chora, e eu não minto,
Si chorando vo-lo digo:
Tem-me cortado o embigo
Este nosso General,
Por isso de tanto mal
Lhe não ponho alguma culpa;
Mas si merece desculpa
O respeito, a que provoca,
Ponto em boca.
Com justiça pois me tórno
Á Camara nó Senhora,
Que pois me trespassa agora,
Agora leve o retorno.
Praza a Deus que o caldo morno,
Que a mim me fazem cear
Da má vacca do jantar,
Por falta do bom pescado,
Lhes seja em cristéis lançado;
Mas si a saude lhes toca,
Ponto em boca.
RETRATO
DO GOVERNADOR ANTONIO LUIS DA CAMARA COUTINHO

Vá de retrato
Por consoantes;
Que eu sou Timantes
De um nariz de tocano côr de pato.

Pelo cabello
Começa a obra,
Que o tempo sobra
Para pintar a giba do camello.

Causa-me engulho
O pêllo untado,
Que de molhado,
Parece que sae sempre de mergulho.

Não pinto as faltas


Dos olhos baios,
Que versos raios
Nunca ferem sinão em coisas altas.

Mas a fachada
Da sobrancelha
Se me assimelha
A uma negra vassoura esparramada.

Nariz de embono
Com tal sacada,
Que entra na escada
Duas horas primeiro que seu dono.
Nariz que falia
Longe do rosto,
Pois na Sé posto
Na Praça manda pôr a guarda em ala.

Membro de olfactos,
Mas tão quadrado
Que um rei coroado
O póde ter por copa de cem pratos.

Tão temerario
É o tal nariz,
Que por um triz
Não ficou cantareira de um armario.

Você perdôe,
Nariz nefando,
Que eu vou cortando
E inda fica nariz em que se assôe.

Ao pé da altura
Do naso oiteiro
Tem o sendeiro
O que boca nasceu e é rasgadura.

Na gargantona,
Membro do gosto,
Está composto
O orgão mui subtil da voz fanhona.

Vamos á giba:
Porém que intento,
Si não sou vento
Para poder subir lá tanto arriba?
Sempre eu insisto
Que no horizonte
D’esse alto monte
Foi tentar o diabo a Jesu-Christo.

Chamam-lhe auctores,
Por fallar fresco,
Dorsum burlesco,
No qual fabricaverunt peccatores.

Havendo apostas
Si é home’ ou féra,
Se assentou que era
Um caracol que traz a casa ás costas.

De grande arriba
Tanto se entona,
Que já blazona
Que engeitou ser canastra por ser giba.

Oh pico alçado!
Quem lá subira,
Para que vira
Si és Etna abrazador, si Alpe nevado.

Cousa pintada,
Sempre uma cousa,
Pois d’onde pousa
Sempre o vêm de bastão, sempre de espada.

Dos Sanctos Passos


Na bruta cinta
Uma cruz pinta;
A espada o pau da cruz, e elle os braços.
Vamos voltando
Para a dianteira,
Que na trazeira
O lhe vejo açoitado por nefando.

Si bem se infere
Outro fracaso,
Porque em tal caso
Só se açoita quem toma o miserere.

Pois que seria,


Si eu vi vergões?
Serão chupões
Que o bruxo do Ferreira lhe daria?

E a entezadeira
Do gram ...,
Que em sujo trapo
Se alimpa nos fundilhos do Ferreira.

Seguem-se as pernas,
Sigam-se embora,
Porque eu por ora
Não me quero embarcar em taes cavernas.

Si bem assento
Nos meus miolos,
Que são dois rôlos
De tabaco já podre e fedorento.

Os pés são figas


Á mór grandeza,
Por cuja empreza
Tomaram tanto pé tantas cantigas.
Velha coitada,
Cuja figura
Na architectura
Da pôpa da nau nova está entalhada.

Boa viagem,
Senhor Tocano,
Que para o anno
Vos espera a Bahia entre a bagagem.
MILAGRES DO BRAZIL
AO PADRE LOURENÇO RIBEIRO, HOMEM PARDO QUE FOI
VIGARIO DA FREGUEZIA DE PASSÉ

Lourenço Ribeiro, clerigo e prégador, natural da Bahia, e,


segundo se rosnava, mulato, dava-se muito a compor trovas,
que cantava nas sociedades ao som da cythara: este homem
teve a indiscrição de mofar e desdenhar publicamente dos
versos de Gregorio de Mattos. Chegou isto aos ouvidos do
poeta, que offendido da fatuidade do cabrito, resolveu logo
tirar a desforra, o que fez na seguinte satyra, á qual deu o
titulo acima.

Um branco muito encolhido,


Um mulato muito ousado,
Um branco todo coitado,
Um canaz todo atrevido;
O saber muito abatido,
A ignorancia e ignorante
Muito ufana e mui farfante,
Sem pena ou contradicção:
Milagres do Brazil são.

Que um cão revestido em padre.


Por culpa da Sancta Sé,
Seja tão ousado que
Contra um branco honrado ladre;
E que ésta ousadia quadre
Ao bispo, ao governador,
Ao cortezão, ao senhor,
Tendo naus e maranhão:
Milagres do Brazil são.
Si este tal podengo asneiro
O pae o esvanece já,
A mãe lhe lembro que está
Roendo em um tamoeiro:
Que importa um branco cueiro,
Si o .. é tão denegrido;
Mas si no mixto sentido
Se lhe esconde a negridão,
Milagres do Brazil são.

Prega o perro frandulario,


E como a licença o cega,
Cuida que em pulpito prega,
E ladra em um campanario:
Vão ouvi-lo de ordinario
Tios e tias do Congo,
E si, suando o mondongo,
Elles só o gabo lhe dão,
Milagres do Brazil são.

Que ha de prégar o cachorro,


Sendo uma vil creatura,
Que não sabe de escriptura
Mais que aquella que o pôz forro?
Quem lhe dá ajuda e soccorro
São quatro sermões antigos,
Que lhe vão dando os amigos;
E si amigos tem um cão,
Milagres do Brazil são.
Um cão é o timbre maior
Da ordem predicatoria,
Mas não acho em toda a historia
Que um cão fosse prégador,
Si nunca falta um senhor,
Que lhe alcance esta licença
De Lourenço por Lourença,
Que as pardas tudo farão,
Milagres do Brazil são.

Té em versos quer dar pennada,


E porque o genio desbroche,
Como é cão, a troche moche
Mette a unha e dá dentada:
O perro não sabe nada,
E si com pouca vergonha
Tudo abate, porque sonha
Que sabe alguma questão,
Milagres do Brazil são.

Do perro affirmam doutores


Que fez uma apologia
Ao Mestre da theologia,
Outra ao Sol dos prégadores:
Si da lua aos resplendores
Late um cão a noite inteira,
E ella, seguindo a carreira,
Luz com mais ostentação,
Milagres do Brazil são.
Que vos direi do mulato,
Que vos não tenha já dicto,
Si será amanhãa delicto
Fallar d’elle sem recato?
Não faltará um mentecapto,
Que como villão de encerro
Sinta que dêm no seu perro,
E se ponha como um cão:
Milagres do Brazil são.

Imaginais que o insensato


Do canzarrão falla tanto
Porque sabe tanto ou quanto?
Não, sinão por ser mulato;
Ter sangue de carrapato,
Seu estoraque de Congo,
Cheirar-lhe a roupa a mondongo,
É cifra de perfeição:
Milagres do Brazil são.
A UM HOMEM HUMILDE
QUE SE METTEU A FIDALGO

Cançado de ver pregar


Cultissimas prophecias,
Quero das cultinarias
Hoje o habito enforcar:
De que serve o rebentar
Por quem de mim não tem magua?
Verdades direi como agua,
Porque todos entendaes,
Os ladinos o os boçaes,
A Musa praguejadora.
Entendeis-me agora?

O fallar de intercadencia,
Entre silencio e palavra,
Crer que a testa se vos abra,
E encaixar-vos que é prudencia:
Alerta, homens de sciencia,
Que quer o Xisgaraviz
Que aquillo que vos não diz,
Por lh’o impedir a rudeza,
Avalieis madureza,
Sendo ignorancia traidora.
Entendeis-me agora?
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like