100% found this document useful (4 votes)

205 views47 pages

PDF Data Science On AWS Implementing End To End Continuous AI and Machine Learning Pipelines Early Edition Chris Fregly Download

Implementing

Uploaded by

rezahiewiwi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

205 views47 pages

PDF Data Science On AWS Implementing End To End Continuous AI and Machine Learning Pipelines Early Edition Chris Fregly Download

Implementing

Uploaded by

rezahiewiwi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Download the full version of the textbook now at textbookfull.

com

Data Science on AWS Implementing End to End

Continuous AI and Machine Learning Pipelines
Early Edition Chris Fregly

https://textbookfull.com/product/data-science-on-
aws-implementing-end-to-end-continuous-ai-and-
machine-learning-pipelines-early-edition-chris-
fregly/

Explore and download more textbook at https://textbookfull.com

Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Applied Data Science Using PySpark: Learn the End-to-End

Predictive Model-Building Cycle 1st Edition Ramcharan
Kakarla
https://textbookfull.com/product/applied-data-science-using-pyspark-
learn-the-end-to-end-predictive-model-building-cycle-1st-edition-
ramcharan-kakarla/
textbookfull.com

AI as a Service: Serverless machine learning with AWS 1st

Edition Peter Elger

https://textbookfull.com/product/ai-as-a-service-serverless-machine-
learning-with-aws-1st-edition-peter-elger/

textbookfull.com

Data Engineering with AWS: Acquire the skills to design

and build AWS-based data transformation pipelines like a
pro 2nd Edition Eagar
https://textbookfull.com/product/data-engineering-with-aws-acquire-
the-skills-to-design-and-build-aws-based-data-transformation-
pipelines-like-a-pro-2nd-edition-eagar/
textbookfull.com

Physically Based Shader Development for Unity 2017:

Develop Custom Lighting Systems 1st Edition Claudia
Doppioslash
https://textbookfull.com/product/physically-based-shader-development-
for-unity-2017-develop-custom-lighting-systems-1st-edition-claudia-
doppioslash/
textbookfull.com
Introduction to EU energy law First Edition Talus

https://textbookfull.com/product/introduction-to-eu-energy-law-first-
edition-talus/

textbookfull.com

Practical Strategies for Immigration Relief Family Based

Immigration and Executive Actions 1st Edition Immigrant
Legal Resource Center
https://textbookfull.com/product/practical-strategies-for-immigration-
relief-family-based-immigration-and-executive-actions-1st-edition-
immigrant-legal-resource-center/
textbookfull.com

Diseases of Poultry 14th edition David E. Swayne

https://textbookfull.com/product/diseases-of-poultry-14th-edition-
david-e-swayne/

textbookfull.com

Wolves of Newhaven 01 0 Stolen by Her Wolves 1st Edition

Kaylin Peyerk

https://textbookfull.com/product/wolves-of-newhaven-01-0-stolen-by-
her-wolves-1st-edition-kaylin-peyerk/

textbookfull.com

Obesity Management: A Clinical Casebook Louis J. Aronne

https://textbookfull.com/product/obesity-management-a-clinical-
casebook-louis-j-aronne/

textbookfull.com
Techniques for Success With Implants in the Esthetic Zone
1st Edition Arndt Happe (Editor)

https://textbookfull.com/product/techniques-for-success-with-implants-
in-the-esthetic-zone-1st-edition-arndt-happe-editor/

textbookfull.com
Data Science on AWS
Implementing End-to-End, Continuous AI and
Machine Learning Pipelines

Chris Fregly and Antje Barth

Data Science on AWS
by Chris Fregly and Antje Barth
Copyright © 2021 Antje Barth and Flux Capacitor, LLC. All rights
reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales
promotional use. Online editions are also available for most titles
(http://oreilly.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or
[email protected].

Acquisitions Editor: Jessica Haberman

Development Editor: Gary O’Brien

Production Editor: Katherine Tozer

Interior Designer: David Futato

Cover Designer: Karen Montgomery

July 2021: First Edition

Revision History for the Early Release
2020-05-05: First Release
2020-06-10: Second Release
2020-07-17: Third Release
2020-08-03: Fourth Release
2020-08-26: Fifth Release
2020-10-02: Sixth Release
2020-11-30: Seventh Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492079392 for
release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc.
Data Science on AWS, the cover image, and related trade dress are
trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do
not represent the publisher’s views. While the publisher and the
authors have used good faith efforts to ensure that the information
and instructions contained in this work are accurate, the publisher
and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from
the use of or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is
subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof
complies with such licenses and/or rights.
978-1-492-07932-3
Chapter 1. Automated
Machine Learning

A NOTE FOR EARLY RELEASE READERS

With Early Release ebooks, you get books in their earliest form—
the author’s raw and unedited content as they write—so you can
take advantage of these technologies long before the official
release of these titles.
This will be the 3rd chapter of the final book. Please note that
the GitHub repo will be made active later on.
If you have comments about how we might improve the content
and/or examples in this book, or if you notice missing material
within this chapter, please reach out to the author at
[email protected].

In this chapter, we will show how to use the fully-managed Amazon

AI and machine learning services to offload the undifferentiated
heavy lifting of building AI pipelines. We dive deep into two Amazon
services for automated machine learning, Amazon SageMaker
Autopilot and Amazon Comprehend, designed for users who want to
build powerful predictive models from their datasets with just a few
clicks.

What is Automated Machine Learning?

Automated machine learning (AutoML) commonly refers to the effort
of automating the typical steps of a machine learning pipeline shown
in Figure 1-1.
Figure 1-1. Typical machine learning pipeline.

Machine learning practitioners spend a lot of time building and

managing such pipelines. They need to prepare the data and decide
on the framework and algorithm to use. Seasoned data scientists
use years of experience and intuition to choose the best algorithm
for a given dataset. In an iterative process, ML practitioners try to
find the best performing model configuration called “hyper-
parameters”. Unfortunately, there is no cheat sheet either for
choosing any of these parameters. We still need experience,
intuition, and patience to run many experiments and find the best
hyper-parameters for our algorithm and dataset.
What if we could just use a service that automatically finds and
trains the best model for our dataset and deploys the model to
production with a single click? Amazon SageMaker Autopilot offers
us exactly this functionality. Autopilot simplifies the model training
and tuning processing by handling many aspects of the model
development life cycle (MDLC) including feature transformation,
algorithm selection, model training, tuning, and deployment.
Simply point Autopilot to your dataset - and out comes a set of fully-
trained and optimized predictive models. Autopilot explores many
algorithms and configurations based on many years of AI and
machine learning experience at Amazon. The model candidates are
summarized by Autopilot through a set of generated Jupyter
notebooks and Python scripts. You have full control over these
generated notebooks and scripts. You can modify them, automate
them, and share them with colleagues. You can select the top model
candidate based on your desired balance of model accuracy, model
size, and prediction latency. Let’s dive deeper into the process of
automated machine learning with Autopilot.

Automated Machine Learning with Autopilot

Autopilot is the name of Amazon SageMaker’s AutoML service. You
simply provide your raw data in a S3 bucket, for example in the form
of a tabular CSV file, and tell Autopilot which column you want to
predict. As the name implies, Autopilot then does the rest
automatically.

NOTE
S3 is Amazon’s Simple Storage Service. S3 provides a simple web
service interface that you can use to store and retrieve any amount of
data. We will discuss this service in more detail in the next chapter.

Autopilot uses automated machine learning to analyze the data and

identifies the best algorithm and model configuration for your data
as shown in Figure 1-2.
Figure 1-2. Amazon SageMaker Autopilot.

You can tell Autopilot how many model candidates to explore. In the
process of building those candidates, Autopilot tries different
algorithms and algorithm settings. Autopilot also applies all needed
data transformations to your data to optimize the input for each
algorithm. The algorithm, configuration, and data transformation
code are then combined into a single ML pipeline definition. The
most promising pipelines are selected by Autopilot and used to find
the best performing model. Lastly, Autopilot shares the results in a
model leaderboard. You can use the best performing model as a
baseline and optimize the model even further. A second option is to
simply deploy the model and start predicting.
Another highlight of Autopilot is the fact that it provides full visibility
into each of those steps and shares all code needed to reproduce
the results. AWS calls this a “white-box” approach. This white-box
approach to AutoML is very unique. Let’s explore the white-box vs.
black-box approach to AutoML a bit further.

Understand Autopilot’s White-Box Approach to

AutoML
In a black-box approach as shown in Figure 1-3, you don’t have
control or visibility into the chosen algorithms, applied data
transformations, or hyper-parameter choices. You point the AutoML
service to your data and receive a trained model.

Figure 1-3. Other AutoML products use a black-box approach.

This makes it hard to understand and explain the model, and to

manually reproduce the model. Most AutoML solutions implement
this kind of black-box approach. In contrast, Autopilot documents
and shares its findings throughout the data analysis, feature
engineering, and model tuning steps, as shown in Figure 1-4.
Figure 1-4. Autopilot’s white-box approach to AutoML.

Autopilot not only shares the models, Autopilot also logs all observed
metrics and generates Jupyter notebooks which contain the code to
reproduce the model pipelines. The Data Exploration notebook
contains the data analysis results and identifies potential data quality
issues such as missing values that might impact model performance
if not addressed. The Candidate Definition notebook highlights the
best algorithms to learn our given dataset, as well as the code and
configuration needed to use your dataset with each algorithm.

NOTE
The Jupyter notebooks are available after the first data analysis step.
You can configure Autopilot to stop after this step if you want to iterate
quickly over your data before starting the actual model tuning step.
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Use SageMaker Experiments in Autopilot
Autopilot uses SageMaker Experiments to keep track of all data
analysis, feature engineering and model training/tuning jobs. This
feature of the broader Amazon SageMaker family of ML services
helps you organize, track, compare and evaluate machine learning
experiments. SageMaker Experiments enables model versioning and
lineage tracking across all phases of the ML lifecycle.
An experiment consists of trials and training jobs as shown in
Figure 1-5. A trial is a collection of training steps and metadata for
those steps. The training steps typically include data preprocessing,
model training, and model tuning. Metadata includes dataset
locations, algorithm hyper-parameters, output artifacts, and
performance metrics.
Figure 1-5. SageMaker Experiments.

You can explore and manage Autopilot experiments and trials either
through the UI or using the AWS Python SDK boto3
[https://github.com/boto/boto3]. Let’s have a look at both options
and start an Autopilot experiment to build a custom classifier model.
As input data, we leverage samples from the Amazon Customer
Reviews Dataset [https://s3.amazonaws.com/amazon-reviews-
pds/readme.html]. This dataset is a collection of over 150 million
product reviews on Amazon.com from 1995 to 2015. Those product
reviews and star ratings are a popular customer feature of
Amazon.com. Star rating 5 is the best and 1 is the worst. We will
describe and explore this dataset in much more detail in the next
chapters.

Train and Deploy with Autopilot UI

The Autopilot UI is integrated into SageMaker Studio, an IDE which
provides a single, web-based visual interface where you can perform
your machine learning development. Simply navigate to Amazon
SageMaker in your AWS Console and click on SageMaker Studio as
shown in Figure 1-6.
Figure 1-6. AWS Console > Amazon SageMaker > Amazon SageMaker Studio

Then, follow the instructions to set up SageMaker Studio and click

on Open Studio once completed as shown in Figure 1-7.
Figure 1-7. Setup SageMaker Studio

This will take you to the SageMaker Studio IDE as shown in

Figure 1-8.
Figure 1-8. SageMaker Studio IDE

You will find the Autopilot UI when you click the laboratory flask icon
in the left-side menu as shown in Figure 1-9. Once there, click on
Create experiment to create and configure our first Autopilot
experiment.
Figure 1-9. Create an Autopilot experiment.
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
In preparation for our Autopilot experiment, we use a subset of the
Amazon Customer Reviews Dataset to train our model. We want to
train a classifier to predict the star rating for a given review,
therefore we only use the star_rating and review_body
columns:
star_rating,review_body
5,"GOOD, GREAT, WONDERFUL, MORE THAN ADEQUATE AND EXACTLY
WHAT I NEED FOR MY COMPUTER. NONE BETTER OR GOOD TO USE."
2,"Even though it does the same job as TurboTax, it isn't as
user friendly as TurboTax. I guess you get what you paid
for."
4,"Pretty easy to use. No issues."
…

NOTE
In other scenarios, you will likely want to use more columns from your
dataset and let Autopilot choose the most important ones. In our
example, however, we want to keep things simple and just use the
star_rating and review_body columns to focus on the technical aspects
of Autopilot.

Next, we configure the Autopilot experiment as shown in Figure 1-

10.
Figure 1-10. Configure the Autopilot Experiment.

We just need to provide a few input parameters:

Experiment name: A name to identify the experiment, e.g.
amazon-customer-reviews
Input data location: The S3 path to our training data, e.g.
s3://dsoaws-amazon-
reviews/data/amazon_reviews_us_Digital_Softwa
re_v1_00_header.csv

Target attribute name: The attribute (column) we want to

predict, i.e. star_rating
Output data location: The S3 path for storing the generated
output, such as models and other artifacts, e.g.
s3://dsoaws-amazon-reviews/autopilot/output
Problem type: The machine learning problem type such as
Binary classification, Regression, and Multiclass classification.
The default, “Auto”, allows Autopilot to choose for itself
based on the given input data.
Run complete experiment: We can choose to run a complete
experiment or just generate the Data Exploration and
Candidate Definition notebooks as part of the Data Analysis
phase.
Let’s click Create Experiment and start our first Autopilot job.
You can observe the progress of the job in the UI as shown in
Figure 1-11.
Figure 1-11. Progress of a running Autopilot job.

The UI shows the 3 stages of the Autopilot job: Analyzing Data,

Feature Engineering, and Model Tuning. Once Autopilot completes
the Analyzing Data stage, you can see the links to the two generated
notebooks appearing in the UI as shown in Figure 1-12.
Figure 1-12. Autopilot > Analyzing Data > Generated notebooks

If you have a look at the S3 output bucket, you can find the
generated notebooks, code, and transformed data in the following
structure:
amazon-customer-reviews/
data-processor-models/
amazon-cus-dpp0-1-xxx/
output/model.tar.gz
Random documents with unrelated
content Scribd suggests to you:
back
back
back
back
back
back
back
back
back
back
back
back
back