T-GCPBDML-B - M3 - Big Data With BigQuery - ILT Slides
T-GCPBDML-B - M3 - Big Data With BigQuery - ILT Slides
Module 3
Google Cloud Big Data and Machine Learning Fundamentals
Proprietary + Confidential
Introduction
01
Proprietary + Confidential
In the previous module of this course, you learned how to build a streaming dataflow
using Pub/sub, Dataflow, and Looker vs. Looker Studio. Now let’s switch our focus to
a popular data warehouse product on Google, BigQuery.
Proprietary + Confidential
Agenda
BigQuery
BigQuery demo
BigQuery ML
BigQuery ML
Hands-on lab
You’ll begin by exploring BigQuery’s two main services, storage and analytics, and
then get a demonstration of BigQuery in use.
After that, you’ll see how BigQuery ML provides a data-to-AI lifecycle all within one
place. You’ll also learn about BigQuery ML project phases, as well as key commands.
Finally, you’ll get hands-on practice using BigQuery ML to build a custom ML model.
Being fully managed means that BigQuery takes care of the underlying infrastructure,
so you can focus on using SQL queries to answer business questions–without
worrying about deployment, scalability, and security.
At this point, it’s useful to consider what the main difference is between a data
warehouse and a data lake. A data lake is just a pool of raw, unorganized, and
unclassified data, which has no specified purpose. A data warehouse on the other
hand, contains structured and organized data, which can be used for advanced
querying.
Proprietary + Confidential
BigQuery features
01
Storage plus analytics
A place to store petabytes of data.
(Petabyte = 11,000 4k movies)
Serverless
02 Free from provisioning resources or managing servers but focus on
SQL queries.
03
Flexible pay-as-you-go pricing model
Pay for what your query processes, or use a flat-rate option.
BigQuery
04
Data encryption at rest by default
Encrypt data stored on a disk.
05
Built-in machine learning
Write ML models directly in BigQuery using SQL.
● BigQuery provides two services in one: storage plus analytics. It’s a place to
store petabytes of data. For reference, 1 petabyte is equivalent to 11,000
movies at 4k quality. BigQuery is also a place to analyze data, with built-in
features like machine learning, geospatial analysis, and business intelligence,
which we’ll explore a bit later on.
● BigQuery is a fully managed serverless solution, meaning that you don’t need
to worry about provisioning any resources or managing servers in the backend
but only focus on using SQL queries to answer your organization's questions in
the frontend. If you’ve never written SQL before, don’t worry. This course
provides resources and labs to help.
● BigQuery has a flexible pay-as-you-go pricing model where you pay for the
number of bytes of data your query processes and for any permanent table
storage. If you prefer to have a fixed bill every month, you can also subscribe
to flat-rate pricing where you have a reserved amount of resources for use.
● BigQuery has built-in machine learning features so you can write ML models
● directly in BigQuery using SQL. Also, if you decide to use other professional
tools—such as Vertex AI from Google Cloud—to train your ML models, you can
export datasets from BigQuery directly into Vertex AI for a seamless
integration across the data-to-AI lifecycle.
Proprietary + Confidential
BI tools
● Looker
Real-time data ● Looker Studio
Pub/Sub
(streaming) ● Tableau
● Google Sheets
Dataflow
BigQuery
Cloud
Batch data AI/ML tools
Storage
● AutoML
● Vertex AI
Workbench
The input data can be either real-time or batch data. If you think back to the last
module of the course, you’ll recall that there are four challenges of big data in modern
organizations. They are that data can be any format (variety), any size (volume), any
speed (velocity), and possibly inaccurate (veracity).
If it's streaming data, which can be either structured or unstructured, high speed, and
large volume, Pub/Sub is needed to digest the data. If it’s batch data, it can be directly
uploaded to Cloud Storage.
After that, both pipelines lead to Dataflow to process the data. Dataflow is where we
ETL – extract, transform, and load – the data if needed.
BigQuery sits in the middle to link data processes using Dataflow and data access
through analytics, AI, and ML tools.
The job of the analytics engine of BigQuery at the end of a data pipeline is to ingest all
the processed data after ETL, store and analyze it, and possibly output it for further
use such as data visualization and machine learning.
BigQuery outputs usually feed into two buckets: business intelligence tools and AI/ML
tools.
If you’re a business analyst or data analyst, you can connect to visualization tools like
Looker, Looker Studio, Tableau, and other BI tools.
If you prefer to work in spreadsheets, you can query both small or large BigQuery
datasets directly from Google Sheets and even perform common operations like pivot
tables.
Alternatively if you’re a data scientist or machine learning engineer, you can directly
call the data from BigQuery through AutoML or Workbench. These AI/ML tools are
part of Vertex AI, Google's unified ML platform.
BigQuery is like a common staging area for data analytics workloads. When your data
is there, business analysts, BI developers, data scientists, and machine learning
engineers can be granted access to your data for their own insights.
Proprietary + Confidential
02
Proprietary + Confidential
Fast SQL-based
analytical engine
BigQuery provides two services in one. It’s both a fully managed storage facility to
load and store datasets, and also a fast SQL-based analytical engine.
The two services are connected by Google's high-speed internal network. It’s this
super-fast network that allows BigQuery to scale both storage and compute
independently, based on demand.
Let’s look at how BigQuery manages the storage and metadata for datasets.
Proprietary + Confidential
External data:
○ Google Cloud storage (e.g. Cloud Storage)
○ Google Cloud database (e.g., Spanner or Cloud
SQL)
Public datasets
After the data is stored in BigQuery, it’s fully managed and is automatically replicated,
backed up, and set up to autoscale.
Proprietary + Confidential
Streaming
Generated data
BigQuery
● The first is a batch load, where source data is loaded into a BigQuery table in a
single batch operation. This can be a one-time operation or it can be automated
to occur on a schedule. A batch load operation can create a new table or append
data into an existing table.
● The second is streaming, where smaller batches of data are streamed
continuously so that the data is available for querying in near-real time.
● And the third is generated data, where SQL statements are used to insert rows
into an existing table or to write the results of a query to a table.
Proprietary + Confidential
Of course, the purpose of BigQuery is not to just save data; it’s for analyzing data and
helping to make business decisions.
BigQuery is optimized for running analytic queries over large datasets. It can perform
queries on terabytes of data in seconds and petabytes in minutes. This performance
lets you analyze large datasets efficiently and get insights in near real time.
Proprietary + Confidential
Ad hoc analysis
Geospatial analytics
Building BI dashboards
BigQuery supports:
By default, BigQuery runs interactive queries, which means that the queries are
executed as needed.
BigQuery also offers batch queries, where each query is queued on your behalf and
the query starts when idle resources are available, usually within a few minutes.
Proprietary + Confidential
BigQuery demo
03
Proprietary + Confidential
As any data analyst will tell you, exploring a dataset with SQL is often one of the first
steps to uncover hidden insights.
In this section:
● We’ll show you how to use BigQuery to uncover insights from a public dataset.
● The goal is to find the most popular stations across San Francisco to pick up
bikes.
● You can find this public dataset in Bigquery by following these steps:
○ Navigate to the Google Cloud console > BigQuery > Add data > Public
dataset > Search San Francisco bike share > Under
san_francisco_bikeshare, choose bikeshare_trips
○ Click the three dots besides the dataset to start the query, for more
details please refer to the video.
● From the schema, you can find the dataset include information about the:
○ TripID
○ Trip duration
○ Start station, date
○ End station, date
○ Bike number
○ Subscriber information, etc.
● Take a moment to consider this question: How can you find the most popular
station using SQL)?
Proprietary + Confidential
BigQuery demo
The goal of this code is to find the ten most popular stations, based on the number of
trips that start from that station.
Introduction to
04
BigQuery ML
Although BigQuery started out solely as a data warehouse, over time it has evolved to
provide features that support the data-to-AI lifecycle.
Proprietary + Confidential
BigQuery ML
ML project phases
BigQuery
Key ML commands in SQL
In this section of the course, we’ll explore BigQuery’s capabilities for building machine
learning models and the ML project phases, and walk you through the key ML
commands in SQL.
Proprietary + Confidential
Build a model
01 Export data from your datastore into an
integrated development environment (IDE)
If you’ve worked with ML models before, you know that building and training them can
be very time-intensive.
● You must first export data from your datastore into an IDE (integrated
development environment) such as Jupyter Notebook or Google Colab and
then transform the data and perform your feature engineering steps before
you can feed it into a training model.
● Then finally, you need to build the model in TensorFlow, or a similar library, and
train it locally on a computer or on a virtual machine.
To improve the model performance, you also need to go back and forth to get more
data and create new features. This process will need to be repeated, but it’s so
time-intensive that you’ll probably stop after a few iterations. Also, I just mentioned
TensorFlow and feature engineering; in the past, if you weren’t familiar with these
technologies, ML was left to the data scientists on your team and was not available to
you.
Proprietary + Confidential
Step 1: Create a model with a SQL statement. Here we can use the bikeshare dataset
as an example.
Proprietary + Confidential
1 SELECT
2 predicted_num_trips,num_trips,trip_data
3 FROM
4 ml.PREDICT(MODEL `numbikes.model`, (WITH bike_data AS
5 (
6 SELECT
7 COUNT(*) as num_trips,
And that’s it! You now have a model and can view the results.
Additional steps might include activities like evaluating the model, but if you know
basic SQL, you can now implement ML; pretty cool!
Proprietary + Confidential
Hyperparameters
Build a model
Train a model
Hyperparameters
BigQuery ML was designed to be simple, like building a model in two steps. That
simplicity extends to defining the machine learning hyperparameters, which let you
tune the model to achieve the best training result.
Hyperparameters are the settings applied to a model before the training starts, like
the learning rate.
With BigQuery ML, you can either manually control the hyperparameters or hand it to
BigQuery starting with a default hyperparameter setting and then automatic tuning.
Proprietary + Confidential
Identify patterns
Classify data Predict a number
and clusters
Shoe sales for the next
Is an email spam? Grouping photos
three months
When using a structured dataset in BigQuery ML, you need to choose the appropriate
model type. Choosing which type of ML model depends on your business goal and the
datasets.
BigQuery supports supervised models and unsupervised models.
AutoML Tables
Anomaly Detection
Wide and Deep NNs
Linear regression
Regression ML Ops Importing TensorFlow models for batch
DNN regressor (TensorFlow) prediction
Once you have your problem outlined, it’s time to decide on the best model.
Categories include classification and regression models. There are also other model
options to choose from, along with ML ops.
Hyperparameter tuning
ML Ops
Options include:
● Importing TensorFlow models for batch prediction
● Exporting models from BigQuery ML for online prediction
● And hyperparameter tuning using Vertex AI Vizier
05 Using BigQuery ML to
predict customer
lifetime value
Proprietary + Confidential
Now that you’re familiar with the types of ML models available to choose from,
high-quality data must be used to teach the models what they need to learn. The best
way to learn the key concepts of machine learning on structured datasets is through
an example.
Lifetime value, or LTV, is a common metric in marketing used to estimate how much
revenue or profit you can expect from a customer given their history and customers
with similar patterns.
Proprietary + Confidential
hits_product_V2ProductName
177 categories
We’ll use a Google Analytics ecommerce dataset from Google’s own merchandise
store that sells branded items like t-shirts and jackets.
The goal is to identify high-value customers and bring them to our store with special
promotions and incentives.
Proprietary + Confidential
Fields
Customer Total Average time Total Ecommerce
lifetime visits spent on revenue transactions
pageviews the site
Having explored the available fields, you may find some useful in determining whether
a customer is high value based on their behavior on our website.
Remember that in machine learning, you feed in columns of data and let the model
figure out the relationship to best predict the label. It may even turn out that some of
the columns weren’t useful at all to the model in predicting the outcome -- you’ll see
later how to determine this.
Proprietary + Confidential
Labels
Now that we have some data, we can prepare to feed it into the model. Incidentally, to
keep this example simple, we’re only using seven records, but we’d need tens of
thousands of records to train a model effectively.
Before we feed the data into the model, we first need to define our data and columns
in the language that data scientists and other ML professionals use.
Using the Google Merchandise Store example, a record or row in the dataset is called
an example, an observation, or an instance.
A label is a correct answer, and you know it’s correct because it comes from historical
data. This is what you need to train the model on in order to predict future data.
Depending on what you want to predict, a label can be either a numeric variable, which
requires a linear regression model, or a categorical variable, which requires a logistic
regression model.
For example, if we know that a customer who has made transactions in the past and
spends a lot of time on our website often turns out to have high lifetime revenue, we
could use revenue as the label and predict the same for newer customers with that
same spending trajectory.
Knowing what you’re trying to predict, such as a class or a number, will greatly
influence the type of model you’ll use.
Proprietary + Confidential
Features
Features
But what do we call all the other data columns in the data table?
Those columns are called features, or potential features. Each column of data is like a
cooking ingredient you can use from the kitchen pantry. Too many ingredients,
however, can ruin a dish!
The process of sifting through data can be time consuming. Understanding the quality
of the data in each column and working with teams to get more features or more
history is often the hardest part of any ML project.
Proprietary + Confidential
Feature engineering
Feature
engineering
Calculate fields
You can even combine or transform feature columns in a process called feature
engineering.
If you’ve ever created calculated fields in SQL, you’ve already executed the basics of
feature engineering.
Proprietary + Confidential
BigQuery ML
BigQuery ML does much of the hard work for you, like automatically one-hot encoding
categorical values. One-hot encoding is a method of converting categorical data to
numeric data to prepare it for model training. From there, BigQuery ML automatically
splits the dataset into training data and evaluation data.
Proprietary + Confidential
Let’s say new data comes in that you don’t have a label for, so you don’t know whether
it is for a high-value customer. You do, however, have a rich history of labeled
examples for you to train a model on.
Proprietary + Confidential
Predicting results!
Historical Training Data (Known LTV)
Row fullVisitorID distinct_days_visited ltv_pageviews ltv_visits ltv_avg_time_on_site_s ltv_revenue ltv_transactions label
So if we train a model on the known historical data, and are happy with the
performance, then we can use it to predict on future datasets!
Proprietary + Confidential
BigQuery ML
06
project phases
Proprietary + Confidential
01
● Easy connectors to Google products.
Extract, transform, and load data into BigQuery
● Use SQL joins.
In phase 1, you extract, transform, and load data into BigQuery, if it isn’t there
already. If you’re already using other Google products, like YouTube for example, look
out for easy connectors to get that data into BigQuery before you build your own
pipeline. You can enrich your existing data warehouse with other data sources by
using SQL joins.
In phase 2, you select and preprocess features. You can use SQL to create the
training dataset for the model to learn from. You’ll recall that BigQuery ML does some
of the preprocessing for you, like one-hot encoding of your categorical variables.
One-hot encoding converts your categorical data into numeric data that is required by
a training model.
Proprietary + Confidential
01 Extract, transform, and load data into BigQuery Use the “CREATE MODEL” command.
#standardSQL
02 Select and preprocess features CREATE MODEL
ecommerce.classification
OPTIONS
03 Create the model inside BigQuery
(
model_type='logistic_reg',
input_label_cols =
04 Evaluate the performance of the trained model 'will_buy_later'
) AS
In phase 3, you create the model inside BigQuery. This is done by using the “CREATE
MODEL” command. Give it a name, specify the model type, and pass in a SQL query
with your training dataset. From there, you can run the query.
Proprietary + Confidential
05 Use the model to make predictions # SQL query with evaluation data
In phase 4, after your model is trained, you can execute an ML.EVALUATE query to
evaluate the performance of the trained model on your evaluation dataset. It’s here
that you can analyze loss metrics like a Root Mean Squared Error for forecasting
models and area-under-the-curve, accuracy, precision, and recall, for classification
models. We’ll explore these metrics later in the course.
Proprietary + Confidential
05 Use the model to make predictions # SQL query with test data
In phase 5, the final phase, when you’re happy with your model performance, you can
then use it to make predictions. To do so, invoke the ml.PREDICT command on your
newly trained model to return with predictions and the model’s confidence in those
predictions. With the results, your label field will have “predicted” added to the field
name. This is your model’s prediction for that label.
Proprietary + Confidential
BigQuery ML
key commands
07
Now that you’re familiar with the key phases of an ML project, let’s explore some of
the key commands of BigQuery ML.
Proprietary + Confidential
You’ll remember from an earlier that you can create a model with just the CREATE
MODEL command.
If you want to overwrite an existing model, use the CREATE OR REPLACE MODEL
command.
Models have OPTIONS, which you can specify. The most important, and the only one
required, is the model type.
Proprietary + Confidential
SELECT
category, Output
weight
FROM Each feature has a
weight from -1 to 1.
UNNEST((
SELECT
Closer to 0 =
category_weights the less important the
feature is to the
FROM prediction.
ML.WEIGHTS(MODEL `bracketology.ncaa_model`)
WHERE Closer to -1 or 1 =
the more important
processed_input = 'seed')) # try other features the feature is to the
like 'school_ncaa' prediction.
You can inspect what a model learned with the ML.WEIGHTS command and filtering
on an input column.
The output of ML.WEIGHTS is a numerical value, and each feature has a weight from
-1 to 1. That value indicates how important the feature is for predicting the result, or
label. If the number is closer to 0, the feature isn't important for the prediction.
However, if the number is closer to -1 or 1, then the feature is more important for
predicting the result.
Proprietary + Confidential
SELECT
*
FROM
ML.EVALUATE(MODEL `bracketology.ncaa_model`)
And if you want to make batch predictions, you can use the ML.PREDICT command on
a trained model, and pass through the dataset you want to make the prediction on.
Proprietary + Confidential
Data columns that are part of your SELECT statement, after your CREATE MODEL statement.
Features
SELECT * FROM ML.FEATURE_INFO(MODEL `mydataset.mymodel`)
Model object An object created in BigQuery that resides in your BigQuery dataset.
Lab:
08
Predicting purchases
with BigQuery ML
Proprietary + Confidential
Hands-on lab
Order data
Visitor data
BigQuery ML
shop.googlemerchandisestore.com
Now it’s time to get some hands-on practice building a machine learning model in
BigQuery.
In the next lab, you’ll use ecommerce data from the Google Merchandise Store
website: https://shop.googlemerchandisestore.com/
The site’s visitor and order data have been loaded into BigQuery, and you’ll build a
machine learning model to predict whether a visitor will return for more purchases
later.
Proprietary + Confidential
You’ll practice
09 Summary
Well done on completing another lab! Hopefully you now feel more comfortable
building custom machine learning models with BigQuery ML!
Fully-managed storage
facility for datasets
Fast SQL-based
BigQuery analytical engine
Our focus was on BigQuery, the data warehouse that provides two services in one. It’s
a fully-managed storage facility for datasets, and a fast SQL-based analytical engine.
Proprietary + Confidential
BI tools
● Looker
Real-time data ● Looker Studio
Pub/Sub
(streaming) ● Tableau
● Google Sheets
Dataflow
BigQuery
Cloud
Batch data AI/ML tools
Storage
● AutoML
● Vertex AI
Workbench
BigQuery sits between data processes and data uses, like a common staging area. It
gets data from ingestion and processing and outputs data to BI tools such as Looker
and Looker Studio and ML tools such as Vertex AI.
After the data is in BigQuery, business analysts, BI developers, data scientists, and
machine learning engineers can be granted access to the data for their own insights.
Proprietary + Confidential
BigQuery ML
04 Evaluate the performance of the trained model
In phase 1, you extract, transform, and load data into BigQuery, if it isn’t there
already.
In phase 2, you select and preprocess features. You can use SQL to create the
training dataset for the model to learn from.
In phase 4, after your model is trained, you can execute an ML.EVALUATE query to
evaluate the performance of the trained model on your evaluation dataset.
And in phase 5, the final phase, when you’re happy with your model performance, you
can then use it to make predictions.
Proprietary + Confidential
Quiz
Proprietary + Confidential
Question #1
Question
BigQuery is a fully managed data warehouse. What does “fully managed” refer to?
A. BigQuery manages the data quality for you.
B. BigQuery manages the underlying structure for you.
C. BigQuery manages the cost for you.
D. BigQuery manages the data source for you.
Proprietary + Confidential
Question #1
Answer
BigQuery is a fully managed data warehouse. What does “fully managed” refer to?
A. BigQuery manages the data quality for you.
B. BigQuery manages the underlying structure for you.
C. BigQuery manages the cost for you.
D. BigQuery manages the data source for you.
Proprietary + Confidential
Question #2
Question
Question #2
Answer
Question #3
Question
Which pattern describes source data that is moved into a BigQuery table in a single
operation?
A. Batch load
B. Streaming
C. Spot load
D. Generated data
Proprietary + Confidential
Question #3
Answer
Which pattern describes source data that is moved into a BigQuery table in a single
operation?
A. Batch load
B. Streaming
C. Spot load
D. Generated data
Proprietary + Confidential
Question #4
Question
Which BigQuery feature leverages geography data types and standard SQL geography
functions to analyze a data set?
A. Ad hoc analysis
B. Geospatial analysis
C. Building machine learning models
D. Building business intelligence dashboards
Proprietary + Confidential
Question #4
Answer
Which BigQuery feature leverages geography data types and standard SQL geography
functions to analyze a data set?
A. Ad hoc analysis
B. Geospatial analysis
C. Building machine learning models
D. Building business intelligence dashboards
Proprietary + Confidential
Question #5
Question
Data is loaded into BigQuery, and the features have been selected and preprocessed.
What’s next when using BigQuery ML?
A. Use the ML model to make predictions.
B. Create the ML model inside BigQuery.
C. Evaluate the performance of the trained ML model.
D. Classify labels to train on historical data.
Proprietary + Confidential
Question #5
Answer
Data is loaded into BigQuery, and the features have been selected and preprocessed.
What’s next when using BigQuery ML?
A. Use the ML model to make predictions.
B. Create the ML model inside BigQuery.
C. Evaluate the performance of the trained ML model.
D. Classify labels to train on historical data.
Proprietary + Confidential
Question #6
Question
In a supervised machine learning model, what provides historical data that can be used
to predict future data?
A. Labels
B. Examples
C. Data points
D. Features
Proprietary + Confidential
Question #6
Answer
In a supervised machine learning model, what provides historical data that can be used
to predict future data?
A. Labels
B. Examples
C. Data points
D. Features
Proprietary + Confidential
Question #7
Question
You want to use machine learning to group random photos into similar groups. Which
should you use?
A. Unsupervised learning, cluster analysis
B. Supervised learning, linear regression
C. Supervised learning, logistic regression
D. Unsupervised learning, dimensionality reduction
Proprietary + Confidential
Question #7
Answer
You want to use machine learning to group random photos into similar groups. Which
should you use?
A. Unsupervised learning, cluster analysis
B. Supervised learning, linear regression
C. Supervised learning, logistic regression
D. Unsupervised learning, dimensionality reduction
Proprietary + Confidential
Question #8
Question
You want to use machine learning to identify whether an email is spam. Which should
you use?
A. Unsupervised learning, cluster analysis
B. Supervised learning, linear regression
C. Supervised learning, logistic regression
D. Unsupervised learning, dimensionality reduction
Proprietary + Confidential
Question #8
Answer
You want to use machine learning to identify whether an email is spam. Which should
you use?
A. Unsupervised learning, cluster analysis
B. Supervised learning, linear regression
C. Supervised learning, logistic regression
D. Unsupervised learning, dimensionality reduction