0% found this document useful (0 votes)
819 views

Exam Overview: GCP Data Engineer

The document discusses several big data technologies: 1) Hadoop and HDFS provide scalable data storage and processing. HDFS is the storage layer and Hadoop runs MapReduce jobs. 2) Hive, Pig and Spark are data processing frameworks that run on Hadoop. Hive uses SQL, Pig uses PigLatin, and Spark is fast and general purpose. 3) Cloud Storage, Cloud SQL, BigTable and BigQuery provide scalable data storage options. Cloud Storage stores unstructured data, Cloud SQL provides relational databases, BigTable is a NoSQL database, and BigQuery is for analytics.

Uploaded by

S Ale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
819 views

Exam Overview: GCP Data Engineer

The document discusses several big data technologies: 1) Hadoop and HDFS provide scalable data storage and processing. HDFS is the storage layer and Hadoop runs MapReduce jobs. 2) Hive, Pig and Spark are data processing frameworks that run on Hadoop. Hive uses SQL, Pig uses PigLatin, and Spark is fast and general purpose. 3) Cloud Storage, Cloud SQL, BigTable and BigQuery provide scalable data storage options. Cloud Storage stores unstructured data, Cloud SQL provides relational databases, BigTable is a NoSQL database, and BigQuery is for analytics.

Uploaded by

S Ale
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

- Less development effort & code write system and you want to have a

GCP Data Engineer1 efficiency


quite responsive reporting system on top
v20190116 - does not have any notion for partitions of that stored data.

- supports Avro whereas Hive does not


- does not provide ACID and relational
- Hive
data properties

1. Exam overview - data warehousing system and query - an available, partition-tolerant system
language
that supports eventual consistency

The exam consists of 50 questions that must be - SQL-like


- MongoDB

answered in 2 hours.
- Spark
- is fit for use cases where your system
- fast, interactive, general-purpose demands a schema-less document
The content including:
framework for SQL, streaming, machine store.

• Storage (20% of questions),


learning, etc.
- HBase

• Big Data Processing (35%),


- solves similar problems as Hadoop - might be fit for search engines, analyzing
• Machine Learning (18%),
MapReduce but with a fast in-memory log data, or any place where scanning
• case studies (15%) and
approach.
huge, two-dimensional join-less tables is
• others (Hadoop and security about 12%).
- Sqoop
a requirement.

- transfer data between Hadoop and - Redis

structured datastores (relational)


- is built to provide In-Memory search for
All the questions are scenario simulations where - Sqoop imports data from a relational varieties of data structures like trees,
you have to choose which option would be the database system or a mainframe into queues, linked lists, etc and can be a
best way to deal with the situation.
HDFS.
good fit for making real-time
- Running Sqoop on a Dataproc Hadoop leaderboards, pub-sub kind of system.

cluster gives you access to the built-in - MYSQL


2. Big Data Ecosystem2 Google Cloud Storage connector

- The two previous points mean you can


- You can easily add nodes to MySql
Cluster Data Nodes and build cube to do
particularly focusing on Apache Pig, Hive, Spark, use Sqoop to import data directly into OLAP. (answer for a mock question)

and Beam
Cloud Storage

- Oozie

- Hadoop
- workflow scheduler system to manage
Apache Hadoop jobs

- open source MapReduce framework

- the underlying technology for Dataproc


- Oozie Workflow jobs are Directed 3. Storage
- HDFS
Acyclical Graphs (DAGs) of actions.
Cloud Storage/Cloud SQL/DataStore/BigTable/
- Hadoop File System
- Cassandra
BigQuery.

- Pig
- Wide-column store based on ideas of
- scripting language that compiles into BigTable and DynamoDB(Datastore)

MapReduce jobs
- Wide column store

- Procedural Data Flow Language: - solution for problems where one of your
PigLatin
requirements is to have a very heavy

1 by [email protected], modified based on ‘James' GCP Dotpoints’ with various online sources which are acknowledged at the end, please share under
the Creative Commons Attribution 3.0 Australia License. Let me know if there is any mistake.
2 refer to https://hadoopecosystemtable.github.io
3.1.Cloud Storage (GCS) 3.2.Cloud SQL & Spanner Performance:

- Fast to petabyte scale, not a good solution for


- Blob storage. Upload any bytes to a location. - Managed/No ops relational database (MySQL storing less than 1 TB of data.

The content isn’t indexed at all just stored. and PostgreSQL) like Amazon RDS.
- Low-latency read/write access

(Amazon S3)
- best for gigabytes of data with transactional - High-throughput analytics

- Virtually unlimited storage


nature
- Native time series support

- Nearline and Coldline: for ~1 sec lookup for - Low latency


- for large analytical and operational workloads

access, charged for volume of data accessed - Doesn’t scale well beyond GB’s - designed for sparse tables

- Nearline for once per month


- data structures and underlying
- Coldline for once per year
infrastructure is required.

- buckets to segregate storage items


Key Design:

- Spanner is a distributed and scalable


- geographical separation:
solution for RDBMS however also more - Design your keys how you intend to query.

- persistant, durable, replicated


expensive.
- If your most common query is the most recent
- spread data across zones to minimise - Management:
data, use a reverse date stamp at the end of
impact of service disruptions
- Managed backups & automatic the key.

- spread data across regions to provide replication


- Ensure your keys are evenly distributed to
global access to data
- fast connection with GCE/GAE
void hot spotting. This is why date stamps as
- Ideal for storing but not for high volume of - uses Google security
a key or starting a key is bad practise as all of
read/write (e.g. sensor data)
- Flexible pricing, pay for when you use it
the most recent data is being written at the
- A way to store the data that can be commonly same time.

used by Dataproc and BigQuery


- For historical data analytic, hotspot
3.3.BigTable issue may not a biggest concern.

Encryption:
Feature:
- For time-series data, use tall/narrow tables.
- Google Cloud Platform encrypts customer data Denormalize- prefer multiple tall and narrow
- Stored on Google’s internal store Colossus

stored at rest by default


tables

- no transactional support (so can handle


- Encryption Options:
- avoid hotspotting
petabytes of data)

- Server-side encryption:
- Field promotion (preferred): Move fields
- not relational (No SQL or joins), ACID only at from the column data into the row key to
- Customer-supplied encryption row level.

keys: You can create and manage make writes non- contiguous.

- Avoid schema designs that require


your own encryption keys for - Salting: (only where field promotion does
atomicity across rows

server-side encryption
not resolve) Add an additional calculated
- high throughput: Throughput has linear element to the row key to artificially
- Customer-managed encryption growth with node count if correctly balanced.

keys: You can generate and make writes non- contiguous.

- work with it using Hbase API

manage your encryption keys - no-ops, auto-balanced, replicated,


using Cloud Key Management Performance Test
compacted,
- learns about your access patterns and will
Service.

- Client-side encryption: encryption that adjust the metadata stored in nodes in order to
Query:
try balance your workloads.

occurs before data is sent to Cloud - Single key lookup. No property search.

Storage.
- This takes minutes to hours and requires to use
- Stored lexicographically in big endian format at least 300GB of data

so keys can be anything.


- use a production instance

- quick range lookup - Stay below the recommended storage


utilization per node.

- Before you test, run a heavy pre-test for performance (when use Cloud Bigtable with a
several minutes
Compute Engine-based application)

- Run your test for at least 10 minutes


TOOLS
Data update - cbt is a tool for doing basic interactions with
- When querying BigTable selects the most Cloud Bigtable.

recent value that matches the key.


- HBase shell is a command-line tool that
- This means that when deleting/updating performs administrative tasks, such as creating
we actually write a new row with the and deleting tables.

desired data and compaction can - you can update any of the following settings
remove the deleted row later. This means without any downtime:

!
that deleting data will temporarily - number of clusters / replication settings

increase the disk usage


- upgrade a development instance to a
- append only (cannot update a single field like Size limits production (permanent)

in CSQL/CDS)
- A single row key: 4 KB (soft limit?)
- Impossible to Switching between SSD and
- tables should be tall and narrow (store - A single value in a table cell: 100 MB
HDD

changes by appending new rows- tall, collapse - All values in a single row: 256 MB
- export the data from the existing
flags into a single column - narrow)
instance and import the data into a new
Access Control instance.

- you can configure access control at the - OR write a Cloud Dataflow or Hadoop
Group columns project level and the instance level(lowest IAM MapReduce job that copies the data
- Group columns of data you are likely to query resources control)
from one instance to another.

together (for example address fields, first/last - A Cloud Bigtable instance is mostly just
name and contact details).
a container for your clusters and nodes,
- use short column names, organise into which do all of the real work.
3.4.Datastore
column families (groups e.g. MD:symbol, - Tables belong to instances, not to - Built on top of BigTable.

MD:price)
clusters or nodes. So if you have an - non-consistent for every row

instance with up to 2 clusters, you can't


- document DB for non-relational data

Periodically compacts assign tables to individual clusters



- Suitable:

• BigTable periodically compacts the data for - Atomic transactions: can execute a set
you. This means reorganising and removing of operations where either all succeed, or
deleted records once they’re no longer needed.
Production & Development
none occur.

- Production: A standard instance with either 1


- Supports ACID transactions, SQL-like
Sorted String Tables or 2 clusters, as well as 3 or more nodes in
queries.

- BigTable relies on Sorted String Tables to each cluster.

-
organise the data. - use replication to provide high - for structured data.

- are immutable pre sorted key,value pairs of availability

- for hierarchical document storage such


strings or protobufs.
- Development: A low-cost instance for as HTML

development and testing, with performance - Query

limited to the equivalent of a 1-node cluster.

- can search by keys or properties (if


- It is recommended to create your Compute indexed)

Architecture
Engine instance in the same zone as your - Key lookups somewhat like Amazon
Cloud Bigtable instance for the best possible DynamoDB

- Allows for SQL-like querying down to - Has connectors for BigTable, GCS, Google - web console (local files), GCS, GDS

property level
Drive and can import from Datastore backups, - stream (costly)

- does not support complex joins with CSV, JSON and ARVO
- data with CDF, Cloud logging or POST
multiple inequality filters
- for analytics. serverless. calls

- Performance:
- alternative to Hadoop with Hive
- Raw files:
- Fast to Terabyte scale, Low latency
Performance - federated data source, CSV/JSON/Avro
- Quick read, slow write as it relies on - Petabyte scale
on GCS, Google sheets

indexing every property (by default) and - High latency used more for analytics than for - Google Drive

must update indexes as updates/writes low latency rapid lookups like a RDBMS like - Loading data into BigQuery from Google
occur
CloudSQL or Spanner
Drive is not currently supported,

Query - but can query data in Google Drive using


- Standard SQL (preferred) or Legacy SQL (old)
an external table.

RDBMS/CloudSQL/ Datastore - Cannot use both Legacy and SQL2011 in the - By default, the BigQuery service expects all
Spanner same query
source data to be UTF-8 encoded
- Table partitioning
- JSON files must always be encoded in
Row Entity - Distributed writing to file for output. Eg: UTF-8

`file-0001-of-0002`
- to support (occasionally) schema changing
Tables Kind you can use 'Automatically detect' for
- user defined functions in JS (UDFJS)

- Query jobs are actions executed schema changes. Automatically detect is not
Fields Property
asynchronously to load, export, query, or copy default selected

Column values must Properties can vary data.


- You cannot use the Web UI to:

- If you use the LIMIT clause, BigQuery will still - Upload a file greater than 10 MB in size

be consistent between entities


process the entire table.
- Upload multiple files at the same time

Structured relational Structured - Avoid SELECT * (full scan), select only - Upload a file in SQL format

data hierarchical data columns needed (SELECT * EXCEPT)

(html, xml) - benefits of using denormalized data Partitions


- Increases query speed
- which improves query performance and
- makes queries simpler
reduces costs

Errors and Error Handling - BUT: Normalize is the way make dataset - You cannot change an existing table into a
- UNAVAILABLE, DEADLINE_EXCEEDED
better organized but less performance partitioned table. You must create a
- Retry using exponential backoff.
optimized
partitioned table from scratch.

- INTERNAL
- Two types of partitioned tables:

- Do not retry this request more than types of queries:


- Ingestion Time: Tables partitioned
once.
- Interactive: query is executed immediately, based on the data’s ingestion (load) date
- Other
counts toward daily/concurrent usage (default)
or arrival date. Each partitioned table will
- Do not retry without fixing the problem
- Batch: batches of queries are queued and the have pseudocolumn_PARTITIONTIME, or
query starts when idle resources are available, time data was loaded into table.
only counts for daily and switches to Pseudocolumns are reserved for the
3.5.BigQuery interactive if idle for 24 hours
table and cannot be used by the user.

Feature - Partitioned Tables: Tables that are


- Fully managed data warehouse
Data Import
partitioned based on a TIMESTAMP or
- batch (free)
DATE column.

- Wildcard tables
- Cross-Join: avoid joins that generate more • Views: Virtual tables defined by a SQL
- Used if you want to union all similar outputs than inputs
query. For more information, see Using
tables with similar names. ’*’ (e.g. - Update/Insert Single Row/Column: avoid point- views.

project.dataset.Table*)
specific DML, instead batch updates and
- Partitioned tables include a pseudo column inserts
Caching:

named _PARTITIONTIME that contains a date- - anti-patterns and schema design: https:// - There is no charge for a query that retrieves its
based timestamp for data loaded into the table
cloud.google.com/bigtable/docs/schema- results from cache.

- It can be used to query specific design


- BigQuery caches query results for 24 hours.

partitions in the WHERE clause


- By default, a query's results are cached unless:

Access Control - When a destination table is specified

Windowing:
- Security can be applied at the project and - If any of the referenced tables or logical
- window functions increase the efficiency and dataset level, but not table or view level
views have changed since the results
reduce the complexity of queries that analyze - three types of resources in BigQuery are were previously cached

partitions (windows) of a dataset by providing organizations, projects, and datasets


- When any of the tables referenced by the
complex operations without the need for many - Authorized views allow you to share query query have recently received streaming
intermediate calculations.
results with particular users/groups without inserts (a streaming buffer is attached to
- They reduce the need for intermediate tables to giving them access to underlying data
the table) even if no new rows have
store temporary data
- Can be used to restrict access to arrived

particular columns or rows - If the query uses non-deterministic


- Create a separate dataset to store the functions such as
Bucketing view
CURRENT_TIMESTAMP() and NOW(),
- Like partitioning, but each split/partition should CURRENT_USER()

be the same size and is based on the hash Billing - If you are querying multiple tables using
function of a column. Each bucket is a - based on storage (amount of data stored), a wildcard

separate file, which makes for more efficient - If the query runs against an external data
querying (amount of data/number of bytes
sampling and joining data.
source

processed by query), and streaming inserts

legacy vs. standard SQL3 - Storage options are active and long-term Export:

- `project.dataset.tablename*`
(modified or not past 90 days).
- Data can only be exported in JSON / CSV /
- It is set each time you run a query
- Query options are on-demand and flat-rate.
Avro

- default query language is


- The only compression option available is GZIP.

- Legacy SQL for classic UI


Table types: - GZIP compression is not supported for
- Standard SQL for Beta UI
• Native tables: tables backed by native Avro exports.

BigQuery storage.
- To export more than 1 GB of data, you need to
• External tables: tables backed by storage put a wildcard in the destination filename. (up
Anti-patterns external to BigQuery(also known as a to 1 GB of table data to a single file)

federated data source). For more


- Avoid self-joins
information, see Querying External Data More refer to: https://cloud.google.com/
- Partition/Skew: avoid unequally sized Sources.
bigquery/docs/

partitions, or when a value occurs more often https://github.com/jorwalk/data-engineering-


than any other value -
gcp/blob/master/know/bigquery.md

3 https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
(eg Bigtable for low latency use and bigquery - Currently, data-driven triggers only
4. Big Data Processing for data exploration).
support firing after a certain number of
Covering knowledge about BigQuery, Cloud data elements.

Dataflow, Cloud Dataproc, Cloud Datalab and Windowing - Composite triggers. These triggers combine
Cloud Pub/Sub. - Can apply windowing to streams for rolling multiple triggers in various ways.

average for the window, max in a window etc.

- window types

4.1.App Engine - Fixed Time Windows


Tech:
- Sliding Time Windows (overlapped)
- PCollections: abstraction that represents a
- run code on managed instances of machines - Session Windows

with automated scaling and deployment potentially distributed, multi-element data set,
- Single Global Window
that acts as the pipeline’s data

- Handle sudden and extreme spikes of traffic - Default windowing behavior is to assign all
which require immediate scaling - A transform is a data processing operation, or
elements of a PCollection to a single, global a step, in your pipeline. A transform takes one
window, even for unbounded PCollections
or more PCollections as input, performs a
4.2.GCP Compute processing function that you provide on the
Triggers elements of that PCollection, and produces an
- VM
- IT determines when a Window's contents output PCollection.
- Pre-emptible instances (up to 80% discount should be output based on certain criteria - DirectPipelineRunner allows you to execute
but can be taken away)
being met
operations in the pipeline directly & locally

- Allocated on-demand and only pay for the - Allows specifying a trigger to control - Create a cron job with Google App Engine
time they are up
when (in processing time) results for the Cron Service to run the Cloud Dataflow job
given window can be produced.

- If unspecified, the default behavior is to


4.3.Dataflow trigger first when the watermark passes IAM
the end of the window, and then trigger
Feature - dataflow.developer role enable the developer
again every time there is late arriving
- Executes Apache Beam Pipelines(no-ops, interacting with the Cloud Dataflow job , with
data.

could use Spark, Flink)


data privacy.

- Time-Based Triggers
- Can be used for batch or stream data
- dataflow.worker role provides the permissions
- Event time triggers. These triggers
- scaleable, fault-tolerant, multi-step necessary for a Compute Engine service
operate on the event time, as indicated
processing of data
by the timestamp on each data element. account to execute work units for a Dataflow
- Often used for data preparation/ETL for data Beam’s default trigger is event time- pipeline

sets
based.

- filter, group, transform


- Processing time triggers. These Pipeline Update
- Pipelines and how it works for ETL
triggers operate on the processing time – - With Update, to replace an existing pipeline in-
the time when the data element is place with the new one and preserve
DataSource processed at any given stage in the Dataflow's exactly-once processing guarantee

- The Cloud Dataflow connector for Cloud pipeline.


- When update pipeline manually, use DRAIN
Bigtable makes it possible to use Cloud - Data-driven triggers. These triggers operate instead of CANCEL to maintain in-flight data.

Bigtable in a Cloud Dataflow pipeline.


by examining the data as it arrives in each - The Drain command is supported for
- Can read data from multiple sources and can window, and firing when that data meets a streaming pipelines only.

kick off multiple cloud functions in parallel certain property.


- pipelines cannot share data or transforms

writing to multiple sinks in a distributed fashion


Key things to focus on are: - push and pull - Used if migrating existing on-premise
- Event vs. Processing time
- pull is a more efficiency message deliver/ Hadoop or Spark infrastructure to Google
consume mechanism Cloud Platform without redevelopment effort.

- Configuring ETL pipelines

Message Flow
How to integrate with BigQuery
Storage

- Publisher creates a topic in the Cloud Pub/


- Can store data on disk (HDFS) or can use GCS

Sub service and sends messages to the topic.

constraints you might have.


- GCS allows for the use of preemptible
- Messages are persisted in a message store
machines that can reduce costs significantly

why you would use JSON or Java related to until they are delivered and acknowledged by - - store in HDFS (split up on the cluster, but
Pipelines.
subscribers.
requires cluster to be up) or in GCS (separate
- The Pub/Sub service forwards messages from cluster and storage)

a topic to all of its subscriptions, individually.


4.4.Cloud Pub/Sub Each subscription receives messages either by
Customize the software
pushing/pulling.

Feature - The subscriber receives pending messages - Set initialization actions

- server-less messaging
from its subscription and acknowledges - Modify configuration files using cluster
- decouples producers and consumers of data message.
properties

in large organisations / complex systems


- When a message is acknowledged by the - Log into the master node and make changes
- the glue that connects all the components.
subscriber, it is removed from the from there

- order not guaranteed subscription’s message queue.


- NO Cloud Deployment Manager

Deduplicate Tech:
- Maintain a database table to store the hash - creating a new Cloud Dataproc cluster with the
value and other metadata for each data entry.
projects.regions.clusters.create operation,
- Cloud Pub/Sub assigns a unique `message_id` these four values are required: project, region,
to each message, which can be used to detect name, and zone.

duplicate messages received by the - you can access the YARN web interface by
subscriber.
configuring a browser to connect through a
- Lots duplicate messages may happen when: SOCKS proxy

endpoint is not acknowledging messages - You can SSH directly to the cluster nodes

within the acknowledgement deadline


- you can use a SOCKS proxy to connect your

browser through an SSH tunnel.

- YARN ResourceManager and the HDFS


Asynchronous processing 4.5.Dataproc NameNode interfaces are available on master
- availability (buffer during outages)
- change management Feature node

- throughput (balance load among workers) - Dataproc is managed (No Ops) Hadoop cluster
- unification (cross-organisational) on GCP (i.e. managed Hadoop, Pig, Hive, Billing:
- latency (accept requests at edge of network) Spark programs)
- is billed by the second. All Cloud Dataproc
- consistency - automated cluster management. resizing
clusters are billed in one-second clock-time
- Code/Query only
increments, subject to a 1-minute minimum
basic concepts - Job management screen in the console.
billing

- topics - think in terms of a ‘job-specific resource’, for


- subscriptions each job, create a cluster then delete it

IAM
• TensorFlow CheatSheet and Terminology
- Features: The data values/fields you choose to
- Service accounts used with Cloud model, these can be transformed (x^2, y^2,
Understand the different ML services available: etc)

Dataproc must have Dataproc/Dataproc ML Engine, ML APIs, and TensorFlow, as well as - Feature Engineering: The process of building a
Worker role (or have all the permissions the relevance of Cloud DataLab.
set of feature combinations to act on inputs.

granted by Dataproc Worker role).


- Precision: The positive predictive value how
- need permissions to read and write mostly ML domain (and not TensorFlow specific) many times it correctly predicted a thing as its
to Google Cloud Storage, and to basically about training. Nothing about the Cloud classification (eg cat)

ML service
- Recall: The true positive rate, How many times
write to Google Cloud Logging.
a think IS in the class (the actual number of
cats)

- Only recognised 1/10 cats, but was right.


4.6.Dataprep 5.1.ML Terms: 100% precision, 10% recall

- Managed Trifacta for preparing and analysing - Label: The correct classification/value

quality and transforming the input data - Input: Predictor Variables


Recommendation Engine
- service for visually exploring, cleaning, and - Example: Input + Label sample to train your - cluster similar users: User A and User B both
preparing data for analysis. can transform data model
rate House Z as a 4

of any size stored in CSV, JSON, or relational- - Model: Mathematical function, Some work is - cluster similar items (products, houses, etc):
table formats done on inputs for an output
Most users rate House Y as a 2

- Training: Adjusting Variable weights in a model - combine these two to product a rating

to minimise error

- Prediction: Using the model to guess the label


4.7.Cloud Functions: for an input

- Supervised Learning: Training your model


- NodeJS functions as a service.

using examples data to predict future data


5.1.ML Basics
- No ops, no server just code entry point and
- Unsupervised Learning: Data is analysed - There are two main stages of ML, Training and
response, autoscaled by GCP

without labels for patterns or clusters.


Inference. Inference is often predictive in nature

- Can be triggered by dataflow, GCS bucket - Neuron: A way to combine inputs and - Common models tend to include regression
events, Pub/Sub Messages and HTTP weighting them to make a decision (it is one (what value?) and classification (what
Calls
unit of input combination)
category?)

- Gradient Descent: The process of testing error - Converting inputs to vectors for analysis a well
in order to minimise its value iteratively documented problem that can actually be
decreases towards a minimum. (This can be improved with the use of ML itself.Some public
global or local maximum and starting points
5. Machine Learning4 and learning rates are important to ensure the
models already exist incluing https://
en.wikipedia.org/wiki/Word2vec

process doesn’t stop at local minima. Too low - Initial weight selection is hard to get right,
Covering knowledge on GCP API (Vision API, a learning rate and your model will tgrain
Speech API, Natural Language API and Translate human intuition can often help with this starting
slowly, too high and it may miss the minimum)
point for gradient descent

API) and Tensorflow.


- Hidden Layer: A set of neurons that act on the - Weights are iteractively tweaked. Initial weights
same input data.
-> Calculate Erorr -> adjust weight ->
• Embeddings

• Deploying Models

4 This part is not well organised yet


recalculate error -> repeat until error is can’t be fixed try find more `bad` cases to try portion of the work as you designate in your
minimised.
train your model for it, Otherwise remove these job configuration.

- In Image ML each pixel is represented by a from the data set.


- parameter servers: One or more replicas may
number, to vectorising images is actually much - Overfitting and how to correct
be designated as parameter servers. These
easier than text where you have to recognise - Neural network basics (nodes / layers)
replicas coordinate shared model state
that Man is to Woman as Boy is to Girl so they - Use the Tensorflow playground to understand between the workers.

need to have similar magnitude differences.


neural networks
- CUSTOM tier for Cloud Machine Learning
Engine allows you to specify the number of
Neural Network Workers and parameter servers

Machine Learning on GCP 

- Goal is to minimize cost


- Tensorflow - for ML researcher, use SDK

- Cost depends on problem - usually the sum of Online versus Batch Prediction
- CloudML - for Data Scientist, use custom
square errors for regression problems or model, scaleable, no-ops (where have enough - Online
Cross_entropy on classification problems
data to train a model)
- Optimized to minimize the latency of
- Feature Engineering less needed than linear - ML APIs - for App Developer, use pre-built serving predictions.

models, but still useful 


models, e.g. vision, speech, language (where - Predictions returned in the response
- Understand how to reduce noise
would need more data to train a model)
message.

- Returns as soon as possible.

Wide & Deep Learning model - Batch


- The wide model is used for memorization, 5.3.ML Engine - Optimized to handle a high volume of
while the deep model is used for instances in a job and to run more
- Train and predict ML models
complex models.

generalization - Can use multiple ML platforms such as


- Use for recommender system, search, and - Predictions written to output files in a
TensorFlow , scikit-learn and XGBoost
Cloud Storage location that you
ranking problems.

specify.

Online training and/or continue learning - Asynchronous request.

Training cluster:
- build pipeline to continue train you model
based on both new and old data,
- The training service allocates the resources for Exception
the machine types you specify
- The training service runs until your job
- Replica: Your running job on a given node is succeeds or encounters an unrecoverable
called a replica.
error.

Effective ML: - each replica in the training cluster is - In the distributed case, it is the status of the
- Data Collection
given a single role or task in distributed master replica that signals the overall job
- Data Organisation
training:
status.

- Model Creation using human insight and - run a Cloud ML Engine training job locally
domain knowledge
- master: Exactly one replica is designated the (gcloud ml-engine local train) is especially
- Use machines to flesh out the model from the master.
useful in the case of testing distributed models

input data
- This task manages the others and
- Your Dataset should cover all cases both reports status for the job as a whole.

positive and negative and should have at least - If you are running a single-process job, 5.4.TensorFlow
5 cases of each otherwise the model cannot the sole replica is the master for the job.

correctly classify that case. Near misses are - workers: One or more replicas may be ● OS Machine learning/Deep Learning
also important for this, Explore your data, find designated as workers. These replicas do their platform
causes of problems and try fix the issue. If it
● Lazy evaluate during build, full
evaluate during execution - - also can insert SQL & JS for BigQuery, HTML 6.2.Stackdriver
for web content, charts, tables

Tensorflow
- - analyse data in BQ, GCE, GCS
- For store, search, analyse, monitor, and alert
- Machine learning library
- - free, just pay for resources
on log data and events.

- - Underpins many of Google’s products


- Three ways to run Datalab:
- Be sure to know the sub-products of
- - C++ engine and API (so can run on GPUs)
- - locally (good, if only one person using)
Stackdriver (Debugger, Error Reporting,
- - Python API (so can easily write code)
- - Docker on GCE (better, use by multiple Alerting, Trace, Logging), what they do and
- To use Tensorflow:
people through SSH;CloudShell, uses when they should be used.

- - Collect predictors and target data


resources on GCE)
- Hybrid monitoring service.

-     - discard info that identifies a row (need at - Docker + Gateway (best, uses a gateway and - how you can debug, monitor and log using
least 5-10 examples of a particular value - to proxy, runs locally) Stackdriver.

avoid overfitting)
- how to use Stackdriver to help debug source
-     - predictor columns must be numerical (not - a powerful interactive tool created to explore, code

categorial / codes)
analyze, transform and visualize data and build - Audit Logs to review data access (e.g.
- - Create model 
machine learning models on Google cloud BigQuery)

-     - how many nodes and layers do we need?


platform. - Stackdriver Monitoring

- - Train the model based on input data


- can see the usage of BigQuery query
-     - Regression model predicts a number
slots.

-     - Classification model predicts a category


- Stackdriver Trace
- Use the model on new data
- is a distributed tracing system for

Feature Engineering
6. Management Google Cloud Platform that collects
latency data from Google App Engine,
- a sparse vector Google HTTP(S) load balancers, and
- very long, with many zeros, contains only 6.1.IAM & Billing applications instrumented with the
a single 1
Stackdriver Trace SDKs, and displays it
- If you don't know the set of possible values in - 3 Member types, Service account, google in near real time in the Google Cloud
advance, you can use account and google group.
Platform Console

categorical_column_with_hash_bucket instead.
- Service Accounts are for non human
- An embedding is a mapping from discrete users such as applications

- Google Accounts are for single users


6.3.Data Studio 360
objects, such as words, to vectors of real - Google groups are for multi users
- Data Dashboard

numbers.
- Project transfer fees are associated with the - can use the existing YouTube data
instigator.
source
- Billing access can be provided to a project or
5.5.Datalab set of projects without granting access to the
- The prefetch cache is only active for data
content. This is useful for separation of duties sources that use owner's credentials to
- Datalab: Managed Jupyter notebooks great access the underlying data.

for use with a dataproc cluster to write pyspark between finance/devs etc.

- How billing works across projects


- Disabling the query cache could result in
jobs

- How to run Datalab higher data usage costs for paid data
sources, such as BigQuery

- you can turn the prefetch cache off for a


- - open source notebook built on Jupyter

given report. You might want to do this if:

- - use existing Python packages 

- your data changes frequently and you - Compressing and combining smaller files - Searching for objects by attribute value -
want to prioritize freshness over info fewer larger files is also a best practice for Datastore (bigtable only by single row key)

performance.
speeding up transfer speeds
- High throughput writes of wide column data:
- you are using a data source that incurs Bigtable

usage costs (e.g., BigQuery) and want to Avro Data Format5:


- Warehousing structured data - BigQuery
minimize those costs.
- Is faster to load. The data can be read in
parallel, even if the data blocks
are compressed.

6.4.Cloudshell - Doesn't require typing or serialization.

- temporary VM - Is easier to parse because there are no


- recycled every 60 minutes (approx.)
encoding issues found in other formats such as
ASCII.

- Compressed Avro files are not supported, but


6.5.Cloud Deployment Manager compressed data blocks are. BigQuery
supports the DEFLATE and Snappy codecs.

- Allows you to specify all the resources needed !


for your application in a declarative format
using yaml

- Repeatable Deployment Process

7. Product Selection
6.6.Data Transfer - if there is a requirement to search terrabytes -
> petabytes of data relatively quickly it will
- Storage Transfer Service
make more sense to simply store in BigQuery
- import online data into Cloud Storage.
!
(comparable to AWS Redshift).

- repeating schedule
-
- transfer data within Cloud Storage, from - For DataStore, there is a possibility that this
one bucket to another.
could work as a replacement for Cassandra.

- Source: GCS, S3, URL - It is most likely that BigTable will be a better
- Transfer Appliance
solution

- one-time
- if the data set is relatively small < 10TB then
- Rack, capture and then ship your offline DataStore will be preferred.

data to Google Cloud


- If the data set is > 10TB and/or there is no
- If you have a large number of files to transfer requirement for multiple indexes then
you might want to use the gsutil -m option, to BigTable will be better.

perform a parallel (multi-threaded/multi- -


processing) copy
- Be aware of any limitations regarding indexes 6

and partitioning,

5 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
6 https://cloud.google.com/storage-options/
’17): https://www.youtube.com/watch?
8. Case Study 9. Resources and v=TLpfGaYWshw

There are 2 case studies which are as same as in


the GCP website: a logistic Flowlogistic
references Webinar: Building a real-time analytics pipeline
company and a communications hardware with BigQuery and Cloud Dataflow (EMEA):
MJTelco company. Each case study includes https://www.youtube.com/watch?
Resources: v=kdmAiQeYGgE

about 4 questions which ask how to transform


current technologies of that company to use - Google Data Engineering Cheatsheet: https://
GCP technologies. We can learn details about github.com/ml874/Data-Engineering-on-GCP-
these case studies in LinuxAcademy.
Cheatsheet


- Data Engineering Roadmap: https://
the correct answer is not necessary the best github.com/hasbrain/data-engineer-roadmap

technical solution but a solution that achieves - Whizlab Mock Exam

the right outcome for the company based on - Let me know any source is missing

their current limitations.7

Youtube Video to Watch:


Auto-awesome: advanced data science on
flowlogistic: https://cloud.google.com/ Google Cloud Platform (Google Cloud Next ’17):
certification/guides/data-engineer/casestudy- https://www.youtube.com/watch?v=Jp-
flowlogistic
qJFF9jww&list=PLIivdWyY5sqLq-
eM4W2bIgbrpAsP5aLtZ

- to find way to store the data that can be


commonly used by Dataproc and BigQuery
Introduction to Google Cloud Machine Learning
- Second both Dataproc and BigQuery (Google Cloud Next ’17): https://
integrating with Cloud Storage well
www.youtube.com/watch?v=COSXg5HKaO4

- Introduction to big data: tools that deliver deep


mjtelco: https://cloud.google.com/certification/ insights (Google Cloud Next ’17): https://
guides/data-engineer/casestudy-mjtelco
www.youtube.com/watch?v=dlrP2HJMlZg

Easily prepare data for analysis with Google


Deconstructing a Customer Case: Data Engineer Cloud (Google Cloud Next ’17): https://
Exam: https://www.youtube.com/watch? www.youtube.com/watch?v=Q5GuTIgmt98

v=r_yYDysfB-k
Serverless data processing with Google Cloud
Dataflow (Google Cloud Next ’17): https://
Other helpful case:
www.youtube.com/watch?v=3BrcmUqWNm0

Spotify’s Event Delivery – The Road to the Cloud:


https://labs.spotify.com/2016/02/25/spotifys- Data Modeling for BigQuery (Google Cloud Next
event-delivery-the-road-to-the-cloud-part-i/
’17) https://www.youtube.com/watch?
v=Vj6ksosHdhw

Migrating your data warehouse to Google


BigQuery: Lessons Learned (Google Cloud Next

7 https://www.linkedin.com/pulse/google-cloud-certified-professional-data-engineer-writeup-rix/

You might also like