0% found this document useful (0 votes)
63 views

Optimizing and Operating v1.2

This document provides guidance on designing for security in Google Cloud environments. It discusses key concepts like identity and access management using groups and least privilege, data security with encryption, and establishing a resource hierarchy with organizations, folders, and projects. It also provides tips on topics like penetration testing, separation of duties, security controls, and key management. Examples are given around implementing techniques like VPC service controls and using service accounts. The resource hierarchy and how policies inherit down the hierarchy is also illustrated.

Uploaded by

Atul Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Optimizing and Operating v1.2

This document provides guidance on designing for security in Google Cloud environments. It discusses key concepts like identity and access management using groups and least privilege, data security with encryption, and establishing a resource hierarchy with organizations, folders, and projects. It also provides tips on topics like penetration testing, separation of duties, security controls, and key management. Examples are given around implementing techniques like VPC service controls and using service accounts. The resource hierarchy and how policies inherit down the hierarchy is also illustrated.

Uploaded by

Atul Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Optimizing and

Operating
Course agenda
Module 1: Understanding the Professional
Cloud Architect Certification

Module 2: Sample Case Studies for the


Professional Cloud Architect Exam

Module 3: Designing and Implementing


(Review and Preparation Tips)

Module 4: Optimizing and Operating


(Review and Preparation Tips)

Module 5: Resources and Next Steps


Module agenda
Designing for Security and
Compliance
Analyzing and Optimizing Technical
and Business Processes
Developing Procedures to Test
Resilience of Solution in Production
Managing Implementation

Ensuring Solution and Operations


Reliability
Cloud Architect case study 03: Security and
compliance
A large hedge fund wanted to improve their security posture, a
common FinServ requirement...
Security
Business Requirement: Data cannot traverse the public Internet.
Technical Requirement: Must have private API access to Google Cloud services as a good
security practice and to minimize data exfiltration.
Compliance
Business Requirement: Cloud provider must earn the trust of the business. How does
Google Cloud maintain the latest standards around security, availability, process integrity,
privacy, and confidentiality?

This customer had a common FinServ requirement. The customer did not want any
data to traverse the public internet, for obvious reasons. So they had a security
strategy that included a technical requirement to use private APIs to access Google
Cloud resources. They saw this as fundamental to their security strategy. Additionally,
they wanted to know how the Cloud Provider secured Standards Certifications, and
what they did to stay current. They were concerned that the provider might lose a
certification that they were relying on for business.

A large financial company wanted to improve their security posture, a common


FinServ requirement...

Security
● Business Requirement: Data cannot traverse the public Internet.
● Technical Requirement: Must have private API access to Google Cloud
services as a good security practice and to minimize data exfiltration.

Compliance
● Business Requirement: Cloud provider must earn the trust of the business.
How does Google Cloud maintain the latest standards around security,
availability, process integrity, privacy, and confidentiality?
Cloud Architect case study 03: Security and
compliance
We mapped that to technical requirements and Google Cloud’s
products and services...
Security
Ensure all traffic to Google Cloud is through secure methods, such as SSL/TLS, VPN,
Interconnect, and private APIs / private endpoints.
Compliance
Google Cloud has Standards, Regulations & Certifications that would meet their compliance
requirements and help earn their trust in our platform.

The first thing we did was made sure all access to Google Cloud was through secure
methods, including SSL, VPN, Interconnect, and private API.

We decided to use a new feature that was in alpha, called VPC Service control.
https://cloud.google.com/vpc-service-controls/ This enables a security perimeter. For
example, BigQuery could be placed inside a security perimeter, and then could only
be accessed at a private endpoint. And then there were standards and compliance
such as ISO and SOC. We provided these to the customer - and they needed to sign
agreements to be covered by Google's guarantees about these standards.

We mapped that to technical requirements and Google Cloud’s products and


services…

Security
● Ensure all traffic to Google Cloud is through secure methods, such as
SSL/TLS, VPN, Interconnect, and private APIs / private endpoints.

Compliance
● Google Cloud has Standards, Regulations & Certifications that would meet
their compliance requirements and help earn their trust in our platform.
Cloud Architect case study 03: Security and
compliance
And this is how we implemented that technical requirement.
VPC Service Controls / Secure Google Cloud API
● Restrict access to user’s Google Cloud resources based on the Google Cloud Virtual
Network or IP range.
● Restrict the set of Google APIs and Google Cloud resources accessible from user’s
Google Cloud Virtual Network.
Standards, Regulations & Certifications
● Products regularly undergo independent verification of:
○ Security / Privacy / Compliance Controls
○ Certifications
■ ISO 27001, 27017, and 27018 and SOC 1, 2, and 3
certifications.

Currently in Alpha, VPC Service Controls is likely to be on future versions of the Cloud
Architect certification exam.

All services are managed through a secured global API gateway infrastructure.

An interesting point about both security and compliance, is that it is a "shared


responsibility" model. So although we provided secure access and layered protection,
the customer needed to use IAM to manage access to its employees and implement
secure practices in its procedures. Also, the standards compliance covers the cloud
resources, but not the customer's application. So they may need to take extra steps to
ensure that the overall solution is compliant.
Designing for security
Exam outline Tips
Identity and Access Management (IAM) Use groups.

Data security (key management, encryption) Principle of least privilege.

Resource hierarchy (organizations, folders, Separation of responsibilities to match


projects) organization structure. TIP
Penetration testing Shape the scope and the objectives of the test. Can you apply
Isolate key roles. Always have an the "principle of
Separation of duties alternate/secondary who can take over if least privilege" in
needed. Service accounts. case examples?
How are you monitoring security?
Security controls What logs and reports are available to you?
What responses will you take?

Managing customer-supplied encryption keys Key management, key rotation, standards and
with Cloud KMS policy compliance.

One key to securing access is to request and establish groups that represent roles.
Then apply the permissions to the groups. And allow the people in the organization
who manage identity to assign membership to the groups. This creates a clean
interface between permission management on the cloud side, and group membership
on the personnel/IT side.

Another key to security is to craft security permissions. The standard roles are defined
for the most common use cases. But you might want to derive more granular and
restricted roles by customizing them.

Service accounts are a great way to separate system components and establish
secure communications between components.

A bastion host is a way to leverage a service account. For risky and uncommon
actions, make the user/admin start up and log into a bastion host. From there they
can "borrow" the service account assigned to the host to perform restricted functions.
One benefit is that the login process generates logs for accountability.
Cloud IAM resource hierarchy

Organization
example.com
Folders

Dept X Dept Y

Policy Inheritance
Projects

bookshelf static-assets stream-ingest

Compute App Cloud


Resources

Cloud
Pub/Sub BigQuery
Engine Engine Storage Storage

instance_a queue_a bucket_a bucket_b topic_a dataset_a

A policy is set on a resource, and each policy contains a set of:


● Roles
● Role members
● Resources inherit policies from parent:
● Resource policies are a union of parent and resource.
● If parent policy is less restrictive, it overrides a more restrictive resource policy.

https://cloud.google.com/iam/
https://cloud.google.com/iam/docs/
https://cloud.google.com/iam/docs/concepts
https://cloud.google.com/iam/docs/understanding-roles
https://cloud.google.com/iam/docs/service-accounts
Folders
Company

Org
Additional grouping mechanism and
isolation boundaries between projects:
● Different legal entities
Dept X Dept Y Shared
● Departments Infrastructure

Folders
● Teams
Team A Team B
Folders allow delegation of
administration rights.
Product 1 Product 2

Projects
Development Test Production

Resources

Compute Engine App Engine Cloud Storage

Folder map well to organization structure. It is a way to isolate organizations or users


or products while still having them share billing and corporate resources.

https://cloud.google.com/resource-manager/docs/managing-multiple-orgs
https://cloud.google.com/resource-manager/docs/access-control-org

https://cloud.google.com/resource-manager/docs/creating-managing-projects
https://cloud.google.com/resource-manager/docs/access-control-proj

https://cloud.google.com/resource-manager/docs/creating-managing-folders
https://cloud.google.com/resource-manager/docs/access-control-folders
Identity and access
● Separate responsibilities.
● Always have a backup or alternative in case the
responsible person is unreachable.
● Have a separate maintenance path when the normal
paths aren't working (e.g., bastion host).
● Use groups to allocate permissions, then separately
manage group membership.
● Customize roles for greater granularity of permissions.
● Give each group only the permissions they need to
perform that job or task.
● Place critical functions on service machines to create
accountability trail (login log, activity monitoring).
● Backup/spare logs and records; have a review,
analysis, and monitoring strategy (ex: monthly reports).

TIP: Commit a security checklist to memory. Sometimes just running down a list will
rapidly identify a solution.
Why do you bother locking doors?

The principal illustrated in this image is very important and will help you understand
the next several slides about security.

This is a very simple door lock. But it is on the inside of this door. The person or
people inside can unlock this very easily at any time. For them it does nothing but
puts a step in the way of using the door. But the lock is not supposed to do anything
for them. The purpose of the lock is to keep other people out. So to do that, you need
to lock your own people in.

This principal is on display in the way networking contributes to security.

Consider an internal firewall between two VMs. Why would you want to do that?

If you think about it functionally it makes no sense. Because the VMs can already
communicate. So adding a firewall between them does nothing for them. It only
makes it inconvenient if they want to communicate using a different protocol. Now you
have to go change the firewall rule to allow that protocol. And it seems like invented
work. Because if you just didn't put the firewall rule in place to begin with you wouldn't
have to modify it later. However, the firewall rule is not for those VMs. It doesn't do
anything for them. It is to keep others out of their communications. To prevent
someone from spoofing or injecting bad traffic as part of an attack to either violate
privacy or deny service.
Shared VPC: Keep others out by locking you in

Shared VPC allows an organization to connect resources from multiple projects to a


common VPC network. This allows the resources to communicate with each other
securely and efficiently using internal IPs from that network.

For example, in this diagram there is one network that belongs to the Web Application
Server’s project. This network is shared with three other projects, namely the
Recommendation Service, the Personalization Service, and the Analytics Service.
Each of those service projects has instances that are in the same network as the Web
Application Server, allowing for private communication to that server, using internal IP
addresses. The Web Application Server communicates with clients and on-premises
using the server’s external IP address. The backend services, on the other hand,
cannot be reached externally because they only communicate using internal IP
addresses.

When you use shared VPC, you designate a project as a host project and attach one
or more other service projects to it. In this case, the Web Application Server’s project
is the host project, and the three other projects are the service projects. The overall
VPC network is called the shared VPC network.
An example of Shared VPC keeping a system safe

This example has one host project and 3 service projects:

In this diagram, the Shared VPC Admin, which was nominated by an organization
admin, configured the Web Application Project to be a host project with subnet-level
permissions. Doing so allowed the Shared VPC Admin to selectively share subnets
from the VPC network.

Next, the Shared VPC Admin attached the three service projects to the host project
and gave each project owner the Network User role for the corresponding subnets.
Each project owner then created VM instances from their service projects in the
shared subnets.

The billing for those VM instances is attributed to the project where the resources are
created, which are the service projects.

Shared VPC Admins have full control over the resources in the host project, including
administration of the shared VPC network. They can optionally delegate the Network
Admin and Security Admin roles for the host project. Overall, shared VPC is a
centralized approach to multi-project networking because security and network policy
occurs in a single designated VPC network.

For a demo on how to create VM instances in a Shared VPC network, please refer
here:
https://storage.googleapis.com/cloud-training/gcpnet/student/M3%20-%20Shared%20
VPC.mp4
VPC peering keeps communications private and
on topic

VPC Network Peering allows private RFC 1918 connectivity across two VPC
networks, regardless of whether they belong to the same project or the same
organization. Now, remember that each VPC network will have firewall rules that
define what traffic is allowed or denied between the networks.

For example, in this diagram there are two organizations that represent a consumer
and a producer, respectively. Each organization has its own organization node, VPC
network, VM instances, Network Admin and Instance Admin. In order for VPC
Network Peering to be established successfully, the Producer Network Admin needs
to peer the Producer Network with the Consumer Network, and the Consumer
Network Admin needs to peer the Consumer Network with the Producer Network.
When both peering connections are created, the VPC Network Peering session
becomes Active and routes are exchanged This allows the VM instances to
communicate privately, using their internal IP addresses.

VPC Network Peering is a decentralized or distributed approach to multi-project


networking, because each VPC network may remain under the control of separate
administrator groups and maintains its own global firewall and routing tables.
Historically, such projects would consider external IP addresses or VPNs to facilitate
private communication between VPC networks. However, VPC Network Peering does
not incur the network latency, security, and cost drawbacks that are present when
using external IP addresses or VPNs.
Tips on designing solutions that use VPC peering

● Compute Engine, Google Kubernetes


Engine, and App Engine flexible
environments
● Peered VPC networks remain
administratively separate
● Each side of a peering association is set
up independently
● No subnet IP range overlap across
peered VPC networks
● Transitive peering is not supported

Now, there are some things that we want you to remember when using VPC Network
Peering:
● First of all, VPC Network Peering works with Compute Engine, Google
Kubernetes Engine, and App Engine flexible environments.
● Peered VPC networks remain administratively separate. This means that
routes, firewalls, VPNs, and other traffic management tools are administered
and applied separately in each of the VPC networks.
● Each side of a peering association is set up independently. Peering will be
active only when the configuration from both sides matches. This allows either
side to delete the peering association at any time.
● A subnet CIDR prefix in one peered VPC network cannot overlap with a
subnet CIDR prefix in another peered network. This means that two auto
mode VPC networks that only have the default subnets cannot peer.
● Only directly peered networks can communicate, meaning that transitive
peering is not supported. In other words, if VPC network N1 is peered with N2
and N3, but N2 and N3 are not directly connected, VPC network N2 cannot
communicate with VPC network N3 over the peering. This is critical if N1 is a
SaaS organization offering services to N2 and N3.
Should you use Shared VPC or VPC peering ?
Consideration Shared VPC VPC Network Peering

Across organizations No Yes

Within project No Yes

Network Administration Centralized Decentralized

Organization Admin Organization Admin (if same org)

Shared VPC Admin Security and Security and Security and


Network Network Network
Security and Network Admins Admins Admins Admins

Project Owner Project Owner Project Owner Project Owner Project Owner Project Owner

Should you use Shared VPC or VPC peering for a particular situation? That depends
on the business administration requirements.

The biggest difference between the two configurations is the network administration
models. Shared VPC is a centralized approach to multi-project networking, because
security and network policy occurs in a single designated VPC network. In contrast,
VPC Network Peering is a decentralized approach, because each VPC network can
remain under the control of separate administrator groups and maintains its own
global firewall and routing tables.

Also, consider the limits of VM instances per network and total among peered
networks: https://cloud.google.com/vpc/docs/quota#per_network

Now that we’ve compared both of these configurations for sharing VPC networks
across Google Cloud projects, let’s look at one last use case.
Remove external IPs using Private Google Access

The goal of Private Google Access is to remove external IPs from your VMs. Every
time you remove an external IP you reduce the number of opportunities for an
attacker to gain entry or try to deny service.

Private Google Access allows VM instances that only have internal IP addresses to
reach the external IP addresses of Google APIs and services. For example, if your
private VM instance needs to access a Cloud Storage bucket, you need to enable
Private Google Access.

You enable Private Google Access on a subnet-by-subnet basis. As you can see in
this diagram, subnet-a has Private Google Access enabled, and subnet-b has it
disabled. This allows VM A1 to access Google APIs and services, even though it has
no external IP address.

Private Google Access has no effect on instances that have external IP addresses.
That’s why VMs A2 and B2 can access Google APIs and services. The only VM that
can’t access those APIs and services is VM B1. This VM has no public IP address,
and it is in a subnet where Google Private Access is disabled.

For a list of the services supported by Private Google Access, see:


https://cloud.google.com/vpc/docs/private-access-options#pga-supported
Cloud NAT provides internet access to
private instances

Cloud NAT

Network Address Translation enables you to provide limited internet access without
adding an external IP to your VM, and without exposing your internal IP to the world.

Cloud NAT is Google’s managed network address translation service. It lets you
provision your application instances without public IP addresses, while also allowing
them to access the internet in a controlled and efficient manner. This means that your
private instances can access the internet for updates, patching, configuration
management, and more.

In this diagram, Cloud NAT enables two private instances to access an update server
on the internet, which is referred to as outbound NAT. However, Cloud NAT does not
implement inbound NAT. In other words, hosts outside your VPC network cannot
directly access any of the private instances behind the Cloud NAT gateway. This helps
you keep your VPC networks isolated and secure.
Google Cloud Armor works with HTTP(S)
load balancing Cloud Armor

Sarah in
San Francisco
(50.1.1.5) Cloud Armor Backend
Allowlist 50.1.0.0/16 us-west2

Shen in
Singapore
(50.1.1.20)

Backend
DDoS/Application
attacks blocked X Global HTTP(S) asia-southeast1
and mitigated Load Balancer
(120.1.1.1)

Physical hardware load balancers are a target for denial of service attacks. That is
because they have physical hardware network interfaces that have a high but limited
capacity. If an attacker can overrun an interface with traffic, or trigger an overrun from
the internal service out to the internet, they can cause congestion that will slow or halt
traffic.

Google Cloud Armor works with global HTTP(S) load balancing to provide built-in
defenses against Infrastructure Distributed Denial of Service or DDoS attacks.Google
Cloud Armor benefits from over a decade of experience protecting some of the
world’s largest internet properties like Google Search, Gmail, and YouTube.

Google Cloud Armor enables you to restrict or allow access to your HTTP(S) load
balancer at the edge of the Google Cloud network, meaning as close as possible to
the user and to malicious traffic. For example, in this diagram, Sarah in San Francisco
and Shen in Singapore are two employees who are allowed to access your HTTP
load balancer. Therefore, their traffic will be forwarded to the backend in the us-west2
region and the backend in the asia-southeast1 region, respectively. A DDoS attack, on
the other hand, can be blocked directly at the edge without consuming resources or
entering your VPC network.
Designing for legal compliance

Exam outline Tips

Legislation (e.g., Health Insurance Portability and


Accountability Act (HIPAA), Children’s Online PII - personally identifiable information
Privacy Protection Act (COPPA), etc.)

Audits (including logs) Do your policies conform to audit requirements?

Certification (e.g., Information Technology Firewall, IAM, and Keys. CSEK?


Infrastructure Library (ITIL) framework) Make sure that growth doesn't break compliance

What are the two most common compliance areas?


(1) privacy regulations (e.g., HIPAA, GDPR); (2) commercial and line-of-business
standards (e.g., PCI-DSS).
Defense in depth
Third-party virtual Cloud Load Balancing
appliances ● Built-in L3/L4 DDoS protection
● Intelligent anycast for Global IPv6
Cloud Interconnect ● Next-gen firewall and IPv4 LB
● Direct Peering ● Logging ● Autoscaling
● Carrier Interconnect ● Monitoring ● Cross-region overflow
● VPN ● Compliance ● Cross-region failover
● Private Interconnect ● SSL (TLS) termination

Secure VPC Google network


● VPC topologies for ● High capacity/High
isolation performance global
● Distributed firewall network
● Cloud IAM ● Protection against
● Private access to UDP-based attacks
Google services
● Bastion hosts Third-Party
DDoS Defense

Layers of protection. Each layer protects and complements the next internal layer.

● Secure VPC: Identify optimal VPC topology. Deploy distributed firewalls.


Control access with IAM permissions. Avoid adding Public IPs to instances.
Access Google services internally.
● Cloud Interconnect: Connect securely to on-prem or other-cloud
deployments. Private Interconnect. Carrier Interconnect. Direct Peering. VPN.
● 3rd party virtual appliances: Enhance VPC security with 3rd party
appliances. Next-gen firewalls. IDS/IPS (intrusion detection). Logging.
Monitoring. Scale 3rd party appliances using internal load balancing so you
don't create choke points in your VPC.
● Cloud Load Balancing provides edge protection and global infrastructure
protection for IPv4 and IPv6. Layer 3 and 4 DDoS protection. Anycast IP -
even if backends are in multiple regions. To absorb attacks for resiliency -
autoscaling, cross-region overflow, and cross-region failover.
● Google network: High capacity. High performance. SDN network
virtualization Global networks with subnets. Organization, folders,
cross-project networking, peering.
● 3rd Party DDOS: You can complement this infrastructure with additional
DDoS security from 3rd party providers.

https://cloud.google.com/security/
https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-organizations
Map from traditional security to cloud
security concepts
Traditional security zone concepts Google Cloud network security concepts
Cloud Cloud Cloud
Armor Load Balancing Firewall Rules

Internet
Internet
Web
Server Compute Engine
FW Load Balancer FW Load Balancer

Firewall Choke point Cloud Firewall Rules

Load Balancer Load Balancer Compute Engine App 1 App 2 Compute Engine

Service
Accounts
Cloud Storage BigQuery
Dataset

Key concepts: Cloud Armor, Cloud Load Balancing, Firewall Rules, Service Accounts,
separation into front-end and back-end, isolation of resources using separate service
accounts between services.

TIP: Because of pervasive availability of firewall rules, you don't have to install
a router in the network at a particular location to get firewall protection. That
means you can layer the firewalls as shown in this example.

TIP: Because if pervasive support for Service accounts, you can "lock down"
connections between components.

Security
1. Security is an "umbrella" term for many specific services. Access.
Authorization. Accountability. Privacy. Authentication. Encryption.
2. Layering security measures provides a greater deterrent due to the synergy of
multiple methods working together.
3. In some cases, practical security involves raising the effort required to bypass
the security above the value of the item being protected. Sometimes called
"castle logic". The walls of the castle (and depth of the moat) should be higher
(and deeper) than the value of the treasure in the castle.

When faced with a security question on an exam (or in practice), determine which of
the specific technologies/services is being discussed (authentication, encryption) for
example. Then determine exactly what the goals are for sufficient security. Is it
deterrence? Is it meeting a standard for compliance? Is the goal to eliminate a
particular risk or vulnerability? This will help you define the scope of a solution,
whether on an exam or in application.
Encryption of VM disks and Cloud Storage buckets
Customer-Managed Customer-Supplied
Default Encryption Client-Side Encryption
Encryption Keys (CMEK) Encryption Keys (CSEK)

Data is automatically Google-generated data Keep keys on premises, Data is encrypted before it
encrypted prior to being encryption key (DEK) still and use them to encrypt is sent to the cloud.
written to disk. used. your cloud services.
Your keys.
Each encryption key is Allows you to create, use, Google can't recover them.
itself encrypted with a set and revoke the key Your tools.
of master keys. encryption key (KEK). Disk encryption on VMs.
Google has no idea if your
Uses Cloud Key Cloud Storage encryption. data is encrypted before it
Management Service is uploaded.
Keys are never stored on
(Cloud KMS).
disk unencrypted. No way to recover keys.

You provide your key at If you lose your keys,


each operation, and remember to delete the
Google purges it from its objects!
servers when each
operation completes.

Encryption options
https://cloud.google.com/security/encryption-at-rest/

Customer Managed Encryption Keys (CMEK) using Cloud KMS


When you use Dataproc, cluster and job data is stored on Persistent Disks (PDs)
associated with the Compute Engine VMs in your cluster and in a Cloud Storage
bucket. This PD and bucket data is encrypted using a Google-generated data
encryption key (DEK) and key encryption key (KEK). The CMEK feature allows you to
create, use, and revoke the key encryption key (KEK). Google still controls the data
encryption key (DEK).

Default Encryption
Encryption at rest uses the Key Management System (KMS) to generate KEKs and
DEKs.
https://cloud.google.com/security/encryption-at-rest/default-encryption/#key_manage
ment

https://pixabay.com/en/key-old-skeleton-lock-metal-door-30417/

Key Management Service (KMS): https://cloud.google.com/kms/


AES256 keys
Generate keys
Usage includes off-cloud
Key rotation
When a key is destroyed, there is a 24-hour delay
API support
Envelope DEK/KEK

Client-Side Encryption
https://cloud.google.com/storage/docs/encryption/client-side-keys
Scenario #1
Question

Which IAM roles apply to security auditors requiring visibility across all projects?

A. Org viewer, project owner

B. Org viewer, project viewer

C. Org admin, project browser

D. Project owner, network admin

Origin: CAPE
Scenario #1
Answer

Which IAM roles apply to security auditors requiring visibility across all projects?

A. Org viewer, project owner

B. Org viewer, project viewer

C. Org admin, project browser

D. Project owner, network admin

Origin: CAPE
Scenario #1
Rationale

B - Gives read-only access across the company.


A, C, D - The other options allow them to make changes. They should not be able to
make changes.

https://cloud.google.com/iam/docs/understanding-roles
https://cloud.google.com/iam/docs/granting-changing-revoking-access
Scenario #2
Question

A company’s security team has decided to standardize on AES256 for storage device
encryption. Which strategy should be used with Compute Engine instances?

A. Select SSDs rather than HDDs to ensure AES256 encryption.

B. Use the linux dm-crypt tool for whole-disk encryption.

C. Use Customer Supplied Encryption Keys (CSEK).

D. Use openSSL for AES256 file encryption.

Origin: Case study


Scenario #2
Answer

A company’s security team has decided to standardize on AES256 for storage device
encryption. Which strategy should be used with Compute Engine instances?

A. Select SSDs rather than HDDs to ensure AES256 encryption.

B. Use the linux dm-crypt tool for whole-disk encryption.

C. Use Customer Supplied Encryption Keys (CSEK).

D. Use openSSL for AES256 file encryption.

Origin: Case study


Scenario #2
Rationale

A - Selection of disk type determines the default method for whole-disk encryption.
HDDs use AES128 and SDDs use AES256.
B - This would be redundant with Compute Engine disk encryption.
C - Who manages the keys has nothing to do with whether it is AES128 or AES256.
D - File encryption is a different layer. The standard is for device encryption.

https://cloud.google.com/compute/docs/disks/customer-supplied-encryption
https://cloud.google.com/security/encryption-at-rest/default-encryption/

https://cloud.google.com/compute/docs/disks/customer-supplied-encryption
https://cloud.google.com/security/encryption-at-rest/default-encryption/

"In addition to the storage system level encryption described above, in most cases
data is also encrypted at the storage device level, with at least AES128 for hard disks
(HDD) and AES256 for new solid state drives (SSD), using a separate device-level
key (which is different than the key used to encrypt the data at the storage level). As
older devices are replaced, solely AES256 will be used for device-level encryption."
Module agenda
Designing for Security and
Compliance
Analyzing and Optimizing Technical
and Business Processes
Developing Procedures to Test
Resilience of Solution in Production
Managing Implementation

Ensuring Solution and Operations


Reliability
Cloud Architect case study 04: Technical and
business processes
A customer had this interesting business requirement...
● Pushing to Prod is a big event and happens once a month.

● Significant risk of downtime due to unforeseen issues.

● Downtimes exceeding SLA have revenue impact.

● Tl;dr: Need to develop and deploy features without burning the house down.
Pushing to Prod should be a non-event.

You may have seen this situation before. The customer experience is that pushing to
production is scary, because you never know when things might break. This customer
could only push to production once a month. There was significant risk of downtime.
And when there was downtime, the application breaks, and that impacts revenue. So
the summary is… we need to be able to develop and deploy features, need to present
to production, without it being an event.

A customer had this interesting business requirement...


● Pushing to Prod is a big event and happens once a month.
● Significant risk of downtime due to unforeseen issues.
● Downtimes exceeding SLA have revenue impact.
● Need to develop and deploy features without burning the house down.
Pushing to Prod should be a non-event.
Cloud Architect case study 04: Technical and
business processes
We mapped that to technical requirements like this.
Establish CI/CD pipeline:
● Single source repo per product; git-flow as branching model.
● Automate build, self-testing, rapid.
● Automate deployment.
● Set up robust monitoring, logging and alerting for visibility.
Promote team culture:
● Test Driven Development.
● Push often, address broken builds immediately.
● Transparency.
● Change management / Release process.

There are two passes through the problem in the architectural process. First, there is
the business pass, which includes both the business challenge and the people
processes - understanding what roles there are and what actions the people need to
be able to take. And then there is the technical pass which is mapping all these needs
and procedures to a technical solution.

We mapped that to technical requirements like this…

Establish CI/CD pipeline:


● Single source repo per product; git-flow as branching model.
● Automate build, self-testing, rapid.
● Automate deployment.
● Set up robust monitoring, logging and alerting for visibility.

Promote team culture:


● Test Driven Development.
● Push often, address broken builds immediately.
● Transparency.
● Change management / Release process.
Cloud Architect case study 04: Technical and
business processes
And this is how we implemented that technical requirement
Featuring:
● Cloud Source Repositories
● Cloud Build
● Container Registry
● Google Kubernetes Engine + Helm
● Spinnaker
● Cloud Load Balancing
● Google Cloud’s operations suite
● Cloud IAM + Service Accounts

To push to production on demand, we needed to analyze the development process.


We figured out that a single source repo made the most sense for the whole team.
We decided to go with git-flow as a branching model instead of other kinds of
development. We automated the build process. But also we made sure that the builds
were self-testing using SALT test coverage. And we measured the build process and
made sure that it was rapid. We also automated the deployment of the solution
software. Couple this automation with monitoring, logging, and alerting, and you get a
nice build-and-deploy system that can be handed off to a team.

But that doesn't solve the entire problem. The system has to be used. And this team
was not going to be accustomed to using the development paradigm we had
implemented. So we also needed to do was enhance their processes. This primarily
had to do with how test and development worked together. In the new paradigm they
would start writing the testing along with development. The new way was to push to
production often and push early, and as soon as something break -- fix it. This meant
a new team culture, one of accountability and transparency. Where all the
stakeholders could participate in identifying and resolving a problem. And that meant
change management and leadership buy-in was necessary for the technical solution
to be successful.

And this is how we implemented that technical requirement.


● Cloud Source Repositories - for hosting their repositories
● Cloud Build - that builds the docker container images
● Container Registry - that hosts those container images
● Google Kubernetes Engine + Helm - for running and managing
● Spinnaker - for CDP - continuous deliver
● Cloud Load Balancing, Google Cloud’s operations suite, Cloud IAM + Service
Accounts - management constructs for proper visibility
Analyzing and defining technical processes
Exam outline Tips
Choosing developer tools on Google Cloud
Software Development Lifecycle Plan (SDLC)
https://cloud.google.com/tools/docs/
Continuous integration/continuous deployment Kubernetes … Cloud Build
Blame isn't root cause. Focus on systems—what you
Troubleshooting/post mortem analysis culture
can change in your procedures, not who is at fault.
Testing and validation Testing and validation
Error budget versus perfection
Information Technology Infrastructure Library
IT enterprise process (e.g., ITIL) defines processes, skills, and handoff of
responsibilities
What's your D.R. story? How do you know it works, if
Business continuity and disaster recovery
you don't test it? Tradeoffs. What are your SLAs?

CI/CD with Kubernetes:


https://cloud.google.com/kubernetes-engine/continuous-deployment/
CI/CD tools integrations:
https://cloud.google.com/container-registry/docs/continuous-delivery
https://cloud.google.com/tools/docs/
Testing
Load testing - test overload conditions
Plan and review production SLOs and SLIs
Software Development Lifecycle Plan
(SDLC) to identify overload risks
Analyze and define business processes,
including policy for degraded service and
service outages
Develop procedures to test resilience of
solution in overload conditions
Periodic testing
Test recovery processes

TIP: It is handy to have a testing checklist in mind to help you consider all
options. Consider the question(s) you are trying to answer with testing; "will
the solution support the number of users", "will it handle peak traffic", "is
latency acceptable", and so forth.

● If you can, test during a low use time, such as at night (called a 'dark launch')
● If not, test in pre-production using a synthetic workload that closely resembles
a real workload. The results could be misleading if the workload is not
designed well:
● Correct mix of read-only and state mutating operations
● Sufficiently diverse workload (with realistic cache hitrate)
● Environment should resemble production
● If expensive operations (e.g. returning large lists) exist, exercise these
● Don't rely on load testing tools that minimize the number of concurrent
operations - gives optimistic view of peak throughput
Pricing calculator

https://cloud.google.com/products/calculator/

TIP: The pricing calculator can be used with BigQuery to estimate the cost of a query
before you submit it.
Optimizing VM cost
VM dimensioning: standard, high-CPU, high-mem, GPU, and custom
Sustained use discounts
Machine-type discounts
Inferred instance type discount
Committed use discounts
Compute Engine and Google Kubernetes Engine VMs
Not: App Engine flexible environment, Dataproc, Dataflow, or Cloud SQL VMs
Preemptible VMs

Discounting algorithms are subject to change. Please see current discounting or


details:
Committed use discounts:
https://cloud.google.com/compute/docs/instances/signing-up-committed-use-discount
s
Sustained use discounts:
https://cloud.google.com/compute/docs/sustained-use-discounts
https://cloud.google.com/compute/docs/instances/preemptible
Optimizing disk cost
Determine how much space you need.
Determine what performance characteristics your applications require.
● I/O Pattern: small reads and writes or large reads and writes.
● Configure your instances to optimize storage performance.

Monthly charges Standard PD SSD PD

10 GB $0.40 $1.70

1 TB $40 $170

16 TB $655.36 $5,570.56

https://cloud.google.com/compute/docs/disks/performance
Optimizing network cost
Internet egress
Internet
Egress between regions

Continent Continent

Region Region Region

Zone Zone Zone Zone Zone

Egress within Egress between Intercontinental Egress:


the same zones in the at Internet rate
zone: free same region

Ingress is free. Networking costs are similar per Google Cloud product but are billed
per product. So you need to view the pricing documentation per product for the
details.
Example: Cloud Storage has standard egress costs. But there are also separate
charges for data migration and for Cloud Storage operations.

Example: Egress between regions: $0.01 per GB. Egress to Internet: first terabyte to
world destinations, $0.12 per GB.

https://cloud.google.com/bigtable/pricing#network
https://cloud.google.com/storage/pricing
Network service tiers decision tree

There are performance and cost differences between the Network Service Tiers. This
decision tree provides guidance on choosing the tier that best meets business needs.

Ask yourself whether high performance or lower cost is most important for your
workload or resource. The Premium Tier is the clear choice for performance. If cost is
your main consideration, remember that the Standard Tier has other restrictions in
addition to network performance. If you want to deploy your backends or have users
in multiple regions, but don’t want to use the public internet over Google’s network for
inter-continental and cross region traffic, you want to choose the Premium Tier. Also, if
you want Global Load Balancing or Cloud CDN, you need to use the Premium Tier.

Otherwise, the Standard Tier is a great choice if you don’t need any of those services
and are okay using the public internet instead of Google’s network.
Analyzing and defining business processes

Exam outline Tips


Stakeholder management Who are the gatekeepers and what roles do they
(e.g., influencing and facilitation) play? How can you enable them to do their jobs?

Quality is a process, not a product. Change is


Change management
inevitable in cloud. Adapt process to be continuous.

Team assessment / skills readiness Do you need a playbook? Rehearsals?

Decision-making process How long do you have to make which decisions?

How do you know that what you delivered is still


Customer success management meeting your customer/client's needs?
When is it time for version +1 or version 2.0 ?

Cost optimization / resource optimization What are the financial goals and how are they
(Capex / Opex) measured and reported?
Instance overhead estimation
Of the 100% capacity of a VM, some will be consumed
by overhead.
If you can't test, 30% overhead is a cautious estimate.
Load testing can better estimate the overhead.
● OS image choice can make a difference (up to 20%).
○ Test several images
● Hardware architecture (specific to zone) can make a Open source CPU
measurement tools
difference for some workloads.
● Coremark
● OS firewalls turn off if Google Cloud firewalls are
● Spec CPU 2006
sufficient.
● On Windows, move Pagefile to Local SSD if
affordable.

TIP: If you can exchange a server-oriented architecture for a serverless service,


you no longer have to be concerned about instance overhead, just the SLAs of
the service.

This page lists the hardware architectures in the different zones. Sandy Bridge,
Haswell, Broadwell, IvyBridge, Skylake, and so forth.
https://cloud.google.com/compute/docs/regions-zones/regions-zones
There are a lot of benchmark comparisons online. It is suggested that you test your
application and workload in different zones to see what difference the hardware in the
zone might make.

Open source CPU measurement tools


Coremark
Spec CPU 2006
https://pixabay.com/en/board-circuits-trace-control-center-1709192/
Persistent Disk estimation
IOPS may be constrained by CPU
● n1-standard-4 = PD-SSD at capacity
● n1-standard-16 = Local SSD at capacity
If estimates are based on averages, consider
potential I/O bursts; estimate might be
under-provisioned.
Disk performance scales with disk size.
Open source disk measurement tools
● If you trade up, check for performance
over capacity. ● Bonnie++
● Copy
● If you trade down, check for
● Fio
performance under capacity.
● Synthetic Storage

TIP: There are new persistent disk features and options. People often assume
that a persistent disk is just a hard disk, when it has different features and
capabilities. Do you know if disk encryption or RAID configurations would work
with Persistent Disks? Would they be valuable?

Consider potential I/O bursts. If you have planned on IOPS based on an average, and
the actual disk usage is bursty, the disk could be under provisioned for dealing with
the bursts.
Persistent disk performance scales with the size of the disk. So if you trade up to a
larger disk in your design, revisit the performance to avoid over capacity, and if you
trade disk size down, check for under capacity.
Potential IOPS may be constrained by CPU. An n1-standard-4 can drive a PD-SSD at
capacity, and an n1-standard-16 can drive a Local SSD at capacity.

Open source disk measurement tools


Bonnie++
Copy
Fio
Synthetic Storage
Network capacity estimation
Consider potential I/O bursts.
Network capacity scales with the
number of cores.
Internal IPs and External IPs are
different speeds; they are not
symmetrical.

Open source network measurement tools


● Iperf
● Mesh
● Network
● Netperf
● Ping

In general, Internal IPs are faster than External IPs.

Open source network measurement tools


Iperf
Mesh
Network
Netperf
Ping

Consider potential I/O bursts. If you have planned on IOPS based on an average, and
the actual traffic is bursty, it could be under provisioned for dealing with the bursts.

Network capacity scales with the number of cores. So if you change out the number
of cores in the VMs in your design, revisit the network capacity to make sure the
design does not over- or under- provision.

Internal IPs and External IPs are different speeds, they are not symmetrical. Even
VM-to-VM in the same zone over External IPs can be ~1Gbps vs ~8.5Gbps for
Internal IPs.
Workload estimation
Throughput depends on:
● Types of requests or operations
● Requests that change state
● Request size (payload)
● Whether system uses sharding,
pipelining, or batching Open source workload measurement tools
● Aerospike
● Cassandra
● Hadoop
● HPCC
● MongoDB
● Oldisim
● Redis
Module agenda
Designing for Security and
Compliance
Analyzing and Optimizing Technical
and Business Processes
Developing Procedures to Test
Resilience of Solution in Production
Managing Implementation

Ensuring Solution and Operations


Reliability

Developing procedures to test resilience of solution in production (e.g., DiRT and


Simian Army)
Is this resilient?
Failover Design

Cloud Load Cloud Load


Balancing Balancing
Is this a good design for handling failover?
480
Will it work?
480 120
Why or why not?
120
What procedures would you implement? 960 240

Frontend Frontend
A B

max(1000) qps max(1000) qps

This design illustrates that in some cases technical design is not sufficient but must be
followed up with human procedures. The original design was intended to handle a
maximum of 1000 queries per second. It was originally a functional design. However,
over time the load has grown. While the system is still operational, if one of the
Frontend servers were to fail and the combined traffic was taken up by the other
Frontend, the total would be 1200 qps and would be above capacity. For this reason,
resiliency requires identifying key metrics (in this case, total load) and periodically
adjusting capacity to stay ahead of growth. On the other side, if load were diminishing
consistently over time, there could be cost savings in downsizing the Frontend
servers.
Scale-out policy decision

Managed instance group

100% 100% 100% 85% 100% 100%

Target CPU Utilization = 0.75


1. Actual utilization = 385/400 = 96.25%
2. If +1 then 385/500 = 77% (Still above target 75%)
3. If +2 then 385/600 = 64.16%

The percentage utilization that an additional VM contributes depends on the size of


the group. The 4th VM added to a group offers 25% increase in capacity to the group.
The 10th VM added to a group only offers 10% more capacity, even though the VMs
are the same size.

In this case shown in the diagram Autoscaler is conservative and rounds up. In other
words, it would prefer to start an extra VM that isn't really needed than to possibly run
out of capacity.

https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decision
s
https://cloud.google.com/compute/docs/autoscaler/multiple-policies
Scale-in policy decision

Managed instance group

80% 55% 40% 36% 73% 46%

Target CPU Utilization = 0.75


1. Actual utilization = 330/600 = 55%
2. If -1 then 330/500 = 66% (Still below target)
3. If -2 then 330/400 = 82.5%

In this example, removing one VM doesn't get close enough to the target of 75%.
Removing a second VM would exceed the target. Autoscaler behaves conservatively.
So it will shut down one VM rather than two VMs. It would prefer underutilization over
running out of resource when it is needed.

TIP: When would you use Cloud Monitoring metrics for autoscaling?
Scenario #1
Question

Which of these requirements will Cloud Monitoring dashboards, metrics, and reporting
satisfy?

A. Improve security by defining and adhering to a set of security and Identity and
Access Management (IAM) best practices for cloud

B. Encrypt data on the wire and at rest.

C. Analyze and optimize architecture for performance in the cloud.

D. Support multiple VPN connections between the production data center and
cloud environment.

Origin: Case study


Scenario #1
Answer

Which of these requirements will Cloud Monitoring dashboards, metrics, and reporting
satisfy?

A. Improve security by defining and adhering to a set of security and Identity and
Access Management (IAM) best practices for cloud

B. Encrypt data on the wire and at rest.

C. Analyze and optimize architecture for performance in the cloud.

D. Support multiple VPN connections between the production data center and
cloud environment.
Scenario #1
Rationale

C - Cloud Monitoring metrics will help to analyze and optimize performance for the
cloud, because it can be used to gather metrics -- and custom metrics if needed to get
to specific behavior of the applications being migrated.
A, B, D - Cloud Monitoring does not necessarily improve security, although alerts could
be set to identify problem circumstances that could indicate vulnerabilities.
Cloud Monitoring is incidental to VPN connections to on prem. It doesn't have anything
to do with encryption processes.

https://cloud.google.com/stackdriver/docs/
https://cloud.google.com/monitoring/agent/
https://cloud.google.com/monitoring/custom-metrics/
Scenario #2
Question

How can a company connect cloud applications to an Oracle database in its data
center to meet its business requirement of up to 10 GB of transactions with an SLA?

A. Implement a high-throughput Cloud VPN connection

B. Cloud Router with VPN

C. Dedicated Interconnect

D. Partner Interconnect
Scenario #2
Answer

How can a company connect cloud applications to an Oracle database in its data
center to meet its business requirement of up to 10 GB of transactions with an SLA?

A. Implement a high-throughput Cloud VPN connection

B. Cloud Router with VPN

C. Dedicated Interconnect

D. Partner Interconnect

Origin: Case study


Scenario #2
Rationale

D - Partner Interconnect is good up to 50 Gbps and provides an SLA.


A - VPN, even high-throughput VPN, has a 99.9% SLA on the VPN service being
available. That doesn't mean the Internet will be available to transport the data. So no
real SLA is possible.
B - Cloud Router with VPN is a BGP-based method of dynamically discovering routes. It
is incidental to connectivity.
C - Dedicated Interconnect will provide an SLA but is cost effective above 10 Gbps up to
80 Gbps or 2 x 100 Gbps (per dedicated connection)
https://cloud.google.com/vpn/sla
https://cloud.google.com/network-connectivity/docs/vpn/concepts
https://cloud.google.com/network-connectivity/docs/how-to/choose-product#cloud-interconnect
https://cloud.google.com/network-connectivity/docs/interconnect/concepts/partner-overview

The first way to distinguish the remaining options is by speed/volume of data.


● Cloud VPN is useful for low-volume connections.
● Partner Interconnect is useful for data up to 50 Gbps.
● Direct Interconnect is useful for data from 10 Gbps to 80 Gbps or 2 x 100
Gbps.

The Cloud VPN SLA covers the availability of the VPN service, not the availability of
the public internet.
Business internet SLAs from ISPs is commonly between 99% and 99.5% for a
dedicated line.
Therefore, even though the Cloud VPN service is available 99.9% of the time, the
communication it relies on will be down between ½% and 1% of the time. That doesn't
meet "guaranteed service availability" requirements.
https://cloud.google.com/vpn/sla
https://cloud.google.com/network-connectivity/docs/vpn/concepts
The VPN SLA provides 99.9% service availability. That doesn't mean that the you will
be able to communicate over the VPN. Because the VPN is dependent on the
availability of the Internet.

https://cloud.google.com/network-connectivity/docs/how-to/choose-product#cloud-inte
rconnect
Dedicated Interconnect:
● 10 Gbps to 80 Gbps, or 2 x 100 Gbps (200 Gbps),
● Google offers end-to-end SLA for the connectivity between your VPC and
● on-premises networks for Google-defined topologies.

Partner Interconnect:
https://cloud.google.com/network-connectivity/docs/interconnect/concepts/partner-ove
rview
● Up to 50 Gbps
● Google offers two SLA's that apply only to the connectivity between your VPC
network and the service provider's network.
● One is 99.99% and the other is 99.9%
● The SLA doesn't include the connectivity between your network and the
service provider's network.
Module agenda
Designing for Security and
Compliance
Analyzing and Optimizing Technical
and Business Processes
Developing Procedures to Test
Resilience of Solution in Production
Managing Implementation

Ensuring Solution and Operations


Reliability
Cloud Architect case study 05: Managing
implementation
A Finserv customer had this interesting business requirement...
● Encryption in transit and at rest for all developer operations.

● Follows Google Best Practices.

● All Keys must be managed by Company.

A lot of the customers we see are in the Enterprise space, so their needs are very
similar. This example came from a Financial Services company. We often see similar
requirements among FinServ companies.

A Finserv customer had this interesting business requirement...


● Encryption in transit and at rest for all developer operations
● Follows Google Best Practices
● All Keys must be managed by Company - they wanted to own the keys
Cloud Architect case study 05: Managing
implementation
We mapped that to technical requirements like this...
● Use Google Authentication.

● No Public IP access unless through bastion host.

● No Operations team access to production environment.

● Minimize downloaded keys.

● Keys accounted for via business logic application.

The real trick here is that the structure and solution had to be put into production at
one time. It couldn't be built in parts into production. It had to be all working when it
went into production. That caused us to think about what parts were inherent, and
what parts we could automate. So we ended up using a Jenkins pipeline and
Deployment Manager templates for parts of this automation.

We mapped that to technical requirements like this…


● Use Google Authentication.
● No Public IP access unless through bastion host.
● No Operations team access to production environment. That means "no ops" -
everything is automated.
● Minimize downloaded keys.
● Keys accounted for via business logic application.
Cloud Architect case study 05: Managing
implementation
And this is how we implemented that technical requirement.
● All Google APIs are encrypted in transit, and authenticated.

● Production has operations team access - all deployment pipelines via Jenkins and
Deployment Manager. Business Logic in python templates in Deployment Manager.

● The Cloud SDK was not installed in local machines - Cloud Shell ensures no keys are
downloaded.

● Service Account keys when needed for off-Google Cloud clients are managed via
deployment pipelines.

All of the Google APIs are encrypted in transit, and authenticated. So that requirement
was inherited and automatic. The production team needed operations access but
without handing them keys. So what we did is implemented all operations in
deployment pipelines using Jenkins and Deployment Manager. The business logic
was implemented using python in the Deployment Manager templates.

And this is how we implemented that technical requirement.


● All Google APIs are encrypted in transit, and authenticated.
● Production has operations team access - all deployment pipelines via Jenkins
and Deployment Manager. Business Logic in python templates in Deployment
Manager.
● Cloud SDK was not installed in local machines - Cloud Shell ensures no keys
are downloaded.
● Service Account keys when needed for off-Google Cloud clients are managed
via deployment pipelines.

There are two kinds of operations actions; on-Google Cloud actions and off-Google
Cloud actions. For on-Google Cloud actions, we didn't install Cloud SDK on local
machines. Instead, we set them up to use Cloud Shell. That ensured that no keys
were downloaded. For off-Google Cloud actions,the Service Account keys were
managed via the deployment pipelines. Any time there was a need for off-Google
Cloud access, the clients are managed via the deployment pipelines. So that means
there is full audit, control, and records of those keys, who had access to them, and
when and where they were used.
Advising development/operation teams to ensure
successful deployment of the solution
Exam outline Tips

Application development Kubernetes, App Engine, Containers

Be familiar with the Cloud SDK and how it works for


API best practices automating infrastructure. Be familiar with ML APIs
such as the Cloud Natural Language API.

Be familiar with the different kinds of testing: black


Testing frameworks (load/unit/integration) box/clear box, unit, integration.
Be able to differentiate them.

Especially Be familiar with the features for migrating


Data and system migration tooling or synchronizing data between on-premises or
non-Google Cloud cloud and Google Cloud.
Preparing for deployment and release management

Launch Release

● Reliability is
Dependencies fundamental
Capacities ● Automation;
SPoFs self-service release
Security ● Access control over
Access critical features
and processes
Phased rollout

Create a launch checklist. 08 ArchDP Deployment, monitoring and alerting, and


incident response v1.0.3, Slide 5.

Dependencies, Capacities, Single Points of Failure, Security and access, Rollout plan
(phased)

Release Management
● Create a launch checklist. 08 ArchDP Deployment, monitoring and alerting,
and incident response v1.0.3, Slide 6.
● Automate everything you can. Reliability is key. Self-service release process.
Implement access control over critical release features/processes.
Capacity planning for launch

Test first

Work through issues before serving


user traffic
● Identify bottlenecks

Slow, staged, iterative


● Dark launches
● Use invitations to stage launch

Use invitations to slow or stage a launch.


Interacting with Google Cloud using the Cloud SDK
(gcloud, gsutil and bq)

Exam outline Tips

Install tools on a computer or on a VM. Example: gsutil for


Local installation
backup to Cloud Storage from data center.

DevOps interface. Cloud-based shared VM with tools


Cloud Shell pre-installed. Convenient, but no SLA. So use a VM for
reliability because Compute Engine has an SLA.
Scenario #1
Question

Implement back-out/rollback for website with 100s of VMs. Site has frequent
critical updates.

A. Create a Nearline copy of static data in Cloud Storage.

B. Create a snapshot of each VM prior to update, in case of failure.

C. Use managed instance groups with the “rolling-action start-update” command


when starting a rolling update.

D. Only deploy changes using Deployment Manager templates.

Origin: CAPE
Scenario #1
Answer

Implement back-out/rollback for website with 100s of VMs. Site has frequent
critical updates.

A. Create a Nearline copy of static data in Cloud Storage.

B. Create a snapshot of each VM prior to update, in case of failure.

C. Use managed instance groups with the “rolling-action start-update” command


when starting a rolling update.

D. Only deploy changes using Deployment Manager templates.

Origin: CAPE
Scenario #1
Rationale

C - Allows compute engine to handle updates. Easy management of VMs.


Website with 100's of VMs. Load balanced. Likely already using a managed instance group.
D - Large overhead and chance for version conflicts between DM templates if an old
template is changed that running infrastructure relies on.
B - A valid approach. But with 100s of VMs, it will be slow and storing all those images will
be expensive. The recommendation is to create snapshots outside of normal production
periods to avoid the slow down. More data = larger snapshots, slower to create.
A - Unreliable recovery method. Can't roll back once the copy is overwritten.

https://cloud.google.com/compute/docs/instance-groups/updating-managed-instance-groups
https://cloud.google.com/compute/docs/disks/restore-and-delete-snapshots
Scenario #2
Question

A car reservation system has long-running transactions. Which one of the following
deployment methods should be avoided?

A. Execute canary releases.

B. Perform A/B testing prior to release.

C. Introduce a blue-green deployment model.

D. Introduce a pipeline deployment model.

Origin: Experience
Scenario #2
Answer

A car reservation system has long-running transactions. Which one of the following
deployment methods should be avoided?

A. Execute canary releases.

B. Perform A/B testing prior to release.

C. Introduce a blue-green deployment model.

D. Introduce a pipeline deployment model.

Origin: CAPE
Scenario #2
Rationale

C - Switching the load balancer from pointing at the green "good" environment to the blue
"new" environment is a fast way to rollback if there is a problem during release. However,
long-running transactions will be disrupted by that switch.
A - Testing the application with a few users before releasing to everyone will detect
problems early and confine their impact.
B - Performing testing of features "A" with the feature, "B" without the feature, will detect
problems before release.
D - Pipeline deployment - introducing orderly procedures into the QA process can improve
the effectiveness of QA.
https://cloud.google.com/load-balancing/docs/
https://cloud.google.com/load-balancing/docs/backend-service

The second link discusses long-running connections and how to support them.
Module agenda
Designing for Security and
Compliance
Analyzing and Optimizing Technical
and Business Processes
Developing Procedures to Test
Resilience of Solution in Production
Managing Implementation

Ensuring Solution and Operations


Reliability
Site Reliability Engineering

TIP Product

You need to know about every one Development


of these layers and their
order—how work is divided in SRE. Capacity Planning

Testing and Release


https://landing.google.com/sre/book/index.html Procedures

Post mortem/
Root cause analysis

Incident Response

Monitoring

Site Reliability Engineering or SRE


https://landing.google.com/sre/
The SRE Book -- Read Online for FREE:
https://landing.google.com/sre/book/index.html
Do you want to automate infrastructure or workflow?

Infrastructure Workflow
Automation Orchestration

Deployment Manager Cloud Composer


● Terraform ● Apache Airflow
● Ansible
● Chef
● Puppet

Infrastructure automation tools like Deployment Manager make creating cloud


infrastructure manageable and repeatable and provide a method of documenting the
infrastructure which is a common business requirement.

Workflow orchestration is the idea that most business applications rely on passing
data from one area of work concentration to another. For example, in the supply chain
raw materials are ordered. In the factory, the materials are manufactured into
products. At the loading dock the products are shipped. And in the store the products
are sold. Each of these parts of the business have systems associated with them:
resource management, manufacturing control, shipping and receiving, sales. And
those systems need to share information with one another to operate the business.
The overall solution is more about workflow between these systems than about the
infrastructure of any single system.

If you start to consider workflow in your overall design, you will start to take control
over the overall solution and the reliability and methods of operations of the total
system.

Q: Why should you have a conductor for your orchestra?


A: To ensure solution and operations reliability.
Create data infrastructure when the workflow
requires it
Dataproc Workflow Template
Create a cluster
Parameters

Select existing cluster

YAML
Template DAG Submit jobs

Wait on dependencies

Delete a cluster

A Dataproc Workflow Template is a simple example if the crossover between


workflow and infrastructure. In this case, the workflow (scheduled or on demand)
causes the service to create a Dataproc cluster to process data, and then deletes the
cluster when the work is done. Infrastructure becomes dynamic in a workflow
context.

Details:

The Dataproc Workflow Template is a YAML file that is processed through a Directed
Acyclic Graph (DAG). It can create a new cluster, select from an existing cluster,
submit jobs, hold jobs for submission until dependencies can complete, and it can
delete a cluster when the job is done.

It is currently available through the gcloud command and the REST API, but not
through Console.

The Workflow Template becomes active when it is instantiated into the DAG. The
Template can be submitted multiple times with different parameter values. You can
also write a template inline in the the gcloud command, and you can list workflows
and workflow metadata to help diagnose issues.
Cloud Composer: Extensible workflow orchestration

dag_folder Create DAG object

Scheduling
Cron parameters interval

P P Operations
REST API calls
P
Dependencies Relationships
between
Python
operations

Computational workflow and


Data processing pipeline DAG
orchestration in a single service

Cloud Composer, based on Apache Airflow, is a service that provides extensible


dependency management for complex workflows. Because the Directed Acyclic
Graphs (DAGs) are written in Python, Cloud Composer can be extended to
coordinate and orchestrate anything with a Python-compatible API -- which are most
services today.

In Airflow/Composer, there can be multiple DAGs, and they are defined in Python.
Each DAG lives in the dag_folder.

A few notes about Cloud Composer:

● Workflow orchestration
○ Concerned with the instructions necessary to complete each step.
○ Computational workflow
○ Data processing pipeline
○ Dependency management
○ Extensible operators
■ operations --> REST APIs
Monitoring/Logging/Alerting solution

You can only monitor


the interactions

Alert on customer pain


You can monitor the
business logic

TIP: Make sure you know the difference between black box monitoring and clear box
monitoring.
Dashboards can visualize utilization and
network traffic

Monitoring

Cloud Monitoring allows you to create custom dashboards that contain charts of the
metrics that you want to monitor. For example, you can create charts that display your
instances’ CPU utilization, the packets or bytes sent and received by those instances,
and the packets or bytes dropped by the firewall of those instances.

In other words, charts provide visibility into the utilization and network traffic of your
VM instances, as shown on this slide. These charts can be customized with filters to
remove noise, groups to reduce the number of time series, and aggregates to group
multiple time series together.

For a full list of supported metrics, please refer to the documentation:


https://cloud.google.com/monitoring/api/metrics_gcp
Alerting policies can notify you of certain conditions

Monitoring

Now, although charts are extremely useful, they can only provide insight while
someone is looking at them. But what if your server goes down in the middle of the
night or over the weekend? Do you expect someone to always look at dashboards to
determine whether your servers are available or have enough capacity or bandwidth?
If not, you want to create alerting policies that notify you when specific conditions are
met.

For example, as shown on this slide, you can create an alerting policy when the
network egress of your VM instance goes above a certain threshold for a specific
timeframe. When this condition is met, you or someone else can be automatically
notified through email, SMS, or other channels in order to troubleshoot this issue.
Uptime checks test the availability of your
public services

Monitoring

There are also uptime checks that you can configure to test the availability of your
public services from locations around the world, as you can see on this slide. The type
of uptime check can be set to HTTP, HTTPS, or TCP. The resource to be checked can
be an App Engine application, a Compute Engine instance, a URL of a host, or an
AWS instance or load balancer.

For each uptime check, you can create an alerting policy and view the latency of each
global location.
Google Cloud’s operations suite unifies
operations tools

Error
Monitoring Logging Reporting Trace Debugger Profiler

Nagios Fluentd Sentry LTTng GDB Dynatrace


Core Heapster OverOps Trace Gprof Microsoft
Zabbix Prometheus Airbrake Compass Valgrind System Center
Cacti Graylog2 Raygun Babeltrace LogicMonitor
OpenNMS Logcheck PagerDuty AppDynamics
Icinga Logwatch Pingdom
Logstash

TIP: Know how to use Trace and Debug

Examples of other tools that Google Cloud’s operations suite replaces. Note that it is
not just a collection of alternate tools that is the issue, but how you use them together.
The individual tools are not integrated or designed to work together, so a lot of manual
procedures and translation/massaging of data are required to use them together. With
Google Cloud’s operations suite, the integration is by design so that work disappears.
It is is also multi-cloud, able to manage projects across Google Cloud and AWS.
Playbooks and "Easy" buttons

TIP: Another old saying: "People don't plan to fail, they fail to plan." Another
way of saying this is "The only time we have to prepare for emergencies is
BEFORE they happen. Once the emergency is occurring it is too late to
prepare." You can design a great technical solution, but if it doesn't include
human processes then it might not be adaptive and resilient.

Easy buttons are tools and processes that automate common actions.
A playbook is a list of what to do when.

General rule: For every alert you should have a play in the playbook.

08 ArchDP Deployment, monitoring and alerting, and incident response v1.0.3, slides
25 and 26.

Use microservices with RESTful APIs


Use Google Cloud APIs
Drain a service for maintenance

For every alert the monitoring system can generate, you should have a Playbook
entry.
Standardize the format of the playbook.
Write for newly onboarded employees -- don't assume any company-specific
knowledge.
Develop a structured incident response

Monitoring dashboard Plans and tools for


Alerting regimen
responding to issues

TIP: What are the differences between a Dashboard, an Alert, and Incident
Response?

Being prepared: Fast response. Know what to do when.


Consistency: Reduce duplication of effort. Know who is doing what.
Rehearsal: practice incident response. Turn incidents into training -- learn from them
-- what worked and what didn't.

SRE tip: Rehearse. The time to make sure everything is working and people know
their procedures is BEFORE an emergency occurs, not during it.

https://docs.google.com/presentation/d/10og__CX4qygNG0mpH7GOphpkPeXOQ3g
RJ5dlzGKcbQs/edit#slide=id.g250387a28f_0_4050

● Plans and tools for responding to issues


● A set of EASY buttons
Supporting operational troubleshooting

Log Trace Debug

TIP: Find a lab that uses logging and trace and debug to identify and solve an
application problem. This will give you a sense of the value and how these
components work together.
https://cloud.google.com/logging/docs/
https://cloud.google.com/trace/docs/ ← App Engine Performance / Latency
information
https://cloud.google.com/debugger/docs/
Strategies for dealing with failure
Obviation Design a system in which a particular type of failure is impossible.

Prevention Take steps to ensure that a possible failure does not occur.

Detect a failure before or as it is happening and take steps to reduce or


Detection and Mitigation
eliminate the effects of it.

Instead of failing completely, handle stress and return to full service when
Graceful Degradation
the issue passes.

Repair Fix the problem. Hopefully, it won't come back, at least not in the same way.

Allow the problem to occur and get service back as quickly as possible.
Recover
Measure indicators of recovery and work to improve them.

SRE saying: "Hope is not a strategy."


Evaluating quality control measures

Measures
● Metrics, Reports, SLIs
● Objectives, goals, watermarks, limits
Human processes
● Evaluation
● Decisions
● Actions

Service Level Indicators (SLIs)

TIP: Qualities are often where our goals start. But figuring out how to measure
them quantitatively enables data-driven operations. It can be difficult to figure
out exactly what to measure, because sometimes what is easily measured is
not a good indicator of customer interests.

Example: CPU utilization may or may not indicate user satisfaction. Round-trip delay
or frequency of request errors might be a better measure of the user's experience.

What metrics are you using? Can you define metrics that relate directly to user
experience and service objectives?

What are the watermarks or alert levels at which human processes are engaged?
How are you setting those values? When do they need to be revisited and updated?
How do you know they are related to important events?

What are the people supposed to do? What decisions or actions are they supposed to
make/take? Are these documented?
Cloud Architect case study 06: Solution and
operations reliability
A customer had this interesting business requirement...

● The back-office system needs to support frequent updates.

● The back-office system needs to be available - especially between 06:00 CEST and
18:00 CEST.

● A failure in one part of the back-office system shouldn’t bring down the entire
system.

A customer had this interesting business requirement...

● The back-office system needs to support frequent updates


● The back-office system needs to be available - especially between 06:00
CEST and 18:00 CEST
● A failure in one part of the back-office system shouldn’t bring down the entire
system
● Customer wants to re-architect system. Does not want to bring down the entire
system when doing an update.
Cloud Architect case study 06: Solution and
operations reliability
We mapped that to technical requirements like this...

Microservices!

● Break apart the back-office system into independent services.

● Create a standard way for teams to publish logs and metrics for their services.

● Create a standard way for services to be rolled out.

We mapped that to technical requirements like this…

Microservices!
● Break apart the back-office system into independent services
● Create a standard way for teams to publish logs and metrics for their services
● Create a standard way for services to be rolled out

This was a natural fit for microservices. They knew that when they told development
groups that they would be developing their own microservices, that they needed
standards for reliability and scalability, and they want common ways to monitor the
applications.
Cloud Architect case study 06: Solution and
operations reliability
And this is how we implemented that technical requirement.

Google Kubernetes Engine

● Microservices deployed into a shared cluster


● Surging Rolling Deployments with K8s deployment resource

Cloud Monitoring

● Custom Metrics - a wrapper library around the Cloud Monitoring client libraries to:
○ Expose “common” metrics
○ Expose custom metrics

And this is how we implemented that technical requirement.

Google Kubernetes Engine


● Microservices deployed into a shared cluster
● Surging Rolling Deployments with K8s deployment resource

Cloud Monitoring
● Custom Metrics - a wrapper library around the Cloud Monitoring client libraries
to:
● Expose “common” metrics
● Expose custom metrics

Solution was to use Cloud Monitoring. Exposing the metrics could be done through
dashboards. Exposed metrics through prometheus standard, scraped from APIs, and
sent to Cloud Monitoring where it could be exposed through dashboards.

They used custom metrics in Cloud Monitoring, so they were able to monitor and
scale their microservices based on those metrics.
Scenario #1
Question

A microservice has intermittent problems that bursts logs. How can you trap it for
live debugging?

A. Log into machine with microservice and wait for the log messages.

B. Look for error in Error Reporting dashboard.

C. Configure microservice to send traces to Cloud Trace.

D. Set a log metric in Cloud Monitoring, alert on it past a threshold.

Origin: CAPE
Scenario #1
Answer

A microservice has intermittent problems that bursts logs. How can you trap it for
live debugging?

A. Log into machine with microservice and wait for the log messages.

B. Look for error in Error Reporting dashboard.

C. Configure microservice to send traces to Cloud Trace.

D. Set a log metric in Cloud Monitoring, alert on it past a threshold.


Scenario #1
Rationale

D - A Cloud Monitoring metric can identify a burst of log lines. You can set an alert. Then
connect to the machine while the problem is happening.
A - Chances of catching it on one machine is low.
B - Error reporting won’t necessarily catch the log lines unless they are stack traces in the
proper format. Additionally just because there is a pattern doesn’t mean you will know
exactly when and where to log in to debug.
C - Trace may tell you where time is being spent but wont let you hone in on the exact
host that the problem is occurring on because you generally only send samples of traces.
There is also no alerting on traces to notify exactly when the problem is happening.
https://cloud.google.com/logging/docs/
https://cloud.google.com/monitoring/alerts/
Scenario #2
Question

Last week a region had a 1% failure rate in web tier VMs? How should you respond?

A. Monitor the application for a 5% failure rate.

B. Duplicate the application on prem to compensate for failures in the cloud.

C. Perform a root cause analysis, reviewing cloud provider and deployment


details to prevent similar future failures.

D. Halt all development until the application issue can be found and fixed.

Origin: Experience
Scenario #2
Answer

Last week a region had a 1% failure rate in web tier VMs? How should you respond?

A. Monitor the application for a 5% failure rate.

B. Duplicate the application on prem to compensate for failures in the cloud.

C. Perform a root cause analysis, reviewing cloud provider and deployment


details to prevent similar future failures.

D. Halt all development until the application issue can be found and fixed.
Scenario #2
Rationale

C - Perform root cause analysis, because you don't know from the information given
whether the issue had to do with the Cloud Provider or was in the application or something
to do with the interface between the application and cloud resources. The goal of
identifying root cause is to prevent future failures, that might include changing procedures.
A - Raising the threshold doesn't help identify the underlying issue.
B - The assumption is that the cloud is unreliable and on prem is more reliable, so it needs
to act as a backup. That's a lot of work that might not be needed and still doesn't find the
cause.
D - The assumption is that the application is the problem. But a 1% error could be within
SLA for some services. It might not be the application at all. It could be an one-time issue.
The information doesn't tell us if this is a recurring problem.
Challenge Lab 02
PCA Prep - Update and Scale Out a
Containerized Application on a
Kubernetes Cluster

1:10

A Challenge Lab has minimal instructions. It explains the circumstance and the
expected results -- you have to figure out how to implement them.

This is a timed lab.


The lab will expire after 1 hour and 10 minutes.
The lab can be completed in 50 minutes.
o ne
this ass!
Try er cl
aft

Supplemental
Challenge Lab
PCA Prep - Deploy a Compute
Instance with a Remote Startup Script

1:30

A Challenge Lab has minimal instructions. It explains the circumstance and the
expected results -- you have to figure out how to implement them.

This is a timed lab.


The lab will expire after 1 hour and 30 minutes.
The lab can be completed in 1 hour and 15 minutes.

You might also like