0% found this document useful (0 votes)
121 views

Snowflake

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Snowflake

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 122

Snowflake Overview

Introduction to Snowflake

 Snowflake is a cloud data warehouse that differs from traditional on-premises databases and
data warehouses.
 It operates on cloud infrastructure provided by:
o Amazon Web Services (AWS)
o Microsoft Azure
o Google Cloud Platform (GCP)

Key Features and Benefits

 Niche Features:
o Time Travel: Allows users to access historical data at any point within a defined period.
o Fail Safe: Provides a seven-day period to recover historical data.
o Data Cloning: Enables instant, zero-copy clones of databases, schemas, and tables.
o Data Sharing: Facilitates secure and governed sharing of data across organizations.
 Cost Efficiency:
o Pay-as-you-use Model: Customers are charged based on compute and storage usage.
o Separation of Costs: Storage costs are separate from compute costs, providing flexibility
and cost savings.
 Virtual Warehouses:
o These are clusters or computing engines used to run queries.
o Charges are based on the duration of query execution and the amount of data stored.
 Infrastructure Management:
o Snowflake manages all hardware and software infrastructure.
o Provides a Software as a Service (SaaS) platform, eliminating the need for customers to
handle installations or maintenance.
 Scalability and Performance:
o Elastic and Highly Scalable: Can scale up or down based on workload.
o Fault Tolerant: Ensures high availability and reliability.
o Massive Parallel Processing (MPP): Capable of handling large workloads and complex
queries by spinning up multiple clusters.

User Experience

 Account Setup:
o Users can quickly create an account and set up a data warehouse without dealing with
infrastructure issues.
 Interface Exploration:
o Key options available on the Snowflake interface include:
 Databases
 Shares
 Warehouses
 Worksheets
 History
 Other options
o User-specific information, such as username and roles, can be accessed via a dropdown
menu on the right side of the screen.

Importance of Roles

 Roles in Snowflake:
o Roles play a significant role in managing access and permissions within Snowflake.
o Understanding and configuring roles is crucial for effective Snowflake usage and security
management.

Different types of roles in snowflake


Snowflake Roles Overview
1. Account Admin

 Role: Account Administrator


 Level: Topmost role
 Capabilities:
o Full control over the account
o Access to account settings and notifications
 Caution: Should be granted to a limited number of users

2. Security Admin

 Role: Security Administrator


 Capabilities:
o Monitor and manage users and roles.
o Modify and monitor user sessions.
o Provide and revoke grants.
 Limitations: Cannot see account notifications

3. User Admin

 Role: User and Role Administrator


 Capabilities:
o Create users and roles.
o Provide privileges to other roles.

4. Sysadmin

 Role: System Administrator


 Capabilities:
o Create warehouses, databases, and other objects!
o Grant privileges on warehouses, databases, and objects to other roles
 Default Role: Automatically selected upon login.
5. Public

 Role: Default role for all users


 Capabilities:
o Own securable objects
o Automatically granted to every user and role

Key Points

 Account Admin and Security Admin can handle account settings.


 Sysadmin and Public roles do not have access to account settings and notifications.
 Each role has specific significance and should be assigned carefully.
 Understanding these roles is crucial for Snowflake certification and effective account
management.

Session Notes: Snowflake Database Overview


URL Breakdown

1. Account ID: Unique alphanumeric text before the region (e.g., US East one).
2. Region: The region selected during account creation (e.g., US East one).
3. Domain: The domain currently in use.

Databases Overview

 Default Databases Provided by Snowflake:


o snowflake_sample_data
o demo_db
o util_db

Interacting with Databases

1. Selecting a Database:
o Click anywhere on the row of the database (not the hyperlink) to bring up a slider on the
right side.
o The slider allows granting privileges to the selected user or role (e.g., sysadmin).

2. Database Options:
o Create Clone
o Drop
o Transfer Ownership

3. Viewing Tables:
o Click on the hyperlink of the database name to view available tables.
o Example: Clicking on snowflake_sample_data shows multiple tables, including a
large table up to 10.9 TB.
4. Granting Privileges:
o Click on the row of a specific table to provide privileges.

Database Functionalities

 Top Options:
o Tables
o Views
o Schemas
o Stages
o File Formats
o Sequences
o Pipes

 Common Elements:
o Tables, Views, Schemas: Familiar from traditional data warehouses.
o Stages, File Formats, Sequences, Pipes: Unique functionalities provided by Snowflake.

Exploring Tables

1. Breadcrumb Trail:
o Shows the navigation path (e.g., Databases > Database Name > Tables >
Table Name).

2. Table Details:
o Schema information in brackets.
o Metadata for each table, including columns, ordinal values, data types, nullability, and
comments.

Exploring Snowflake Architecture: Warehouses


Introduction to Warehouses in Snowflake

 Warehouses in Snowflake are not traditional data warehouses.


 They are virtual warehouses designed to handle query processing.

Snowflake Architecture Overview

Snowflake architecture consists of three layers:


1. Database Storage Layer (Storage Layer)
2. Virtual Warehouse Layer
3. Cloud Services Layer

1. Database Storage Layer

 Function: Handles data storage.


 Storage Type: Hybrid Columnar Storage.
o Hybrid Columnar Storage:
 Data is stored in blocks and compressed for faster query processing.
 Unlike traditional row and column storage, it fetches data in blocks, enhancing
performance.
 Used in various big data technologies like Amazon S3.
 Benefit: Faster query processing by fetching compressed data blocks.

2. Virtual Warehouse Layer

 Function: Handles query processing.


 Virtual Warehouses:
o Known as the "muscle" of the system.
o Perform Massive Parallel Processing (MPP):
 Can handle large datasets (e.g., petabytes) by dividing data into chunks and
processing them in parallel.
o Scalability:
 Can automatically scale up or down based on user needs.
o Auto Suspend Feature:
 Automatically suspends warehouses when idle for a specified period (e.g., 10
minutes).
 Reactivates warehouses when a query is executed, ensuring cost efficiency.
 Benefit: Efficient and scalable query processing with cost-saving features.

3. Cloud Services Layer

 Function: Manages metadata and infrastructure. (Authentication and access control)


 The cloud services layer is a collection of services that coordinate activities across Snowflake.
These services tie together all the different components of Snowflake to process user requests,
from login to query dispatch. The cloud services layer also runs on compute instances
provisioned by Snowflake from the cloud provider.
 Role: Known as the "brain" of the system.
 Capabilities:
o Stores metadata for tables, schemas, and databases.
o Optimizes queries automatically, eliminating the need for manual optimization.
o Manages infrastructure and various independent scalable services.
 Benefit: Automates crucial data management tasks and optimizes system performance.
Guide to Creating a Virtual Warehouse in Snowflake
Step-by-Step Process

1. Accessing Warehouses:
o Navigate to the Warehouses section in your Snowflake account.
o You will see a list of existing warehouses, including the default compute warehouse
provided by Snowflake.

2. Creating a New Virtual Warehouse:


o Click on the Create button to start the process of creating a new virtual warehouse.

3. Configuring the Virtual Warehouse:


o Name: Enter a name for your virtual warehouse, e.g., my_test_warehouse.
o Size: Select the size of the warehouse. Options range from X-Small to 4X-Large. Be
cautious with larger sizes as they consume more credits per hour (e.g., 4X-Large
consumes 128 credits per hour).
 For this example, select X-Small.
o Clusters:
 Maximum Clusters: Set the maximum number of clusters (e.g., 3). This allows
the warehouse to scale up to handle high loads.
 Minimum Clusters: Set the minimum number of clusters (e.g., 1).
o Scaling Policy: Choose between Standard and Economy. This will be discussed in detail
in a future session.
o Auto Suspend: Set the auto-suspend time to define how long the warehouse should
remain idle before suspending (e.g., 10 minutes).
o Auto Resume: Enable auto-resume to automatically reactivate the warehouse when a
query is issued.
o Comments: Optionally, add a comment for the warehouse (e.g., "This is my test virtual
warehouse").

4. Finalizing the Warehouse Creation:


o Click on Show SQL to view the SQL query used to create the warehouse. This can be
useful for creating warehouses via SQL scripts.
o Click on Finish to complete the creation process.

5. Viewing and Managing the Virtual Warehouse:


o Once created, the new warehouse will appear in the list with its name, size, and other
configurations.
o The state of the warehouse (e.g., active or suspended) and the number of active clusters
will be displayed.

6. Granting Privileges:
o Click on the row of the warehouse to bring up the slider on the right side.
o Use the slider to grant privileges such as Modify, Monitor, Operate, and Usage to
specific roles (e.g., sysadmin).
o Optionally, enable the With Grant Option to allow the role to grant these privileges
to others.

7. Removing the Slider:


o To close the slider, click again on the row of the warehouse.

Scaling Policies in Snowflake Virtual Warehouses


Introduction to Scaling Policies

When your virtual warehouse experiences a high load of queries, scaling policies come into play
to manage the workload efficiently. Snowflake offers two types of scaling policies:

1. Standard (Default)
2. Economy

1. Standard Scaling Policy

 Purpose: Minimize or completely prevent queuing of queries.


 Behavior:
o Automatic Cluster Start: When the system detects queries in the queue, it automatically
starts additional clusters.
o Query Processing: The new clusters handle the queued queries, ensuring minimal delay.
 Cluster Shutdown:
o Consecutive Checks: After 2 or 3 consecutive successful checks (performed at one-
minute intervals), the system determines if the load on the least loaded cluster can be
redistributed to other clusters.
o Redistribution: If redistribution is possible, the least loaded cluster is shut down to
conserve resources.
 Ideal Use Case: Production environments where query performance and speed are critical.

2. Economy Scaling Policy

 Purpose: Conserve credits by minimizing the number of running clusters.


 Behavior:
o Cluster Start: Additional clusters are not automatically started unless the system
estimates enough query load to keep the cluster busy for at least six minutes.
o Query Queuing: This may result in queries being queued and taking longer to complete.
 Cluster Shutdown:
o Consecutive Checks: Performs 5 to 6 checks (at one-minute intervals) to determine if the
load on the least loaded cluster can be redistributed.
o Redistribution: Similar to the standard policy, but with more checks to ensure resource
conservation.
 Ideal Use Case: Non-production environments where cost savings are prioritized over query
performance.

Comparison of Standard and Economy Policies


Feature Standard Policy Economy Policy

Minimizes or prevents queuing by starting


Query Queuing May result in queries being queued
additional clusters

Only if there is enough load for at


Cluster Start Automatic when queries are in queue
least six minutes

Cluster Shutdown
2-3 checks at one-minute intervals 5-6 checks at one-minute intervals
Checks

Ideal For Production environments Non-production environments

Recommendations

 Production Environments: Use the Standard Policy to ensure queries are processed quickly and
efficiently, minimizing delays.
 Non-Production Environments: Consider the Economy Policy to save on credits, especially if
query performance is not a critical factor.

Snowflake Account and Administrator Roles Overview


1. Account Administrator (ACCOUNTADMIN)

 Role: Most powerful role in the Snowflake system.


 Responsibilities:
o Account Configuration: Responsible for configuring parameters at the account level.
o Object Management: Can view and operate on all objects within the account.
o Billing and Credits: Can view and manage Snowflake billing and credit data.
o SQL Statement Control: Can stop any running SQL statements.
 Hierarchy: In the default access control hierarchy, the ACCOUNTADMIN role owns both the
SECURITYADMIN and SYSADMIN roles.

2. Security Administrator (SECURITYADMIN)

 Role: Focuses on security and user management.


 Responsibilities:
o User Management: Create and manage users.
Role Management: Create and manage roles.
o
Privileges: Includes the privileges to grant and revoke access to various resources within
o
the account.
 Ownership: This role is owned by the ACCOUNTADMIN role in the default hierarchy.

3. System Administrator (SYSADMIN)

 Role: Manages system-level operations and resources.


 Responsibilities:
o Resource Management: Create and manage warehouses, databases, and other objects.
o Privileges: Grant privileges on warehouses, databases, and objects to other roles.
 Ownership: This role is also owned by the ACCOUNTADMIN role in the default hierarchy.

Summary of Roles and Responsibilities

Role Responsibilities Ownership

- Configure account parameters<br>- View and operate on


Owns SECURITYADMIN
ACCOUNTADMIN all objects<br>- Manage billing and credits<br>- Stop
and SYSADMIN
running SQL statements

- Create and manage users<br>- Create and manage Owned by


SECURITYADMIN
roles<br>- Grant and revoke access privileges ACCOUNTADMIN

- Create and manage warehouses<br>- Create and manage


Owned by
SYSADMIN databases and other objects<br>- Grant privileges on
ACCOUNTADMIN
resources

Key Points

 The ACCOUNTADMIN role is the topmost role with comprehensive control over the Snowflake
account.
 The SECURITYADMIN role focuses on security aspects, including user and role management.
 The SYSADMIN role handles system-level operations, such as managing warehouses and
databases.
 Both SECURITYADMIN and SYSADMIN roles are owned by the ACCOUNTADMIN role in the
default access control hierarchy.

Recommendations

 Limit Access: Due to the extensive privileges associated with the ACCOUNTADMIN role, it should
be granted to a limited number of trusted users.
 Delegate Responsibilities: Use the SECURITYADMIN and SYSADMIN roles to delegate specific
responsibilities, ensuring a clear separation of duties and enhanced security.

By understanding and appropriately assigning these roles, you can effectively manage your
Snowflake account, ensuring both security and operational efficiency.
Snowflake Pricing Model Overview

Understanding the Snowflake pricing model is crucial for managing costs effectively while using
the platform. Snowflake separates compute and storage costs, and charges are based on
consumption calculated using Snowflake credits. Here’s a detailed breakdown of the components
included in the Snowflake pricing model:

1. Snowflake Editions and Credit Costs

Snowflake offers multiple editions, each with different features and credit costs:

 Standard Edition
 Enterprise Edition
 Business Critical Edition
 Virtual Private Snowflake (VPS)

Each edition has a different cost per credit, which impacts the overall pricing. The value of
Snowflake credits varies based on the edition you are using.

2. Converting Credits to Currency

Snowflake credits are the basis for calculating costs. These credits are converted into dollars or
other currencies when billing:

 Credit Value: The value of a Snowflake credit depends on the edition.


 Currency Conversion: Credits are converted into the billing currency (e.g., USD) based on the
current exchange rates and the edition-specific credit value.

3. Serverless Features and Pay-As-You-Go Model

Snowflake includes several serverless features that are charged on a pay-as-you-go basis:

 Serverless Features: Include services like Snowpipe (data ingestion), automatic clustering, and
materialized view maintenance.
 Pay-As-You-Go: Charges are based on the actual usage of these features, providing flexibility and
cost efficiency.

4. Snowflake Credits

Snowflake credits are the fundamental unit of consumption for both compute and storage costs:

 Compute Costs: Calculated based on the number of credits consumed by virtual warehouses.
 Storage Costs: Calculated based on the amount of data stored, either on-demand or pre-
purchased.
5. Storage Costs

Storage costs can be computed in two ways:

 On-Demand Storage: Charges are based on the actual amount of data stored each month.
 Pre-Purchased Storage: Offers discounted rates for committing to a certain amount of storage in
advance.

6. Virtual Warehouses (Compute Costs)

Compute costs are associated with the use of virtual warehouses:

 Virtual Warehouses: Charged based on the size and duration of usage.


 Credit Consumption: Larger warehouses consume more credits per hour.

7. Data Transfer Costs

Data transfer costs are incurred when moving data in and out of Snowflake:

 Internal Transfers: Typically free within the same region.


 External Transfers: Charged based on the amount of data transferred across regions or to other
cloud providers.

8. Cloud Services Costs

Cloud services costs cover the management and optimization services provided by Snowflake:

 Cloud Services: Include metadata management, query optimization, and infrastructure


management.
 Credit Usage: These services consume credits based on the level of activity and usage.

Pricing Examples

To understand the practical application of Snowflake pricing, let’s look at some examples:

Example 1: Compute Cost Calculation

 Virtual Warehouse Size: Medium (M)


 Usage Duration: 10 hours
 Credits per Hour: 4 credits/hour (for Medium warehouse)
 Total Credits Used: 4 credits/hour * 10 hours = 40 credits
 Cost in USD: 40 credits * Credit Value (based on edition)

Example 2: Storage Cost Calculation

 Storage Type: On-Demand


 Data Stored: 1 TB
 Monthly Rate: $23 per TB (example rate)
 Total Cost: 1 TB * $23 = $23

Example 3: Data Transfer Cost Calculation

 Data Transfer: 100 GB to another region


 Transfer Rate: $0.02 per GB
 Total Cost: 100 GB * $0.02 = $2

By understanding these components and how they contribute to the overall cost, you can manage
and optimize your Snowflake usage effectively.

Understanding Snowflake Credits


What are Snowflake Credits?

 Snowflake Credits: The fundamental unit of measure for consumption on the Snowflake
platform.
 Usage Tracking: Snowflake tracks all resource consumption in the form of credits, not actual
dollar amounts.
 Conversion: Credits can be converted into dollars or other currencies based on the specific
pricing of the Snowflake edition you are using.

How Snowflake Credits are Consumed

 Resource Consumption: Credits are consumed when resources are used, such as:
o Virtual Warehouses: The computing engines that run your queries.
o Cloud Services Layer: Performs work such as metadata management and query
optimization.
o Serverless Features: Includes services like Snowpipe for data ingestion, automatic
clustering, and materialized view maintenance.

Virtual Warehouses

 Role: The computing engines of Snowflake that execute queries.


 Credit Consumption: Virtual warehouses consume credits based on their size and the duration
they are running.

Serverless Features

 Consumption Tracking: All consumption from serverless features is tracked and represented as
Snowflake credits.

Tracking Consumption

 Credits Representation: All resource consumption is represented as Snowflake credits.


Snowflake Editions Overview

Snowflake offers several editions, each with unique features and capabilities tailored to different
business needs. Here’s a detailed breakdown of each edition:

1. Standard Edition

 Basic Level: The entry-level edition of Snowflake.


 Features:
o Complete SQL Data Warehouse: Supports all SQL queries.
o Secure Data Sharing: Share data across Snowflake accounts and cloud providers (AWS,
Azure, GCP).
o Premium Support: 24/7 support throughout the year.
o Time Travel: Provides 1 day of time travel.
o Enterprise-Grade Encryption: Data is encrypted both in transit and at rest.
o Dedicated Virtual Warehouses: Ensures no downtime or scalability issues.
o Federated Authentication: Integrate with Azure Active Directory for single sign-on.
o Data Replication: Replicate databases across regions or geographical locations.
o External Functions: Support for external cloud providers and native connectors (Python,
NodeJS, Pyspark).
o Snow Site Analytics: Built-in analytics capabilities.
o Data Exchange: Create and manage data exchanges.
o Data Marketplace Access: Access to multiple ETL and analytical tools.

2. Enterprise Edition

 Advanced Level: Includes all features of the Standard Edition plus additional capabilities.
 Additional Features:
o Multi-Cluster Warehouse: Supports multiple clusters for better performance.
o Extended Time Travel: Up to 90 days of time travel.
o Annual Key Rotation: Annual renewal of encryption keys.
o Materialized Views: Support for materialized views.
o Search Optimization Service: Enhanced search capabilities.
o Dynamic Data Masking: Mask sensitive data dynamically.
o External Data Tokenization: Tokenize data for enhanced security.

3. Business Critical Edition

 High-Security Level: Designed for environments with stringent security requirements.


 Includes All Enterprise Features Plus:
o Compliance: Meets guidelines such as HIPAA, PCI, GDPR.
o Enhanced Encryption: Data is encrypted throughout the Snowflake account.
o Tri-Secret Secure: Uses customer-managed keys stored in multiple key vaults for
enhanced security.
o PrivateLink Support: Secure connections using AWS PrivateLink or Azure Private Link.
o Database Failover and Failback: Ensures business continuity with failover and failback
capabilities.

4. Virtual Private Snowflake (VPS)

 Highest Level: Requires direct contact with Snowflake for setup.


 Includes All Business Critical Features Plus:
o Customer Dedicated Virtual Servers: Dedicated servers for enhanced performance and
security.
o In-Memory Encryption Key: Encryption keys are stored in memory.
o Customer Dedicated Metadata Store: Dedicated metadata storage for enhanced
security and performance.

Summary of Features by Edition


Virtual Private
Feature Standard Enterprise Business Critical
Snowflake

SQL Data Warehouse Yes Yes Yes Yes

Secure Data Sharing Yes Yes Yes Yes

Premium Support Yes Yes Yes Yes

Time Travel 1 day 90 days 90 days 90 days

In transit and at In transit and at Throughout


Encryption Throughout account
rest rest account

Dedicated Virtual
Yes Yes Yes Yes
Warehouses

Federated Authentication Yes Yes Yes Yes

Data Replication Yes Yes Yes Yes

External Functions Yes Yes Yes Yes

Snow Site Analytics Yes Yes Yes Yes

Data Exchange Yes Yes Yes Yes

Data Marketplace Access Yes Yes Yes Yes

Multi-Cluster Warehouse No Yes Yes Yes

Annual Key Rotation No Yes Yes Yes


Virtual Private
Feature Standard Enterprise Business Critical
Snowflake

Materialized Views No Yes Yes Yes

Search Optimization Service No Yes Yes Yes

Dynamic Data Masking No Yes Yes Yes

External Data Tokenization No Yes Yes Yes

Compliance (HIPAA, PCI,


No No Yes Yes
GDPR)

Tri-Secret Secure No No Yes Yes

Private Link Support No No Yes Yes

Database Failover and


No No Yes Yes
Failback

Customer Dedicated Virtual


No No No Yes
Servers

In-Memory Encryption Key No No No Yes

Customer Dedicated
No No No Yes
Metadata Store

Conclusion

Each Snowflake edition is designed to cater to different business needs, from basic data
warehousing to high-security environments. By understanding the features and capabilities of
each edition, you can choose the one that best fits your organization's requirements.

Snowflake Serverless Features and Their Impact on Costs

Snowflake offers several serverless features that leverage Snowflake-managed compute


resources. These features consume Snowflake credits based on their usage, which can impact
your overall Snowflake costs. Here’s an overview of these serverless features and their cost
implications:
1. Snowpipe

 Function: Automatically ingests data into Snowflake tables.


 Compute Resource Usage: Uses Snowflake-managed compute resources to load data
continuously.
 Cost Implication: Consumes Snowflake credits based on the volume of data ingested and the
frequency of ingestion.

2. Database Replication

 Function: Replicates databases across different regions or accounts for disaster recovery and
data sharing.
 Compute Resource Usage: Uses compute resources to replicate and synchronize data.
 Cost Implication: Consumes Snowflake credits based on the amount of data replicated and the
frequency of replication.

3. Materialized Views Maintenance

 Function: Maintains materialized views by automatically refreshing them to reflect changes in


the underlying data.
 Compute Resource Usage: Uses compute resources to refresh and maintain materialized views.
 Cost Implication: Consumes Snowflake credits based on the frequency and complexity of the
refresh operations.

4. Automatic Clustering

 Function: Automatically manages the clustering of data to optimize query performance.


 Compute Resource Usage: Uses compute resources to reorganize data based on clustering keys.
 Cost Implication: Consumes Snowflake credits based on the volume of data and the frequency
of clustering operations.

5. Search Optimization Service

 Function: Enhances search performance by creating and maintaining search optimization


structures.
 Compute Resource Usage: Uses compute resources to build and maintain search optimization
structures.
 Cost Implication: Consumes Snowflake credits based on the volume of data and the complexity
of search optimization.

Summary of Serverless Features and Cost Implications


Serverless Feature Function Compute Resource Usage Cost Implication

Consumes credits based on


Automatically ingests data Uses compute resources for
Snowpipe data volume and ingestion
into Snowflake tables continuous data loading
frequency
Serverless Feature Function Compute Resource Usage Cost Implication

Uses compute resources for Consumes credits based on


Database Replicates databases across
data replication and data volume and
Replication regions or accounts
synchronization replication frequency

Materialized Consumes credits based on


Maintains materialized Uses compute resources for
Views refresh frequency and
views by refreshing them refresh operations
Maintenance complexity

Manages data clustering to Consumes credits based on


Automatic Uses compute resources for
optimize query data volume and clustering
Clustering data reorganization
performance frequency

Enhances search
Search Uses compute resources for Consumes credits based on
performance by
Optimization building and maintaining data volume and
maintaining optimization
Service structures optimization complexity
structures

Impact on Overall Snowflake Costs

 Increased Consumption: The use of these serverless features will increase the consumption of
Snowflake credits, leading to higher overall costs.
 Cost Management: It is essential to monitor and manage the usage of these features to optimize
costs. Consider the following strategies:
o Usage Monitoring: Regularly monitor the usage of serverless features and their impact
on credit consumption.
o Cost Analysis: Analyze the cost-benefit ratio of using these features to ensure they
provide value relative to their cost.
o Optimization: Optimize the frequency and volume of operations to balance
performance improvements with cost efficiency.

By understanding the serverless features and their cost implications, you can make informed
decisions about their usage and manage your Snowflake costs effectively.

Storage Cost Options in Snowflake

Snowflake offers two primary storage cost options: On-Demand and Pre-Purchased. Each option
has its own advantages and considerations. Here’s a detailed breakdown of both:
1. On-Demand Storage

Overview:

 Flexibility: On-Demand storage is the most flexible and easiest way to purchase Snowflake
storage services.
 Pay-As-You-Go: Similar to the pay-as-you-go model used by cloud providers like AWS, Azure, and
GCP.
 Ideal For: New users or those who are unsure about their storage requirements.

Pricing:

 Fixed Rate: Customers are charged a fixed rate for the services consumed and are billed in
arrears every month.
 Common Price: The common price across regions is $40 per terabyte per month.
 Regional Variations: Prices can vary depending on the cloud provider and region, potentially
going up or down.

2. Pre-Purchased Storage

Overview:

 Capacity Commitment: Pre-purchased storage requires a specific dollar commitment to


Snowflake.
 Provisioned Storage: You must specify how much storage you will use each month, and that
amount will be provisioned for you.

Pricing:

 Cost Efficiency: Pre-purchased storage is generally cheaper than on-demand storage.


 Commitment: Requires careful analysis of your storage needs to avoid underutilization and
potential loss of money.

Considerations:

 Analyze Usage: Pre-analyze your monthly storage requirements to ensure you are consuming
the pre-purchased storage fully.
 Switching Strategy: A popular strategy is to start with on-demand storage, monitor usage, and
then switch to pre-purchased storage once you have a good understanding of your needs.

Summary of Storage Cost Options


Feature On-Demand Storage Pre-Purchased Storage

Flexibility High Low


Feature On-Demand Storage Pre-Purchased Storage

Pricing Model Pay-as-you-go Fixed monthly commitment

Cost $40 per TB per month (common price) Generally cheaper than on-demand

Ideal For New users, uncertain storage needs Users with predictable storage needs

Billing Billed in arrears every month Pre-purchased and provisioned monthly

Commitment None Requires careful analysis of needs


Snowflake Pricing Model: Key Points
1. Compute Costs

 Virtual Warehouses: Snowflake's compute resources are called virtual warehouses. Each virtual
warehouse consists of a cluster of compute resources.
 Pay-As-You-Go: Compute costs are based on the actual usage of virtual warehouses, measured
in credits. Users are billed per second, with a minimum of 60 seconds per usage.
 Credit Consumption: The number of credits consumed depends on the size of the virtual
warehouse (e.g., X-Small, Small, Medium, Large, etc.). Larger warehouses consume more credits
per second but provide more compute power.
 Auto-Suspend and Auto-Resume: Virtual warehouses can be configured to automatically
suspend when not in use and resume when needed, helping to manage and reduce compute
costs.
 Scaling: Snowflake supports multi-cluster warehouses that can automatically scale out (add
more clusters) and scale in (reduce clusters) based on the workload, optimizing performance and
cost.

2. Resource Monitors

 Purpose: Resource monitors are used to control and manage credit consumption within
Snowflake accounts, helping to prevent unexpected high usage and costs.
 Configuration: Administrators can set up resource monitors to track credit usage and define
thresholds for different actions.
 Thresholds and Actions:
o Notification: Send alerts when credit usage reaches a specified threshold.
o Suspension: Automatically suspend virtual warehouses or other compute resources
when a threshold is reached to prevent further credit consumption.
o Custom Actions: Define custom actions to be taken when thresholds are met, such as
running specific SQL commands or triggering external processes.
 Granularity: Resource monitors can be applied at different levels, such as account-wide, specific
warehouses, or user-defined groups of warehouses.
 Monitoring Periods: Administrators can define monitoring periods (e.g., daily, weekly, monthly)
to reset and track credit usage over specific intervals.

Notes on Resource Monitors in Snowflake


1. Credit Quota

 Definition: Specifies the number of Snowflake credits allocated to the monitor for a specified
frequency interval.
 Frequency Intervals: Can be set to daily, weekly, or monthly.
 Reset Mechanism: The credit quota resets to zero at the beginning of each specified interval.
 Example: If the credit limit is set to 100 credits for September, it resets to zero at the beginning
of October.
 Usage Tracking: Tracks credits consumed by both user-managed virtual warehouses and virtual
warehouses used by cloud services.
 Alert Mechanism: If the combined credit consumption (e.g., 300 credits by virtual warehouses
and 200 credits by cloud services) exceeds the limit (e.g., 500 credits), an alert is triggered
automatically.

2. Schedule

 Default Schedule: Starts monitoring credit usage immediately and resets used credits to zero at
the beginning of each calendar month.
 Custom Schedule Properties:
o Frequency: Interval at which used credits reset relative to the specified start date and
time. Options include daily, weekly, or monthly.
o Start Date and Time: Timestamp when the resource monitor starts monitoring the
assigned warehouses.
o End Date and Time: Timestamp when Snowflake suspends the warehouses associated
with the resource monitor, regardless of whether the used credits reached any
thresholds.

3. Monitor Level

 Definition: Specifies whether the resource monitor is used to monitor credit usage for the entire
account or specific individual warehouses.
 Options:
o Account Level: Monitors all warehouses in the account.
o Warehouse Level: Monitors specific individual warehouses.
 Importance: This property must be set; otherwise, the resource monitor does not monitor any
credit usage.

Resource Monitor Actions in Snowflake

When configuring a resource monitor in Snowflake, it is crucial to define actions that will be
triggered when the credit usage reaches specified thresholds. These actions help manage and
control credit consumption effectively. Below are the key actions that can be set for a resource
monitor:

1. Notify and Suspend

 Description: This action sends a notification to all account administrators with notifications
enabled and suspends all assigned warehouses.
 Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o Suspension: All assigned warehouses are suspended after all currently executing
statements are completed.
 Consideration: This action does not immediately stop running queries, which means there could
be additional credit consumption beyond the threshold if queries take time to complete.
 Use Case: Suitable when you want to ensure that ongoing queries are not abruptly terminated
but still want to control credit usage.

2. Notify and Suspend Immediately

 Description: This action sends a notification to all account administrators with notifications
enabled and suspends all assigned warehouses immediately.
 Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o Immediate Suspension: All assigned warehouses are suspended immediately, and any
running queries are stopped.
 Consideration: This action ensures that credit consumption stops exactly at the threshold, but it
may result in incomplete queries.
 Use Case: Suitable when you need a hard stop on credit usage to prevent any consumption
beyond the specified limit.

3. Notify

 Description: This action only sends a notification to all account administrators with notifications
enabled.
 Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o No Suspension: No action is taken on the virtual warehouses; they continue to run as
usual.
 Consideration: This action is purely informational and does not impact the operation of the
warehouses.
 Use Case: Suitable when you want to monitor credit usage and be alerted without interrupting
the operations of the warehouses.

Example Scenario

Let's assume you have set a credit limit of 100 credits for a resource monitor. You can define
actions based on different thresholds:

 10% Threshold (10 Credits):


o Action: Notify
o Behavior: Send a notification to administrators when 10 credits are consumed.

 50% Threshold (50 Credits):


o Action: Notify and Suspend
o Behavior: Send a notification and suspend all assigned warehouses after completing
ongoing queries when 50 credits are consumed.
 100% Threshold (100 Credits):
o Action: Notify and Suspend Immediately
o Behavior: Send a notification and immediately suspend all assigned warehouses,
stopping any running queries when 100 credits are consumed.

Suspension and Resumption of Virtual Warehouses in Snowflake

Understanding the suspension and resumption of virtual warehouses is crucial for effective
resource management. Here is a detailed explanation of how these processes work in Snowflake:

Suspension of Virtual Warehouses

When a resource monitor's credit usage reaches a defined threshold, the assigned virtual
warehouses are suspended. This suspension can occur under two actions:

1. Suspend: Warehouses are suspended after completing all currently executing queries.
2. Suspend Immediately: Warehouses are suspended immediately, stopping any running queries.

Resumption of Virtual Warehouses

Suspended virtual warehouses can be resumed under the following conditions:

1. Next Interval Starts:


o Explanation: The credit quota resets at the beginning of the next interval (e.g., daily,
weekly, monthly) as dictated by the start date for the monitor.
o Example: If the resource monitor is set to run monthly and the current month is
September, the warehouses will be auto-resumed at the start of October when the
credit quota resets.

2. Credit Quota is Increased:


o Explanation: If the credit quota for the resource monitor is increased, the warehouses
can be resumed.
o Example: If the initial credit quota is 100 credits and it is increased to 150 credits, the
warehouses can be resumed once the new quota is set.

3. Suspend Action Threshold is Increased:


o Explanation: If the threshold for the suspend action is increased, the warehouses can be
resumed.
o Example: If the suspend action threshold is initially set at 80 credits and it is increased to
90 credits, the warehouses can be resumed.

4. Warehouses are No Longer Assigned to the Monitor:


o Explanation: If the warehouses are detached from the resource monitor, they can be
resumed.
o Example: If a warehouse is removed from the list of warehouses monitored by the
resource monitor, it can be resumed.

5. Monitor is Dropped:
o Explanation: If the resource monitor itself is dropped, all warehouses tied to that
monitor can be auto-resumed.
o Example: If the resource monitor is deleted, the warehouses that were assigned to it will
be resumed automatically.

Important Note

 Delay in Suspension: When credit quota thresholds are reached for a resource monitor, the
assigned warehouses may take some time to suspend, even when the action is "suspend
immediately." This delay can result in additional credit consumption beyond the threshold.

Example of Resource Monitors in Snowflake

Let's break down the example provided to understand how resource monitors work in Snowflake:

Resource Monitors Configuration

1. Resource Monitor 1 (Account Level)


o Credit Quota: 5000 credits
o Scope: Monitors all warehouses in the account
o Action: Sends an alert and suspends all warehouses when the credit quota is reached

2. Resource Monitor 2 (Specific Warehouse)


o Credit Quota: 1000 credits
o Scope: Monitors Warehouse 3
o Action: Sends an alert and suspends Warehouse 3 when the credit quota is reached

3. Resource Monitor 3 (Specific Warehouses)


o Credit Quota: 2500 credits
o Scope: Monitors Warehouse 4 and Warehouse 5
o Action: Sends an alert and suspends Warehouse 4 and Warehouse 5 when the credit
quota is reached

Warehouse Assignments to Departments

 Warehouse 1: Sales
 Warehouse 2: Marketing
 Warehouse 3: Tech
 Warehouse 4: Finance
 Warehouse 5: HR

Credit Consumption Scenario

 Warehouse 1 and Warehouse 2: Combined consumption of 3600 credits


 Warehouse 3: Consumption of 400 credits (out of 1000 credit quota)
 Warehouse 4 and Warehouse 5: Combined consumption of 1000 credits (out of 2500 credit
quota)

Total Credit Consumption Calculation

 Warehouse 1 and 2: 3600 credits


 Warehouse 3: 400 credits
 Warehouse 4 and 5: 1000 credits
 Total: 3600 + 400 + 1000 = 5000 credits

Since the total credit consumption reaches 5000 credits, Resource Monitor 1 (account level)
will be triggered. This will result in:

 An alert being sent to the account administrators.


 Suspension of all warehouses in the account, even if Warehouse 3, Warehouse 4, and
Warehouse 5 have not consumed their respective credit quotas for the month.

Key Points to Note

1. Individual Resource Monitors: It is best practice to set resource monitors individually


for each virtual warehouse to ensure that one department's warehouse consumption does
not affect the credit quota of another department's warehouse.
2. Non-Overriding Monitors: An account-level resource monitor does not override the
resource monitor assignment for individual warehouses. If either the account resource
monitor or the warehouse resource monitor reaches its defined threshold and a suspend
action is defined, the warehouses will be suspended.
3. Single Assignment: A resource monitor can track the consumption of multiple
warehouses. However, one warehouse cannot be tied to multiple resource monitors.

Certainly! Here is a summary of the steps to create a resource monitor in Snowflake Web UI and
the importance of enabling notifications first:

Steps to Create a Resource Monitor in Snowflake Web UI:

1. Log in to Snowflake Web UI:


o Open your web browser and log in to your Snowflake account.
2. Navigate to Resource Monitors:
o Go to the "Admin" section in the navigation panel.
o Select "Resource Monitors" from the dropdown menu.

3. Create a New Resource Monitor:


o Click on the "Create" button to start setting up a new resource monitor.

4. Configure Resource Monitor Settings:


o Name: Provide a unique name for the resource monitor.
o Credit Quota: Set the credit quota, which is the maximum number of credits the
monitor will track.
o Frequency: Choose the frequency (e.g., daily, weekly, monthly) for the credit quota to
reset.

5. Set Up Triggers:
o Define triggers to specify actions when certain thresholds are met:
 Threshold: Set the percentage of the credit quota that will trigger an action.
 Action: Choose the action to be taken (e.g., notify, suspend, or resume
warehouses).

6. Enable Notifications:
o Ensure that notifications are enabled to alert administrators or users when thresholds
are reached. This is crucial for proactive management and avoiding unexpected
disruptions.

7. Save the Resource Monitor:


o Review the settings and click "Save" to create the resource monitor.

Importance of Enabling Notifications:

 Proactive Management: Notifications allow administrators to be aware of resource usage and


take necessary actions before reaching critical limits.
 Avoid Disruptions: Timely alerts help prevent unexpected suspensions of warehouses, ensuring
continuous operation.
 Cost Control: By monitoring and receiving alerts on credit usage, organizations can better
manage and optimize their Snowflake costs.
 Compliance and Governance: Notifications support adherence to internal policies and
governance standards by keeping stakeholders informed about resource consumption.

By following these steps and enabling notifications, you can effectively manage and monitor
your Snowflake resources, ensuring optimal performance and cost efficiency.

/*

-- Create Resource monitor using SQL queries


To create a monitor that starts monitoring immediately, resets at the beginning of each month,
and suspends the assigned warehouse assigned when the used credits reach 100% of the credit
quota:

*/

use role accountadmin;

create or replace resource monitor monitor1 with credit_quota=15

triggers on 100 percent do suspend;

alter warehouse compute_wh set resource_monitor = monitor1;

/*

To create a monitor that is similar to the first example, but suspends at 90% and suspends
immediately at 100% to prevent all warehouses in the account from consuming credits after the
quota has been reached:

*/

use role accountadmin;

create or replace resource monitor monitor2 with credit_quota=100

triggers on 90 percent do suspend

on 100 percent do suspend_immediate;

alter warehouse compute_wh set resource_monitor = monitor2;

/*

To create a monitor that is similar to the first example, but lets the assigned warehouse exceed
the quota by 10% And includes two notification actions to alert account administrators as the
used credits reach the halfway and three-quarters points for the quota:

*/

use role accountadmin;


create or replace resource monitor monitor3 with credit_quota=120

triggers on 50 percent do notify

on 75 percent do notify

on 100 percent do suspend

on 110 percent do suspend_immediate;

alter warehouse compute_wh set resource_monitor = monitor3;

/*

To create an account-level resource monitor that starts immediately (based on the current
timestamp), resets monthly on the same day and time, has no end date or time, and suspends the
assigned warehouse when the used credits reach 100% of the quota:

*/

use role accountadmin;

create or replace resource monitor monitor_freq1 with credit_quota=50

frequency = monthly

start_timestamp = immediately

triggers on 100 percent do suspend;

alter warehouse compute_wh set resource_monitor = monitor_freq1;

/*

To create a resource monitor that starts at a specific date and time in the future, resets weekly on
the same day and time,

has no end date or time, and performs two different suspend actions at different thresholds on
two assigned warehouses:

*/
use role accountadmin;

create or replace resource monitor monitor_freq2 with credit_quota=200

frequency = weekly

start_timestamp = '2020-09-22 00:00 PST'

triggers on 80 percent do suspend

on 100 percent do suspend_immediate;

alter warehouse compute_wh set resource_monitor = monitor_freq2;

alter warehouse pc_matillion_wh set resource_monitor = monitor_freq2;

-- Setting a Resource Monitor for Account

use role accountadmin;

create resource monitor account_monitor with credit_quota=500

triggers on 100 percent do suspend;

alter account set resource_monitor = account_monitor;

-- ALTER RESOURCE MONITORS

alter resource monitor monitor1 set credit_quota = 150;


Micro Partitioning in Snowflake
Introduction

In this section, we will discuss micro partitioning in Snowflake and how it enhances query
processing speed. Before diving into Snowflake's approach, let's understand traditional
partitioning methods used in data warehouses.

Traditional Partitioning in Data Warehouses

Definition: Partitioning is the process of dividing a table into multiple chunks to improve query
processing and data retrieval speed.

Characteristics:

 Independent Units: Each partition in a traditional warehouse is an independent unit of


management.
 Static Partitioning: Partitions are created based on specific criteria and remain static unless
manually reconfigured.

Benefits:

 Improved Performance: Partitioning large tables can lead to acceptable performance and better
scalability.

Limitations:

1. Maintenance Overhead:
o Continuous monitoring and repartitioning are required as data size increases.
o Repartitioning involves significant maintenance efforts to ensure optimal performance.

2. Data Skewness:
o Uneven distribution of data across partitions can occur.
o Example: Partitioning by a gender column with more observations for females than
males leads to uneven partition sizes.

3. Variable Partition Sizes:


o Partitions can vary significantly in size, leading to inefficiencies.
o Example: A gender-based partition might have a large female partition and a small male
partition.
Micro Partitioning in Snowflake

Definition: Micro partitioning is a technique used by Snowflake to automatically divide tables


into small, manageable chunks called micro partitions.

Advantages:

1. Automated Management:
o Snowflake automatically handles partitioning, eliminating the need for manual
intervention.
o Micro partitions are created and maintained by Snowflake without user input.

2. Improved Query Performance:


o Micro partitions enable faster query processing by allowing Snowflake to scan only the
relevant partitions.
o This reduces the amount of data scanned and speeds up query execution.

3. Efficient Data Storage:


o Micro partitions are compressed and optimized for storage efficiency.
o Snowflake's architecture ensures that data is stored in a way that minimizes storage
costs.

4. Dynamic Partitioning:
o Unlike traditional static partitioning, Snowflake's micro partitions are dynamic and adapt
to data changes.
o This reduces the need for manual repartitioning and maintenance.

5. Handling Data Skewness:


o Snowflake's micro partitioning mitigates data skewness by distributing data evenly
across partitions.
o This ensures balanced partition sizes and efficient query processing.

Conclusion: Micro partitioning in Snowflake offers significant advantages over traditional


partitioning methods. By automating partition management, improving query performance, and
efficiently handling data storage and skewness, Snowflake provides a robust solution for modern
data warehousing needs.
Micro Partitioning in Snowflake
Introduction

Micro partitioning is a unique and efficient way of partitioning data in Snowflake, which differs
significantly from traditional partitioning methods used in other data warehouses. This approach
offers several benefits that overcome the limitations of static partitioning.

Key Features of Micro Partitioning

1. Size of Micro Partitions:


o Each micro partition in Snowflake can contain between 50 MB to 500 MB of
uncompressed data.
o Once the data is loaded into Snowflake tables, the actual storage size is significantly
reduced due to Snowflake's efficient compression techniques.

2. Columnar Storage:
o Snowflake uses columnar storage for its micro partitions.
o Each micro partition contains a group of rows stored by each column, optimizing storage
and query performance.

3. Metadata Storage:
o Snowflake stores metadata about all rows in a micro partition.
o The metadata includes the range of values for each column, the number of distinct
values, and additional properties used for optimization and efficient query processing.

4. Cloud Services Layer:


o The cloud services layer in Snowflake, often referred to as the "brain of the system,"
stores all metadata about the rows in each micro partition.
o This layer knows the exact placement of each row within the micro partitions, enabling
efficient query processing.

Benefits of Micro Partitioning

1. Automated and Dynamic Partitioning:


o Unlike traditional static partitioning, Snowflake's micro partitioning is automated and
dynamic.
o This reduces the need for manual intervention and maintenance, as Snowflake
automatically manages the partitions.

2. Efficient Query Processing:


o When a user runs a query, Snowflake's cloud services layer uses the metadata to identify
and access only the relevant micro partitions.
o This selective access speeds up query processing by avoiding unnecessary scans of
unrelated data.

3. Handling Large Tables:


o Snowflake can handle tables with a large number of micro partitions, ranging from ten to
millions, depending on the size of the table.
o The cloud services layer efficiently manages and optimizes these partitions for
performance.

4. Optimized Storage:
o The columnar storage format and efficient compression techniques used by Snowflake
reduce the overall storage size.
o This leads to cost savings and better performance.

5. Scalability:
o Snowflake's micro partitioning allows for seamless scalability, accommodating growing
data volumes without compromising performance.

Detailed Process of Micro Partitioning

1. Data Loading:
o When data is loaded into a Snowflake table, it is automatically divided into micro
partitions.
o Each micro partition stores a group of rows in a columnar format.

2. Metadata Management:
o The cloud services layer captures and stores metadata for each micro partition.
o This metadata includes the range of values for each column, the number of distinct
values, and other properties for optimization.

3. Query Execution:
o During query execution, the cloud services layer uses the metadata to identify the
relevant micro partitions.
o Only the necessary micro partitions are accessed, improving query performance.

4. Optimization:
o Snowflake continuously optimizes the micro partitions and metadata to ensure efficient
query processing.
o This includes reorganizing partitions and updating metadata as needed.

Benefits of Micro Partitioning in Snowflake

Micro partitioning in Snowflake offers several significant benefits that enhance performance,
reduce maintenance overhead, and optimize storage. Let's delve into these benefits in detail:
1. Automated and Dynamic Partitioning

 Automatic Creation: Micro partitions are automatically created by Snowflake without the need
for explicit definition or maintenance by users. This reduces the manual effort required to
manage partitions.
 Dynamic Adjustment: Snowflake dynamically adjusts the micro partitions based on the data size
and usage patterns, ensuring optimal performance without user intervention.

2. Low Maintenance Overhead

 Negligible Maintenance: Since Snowflake handles the creation and management of micro
partitions, the maintenance overhead for users is minimal. This is particularly beneficial for large
tables containing terabytes of data.
 Scalability: Snowflake can create billions of micro partitions as needed, allowing it to efficiently
manage very large datasets without requiring manual partitioning.

3. Small and Uniform Partition Size

 Size Range: Each micro partition can store between 50 MB to 500 MB of uncompressed data.
This small size enables fine-grained tuning for faster queries.
 Uniformity: Snowflake ensures that micro partitions are uniformly small, which helps prevent
data skew and ensures balanced performance across partitions.

4. Fine-Grained Pruning

 Efficient Query Processing: Snowflake's cloud services layer knows the exact placement of each
row within the micro partitions. When a query is run, Snowflake scans only the relevant micro
partitions, significantly speeding up query execution.
 Example: If a query filters data for a specific country (e.g., India), Snowflake will only scan the
micro partitions containing data for India, rather than scanning the entire table.

5. Prevention of Data Skew

 Overlapping Ranges: Micro partitions can overlap in their range of values, which helps distribute
data evenly and prevent skew. This ensures that no single partition becomes a bottleneck.
 Balanced Distribution: The uniform size and overlapping ranges of micro partitions contribute to
a balanced distribution of data, enhancing overall performance.

6. Columnar Storage

 Independent Column Storage: Columns are stored independently within micro partitions,
enabling efficient scanning of individual columns. Only the columns referenced by a query are
scanned, reducing the amount of data processed.
 Example: In a customer table, if a query requests only the department ID and customer name,
Snowflake will scan only these two columns, ignoring the rest. This improves query efficiency
and reduces compute costs.
7. Efficient Compression

 Column Compression: Columns are compressed individually within micro partitions. Snowflake
automatically determines the most efficient compression algorithm for each column, optimizing
storage and performance.
 Cost Savings: Efficient compression reduces storage costs and enhances query performance by
minimizing the amount of data that needs to be processed.

8. Cost and Time Efficiency

 Reduced Compute Costs: By scanning only the relevant data and using efficient compression,
Snowflake reduces the compute costs associated with query processing.
 Faster Query Results: The combination of fine-grained pruning, columnar storage, and efficient
compression leads to faster query results, saving time for users.

Understanding the Logical and Physical Structure of a Snowflake Table


Logical Structure of a Snowflake Table

The logical structure of a Snowflake table is what users typically interact with when querying or
managing data. This structure includes columns and rows, similar to traditional databases.

Example Table:

 Columns: type, name, country, date


 Rows: 24 rows

This logical view is straightforward and familiar to anyone who has worked with relational
databases. However, the physical storage of this data in Snowflake is quite different and
optimized for performance and scalability.

Physical Structure of a Snowflake Table

Snowflake uses a unique approach to store data physically, leveraging micro partitions and
columnar storage. This section explains how data is stored in Snowflake's storage layer.

Micro Partitions:

 Automatic Creation: Micro partitions are created automatically by Snowflake, with no manual
intervention required from users.
 Size: Each micro partition ranges from 50 MB to 500 MB of uncompressed data.
 Uniform Size: Snowflake ensures that micro partitions are uniformly sized, which helps in
efficient data management and query processing.

Example of Micro Partitions:

 Micro Partition 1: Contains rows 1 to 6


 Micro Partition 2: Contains rows 7 to 12
 Micro Partition 3: Contains rows 13 to 18
 Micro Partition 4: Contains rows 19 to 24

Each micro partition stores data in a columnar format, meaning each column's data is stored
separately within the partition.

Columnar Storage:

 Column Blocks: In each micro partition, data is stored in blocks by column. For example:
o Block for column 'type'
o Block for column 'name'
o Block for column 'country'
o Block for column 'date'
 Efficiency: This columnar storage format allows Snowflake to scan only the necessary columns
when executing queries, improving performance and reducing I/O.

Detailed Example of Columnar Storage

Row Store vs. Columnar Store:

 Row Store: Stores entire rows as single blocks. This is less efficient for analytical queries that
often require only specific columns.
 Columnar Store: Stores each column as separate blocks. This is more efficient for analytical
queries as it allows for selective column scanning.

Metadata Storage:

 Each file (micro partition) contains a header that stores metadata such as:
o Minimum value of each column
o Maximum value of each column
o Number of distinct values in each column

Example Metadata for Columns:

 Columns: type, name, country, date


 Metadata: Stored in the file header, providing quick access to column statistics and aiding in
query optimization.
Summary

Logical Structure:

 Columns: type, name, country, date


 Rows: 24 rows

Physical Structure:

 Micro Partitions: Automatically created, uniformly sized, 50 MB to 500 MB each


 Columnar Storage: Each column stored separately within micro partitions
 Metadata: Stored in file headers, including min/max values and distinct counts

How Micro Partitions Get Accessed in Snowflake


High-Level Architecture of Snowflake

Snowflake's architecture consists of three primary layers:

1. Centralized Storage: The bottom layer where all data is stored in micro partitions.
2. Multi-Cluster Compute (Virtual Warehouse Layer): The middle layer responsible for executing
queries.
3. Cloud Services Layer: The top layer, often referred to as the brain of the system, responsible for
metadata management, query optimization, and execution planning.

Query Execution Process

Let's walk through the process of how a query is executed in Snowflake, focusing on how micro
partitions are accessed:

1. Query Submission:
o A user submits a query, for example: SELECT type, name, country FROM
employee WHERE date = '11/2'.

2. Cloud Services Layer:


o The query first reaches the Cloud Services Layer.
o This layer creates an execution plan for the query.
o It identifies the relevant micro partitions that contain data for the specified date ('11/2')
using metadata.

3. Metadata Utilization:
o The Cloud Services Layer uses metadata to determine which micro partitions contain
data for '11/2'.
o It processes additional information and incorporates it into the execution plan.

4. Passing to Virtual Warehouse Layer:


o The execution plan, along with the identified micro partitions, is passed to the Virtual
Warehouse Layer.

5. Virtual Warehouse Layer:


o This layer reads the header files of each identified micro partition to confirm the
presence of the required data.
o It extracts the necessary columns (type, name, country) from each relevant micro
partition.

6. Centralized Storage Interaction:


o The Virtual Warehouse Layer interacts with the Centralized Storage to retrieve the actual
data from the identified micro partitions.
o Only the micro partitions containing data for '11/2' are accessed, while others are
pruned.

7. Data Retrieval and Result Delivery:


o The data is retrieved from the Centralized Storage.
o The results are processed and returned to the user.

Example of Micro Partition Selection

Consider the following micro partitions:

 Micro Partition 1: Contains data for '11/2' (6 rows)


 Micro Partition 2: Contains data for '11/2' (3 rows) and '11/3' (3 rows)
 Micro Partition 3: Contains data for '11/2' (3 rows)
 Micro Partition 4: Does not contain data for '11/2'

When the query is executed:

 Micro Partition 4 is pruned as it does not contain data for '11/2'.


 Micro Partitions 1, 2, and 3 are accessed:
o From Micro Partition 1, all 6 rows for '11/2' are retrieved.
o From Micro Partition 2, only the 3 rows for '11/2' are retrieved.
o From Micro Partition 3, only the 3 rows for '11/2' are retrieved.

Handling Overlapping Data

Micro partitions can contain overlapping data ranges. This is managed as follows:

 Overlapping Data: Micro partitions 2 and 3 both contain data for '11/2'.
 Data Insertion Timing: Data is loaded into micro partitions as it is inserted into the table. For
example:
o After inserting 3 rows for '11/2' into Micro Partition 2, 3 rows for '11/3' are inserted,
filling the partition.
o Subsequent rows for '11/2' are then inserted into Micro Partition 3.

Data Clustering in Snowflake


Introduction

Data clustering in Snowflake plays a vital role in optimizing data retrieval and query
performance. By clustering data, Snowflake ensures that similar kinds of data are stored together
in common micro partitions, which enhances the efficiency of data access.

Clustering on Micro Partitions

Purpose:

 Clustering organizes data within micro partitions to ensure that similar data is stored together.
 This process optimizes data retrieval, making queries faster and more efficient.

Example:

 Consider a table with rows for dates '11/2', '11/3', and '11/4'.
 By clustering on the date column, Snowflake ensures that data for the same date is stored in the
same or adjacent micro partitions.
 This prevents data for the same date from being scattered across multiple micro partitions,
which would degrade query performance.

Natural Clustering

Default Behavior:

 Snowflake automatically clusters data along natural dimensions, such as date, when data is
initially loaded into tables.
 This automatic clustering produces well-clustered tables that are optimized for common query
patterns.
Limitations:

 Over time, as users perform various operations (inserts, updates, deletes), the natural clustering
may become less optimal.
 In such cases, the default clustering may not be the best choice for sorting or ordering data
across the table.

Custom Clustering Keys

Custom Cluster Keys:

 Users can define their own cluster keys to optimize data retrieval based on specific query
patterns.
 A cluster key is a column or set of columns that Snowflake uses to cluster data within micro
partitions.

Testing Cluster Keys:

 Users can test multiple clustering keys to determine which performs better for their specific
queries.
 This involves analyzing query performance and adjusting the cluster keys accordingly.

Importance of Clustering for Large Tables

Query Performance:

 For very large tables, clustering becomes crucial to ensure efficient query performance.
 Unsorted or partially sorted data can significantly impact query performance, particularly for
large datasets.

Faster Data Retrieval:

 By clustering data based on a cluster key, Snowflake can quickly access the relevant micro
partitions, avoiding unnecessary scans.
 This accelerates query performance and reduces compute costs.

Clustering Metadata and Re-Clustering

Metadata Collection:

 When data is inserted or loaded into a table, Snowflake collects and records clustering metadata
for each micro partition.
 This metadata includes information about the clustering key and the performance of queries.

Automatic Re-Clustering:
 Once a clustering key is defined, Snowflake's cloud services layer automatically performs re-
clustering as needed.
 This ensures that the table remains well-clustered over time, with minimal maintenance
overhead for users.

Maintenance:

 There is no ongoing maintenance required for clustering once the key is defined, unless the user
decides to change or drop the clustering key in the future.

Example Scenario

Querying by Date:

 Suppose you have a table clustered by the date column.


 When querying data for a specific date, Snowflake can quickly identify and scan only the micro
partitions containing that date.
 This avoids scanning all micro partitions in the table, significantly improving query performance.

Efficiency:

 If the date column is well-clustered, Snowflake may need to scan only one or two micro
partitions to retrieve the data.
 This efficient access reduces the time and resources required for query execution.

Understanding Clustering Keys in Snowflake


Introduction

Clustering keys in Snowflake are designed to optimize data retrieval from tables by organizing
data within micro partitions. This process enhances query performance by reducing the amount
of data scanned during queries.

What are Clustering Keys?

 Purpose: Clustering keys perform clustering on micro partitions to optimize data retrieval.
 Definition: Clustering keys can be defined on a single column or multiple columns in a table.
They can also be expressions based on columns.

Example of Clustering Key

 Date Column: You can define a clustering key based on a date column. Instead of using the
complete date, you can extract the month and use it as the clustering key.
o Expression: Extract the month from the date column and use it as the clustering key.
Benefits of Clustering Keys

 Co-locating Data: Clustering keys ensure that similar data is stored together in the same micro
partitions.
 Improved Query Performance: Clustering keys are useful for very large tables as they improve
scan efficiency by skipping data that doesn't meet the filter criteria.

Default Clustering vs. Custom Clustering

 Default Clustering: Snowflake produces well-clustered tables by default.


 Custom Clustering: If the default clustering does not provide optimal query performance, you
can explicitly define clustering keys.

Maintenance of Clustering Keys

 Automatic Maintenance: Once a clustering key is defined, Snowflake automatically maintains


the clustering for future rows in the table.
 No Additional Administration: There is no additional administration required unless you drop or
modify the clustering key.

When to Use Clustering Keys

 Slow Queries: Use clustering keys when queries on the table are running slower than expected
or have degraded over time.
 Large Clustering Depth: Use clustering keys when the clustering depth (overlapping in micro
partitions) is very large.

Considerations

 Computational Cost: Clustering can be computationally expensive. Only cluster when queries
will benefit substantially from it.
 Testing Clustering Keys: Test clustering keys on a table and check its clustering depth. Monitor
query performance to decide if you should keep the clustering key.

Micro Partitioning Overlapping and Clustering Depth in Snowflake


Introduction

In this section, we will delve into the concepts of micro partitioning overlapping and clustering
depth in Snowflake. These concepts are crucial for understanding how Snowflake optimizes data
storage and query performance.
Micro Partitioning Overlapping

Definition:

 Overlapping: Overlapping occurs when the same data values are stored in multiple micro
partitions. This can happen due to the way data is inserted or updated in the table.

Example:

 Consider a table with 24 rows stored across four micro partitions. If the same date value ('11/2')
appears in multiple micro partitions, this is an example of overlapping.

Impact:

 Query Performance: Overlapping can affect query performance because Snowflake may need to
scan multiple micro partitions to retrieve the required data.
 Storage Efficiency: Overlapping can also impact storage efficiency as the same data is stored in
multiple locations.

Clustering Depth

Definition:

 Clustering Depth: Clustering depth is a measure of how well the data in a table is clustered. It
indicates the number of micro partitions that need to be scanned to retrieve the required data.

Calculation:

 Clustering depth is calculated based on the number of overlapping micro partitions and the
distribution of data within those partitions.

Example:

 If a query needs to retrieve data for a specific date ('11/2') and this date is stored in three
overlapping micro partitions, the clustering depth for this query is three.

Impact:

 Query Performance: Higher clustering depth means more micro partitions need to be scanned,
which can slow down query performance.
 Optimization: Lower clustering depth indicates better clustering and more efficient data
retrieval.

Metadata Stored by Snowflake

Clustering Metadata:
 Snowflake maintains clustering metadata for each table, which includes:
o Total Number of Micro Partitions: The total number of micro partitions that make up
the table.
o Overlapping Micro Partitions: The number of micro partitions containing overlapping
values.
o Clustering Depth: The depth of clustering for the overlapping micro partitions.

Usage:

 This metadata is used by Snowflake's cloud services layer to optimize query execution and
improve performance.

Actions Performed by Snowflake During Query Execution

1. Pruning Micro Partitions:


o Snowflake prunes (excludes) micro partitions that are not needed for the query based on
the clustering metadata.
o This reduces the number of micro partitions that need to be scanned.

2. Pruning Columns:
o Within the remaining micro partitions, Snowflake prunes the columns that are not
needed for the query.
o This further reduces the amount of data that needs to be processed.

Example Scenario

Query:

 Suppose you run a query to retrieve data for '11/2' from a table with 24 rows stored across four
micro partitions.

Steps:

1. Pruning Micro Partitions:


o Snowflake identifies and prunes the micro partitions that do not contain data for '11/2'.
o Only the micro partitions with overlapping values for '11/2' are scanned.

2. Pruning Columns:
o Within the remaining micro partitions, Snowflake prunes the columns that are not
needed for the query.
o For example, if the query only requests the 'type' and 'name' columns, the 'country' and
'date' columns are pruned.

Result:

 The query execution is optimized, and the required data is retrieved efficiently.
Clustering Depth in Snowflake
Introduction

Clustering depth is a critical metric in Snowflake that helps monitor the efficiency of data
clustering within a table. It tracks the overlapping of micro partitions and measures the average
depth of these overlaps for specified columns.

Key Points about Clustering Depth

1. Definition:
o Clustering depth measures the average number of overlapping micro partitions for
specified columns in a table.
o A smaller average depth indicates better clustering.

2. Advantages:
o Monitoring Clustering Health: Helps monitor the clustering health of a large table over
time.
o Determining Need for Clustering Keys: Assists in deciding whether a large table would
benefit from explicitly defining a clustering key.

3. Zero Clustering Depth:


o A table with no micro partitions has a clustering depth of zero.

4. Performance Monitoring:
o Clustering depth alone is not a perfect measure of clustering efficiency.
o Query performance over time should also be monitored to determine if the table is well-
clustered.
o If queries perform as expected, the table is likely well-clustered.
o If query performance degrades over time, the table may benefit from re-clustering or
defining a new clustering key.

Practical Example of Clustering Depth

Scenario:

 A table has five micro partitions, and a column contains values from A to Z.

Layers of Clustering:

1. Initial Layer:
o All five micro partitions contain overlapping values from A to Z.
o Overlapping micro partitions count: 5
o Clustering depth: 5

2. First Clustering:
o Data is aggregated into ranges A-D and E-J.
o Three micro partitions still contain overlapping values from K to Z.
o Overlapping micro partitions count: 3
o Clustering depth: 3

3. Second Clustering:
o Further clustering separates values A-D, E-J, and reduces overlap for K-N and L-Q.
o Overlapping micro partitions count: 3
o Clustering depth: 2

4. Final Layer:
o All micro partitions are separated, with no overlapping values.
o Overlapping micro partitions count: 0
o Clustering depth: 1 (minimum value for overlap depth is one or greater)

Visual Illustration

Diagram:

 Top Layer: Five micro partitions with overlapping values A-Z.


 Next Layer: Three micro partitions with ranges A-D, E-J, and overlapping K-Z.
 Subsequent Layer: Further reduced overlap with ranges A-D, E-J, K-N, and L-Q.
 Final Layer: No overlapping values, fully separated micro partitions.

Understanding Clustering and Reclustering in Snowflake


Introduction

Clustering and reclustering are essential features in Snowflake that optimize data retrieval and
query performance. Once a clustering key is defined, Snowflake automatically manages the
clustering and reclustering processes, ensuring that data remains well-organized and efficiently
accessible.

Clustering in Snowflake

Clustering Key:

 A clustering key is defined on one or more columns of a table to organize data within micro
partitions.
 Example: Clustering a table based on the date column.

Automatic Reclustering:
 Snowflake automatically reclusters tables based on the defined clustering key.
 This process reorganizes the data to maintain optimal clustering as operations like insert, update,
delete, merge, and copy are performed.

Benefits:

 Improved Query Performance: By keeping similar data together, Snowflake can quickly access
the relevant micro partitions, reducing query execution time.
 Reduced Maintenance: Users do not need to manually manage the clustering operations, as
Snowflake handles this automatically.

Reclustering Process

Why Reclustering is Needed:

 Over time, as data is inserted, updated, or deleted, the clustering of a table may become less
optimal.
 Periodic reclustering is required to maintain the efficiency of data retrieval.

How Reclustering Works:

 Snowflake uses the clustering key to reorganize the column data, ensuring that related records
are relocated to the same micro partition.
 This process ensures that similar data resides in the same micro partitions, optimizing query
performance.

Example Scenario:

 Consider a table with four micro partitions, each containing six rows.
 The table has columns: date, country, name, and type.
 We focus on the date column for clustering.

Initial State:

 Micro partitions contain overlapping values for the date column (e.g., '11/2' appears in multiple
partitions).

Reclustering:

 Snowflake reclusters the table based on the date column.


 Micro partitions are reorganized to ensure that all rows for '11/2' are in the same or adjacent
partitions.
 Additionally, clustering can be done on a secondary column (e.g., type) within each micro
partition.

Result:
 After reclustering, micro partitions are sorted based on date and then type.
 Example:
o Micro Partition 1: Contains all rows for '11/2' with similar type values.
o Micro Partition 2: Contains remaining rows for '11/2' with different type values.

Query Efficiency:

 When querying for data on '11/2', Snowflake only scans the relevant micro partitions (e.g., Micro
Partitions 1 and 2).
 This reduces the number of partitions scanned and improves query performance.

Managing Reclustering

Suspending and Resuming Reclustering:

 Users can suspend or resume reclustering operations as needed.


 This provides flexibility to manage compute resources and control costs.

Commands:

 Suspend Reclustering: ALTER TABLE <table_name> SUSPEND RECLUSTERING;


 Resume Reclustering: ALTER TABLE <table_name> RESUME RECLUSTERING;

Guide to Creating an AWS Free Tier Account


Step 1: Navigate to AWS Free Tier Page

1. Open your web browser.


2. Type "create AWS free tier account" in the search bar.
3. Click on the link that says "Free Tier – AWS" (Amazon.com).

Step 2: Explore AWS Free Tier Services

1. On the AWS Free Tier page, you will see an option to create a free tier account.
2. Before creating an account, scroll down to explore the services available under the free tier.
3. Note that the free tier includes over 60 products, but usage is limited.
4. The free tier is available for 12 months and includes some short-term free trial offers starting
from the activation date of each service.
5. To see the services, you can filter by various types and product categories.

Step 3: Check Specific Services

1. For example, to check database services, select the "Database" filter.


2. Scroll up to see the available services under the free tier for databases, such as Amazon RDS,
Amazon DynamoDB, and ElasticCache.
3. Hover over each service to see what is included in the free tier, such as storage limits and
provisioned write capacity units.

Step 4: Create Your Free Tier Account

1. Click on the "Create a Free Tier Account" link.


2. You will be directed to a page where you need to provide your email address, password, and
account name.
3. Fill in your email address, create a password (meeting the specified requirements), and confirm
the password.
4. Enter an account name (e.g., "koshish") and click "Continue".

Step 5: Provide Contact Information

1. Select the account type (e.g., Personal).


2. Enter your full name, phone number, country, address, city, state, and postal code.
3. Check the box to agree to the terms and conditions.

Step 6: Enter Payment Information

1. Provide your credit or debit card number.


2. AWS will deduct a small amount (e.g., INR 2) for verification, which will be refunded within a few
days.
3. Fill out the card details and verify.

Step 7: Select a Plan

1. After verifying your identity, select the plan you want.


2. Choose the "Basic Plan" for the free tier account.

Step 8: Sign In to the AWS Management Console

1. Select your role and area of interest (e.g., Business Analyst, AI, and Machine Learning).
2. Click "Submit".
3. Sign in to the console by clicking on the provided link or directly accessing the AWS Management
Console.
4. Choose "Root User" and enter your email address and password.
Step 9: Explore AWS Services

1. Once logged in, click on the drop-down menu to see the list of available AWS services.
2. For the Snowflake tutorials, focus on exploring Amazon S3 and IAM (under Security, Identity, and
Compliance).

Guide to Using AWS S3 and IAM for Snowflake Access


Step 1: Access AWS Services

1. Log in to your AWS account.


2. Click on the dropdown menu to see the list of available AWS services.
3. Scroll down to explore the various services.

Step 2: Focus on Required Services for Snowflake

For Snowflake access, we will primarily use two services:

1. Storage: Amazon S3 (Simple Storage Service)


2. Security, Identity, and Compliance: IAM (Identity and Access Management)

Amazon S3 (Simple Storage Service)


Step 3: Access S3 Management Console

1. Click on the S3 link to navigate to the S3 Management Console.


2. S3 is a simple storage service where you can create a data lake and store various types of objects
or files.

Step 4: Create an S3 Bucket

1. Click on the "Create bucket" button.


2. Provide a bucket name (e.g., test-snowflake).
3. Select the region that matches your Snowflake region (e.g., US East (N. Virginia)).
4. Choose whether to block public access. For now, block all public access.
5. Acknowledge the warning about public access.
6. Skip the advanced settings unless you need specific configurations.
7. Click on "Create bucket".

Step 5: Verify Bucket Creation

1. You will see a message confirming that your bucket has been successfully created.
2. The bucket will be listed with its name, region, access type, and creation timestamp.

Step 6: Create Folders within the Bucket

1. Click on the bucket name (e.g., test-snowflake).


2. Click on "Create folder".
3. Provide a folder name (e.g., snowflake).
4. Choose encryption options if needed (for now, select "None").
5. Click on "Save".

Step 7: Upload Objects to the Folder

1. Navigate to the created folder.


2. Click on "Upload" to add files or objects from your desktop.
3. You can upload various file types such as CSV, images, videos, text files, etc.

Identity and Access Management (IAM)


Step 8: Access IAM Management Console

1. Navigate to the IAM service under the Security, Identity, and Compliance section.
2. IAM is used to manage access to AWS services and resources securely.

Step 9: Create IAM Users and Roles

1. Create IAM users and roles to manage permissions for accessing S3 buckets and other AWS
resources.
2. Assign appropriate policies to the users and roles to ensure they have the necessary permissions
for Snowflake integration.

Introduction to AWS IAM (Identity and Access Management)

In the previous session, we learned about Amazon S3 and how to create buckets and folders. In
this session, we will focus on IAM (Identity and Access Management), which is essential for
managing access to AWS services securely.

Accessing IAM
Step 1: Navigate to IAM

1. Via Services Tab:


o Go to the AWS Management Console.
o Click on the "Services" tab.
o Search for "IAM" or navigate to "Security, Identity, and Compliance" and click on "IAM".

2. Via Security Credentials:


o Click on your account name at the top right corner.
o Select "My Security Credentials".
o You will be routed to the IAM Management Console.
Understanding IAM

IAM allows you to manage access to AWS services and resources securely. It enables you to
create and manage AWS users and groups, and use permissions to allow and deny their access to
AWS resources.

Key Components of IAM:

1. Users: Individual accounts that represent a person or service needing access to AWS resources.
2. Groups: Collections of users, which can be assigned specific permissions.
3. Roles: Permissions assigned to AWS resources, allowing them to interact with other AWS
services.
4. Policies: Documents that define permissions and can be attached to users, groups, and roles.

Creating IAM Users, Groups, and Roles


Step 2: Create IAM Users

1. In the IAM Management Console, click on "Users" in the left-hand menu.


2. Click on "Add user".
3. Enter a username and select the type of access (programmatic access, AWS Management
Console access, or both).
4. Set permissions by attaching policies directly, adding the user to a group, or copying permissions
from an existing user.
5. Review and create the user.

Step 3: Create IAM Groups

1. In the IAM Management Console, click on "Groups" in the left-hand menu.


2. Click on "Create New Group".
3. Enter a group name.
4. Attach policies to the group by selecting from the list of available policies.
5. Review and create the group.

Step 4: Assign Users to Groups

1. After creating a group, you can add users to it.


2. Select the group and click on the "Add Users to Group" button.
3. Choose the users you want to add and click "Add Users".

Step 5: Create IAM Roles

1. In the IAM Management Console, click on "Roles" in the left-hand menu.


2. Click on "Create role".
3. Select the type of trusted entity (AWS service, another AWS account, or web identity).
4. Attach policies to the role by selecting from the list of available policies.
5. Review and create the role.
Creating an IAM User in AWS

In the previous lecture, we created a group named "test policies". Now, we will move forward
and create a user. This user will have access to your AWS account and can use various services
based on the permissions you assign.

Step-by-Step Guide to Creating an IAM User


Step 1: Add a User

1. Navigate to IAM: Go to the IAM Management Console.


2. Click on "Users": In the left-hand menu, click on "Users".
3. Click on "Add user": This will start the process of creating a new user.

Step 2: Set User Details

1. Username: Enter a username (e.g., snowflake).


2. Access Type:
o Programmatic Access: Check this option to allow the user to access AWS through the
API, CLI, SDK, or other development tools.
o AWS Management Console Access: Check this option to allow the user to access the
AWS Management Console.
3. Console Password:
o Choose whether to autogenerate a password or create a custom password.
o If you create a custom password, you can also check the option to require the user to
reset their password upon first login.

Step 3: Set Permissions

1. Attach Policies or Groups:


o Add User to Group: Attach the user to an existing group (e.g., test policies).
o Copy Permissions from Existing User: Copy permissions from another user (not
applicable if no other users exist).
o Attach Existing Policies Directly: Select specific policies to attach to the user.
2. Attach Policies:
o For this example, we will attach the AmazonS3FullAccess policy.
o Search for S3 and select AmazonS3FullAccess.

Step 4: Review and Create User

1. Review: Check the details and permissions assigned to the user.


2. Create User: Click on "Create user".

Step 5: Access Keys and Credentials

1. Access Key ID and Secret Access Key: These are needed for programmatic access.
2. Download CSV: Download the CSV file containing the Access Key ID and Secret Access Key.
3. Provide Credentials: Share the URL, Access Key ID, Secret Access Key, username, and password
with the new user.

Step 6: Verify User Creation

1. User List: The new user (snowflake) will appear in the user list.
2. User Details: Click on the username to see the details.
o Permissions: Check the policies attached to the user.
o Groups: Verify if any groups are attached.
o Security Credentials: View the Access Key ID and its status.

Introduction to IAM Roles and How to Create Them

In this lecture, we will learn about IAM roles and how to create them. IAM roles are essential for
granting permissions to various applications or services within an AWS account. They allow
services like S3 and AWS Glue to interact with each other and enable external applications, such
as Snowflake, to access AWS resources.

What is an IAM Role?

IAM roles are a way to grant permissions to applications or services within your AWS account.
They are used to allow different AWS services to interact with each other or to enable external
applications to access AWS resources securely.

Use Case Example

For instance, if you want Snowflake to access data stored in your AWS S3 buckets, you need to
create an IAM role that grants Snowflake the necessary permissions to interact with S3.

Step-by-Step Guide to Creating an IAM Role


Step 1: Navigate to IAM Roles

1. Access IAM: Go to the IAM Management Console.


2. Click on "Roles": In the left-hand menu, click on "Roles".

Step 2: Create a New Role

1. Click on "Create role": This will start the process of creating a new role.
2. Select Trusted Entity:
o Choose "Another AWS account" for this example.
o Enter your AWS account ID. To find your account ID, go to "My Security Credentials" and
copy the account ID.

Step 3: Configure Role for Snowflake

1. Require External ID:


o Check the box for "Require external ID".
o Enter an external ID (e.g., 00000 for now). This ID will be provided by Snowflake.
2. Click "Next: Permissions".

Step 4: Attach Permissions

1. Attach Policies:
o Search for S3.
o Select AmazonS3FullAccess to grant full access to S3.
2. Click "Next: Tags".

Step 5: Add Tags (Optional)

1. Skip Tags: For this example, we will not add any tags.
2. Click "Next: Review".

Step 6: Review and Create Role

1. Role Name: Enter a role name (e.g., snowflake-role).


2. Role Description: Optionally, provide a description (e.g., "This is a role for accessing S3 objects
through Snowflake").
3. Review Details: Ensure the trusted entities, policies, and permissions are correct.
4. Click "Create role".

Step 7: Verify Role Creation

1. Role List: The new role (snowflake-role) will appear in the roles list.
2. Role Details: Click on the role name to see the details.
o Role ARN: Note the ARN (Amazon Resource Name) of the role.
o Permissions: Verify the attached policies.
o Trust Relationships: Check the trust relationships.

Step 8: Edit Trust Relationships

1. Click on "Trust relationships": In the role details, click on the "Trust relationships" tab.
2. Edit Trust Relationships:
o Click on "Edit trust relationship".
o You will see a JSON policy document. This document defines which entities can assume
the role.
o You will need to update this document with the ARN and external ID provided by
Snowflake when creating the stage object in Snowflake.

Uploading Data to S3 Buckets

In this lecture, we will learn how to upload data to S3 buckets. We will create folders within an
S3 bucket and upload files in different formats such as CSV and Parquet.
Step-by-Step Guide to Uploading Data to S3 Buckets
Step 1: Access S3 Console

1. Navigate to AWS Management Console: Log in to your AWS account.


2. Go to S3:
o If you have recently visited the S3 service, it should appear on your console homepage.
o Alternatively, click on the "Services" dropdown and select "S3".

Step 2: Access Your Bucket

1. Locate Your Bucket: Find the bucket you created earlier (e.g., test-snowflake).
2. Open the Bucket: Click on the bucket name to access it.

Step 3: Create Folders

1. Create Folders:
o Navigate to the folder you created earlier (e.g., snowflake).
o Click on "Create folder".
o Name the first folder CSV and click "Save".
o Repeat the process to create another folder named parquet.

Step 4: Upload Files to Folders

1. Upload CSV File:


o Navigate to the CSV folder.
o Click on "Upload".
o Drag and drop your CSV file (e.g., health.csv) from your desktop or click "Add files" to
select it.
o Click "Next".
o Review the permissions and settings (default settings are usually sufficient).
o Click "Next" and then "Upload".
o Wait for the upload to complete and verify that the file appears in the folder.

2. Upload Parquet File:


o Navigate back to the snowflake folder and then to the parquet folder.
o Click on "Upload".
o Drag and drop your Parquet file(s) (e.g., health.parquet, cdk1.parquet,
cdc2.parquet) from your desktop or click "Add files" to select them.
o Click "Upload" (you can skip the intermediate steps if you don't need to change any
settings).
o Wait for the upload to complete and verify that the files appear in the folder.

"Version": "2012-10-17",
"Statement": [

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::<your-account-details>"

},

"Action": "sts:AssumeRole",

"Condition": {

"StringEquals": {

"sts:ExternalId": "<your-external-id>"

Uploading Data to S3 Buckets

In this lecture, we will learn how to upload data to S3 buckets. We will create folders within an
S3 bucket and upload files in different formats such as CSV and Parquet.

Step-by-Step Guide to Uploading Data to S3 Buckets


Step 1: Access S3 Console

1. Navigate to AWS Management Console: Log in to your AWS account.


2. Go to S3:
o If you have recently visited the S3 service, it should appear on your console homepage.
o Alternatively, click on the "Services" dropdown and select "S3".

Step 2: Access Your Bucket

1. Locate Your Bucket: Find the bucket you created earlier (e.g., test-snowflake).
2. Open the Bucket: Click on the bucket name to access it.

Step 3: Create Folders

1. Create Folders:
o Navigate to the folder you created earlier (e.g., snowflake).
o Click on "Create folder".
o Name the first folder CSV and click "Save".
o Repeat the process to create another folder named parquet.

Step 4: Upload Files to Folders

1. Upload CSV File:


o Navigate to the CSV folder.
o Click on "Upload".
o Drag and drop your CSV file (e.g., health.csv) from your desktop or click "Add files" to
select it.
o Click "Next".
o Review the permissions and settings (default settings are usually sufficient).
o Click "Next" and then "Upload".
o Wait for the upload to complete and verify that the file appears in the folder.

2. Upload Parquet File:


o Navigate back to the snowflake folder and then to the parquet folder.
o Click on "Upload".
o Drag and drop your Parquet file(s) (e.g., health.parquet, cdk1.parquet,
cdc2.parquet) from your desktop or click "Add files" to select them.
o Click "Upload" (you can skip the intermediate steps if you don't need to change any
settings).
o Wait for the upload to complete and verify that the files appear in the folder.

Creating a Table Schema in Snowflake for Data from AWS S3

In this lecture, we will create a table schema or table metadata in Snowflake. This involves
writing a CREATE TABLE DDL (Data Definition Language) statement to define the structure
of the table, including the columns and their data types. This is the first step in the process of
fetching data from AWS S3 and loading it into Snowflake.

Step-by-Step Guide to Creating a Table Schema


Step 1: Understand the Data Structure

1. Review the Data: The data loaded into S3 consists of 26 columns related to healthcare providers,
treatments, charges, payments, discharges, and other characteristics.
2. Column Overview:
o Columns include provider information, treatment descriptions, charges, payments,
discharges, regions, and reimbursement percentages.
o The data also includes provider ID, name, state, street address, and zip code.

Step 2: Define the Table Schema

1. Column Names and Data Types: Based on the nature of the data, assign appropriate data types
to each column. For example:
o Numeric columns (e.g., average covered payments) should be defined as NUMBER.
o Character columns (e.g., provider name, referral region) should be defined as VARCHAR.
o Columns with special characters (e.g., total payments with dollar and million signs)
should also be defined as VARCHAR.

Step 3: Write the CREATE TABLE DDL Statement

1. SQL Statement: Write the SQL statement to create the table in Snowflake. Here is an example
based on the provided data:

SQL
CREATE TABLE healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER,
-- Add other columns as needed
-- Ensure to match the data types with the nature of the data
);
Step 4: Execute the DDL Statement in Snowflake

1. Run the Query: Execute the CREATE TABLE statement in Snowflake to create the empty
healthcare table.
2. Verify Table Creation:
o Refresh the database to see the newly created healthcare table under the public
schema.
o Preview the table to ensure it is created with the correct columns and data types.

Example Execution

1. Navigate to Snowflake: Open your Snowflake console.


2. Run the CREATE TABLE Statement:
o Copy and paste the CREATE TABLE statement into the Snowflake query editor.
o Execute the query.

SQL
CREATE TABLE healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER,
-- Add other columns as needed
-- Ensure to match the data types with the nature of the data
);

3. Verify the Results:


o Confirm that the table healthcare is successfully created.
o Refresh the database to see the healthcare table under the public schema.
o Preview the table to ensure it is empty but has the correct columns.

Creating an Integration Object in Snowflake to Connect with AWS S3

In this lecture, we will create an integration object in Snowflake that establishes a connection
between AWS S3 and Snowflake. This integration object will allow Snowflake to access data
stored in S3 buckets.

Step-by-Step Guide to Creating an Integration Object


Step 1: Understand the Integration Object Syntax

1. SQL Statement: The SQL statement to create an integration object in Snowflake is as follows:

SQL
CREATE OR REPLACE STORAGE INTEGRATION S3_INT
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = '<AWS_ROLE_ARN>'
STORAGE_ALLOWED_LOCATIONS = ('s3://<BUCKET_NAME>/<FOLDER_NAME>/',
's3://<ANOTHER_BUCKET_NAME>/');
Step 2: Retrieve AWS Role ARN

1. Navigate to IAM: Go to the AWS Management Console, then to the IAM service.
2. Find the Role: Locate the role you created earlier (e.g., snowflake-role).
3. Copy the Role ARN: Copy the Role ARN from the role details.

Step 3: Define Storage Allowed Locations

1. Navigate to S3: Go to the S3 service in the AWS Management Console.


2. Identify Buckets and Folders: Determine which buckets and folders you want Snowflake to
access. For example, you may have a bucket named test-snowflake with folders CSV,
parquet, and Json.

Step 4: Create the Integration Object in Snowflake

1. SQL Statement: Use the following SQL statement to create the integration object in Snowflake.
Replace <AWS_ROLE_ARN>, <BUCKET_NAME>, and <FOLDER_NAME> with your actual values.

SQL
CREATE OR REPLACE STORAGE INTEGRATION S3_INT
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://test-snowflake/snowflake/', 's3://test-xyz-
snowflake/');
Step 5: Execute the SQL Statement

1. Run the Query: Execute the SQL statement in the Snowflake query editor.
2. Verify Execution: Ensure that the integration object is created successfully.

Describing the Integration Object and Updating Trust Relationships

In this lecture, we will describe the integration object we created in Snowflake and update the
trust relationships in AWS IAM to establish a secure connection between AWS S3 and
Snowflake.

Step-by-Step Guide to Describing the Integration Object and Updating Trust


Relationships
Step 1: Describe the Integration Object

1. SQL Statement: Use the following SQL statement to describe the integration object in Snowflake:

SQL
DES INTEGRATION S3_INT;

2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Review the Results: The results will display several properties of the integration object. Key
properties include:
o STORAGE_AWS_IAM_USER_ARN
o STORAGE_EXTERNAL_ID

Step 2: Update Trust Relationships in AWS IAM

1. Retrieve Property Values:


o STORAGE_AWS_IAM_USER_ARN: Copy the value from the results of the DESCRIBE
INTEGRATION command.
o STORAGE_EXTERNAL_ID: Copy the value from the results of the DESCRIBE INTEGRATION
command.

2. Navigate to IAM: Go to the AWS Management Console and open the IAM service.
3. Find the Role: Locate the role you created earlier (e.g., snowflake-role).
4. Edit Trust Relationships:
o Go to the "Trust relationships" tab.
o Click on "Edit trust relationship".
o Update the policy document with the STORAGE_AWS_IAM_USER_ARN and
STORAGE_EXTERNAL_ID values from Snowflake.

Step 3: Update the Trust Policy

1. Policy Document: Update the trust policy document in IAM with the following values:
o AWS IAM User ARN: Replace the existing ARN with the STORAGE_AWS_IAM_USER_ARN
from Snowflake.
o External ID: Replace the existing external ID with the STORAGE_EXTERNAL_ID from
Snowflake.

2. Example Trust Policy:

JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/snowflake-user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "external-id-from-snowflake"
}
}
}
]
}

3. Update the Policy: Paste the updated ARN and external ID into the policy document and click
"Update Trust Policy".

Example Execution

1. Describe the Integration Object in Snowflake:


o Run the following SQL statement:

SQL
DESCRIBE INTEGRATION S3_INT;
2. Review the Results:
o Identify the STORAGE_AWS_IAM_USER_ARN and STORAGE_EXTERNAL_ID from the
results.

3. Update Trust Relationships in IAM:


o Navigate to the IAM role (snowflake-role).
o Edit the trust relationship with the following values:

JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/snowflake-user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "external-id-from-snowflake"
}
}
}
]
}

4. Verify the Update:


o Confirm that the trust relationships have been updated with the correct ARN and
external ID.

Loading Data from S3 to Snowflake

In this lecture, we will go through the steps to load data from AWS S3 into Snowflake. This
involves creating a file format, creating a stage object, and using the COPY INTO command to
load the data into the Snowflake table.

Step-by-Step Guide to Loading Data from S3 to Snowflake


Step 1: Create a File Format

1. SQL Statement: Use the following SQL statement to create a file format for CSV files:

SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the file format is successfully created.

Step 2: Create a Stage Object

1. SQL Statement: Use the following SQL statement to create a stage object:

SQL
CREATE OR REPLACE STAGE demo_db.public.ext_stage
URL = 's3://test-snowflake/snowflake/CSV/'
STORAGE_INTEGRATION = S3_INT
FILE_FORMAT = demo_db.public.csv_format;

2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the stage object is successfully created.

Step 3: Copy Data into Snowflake Table

1. SQL Statement: Use the following SQL statement to copy data into the healthcare table:

SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';

2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Review the Results: Check the results to see if the data was partially loaded due to errors.

Handling Errors Due to Delimiters

1. Identify the Issue: The error occurs because some values in the CSV file contain
commas, which is the delimiter.
2. Preview the Data: Use the following steps to preview the data in S3:
o Go to the S3 console.
o Navigate to the test-snowflake/snowflake/CSV/ folder.
o Select the health.csv file and click on "Select from".
o Choose the file format as CSV and click on "Show file preview".

3. Download and Inspect the File: Download the CSV file and inspect it to identify rows
with commas within values.
4. Modify the COPY Command: Use the ON_ERROR = 'CONTINUE' option to bypass rows
with errors:

SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';

5. Run the Query: Execute the SQL statement in the Snowflake query editor.
6. Review the Results: Check the results to see the number of rows loaded and any errors
encountered.

Loading Complete Data from S3 to Snowflake with a Custom Delimiter

In this lecture, we will load the complete data from AWS S3 into Snowflake by changing the
delimiter in the CSV file to avoid issues with commas within the data values. We will use a pipe
(|) as the delimiter and follow the steps to update the file, upload it to S3, and load it into
Snowflake.

Step-by-Step Guide to Loading Data with a Custom Delimiter


Step 1: Update the CSV File with a Pipe Delimiter

1. Open the CSV File:


o Open the original CSV file (health.csv) in a text editor like Notepad++.

2. Replace Commas with Pipes:


o Use the "Replace" feature (Ctrl + H) to replace all commas with pipes (|).
o Be cautious of values that contain commas. Manually revert these specific values back to
commas after the bulk replace.

3. Save the Updated File:


o Save the updated file as health_pipe.csv.

Step 2: Upload the Updated File to S3

1. Navigate to S3:
o Go to the AWS S3 console.
o Navigate to the appropriate bucket and folder (e.g.,
test-snowflake/snowflake/CSV/).

2. Upload the File:


o Drag and drop the updated health_pipe.csv file into the S3 folder.
o Click "Upload" to complete the process.

Step 3: Create a File Format in Snowflake with Pipe Delimiter

1. SQL Statement: Use the following SQL statement to create a file format with a pipe delimiter:

SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;

2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the file format is successfully created.

Step 4: Create a Stage Object for the Updated File

1. SQL Statement: Use the following SQL statement to create a stage object for the updated file:

SQL
CREATE OR REPLACE STAGE demo_db.public.ext_stage
URL = 's3://test-snowflake/snowflake/CSV/health_pipe.csv'
STORAGE_INTEGRATION = S3_INT
FILE_FORMAT = demo_db.public.csv_format;

2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the stage object is successfully created.

Step 5: Copy Data into Snowflake Table

1. Recreate the Healthcare Table:


o Optionally, recreate the healthcare table to ensure it is empty:

SQL
CREATE OR REPLACE TABLE demo_db.public.healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER
-- Add other columns as needed
);

2. SQL Statement: Use the following SQL statement to copy data into the healthcare table:

SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';

3. Run the Query: Execute the SQL statement in the Snowflake query editor.
4. Review the Results: Check the results to see if the data was fully loaded without errors.
Loading CSV Data into Snowflake: Step-by-Step Guide

Below is the step-by-step guide to load CSV data from AWS S3 into Snowflake. This guide
includes creating the table, integration object, file format, stage object, and using the COPY
command to ingest data.

Step 1: Create the Table


Loading CSV Data into Snowflake

Step 1: Create the Table

SQL
CREATE OR REPLACE TABLE HEALTHCARE_CSV (
AVERAGE_COVERED_CHARGES NUMBER(38,6),
AVERAGE_TOTAL_PAYMENTS NUMBER(38,6),
TOTAL_DISCHARGES NUMBER(38,0),
BACHELORORHIGHER NUMBER(38,1),
HSGRADORHIGHER NUMBER(38,1),
TOTALPAYMENTS VARCHAR(128),
REIMBURSEMENT VARCHAR(128),
TOTAL_COVERED_CHARGES VARCHAR(128),
REFERRALREGION_PROVIDER_NAME VARCHAR(256),
REIMBURSEMENTPERCENTAGE NUMBER(38,9),
DRG_DEFINITION VARCHAR(256),
REFERRAL_REGION VARCHAR(26),
INCOME_PER_CAPITA NUMBER(38,0),
MEDIAN_EARNINGSBACHELORS NUMBER(38,0),
MEDIAN_EARNINGS_GRADUATE NUMBER(38,0),
MEDIAN_EARNINGS_HS_GRAD NUMBER(38,0),
MEDIAN_EARNINGSLESS_THAN_HS NUMBER(38,0),
MEDIAN_FAMILY_INCOME NUMBER(38,0),
NUMBER_OF_RECORDS NUMBER(38,0),
POP_25_OVER NUMBER(38,0),
PROVIDER_CITY VARCHAR(128),
PROVIDER_ID NUMBER(38,0),
PROVIDER_NAME VARCHAR(256),
PROVIDER_STATE VARCHAR(128),
PROVIDER_STREET_ADDRESS VARCHAR(256),
PROVIDER_ZIP_CODE NUMBER(38,0)
);

Step 2: Create Integration Object for External Stage

SQL
CREATE OR REPLACE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::435098453023:role/snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://testsnowflake/snowflake/',
's3://testxyzsnowflake/');

Step 3: Describe Integration Object to Fetch External ID


SQL
DESC INTEGRATION s3_int;

Step 4: Create File Format

SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;

Step 5: Create Stage Object

SQL
CREATE OR REPLACE STAGE demo_db.public.ext_csv_stage
URL = 's3://testsnowflake/snowflake/csv'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = demo_db.public.csv_format;

Step 6: Use COPY Command to Ingest Data from S3

SQL
COPY INTO healthcare_csv
FROM @demo_db.public.ext_csv_stage
ON_ERROR = 'CONTINUE';

Step 7: Verify Data Load

SQL
SELECT * FROM healthcare_csv;

This guide provides a comprehensive step-by-step process to load CSV data from AWS S3 into Snowflake,
ensuring that all necessary configurations and commands are clearly outlined.

Loading JSON Data from S3 to Snowflake


1. Create a Schema

First, create a new schema to organize the JSON data:

SQL
CREATE SCHEMA Json_data;
USE SCHEMA Json_data;
2. Create a Table

Create a table to load the JSON data. This table will have 26 columns along with metadata
columns such as file name, file row number, and load timestamp:

SQL
CREATE TABLE healthcare_json_table (
-- Define your 26 columns here
column1 STRING,
column2 STRING,
-- ...
file_name STRING,
file_row_number NUMBER,
load_timestamp TIMESTAMP
);
3. Integration Object

Ensure you have an integration object configured to allow Snowflake to interact with your S3
bucket. This should have been set up previously for loading CSV data.

4. Create File Format

Create a file format for JSON:

SQL
CREATE OR REPLACE FILE FORMAT Json_format
TYPE = 'JSON';
5. Create Stage Object

Create an external stage to point to the JSON data in S3:

SQL
CREATE OR REPLACE STAGE Json_stage
URL = 's3://test-snowflake/snowflake/json/'
FILE_FORMAT = Json_format
STORAGE_INTEGRATION = your_integration_object;
6. Loading Data Using COPY Command

Use the COPY command to load data from S3 to Snowflake. Ensure that you reference the
correct JSON keys and maintain the exact case and spaces as in the JSON file:

SQL
COPY INTO healthcare_json_table
FROM @Json_stage
FILE_FORMAT = (FORMAT_NAME = Json_format)
ON_ERROR = 'CONTINUE';
7. Verify Data Load

Check the loaded data and verify that all rows have been loaded correctly:

SQL
SELECT file_name, COUNT(*) AS row_count
FROM healthcare_json_table
GROUP BY file_name;
Summary

 Schema Creation: Organize your data by creating a new schema.


 Table Creation: Define a table with the necessary columns and metadata.
 Integration Object: Ensure Snowflake can interact with S3.
 File Format: Define the JSON file format.
 Stage Object: Create an external stage pointing to the JSON data in S3.
 Copy Command: Load the data using the COPY command, ensuring correct referencing of JSON
keys.
 Verification: Verify that the data has been loaded correctly.

This process ensures that JSON data from S3 is accurately loaded into Snowflake, transforming
it into a relational table format for further analysis.

Loading JSON Data from S3 to Snowflake


1. Create the Table

Create a table to store the JSON data with the necessary columns and metadata:

SQL
CREATE OR REPLACE TABLE healthcare_json (
id VARCHAR(50),
AVERAGE_COVERED_CHARGES VARCHAR(150),
AVERAGE_TOTAL_PAYMENTS VARCHAR(150),
TOTAL_DISCHARGES INTEGER,
BACHELORORHIGHER FLOAT,
HSGRADORHIGHER VARCHAR(150),
TOTALPAYMENTS VARCHAR(128),
REIMBURSEMENT VARCHAR(128),
TOTAL_COVERED_CHARGES VARCHAR(128),
REFERRALREGION_PROVIDER_NAME VARCHAR(256),
REIMBURSEMENTPERCENTAGE VARCHAR(150),
DRG_DEFINITION VARCHAR(256),
REFERRAL_REGION VARCHAR(26),
INCOME_PER_CAPITA VARCHAR(150),
MEDIAN_EARNINGSBACHELORS VARCHAR(150),
MEDIAN_EARNINGS_GRADUATE VARCHAR(150),
MEDIAN_EARNINGS_HS_GRAD VARCHAR(150),
MEDIAN_EARNINGSLESS_THAN_HS VARCHAR(150),
MEDIAN_FAMILY_INCOME VARCHAR(150),
NUMBER_OF_RECORDS VARCHAR(150),
POP_25_OVER VARCHAR(150),
PROVIDER_CITY VARCHAR(128),
PROVIDER_ID VARCHAR(150),
PROVIDER_NAME VARCHAR(256),
PROVIDER_STATE VARCHAR(128),
PROVIDER_STREET_ADDRESS VARCHAR(256),
PROVIDER_ZIP_CODE VARCHAR(150),
filename VARCHAR,
file_row_number VARCHAR,
load_timestamp TIMESTAMP DEFAULT TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP)
);
2. Create JSON File Format

Create a file format for JSON:


SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.json_format
TYPE = 'json';
3. Create External Stage

Create an external stage to point to the JSON data in S3:

SQL
CREATE OR REPLACE STAGE demo_db.public.ext_json_stage
URL = 's3://testsnowflake/snowflake/json'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = demo_db.public.json_format;
4. Load Data Using COPY Command

Use the COPY command to load data from S3 to Snowflake:

SQL
COPY INTO demo_db.public.healthcare_json
FROM (
SELECT
$1:"_id"::VARCHAR,
$1:" Average Covered Charges "::VARCHAR,
$1:" Average Total Payments "::VARCHAR,
$1:" Total Discharges "::INTEGER,
$1:"% Bachelor's or Higher"::FLOAT,
$1:"% HS Grad or Higher"::VARCHAR,
$1:"Total payments"::VARCHAR,
$1:"% Reimbursement"::VARCHAR,
$1:"Total covered charges"::VARCHAR,
$1:"Referral Region Provider Name"::VARCHAR,
$1:"ReimbursementPercentage"::VARCHAR,
$1:"DRG Definition"::VARCHAR,
$1:"Referral Region"::VARCHAR,
$1:"INCOME_PER_CAPITA"::VARCHAR,
$1:"MEDIAN EARNINGS - BACHELORS"::VARCHAR,
$1:"MEDIAN EARNINGS - GRADUATE"::VARCHAR,
$1:"MEDIAN EARNINGS - HS GRAD"::VARCHAR,
$1:"MEDIAN EARNINGS- LESS THAN HS"::VARCHAR,
$1:"MEDIAN_FAMILY_INCOME"::VARCHAR,
$1:"Number of Records"::VARCHAR,
$1:"POP_25_OVER"::VARCHAR,
$1:"Provider City"::VARCHAR,
$1:"Provider Id"::VARCHAR,
$1:"Provider Name"::VARCHAR,
$1:"Provider State"::VARCHAR,
$1:"Provider Street Address"::VARCHAR,
$1:"Provider Zip Code"::VARCHAR,
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP)
FROM @demo_db.public.ext_json_stage
);
5. Verify Data Load

Check the loaded data:

SQL
SELECT * FROM healthcare_json;
6. Clean Up

If needed, truncate the table or drop it:

SQL
TRUNCATE TABLE healthcare_json;
DROP TABLE healthcare_json;
7. Check Other Tables

Verify data in other tables if required:

SQL
SELECT * FROM healthcare_csv;
SELECT * FROM healthcare_parquet;
SELECT * FROM healthcare_json;

This process ensures that JSON data from S3 is accurately loaded into Snowflake, transforming
it into a relational table format for further analysis.

Certainly! Let's organize and clarify the information about the different types of tables in
Snowflake for better understanding.

Types of Tables in Snowflake

Snowflake supports three types of tables, each with distinct characteristics and use cases:

1. Permanent Tables
2. Temporary Tables
3. Transient Tables

1. Permanent Tables

 Default Table Type: When you create a table in Snowflake without specifying the type, it
defaults to a permanent table.
 Longevity: These tables are designed for long-term storage and are typically used for production
data.
 Data Protection: They have robust data protection and recovery mechanisms.
 Time Travel: Permanent tables support a high number of time travel retention days.
 Failsafe: They include a failsafe period of seven days, which provides an additional layer of data
recovery.

2. Temporary Tables

 Session-Specific: Temporary tables exist only within the session in which they are created. Once
the session ends, the table is automatically dropped.
 Non-Recoverable: Data in temporary tables cannot be recovered after the session ends.
 Isolation: These tables are not visible to other users or sessions.
 No Cloning: Temporary tables do not support features such as cloning.
 Naming Precedence: If a temporary table and a permanent table have the same name within
the same schema, the temporary table takes precedence when queried.

3. Transient Tables

 Similar to Permanent Tables: Transient tables are similar to permanent tables in terms of
structure and usage.
 No Failsafe: They do not have a failsafe period, which means they are not designed for the same
level of data protection and recovery.
 Cost Efficiency: Transient tables are designed for data that does not require long-term
protection, making them more cost-effective.
 Time Travel: They have a shorter time travel retention period compared to permanent tables.
 Schema and Database: You can create transient databases and schemas. All objects within a
transient schema or database will also be transient by default.

Certainly! Let's organize and clarify the information on creating and managing different types of
databases, schemas, and tables in Snowflake, focusing on cost considerations and practical steps.

Design Considerations for Snowflake Objects

When designing your Snowflake data platform, it's crucial to decide on the type of database,
schema, or table to create based on your requirements and cost management. Each type of table
has associated costs and features that affect storage and data protection.

Types of Tables and Their Costs

1. Permanent Tables
o Cost: Higher due to failsafe and longer time travel periods.
o Features: Failsafe period of 7 days, time travel up to 90 days.
2. Transient Tables
o Cost: Lower as they do not have a failsafe period.
o Features: Time travel retention period of 1 day, no failsafe.

3. Temporary Tables
o Cost: Typically lower as they exist only within a session.
o Features: Session-specific, non-recoverable after session ends.

Creating Databases and Schemas


Transient Database and Schema

To create a transient database:

SQL
CREATE OR REPLACE TRANSIENT DATABASE development;

 Note: All objects (schemas, tables, views) created under this database will be transient by
default.

To create a schema within the transient database:

SQL
USE DATABASE development;
CREATE SCHEMA employee;

 Result: The employee schema will also be transient.

To verify the database and schema:

SQL
SHOW DATABASES;
SHOW SCHEMAS;

 Output: You will see the development database and employee schema marked as transient.

To create a transient table within the schema:

SQL
USE SCHEMA employee;
CREATE TABLE employees (id INT, name STRING);

 Verification:

SQL
SHOW TABLES;

 Output: The employees table will be transient with a default retention time of 1 day.
Permanent Database and Schema

To create a permanent database:

SQL
CREATE OR REPLACE DATABASE development_perm;

 Note: No specific keyword is needed for a permanent database.

To create a schema within the permanent database:

SQL
USE DATABASE development_perm;
CREATE SCHEMA employee;

 Result: The employee schema will be permanent.

To verify the database and schema:

SQL
SHOW DATABASES;
SHOW SCHEMAS;

 Output: You will see the development_perm database and employee schema marked as
permanent.

To create a permanent table within the schema:

SQL
USE SCHEMA employee;
CREATE TABLE employees (id INT, name STRING);

 Verification:

SQL
SHOW TABLES;

 Output: The employees table will be permanent with a default retention time of 1 day, which
can be extended up to 90 days.

Certainly! Let's present the information in a more structured and clear format, including the SQL
commands for creating transient and permanent databases, schemas, and tables in Snowflake.

Types of Objects in Snowflake

When designing your Snowflake data platform, it's crucial to decide which type of database,
schema, or table to create based on your requirements and cost management. Snowflake offers
three types of objects:
1. Temporary
2. Transient
3. Permanent

Temporary Objects

 Existence: Only within the session in which they are created.


 Visibility: Not visible to other users or sessions.
 Features: Do not support features like cloning.
 Data Retention: Data is purged once the session ends and is not recoverable.

Transient Objects

 Existence: Persist until explicitly dropped.


 Visibility: Available to all users with appropriate privileges.
 Design: For transitory data that needs to be maintained beyond each session.
 Data Retention: Shorter time travel retention period and no failsafe period.

Permanent Objects (Default)

 Existence: Persist until explicitly dropped.


 Visibility: Available to all users with appropriate privileges.
 Design: For long-term data storage with additional data protection.
 Data Retention: High number of time travel retention days and a failsafe period for data
recovery.

Creating Databases, Schemas, and Tables


Create Transient Database, Schema, and Table
SQL
-- Create TRANSIENT Database
CREATE OR REPLACE TRANSIENT DATABASE DEVELOPMENT;

-- Show Databases
SHOW DATABASES;

-- Describe Database
DESC DATABASE DEVELOPMENT;

-- Use Database
USE DATABASE DEVELOPMENT;

-- Create Schema
CREATE OR REPLACE SCHEMA EMPLOYEE;

-- Show Schemas
SHOW SCHEMAS;

-- Create Transient Table


CREATE OR REPLACE TABLE EMPLOYEES (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

-- Show Tables
SHOW TABLES;

-- Drop Database
DROP DATABASE DEVELOPMENT;
Create Permanent Database, Schema, and Table
SQL
-- Create PERMANENT Database
CREATE OR REPLACE DATABASE DEVELOPMENT_PERM;

-- Show Databases
SHOW DATABASES;

-- Use Database
USE DATABASE DEVELOPMENT_PERM;

-- Create Schema
CREATE OR REPLACE SCHEMA EMPLOYEE;

-- Show Schemas
SHOW SCHEMAS;

-- Create Permanent Table


CREATE OR REPLACE TABLE EMPLOYEES (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

-- Show Tables
SHOW TABLES;

-- Drop Database
DROP DATABASE DEVELOPMENT_PERM;

Certainly! Let's walk through the steps for creating transient and permanent schemas in
Snowflake, along with the necessary SQL commands and explanations.

Creating Transient and Permanent Schemas in Snowflake


Step 1: Switch to the Permanent Database

First, ensure you are using a permanent database. In this example, we will use demo_db.

SQL
USE DATABASE demo_db;
Step 2: Create a Transient Schema

To create a transient schema under the permanent database demo_db, use the following
command:

SQL
-- Create Transient Schema
CREATE OR REPLACE TRANSIENT SCHEMA employee;

 Explanation: The TRANSIENT keyword is used to specify that the schema is transient. This
means all objects created within this schema will inherit the transient property.

Verify the creation and properties of the schema:

SQL
-- Show Schemas
SHOW SCHEMAS;

 Output: You should see the employee schema listed as transient.

Step 3: Create a Table in the Transient Schema

Create a table within the transient schema:

SQL
-- Create Table in Transient Schema
USE SCHEMA employee;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the table creation and its properties:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees table should be listed as transient. The icon for a transient table is
different from that of a permanent table, indicating its transient nature.

Step 4: Create a Permanent Schema

To create a permanent schema under the permanent database demo_db, use the following
command:
SQL
-- Create Permanent Schema
CREATE OR REPLACE SCHEMA employee_perm;

 Explanation: No specific keyword is needed for a permanent schema. By default, schemas are
permanent unless specified otherwise.

Verify the creation and properties of the schema:

SQL
-- Show Schemas
SHOW SCHEMAS;

 Output: You should see the employee_perm schema listed without the transient property.

Step 5: Create a Table in the Permanent Schema

Create a table within the permanent schema:

SQL
-- Create Table in Permanent Schema
USE SCHEMA employee_perm;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the table creation and its properties:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees table should be listed as a permanent table. The icon for a permanent
table is different from that of a transient table.

Creating Transient and Permanent Schemas in Snowflake


Step 1: Switch to the Permanent Database

First, ensure you are using a permanent database. In this example, we will use demo_db.

SQL
USE DATABASE demo_db;
Step 2: Create a Transient Schema

To create a transient schema under the permanent database demo_db, use the following
command:
SQL
-- Create Transient Schema
CREATE OR REPLACE TRANSIENT SCHEMA employee;

 Explanation: The TRANSIENT keyword is used to specify that the schema is transient. This
means all objects created within this schema will inherit the transient property.

Verify the creation and properties of the schema:

SQL
-- Show Schemas
SHOW SCHEMAS;

 Output: You should see the employee schema listed as transient. The public schema will
remain permanent because it inherits the properties of the permanent database.

Step 3: Create a Table in the Transient Schema

Create a table within the transient schema:

SQL
-- Create Table in Transient Schema
USE SCHEMA employee;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the table creation and its properties:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees table should be listed as transient. The icon for a transient table is
different from that of a permanent table, indicating its transient nature.

Step 4: Create a Permanent Schema

To create a permanent schema under the permanent database demo_db, use the following
command:

SQL
-- Create Permanent Schema
CREATE OR REPLACE SCHEMA employee_perm;
 Explanation: No specific keyword is needed for a permanent schema. By default, schemas are
permanent unless specified otherwise.

Verify the creation and properties of the schema:

SQL
-- Show Schemas
SHOW SCHEMAS;

 Output: You should see the employee_perm schema listed without the transient property.

Step 5: Create a Table in the Permanent Schema

Create a table within the permanent schema:

SQL
-- Create Table in Permanent Schema
USE SCHEMA employee_perm;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the table creation and its properties:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees table should be listed as a permanent table. The icon for a permanent
table is different from that of a transient table.

By following these steps, you can effectively manage and create transient and permanent
schemas and tables in Snowflake, ensuring the appropriate use of resources and data protection
features.

Certainly! Let's walk through the steps for creating temporary, transient, and permanent tables in
Snowflake, along with the necessary SQL commands and explanations.

Creating Different Types of Tables in Snowflake


Step 1: Create a Temporary Table

To create a temporary table, use the TEMPORARY keyword. Temporary tables exist only within the
session in which they are created and are not visible to other sessions.

SQL
-- Create Temporary Table
CREATE OR REPLACE TEMPORARY TABLE employees_temp (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the creation and properties of the table:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees_temp table should be listed as temporary. The icon for a temporary
table has a little clock sign.

Insert some rows into the temporary table:

SQL
-- Insert Rows into Temporary Table
INSERT INTO employees_temp (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);

Verify the data:

SQL
-- Select Data from Temporary Table
SELECT * FROM employees_temp;

 Note: If you try to access this table from a different session, you will get an error because
temporary tables are session-specific.

Step 2: Create a Transient Table

To create a transient table, use the TRANSIENT keyword. Transient tables persist until explicitly
dropped and are available to all users with the appropriate privileges.

SQL
-- Create Transient Table
CREATE OR REPLACE TRANSIENT TABLE employees_transient (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the creation and properties of the table:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees_transient table should be listed as transient. The icon for a
transient table is different from that of a permanent table.

Insert some rows into the transient table:

SQL
-- Insert Rows into Transient Table
INSERT INTO employees_transient (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);

Verify the data:

SQL
-- Select Data from Transient Table
SELECT * FROM employees_transient;

 Note: Unlike temporary tables, transient tables can be accessed from different sessions.

Step 3: Create a Permanent Table

To create a permanent table, you can simply use the CREATE OR REPLACE TABLE command.
Permanent tables are the default type and do not require a specific keyword.

SQL
-- Create Permanent Table
CREATE OR REPLACE TABLE employees_perm (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);

Verify the creation and properties of the table:

SQL
-- Show Tables
SHOW TABLES;

 Output: The employees_perm table should be listed as a permanent table. The icon for a
permanent table is different from that of a transient or temporary table.

Insert some rows into the permanent table:

SQL
-- Insert Rows into Permanent Table
INSERT INTO employees_perm (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);

Verify the data:

SQL
-- Select Data from Permanent Table
SELECT * FROM employees_perm;

 Note: Permanent tables can be accessed from different sessions and have a failsafe period for
data recovery.

Certainly! Let's walk through the steps for converting a permanent table to a transient table, and
a transient table to a temporary table in Snowflake. This process involves using the CLONE
keyword to create copies of tables with different properties.

Converting Tables in Snowflake


Step 1: Convert a Permanent Table to a Transient Table

To convert a permanent table to a transient table, use the CREATE OR REPLACE TRANSIENT
TABLE command along with the CLONE keyword. This will create a transient copy of the
permanent table.

SQL
-- Convert Permanent Table to Transient Table
CREATE OR REPLACE TRANSIENT TABLE employees_transient CLONE employees_perm;
 Explanation: This command creates a new transient table named employees_transient by
cloning the existing permanent table employees_perm.

Verify the creation and properties of the table:

SQL
-- Show Tables
SHOW TABLES;

 Output: You should see the employees_transient table listed as transient. The original
employees_perm table will still exist.

If you no longer need the original permanent table, you can drop it:

SQL
-- Drop Permanent Table
DROP TABLE employees_perm;
Step 2: Convert a Transient Table to a Temporary Table

To convert a transient table to a temporary table, use the CREATE OR REPLACE TEMPORARY
TABLE command along with the CLONE keyword. This will create a temporary copy of the
transient table.

SQL
-- Convert Transient Table to Temporary Table
CREATE OR REPLACE TEMPORARY TABLE employees_temp CLONE employees_transient;

 Explanation: This command creates a new temporary table named employees_temp by cloning
the existing transient table employees_transient.

Verify the creation and properties of the table:

SQL
-- Show Tables
SHOW TABLES;

 Output: You should see the employees_temp table listed as temporary. The original
employees_transient table will still exist.

If you no longer need the original transient table, you can drop it:

SQL
-- Drop Transient Table
DROP TABLE employees_transient;

Understanding Time Travel in Snowflake


Introduction

In this section, we will discuss an important concept in Snowflake called Time Travel. Before
diving into Time Travel, it is recommended to review the previous section on different types of
tables in Snowflake for a better understanding.

What is Time Travel?

Time Travel in Snowflake allows users to access historical data at any point within a defined
retention period. This feature is part of Snowflake's continuous data protection lifecycle,
enabling users to query, clone, and restore data from the past.

Continuous Data Protection Lifecycle

Snowflake's continuous data protection lifecycle involves creating various objects such as
databases, schemas, tables, and views using different DDL statements and SQL operations. Time
Travel is a key component of this lifecycle, providing the ability to view and restore historical
data.

Example Timeline

To illustrate Time Travel, consider the following hypothetical timeline for an employees table:

 Day 1: Created the employees table and inserted 3 records.


 Day 5: Inserted 5 more records.
 Day 10: Updated 2 records.
 Day 12: Deleted 1 record.
 Day 16: Inserted 2 new records and updated 1 record.

Using Time Travel, you can view the state of the employees table at any of these points in time.

Key Features of Time Travel

1. Historical Data Access: View the state of a table at any point within the retention period.
2. Data Recovery: Restore data that has been updated or deleted.
3. Cloning: Create clones of tables, schemas, and databases at specific points in time.
4. Dropped Objects Recovery: Restore dropped tables, schemas, and databases.

Time Travel Retention Period

The retention period for Time Travel varies based on the Snowflake edition:

 Enterprise and Higher Editions: Up to 90 days.


 Standard Edition: Up to 1 day.

The retention period is a crucial property that determines how long historical data is preserved.
Failsafe

After the retention period, data moves to the Failsafe zone, where it is retained for an additional
7 days for permanent tables. However, data in the Failsafe zone cannot be queried or restored by
users and is reserved for disaster recovery by Snowflake.

Operations Using Time Travel

 Running Queries: Query historical data that has been updated or deleted.
 Creating Clones: Clone tables, schemas, and databases at specific points in time.
 Restoring Dropped Objects: Restore dropped tables, schemas, and databases.

Setting the Retention Period

The retention period can be set during the creation of objects and is managed by users with the
accountadmin role. For permanent tables, the retention period can be set up to 90 days, while
for transient and temporary tables, it is limited to 1 day.

Setting and Altering Data Retention Time in Snowflake


Introduction

In this section, we will explore how to set and alter the data retention time property for tables in
Snowflake. This property is crucial for managing the Time Travel feature, which allows users to
access historical data.

Creating and Setting Retention Time for a Table

Let's start by creating a table and setting its retention time property:

1. Create a Permanent Table:

SQL
CREATE TABLE employees (
id INT,
name STRING,
position STRING
);

2. Check the Table:

SQL
SHOW TABLES;

This will display the employees table with a default retention time.

3. Set Retention Time:

SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 90;

4. Verify Retention Time:

SQL
SHOW TABLES;

You should see the retention time set to 90 days.

Attempting to Set Retention Time Beyond Maximum Limit

If you try to set the retention time beyond the allowed limit (90 days for permanent tables),
Snowflake will return an error:

SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 95;

Error: Exceeds the maximum allowable retention time of 90 days.

Altering Retention Time

You can alter the retention time to any value between 0 and 90 days:

SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 30;

Verify the change:

SQL
SHOW TABLES;

The retention time should now be updated to 30 days.


Inheritance of Retention Time

When you create a schema or database with a specific retention time, all objects under it inherit
this property unless explicitly set otherwise.

1. Create a Schema with Retention Time:

SQL
CREATE SCHEMA employee_perm DATA_RETENTION_TIME_IN_DAYS = 10;

2. Create Tables Under the Schema:

SQL
CREATE TABLE employee_perm.employee_new (
id INT,
name STRING,
position STRING
);

3. Insert Data:

SQL
INSERT INTO employee_perm.employee_new VALUES (1, 'John Doe',
'Manager');

4. Check Retention Time:

SQL
SHOW TABLES IN SCHEMA employee_perm;

The employee_new table should have a retention time of 10 days.

Creating Transient and Temporary Tables

Transient and temporary tables have different retention time limits:

1. Create Transient Table:

SQL
CREATE TRANSIENT TABLE employee_perm.employee_transient (
id INT,
name STRING,
position STRING
);

2. Create Temporary Table:

SQL
CREATE TEMPORARY TABLE employee_perm.employee_temp (
id INT,
name STRING,
position STRING
);

3. Check Retention Time:

SQL
SHOW TABLES IN SCHEMA employee_perm;

o Transient table: Retention time is 1 day.


o Temporary table: Retention time is inherited from the schema but effectively limited to 1
day.

Verifying Actual Retention Time

To verify the actual retention time, you can query the metadata:

SQL
SELECT table_name, retention_time
FROM demo_db.information_schema.tables
WHERE table_schema = 'employee_perm';
Altering Schema Retention Time

You can alter the retention time for a schema, which will affect all new objects created under it:

SQL
ALTER SCHEMA employee_perm SET DATA_RETENTION_TIME_IN_DAYS = 55;

Verify the change:

SQL
SHOW SCHEMAS;

Querying Historical Data Using Time Travel in Snowflake


Introduction

In this section, we will explore how to query historical data by utilizing Snowflake's Time Travel
feature. We will demonstrate three methods: using timestamps, offsets, and query IDs.

Method 1: Using Timestamps

To query historical data based on a specific timestamp, follow these steps:

1. Check the Current Timestamp:

SQL
SELECT CURRENT_TIMESTAMP();
This will return the current timestamp in UTC format.

2. Set Session Time Zone to UTC:

SQL
ALTER SESSION SET TIMEZONE = 'UTC';

3. Query Historical Data Using Timestamp:

SQL
SELECT * FROM employees AT (TIMESTAMP => '2023-10-12 12:00:00');

Replace '2023-10-12 12:00:00' with the desired timestamp.

4. Handling Errors: If the timestamp is beyond the allowed time travel period or before the
object creation time, you will receive an error:

SQL
SELECT * FROM employees AT (TIMESTAMP => '2023-10-11 12:00:00');

Error: Time travel data is not available for table employees. The
requested time is either beyond the allowed time travel period or before
the object creation time.

Method 2: Using Offsets

To query historical data based on an offset from the current time:

1. Query Historical Data Using Offset:

SQL
SELECT * FROM employees AT (OFFSET => -60 * 5);

This queries the state of the employees table 5 minutes ago (300 seconds).

2. Handling Errors: If the offset is beyond the allowed time travel period:

SQL
SELECT * FROM employees AT (OFFSET => -60 * 7);

Error: Time travel data is not available for table employees before
seven minutes.
Method 3: Using Query IDs

To query historical data based on a specific query ID:

1. Run a Query to Generate a Query ID:

SQL
SELECT * FROM employees;

2. Fetch the Query ID from History: Open the query history in Snowflake and copy the
query ID of the desired query.
3. Query Historical Data Using Query ID:

SQL
SELECT * FROM employees AT (STATEMENT => 'query_id');

Replace 'query_id' with the actual query ID copied from the history.

Cloning Historical Objects Using Time Travel in Snowflake


Introduction

In this section, we will explore how to clone historical objects using Snowflake's Time Travel
feature. Cloning allows you to create a duplicate of an object (table, schema, or database) at a
specified point in its history.

Cloning a Table Using Timestamp

To clone a table at a specific timestamp, follow these steps:

1. Get the Current Timestamp:

SQL
SELECT CURRENT_TIMESTAMP();

2. Clone the Table:

SQL
CREATE TABLE restore_table CLONE employees AT (TIMESTAMP => '2023-10-12
12:00:00');

Replace '2023-10-12 12:00:00' with the desired timestamp.

3. Verify the Cloned Table:

SQL
SELECT * FROM restore_table;
Check that the cloned table has the same data as the original table at the specified
timestamp.

Cloning a Schema Using Offset

To clone a schema using an offset from the current time:

1. Clone the Schema:

SQL
CREATE SCHEMA restore_schema CLONE employee_perm AT (OFFSET => -600);

This clones the employee_perm schema as it was 600 seconds (10 minutes) ago.

2. Verify the Cloned Schema:

SQL
SHOW SCHEMAS;

Check that the restore_schema has been created with the tables that existed in
employee_perm at the specified offset.

Cloning a Database Using Query ID

To clone a database using a specific query ID:

1. Get the Query ID: Run a query to generate a query ID:

SQL
SELECT * FROM employees;

2. Fetch the Query ID from History: Open the query history in Snowflake and copy the
query ID of the desired query.
3. Clone the Database:

SQL
CREATE DATABASE restore_db CLONE demo_db AT (STATEMENT => '01a2b3c4-
d5e6-7f89-0a1b-2c3d4e5f6g7h');

Replace '01a2b3c4-d5e6-7f89-0a1b-2c3d4e5f6g7h' with the actual query ID.

4. Verify the Cloned Database:

SQL
SHOW DATABASES;

Check that the restore_db has been created with the schemas and tables that existed in
demo_db before the specified query ID.
Dropping and Restoring Objects Using Time Travel in
Snowflake
Introduction

In this session, we will learn how to drop and restore objects (tables, schemas, and databases) in
Snowflake using the Time Travel feature. When an object is dropped, it is retained for the data
retention period, during which it can be restored. Once the retention period has passed, the object
is moved to the Failsafe zone and cannot be restored by users.

Checking History for Objects

To check the history of different objects in Snowflake, you can use the following commands:

1. Show Table History:

SQL
SHOW TABLES HISTORY LIKE 'employees%' IN DATABASE demo_db;

This command displays the history of tables starting with 'employees' in the demo_db
database.

2. Show Schema History:

SQL
SHOW SCHEMAS HISTORY IN DATABASE demo_db;

This command displays the history of schemas in the demo_db database.

3. Show Database History:

SQL
SHOW DATABASES HISTORY;

This command displays the history of all databases.

Dropping Objects

Let's drop a database, schema, and table to see how the history is updated:

1. Drop Database:

SQL
DROP DATABASE development;
2. Drop Schema:

SQL
DROP SCHEMA demo_db.employee;

3. Drop Table:

SQL
DROP TABLE demo_db.employee_perm.employees;
Restoring Dropped Objects

To restore dropped objects within the retention period, use the UNDROP command:

1. Restore Table:

SQL
UNDROP TABLE demo_db.employee_perm.employees;

2. Restore Schema:

SQL
UNDROP SCHEMA demo_db.employee;

3. Restore Database:

SQL
UNDROP DATABASE development;
Verifying Restored Objects

After restoring the objects, you can verify their existence:

1. Verify Restored Table:

SQL
SHOW TABLES IN SCHEMA demo_db.employee_perm;

2. Verify Restored Schema:

SQL
SHOW SCHEMAS IN DATABASE demo_db;

3. Verify Restored Database:

SQL
SHOW DATABASES;
Important Notes

 If an object with the same name already exists, the UNDROP command will fail. You must rename
the existing object before restoring the previous version.
 The SHOW ... HISTORY commands include an additional column dropped_on, which displays
the date and time when the object was dropped. If an object has been dropped more than once,
each version is included as a separate row in the output.

Understanding Fail Safe in Snowflake


Introduction

In this section, we will discuss the concept of Fail Safe in Snowflake, which is an essential
component of Snowflake's continuous data protection lifecycle. Fail Safe ensures that historical
data is protected and recoverable in the event of a system failure or disaster.

Continuous Data Protection Lifecycle

We have already seen the continuous data protection lifecycle in the Time Travel section. Let's
revisit the key points:

 DDL Operations: Create various objects such as databases, schemas, tables, and views.
 Time Travel: Allows querying and cloning historical data within a retention period (up to 90 days
for permanent tables, 1 day for transient and temporary tables).
 Fail Safe: Provides an additional 7-day period for permanent tables after the Time Travel
retention period ends.

Fail Safe Overview

Fail Safe is a non-configurable 7-day period during which historical data is recoverable by
Snowflake only. User interactions are not allowed in the Fail Safe zone, and it is intended for use
by Snowflake in case of hardware failures or disasters.

Key Points about Fail Safe

 No User Operations: Unlike Time Travel, users cannot perform any operations in the Fail Safe
zone.
 Data Recovery: Fail Safe is used by Snowflake to recover data in case of extreme operational
failures.
 Non-Configurable Period: The Fail Safe period is fixed at 7 days and cannot be altered.
 Cost Implications: Data in the Fail Safe zone incurs additional storage costs.
Example Timeline

Consider the following timeline for an employees table:

 Day 1: Created the employees table and inserted 3 records.


 Day 5: Inserted 5 more records.
 Day 10: Updated 2 records.
 Day 12: Deleted 1 record.

If the Time Travel retention period is set to 12 days:

 Day 13: Data from Day 1 moves to Fail Safe.


 Day 18: Data from Day 5 moves to Fail Safe.
 Day 23: Data from Day 10 moves to Fail Safe.
 Day 25: Data from Day 12 moves to Fail Safe.

Cost Considerations

Fail Safe can significantly impact storage costs due to multiple snapshots:

 Snapshot Size: Each snapshot taken during the day is considered for Fail Safe storage.
 Cumulative Cost: If multiple snapshots are taken, the cumulative size is used for cost calculation.

Recommendations

 Use Transient Tables: During development and testing phases, use transient tables to avoid Fail
Safe costs.
 Design Considerations: Carefully design your data retention and backup strategies to minimize
costs.

Fail Safe vs. Traditional Backups

Fail Safe provides a more efficient and cost-effective alternative to traditional backups:

 Eliminates Redundancy: Avoids the need for multiple full and incremental backups.
 Scalability: Scales with your data without the need for manual intervention.
 Reduced Downtime: Minimizes downtime and data loss during recovery.

Monitoring Fail Safe Storage Consumption in Snowflake


Introduction

In this section, we will explore how to monitor and access the storage consumption of the Fail
Safe zone in Snowflake. Understanding Fail Safe storage consumption is crucial for managing
costs and ensuring efficient data management.

Prerequisites

Ensure that you have switched your account role to ACCOUNTADMIN or SECURITYADMIN to access
the necessary account-level details.

Steps to Access Fail Safe Storage Consumption

1. Switch to Account Admin Role:

SQL
USE ROLE ACCOUNTADMIN;

2. Navigate to the Account Section:


o Log in to the Snowflake web interface.
o Click on the Account option in the top menu.

3. Access Usage Details:


o Under the Usage section, click on Average Storage Used.

4. View Storage Consumption:


o You will see different tabs for Total, Database, Stage, and Fail Safe.
o Click on the Fail Safe tab to view the storage consumption for the Fail Safe zone.

Understanding the Storage Consumption Details

 Total Storage: Displays the total storage used by Snowflake.


 Database Storage: Shows the storage used by databases.
 Stage Storage: Displays the storage used by stage objects.
 Fail Safe Storage: Shows the storage consumption for the Fail Safe zone.

Example Analysis

Let's analyze the storage consumption data:

1. Fail Safe Consumption Trend:


o The Fail Safe consumption is shown to be increasing day by day from day 5 to day 11.
o This increase is due to the snapshots of data taken for each table, as discussed in the
previous lecture.

2. Average Consumption:
o The average consumption for the Fail Safe zone is around 1.5 MB.
o The average consumption for the database is 1.68 MB.
o Although the data is not huge, it is important to note that Fail Safe storage can
sometimes exceed database storage due to multiple snapshots.

Best Practices for Managing Fail Safe Costs

 Design Considerations: Carefully design your data retention and backup strategies to minimize
costs.
 Use Transient Tables: During development and testing phases, use transient tables to avoid Fail
Safe costs, as Fail Safe is not applicable to transient tables.

Understanding and Leveraging Tasks in Snowflake


Introduction

In this section, we will learn about tasks in Snowflake and how to leverage them for automating
various operations. A task in Snowflake is a kind of trigger that gets executed at a specific time
or period. Tasks can be used to automate SQL statements or stored procedures at scheduled
intervals.

What is a Task in Snowflake?

A task is an object in Snowflake that allows you to schedule and automate the execution of SQL
statements or stored procedures. Tasks can be set to run at specific intervals or at a specific point
in time, and they will continue to run until manually stopped.

Key Features of Tasks

 Scheduled Execution: Tasks can be scheduled to run at specific intervals or times.


 Automated Operations: Tasks can automate operations such as creating tables, inserting or
deleting rows, and running stored procedures.
 No Event Source Trigger: Tasks cannot be triggered by events such as the creation of a new
table. They must be scheduled.
 Single Instance Execution: Only one instance of a task runs at a scheduled time.

Types of Tasks

1. Standalone Task: A task that does not have any child tasks and is not dependent on any parent
task.
2. Parent-Child Tasks: Tasks that have dependencies, where a parent task can have multiple child
tasks.
Example Use Cases

 Data Ingestion: Automate the ingestion of data into tables at regular intervals.
 Data Cleanup: Schedule tasks to delete old or unnecessary data from tables.
 Data Transformation: Automate the execution of stored procedures for data transformation.

How Tasks Get Executed

When a task is scheduled, it goes through the following stages:

1. Queued: The task is queued for execution.


2. Running: The task starts running once it is dequeued.

For example, if a task is set to run every minute:

 The task may be queued for a few seconds (e.g., 20 seconds) due to other queries or tasks being
executed.
 The task then runs for the remaining time (e.g., 40 seconds) within the one-minute window.

Considerations for Warehouse Size

The size of the warehouse should be determined based on the number and volume of tasks you
will be executing. For heavy workloads, consider using a larger warehouse size to ensure
efficient execution of tasks.

Creating and Managing Tasks

Let's see how to create and manage tasks in Snowflake:

1. Create a Task:

SQL
CREATE TASK my_task
WAREHOUSE = my_warehouse
SCHEDULE = '1 MINUTE'
AS
INSERT INTO my_table (column1, column2)
SELECT column1, column2
FROM source_table;

2. Start a Task:

SQL
ALTER TASK my_task RESUME;

3. Stop a Task:

SQL
ALTER TASK my_task SUSPEND;

4. Drop a Task:

SQL
DROP TASK my_task;
Monitoring Task Execution

You can monitor the execution of tasks by checking the query history:

 Query History: View the time taken to complete tasks and analyze performance.
 Task History: Check the status and execution details of tasks.

Understanding and Leveraging Tree of Tasks in Snowflake


Introduction

In this section, we will learn about the concept of a tree of tasks in Snowflake and how to
leverage it for automating complex workflows. A tree of tasks allows you to create a hierarchy of
tasks with dependencies, where a parent task can have multiple child tasks.

Tree of Tasks Overview

A tree of tasks is a hierarchical structure where tasks are organized in a parent-child relationship.
The topmost task in the hierarchy is known as the root task, which executes all the subtasks.

Example Tree of Tasks

Consider the following tree of tasks:

Unknown
Task A (Root Task)
/ \
Task B Task C
/ | \
D E F
/ \
G H

 Task A: Root task that executes Task B and Task C.


 Task B: Parent task for Task D, Task E, and Task F.
 Task C: Parent task for Task G and Task H.
 Task D, E, F: Child tasks of Task B.
 Task G, H: Child tasks of Task C.
Rules for Tree of Tasks

1. Single Path Between Nodes: An individual task can have only one parent task. For example, Task
B can only have Task A as its parent.
2. Root Task Schedule: The root task must have a defined schedule. Child tasks are triggered based
on the completion of their parent tasks.
3. Maximum Tasks: A tree of tasks can have a maximum of 1000 tasks, including the root task, in a
resumed state.
4. Maximum Child Tasks: A task can have a maximum of 100 child tasks.

Execution Flow

 Root Task Execution: The root task (Task A) is scheduled to run at specific intervals.
 Child Task Execution: Once the root task completes, its child tasks (Task B and Task C) are
executed.
 Subsequent Child Tasks: Child tasks of Task B (Task D, Task E, Task F) and Task C (Task G, Task H)
are executed after their respective parent tasks complete.

Example Execution Timeline

Consider a tree of tasks that requires 5 minutes on average to complete each run:

Unknown
Run 1:
- T1 (Root Task) starts and remains in queue for a few seconds, then runs.
- T2 (Child Task of T1) starts after T1 completes, remains in queue, then
runs.
- T3 (Child Task of T1) starts after T1 completes, remains in queue, then
runs.
- Total time: 5 minutes.

Run 2:
- T1 starts again at the beginning of the next 5-minute window.
- T2 and T3 follow the same execution pattern as in Run 1.
Creating and Managing Tree of Tasks

Let's see how to create and manage a tree of tasks in Snowflake:

1. Create Root Task:

SQL
CREATE TASK root_task
WAREHOUSE = my_warehouse
SCHEDULE = '5 MINUTE'
AS
CALL my_stored_procedure();

2. Create Child Task:


SQL
CREATE TASK child_task_1
AFTER root_task
WAREHOUSE = my_warehouse
AS
INSERT INTO my_table (column1, column2)
SELECT column1, column2
FROM source_table;

3. Create Subsequent Child Tasks:

SQL
CREATE TASK child_task_2
AFTER child_task_1
WAREHOUSE = my_warehouse
AS
DELETE FROM my_table WHERE condition;

CREATE TASK child_task_3


AFTER child_task_1
WAREHOUSE = my_warehouse
AS
UPDATE my_table SET column1 = value WHERE condition;

4. Start Root Task:

SQL
ALTER TASK root_task RESUME;
Monitoring Task Execution

You can monitor the execution of tasks by checking the query history and task history:

 Query History: View the time taken to complete tasks and analyze performance.
 Task History: Check the status and execution details of tasks.

Creating Tasks in Snowflake


Introduction

In this section, we will learn how to create and manage tasks in Snowflake. We will start by
creating a table and then create a task to insert records into the table at a specific interval.

Step-by-Step Guide
Step 1: Create a Table

First, we create a table named employees with three columns: employee_id, employee_name,
and load_time.
SQL
CREATE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR DEFAULT 'YourName',
load_time DATE
);

 employee_id: An integer column with auto-increment properties.


 employee_name: A varchar column with a default value.
 load_time: A date column to store the timestamp of when the record is inserted.

Step 2: Verify Table Creation

Verify that the table has been created successfully.

SQL
SHOW TABLES LIKE 'employees';
Step 3: Create a Task

Next, we create a task to insert records into the employees table at a specific interval (every one
minute).

SQL
CREATE OR REPLACE TASK employees_task
WAREHOUSE = compute_wh
SCHEDULE = '1 MINUTE'
AS
INSERT INTO employees (load_time)
VALUES (CURRENT_TIMESTAMP);

 employees_task: The name of the task.


 compute_wh: The virtual warehouse used to run the task.
 SCHEDULE = '1 MINUTE': The task is scheduled to run every one minute.
 INSERT INTO employees (load_time) VALUES (CURRENT_TIMESTAMP): The SQL
statement to be executed by the task.

Step 4: Verify Task Creation

Verify that the task has been created successfully.

SQL
SHOW TASKS LIKE 'employees_task';
Step 5: Resume the Task

By default, the task is in a suspended state. We need to resume the task to start its execution.

SQL
ALTER TASK employees_task RESUME;
Step 6: Verify Task Status

Check the status of the task to ensure it is running.

SQL
SHOW TASKS LIKE 'employees_task';
Step 7: Verify Records in the Table

After resuming the task, verify that records are being inserted into the employees table every
minute.

SQL
SELECT * FROM employees;

You should see records being inserted with the employee_id auto-incrementing,
employee_name set to the default value, and load_time showing the current timestamp.

Creating a Tree of Tasks in Snowflake


Introduction

In this lecture, we will create a tree of tasks in Snowflake, establishing a parent-child relationship
between tasks. We will build on the previous example by creating a root task and two child tasks
that will execute based on the completion of the parent task.

Step-by-Step Guide
Step 1: Verify and Suspend the Existing Task

First, let's verify that the previous task is working and then suspend it to create child tasks.

SQL
-- Verify the existing task
SHOW TASKS LIKE 'employees_task';

-- Suspend the existing task


ALTER TASK employees_task SUSPEND;
Step 2: Create the Employees Copy Table

Create a copy of the employees table without the auto-increment and default properties.

SQL
CREATE TABLE employees_copy (
employee_id INTEGER,
employee_name VARCHAR,
load_time DATE
);
Step 3: Create the Employees Copy Task

Create a task to insert records into the employees_copy table after the employees_task
completes.

SQL
CREATE OR REPLACE TASK employees_copy_task
WAREHOUSE = compute_wh
AFTER employees_task
AS
INSERT INTO employees_copy (employee_id, employee_name, load_time)
SELECT employee_id, employee_name, load_time
FROM employees;
Step 4: Create the Employees Copy 2 Table

Create another copy of the employees table.

SQL
CREATE TABLE employees_copy_2 (
employee_id INTEGER,
employee_name VARCHAR,
load_time DATE
);
Step 5: Create the Employees Copy 2 Task

Create a task to insert records into the employees_copy_2 table after the employees_task
completes.

SQL
CREATE OR REPLACE TASK employees_copy_2_task
WAREHOUSE = compute_wh
AFTER employees_task
AS
INSERT INTO employees_copy_2 (employee_id, employee_name, load_time)
SELECT employee_id, employee_name, load_time
FROM employees;
Step 6: Resume the Child Tasks

Resume the child tasks before resuming the parent task.

SQL
-- Resume the first child task
ALTER TASK employees_copy_task RESUME;

-- Resume the second child task


ALTER TASK employees_copy_2_task RESUME;
Step 7: Resume the Parent Task

Resume the parent task to start the execution of the tree of tasks.

SQL
ALTER TASK employees_task RESUME;
Execution Flow

 Parent Task: The employees_task runs every minute, inserting a record into the employees
table.
 Child Tasks: After the employees_task completes, the employees_copy_task and
employees_copy_2_task run, copying records from the employees table to their respective
tables.

Automating Stored Procedures with Tasks in Snowflake


Introduction

In this lecture, we will learn how to call stored procedures automatically using tasks in
Snowflake. We will build on the previous example by creating a stored procedure that inserts
values into a table and then create a task to call this stored procedure at regular intervals.

Step-by-Step Guide
Step 1: Create the Employees Table

First, we create the employees table with three columns: employee_id, employee_name, and
load_time.

SQL
CREATE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR DEFAULT 'YourName',
load_time DATE
);

 employee_id: An integer column with auto-increment properties.


 employee_name: A varchar column with a default value.
 load_time: A date column to store the timestamp of when the record is inserted.

Step 2: Create the Stored Procedure

Next, we create a stored procedure that inserts values into the employees table. The stored
procedure will take one argument, today_date, and use it to insert the current timestamp into
the load_time column.

SQL
CREATE OR REPLACE PROCEDURE load_employees_data(today_date VARCHAR)
RETURNS STRING NOT NULL
LANGUAGE JAVASCRIPT
AS
$$
var sql_command = `INSERT INTO employees (load_time) VALUES (?)`;
snowflake.execute({
sqlText: sql_command,
binds: [today_date]
});
return 'Succeeded';
$$;

 load_employees_data: The name of the stored procedure.


 today_date: A varchar argument representing the current date.
 sql_command: The SQL command to insert a record into the employees table.
 snowflake.execute: Executes the SQL command with the provided argument.

Step 3: Create the Task

Create a task to call the stored procedure every minute.

SQL
CREATE OR REPLACE TASK employees_load_task
WAREHOUSE = compute_wh
SCHEDULE = '1 MINUTE'
AS
CALL load_employees_data(CURRENT_TIMESTAMP::VARCHAR);

 employees_load_task: The name of the task.


 compute_wh: The virtual warehouse used to run the task.
 SCHEDULE = '1 MINUTE': The task is scheduled to run every minute.
 CALL load_employees_data(CURRENT_TIMESTAMP::VARCHAR): Calls the stored
procedure with the current timestamp as an argument.

Step 4: Verify and Resume the Task

Verify that the task has been created successfully and then resume it to start its execution.

SQL
-- Verify the task
SHOW TASKS LIKE 'employees_load_task';

-- Resume the task


ALTER TASK employees_load_task RESUME;
Step 5: Verify Records in the Table

After resuming the task, verify that records are being inserted into the employees table every
minute.

SQL
SELECT * FROM employees;
You should see records being inserted with the employee_id auto-incrementing,
employee_name set to the default value, and load_time showing the current timestamp.

Monitoring Task History in Snowflake


Introduction

Understanding how to monitor and analyze the history of tasks in Snowflake is crucial for
ensuring that tasks are executed as expected and for troubleshooting any issues that may arise. In
this section, we will explore various methods to track the task history.

Methods to Track Task History


Method 1: Retrieve Most Recent 100 Records

You can retrieve the most recent 100 records of task executions using the following query:

SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY SCHEDULED_TIME;

This query provides a comprehensive history of task executions, including:

 Task names
 Query IDs
 Database and schema information
 Query text
 Status (succeeded or failed)
 Error codes and messages (if any)
 Query start and completion times
 Root task information
 Run IDs

Method 2: Retrieve Task History Within a Specific Time Range

To retrieve task history within a specific time range, use the following query:

SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
SCHEDULED_TIME_RANGE_START => '2023-10-01T11:00:00',
SCHEDULED_TIME_RANGE_END => '2023-10-01T12:00:00'
));

This query filters the task history to show only the records within the specified time range. The
output includes the same metrics as the previous method but limited to the specified period.
Method 3: Retrieve Latest N Records for a Specific Task

To retrieve the latest N records for a specific task, use the following query:

SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
SCHEDULED_TIME_RANGE_START => CURRENT_TIMESTAMP - INTERVAL '1 HOUR',
RESULT_LIMIT => 10,
TASK_NAME => 'employees_load_task'
));

This query allows you to specify:

 A time range (e.g., the last hour)


 A limit on the number of records (e.g., the latest 10 records)
 The specific task name for which you want to check the history

Understanding Streams in Snowflake


Introduction

Streams in Snowflake are a powerful feature that enables Change Data Capture (CDC) by
tracking changes made to tables, such as inserts, updates, and deletes. This is particularly useful
for performing incremental loads or capturing changes from various data sources to keep your
data warehouse up to date.

Why Use Streams?

When using Snowflake as your data warehouse, you may need to:

 Perform incremental loads from existing databases or data sources.


 Capture changes from CSV files or other databases.
 Load dimensions and facts accurately based on received inserts, deletes, or updates.

What Are Streams?

Streams in Snowflake are objects that record Data Manipulation Language (DML) changes made
to tables. This includes:

 Inserts
 Updates
 Deletes
 Metadata about each change
Streams help in capturing these changes with ease, allowing you to perform various operations
such as loading dimensions and facts accurately.

How Streams Work

 CDC Process: Streams facilitate the Change Data Capture process by recording changes at the
row level between two transactional points in time in a table.
 Tracking Changes: Once a stream is created on a source table, it starts tracking all changes from
that point in time.
 Row-Level Changes: Streams track changes at the row level between two points in time (e.g.,
Time T1 and Time T2).

Example Scenario

Suppose you have a table named employees and you create a stream on this table on Day 1. Any
DML operations (inserts, updates, deletes) performed on the employees table will be tracked by
the stream. On Day 2, you can check the stream to see what changes occurred between Day 1
and Day 2.

Key Concepts

 Source Table: The table on which the stream is created.


 Stream Object: The object that tracks changes made to the source table.
 Transactional Points: Specific points in time (e.g., T1 and T2) between which changes are
tracked.

Benefits of Using Streams

 Ease of Use: Streams make it easy to capture and track changes in tables.
 Accuracy: Ensures accurate loading of dimensions and facts based on tracked changes.
 Flexibility: Can be used with various data sources and staging tables in Snowflake.

Understanding How Streams Work in Snowflake


Introduction

Streams in Snowflake are used to track changes in tables, such as inserts, updates, and deletes.
Understanding how streams work under the hood is crucial for effectively using them in your
data warehouse operations.
How Streams Work
Initial Snapshot

When you create a stream on a table, it logically takes an initial snapshot of every row in the
source table. This snapshot serves as the baseline from which changes are tracked.

Change Tracking System

After the initial snapshot, the change tracking system records information about changes (inserts,
updates, deletes) committed after the snapshot was taken. For example, if the snapshot was taken
at Time T1, any changes made between T1 and T2 will be recorded.

Hidden Columns

Streams do not contain table data themselves. Instead, they create hidden columns in the original
table to track changes. Snowflake charges for the storage cost associated with these hidden
columns.

Offsets

A stream stores the offset for the source table and returns CDC records by leveraging the
versioning history for the source table. The offset represents a point in time in the transactional
version timeline of the source table.

Understanding Offsets
Concept of Offsets

Offsets are like bookmarks in a book, indicating a point in time from which changes are tracked.
When you start a stream, the offset is set to zero. As changes are made and consumed, the offset
is updated to reflect the new point in time.

Example Scenario

1. Initial Creation (T1):


o Create a stream on the employees table.
o Initial offset is set to zero.
2. First Change (T2):
o Make changes to the employees table.
o Consume the changes at T2.
o Offset is updated from zero to one.
3. Second Change (T3):
o Make further changes to the employees table.
o Consume the changes at T3.
o Offset is updated from one to two.
4. Third Change (T4):
o Make additional changes to the employees table.
o Consume the changes at T4.
o Offset is updated from two to three.

Practical Example
Creating a Stream
SQL
CREATE OR REPLACE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR,
load_time DATE
);

CREATE OR REPLACE STREAM employees_stream ON TABLE employees;


Making Changes and Consuming the Stream

1. Insert a Row:

SQL
INSERT INTO employees (employee_name, load_time) VALUES ('John Doe',
CURRENT_DATE);

2. Consume the Stream:

SQL
SELECT * FROM employees_stream;

3. Update a Row:

SQL
UPDATE employees SET employee_name = 'John Smith' WHERE employee_id = 1;

4. Consume the Stream Again:

SQL
SELECT * FROM employees_stream;
Key Points to Remember

1. Offsets: Offsets are updated each time changes are consumed. They represent the point in time
from which the stream will start tracking new changes.
2. Latest Action: If multiple statements change a row, the stream contains only the latest action
taken on that row.
3. Hidden Columns: Streams use hidden columns in the original table to track changes, and
Snowflake charges for the storage cost of these columns.
4. CDC Records: Streams return CDC records by leveraging the versioning history for the source
table.

Capturing Changes Using Streams in Snowflake


Introduction

Streams in Snowflake are a powerful feature for tracking changes in tables, such as inserts,
updates, and deletes. There are three types of streams in Snowflake:

1. Standard Streams: Track inserts, updates, and deletes.


2. Append-Only Streams: Track only inserts.
3. Insert-Only Streams: Used for external tables (e.g., tables residing on cloud storage like AWS,
Azure, or GCP).

Types of Streams
1. Standard Streams

Standard streams track all types of changes (inserts, updates, and deletes) in a table. Let's explore
how to create and use standard streams with an example.

Example: Using Standard Streams


Step 1: Create the Employees Table

First, create a table named employees with three columns: employee_id, salary, and
manager_id.

SQL
CREATE OR REPLACE TABLE employees (
employee_id INTEGER,
salary INTEGER,
manager_id INTEGER
);
Step 2: Create a Stream

Create a stream to track changes to the employees table.

SQL
CREATE OR REPLACE STREAM employees_stream ON TABLE employees;
Step 3: Verify the Stream

You can verify the stream by running the following commands:

SQL
-- Show all streams
SHOW STREAMS;

-- Describe the stream


DESCRIBE STREAM employees_stream;
Step 4: Check the Stream Offset

The offset indicates the point in time from which the stream starts tracking changes. Initially, the
offset is zero.
SQL
SELECT SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream');

Convert the offset to a timestamp:

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
Step 5: Insert Data into the Employees Table

Insert some records into the employees table.

SQL
INSERT INTO employees (employee_id, salary, manager_id) VALUES
(1, 50000, 101),
(2, 60000, 102),
(3, 70000, 103),
(4, 80000, 104),
(5, 90000, 105);
Step 6: Query the Stream

Query the stream to see the changes tracked.

SQL
SELECT * FROM employees_stream;

The stream output includes the original table columns (employee_id, salary, manager_id) and
additional metadata columns:

 METADATA$ACTION: Indicates the type of change (INSERT, DELETE).


 METADATA$ISUPDATE: Indicates if the row was updated.
 METADATA$ROW_ID: Unique identifier for the change.

Step 7: Consume the Changes

Consume the changes from the stream by inserting them into a consumer table.

SQL
-- Create the consumer table
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);

-- Insert changes from the stream into the consumer table


INSERT INTO employees_consumer (employee_id, salary)
SELECT employee_id, salary FROM employees_stream;

Verify the data in the consumer table:

SQL
SELECT * FROM employees_consumer;
Step 8: Check the Updated Stream Offset

After consuming the changes, the stream offset is updated.

SQL
SELECT SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream');

Convert the offset to a timestamp:

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

Performing Update Operations Using Streams in Snowflake


Introduction

In this example, we will demonstrate how to perform update operations using streams in
Snowflake. We will update rows in the employees table and track these changes using a stream.
Finally, we will consume the changes and insert them into a consumer table.

Step-by-Step Guide
Step 1: Verify the Stream

First, check if the stream has any rows.

SQL
SELECT * FROM employees_stream;

If the stream is empty, it means no changes have been tracked yet.

Step 2: Check the Stream Offset

Retrieve the current offset of the stream.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This timestamp indicates the point in time from which the stream starts tracking changes.

Step 3: Update Rows in the Employees Table

Update the employees table to increase the salary of employees whose salary is less than 33,000.

SQL
UPDATE employees
SET salary = salary + 10000
WHERE salary < 33000;
Step 4: Verify the Updated Employees Table

Check the employees table to see the updated rows.

SQL
SELECT * FROM employees ORDER BY salary;

You should see that the salaries of employees with employee_id 3 and 4 have been incremented.

Step 5: Check the Stream for Tracked Changes

Query the stream to see the tracked changes.

SQL
SELECT * FROM employees_stream;

You will see four records instead of two. This is because streams track updates by recording a
delete for the old row and an insert for the new row.

 Deleted Rows: The original rows before the update.


 Inserted Rows: The updated rows after the update.

Step 6: Consume the Changes

Insert the changes from the stream into the employees_consumer table.

SQL
-- Create the consumer table if not already created
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);

-- Insert changes into the consumer table


INSERT INTO employees_consumer (employee_id, salary)
SELECT employee_id, salary
FROM employees_stream
WHERE METADATA$ACTION = 'INSERT' AND METADATA$ISUPDATE = TRUE;

This query inserts only the updated rows (with the new salaries) into the employees_consumer
table.

Step 7: Verify the Consumer Table

Check the employees_consumer table to see the inserted rows.

SQL
SELECT * FROM employees_consumer;
You should see that the table now contains duplicate records for employee_id 3 and 4,
representing both the old and new salaries.

Step 8: Check the Updated Stream Offset

After consuming the changes, the stream offset is updated.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This new timestamp indicates the point in time from which the stream will start tracking new
changes.

Capturing Delete Operations Using Streams in Snowflake


Introduction

In this lecture, we will learn how to leverage streams to capture delete operations on a table in
Snowflake. We will delete rows from the employees table and track these changes using a
stream. Finally, we will consume the changes and delete the corresponding rows from a
consumer table.

Step-by-Step Guide
Step 1: Verify the Stream Offset

First, check the current offset of the stream.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This timestamp indicates the point in time from which the stream starts tracking changes.

Step 2: Delete Rows from the Employees Table

Delete rows from the employees table where the salary is less than 40,000.

SQL
DELETE FROM employees
WHERE salary < 40000;
Step 3: Verify the Deleted Rows in the Employees Table

Check the employees table to see the remaining rows.

SQL
SELECT * FROM employees ORDER BY salary;

You should see that the rows with employee_id 2 and 4 have been deleted.
Step 4: Check the Stream for Tracked Changes

Query the stream to see the tracked delete operations.

SQL
SELECT * FROM employees_stream;

You will see two records with the DELETE action for the deleted rows.

Step 5: Delete Corresponding Rows from the Consumer Table

Delete the corresponding rows from the employees_consumer table by consuming the changes
from the stream.

SQL
DELETE FROM employees_consumer
WHERE employee_id IN (
SELECT DISTINCT employee_id
FROM employees_stream
WHERE METADATA$ACTION = 'DELETE' AND METADATA$ISUPDATE = FALSE
);

This query deletes the rows from the employees_consumer table that match the employee_id of
the deleted rows in the employees table.

Step 6: Verify the Consumer Table

Check the employees_consumer table to see the remaining rows.

SQL
SELECT * FROM employees_consumer;

You should see that the rows with employee_id 2 and 4 have been deleted.

Step 7: Check the Updated Stream Offset

After consuming the changes, the stream offset is updated.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This new timestamp indicates the point in time from which the stream will start tracking new
changes.

Understanding Transactions and Streams in Snowflake


Introduction

In this lecture, we will explore how streams in Snowflake behave when changes are made within
a transaction. We will demonstrate that streams only capture changes after the transaction is
committed.

Step-by-Step Guide
Step 1: Verify the Stream Offset

First, check the current offset of the stream.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This timestamp indicates the point in time from which the stream starts tracking changes.

Step 2: Begin a Transaction

Start a new transaction.

SQL
BEGIN;

Verify that the transaction has started.

SQL
SHOW TRANSACTIONS;

You should see the transaction ID and session ID indicating that the transaction is currently
running.

Step 3: Insert Rows Within the Transaction

Insert some rows into the employees table within the transaction.

SQL
INSERT INTO employees (employee_id, salary, manager_id) VALUES
(6, 45000, 106),
(7, 55000, 107),
(8, 65000, 108);
Step 4: Verify the Employees Table and Stream

Check the employees table to see the current rows. Since the transaction is not yet committed,
the new rows will not be visible.

SQL
SELECT * FROM employees ORDER BY salary;
Check the stream to see if it has captured any changes. Since the transaction is not yet
committed, the stream will be empty.

SQL
SELECT * FROM employees_stream;
Step 5: Commit the Transaction

Commit the transaction to make the changes permanent.

SQL
COMMIT;
Step 6: Verify the Employees Table and Stream After Commit

Check the employees table again to see the new rows.

SQL
SELECT * FROM employees ORDER BY salary;

Check the stream to see if it has captured the changes after the commit.

SQL
SELECT * FROM employees_stream;

You should see the new rows captured by the stream.

Step 7: Consume the Stream

Create a consumer table and insert the changes from the stream into the consumer table.

SQL
-- Create the consumer table if not already created
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);

-- Insert changes into the consumer table


INSERT INTO employees_consumer (employee_id, salary)
SELECT employee_id, salary
FROM employees_stream;

Verify the data in the consumer table.

SQL
SELECT * FROM employees_consumer;

You should see the new rows inserted into the consumer table.
Step 8: Check the Updated Stream Offset

After consuming the changes, the stream offset is updated.

SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));

This new timestamp indicates the point in time from which the stream will start tracking new
changes.

Step 9: Set Comments for Streams

It's good practice to set comments for each stream to track their purpose.

SQL
ALTER STREAM employees_stream SET COMMENT = 'This stream is used to capture
changes from the employees table.';

Verify the comment by showing the streams.

SQL
SHOW STREAMS;

You should see the updated comment for the stream.

Step 10: Drop the Stream

If the stream is no longer needed, you can drop it.

SQL
DROP STREAM employees_stream;

Verify that the stream has been dropped.

SQL
SHOW STREAMS;

You should see that no streams are running in your warehouse.

You might also like