Snowflake
Snowflake
Introduction to Snowflake
Snowflake is a cloud data warehouse that differs from traditional on-premises databases and
data warehouses.
It operates on cloud infrastructure provided by:
o Amazon Web Services (AWS)
o Microsoft Azure
o Google Cloud Platform (GCP)
Niche Features:
o Time Travel: Allows users to access historical data at any point within a defined period.
o Fail Safe: Provides a seven-day period to recover historical data.
o Data Cloning: Enables instant, zero-copy clones of databases, schemas, and tables.
o Data Sharing: Facilitates secure and governed sharing of data across organizations.
Cost Efficiency:
o Pay-as-you-use Model: Customers are charged based on compute and storage usage.
o Separation of Costs: Storage costs are separate from compute costs, providing flexibility
and cost savings.
Virtual Warehouses:
o These are clusters or computing engines used to run queries.
o Charges are based on the duration of query execution and the amount of data stored.
Infrastructure Management:
o Snowflake manages all hardware and software infrastructure.
o Provides a Software as a Service (SaaS) platform, eliminating the need for customers to
handle installations or maintenance.
Scalability and Performance:
o Elastic and Highly Scalable: Can scale up or down based on workload.
o Fault Tolerant: Ensures high availability and reliability.
o Massive Parallel Processing (MPP): Capable of handling large workloads and complex
queries by spinning up multiple clusters.
User Experience
Account Setup:
o Users can quickly create an account and set up a data warehouse without dealing with
infrastructure issues.
Interface Exploration:
o Key options available on the Snowflake interface include:
Databases
Shares
Warehouses
Worksheets
History
Other options
o User-specific information, such as username and roles, can be accessed via a dropdown
menu on the right side of the screen.
Importance of Roles
Roles in Snowflake:
o Roles play a significant role in managing access and permissions within Snowflake.
o Understanding and configuring roles is crucial for effective Snowflake usage and security
management.
2. Security Admin
3. User Admin
4. Sysadmin
Key Points
1. Account ID: Unique alphanumeric text before the region (e.g., US East one).
2. Region: The region selected during account creation (e.g., US East one).
3. Domain: The domain currently in use.
Databases Overview
1. Selecting a Database:
o Click anywhere on the row of the database (not the hyperlink) to bring up a slider on the
right side.
o The slider allows granting privileges to the selected user or role (e.g., sysadmin).
2. Database Options:
o Create Clone
o Drop
o Transfer Ownership
3. Viewing Tables:
o Click on the hyperlink of the database name to view available tables.
o Example: Clicking on snowflake_sample_data shows multiple tables, including a
large table up to 10.9 TB.
4. Granting Privileges:
o Click on the row of a specific table to provide privileges.
Database Functionalities
Top Options:
o Tables
o Views
o Schemas
o Stages
o File Formats
o Sequences
o Pipes
Common Elements:
o Tables, Views, Schemas: Familiar from traditional data warehouses.
o Stages, File Formats, Sequences, Pipes: Unique functionalities provided by Snowflake.
Exploring Tables
1. Breadcrumb Trail:
o Shows the navigation path (e.g., Databases > Database Name > Tables >
Table Name).
2. Table Details:
o Schema information in brackets.
o Metadata for each table, including columns, ordinal values, data types, nullability, and
comments.
1. Accessing Warehouses:
o Navigate to the Warehouses section in your Snowflake account.
o You will see a list of existing warehouses, including the default compute warehouse
provided by Snowflake.
6. Granting Privileges:
o Click on the row of the warehouse to bring up the slider on the right side.
o Use the slider to grant privileges such as Modify, Monitor, Operate, and Usage to
specific roles (e.g., sysadmin).
o Optionally, enable the With Grant Option to allow the role to grant these privileges
to others.
When your virtual warehouse experiences a high load of queries, scaling policies come into play
to manage the workload efficiently. Snowflake offers two types of scaling policies:
1. Standard (Default)
2. Economy
Cluster Shutdown
2-3 checks at one-minute intervals 5-6 checks at one-minute intervals
Checks
Recommendations
Production Environments: Use the Standard Policy to ensure queries are processed quickly and
efficiently, minimizing delays.
Non-Production Environments: Consider the Economy Policy to save on credits, especially if
query performance is not a critical factor.
Key Points
The ACCOUNTADMIN role is the topmost role with comprehensive control over the Snowflake
account.
The SECURITYADMIN role focuses on security aspects, including user and role management.
The SYSADMIN role handles system-level operations, such as managing warehouses and
databases.
Both SECURITYADMIN and SYSADMIN roles are owned by the ACCOUNTADMIN role in the
default access control hierarchy.
Recommendations
Limit Access: Due to the extensive privileges associated with the ACCOUNTADMIN role, it should
be granted to a limited number of trusted users.
Delegate Responsibilities: Use the SECURITYADMIN and SYSADMIN roles to delegate specific
responsibilities, ensuring a clear separation of duties and enhanced security.
By understanding and appropriately assigning these roles, you can effectively manage your
Snowflake account, ensuring both security and operational efficiency.
Snowflake Pricing Model Overview
Understanding the Snowflake pricing model is crucial for managing costs effectively while using
the platform. Snowflake separates compute and storage costs, and charges are based on
consumption calculated using Snowflake credits. Here’s a detailed breakdown of the components
included in the Snowflake pricing model:
Snowflake offers multiple editions, each with different features and credit costs:
Standard Edition
Enterprise Edition
Business Critical Edition
Virtual Private Snowflake (VPS)
Each edition has a different cost per credit, which impacts the overall pricing. The value of
Snowflake credits varies based on the edition you are using.
Snowflake credits are the basis for calculating costs. These credits are converted into dollars or
other currencies when billing:
Snowflake includes several serverless features that are charged on a pay-as-you-go basis:
Serverless Features: Include services like Snowpipe (data ingestion), automatic clustering, and
materialized view maintenance.
Pay-As-You-Go: Charges are based on the actual usage of these features, providing flexibility and
cost efficiency.
4. Snowflake Credits
Snowflake credits are the fundamental unit of consumption for both compute and storage costs:
Compute Costs: Calculated based on the number of credits consumed by virtual warehouses.
Storage Costs: Calculated based on the amount of data stored, either on-demand or pre-
purchased.
5. Storage Costs
On-Demand Storage: Charges are based on the actual amount of data stored each month.
Pre-Purchased Storage: Offers discounted rates for committing to a certain amount of storage in
advance.
Data transfer costs are incurred when moving data in and out of Snowflake:
Cloud services costs cover the management and optimization services provided by Snowflake:
Pricing Examples
To understand the practical application of Snowflake pricing, let’s look at some examples:
By understanding these components and how they contribute to the overall cost, you can manage
and optimize your Snowflake usage effectively.
Snowflake Credits: The fundamental unit of measure for consumption on the Snowflake
platform.
Usage Tracking: Snowflake tracks all resource consumption in the form of credits, not actual
dollar amounts.
Conversion: Credits can be converted into dollars or other currencies based on the specific
pricing of the Snowflake edition you are using.
Resource Consumption: Credits are consumed when resources are used, such as:
o Virtual Warehouses: The computing engines that run your queries.
o Cloud Services Layer: Performs work such as metadata management and query
optimization.
o Serverless Features: Includes services like Snowpipe for data ingestion, automatic
clustering, and materialized view maintenance.
Virtual Warehouses
Serverless Features
Consumption Tracking: All consumption from serverless features is tracked and represented as
Snowflake credits.
Tracking Consumption
Snowflake offers several editions, each with unique features and capabilities tailored to different
business needs. Here’s a detailed breakdown of each edition:
1. Standard Edition
2. Enterprise Edition
Advanced Level: Includes all features of the Standard Edition plus additional capabilities.
Additional Features:
o Multi-Cluster Warehouse: Supports multiple clusters for better performance.
o Extended Time Travel: Up to 90 days of time travel.
o Annual Key Rotation: Annual renewal of encryption keys.
o Materialized Views: Support for materialized views.
o Search Optimization Service: Enhanced search capabilities.
o Dynamic Data Masking: Mask sensitive data dynamically.
o External Data Tokenization: Tokenize data for enhanced security.
Dedicated Virtual
Yes Yes Yes Yes
Warehouses
Customer Dedicated
No No No Yes
Metadata Store
Conclusion
Each Snowflake edition is designed to cater to different business needs, from basic data
warehousing to high-security environments. By understanding the features and capabilities of
each edition, you can choose the one that best fits your organization's requirements.
2. Database Replication
Function: Replicates databases across different regions or accounts for disaster recovery and
data sharing.
Compute Resource Usage: Uses compute resources to replicate and synchronize data.
Cost Implication: Consumes Snowflake credits based on the amount of data replicated and the
frequency of replication.
4. Automatic Clustering
Enhances search
Search Uses compute resources for Consumes credits based on
performance by
Optimization building and maintaining data volume and
maintaining optimization
Service structures optimization complexity
structures
Increased Consumption: The use of these serverless features will increase the consumption of
Snowflake credits, leading to higher overall costs.
Cost Management: It is essential to monitor and manage the usage of these features to optimize
costs. Consider the following strategies:
o Usage Monitoring: Regularly monitor the usage of serverless features and their impact
on credit consumption.
o Cost Analysis: Analyze the cost-benefit ratio of using these features to ensure they
provide value relative to their cost.
o Optimization: Optimize the frequency and volume of operations to balance
performance improvements with cost efficiency.
By understanding the serverless features and their cost implications, you can make informed
decisions about their usage and manage your Snowflake costs effectively.
Snowflake offers two primary storage cost options: On-Demand and Pre-Purchased. Each option
has its own advantages and considerations. Here’s a detailed breakdown of both:
1. On-Demand Storage
Overview:
Flexibility: On-Demand storage is the most flexible and easiest way to purchase Snowflake
storage services.
Pay-As-You-Go: Similar to the pay-as-you-go model used by cloud providers like AWS, Azure, and
GCP.
Ideal For: New users or those who are unsure about their storage requirements.
Pricing:
Fixed Rate: Customers are charged a fixed rate for the services consumed and are billed in
arrears every month.
Common Price: The common price across regions is $40 per terabyte per month.
Regional Variations: Prices can vary depending on the cloud provider and region, potentially
going up or down.
2. Pre-Purchased Storage
Overview:
Pricing:
Considerations:
Analyze Usage: Pre-analyze your monthly storage requirements to ensure you are consuming
the pre-purchased storage fully.
Switching Strategy: A popular strategy is to start with on-demand storage, monitor usage, and
then switch to pre-purchased storage once you have a good understanding of your needs.
Cost $40 per TB per month (common price) Generally cheaper than on-demand
Ideal For New users, uncertain storage needs Users with predictable storage needs
Virtual Warehouses: Snowflake's compute resources are called virtual warehouses. Each virtual
warehouse consists of a cluster of compute resources.
Pay-As-You-Go: Compute costs are based on the actual usage of virtual warehouses, measured
in credits. Users are billed per second, with a minimum of 60 seconds per usage.
Credit Consumption: The number of credits consumed depends on the size of the virtual
warehouse (e.g., X-Small, Small, Medium, Large, etc.). Larger warehouses consume more credits
per second but provide more compute power.
Auto-Suspend and Auto-Resume: Virtual warehouses can be configured to automatically
suspend when not in use and resume when needed, helping to manage and reduce compute
costs.
Scaling: Snowflake supports multi-cluster warehouses that can automatically scale out (add
more clusters) and scale in (reduce clusters) based on the workload, optimizing performance and
cost.
2. Resource Monitors
Purpose: Resource monitors are used to control and manage credit consumption within
Snowflake accounts, helping to prevent unexpected high usage and costs.
Configuration: Administrators can set up resource monitors to track credit usage and define
thresholds for different actions.
Thresholds and Actions:
o Notification: Send alerts when credit usage reaches a specified threshold.
o Suspension: Automatically suspend virtual warehouses or other compute resources
when a threshold is reached to prevent further credit consumption.
o Custom Actions: Define custom actions to be taken when thresholds are met, such as
running specific SQL commands or triggering external processes.
Granularity: Resource monitors can be applied at different levels, such as account-wide, specific
warehouses, or user-defined groups of warehouses.
Monitoring Periods: Administrators can define monitoring periods (e.g., daily, weekly, monthly)
to reset and track credit usage over specific intervals.
Definition: Specifies the number of Snowflake credits allocated to the monitor for a specified
frequency interval.
Frequency Intervals: Can be set to daily, weekly, or monthly.
Reset Mechanism: The credit quota resets to zero at the beginning of each specified interval.
Example: If the credit limit is set to 100 credits for September, it resets to zero at the beginning
of October.
Usage Tracking: Tracks credits consumed by both user-managed virtual warehouses and virtual
warehouses used by cloud services.
Alert Mechanism: If the combined credit consumption (e.g., 300 credits by virtual warehouses
and 200 credits by cloud services) exceeds the limit (e.g., 500 credits), an alert is triggered
automatically.
2. Schedule
Default Schedule: Starts monitoring credit usage immediately and resets used credits to zero at
the beginning of each calendar month.
Custom Schedule Properties:
o Frequency: Interval at which used credits reset relative to the specified start date and
time. Options include daily, weekly, or monthly.
o Start Date and Time: Timestamp when the resource monitor starts monitoring the
assigned warehouses.
o End Date and Time: Timestamp when Snowflake suspends the warehouses associated
with the resource monitor, regardless of whether the used credits reached any
thresholds.
3. Monitor Level
Definition: Specifies whether the resource monitor is used to monitor credit usage for the entire
account or specific individual warehouses.
Options:
o Account Level: Monitors all warehouses in the account.
o Warehouse Level: Monitors specific individual warehouses.
Importance: This property must be set; otherwise, the resource monitor does not monitor any
credit usage.
When configuring a resource monitor in Snowflake, it is crucial to define actions that will be
triggered when the credit usage reaches specified thresholds. These actions help manage and
control credit consumption effectively. Below are the key actions that can be set for a resource
monitor:
Description: This action sends a notification to all account administrators with notifications
enabled and suspends all assigned warehouses.
Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o Suspension: All assigned warehouses are suspended after all currently executing
statements are completed.
Consideration: This action does not immediately stop running queries, which means there could
be additional credit consumption beyond the threshold if queries take time to complete.
Use Case: Suitable when you want to ensure that ongoing queries are not abruptly terminated
but still want to control credit usage.
Description: This action sends a notification to all account administrators with notifications
enabled and suspends all assigned warehouses immediately.
Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o Immediate Suspension: All assigned warehouses are suspended immediately, and any
running queries are stopped.
Consideration: This action ensures that credit consumption stops exactly at the threshold, but it
may result in incomplete queries.
Use Case: Suitable when you need a hard stop on credit usage to prevent any consumption
beyond the specified limit.
3. Notify
Description: This action only sends a notification to all account administrators with notifications
enabled.
Behavior:
o Notification: Administrators receive an alert when the credit usage reaches the specified
threshold.
o No Suspension: No action is taken on the virtual warehouses; they continue to run as
usual.
Consideration: This action is purely informational and does not impact the operation of the
warehouses.
Use Case: Suitable when you want to monitor credit usage and be alerted without interrupting
the operations of the warehouses.
Example Scenario
Let's assume you have set a credit limit of 100 credits for a resource monitor. You can define
actions based on different thresholds:
Understanding the suspension and resumption of virtual warehouses is crucial for effective
resource management. Here is a detailed explanation of how these processes work in Snowflake:
When a resource monitor's credit usage reaches a defined threshold, the assigned virtual
warehouses are suspended. This suspension can occur under two actions:
1. Suspend: Warehouses are suspended after completing all currently executing queries.
2. Suspend Immediately: Warehouses are suspended immediately, stopping any running queries.
5. Monitor is Dropped:
o Explanation: If the resource monitor itself is dropped, all warehouses tied to that
monitor can be auto-resumed.
o Example: If the resource monitor is deleted, the warehouses that were assigned to it will
be resumed automatically.
Important Note
Delay in Suspension: When credit quota thresholds are reached for a resource monitor, the
assigned warehouses may take some time to suspend, even when the action is "suspend
immediately." This delay can result in additional credit consumption beyond the threshold.
Let's break down the example provided to understand how resource monitors work in Snowflake:
Warehouse 1: Sales
Warehouse 2: Marketing
Warehouse 3: Tech
Warehouse 4: Finance
Warehouse 5: HR
Since the total credit consumption reaches 5000 credits, Resource Monitor 1 (account level)
will be triggered. This will result in:
Certainly! Here is a summary of the steps to create a resource monitor in Snowflake Web UI and
the importance of enabling notifications first:
5. Set Up Triggers:
o Define triggers to specify actions when certain thresholds are met:
Threshold: Set the percentage of the credit quota that will trigger an action.
Action: Choose the action to be taken (e.g., notify, suspend, or resume
warehouses).
6. Enable Notifications:
o Ensure that notifications are enabled to alert administrators or users when thresholds
are reached. This is crucial for proactive management and avoiding unexpected
disruptions.
By following these steps and enabling notifications, you can effectively manage and monitor
your Snowflake resources, ensuring optimal performance and cost efficiency.
/*
*/
/*
To create a monitor that is similar to the first example, but suspends at 90% and suspends
immediately at 100% to prevent all warehouses in the account from consuming credits after the
quota has been reached:
*/
/*
To create a monitor that is similar to the first example, but lets the assigned warehouse exceed
the quota by 10% And includes two notification actions to alert account administrators as the
used credits reach the halfway and three-quarters points for the quota:
*/
on 75 percent do notify
/*
To create an account-level resource monitor that starts immediately (based on the current
timestamp), resets monthly on the same day and time, has no end date or time, and suspends the
assigned warehouse when the used credits reach 100% of the quota:
*/
frequency = monthly
start_timestamp = immediately
/*
To create a resource monitor that starts at a specific date and time in the future, resets weekly on
the same day and time,
has no end date or time, and performs two different suspend actions at different thresholds on
two assigned warehouses:
*/
use role accountadmin;
frequency = weekly
In this section, we will discuss micro partitioning in Snowflake and how it enhances query
processing speed. Before diving into Snowflake's approach, let's understand traditional
partitioning methods used in data warehouses.
Definition: Partitioning is the process of dividing a table into multiple chunks to improve query
processing and data retrieval speed.
Characteristics:
Benefits:
Improved Performance: Partitioning large tables can lead to acceptable performance and better
scalability.
Limitations:
1. Maintenance Overhead:
o Continuous monitoring and repartitioning are required as data size increases.
o Repartitioning involves significant maintenance efforts to ensure optimal performance.
2. Data Skewness:
o Uneven distribution of data across partitions can occur.
o Example: Partitioning by a gender column with more observations for females than
males leads to uneven partition sizes.
Advantages:
1. Automated Management:
o Snowflake automatically handles partitioning, eliminating the need for manual
intervention.
o Micro partitions are created and maintained by Snowflake without user input.
4. Dynamic Partitioning:
o Unlike traditional static partitioning, Snowflake's micro partitions are dynamic and adapt
to data changes.
o This reduces the need for manual repartitioning and maintenance.
Micro partitioning is a unique and efficient way of partitioning data in Snowflake, which differs
significantly from traditional partitioning methods used in other data warehouses. This approach
offers several benefits that overcome the limitations of static partitioning.
2. Columnar Storage:
o Snowflake uses columnar storage for its micro partitions.
o Each micro partition contains a group of rows stored by each column, optimizing storage
and query performance.
3. Metadata Storage:
o Snowflake stores metadata about all rows in a micro partition.
o The metadata includes the range of values for each column, the number of distinct
values, and additional properties used for optimization and efficient query processing.
4. Optimized Storage:
o The columnar storage format and efficient compression techniques used by Snowflake
reduce the overall storage size.
o This leads to cost savings and better performance.
5. Scalability:
o Snowflake's micro partitioning allows for seamless scalability, accommodating growing
data volumes without compromising performance.
1. Data Loading:
o When data is loaded into a Snowflake table, it is automatically divided into micro
partitions.
o Each micro partition stores a group of rows in a columnar format.
2. Metadata Management:
o The cloud services layer captures and stores metadata for each micro partition.
o This metadata includes the range of values for each column, the number of distinct
values, and other properties for optimization.
3. Query Execution:
o During query execution, the cloud services layer uses the metadata to identify the
relevant micro partitions.
o Only the necessary micro partitions are accessed, improving query performance.
4. Optimization:
o Snowflake continuously optimizes the micro partitions and metadata to ensure efficient
query processing.
o This includes reorganizing partitions and updating metadata as needed.
Micro partitioning in Snowflake offers several significant benefits that enhance performance,
reduce maintenance overhead, and optimize storage. Let's delve into these benefits in detail:
1. Automated and Dynamic Partitioning
Automatic Creation: Micro partitions are automatically created by Snowflake without the need
for explicit definition or maintenance by users. This reduces the manual effort required to
manage partitions.
Dynamic Adjustment: Snowflake dynamically adjusts the micro partitions based on the data size
and usage patterns, ensuring optimal performance without user intervention.
Negligible Maintenance: Since Snowflake handles the creation and management of micro
partitions, the maintenance overhead for users is minimal. This is particularly beneficial for large
tables containing terabytes of data.
Scalability: Snowflake can create billions of micro partitions as needed, allowing it to efficiently
manage very large datasets without requiring manual partitioning.
Size Range: Each micro partition can store between 50 MB to 500 MB of uncompressed data.
This small size enables fine-grained tuning for faster queries.
Uniformity: Snowflake ensures that micro partitions are uniformly small, which helps prevent
data skew and ensures balanced performance across partitions.
4. Fine-Grained Pruning
Efficient Query Processing: Snowflake's cloud services layer knows the exact placement of each
row within the micro partitions. When a query is run, Snowflake scans only the relevant micro
partitions, significantly speeding up query execution.
Example: If a query filters data for a specific country (e.g., India), Snowflake will only scan the
micro partitions containing data for India, rather than scanning the entire table.
Overlapping Ranges: Micro partitions can overlap in their range of values, which helps distribute
data evenly and prevent skew. This ensures that no single partition becomes a bottleneck.
Balanced Distribution: The uniform size and overlapping ranges of micro partitions contribute to
a balanced distribution of data, enhancing overall performance.
6. Columnar Storage
Independent Column Storage: Columns are stored independently within micro partitions,
enabling efficient scanning of individual columns. Only the columns referenced by a query are
scanned, reducing the amount of data processed.
Example: In a customer table, if a query requests only the department ID and customer name,
Snowflake will scan only these two columns, ignoring the rest. This improves query efficiency
and reduces compute costs.
7. Efficient Compression
Column Compression: Columns are compressed individually within micro partitions. Snowflake
automatically determines the most efficient compression algorithm for each column, optimizing
storage and performance.
Cost Savings: Efficient compression reduces storage costs and enhances query performance by
minimizing the amount of data that needs to be processed.
Reduced Compute Costs: By scanning only the relevant data and using efficient compression,
Snowflake reduces the compute costs associated with query processing.
Faster Query Results: The combination of fine-grained pruning, columnar storage, and efficient
compression leads to faster query results, saving time for users.
The logical structure of a Snowflake table is what users typically interact with when querying or
managing data. This structure includes columns and rows, similar to traditional databases.
Example Table:
This logical view is straightforward and familiar to anyone who has worked with relational
databases. However, the physical storage of this data in Snowflake is quite different and
optimized for performance and scalability.
Snowflake uses a unique approach to store data physically, leveraging micro partitions and
columnar storage. This section explains how data is stored in Snowflake's storage layer.
Micro Partitions:
Automatic Creation: Micro partitions are created automatically by Snowflake, with no manual
intervention required from users.
Size: Each micro partition ranges from 50 MB to 500 MB of uncompressed data.
Uniform Size: Snowflake ensures that micro partitions are uniformly sized, which helps in
efficient data management and query processing.
Each micro partition stores data in a columnar format, meaning each column's data is stored
separately within the partition.
Columnar Storage:
Column Blocks: In each micro partition, data is stored in blocks by column. For example:
o Block for column 'type'
o Block for column 'name'
o Block for column 'country'
o Block for column 'date'
Efficiency: This columnar storage format allows Snowflake to scan only the necessary columns
when executing queries, improving performance and reducing I/O.
Row Store: Stores entire rows as single blocks. This is less efficient for analytical queries that
often require only specific columns.
Columnar Store: Stores each column as separate blocks. This is more efficient for analytical
queries as it allows for selective column scanning.
Metadata Storage:
Each file (micro partition) contains a header that stores metadata such as:
o Minimum value of each column
o Maximum value of each column
o Number of distinct values in each column
Logical Structure:
Physical Structure:
1. Centralized Storage: The bottom layer where all data is stored in micro partitions.
2. Multi-Cluster Compute (Virtual Warehouse Layer): The middle layer responsible for executing
queries.
3. Cloud Services Layer: The top layer, often referred to as the brain of the system, responsible for
metadata management, query optimization, and execution planning.
Let's walk through the process of how a query is executed in Snowflake, focusing on how micro
partitions are accessed:
1. Query Submission:
o A user submits a query, for example: SELECT type, name, country FROM
employee WHERE date = '11/2'.
3. Metadata Utilization:
o The Cloud Services Layer uses metadata to determine which micro partitions contain
data for '11/2'.
o It processes additional information and incorporates it into the execution plan.
Micro partitions can contain overlapping data ranges. This is managed as follows:
Overlapping Data: Micro partitions 2 and 3 both contain data for '11/2'.
Data Insertion Timing: Data is loaded into micro partitions as it is inserted into the table. For
example:
o After inserting 3 rows for '11/2' into Micro Partition 2, 3 rows for '11/3' are inserted,
filling the partition.
o Subsequent rows for '11/2' are then inserted into Micro Partition 3.
Data clustering in Snowflake plays a vital role in optimizing data retrieval and query
performance. By clustering data, Snowflake ensures that similar kinds of data are stored together
in common micro partitions, which enhances the efficiency of data access.
Purpose:
Clustering organizes data within micro partitions to ensure that similar data is stored together.
This process optimizes data retrieval, making queries faster and more efficient.
Example:
Consider a table with rows for dates '11/2', '11/3', and '11/4'.
By clustering on the date column, Snowflake ensures that data for the same date is stored in the
same or adjacent micro partitions.
This prevents data for the same date from being scattered across multiple micro partitions,
which would degrade query performance.
Natural Clustering
Default Behavior:
Snowflake automatically clusters data along natural dimensions, such as date, when data is
initially loaded into tables.
This automatic clustering produces well-clustered tables that are optimized for common query
patterns.
Limitations:
Over time, as users perform various operations (inserts, updates, deletes), the natural clustering
may become less optimal.
In such cases, the default clustering may not be the best choice for sorting or ordering data
across the table.
Users can define their own cluster keys to optimize data retrieval based on specific query
patterns.
A cluster key is a column or set of columns that Snowflake uses to cluster data within micro
partitions.
Users can test multiple clustering keys to determine which performs better for their specific
queries.
This involves analyzing query performance and adjusting the cluster keys accordingly.
Query Performance:
For very large tables, clustering becomes crucial to ensure efficient query performance.
Unsorted or partially sorted data can significantly impact query performance, particularly for
large datasets.
By clustering data based on a cluster key, Snowflake can quickly access the relevant micro
partitions, avoiding unnecessary scans.
This accelerates query performance and reduces compute costs.
Metadata Collection:
When data is inserted or loaded into a table, Snowflake collects and records clustering metadata
for each micro partition.
This metadata includes information about the clustering key and the performance of queries.
Automatic Re-Clustering:
Once a clustering key is defined, Snowflake's cloud services layer automatically performs re-
clustering as needed.
This ensures that the table remains well-clustered over time, with minimal maintenance
overhead for users.
Maintenance:
There is no ongoing maintenance required for clustering once the key is defined, unless the user
decides to change or drop the clustering key in the future.
Example Scenario
Querying by Date:
Efficiency:
If the date column is well-clustered, Snowflake may need to scan only one or two micro
partitions to retrieve the data.
This efficient access reduces the time and resources required for query execution.
Clustering keys in Snowflake are designed to optimize data retrieval from tables by organizing
data within micro partitions. This process enhances query performance by reducing the amount
of data scanned during queries.
Purpose: Clustering keys perform clustering on micro partitions to optimize data retrieval.
Definition: Clustering keys can be defined on a single column or multiple columns in a table.
They can also be expressions based on columns.
Date Column: You can define a clustering key based on a date column. Instead of using the
complete date, you can extract the month and use it as the clustering key.
o Expression: Extract the month from the date column and use it as the clustering key.
Benefits of Clustering Keys
Co-locating Data: Clustering keys ensure that similar data is stored together in the same micro
partitions.
Improved Query Performance: Clustering keys are useful for very large tables as they improve
scan efficiency by skipping data that doesn't meet the filter criteria.
Slow Queries: Use clustering keys when queries on the table are running slower than expected
or have degraded over time.
Large Clustering Depth: Use clustering keys when the clustering depth (overlapping in micro
partitions) is very large.
Considerations
Computational Cost: Clustering can be computationally expensive. Only cluster when queries
will benefit substantially from it.
Testing Clustering Keys: Test clustering keys on a table and check its clustering depth. Monitor
query performance to decide if you should keep the clustering key.
In this section, we will delve into the concepts of micro partitioning overlapping and clustering
depth in Snowflake. These concepts are crucial for understanding how Snowflake optimizes data
storage and query performance.
Micro Partitioning Overlapping
Definition:
Overlapping: Overlapping occurs when the same data values are stored in multiple micro
partitions. This can happen due to the way data is inserted or updated in the table.
Example:
Consider a table with 24 rows stored across four micro partitions. If the same date value ('11/2')
appears in multiple micro partitions, this is an example of overlapping.
Impact:
Query Performance: Overlapping can affect query performance because Snowflake may need to
scan multiple micro partitions to retrieve the required data.
Storage Efficiency: Overlapping can also impact storage efficiency as the same data is stored in
multiple locations.
Clustering Depth
Definition:
Clustering Depth: Clustering depth is a measure of how well the data in a table is clustered. It
indicates the number of micro partitions that need to be scanned to retrieve the required data.
Calculation:
Clustering depth is calculated based on the number of overlapping micro partitions and the
distribution of data within those partitions.
Example:
If a query needs to retrieve data for a specific date ('11/2') and this date is stored in three
overlapping micro partitions, the clustering depth for this query is three.
Impact:
Query Performance: Higher clustering depth means more micro partitions need to be scanned,
which can slow down query performance.
Optimization: Lower clustering depth indicates better clustering and more efficient data
retrieval.
Clustering Metadata:
Snowflake maintains clustering metadata for each table, which includes:
o Total Number of Micro Partitions: The total number of micro partitions that make up
the table.
o Overlapping Micro Partitions: The number of micro partitions containing overlapping
values.
o Clustering Depth: The depth of clustering for the overlapping micro partitions.
Usage:
This metadata is used by Snowflake's cloud services layer to optimize query execution and
improve performance.
2. Pruning Columns:
o Within the remaining micro partitions, Snowflake prunes the columns that are not
needed for the query.
o This further reduces the amount of data that needs to be processed.
Example Scenario
Query:
Suppose you run a query to retrieve data for '11/2' from a table with 24 rows stored across four
micro partitions.
Steps:
2. Pruning Columns:
o Within the remaining micro partitions, Snowflake prunes the columns that are not
needed for the query.
o For example, if the query only requests the 'type' and 'name' columns, the 'country' and
'date' columns are pruned.
Result:
The query execution is optimized, and the required data is retrieved efficiently.
Clustering Depth in Snowflake
Introduction
Clustering depth is a critical metric in Snowflake that helps monitor the efficiency of data
clustering within a table. It tracks the overlapping of micro partitions and measures the average
depth of these overlaps for specified columns.
1. Definition:
o Clustering depth measures the average number of overlapping micro partitions for
specified columns in a table.
o A smaller average depth indicates better clustering.
2. Advantages:
o Monitoring Clustering Health: Helps monitor the clustering health of a large table over
time.
o Determining Need for Clustering Keys: Assists in deciding whether a large table would
benefit from explicitly defining a clustering key.
4. Performance Monitoring:
o Clustering depth alone is not a perfect measure of clustering efficiency.
o Query performance over time should also be monitored to determine if the table is well-
clustered.
o If queries perform as expected, the table is likely well-clustered.
o If query performance degrades over time, the table may benefit from re-clustering or
defining a new clustering key.
Scenario:
A table has five micro partitions, and a column contains values from A to Z.
Layers of Clustering:
1. Initial Layer:
o All five micro partitions contain overlapping values from A to Z.
o Overlapping micro partitions count: 5
o Clustering depth: 5
2. First Clustering:
o Data is aggregated into ranges A-D and E-J.
o Three micro partitions still contain overlapping values from K to Z.
o Overlapping micro partitions count: 3
o Clustering depth: 3
3. Second Clustering:
o Further clustering separates values A-D, E-J, and reduces overlap for K-N and L-Q.
o Overlapping micro partitions count: 3
o Clustering depth: 2
4. Final Layer:
o All micro partitions are separated, with no overlapping values.
o Overlapping micro partitions count: 0
o Clustering depth: 1 (minimum value for overlap depth is one or greater)
Visual Illustration
Diagram:
Clustering and reclustering are essential features in Snowflake that optimize data retrieval and
query performance. Once a clustering key is defined, Snowflake automatically manages the
clustering and reclustering processes, ensuring that data remains well-organized and efficiently
accessible.
Clustering in Snowflake
Clustering Key:
A clustering key is defined on one or more columns of a table to organize data within micro
partitions.
Example: Clustering a table based on the date column.
Automatic Reclustering:
Snowflake automatically reclusters tables based on the defined clustering key.
This process reorganizes the data to maintain optimal clustering as operations like insert, update,
delete, merge, and copy are performed.
Benefits:
Improved Query Performance: By keeping similar data together, Snowflake can quickly access
the relevant micro partitions, reducing query execution time.
Reduced Maintenance: Users do not need to manually manage the clustering operations, as
Snowflake handles this automatically.
Reclustering Process
Over time, as data is inserted, updated, or deleted, the clustering of a table may become less
optimal.
Periodic reclustering is required to maintain the efficiency of data retrieval.
Snowflake uses the clustering key to reorganize the column data, ensuring that related records
are relocated to the same micro partition.
This process ensures that similar data resides in the same micro partitions, optimizing query
performance.
Example Scenario:
Consider a table with four micro partitions, each containing six rows.
The table has columns: date, country, name, and type.
We focus on the date column for clustering.
Initial State:
Micro partitions contain overlapping values for the date column (e.g., '11/2' appears in multiple
partitions).
Reclustering:
Result:
After reclustering, micro partitions are sorted based on date and then type.
Example:
o Micro Partition 1: Contains all rows for '11/2' with similar type values.
o Micro Partition 2: Contains remaining rows for '11/2' with different type values.
Query Efficiency:
When querying for data on '11/2', Snowflake only scans the relevant micro partitions (e.g., Micro
Partitions 1 and 2).
This reduces the number of partitions scanned and improves query performance.
Managing Reclustering
Commands:
1. On the AWS Free Tier page, you will see an option to create a free tier account.
2. Before creating an account, scroll down to explore the services available under the free tier.
3. Note that the free tier includes over 60 products, but usage is limited.
4. The free tier is available for 12 months and includes some short-term free trial offers starting
from the activation date of each service.
5. To see the services, you can filter by various types and product categories.
1. Select your role and area of interest (e.g., Business Analyst, AI, and Machine Learning).
2. Click "Submit".
3. Sign in to the console by clicking on the provided link or directly accessing the AWS Management
Console.
4. Choose "Root User" and enter your email address and password.
Step 9: Explore AWS Services
1. Once logged in, click on the drop-down menu to see the list of available AWS services.
2. For the Snowflake tutorials, focus on exploring Amazon S3 and IAM (under Security, Identity, and
Compliance).
1. You will see a message confirming that your bucket has been successfully created.
2. The bucket will be listed with its name, region, access type, and creation timestamp.
1. Navigate to the IAM service under the Security, Identity, and Compliance section.
2. IAM is used to manage access to AWS services and resources securely.
1. Create IAM users and roles to manage permissions for accessing S3 buckets and other AWS
resources.
2. Assign appropriate policies to the users and roles to ensure they have the necessary permissions
for Snowflake integration.
In the previous session, we learned about Amazon S3 and how to create buckets and folders. In
this session, we will focus on IAM (Identity and Access Management), which is essential for
managing access to AWS services securely.
Accessing IAM
Step 1: Navigate to IAM
IAM allows you to manage access to AWS services and resources securely. It enables you to
create and manage AWS users and groups, and use permissions to allow and deny their access to
AWS resources.
1. Users: Individual accounts that represent a person or service needing access to AWS resources.
2. Groups: Collections of users, which can be assigned specific permissions.
3. Roles: Permissions assigned to AWS resources, allowing them to interact with other AWS
services.
4. Policies: Documents that define permissions and can be attached to users, groups, and roles.
In the previous lecture, we created a group named "test policies". Now, we will move forward
and create a user. This user will have access to your AWS account and can use various services
based on the permissions you assign.
1. Access Key ID and Secret Access Key: These are needed for programmatic access.
2. Download CSV: Download the CSV file containing the Access Key ID and Secret Access Key.
3. Provide Credentials: Share the URL, Access Key ID, Secret Access Key, username, and password
with the new user.
1. User List: The new user (snowflake) will appear in the user list.
2. User Details: Click on the username to see the details.
o Permissions: Check the policies attached to the user.
o Groups: Verify if any groups are attached.
o Security Credentials: View the Access Key ID and its status.
In this lecture, we will learn about IAM roles and how to create them. IAM roles are essential for
granting permissions to various applications or services within an AWS account. They allow
services like S3 and AWS Glue to interact with each other and enable external applications, such
as Snowflake, to access AWS resources.
IAM roles are a way to grant permissions to applications or services within your AWS account.
They are used to allow different AWS services to interact with each other or to enable external
applications to access AWS resources securely.
For instance, if you want Snowflake to access data stored in your AWS S3 buckets, you need to
create an IAM role that grants Snowflake the necessary permissions to interact with S3.
1. Click on "Create role": This will start the process of creating a new role.
2. Select Trusted Entity:
o Choose "Another AWS account" for this example.
o Enter your AWS account ID. To find your account ID, go to "My Security Credentials" and
copy the account ID.
1. Attach Policies:
o Search for S3.
o Select AmazonS3FullAccess to grant full access to S3.
2. Click "Next: Tags".
1. Skip Tags: For this example, we will not add any tags.
2. Click "Next: Review".
1. Role List: The new role (snowflake-role) will appear in the roles list.
2. Role Details: Click on the role name to see the details.
o Role ARN: Note the ARN (Amazon Resource Name) of the role.
o Permissions: Verify the attached policies.
o Trust Relationships: Check the trust relationships.
1. Click on "Trust relationships": In the role details, click on the "Trust relationships" tab.
2. Edit Trust Relationships:
o Click on "Edit trust relationship".
o You will see a JSON policy document. This document defines which entities can assume
the role.
o You will need to update this document with the ARN and external ID provided by
Snowflake when creating the stage object in Snowflake.
In this lecture, we will learn how to upload data to S3 buckets. We will create folders within an
S3 bucket and upload files in different formats such as CSV and Parquet.
Step-by-Step Guide to Uploading Data to S3 Buckets
Step 1: Access S3 Console
1. Locate Your Bucket: Find the bucket you created earlier (e.g., test-snowflake).
2. Open the Bucket: Click on the bucket name to access it.
1. Create Folders:
o Navigate to the folder you created earlier (e.g., snowflake).
o Click on "Create folder".
o Name the first folder CSV and click "Save".
o Repeat the process to create another folder named parquet.
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<your-account-details>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<your-external-id>"
In this lecture, we will learn how to upload data to S3 buckets. We will create folders within an
S3 bucket and upload files in different formats such as CSV and Parquet.
1. Locate Your Bucket: Find the bucket you created earlier (e.g., test-snowflake).
2. Open the Bucket: Click on the bucket name to access it.
1. Create Folders:
o Navigate to the folder you created earlier (e.g., snowflake).
o Click on "Create folder".
o Name the first folder CSV and click "Save".
o Repeat the process to create another folder named parquet.
In this lecture, we will create a table schema or table metadata in Snowflake. This involves
writing a CREATE TABLE DDL (Data Definition Language) statement to define the structure
of the table, including the columns and their data types. This is the first step in the process of
fetching data from AWS S3 and loading it into Snowflake.
1. Review the Data: The data loaded into S3 consists of 26 columns related to healthcare providers,
treatments, charges, payments, discharges, and other characteristics.
2. Column Overview:
o Columns include provider information, treatment descriptions, charges, payments,
discharges, regions, and reimbursement percentages.
o The data also includes provider ID, name, state, street address, and zip code.
1. Column Names and Data Types: Based on the nature of the data, assign appropriate data types
to each column. For example:
o Numeric columns (e.g., average covered payments) should be defined as NUMBER.
o Character columns (e.g., provider name, referral region) should be defined as VARCHAR.
o Columns with special characters (e.g., total payments with dollar and million signs)
should also be defined as VARCHAR.
1. SQL Statement: Write the SQL statement to create the table in Snowflake. Here is an example
based on the provided data:
SQL
CREATE TABLE healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER,
-- Add other columns as needed
-- Ensure to match the data types with the nature of the data
);
Step 4: Execute the DDL Statement in Snowflake
1. Run the Query: Execute the CREATE TABLE statement in Snowflake to create the empty
healthcare table.
2. Verify Table Creation:
o Refresh the database to see the newly created healthcare table under the public
schema.
o Preview the table to ensure it is created with the correct columns and data types.
Example Execution
SQL
CREATE TABLE healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER,
-- Add other columns as needed
-- Ensure to match the data types with the nature of the data
);
In this lecture, we will create an integration object in Snowflake that establishes a connection
between AWS S3 and Snowflake. This integration object will allow Snowflake to access data
stored in S3 buckets.
1. SQL Statement: The SQL statement to create an integration object in Snowflake is as follows:
SQL
CREATE OR REPLACE STORAGE INTEGRATION S3_INT
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = '<AWS_ROLE_ARN>'
STORAGE_ALLOWED_LOCATIONS = ('s3://<BUCKET_NAME>/<FOLDER_NAME>/',
's3://<ANOTHER_BUCKET_NAME>/');
Step 2: Retrieve AWS Role ARN
1. Navigate to IAM: Go to the AWS Management Console, then to the IAM service.
2. Find the Role: Locate the role you created earlier (e.g., snowflake-role).
3. Copy the Role ARN: Copy the Role ARN from the role details.
1. SQL Statement: Use the following SQL statement to create the integration object in Snowflake.
Replace <AWS_ROLE_ARN>, <BUCKET_NAME>, and <FOLDER_NAME> with your actual values.
SQL
CREATE OR REPLACE STORAGE INTEGRATION S3_INT
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://test-snowflake/snowflake/', 's3://test-xyz-
snowflake/');
Step 5: Execute the SQL Statement
1. Run the Query: Execute the SQL statement in the Snowflake query editor.
2. Verify Execution: Ensure that the integration object is created successfully.
In this lecture, we will describe the integration object we created in Snowflake and update the
trust relationships in AWS IAM to establish a secure connection between AWS S3 and
Snowflake.
1. SQL Statement: Use the following SQL statement to describe the integration object in Snowflake:
SQL
DES INTEGRATION S3_INT;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Review the Results: The results will display several properties of the integration object. Key
properties include:
o STORAGE_AWS_IAM_USER_ARN
o STORAGE_EXTERNAL_ID
2. Navigate to IAM: Go to the AWS Management Console and open the IAM service.
3. Find the Role: Locate the role you created earlier (e.g., snowflake-role).
4. Edit Trust Relationships:
o Go to the "Trust relationships" tab.
o Click on "Edit trust relationship".
o Update the policy document with the STORAGE_AWS_IAM_USER_ARN and
STORAGE_EXTERNAL_ID values from Snowflake.
1. Policy Document: Update the trust policy document in IAM with the following values:
o AWS IAM User ARN: Replace the existing ARN with the STORAGE_AWS_IAM_USER_ARN
from Snowflake.
o External ID: Replace the existing external ID with the STORAGE_EXTERNAL_ID from
Snowflake.
JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/snowflake-user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "external-id-from-snowflake"
}
}
}
]
}
3. Update the Policy: Paste the updated ARN and external ID into the policy document and click
"Update Trust Policy".
Example Execution
SQL
DESCRIBE INTEGRATION S3_INT;
2. Review the Results:
o Identify the STORAGE_AWS_IAM_USER_ARN and STORAGE_EXTERNAL_ID from the
results.
JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/snowflake-user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "external-id-from-snowflake"
}
}
}
]
}
In this lecture, we will go through the steps to load data from AWS S3 into Snowflake. This
involves creating a file format, creating a stage object, and using the COPY INTO command to
load the data into the Snowflake table.
1. SQL Statement: Use the following SQL statement to create a file format for CSV files:
SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the file format is successfully created.
1. SQL Statement: Use the following SQL statement to create a stage object:
SQL
CREATE OR REPLACE STAGE demo_db.public.ext_stage
URL = 's3://test-snowflake/snowflake/CSV/'
STORAGE_INTEGRATION = S3_INT
FILE_FORMAT = demo_db.public.csv_format;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the stage object is successfully created.
1. SQL Statement: Use the following SQL statement to copy data into the healthcare table:
SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Review the Results: Check the results to see if the data was partially loaded due to errors.
1. Identify the Issue: The error occurs because some values in the CSV file contain
commas, which is the delimiter.
2. Preview the Data: Use the following steps to preview the data in S3:
o Go to the S3 console.
o Navigate to the test-snowflake/snowflake/CSV/ folder.
o Select the health.csv file and click on "Select from".
o Choose the file format as CSV and click on "Show file preview".
3. Download and Inspect the File: Download the CSV file and inspect it to identify rows
with commas within values.
4. Modify the COPY Command: Use the ON_ERROR = 'CONTINUE' option to bypass rows
with errors:
SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';
5. Run the Query: Execute the SQL statement in the Snowflake query editor.
6. Review the Results: Check the results to see the number of rows loaded and any errors
encountered.
In this lecture, we will load the complete data from AWS S3 into Snowflake by changing the
delimiter in the CSV file to avoid issues with commas within the data values. We will use a pipe
(|) as the delimiter and follow the steps to update the file, upload it to S3, and load it into
Snowflake.
1. Navigate to S3:
o Go to the AWS S3 console.
o Navigate to the appropriate bucket and folder (e.g.,
test-snowflake/snowflake/CSV/).
1. SQL Statement: Use the following SQL statement to create a file format with a pipe delimiter:
SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the file format is successfully created.
1. SQL Statement: Use the following SQL statement to create a stage object for the updated file:
SQL
CREATE OR REPLACE STAGE demo_db.public.ext_stage
URL = 's3://test-snowflake/snowflake/CSV/health_pipe.csv'
STORAGE_INTEGRATION = S3_INT
FILE_FORMAT = demo_db.public.csv_format;
2. Run the Query: Execute the SQL statement in the Snowflake query editor.
3. Verify Creation: Ensure that the stage object is successfully created.
SQL
CREATE OR REPLACE TABLE demo_db.public.healthcare (
provider_id VARCHAR,
provider_name VARCHAR,
provider_state VARCHAR,
street_address VARCHAR,
zip_code VARCHAR,
average_covered_charges NUMBER,
total_payments VARCHAR,
total_discharges NUMBER
-- Add other columns as needed
);
2. SQL Statement: Use the following SQL statement to copy data into the healthcare table:
SQL
COPY INTO demo_db.public.healthcare
FROM @demo_db.public.ext_stage
ON_ERROR = 'CONTINUE';
3. Run the Query: Execute the SQL statement in the Snowflake query editor.
4. Review the Results: Check the results to see if the data was fully loaded without errors.
Loading CSV Data into Snowflake: Step-by-Step Guide
Below is the step-by-step guide to load CSV data from AWS S3 into Snowflake. This guide
includes creating the table, integration object, file format, stage object, and using the COPY
command to ingest data.
SQL
CREATE OR REPLACE TABLE HEALTHCARE_CSV (
AVERAGE_COVERED_CHARGES NUMBER(38,6),
AVERAGE_TOTAL_PAYMENTS NUMBER(38,6),
TOTAL_DISCHARGES NUMBER(38,0),
BACHELORORHIGHER NUMBER(38,1),
HSGRADORHIGHER NUMBER(38,1),
TOTALPAYMENTS VARCHAR(128),
REIMBURSEMENT VARCHAR(128),
TOTAL_COVERED_CHARGES VARCHAR(128),
REFERRALREGION_PROVIDER_NAME VARCHAR(256),
REIMBURSEMENTPERCENTAGE NUMBER(38,9),
DRG_DEFINITION VARCHAR(256),
REFERRAL_REGION VARCHAR(26),
INCOME_PER_CAPITA NUMBER(38,0),
MEDIAN_EARNINGSBACHELORS NUMBER(38,0),
MEDIAN_EARNINGS_GRADUATE NUMBER(38,0),
MEDIAN_EARNINGS_HS_GRAD NUMBER(38,0),
MEDIAN_EARNINGSLESS_THAN_HS NUMBER(38,0),
MEDIAN_FAMILY_INCOME NUMBER(38,0),
NUMBER_OF_RECORDS NUMBER(38,0),
POP_25_OVER NUMBER(38,0),
PROVIDER_CITY VARCHAR(128),
PROVIDER_ID NUMBER(38,0),
PROVIDER_NAME VARCHAR(256),
PROVIDER_STATE VARCHAR(128),
PROVIDER_STREET_ADDRESS VARCHAR(256),
PROVIDER_ZIP_CODE NUMBER(38,0)
);
SQL
CREATE OR REPLACE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::435098453023:role/snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://testsnowflake/snowflake/',
's3://testxyzsnowflake/');
SQL
CREATE OR REPLACE FILE FORMAT demo_db.public.csv_format
TYPE = 'CSV'
FIELD_DELIMITER = '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = TRUE;
SQL
CREATE OR REPLACE STAGE demo_db.public.ext_csv_stage
URL = 's3://testsnowflake/snowflake/csv'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = demo_db.public.csv_format;
SQL
COPY INTO healthcare_csv
FROM @demo_db.public.ext_csv_stage
ON_ERROR = 'CONTINUE';
SQL
SELECT * FROM healthcare_csv;
This guide provides a comprehensive step-by-step process to load CSV data from AWS S3 into Snowflake,
ensuring that all necessary configurations and commands are clearly outlined.
SQL
CREATE SCHEMA Json_data;
USE SCHEMA Json_data;
2. Create a Table
Create a table to load the JSON data. This table will have 26 columns along with metadata
columns such as file name, file row number, and load timestamp:
SQL
CREATE TABLE healthcare_json_table (
-- Define your 26 columns here
column1 STRING,
column2 STRING,
-- ...
file_name STRING,
file_row_number NUMBER,
load_timestamp TIMESTAMP
);
3. Integration Object
Ensure you have an integration object configured to allow Snowflake to interact with your S3
bucket. This should have been set up previously for loading CSV data.
SQL
CREATE OR REPLACE FILE FORMAT Json_format
TYPE = 'JSON';
5. Create Stage Object
SQL
CREATE OR REPLACE STAGE Json_stage
URL = 's3://test-snowflake/snowflake/json/'
FILE_FORMAT = Json_format
STORAGE_INTEGRATION = your_integration_object;
6. Loading Data Using COPY Command
Use the COPY command to load data from S3 to Snowflake. Ensure that you reference the
correct JSON keys and maintain the exact case and spaces as in the JSON file:
SQL
COPY INTO healthcare_json_table
FROM @Json_stage
FILE_FORMAT = (FORMAT_NAME = Json_format)
ON_ERROR = 'CONTINUE';
7. Verify Data Load
Check the loaded data and verify that all rows have been loaded correctly:
SQL
SELECT file_name, COUNT(*) AS row_count
FROM healthcare_json_table
GROUP BY file_name;
Summary
This process ensures that JSON data from S3 is accurately loaded into Snowflake, transforming
it into a relational table format for further analysis.
Create a table to store the JSON data with the necessary columns and metadata:
SQL
CREATE OR REPLACE TABLE healthcare_json (
id VARCHAR(50),
AVERAGE_COVERED_CHARGES VARCHAR(150),
AVERAGE_TOTAL_PAYMENTS VARCHAR(150),
TOTAL_DISCHARGES INTEGER,
BACHELORORHIGHER FLOAT,
HSGRADORHIGHER VARCHAR(150),
TOTALPAYMENTS VARCHAR(128),
REIMBURSEMENT VARCHAR(128),
TOTAL_COVERED_CHARGES VARCHAR(128),
REFERRALREGION_PROVIDER_NAME VARCHAR(256),
REIMBURSEMENTPERCENTAGE VARCHAR(150),
DRG_DEFINITION VARCHAR(256),
REFERRAL_REGION VARCHAR(26),
INCOME_PER_CAPITA VARCHAR(150),
MEDIAN_EARNINGSBACHELORS VARCHAR(150),
MEDIAN_EARNINGS_GRADUATE VARCHAR(150),
MEDIAN_EARNINGS_HS_GRAD VARCHAR(150),
MEDIAN_EARNINGSLESS_THAN_HS VARCHAR(150),
MEDIAN_FAMILY_INCOME VARCHAR(150),
NUMBER_OF_RECORDS VARCHAR(150),
POP_25_OVER VARCHAR(150),
PROVIDER_CITY VARCHAR(128),
PROVIDER_ID VARCHAR(150),
PROVIDER_NAME VARCHAR(256),
PROVIDER_STATE VARCHAR(128),
PROVIDER_STREET_ADDRESS VARCHAR(256),
PROVIDER_ZIP_CODE VARCHAR(150),
filename VARCHAR,
file_row_number VARCHAR,
load_timestamp TIMESTAMP DEFAULT TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP)
);
2. Create JSON File Format
SQL
CREATE OR REPLACE STAGE demo_db.public.ext_json_stage
URL = 's3://testsnowflake/snowflake/json'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = demo_db.public.json_format;
4. Load Data Using COPY Command
SQL
COPY INTO demo_db.public.healthcare_json
FROM (
SELECT
$1:"_id"::VARCHAR,
$1:" Average Covered Charges "::VARCHAR,
$1:" Average Total Payments "::VARCHAR,
$1:" Total Discharges "::INTEGER,
$1:"% Bachelor's or Higher"::FLOAT,
$1:"% HS Grad or Higher"::VARCHAR,
$1:"Total payments"::VARCHAR,
$1:"% Reimbursement"::VARCHAR,
$1:"Total covered charges"::VARCHAR,
$1:"Referral Region Provider Name"::VARCHAR,
$1:"ReimbursementPercentage"::VARCHAR,
$1:"DRG Definition"::VARCHAR,
$1:"Referral Region"::VARCHAR,
$1:"INCOME_PER_CAPITA"::VARCHAR,
$1:"MEDIAN EARNINGS - BACHELORS"::VARCHAR,
$1:"MEDIAN EARNINGS - GRADUATE"::VARCHAR,
$1:"MEDIAN EARNINGS - HS GRAD"::VARCHAR,
$1:"MEDIAN EARNINGS- LESS THAN HS"::VARCHAR,
$1:"MEDIAN_FAMILY_INCOME"::VARCHAR,
$1:"Number of Records"::VARCHAR,
$1:"POP_25_OVER"::VARCHAR,
$1:"Provider City"::VARCHAR,
$1:"Provider Id"::VARCHAR,
$1:"Provider Name"::VARCHAR,
$1:"Provider State"::VARCHAR,
$1:"Provider Street Address"::VARCHAR,
$1:"Provider Zip Code"::VARCHAR,
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP)
FROM @demo_db.public.ext_json_stage
);
5. Verify Data Load
SQL
SELECT * FROM healthcare_json;
6. Clean Up
SQL
TRUNCATE TABLE healthcare_json;
DROP TABLE healthcare_json;
7. Check Other Tables
SQL
SELECT * FROM healthcare_csv;
SELECT * FROM healthcare_parquet;
SELECT * FROM healthcare_json;
This process ensures that JSON data from S3 is accurately loaded into Snowflake, transforming
it into a relational table format for further analysis.
Certainly! Let's organize and clarify the information about the different types of tables in
Snowflake for better understanding.
Snowflake supports three types of tables, each with distinct characteristics and use cases:
1. Permanent Tables
2. Temporary Tables
3. Transient Tables
1. Permanent Tables
Default Table Type: When you create a table in Snowflake without specifying the type, it
defaults to a permanent table.
Longevity: These tables are designed for long-term storage and are typically used for production
data.
Data Protection: They have robust data protection and recovery mechanisms.
Time Travel: Permanent tables support a high number of time travel retention days.
Failsafe: They include a failsafe period of seven days, which provides an additional layer of data
recovery.
2. Temporary Tables
Session-Specific: Temporary tables exist only within the session in which they are created. Once
the session ends, the table is automatically dropped.
Non-Recoverable: Data in temporary tables cannot be recovered after the session ends.
Isolation: These tables are not visible to other users or sessions.
No Cloning: Temporary tables do not support features such as cloning.
Naming Precedence: If a temporary table and a permanent table have the same name within
the same schema, the temporary table takes precedence when queried.
3. Transient Tables
Similar to Permanent Tables: Transient tables are similar to permanent tables in terms of
structure and usage.
No Failsafe: They do not have a failsafe period, which means they are not designed for the same
level of data protection and recovery.
Cost Efficiency: Transient tables are designed for data that does not require long-term
protection, making them more cost-effective.
Time Travel: They have a shorter time travel retention period compared to permanent tables.
Schema and Database: You can create transient databases and schemas. All objects within a
transient schema or database will also be transient by default.
Certainly! Let's organize and clarify the information on creating and managing different types of
databases, schemas, and tables in Snowflake, focusing on cost considerations and practical steps.
When designing your Snowflake data platform, it's crucial to decide on the type of database,
schema, or table to create based on your requirements and cost management. Each type of table
has associated costs and features that affect storage and data protection.
1. Permanent Tables
o Cost: Higher due to failsafe and longer time travel periods.
o Features: Failsafe period of 7 days, time travel up to 90 days.
2. Transient Tables
o Cost: Lower as they do not have a failsafe period.
o Features: Time travel retention period of 1 day, no failsafe.
3. Temporary Tables
o Cost: Typically lower as they exist only within a session.
o Features: Session-specific, non-recoverable after session ends.
SQL
CREATE OR REPLACE TRANSIENT DATABASE development;
Note: All objects (schemas, tables, views) created under this database will be transient by
default.
SQL
USE DATABASE development;
CREATE SCHEMA employee;
SQL
SHOW DATABASES;
SHOW SCHEMAS;
Output: You will see the development database and employee schema marked as transient.
SQL
USE SCHEMA employee;
CREATE TABLE employees (id INT, name STRING);
Verification:
SQL
SHOW TABLES;
Output: The employees table will be transient with a default retention time of 1 day.
Permanent Database and Schema
SQL
CREATE OR REPLACE DATABASE development_perm;
SQL
USE DATABASE development_perm;
CREATE SCHEMA employee;
SQL
SHOW DATABASES;
SHOW SCHEMAS;
Output: You will see the development_perm database and employee schema marked as
permanent.
SQL
USE SCHEMA employee;
CREATE TABLE employees (id INT, name STRING);
Verification:
SQL
SHOW TABLES;
Output: The employees table will be permanent with a default retention time of 1 day, which
can be extended up to 90 days.
Certainly! Let's present the information in a more structured and clear format, including the SQL
commands for creating transient and permanent databases, schemas, and tables in Snowflake.
When designing your Snowflake data platform, it's crucial to decide which type of database,
schema, or table to create based on your requirements and cost management. Snowflake offers
three types of objects:
1. Temporary
2. Transient
3. Permanent
Temporary Objects
Transient Objects
-- Show Databases
SHOW DATABASES;
-- Describe Database
DESC DATABASE DEVELOPMENT;
-- Use Database
USE DATABASE DEVELOPMENT;
-- Create Schema
CREATE OR REPLACE SCHEMA EMPLOYEE;
-- Show Schemas
SHOW SCHEMAS;
-- Show Tables
SHOW TABLES;
-- Drop Database
DROP DATABASE DEVELOPMENT;
Create Permanent Database, Schema, and Table
SQL
-- Create PERMANENT Database
CREATE OR REPLACE DATABASE DEVELOPMENT_PERM;
-- Show Databases
SHOW DATABASES;
-- Use Database
USE DATABASE DEVELOPMENT_PERM;
-- Create Schema
CREATE OR REPLACE SCHEMA EMPLOYEE;
-- Show Schemas
SHOW SCHEMAS;
-- Show Tables
SHOW TABLES;
-- Drop Database
DROP DATABASE DEVELOPMENT_PERM;
Certainly! Let's walk through the steps for creating transient and permanent schemas in
Snowflake, along with the necessary SQL commands and explanations.
First, ensure you are using a permanent database. In this example, we will use demo_db.
SQL
USE DATABASE demo_db;
Step 2: Create a Transient Schema
To create a transient schema under the permanent database demo_db, use the following
command:
SQL
-- Create Transient Schema
CREATE OR REPLACE TRANSIENT SCHEMA employee;
Explanation: The TRANSIENT keyword is used to specify that the schema is transient. This
means all objects created within this schema will inherit the transient property.
SQL
-- Show Schemas
SHOW SCHEMAS;
SQL
-- Create Table in Transient Schema
USE SCHEMA employee;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees table should be listed as transient. The icon for a transient table is
different from that of a permanent table, indicating its transient nature.
To create a permanent schema under the permanent database demo_db, use the following
command:
SQL
-- Create Permanent Schema
CREATE OR REPLACE SCHEMA employee_perm;
Explanation: No specific keyword is needed for a permanent schema. By default, schemas are
permanent unless specified otherwise.
SQL
-- Show Schemas
SHOW SCHEMAS;
Output: You should see the employee_perm schema listed without the transient property.
SQL
-- Create Table in Permanent Schema
USE SCHEMA employee_perm;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees table should be listed as a permanent table. The icon for a permanent
table is different from that of a transient table.
First, ensure you are using a permanent database. In this example, we will use demo_db.
SQL
USE DATABASE demo_db;
Step 2: Create a Transient Schema
To create a transient schema under the permanent database demo_db, use the following
command:
SQL
-- Create Transient Schema
CREATE OR REPLACE TRANSIENT SCHEMA employee;
Explanation: The TRANSIENT keyword is used to specify that the schema is transient. This
means all objects created within this schema will inherit the transient property.
SQL
-- Show Schemas
SHOW SCHEMAS;
Output: You should see the employee schema listed as transient. The public schema will
remain permanent because it inherits the properties of the permanent database.
SQL
-- Create Table in Transient Schema
USE SCHEMA employee;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees table should be listed as transient. The icon for a transient table is
different from that of a permanent table, indicating its transient nature.
To create a permanent schema under the permanent database demo_db, use the following
command:
SQL
-- Create Permanent Schema
CREATE OR REPLACE SCHEMA employee_perm;
Explanation: No specific keyword is needed for a permanent schema. By default, schemas are
permanent unless specified otherwise.
SQL
-- Show Schemas
SHOW SCHEMAS;
Output: You should see the employee_perm schema listed without the transient property.
SQL
-- Create Table in Permanent Schema
USE SCHEMA employee_perm;
CREATE OR REPLACE TABLE employees (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees table should be listed as a permanent table. The icon for a permanent
table is different from that of a transient table.
By following these steps, you can effectively manage and create transient and permanent
schemas and tables in Snowflake, ensuring the appropriate use of resources and data protection
features.
Certainly! Let's walk through the steps for creating temporary, transient, and permanent tables in
Snowflake, along with the necessary SQL commands and explanations.
To create a temporary table, use the TEMPORARY keyword. Temporary tables exist only within the
session in which they are created and are not visible to other sessions.
SQL
-- Create Temporary Table
CREATE OR REPLACE TEMPORARY TABLE employees_temp (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees_temp table should be listed as temporary. The icon for a temporary
table has a little clock sign.
SQL
-- Insert Rows into Temporary Table
INSERT INTO employees_temp (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);
SQL
-- Select Data from Temporary Table
SELECT * FROM employees_temp;
Note: If you try to access this table from a different session, you will get an error because
temporary tables are session-specific.
To create a transient table, use the TRANSIENT keyword. Transient tables persist until explicitly
dropped and are available to all users with the appropriate privileges.
SQL
-- Create Transient Table
CREATE OR REPLACE TRANSIENT TABLE employees_transient (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees_transient table should be listed as transient. The icon for a
transient table is different from that of a permanent table.
SQL
-- Insert Rows into Transient Table
INSERT INTO employees_transient (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);
SQL
-- Select Data from Transient Table
SELECT * FROM employees_transient;
Note: Unlike temporary tables, transient tables can be accessed from different sessions.
To create a permanent table, you can simply use the CREATE OR REPLACE TABLE command.
Permanent tables are the default type and do not require a specific keyword.
SQL
-- Create Permanent Table
CREATE OR REPLACE TABLE employees_perm (
employee_id NUMBER,
empl_join_date DATE,
dept VARCHAR(10),
salary NUMBER,
manager_id NUMBER
);
SQL
-- Show Tables
SHOW TABLES;
Output: The employees_perm table should be listed as a permanent table. The icon for a
permanent table is different from that of a transient or temporary table.
SQL
-- Insert Rows into Permanent Table
INSERT INTO employees_perm (employee_id, empl_join_date, dept, salary,
manager_id)
VALUES
(1, '2023-01-01', 'HR', 50000, 101),
(2, '2023-02-01', 'IT', 60000, 102),
(3, '2023-03-01', 'Finance', 70000, 103),
(4, '2023-04-01', 'Marketing', 80000, 104),
(5, '2023-05-01', 'Sales', 90000, 105),
(6, '2023-06-01', 'Support', 55000, 106),
(7, '2023-07-01', 'Admin', 65000, 107),
(8, '2023-08-01', 'Operations', 75000, 108);
SQL
-- Select Data from Permanent Table
SELECT * FROM employees_perm;
Note: Permanent tables can be accessed from different sessions and have a failsafe period for
data recovery.
Certainly! Let's walk through the steps for converting a permanent table to a transient table, and
a transient table to a temporary table in Snowflake. This process involves using the CLONE
keyword to create copies of tables with different properties.
To convert a permanent table to a transient table, use the CREATE OR REPLACE TRANSIENT
TABLE command along with the CLONE keyword. This will create a transient copy of the
permanent table.
SQL
-- Convert Permanent Table to Transient Table
CREATE OR REPLACE TRANSIENT TABLE employees_transient CLONE employees_perm;
Explanation: This command creates a new transient table named employees_transient by
cloning the existing permanent table employees_perm.
SQL
-- Show Tables
SHOW TABLES;
Output: You should see the employees_transient table listed as transient. The original
employees_perm table will still exist.
If you no longer need the original permanent table, you can drop it:
SQL
-- Drop Permanent Table
DROP TABLE employees_perm;
Step 2: Convert a Transient Table to a Temporary Table
To convert a transient table to a temporary table, use the CREATE OR REPLACE TEMPORARY
TABLE command along with the CLONE keyword. This will create a temporary copy of the
transient table.
SQL
-- Convert Transient Table to Temporary Table
CREATE OR REPLACE TEMPORARY TABLE employees_temp CLONE employees_transient;
Explanation: This command creates a new temporary table named employees_temp by cloning
the existing transient table employees_transient.
SQL
-- Show Tables
SHOW TABLES;
Output: You should see the employees_temp table listed as temporary. The original
employees_transient table will still exist.
If you no longer need the original transient table, you can drop it:
SQL
-- Drop Transient Table
DROP TABLE employees_transient;
In this section, we will discuss an important concept in Snowflake called Time Travel. Before
diving into Time Travel, it is recommended to review the previous section on different types of
tables in Snowflake for a better understanding.
Time Travel in Snowflake allows users to access historical data at any point within a defined
retention period. This feature is part of Snowflake's continuous data protection lifecycle,
enabling users to query, clone, and restore data from the past.
Snowflake's continuous data protection lifecycle involves creating various objects such as
databases, schemas, tables, and views using different DDL statements and SQL operations. Time
Travel is a key component of this lifecycle, providing the ability to view and restore historical
data.
Example Timeline
To illustrate Time Travel, consider the following hypothetical timeline for an employees table:
Using Time Travel, you can view the state of the employees table at any of these points in time.
1. Historical Data Access: View the state of a table at any point within the retention period.
2. Data Recovery: Restore data that has been updated or deleted.
3. Cloning: Create clones of tables, schemas, and databases at specific points in time.
4. Dropped Objects Recovery: Restore dropped tables, schemas, and databases.
The retention period for Time Travel varies based on the Snowflake edition:
The retention period is a crucial property that determines how long historical data is preserved.
Failsafe
After the retention period, data moves to the Failsafe zone, where it is retained for an additional
7 days for permanent tables. However, data in the Failsafe zone cannot be queried or restored by
users and is reserved for disaster recovery by Snowflake.
Running Queries: Query historical data that has been updated or deleted.
Creating Clones: Clone tables, schemas, and databases at specific points in time.
Restoring Dropped Objects: Restore dropped tables, schemas, and databases.
The retention period can be set during the creation of objects and is managed by users with the
accountadmin role. For permanent tables, the retention period can be set up to 90 days, while
for transient and temporary tables, it is limited to 1 day.
In this section, we will explore how to set and alter the data retention time property for tables in
Snowflake. This property is crucial for managing the Time Travel feature, which allows users to
access historical data.
Let's start by creating a table and setting its retention time property:
SQL
CREATE TABLE employees (
id INT,
name STRING,
position STRING
);
SQL
SHOW TABLES;
This will display the employees table with a default retention time.
SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 90;
SQL
SHOW TABLES;
If you try to set the retention time beyond the allowed limit (90 days for permanent tables),
Snowflake will return an error:
SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 95;
You can alter the retention time to any value between 0 and 90 days:
SQL
ALTER TABLE employees SET DATA_RETENTION_TIME_IN_DAYS = 30;
SQL
SHOW TABLES;
When you create a schema or database with a specific retention time, all objects under it inherit
this property unless explicitly set otherwise.
SQL
CREATE SCHEMA employee_perm DATA_RETENTION_TIME_IN_DAYS = 10;
SQL
CREATE TABLE employee_perm.employee_new (
id INT,
name STRING,
position STRING
);
3. Insert Data:
SQL
INSERT INTO employee_perm.employee_new VALUES (1, 'John Doe',
'Manager');
SQL
SHOW TABLES IN SCHEMA employee_perm;
SQL
CREATE TRANSIENT TABLE employee_perm.employee_transient (
id INT,
name STRING,
position STRING
);
SQL
CREATE TEMPORARY TABLE employee_perm.employee_temp (
id INT,
name STRING,
position STRING
);
SQL
SHOW TABLES IN SCHEMA employee_perm;
To verify the actual retention time, you can query the metadata:
SQL
SELECT table_name, retention_time
FROM demo_db.information_schema.tables
WHERE table_schema = 'employee_perm';
Altering Schema Retention Time
You can alter the retention time for a schema, which will affect all new objects created under it:
SQL
ALTER SCHEMA employee_perm SET DATA_RETENTION_TIME_IN_DAYS = 55;
SQL
SHOW SCHEMAS;
In this section, we will explore how to query historical data by utilizing Snowflake's Time Travel
feature. We will demonstrate three methods: using timestamps, offsets, and query IDs.
SQL
SELECT CURRENT_TIMESTAMP();
This will return the current timestamp in UTC format.
SQL
ALTER SESSION SET TIMEZONE = 'UTC';
SQL
SELECT * FROM employees AT (TIMESTAMP => '2023-10-12 12:00:00');
4. Handling Errors: If the timestamp is beyond the allowed time travel period or before the
object creation time, you will receive an error:
SQL
SELECT * FROM employees AT (TIMESTAMP => '2023-10-11 12:00:00');
Error: Time travel data is not available for table employees. The
requested time is either beyond the allowed time travel period or before
the object creation time.
SQL
SELECT * FROM employees AT (OFFSET => -60 * 5);
This queries the state of the employees table 5 minutes ago (300 seconds).
2. Handling Errors: If the offset is beyond the allowed time travel period:
SQL
SELECT * FROM employees AT (OFFSET => -60 * 7);
Error: Time travel data is not available for table employees before
seven minutes.
Method 3: Using Query IDs
SQL
SELECT * FROM employees;
2. Fetch the Query ID from History: Open the query history in Snowflake and copy the
query ID of the desired query.
3. Query Historical Data Using Query ID:
SQL
SELECT * FROM employees AT (STATEMENT => 'query_id');
Replace 'query_id' with the actual query ID copied from the history.
In this section, we will explore how to clone historical objects using Snowflake's Time Travel
feature. Cloning allows you to create a duplicate of an object (table, schema, or database) at a
specified point in its history.
SQL
SELECT CURRENT_TIMESTAMP();
SQL
CREATE TABLE restore_table CLONE employees AT (TIMESTAMP => '2023-10-12
12:00:00');
SQL
SELECT * FROM restore_table;
Check that the cloned table has the same data as the original table at the specified
timestamp.
SQL
CREATE SCHEMA restore_schema CLONE employee_perm AT (OFFSET => -600);
This clones the employee_perm schema as it was 600 seconds (10 minutes) ago.
SQL
SHOW SCHEMAS;
Check that the restore_schema has been created with the tables that existed in
employee_perm at the specified offset.
SQL
SELECT * FROM employees;
2. Fetch the Query ID from History: Open the query history in Snowflake and copy the
query ID of the desired query.
3. Clone the Database:
SQL
CREATE DATABASE restore_db CLONE demo_db AT (STATEMENT => '01a2b3c4-
d5e6-7f89-0a1b-2c3d4e5f6g7h');
SQL
SHOW DATABASES;
Check that the restore_db has been created with the schemas and tables that existed in
demo_db before the specified query ID.
Dropping and Restoring Objects Using Time Travel in
Snowflake
Introduction
In this session, we will learn how to drop and restore objects (tables, schemas, and databases) in
Snowflake using the Time Travel feature. When an object is dropped, it is retained for the data
retention period, during which it can be restored. Once the retention period has passed, the object
is moved to the Failsafe zone and cannot be restored by users.
To check the history of different objects in Snowflake, you can use the following commands:
SQL
SHOW TABLES HISTORY LIKE 'employees%' IN DATABASE demo_db;
This command displays the history of tables starting with 'employees' in the demo_db
database.
SQL
SHOW SCHEMAS HISTORY IN DATABASE demo_db;
SQL
SHOW DATABASES HISTORY;
Dropping Objects
Let's drop a database, schema, and table to see how the history is updated:
1. Drop Database:
SQL
DROP DATABASE development;
2. Drop Schema:
SQL
DROP SCHEMA demo_db.employee;
3. Drop Table:
SQL
DROP TABLE demo_db.employee_perm.employees;
Restoring Dropped Objects
To restore dropped objects within the retention period, use the UNDROP command:
1. Restore Table:
SQL
UNDROP TABLE demo_db.employee_perm.employees;
2. Restore Schema:
SQL
UNDROP SCHEMA demo_db.employee;
3. Restore Database:
SQL
UNDROP DATABASE development;
Verifying Restored Objects
SQL
SHOW TABLES IN SCHEMA demo_db.employee_perm;
SQL
SHOW SCHEMAS IN DATABASE demo_db;
SQL
SHOW DATABASES;
Important Notes
If an object with the same name already exists, the UNDROP command will fail. You must rename
the existing object before restoring the previous version.
The SHOW ... HISTORY commands include an additional column dropped_on, which displays
the date and time when the object was dropped. If an object has been dropped more than once,
each version is included as a separate row in the output.
In this section, we will discuss the concept of Fail Safe in Snowflake, which is an essential
component of Snowflake's continuous data protection lifecycle. Fail Safe ensures that historical
data is protected and recoverable in the event of a system failure or disaster.
We have already seen the continuous data protection lifecycle in the Time Travel section. Let's
revisit the key points:
DDL Operations: Create various objects such as databases, schemas, tables, and views.
Time Travel: Allows querying and cloning historical data within a retention period (up to 90 days
for permanent tables, 1 day for transient and temporary tables).
Fail Safe: Provides an additional 7-day period for permanent tables after the Time Travel
retention period ends.
Fail Safe is a non-configurable 7-day period during which historical data is recoverable by
Snowflake only. User interactions are not allowed in the Fail Safe zone, and it is intended for use
by Snowflake in case of hardware failures or disasters.
No User Operations: Unlike Time Travel, users cannot perform any operations in the Fail Safe
zone.
Data Recovery: Fail Safe is used by Snowflake to recover data in case of extreme operational
failures.
Non-Configurable Period: The Fail Safe period is fixed at 7 days and cannot be altered.
Cost Implications: Data in the Fail Safe zone incurs additional storage costs.
Example Timeline
Cost Considerations
Fail Safe can significantly impact storage costs due to multiple snapshots:
Snapshot Size: Each snapshot taken during the day is considered for Fail Safe storage.
Cumulative Cost: If multiple snapshots are taken, the cumulative size is used for cost calculation.
Recommendations
Use Transient Tables: During development and testing phases, use transient tables to avoid Fail
Safe costs.
Design Considerations: Carefully design your data retention and backup strategies to minimize
costs.
Fail Safe provides a more efficient and cost-effective alternative to traditional backups:
Eliminates Redundancy: Avoids the need for multiple full and incremental backups.
Scalability: Scales with your data without the need for manual intervention.
Reduced Downtime: Minimizes downtime and data loss during recovery.
In this section, we will explore how to monitor and access the storage consumption of the Fail
Safe zone in Snowflake. Understanding Fail Safe storage consumption is crucial for managing
costs and ensuring efficient data management.
Prerequisites
Ensure that you have switched your account role to ACCOUNTADMIN or SECURITYADMIN to access
the necessary account-level details.
SQL
USE ROLE ACCOUNTADMIN;
Example Analysis
2. Average Consumption:
o The average consumption for the Fail Safe zone is around 1.5 MB.
o The average consumption for the database is 1.68 MB.
o Although the data is not huge, it is important to note that Fail Safe storage can
sometimes exceed database storage due to multiple snapshots.
Design Considerations: Carefully design your data retention and backup strategies to minimize
costs.
Use Transient Tables: During development and testing phases, use transient tables to avoid Fail
Safe costs, as Fail Safe is not applicable to transient tables.
In this section, we will learn about tasks in Snowflake and how to leverage them for automating
various operations. A task in Snowflake is a kind of trigger that gets executed at a specific time
or period. Tasks can be used to automate SQL statements or stored procedures at scheduled
intervals.
A task is an object in Snowflake that allows you to schedule and automate the execution of SQL
statements or stored procedures. Tasks can be set to run at specific intervals or at a specific point
in time, and they will continue to run until manually stopped.
Types of Tasks
1. Standalone Task: A task that does not have any child tasks and is not dependent on any parent
task.
2. Parent-Child Tasks: Tasks that have dependencies, where a parent task can have multiple child
tasks.
Example Use Cases
Data Ingestion: Automate the ingestion of data into tables at regular intervals.
Data Cleanup: Schedule tasks to delete old or unnecessary data from tables.
Data Transformation: Automate the execution of stored procedures for data transformation.
The task may be queued for a few seconds (e.g., 20 seconds) due to other queries or tasks being
executed.
The task then runs for the remaining time (e.g., 40 seconds) within the one-minute window.
The size of the warehouse should be determined based on the number and volume of tasks you
will be executing. For heavy workloads, consider using a larger warehouse size to ensure
efficient execution of tasks.
1. Create a Task:
SQL
CREATE TASK my_task
WAREHOUSE = my_warehouse
SCHEDULE = '1 MINUTE'
AS
INSERT INTO my_table (column1, column2)
SELECT column1, column2
FROM source_table;
2. Start a Task:
SQL
ALTER TASK my_task RESUME;
3. Stop a Task:
SQL
ALTER TASK my_task SUSPEND;
4. Drop a Task:
SQL
DROP TASK my_task;
Monitoring Task Execution
You can monitor the execution of tasks by checking the query history:
Query History: View the time taken to complete tasks and analyze performance.
Task History: Check the status and execution details of tasks.
In this section, we will learn about the concept of a tree of tasks in Snowflake and how to
leverage it for automating complex workflows. A tree of tasks allows you to create a hierarchy of
tasks with dependencies, where a parent task can have multiple child tasks.
A tree of tasks is a hierarchical structure where tasks are organized in a parent-child relationship.
The topmost task in the hierarchy is known as the root task, which executes all the subtasks.
Unknown
Task A (Root Task)
/ \
Task B Task C
/ | \
D E F
/ \
G H
1. Single Path Between Nodes: An individual task can have only one parent task. For example, Task
B can only have Task A as its parent.
2. Root Task Schedule: The root task must have a defined schedule. Child tasks are triggered based
on the completion of their parent tasks.
3. Maximum Tasks: A tree of tasks can have a maximum of 1000 tasks, including the root task, in a
resumed state.
4. Maximum Child Tasks: A task can have a maximum of 100 child tasks.
Execution Flow
Root Task Execution: The root task (Task A) is scheduled to run at specific intervals.
Child Task Execution: Once the root task completes, its child tasks (Task B and Task C) are
executed.
Subsequent Child Tasks: Child tasks of Task B (Task D, Task E, Task F) and Task C (Task G, Task H)
are executed after their respective parent tasks complete.
Consider a tree of tasks that requires 5 minutes on average to complete each run:
Unknown
Run 1:
- T1 (Root Task) starts and remains in queue for a few seconds, then runs.
- T2 (Child Task of T1) starts after T1 completes, remains in queue, then
runs.
- T3 (Child Task of T1) starts after T1 completes, remains in queue, then
runs.
- Total time: 5 minutes.
Run 2:
- T1 starts again at the beginning of the next 5-minute window.
- T2 and T3 follow the same execution pattern as in Run 1.
Creating and Managing Tree of Tasks
SQL
CREATE TASK root_task
WAREHOUSE = my_warehouse
SCHEDULE = '5 MINUTE'
AS
CALL my_stored_procedure();
SQL
CREATE TASK child_task_2
AFTER child_task_1
WAREHOUSE = my_warehouse
AS
DELETE FROM my_table WHERE condition;
SQL
ALTER TASK root_task RESUME;
Monitoring Task Execution
You can monitor the execution of tasks by checking the query history and task history:
Query History: View the time taken to complete tasks and analyze performance.
Task History: Check the status and execution details of tasks.
In this section, we will learn how to create and manage tasks in Snowflake. We will start by
creating a table and then create a task to insert records into the table at a specific interval.
Step-by-Step Guide
Step 1: Create a Table
First, we create a table named employees with three columns: employee_id, employee_name,
and load_time.
SQL
CREATE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR DEFAULT 'YourName',
load_time DATE
);
SQL
SHOW TABLES LIKE 'employees';
Step 3: Create a Task
Next, we create a task to insert records into the employees table at a specific interval (every one
minute).
SQL
CREATE OR REPLACE TASK employees_task
WAREHOUSE = compute_wh
SCHEDULE = '1 MINUTE'
AS
INSERT INTO employees (load_time)
VALUES (CURRENT_TIMESTAMP);
SQL
SHOW TASKS LIKE 'employees_task';
Step 5: Resume the Task
By default, the task is in a suspended state. We need to resume the task to start its execution.
SQL
ALTER TASK employees_task RESUME;
Step 6: Verify Task Status
SQL
SHOW TASKS LIKE 'employees_task';
Step 7: Verify Records in the Table
After resuming the task, verify that records are being inserted into the employees table every
minute.
SQL
SELECT * FROM employees;
You should see records being inserted with the employee_id auto-incrementing,
employee_name set to the default value, and load_time showing the current timestamp.
In this lecture, we will create a tree of tasks in Snowflake, establishing a parent-child relationship
between tasks. We will build on the previous example by creating a root task and two child tasks
that will execute based on the completion of the parent task.
Step-by-Step Guide
Step 1: Verify and Suspend the Existing Task
First, let's verify that the previous task is working and then suspend it to create child tasks.
SQL
-- Verify the existing task
SHOW TASKS LIKE 'employees_task';
Create a copy of the employees table without the auto-increment and default properties.
SQL
CREATE TABLE employees_copy (
employee_id INTEGER,
employee_name VARCHAR,
load_time DATE
);
Step 3: Create the Employees Copy Task
Create a task to insert records into the employees_copy table after the employees_task
completes.
SQL
CREATE OR REPLACE TASK employees_copy_task
WAREHOUSE = compute_wh
AFTER employees_task
AS
INSERT INTO employees_copy (employee_id, employee_name, load_time)
SELECT employee_id, employee_name, load_time
FROM employees;
Step 4: Create the Employees Copy 2 Table
SQL
CREATE TABLE employees_copy_2 (
employee_id INTEGER,
employee_name VARCHAR,
load_time DATE
);
Step 5: Create the Employees Copy 2 Task
Create a task to insert records into the employees_copy_2 table after the employees_task
completes.
SQL
CREATE OR REPLACE TASK employees_copy_2_task
WAREHOUSE = compute_wh
AFTER employees_task
AS
INSERT INTO employees_copy_2 (employee_id, employee_name, load_time)
SELECT employee_id, employee_name, load_time
FROM employees;
Step 6: Resume the Child Tasks
SQL
-- Resume the first child task
ALTER TASK employees_copy_task RESUME;
Resume the parent task to start the execution of the tree of tasks.
SQL
ALTER TASK employees_task RESUME;
Execution Flow
Parent Task: The employees_task runs every minute, inserting a record into the employees
table.
Child Tasks: After the employees_task completes, the employees_copy_task and
employees_copy_2_task run, copying records from the employees table to their respective
tables.
In this lecture, we will learn how to call stored procedures automatically using tasks in
Snowflake. We will build on the previous example by creating a stored procedure that inserts
values into a table and then create a task to call this stored procedure at regular intervals.
Step-by-Step Guide
Step 1: Create the Employees Table
First, we create the employees table with three columns: employee_id, employee_name, and
load_time.
SQL
CREATE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR DEFAULT 'YourName',
load_time DATE
);
Next, we create a stored procedure that inserts values into the employees table. The stored
procedure will take one argument, today_date, and use it to insert the current timestamp into
the load_time column.
SQL
CREATE OR REPLACE PROCEDURE load_employees_data(today_date VARCHAR)
RETURNS STRING NOT NULL
LANGUAGE JAVASCRIPT
AS
$$
var sql_command = `INSERT INTO employees (load_time) VALUES (?)`;
snowflake.execute({
sqlText: sql_command,
binds: [today_date]
});
return 'Succeeded';
$$;
SQL
CREATE OR REPLACE TASK employees_load_task
WAREHOUSE = compute_wh
SCHEDULE = '1 MINUTE'
AS
CALL load_employees_data(CURRENT_TIMESTAMP::VARCHAR);
Verify that the task has been created successfully and then resume it to start its execution.
SQL
-- Verify the task
SHOW TASKS LIKE 'employees_load_task';
After resuming the task, verify that records are being inserted into the employees table every
minute.
SQL
SELECT * FROM employees;
You should see records being inserted with the employee_id auto-incrementing,
employee_name set to the default value, and load_time showing the current timestamp.
Understanding how to monitor and analyze the history of tasks in Snowflake is crucial for
ensuring that tasks are executed as expected and for troubleshooting any issues that may arise. In
this section, we will explore various methods to track the task history.
You can retrieve the most recent 100 records of task executions using the following query:
SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY SCHEDULED_TIME;
Task names
Query IDs
Database and schema information
Query text
Status (succeeded or failed)
Error codes and messages (if any)
Query start and completion times
Root task information
Run IDs
To retrieve task history within a specific time range, use the following query:
SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
SCHEDULED_TIME_RANGE_START => '2023-10-01T11:00:00',
SCHEDULED_TIME_RANGE_END => '2023-10-01T12:00:00'
));
This query filters the task history to show only the records within the specified time range. The
output includes the same metrics as the previous method but limited to the specified period.
Method 3: Retrieve Latest N Records for a Specific Task
To retrieve the latest N records for a specific task, use the following query:
SQL
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
SCHEDULED_TIME_RANGE_START => CURRENT_TIMESTAMP - INTERVAL '1 HOUR',
RESULT_LIMIT => 10,
TASK_NAME => 'employees_load_task'
));
Streams in Snowflake are a powerful feature that enables Change Data Capture (CDC) by
tracking changes made to tables, such as inserts, updates, and deletes. This is particularly useful
for performing incremental loads or capturing changes from various data sources to keep your
data warehouse up to date.
When using Snowflake as your data warehouse, you may need to:
Streams in Snowflake are objects that record Data Manipulation Language (DML) changes made
to tables. This includes:
Inserts
Updates
Deletes
Metadata about each change
Streams help in capturing these changes with ease, allowing you to perform various operations
such as loading dimensions and facts accurately.
CDC Process: Streams facilitate the Change Data Capture process by recording changes at the
row level between two transactional points in time in a table.
Tracking Changes: Once a stream is created on a source table, it starts tracking all changes from
that point in time.
Row-Level Changes: Streams track changes at the row level between two points in time (e.g.,
Time T1 and Time T2).
Example Scenario
Suppose you have a table named employees and you create a stream on this table on Day 1. Any
DML operations (inserts, updates, deletes) performed on the employees table will be tracked by
the stream. On Day 2, you can check the stream to see what changes occurred between Day 1
and Day 2.
Key Concepts
Ease of Use: Streams make it easy to capture and track changes in tables.
Accuracy: Ensures accurate loading of dimensions and facts based on tracked changes.
Flexibility: Can be used with various data sources and staging tables in Snowflake.
Streams in Snowflake are used to track changes in tables, such as inserts, updates, and deletes.
Understanding how streams work under the hood is crucial for effectively using them in your
data warehouse operations.
How Streams Work
Initial Snapshot
When you create a stream on a table, it logically takes an initial snapshot of every row in the
source table. This snapshot serves as the baseline from which changes are tracked.
After the initial snapshot, the change tracking system records information about changes (inserts,
updates, deletes) committed after the snapshot was taken. For example, if the snapshot was taken
at Time T1, any changes made between T1 and T2 will be recorded.
Hidden Columns
Streams do not contain table data themselves. Instead, they create hidden columns in the original
table to track changes. Snowflake charges for the storage cost associated with these hidden
columns.
Offsets
A stream stores the offset for the source table and returns CDC records by leveraging the
versioning history for the source table. The offset represents a point in time in the transactional
version timeline of the source table.
Understanding Offsets
Concept of Offsets
Offsets are like bookmarks in a book, indicating a point in time from which changes are tracked.
When you start a stream, the offset is set to zero. As changes are made and consumed, the offset
is updated to reflect the new point in time.
Example Scenario
Practical Example
Creating a Stream
SQL
CREATE OR REPLACE TABLE employees (
employee_id INTEGER AUTOINCREMENT START 1 INCREMENT 1,
employee_name VARCHAR,
load_time DATE
);
1. Insert a Row:
SQL
INSERT INTO employees (employee_name, load_time) VALUES ('John Doe',
CURRENT_DATE);
SQL
SELECT * FROM employees_stream;
3. Update a Row:
SQL
UPDATE employees SET employee_name = 'John Smith' WHERE employee_id = 1;
SQL
SELECT * FROM employees_stream;
Key Points to Remember
1. Offsets: Offsets are updated each time changes are consumed. They represent the point in time
from which the stream will start tracking new changes.
2. Latest Action: If multiple statements change a row, the stream contains only the latest action
taken on that row.
3. Hidden Columns: Streams use hidden columns in the original table to track changes, and
Snowflake charges for the storage cost of these columns.
4. CDC Records: Streams return CDC records by leveraging the versioning history for the source
table.
Streams in Snowflake are a powerful feature for tracking changes in tables, such as inserts,
updates, and deletes. There are three types of streams in Snowflake:
Types of Streams
1. Standard Streams
Standard streams track all types of changes (inserts, updates, and deletes) in a table. Let's explore
how to create and use standard streams with an example.
First, create a table named employees with three columns: employee_id, salary, and
manager_id.
SQL
CREATE OR REPLACE TABLE employees (
employee_id INTEGER,
salary INTEGER,
manager_id INTEGER
);
Step 2: Create a Stream
SQL
CREATE OR REPLACE STREAM employees_stream ON TABLE employees;
Step 3: Verify the Stream
SQL
-- Show all streams
SHOW STREAMS;
The offset indicates the point in time from which the stream starts tracking changes. Initially, the
offset is zero.
SQL
SELECT SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream');
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
Step 5: Insert Data into the Employees Table
SQL
INSERT INTO employees (employee_id, salary, manager_id) VALUES
(1, 50000, 101),
(2, 60000, 102),
(3, 70000, 103),
(4, 80000, 104),
(5, 90000, 105);
Step 6: Query the Stream
SQL
SELECT * FROM employees_stream;
The stream output includes the original table columns (employee_id, salary, manager_id) and
additional metadata columns:
Consume the changes from the stream by inserting them into a consumer table.
SQL
-- Create the consumer table
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);
SQL
SELECT * FROM employees_consumer;
Step 8: Check the Updated Stream Offset
SQL
SELECT SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream');
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
In this example, we will demonstrate how to perform update operations using streams in
Snowflake. We will update rows in the employees table and track these changes using a stream.
Finally, we will consume the changes and insert them into a consumer table.
Step-by-Step Guide
Step 1: Verify the Stream
SQL
SELECT * FROM employees_stream;
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This timestamp indicates the point in time from which the stream starts tracking changes.
Update the employees table to increase the salary of employees whose salary is less than 33,000.
SQL
UPDATE employees
SET salary = salary + 10000
WHERE salary < 33000;
Step 4: Verify the Updated Employees Table
SQL
SELECT * FROM employees ORDER BY salary;
You should see that the salaries of employees with employee_id 3 and 4 have been incremented.
SQL
SELECT * FROM employees_stream;
You will see four records instead of two. This is because streams track updates by recording a
delete for the old row and an insert for the new row.
Insert the changes from the stream into the employees_consumer table.
SQL
-- Create the consumer table if not already created
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);
This query inserts only the updated rows (with the new salaries) into the employees_consumer
table.
SQL
SELECT * FROM employees_consumer;
You should see that the table now contains duplicate records for employee_id 3 and 4,
representing both the old and new salaries.
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This new timestamp indicates the point in time from which the stream will start tracking new
changes.
In this lecture, we will learn how to leverage streams to capture delete operations on a table in
Snowflake. We will delete rows from the employees table and track these changes using a
stream. Finally, we will consume the changes and delete the corresponding rows from a
consumer table.
Step-by-Step Guide
Step 1: Verify the Stream Offset
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This timestamp indicates the point in time from which the stream starts tracking changes.
Delete rows from the employees table where the salary is less than 40,000.
SQL
DELETE FROM employees
WHERE salary < 40000;
Step 3: Verify the Deleted Rows in the Employees Table
SQL
SELECT * FROM employees ORDER BY salary;
You should see that the rows with employee_id 2 and 4 have been deleted.
Step 4: Check the Stream for Tracked Changes
SQL
SELECT * FROM employees_stream;
You will see two records with the DELETE action for the deleted rows.
Delete the corresponding rows from the employees_consumer table by consuming the changes
from the stream.
SQL
DELETE FROM employees_consumer
WHERE employee_id IN (
SELECT DISTINCT employee_id
FROM employees_stream
WHERE METADATA$ACTION = 'DELETE' AND METADATA$ISUPDATE = FALSE
);
This query deletes the rows from the employees_consumer table that match the employee_id of
the deleted rows in the employees table.
SQL
SELECT * FROM employees_consumer;
You should see that the rows with employee_id 2 and 4 have been deleted.
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This new timestamp indicates the point in time from which the stream will start tracking new
changes.
In this lecture, we will explore how streams in Snowflake behave when changes are made within
a transaction. We will demonstrate that streams only capture changes after the transaction is
committed.
Step-by-Step Guide
Step 1: Verify the Stream Offset
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This timestamp indicates the point in time from which the stream starts tracking changes.
SQL
BEGIN;
SQL
SHOW TRANSACTIONS;
You should see the transaction ID and session ID indicating that the transaction is currently
running.
Insert some rows into the employees table within the transaction.
SQL
INSERT INTO employees (employee_id, salary, manager_id) VALUES
(6, 45000, 106),
(7, 55000, 107),
(8, 65000, 108);
Step 4: Verify the Employees Table and Stream
Check the employees table to see the current rows. Since the transaction is not yet committed,
the new rows will not be visible.
SQL
SELECT * FROM employees ORDER BY salary;
Check the stream to see if it has captured any changes. Since the transaction is not yet
committed, the stream will be empty.
SQL
SELECT * FROM employees_stream;
Step 5: Commit the Transaction
SQL
COMMIT;
Step 6: Verify the Employees Table and Stream After Commit
SQL
SELECT * FROM employees ORDER BY salary;
Check the stream to see if it has captured the changes after the commit.
SQL
SELECT * FROM employees_stream;
Create a consumer table and insert the changes from the stream into the consumer table.
SQL
-- Create the consumer table if not already created
CREATE OR REPLACE TABLE employees_consumer (
employee_id INTEGER,
salary INTEGER
);
SQL
SELECT * FROM employees_consumer;
You should see the new rows inserted into the consumer table.
Step 8: Check the Updated Stream Offset
SQL
SELECT TO_TIMESTAMP(SYSTEM$STREAM_GET_TABLE_TIMESTAMP('employees_stream'));
This new timestamp indicates the point in time from which the stream will start tracking new
changes.
It's good practice to set comments for each stream to track their purpose.
SQL
ALTER STREAM employees_stream SET COMMENT = 'This stream is used to capture
changes from the employees table.';
SQL
SHOW STREAMS;
SQL
DROP STREAM employees_stream;
SQL
SHOW STREAMS;