0% found this document useful (0 votes)
30 views90 pages

2 - Snowflake de Feb25

Snowflake is a cloud-based data warehousing solution that offers scalable storage, processing, and analytics services with a pay-as-you-go model. Its architecture separates compute and storage, allowing for dynamic scaling and efficient data management, including support for semi-structured data and real-time data sharing. Key features include automatic micro-partitioning, clustering for performance optimization, and various connectivity options for integration with BI and ETL tools.

Uploaded by

maddypd18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views90 pages

2 - Snowflake de Feb25

Snowflake is a cloud-based data warehousing solution that offers scalable storage, processing, and analytics services with a pay-as-you-go model. Its architecture separates compute and storage, allowing for dynamic scaling and efficient data management, including support for semi-structured data and real-time data sharing. Key features include automatic micro-partitioning, clustering for performance optimization, and various connectivity options for integration with BI and ETL tools.

Uploaded by

maddypd18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 90

1.

Snowflake
2. Snowflake architecture
3. Connecting to snowflake
4. Virtual warehouse
5. Micro-partitions
6. Clustering in Snowflake
7. Snowflake edition
8. Snowflake pricing
9. Data Loading in Snowflake
10. Loading Snowflake data with ETL tools
11. Stages
12. Loading Data from AWS S3, Azure, and GCP into Snowflake
13. Snow pipe
14. Time travel and fail safe
15. Zero copy cloning
16. Tables
17. External Tables in Snowflake
18. Access control in snowflake
19. Views
20. Dynamic data masking
21. Data sharing
22. Scheduling in snowflake - Tasks
23. Streams in Snowflake
24. User Defined Functions (UDFs)
25. Stored Procedures in Snowflake
26. Caching in Snowflake
27. Unloading Data in Snowflake
1. What is Snowflake?
Snowflake is a cloud-based data warehousing solution that provides data storage,
processing, and analytics services.
Key Features:
 Founded in 2012.
 Offers data storage and analytics services.
 No on-premises infrastructure—it runs entirely on the cloud.
 Operates on Amazon S3, Microsoft Azure, and Google Cloud Platform.
 Available as Software-as-a-Service (SaaS).

Why Choose Snowflake?


Advantages of Snowflake:
Pay-as-you-go model – Pay only for the resources you use.
No infrastructure cost – Fully managed cloud platform.
More than a data warehouse – Supports data transformations, pipelines, and even
visualization.
High scalability – Supports automatic scaling (scale-up and scale-out).
Advanced data management – Data recovery, backup, sharing, and masking.
Semi-structured data support – Can analyze external files (e.g., JSON, Parquet, ORC, Avro).
Seamless integration – Works with popular data visualization and reporting tools.
Snowflake offers a range of features that make it a popular choice for data warehousing:
 Elasticity: Snowflake’s architecture allows for dynamic scaling of both compute and
storage resources. It can automatically scale up or down based on workload
requirements, ensuring optimal performance and cost efficiency.
 Separation of Compute and Storage: Snowflake separates the storage and compute
layers, enabling independent scaling of both components. This flexibility allows
businesses to scale compute resources for high-demand workloads without affecting
storage and vice versa.
 Native Support for Semi-structured Data: Snowflake natively supports semi-
structured data formats like JSON, Avro, and Parquet, which eliminates the need for
pre-processing before ingestion.
 Zero Management: Snowflake is a fully managed service, meaning that it takes care
of database management tasks like indexing, tuning, and partitioning, reducing
administrative overhead.
 Concurrency: Snowflake can handle multiple concurrent users and workloads
without impacting performance, thanks to its multi-cluster architecture.
 Data Sharing: Snowflake allows businesses to securely share data in real-time across
different organizations without the need to replicate or move the data, enhancing
collaboration.
 Security and Compliance: Snowflake includes robust security measures such as
encryption (at rest and in transit), role-based access control (RBAC), multi-factor
authentication, and compliance with standards like HIPAA and PCI DSS.

Traditional Data Warehouse vs. Snowflake

Traditional Data
Feature Snowflake
Warehouse

Requires high setup


Infrastructure Cost No infrastructure cost (cloud-based)
costs

Semi-structured Data
Needs ETL tools Supports semi-structured data natively
Handling

Data Loading &


Requires ETL tools Simple with COPY command
Unloading

Scalability Complex scaling process Highly scalable (automatic scaling)

Requires manual Automated optimization (micro-


Database Administration
optimization partitions, clustering)

Needs additional
Data Backup No extra cost (via Cloning)
storage

Data Recovery Complex and expensive Easy with Time Travel

Data Sharing Difficult Easy with Secure Data Sharing

Change Data Capture Requires ETL tools Built-in Streams feature

Requires third-party
Job Scheduling Handled within Snowflake using Tasks
tools

2. Snowflake Architecture
Snowflake's architecture is designed to separate compute, storage, and cloud services,
ensuring high performance, scalability, and cost efficiency. It consists of three key layers:
1 Database Storage Layer
2 Query Processing Layer
3 Cloud Services Layer

1. Database Storage Layer (Storage Layer)


This layer is responsible for efficiently storing data in a highly optimized, columnar format.
Key Features:
Stores table data and query results.
Data is stored in a compressed columnar format.
Uses micro-partitions to optimize data organization.
Snowflake automatically manages storage—including file size, compression, metadata, and
statistics.
Customers cannot access raw storage files directly; they can only interact with data
through SQL queries.
Cluster keys can be defined on large tables to improve query performance.
2. Query Processing Layer (Compute Layer)
This is the actual processing unit of Snowflake, where SQL queries are executed.
Key Features:
Snowflake processes queries using Virtual Warehouses.
Each Virtual Warehouse consists of multiple compute nodes allocated from a cloud
provider.
On AWS, Virtual Warehouses use EC2 instances, while on Azure, they use Virtual
Machines.
Compute costs are based on query execution time on Virtual Warehouses.
Highly scalable – Can scale up and scale down easily.
Supports Auto-Suspend and Auto-Resume, reducing costs by stopping unused warehouses.
Virtual Warehouses act as the "muscle" of Snowflake, handling query execution.

3. Cloud Services Layer (Control Layer)


This is the brain of Snowflake, responsible for coordinating and managing various cloud
services.
Key Features:
Manages authentication and access control.
Handles infrastructure management.
Performs metadata management for optimized query performance.
Ensures security and governance.
Manages serverless features like:
 Snowpipe (automated data ingestion).
 Tasks (scheduling and automation).
 Materialized View Maintenance (ensuring up-to-date query results).

Why Snowflake’s Architecture is Unique?


• Decoupled storage and compute – Pay separately for storage and computing power.
• Elastic scaling – Auto-scale compute power up or down as needed.
• Fully managed – No need to worry about infrastructure or manual tuning.

3. Connecting to Snowflake
Snowflake provides multiple ways to connect, making it flexible for different use cases.
1. Web-Based User Interface (UI)
A browser-based interface to manage and use Snowflake.
Provides access to:
 Query execution
 Database and warehouse management
 Security and user access controls
Ideal for administrators, developers, and analysts.
2. Command-Line Interface (CLI) - SnowSQL
SnowSQL is a command-line tool for interacting with Snowflake.
Supports SQL queries, scripting, and automation.
Useful for developers and DevOps teams.
3. ODBC & JDBC Drivers
Snowflake provides ODBC and JDBC drivers to integrate with external applications.
Enables connectivity with BI tools like Tableau, Power BI, and Looker.
Suitable for analytics, reporting, and third-party integrations.
4. Native Connectors for ETL Tools
Snowflake supports built-in connectors for ETL tools like:
 Informatica
 Datastage
 Talend
 Apache Nifi
Helps in data extraction, transformation, and loading (ETL) workflows.

Why Snowflake’s Connectivity is Powerful?


• Multiple connection methods – Web UI, CLI, API, and drivers.
• Seamless integration with BI tools, ETL platforms, and cloud applications.
• Secure and scalable – Supports role-based access and encryption.

4. Virtual Warehouse
A virtual warehouse in Snowflake is a cluster of compute resources that performs all
computational tasks, such as data loading, querying, and transformations. Snowflake’s
architecture separates compute from storage, so virtual warehouses can be resized (scaled
up or down) and turned on or off independently of the data storage layer. This enables fine-
grained control over performance and cost, allowing users to allocate more resources for
complex operations and scale down when resources are not needed.
Users can create multiple virtual warehouses to handle different workloads, such as ETL
jobs, reporting, and ad-hoc queries. Snowflake can automatically scale a warehouse up or
down based on workload demands, ensuring that performance remains optimal.
Warehouse Selection Based on Requirements
Choose a small warehouse for light workloads (e.g., small queries, occasional data
processing).
Use a larger warehouse for high concurrency, large data loads, and intensive queries.
Warehouse Sizing & Scaling
Snowflake warehouses come in different sizes, which determine the number of compute
nodes in the cluster.

Compute Power (No. of Servers on


Warehouse Size Use Case
AWS)

X-Small (XS) 1 Server Testing, small queries

Small (S) 2 Servers Light ETL workloads

Medium (M) 4 Servers General-purpose querying

High concurrency & larger


Large (L) 8 Servers
queries

X-Large (XL) 16 Servers Heavy data transformations

2X-Large (2XL) 32 Servers Enterprise-level workloads

3X-Large (3XL) 64 Servers High-performance analytics

4X-Large (4XL) 128 Servers Massive-scale processing

Important Notes:
 Each increase in warehouse size doubles the number of compute nodes and cost.
 If there are insufficient resources, queries get queued until resources become
available.
Scaling Options in Snowflake

Snowflake provides two ways to increase computing power:


1 Scale Up (Vertical Scaling)
Increase the size of a Virtual Warehouse (VW).
Helps when queries are slow or data loads take too long.
Can be done anytime using the Web UI or SQL interface.
Example:
ALTER WAREHOUSE my_warehouse SET WAREHOUSE_SIZE = 'LARGE';
2 Scale Out (Horizontal Scaling)
Increase the number of clusters in a Virtual Warehouse.
Used for handling high concurrency (many users running queries at the same time).
Prevents query queuing by automatically adding clusters.
Automatically removes clusters when not needed (Multi-Clustering).
Multi-Clustering is available only in the Enterprise Edition.
Example:
ALTER WAREHOUSE my_warehouse SET MIN_CLUSTER_COUNT = 1, MAX_CLUSTER_COUNT
= 3;
Auto Suspend & Auto Resume (Cost Optimization)
Auto Suspend – Automatically pauses a warehouse after a period of inactivity to save costs.
Auto Resume – Automatically resumes a warehouse when a query is executed.
Example:
ALTER WAREHOUSE my_warehouse SET AUTO_SUSPEND = 300; -- Auto suspend after 5
minutes
Creating a Warehouse in Snowflake
To create a new Virtual Warehouse, use the following SQL command:
CREATE WAREHOUSE my_warehouse
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;

5. Micro-Partitions
Agenda
How data is stored in micro-partitions
Metadata of micro-partitions
Benefits of micro-partitions
Clustering & Cluster Keys
How to define and choose Cluster Keys
What are Micro-Partitions?
Snowflake uses a unique partitioning technique called micro-partitioning.
Micro-partitioning is automatic – users don’t need to define partitions.
Tables are partitioned based on the order of data insertion.
Micro-partitions are small in size (50 MB - 500 MB).
Data is compressed, and Snowflake automatically chooses the best compression algorithm.

Metadata of Micro-Partitions
Snowflake automatically maintains metadata about micro-partitions, which includes:
• Number of distinct values in each column
• Range of values in each column
• Other useful statistics for query optimization
Query Pruning (Metadata-Based Filtering)
Snowflake uses metadata to filter out unnecessary micro-partitions during query execution.
This process is called Query Pruning.
Instead of scanning the entire table, only relevant micro-partitions are scanned.
Example:
SELECT type, country FROM MY_TABLE WHERE name = 'Y';
Only the micro-partitions containing ‘Y’ will be scanned (instead of scanning the entire
table).
Only the required columns (type and country) will be queried, ignoring unnecessary data.

Benefits of Micro-Partitioning
No need for manual partitioning – Snowflake does it automatically.
Optimized query performance – Faster execution due to query pruning.
Columnar storage – Only relevant columns are scanned, improving efficiency.
Efficient compression – Reduces storage costs.
Enables fine-grained pruning – Minimizes data scanning and enhances speed.
What is Clustering?
• Clustering improves query performance by organizing data within micro-partitions.
• Helps when queries filter on specific columns frequently.
• Snowflake automatically clusters data, but manual clustering is needed for large tables
with frequent updates.
What is a Cluster Key?
A Cluster Key is one or more columns used to logically group data within micro-partitions.
Helps in query pruning by reducing the number of scanned micro-partitions.
Example of Defining a Cluster Key:
ALTER TABLE sales CLUSTER BY (region, date);
This clusters the sales table based on region and date, improving queries that filter by these
columns.
How to Choose Cluster Keys?
• Choose columns that are frequently used in WHERE, GROUP BY, and JOIN conditions.
• Select columns with high cardinality (many unique values).
• Avoid too many columns, as it increases clustering costs.
Summary
• Micro-partitioning is automatic – No user maintenance needed.
• Metadata-based pruning speeds up queries by reducing scanned data.
• Clustering improves performance for large datasets with frequent filtering.
• Cluster Keys should be chosen carefully to optimize storage and query execution.

6. Clustering in Snowflake
What is Clustering?
Clustering in Snowflake refers to the process of organizing data in a way that improves
query performance, particularly for large datasets. Snowflake uses automatic clustering by
default, meaning it automatically manages data distribution and storage optimization. Users
can define cluster keys to help Snowflake organize data more efficiently based on
commonly queried columns. This allows for faster retrieval of data and optimized query
performance, especially when working with large volumes of data.
Why is Clustering Important?
 Snowflake stores data automatically in micro-partitions.
 By default, Snowflake determines how to distribute and sort data when it's loaded.
 However, as data grows, it may not be stored optimally for queries.
 Clustering organizes data based on specific columns (clustering keys) to improve
query performance.
How Clustering Works in Snowflake
1. Micro-partitions store data in compressed, columnar format.
2. Without clustering, queries may scan multiple partitions, leading to performance
overhead.
3. With clustering, Snowflake orders the data logically based on a clustering key to
improve partition pruning.
4. Queries that filter or join on clustered columns will scan fewer partitions, improving
efficiency and cost-effectiveness.
Defining Clustering Keys
A clustering key is a set of columns in a table that determines how Snowflake should
organize data in micro-partitions.
Good Clustering Keys Should:
 Be frequently used in the WHERE clause.
 Be used as JOIN keys.
 Be used in aggregations or GROUP BY operations.
 Have a high cardinality (many distinct values).

Example 1: Creating a Table with Clustering


Let's create a sales table and cluster it by region since queries frequently filter by region:
CREATE TABLE sales (
order_id INT,
customer_id INT,
region STRING,
amount DECIMAL(10,2),
order_date DATE
)
CLUSTER BY (region);
Why Cluster by region?
 Queries that filter by region will scan fewer partitions.
Example Query:
SELECT SUM(amount)
FROM sales
WHERE region = 'West';

Optimized query execution because Snowflake will prune irrelevant partitions.

Example 2: Modifying an Existing Table’s Clustering


If we notice that queries often filter by region and order_date, we can modify clustering:
ALTER TABLE sales CLUSTER BY (region, order_date);
Now queries filtering by region and date will be optimized.
Example Query:
SELECT SUM(amount)
FROM sales
WHERE region = 'West' AND order_date >= '2024-01-01';

** Without clustering, the query scans all partitions. With clustering, Snowflake scans only
relevant partitions, reducing cost.

Example 3: Clustering with Expressions


Snowflake allows expressions as clustering keys.
Example: If queries frequently use YEAR(order_date) instead of just order_date:
ALTER TABLE sales CLUSTER BY (region, YEAR(order_date));
Benefit: Instead of scanning all years, Snowflake prunes irrelevant years.

Re-Clustering in Snowflake
Clustering does not happen automatically over time. When data gets fragmented, we need
to re-cluster the table.
Manual Re-Clustering
ALTER TABLE sales RECLUSTER;
Re-clustering costs Snowflake credits, so it should be used carefully.
Automatic Re-Clustering
Snowflake also supports automatic clustering (Enterprise Edition or above):
ALTER TABLE sales SET AUTO_CLUSTERING = TRUE;
Snowflake will continuously optimize clustering as new data is inserted.

Best Practices for Clustering in Snowflake

Best Practice Reason

Use clustering on large tables Small tables don’t benefit much from clustering.

Use columns frequently in WHERE/JOIN Helps Snowflake optimize pruning.

Don’t cluster on more than 4 columns Too many keys increase overhead.

Use expressions for clustering when Example: YEAR(date_column) or


necessary SUBSTRING(code,1,6).

Enable auto-clustering for dynamic data Avoids manual reclustering costs.

Scenario
You're working with a large e-commerce dataset containing millions of sales records. You
need to optimize query performance by defining clustering keys.

Step 1: Create the sales Table


Run the following SQL command to create a sales dataset:
CREATE OR REPLACE TABLE sales (
order_id INT,
customer_id INT,
region STRING,
category STRING,
amount DECIMAL(10,2),
order_date DATE
)
CLUSTER BY (region, order_date);

This clusters data by region and order_date, making it efficient for regional sales analysis.
Step 2: Load Sample Data
Now, insert some sample records:
INSERT INTO sales (order_id, customer_id, region, category, amount, order_date) VALUES
(1, 101, 'North', 'Electronics', 500.00, '2024-01-01'),
(2, 102, 'South', 'Clothing', 200.00, '2024-01-05'),
(3, 103, 'East', 'Electronics', 700.00, '2024-01-10'),
(4, 104, 'West', 'Clothing', 150.00, '2024-01-15'),
(5, 105, 'North', 'Electronics', 900.00, '2024-01-20');

This loads some sample data for testing.

Step 3: Query Optimization Check


Run a query before clustering:
SELECT SUM(amount)
FROM sales
WHERE region = 'North' AND order_date >= '2024-01-01';

** Without clustering, Snowflake scans many partitions, increasing query time.


With clustering, Snowflake scans fewer partitions, improving performance.

Step 4: Modifying Clustering Keys


After analyzing queries, you realize category is frequently used. Modify the table to add
category as a clustering key:
ALTER TABLE sales CLUSTER BY (region, order_date, category);
Now, the table is clustered by region, order_date, and category.

Step 5: Manual Re-Clustering


To apply clustering to existing data, run re-clustering:
ALTER TABLE sales RECLUSTER;
Snowflake reorganizes data based on the new clustering keys.

Step 6: Enable Auto-Clustering (Optional)


If your dataset is growing dynamically, enable auto-clustering:
ALTER TABLE sales SET AUTO_CLUSTERING = TRUE;
Snowflake will automatically optimize clustering as new data is inserted.

Step 7: Performance Testing


Run the query again and check Snowflake's query profile:
SELECT SUM(amount)
FROM sales
WHERE region = 'North' AND order_date >= '2024-01-01';

You should see fewer partitions scanned, leading to better performance.

7. Snowflake Editions & Features


Snowflake offers 4 editions:
1 Standard Edition
2 Enterprise Edition
3 Business Critical Edition
4 Virtual Private Snowflake (VPS)
Cost depends on the edition you choose!
 Most organizations go with Enterprise or Business Critical editions.
1. Standard Edition (Basic)
-> Ideal for small businesses & startups.
-> Includes core features like automatic scaling, security, and SQL support.
** No multi-cluster warehouses (limits concurrent workloads).
** Limited security & compliance (no HIPAA, PCI DSS).

2. Enterprise Edition (Recommended for Most Organizations)


-> All Standard Edition features.
-> Multi-cluster warehouses (for better performance).
-> Time Travel (up to 90 days of data recovery).
-> Materialized Views for faster queries.
-> More security & governance options.

3. Business Critical Edition (For Highly Regulated Industries)


-> All Enterprise Edition features.
-> Enhanced security (HIPAA, PCI DSS, FedRAMP, and more).
-> Tri-Secret Secure – Customer Managed Encryption Keys.
-> Failover & Replication across regions.
-> PrivateLink support for AWS, Azure, and GCP.

4. Virtual Private Snowflake (VPS) – Highest Security Level


-> All Business Critical features.
-> Completely isolated environment.
-> Custom security controls.
-> Best for Government & highly regulated sectors.
Choosing the Right Edition

Edition Best For Key Features

Standard Small businesses, basic users Core Snowflake features

Multi-cluster, Time Travel (90


Enterprise Mid-size & large companies
days)

Financial, healthcare, regulated


Business Critical Compliance, Encryption, Failover
industries

VPS Government, highly secure orgs Fully isolated, max security


8. Snowflake Pricing & Cost Breakdown
1. What Affects Snowflake Cost?
Snowflake Edition (Standard, Enterprise, Business Critical, VPS)
Region (where Snowflake account is created)
Cloud Provider (AWS, Azure, GCP)
Virtual Warehouse Size (XS, S, M, L, XL, etc.)

2. Types of Snowflake Costs


1. Storage Cost
2. Compute Cost
3. Cloud Services Cost

3. Storage Cost
Snowflake charges for storage per terabyte (TB) per month (compressed).
2 Storage Plans:

Storage Type Cost Best For

On-Demand Storage $40/TB per month Flexible, Pay-as-you-go

Capacity Storage $23/TB per month Pre-purchased, lower cost

How to choose?
 Not sure about data size? → Start with On-Demand
 Stable data volume? → Switch to Capacity Storage

4. Compute Cost (Snowflake Credits)


Compute cost is based on Virtual Warehouse usage (per second, min. 1 min).
Larger warehouses consume more credits per second.

Warehouse Size Credits per Hour Example: 30 min Usage

X-Small (XS) 1 Credit 0.5 Credit

Small (S) 2 Credits 1 Credit

Medium (M) 4 Credits 2 Credits

Large (L) 8 Credits 4 Credits

X-Large (XL) 16 Credits 8 Credits

Example Calculation:
 If you use a Large warehouse (L) for 30 min → 4 Credits
 If you use an XS warehouse for 1 hour → 1 Credit

5. What is a Snowflake Credit?


A Snowflake Credit = Unit of compute usage in Snowflake
Used only when you are running compute resources (like Virtual Warehouses).
Free Trial? → Snowflake offers $400 worth of free credits.
Snowflake Credit Cost by Edition:
 Standard → $2.7 per Credit
 Enterprise → $4 per Credit
 Business Critical → $5.4 per Credit
 VPS → Depends on Org

6. Serverless Features (Auto-Compute)


 Some features use Snowflake-managed compute & consume credits.
 Common Serverless Features:
o Auto-clustering
o Query acceleration service
o Search optimization
o Snowflake Tasks (if using serverless mode)

9. Data Loading in Snowflake

Agenda
1. Load Types
2. Bulk Loading vs. Continuous Loading
3. Using the COPY Command
4. Transforming Data During Load
5. Other Data Loading Methods
1. Load Types in Snowflake
Snowflake provides two primary ways to load data:
Bulk Loading Using the COPY Command
 Used for large datasets.
 Loads batch files from cloud storage or local machines.
 Requires a virtual warehouse for processing.
Continuous Loading Using Snowpipe
 Best for real-time or streaming data.
 Uses Snowpipe, which is serverless (additional cost applies).
 Loads data automatically when new files appear in a stage.

2. Bulk Loading Using the COPY Command


How it Works
1. Data files are staged (either Internal or External).
2. The COPY INTO command loads the data into a Snowflake table.
3. Uses a Virtual Warehouse to execute the load.

Hands-On Example
Step 1: Create a Table
CREATE OR REPLACE TABLE customers (
customer_id INT,
name STRING,
email STRING
);

Step 2: Create a Named Stage (Optional)


CREATE OR REPLACE STAGE my_stage
URL = 's3://my-bucket/path/'
STORAGE_INTEGRATION = my_s3_integration;

Step 3: Load Data Using COPY


COPY INTO customers
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1);

Benefits:
 Handles large files efficiently.
 Allows transformations while loading.

3. Continuous Loading Using Snowpipe


How it Works
1. Files are automatically detected in a cloud storage stage.
2. Snowpipe loads them into Snowflake in near real-time.
3. No need for a Virtual Warehouse (serverless).
Hands-On Example
Step 1: Create a Table
CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE
);

Step 2: Create a Pipe for Snowpipe


CREATE OR REPLACE PIPE my_pipe
AUTO_INGEST = TRUE
AS
COPY INTO orders
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1);

Benefits:
 Automated & Continuous data ingestion.
 Ideal for real-time analytics.

4. Transforming Data During Load


Snowflake allows data transformations using the COPY command.
Example: Transform Data While Loading
COPY INTO customers
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE'
COLUMN_REORDERING = TRUE;

Transformations Supported:
 Column Reordering
 Column Omission
 String Operations
 Auto-Increment Fields

5. Other Ways to Load Data


Snowflake integrates with ETL tools like: Informatica, Matillion, Hevo, DataStage, Azure
Data Factory, AWS Glue

10. Loading Snowflake data with ETL tools


Agenda
1. What is ETL in Snowflake?
2. ETL Process Flow
3. Step-by-Step Guide for Popular ETL Tools
 Matillion
 Informatica
 Azure Data Factory (ADF)
 Hevo
 DataStage
4 Best Practices

1. What is ETL in Snowflake?


ETL (Extract, Transform, Load) is a process used to move raw data from various sources,
clean and transform it, and then load it into Snowflake for analytics.
ETL vs ELT
 ETL: Transformation happens before loading the data into Snowflake.
 ELT: Data is loaded first, then transformations happen inside Snowflake (more
efficient with large datasets).
ETL (Extract, Transform, Load)

 Useful for Data Warehouse Ingestion


 Extracts data from various sources, transforms it into a structured format, and then
loads it into a data warehouse.
 Schema-on-Read: Data is structured before loading.
 Examples: Informatica, Talend, Apache Nifi.
ELT (Extract, Load, Transform)

 Useful for Data Lake Ingestion


 Extracts data, loads it in raw format into storage, and then applies transformations
as needed.
 Schema-on-Write: Data remains raw until queried.
 Examples: Snowflake, Google BigQuery, Amazon Redshift.

2. ETL Process Flow


1. Extract: Retrieve data from databases, APIs, or cloud storage.
2. Transform: Apply business rules, cleansing, and formatting.
3. Load: Store the data in Snowflake tables.

3. Step-by-Step Guide for Popular ETL Tools


(A) Matillion ETL for Snowflake
Matillion is a cloud-native ETL tool designed for Snowflake.
Steps to Load Data Using Matillion
1. Connect to Snowflake
 In Matillion, create a Snowflake connection with credentials.
2. Create a Job
 Go to Orchestration Job → Create New Job.
 Drag and drop the Extract component (e.g., MySQL, S3, or API).
3. Transform Data
 Use Transformation Job for data cleansing and aggregation.
4. Load into Snowflake
 Drag the "Table Output" component and map columns to your Snowflake table.
 Run the Job to load data.
Example Query Used in Matillion:
COPY INTO my_table
FROM @my_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1);

(B) Informatica ETL for Snowflake


Informatica Cloud Data Integration (IICS) supports Snowflake as a target.
Steps to Load Data Using Informatica
1. Create a Snowflake Connection
 Go to Administrator → Connections → New Connection.
 Choose Snowflake and provide credentials.
2. Create a Mapping Task
 Select Source (Oracle, SQL Server, API, etc.).
 Apply Transformations (filter, sort, join).
 Choose Snowflake Table as Target.
3. Run the Task
 Deploy and schedule the mapping task to load data into Snowflake.
Example Query in Informatica:
INSERT INTO snowflake_table (col1, col2, col3)
SELECT col1, col2, col3 FROM source_table;
(C) Azure Data Factory (ADF) for Snowflake
ADF is a cloud-based ETL tool from Microsoft that integrates with Snowflake.
Steps to Load Data Using ADF
1. Create a Linked Service
 In ADF, create a Linked Service for Snowflake.
2. Create a Pipeline
 Drag and drop the Copy Data Activity.
 Select Source (Azure SQL, Blob Storage, etc.).
 Choose Snowflake as the Destination.
3. Run and Monitor the Pipeline
 Execute the pipeline and check logs in Monitor.
Example Query in ADF:
COPY INTO snowflake_table
FROM @azure_stage
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1);

(D) Hevo ETL for Snowflake


Hevo is a no-code ETL tool that supports automatic data pipelines.
Steps to Load Data Using Hevo
1. Create a Pipeline
 Select Source (Google Sheets, Salesforce, MySQL, etc.).
 Choose Snowflake as the Destination.
2. Apply Transformations
 Use Hevo’s UI to clean and format data.
3. Start the Data Pipeline
 Enable the pipeline for real-time sync into Snowflake.

(E) IBM DataStage for Snowflake


DataStage is an enterprise ETL tool used for large-scale data integration.
Steps to Load Data Using DataStage
1. Create a Snowflake Connector
 Define a connection in DataStage Designer.
2. Create an ETL Job
 Drag and drop the Extract Stage (e.g., Oracle, SQL Server).
 Apply transformations (Join, Lookup, Aggregation).
3. Load Data to Snowflake
 Use the Snowflake Connector Stage to push data.

4. Best Practices for ETL in Snowflake


Use ELT Instead of ETL (Load raw data first, transform inside Snowflake).
Optimize Cluster Keys to improve query performance.
Monitor Compute Costs (ETL tools consume Snowflake credits).
Use Staging for Large Loads (Internal/External stages).
Automate Pipeline Scheduling for efficiency.
11. Stage in Snowflake
Agenda
1. What is a Stage in Snowflake?
2. Types of Stages
 External Stages
 Internal Stages (User, Table, Named)
3. Creating Stages in Snowflake
4. Loading Data from Stages
5. Best Practices for Using Stages

1. What is a Stage in Snowflake?


A Stage in Snowflake is a storage location where data files are temporarily stored before
loading into tables.
Why Use Stages?
 Stages help organize data files before inserting them into tables.
 Improve performance by reducing direct loads from external sources.
 Enable batch processing for large data files.

2. Types of Stages in Snowflake


There are two main types of Stages:
(A) External Stages
 Store files outside Snowflake in cloud storage (S3, Azure Blob, Google Cloud
Storage).
 Requires a Storage Integration for authentication.
(B) Internal Stages
 Store files inside Snowflake before loading them into tables.
 3 Types:
1. User Stage (@~) → Assigned to a specific user.
2. Table Stage (@%) → Tied to a specific table.
3. Named Internal Stage (@stage_name) → Can be used across multiple tables.

3. Creating Stages in Snowflake


(A) Creating an External Stage
External Stages store files in cloud storage (AWS S3, Azure Blob, Google Cloud Storage).
Example: Creating an External Stage for Amazon S3
CREATE OR REPLACE STAGE mydb.external_stages.s3_stage
URL = 's3://my-bucket-name/'
STORAGE_INTEGRATION = my_s3_integration;

 This stage points to an S3 bucket.


 You need to set up a Storage Integration for authentication.

(B) Creating an Internal Stage


Internal Stages store files inside Snowflake.
1. User Stage (Default for Each User)
 Every Snowflake user automatically gets a stage.
 Example: Upload file to User Stage
PUT file:///data/sales_data.csv @~;
2. Table Stage (Tied to a Specific Table)
 Each table in Snowflake automatically has its own stage.
 Example: Upload file to Table Stage
PUT file:///data/sales_data.csv @%sales_table;
3. Named Internal Stage (Reusable Across Multiple Tables)
 Manually created and stored in a schema.
 Example: Creating a Named Internal Stage
CREATE OR REPLACE STAGE my_internal_stage;
 Example: Upload file to Named Internal Stage
PUT file:///data/sales_data.csv @my_internal_stage;
4. Loading Data from Stages into Snowflake Tables
Once data is staged, we use COPY INTO to load it into Snowflake tables.
(A) Loading from External Stage
COPY INTO sales_table
FROM @s3_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);

(B) Loading from Internal User Stage


COPY INTO sales_table
FROM @~
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);
(C) Loading from Internal Table Stage
COPY INTO sales_table
FROM @%sales_table
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1);
(D) Loading from Named Internal Stage
COPY INTO sales_table
FROM @my_internal_stage
FILE_FORMAT = (FORMAT_NAME = my_csv_format);

5. Best Practices for Using Stages


Use External Stages for large datasets stored in cloud storage.
Use Named Internal Stages for flexibility across multiple tables.
Compress Files (GZIP, Parquet) to reduce storage costs.
Monitor Staged Files using the LIST command:
LIST @my_internal_stage;
Use Auto-Ingest with Snowpipe for real-time data loading.

Scenario: Handling Data Load Errors and Optimizing Load Performance


Business Requirement:
You are loading customer data from an external S3 stage into Snowflake.
Sometimes, the files contain inconsistent data (e.g., missing values, long strings, or incorrect
data types).
You need to handle these errors efficiently while optimizing the data load.
Steps We’ll Cover:
1. Create a Sample Table for Customers
2. Set Up an External Stage (Amazon S3)
3. Practice Different COPY Command Options

Step 1: Create a Table in Snowflake


Let's define the customer_data table.
CREATE OR REPLACE TABLE customer_data (
customer_id INT,
customer_name STRING(50),
email STRING(100),
age INT,
city STRING(50)
);

Why are we limiting string sizes?


 This helps test the behavior of TRUNCATECOLUMNS and ENFORCE_LENGTH.

Step 2: Create an External Stage (S3 Bucket)


If your files are stored in AWS S3, create an external stage.
CREATE OR REPLACE STAGE customer_stage
URL = 's3://your-bucket-name/path/'
STORAGE_INTEGRATION = your_storage_integration;

What does this do?


 customer_stage is a reference to S3 files.
 STORAGE_INTEGRATION ensures secure access.
Verify Files in the Stage:
LIST @customer_stage;

Step 3: Load Data with Different COPY Command Options


1. Basic Data Load
COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1);
Loads data without any additional options.

2. Validation Mode (Check Errors Before Loading)


COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
VALIDATION_MODE = RETURN_ERRORS;

Returns error messages without inserting data.

3. Handling Errors with ON_ERROR


Skip bad records and load valid data
COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
ON_ERROR = CONTINUE;

Loads valid rows, skipping errors.


Skip entire file if too many errors
COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
ON_ERROR = SKIP_FILE_10; -- Skip file if 10 or more errors occur

Useful for large batch loads.

4. Force Load (Even If Already Loaded Before)


COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
FORCE = TRUE;

Forces reloading files.

5. Set Maximum Data Load Size (SIZE_LIMIT)


COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
SIZE_LIMIT = 5000000; -- Limit to 5MB

Stops loading if 5MB of data is reached.

6. Handle Long Strings (TRUNCATECOLUMNS vs. ENFORCE_LENGTH)


Truncate long text instead of failing
COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
TRUNCATECOLUMNS = TRUE;

Cuts long string values instead of throwing errors.


Fail if data is too long (default behavior)
COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
ENFORCE_LENGTH = TRUE;

Ensures data fits within defined column sizes.

7. Automatically Delete Files After Load (PURGE)


COPY INTO customer_data
FROM @customer_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' SKIP_HEADER=1)
PURGE = TRUE;

Deletes files from S3 after successful load.

Summary

Created a customer_data table


Created an S3 External Stage
Practiced multiple COPY command options
12. Loading Data from AWS S3, Azure, and GCP into Snowflake
Common Steps:
 Create storage integration between Snowflake and the cloud provider.
 Create an external stage to access cloud storage.
 Use COPY INTO to load data into Snowflake.

1. Loading Data from AWS S3 → Snowflake


Step 1: Create Storage Integration
CREATE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::account-id:role/snowflake-role'
ENABLED = TRUE;

What this does:


 Grants Snowflake access to AWS S3 using an IAM Role.
 You must configure trust policy in AWS IAM.
Verify Integration:
DESC STORAGE INTEGRATION s3_int;

Step 2: Create an External Stage (S3)


CREATE OR REPLACE STAGE s3_stage
URL = 's3://your-bucket-name/path/'
STORAGE_INTEGRATION = s3_int
FILE_FORMAT = (TYPE = CSV SKIP_HEADER=1);

What this does:


 References the S3 bucket in Snowflake.
 Specifies CSV file format.
Verify Files in S3 Stage:
LIST @s3_stage;

Step 3: Load Data from S3 into Snowflake


COPY INTO your_table
FROM @s3_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');

What this does:


 Loads data from S3 to Snowflake.

2. Loading Data from Azure Blob → Snowflake


Step 1: Create Storage Integration
CREATE STORAGE INTEGRATION azure_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'AZURE'
STORAGE_AZURE_TENANT_ID = '<tenant-id>'
ENABLED = TRUE;

What this does:


 Grants Snowflake access to Azure Blob Storage.
 You must register Snowflake in Azure Active Directory (AAD).
Verify Integration:
DESC STORAGE INTEGRATION azure_int;

Step 2: Create an External Stage (Azure Blob)


CREATE OR REPLACE STAGE azure_stage
URL = 'azure://youraccount.blob.core.windows.net/container-name/'
STORAGE_INTEGRATION = azure_int
FILE_FORMAT = (TYPE = CSV SKIP_HEADER=1);

What this does:


 References Azure Blob Storage.
Verify Files in Azure Stage:
LIST @azure_stage;

Step 3: Load Data from Azure into Snowflake


COPY INTO your_table
FROM @azure_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');

What this does:


 Loads data from Azure Blob Storage to Snowflake.

3. Loading Data from Google Cloud Storage (GCS) → Snowflake


Step 1: Create Storage Integration
CREATE STORAGE INTEGRATION gcs_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'GCS'
ENABLED = TRUE;

What this does:


 Grants Snowflake access to GCS.
 You need to create a Google Cloud IAM Role.
Verify Integration:
DESC STORAGE INTEGRATION gcs_int;

Step 2: Create an External Stage (GCS)


CREATE OR REPLACE STAGE gcs_stage
URL = 'gcs://your-bucket-name/path/'
STORAGE_INTEGRATION = gcs_int
FILE_FORMAT = (TYPE = CSV SKIP_HEADER=1);

What this does:


 References Google Cloud Storage (GCS).
Verify Files in GCS Stage:
LIST @gcs_stage;
Step 3: Load Data from GCS into Snowflake
COPY INTO your_table
FROM @gcs_stage
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"');

What this does:


 Loads data from GCS to Snowflake

Summary
AWS S3, Azure Blob, and GCP Storage all follow similar steps:
1. Create storage integration → Allows Snowflake to access cloud storage.
2. Create an external stage → Defines where data is stored.
3. Use COPY INTO → Loads data into Snowflake.
13. Snowpipe - Continuous Data Loading
1. What is Continuous Loading?
Loads small volumes of data continuously (e.g., every 10 mins, every hour)
Supports real-time or near-real-time data ingestion
Ensures latest data is available for analytics
Uses Snowpipe (a serverless feature) for automatic ingestion

2. What is Snowpipe?
-> A named database object that contains a COPY command
-> Loads data within minutes after files are added to a stage
-> Serverless & managed by Snowflake
-> One-time setup for automation
-> Prevents duplicate file loading
-> Optimal file size: 100-250 MB

3. How Snowpipe Works?


1. A file is added to a cloud storage location
2. Snowflake detects the new file and triggers Snowpipe
3. The COPY command inside the pipe loads the data
4. The file metadata is tracked to avoid duplicates

4. Steps to Create a Snowpipe


1. Create a Storage Integration (to connect Snowflake to cloud storage)
2. Create a Stage Object (to define cloud storage location)
3. Test the COPY Command (to ensure data loads correctly)
4. Create the Pipe using the COPY Command
5. Set Up Event Notifications (AWS S3, Azure Blob, or GCP to trigger Snowpipe)

5. Snowpipe Syntax
CREATE OR REPLACE PIPE PIPE_NAME
AUTO_INGEST = TRUE
AS COPY INTO <table_name> FROM @<stage_name>;

6. Snowpipe DDL Commands

Command Purpose

CREATE PIPE Creates a new Snowpipe

ALTER PIPE Modifies a pipe (pause/resume)

DROP PIPE Deletes a pipe

DESCRIBE PIPE Shows pipe properties & ARN

SHOW PIPES Lists all pipes

7. Troubleshooting Snowpipe Issues


Step 1: Check Pipe Status
SELECT SYSTEM$PIPE_STATUS('pipe_name');
If timestamps don’t match, check:
Cloud storage configuration (e.g., AWS SQS settings)
Snowflake stage object path
Step 2: View COPY History
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
table_name => 'your_table',
START_TIME => 'timestamp'
));

Check for errors or failed loads


Step 3: Validate Data Files
SELECT * FROM TABLE(INFORMATION_SCHEMA.VALIDATE_PIPE_LOAD(
PIPE_NAME => 'pipe_name',
START_TIME => 'timestamp'
));
Identifies errors in files before loading

8. Managing Snowpipes
-> View pipe properties:
DESC PIPE pipe_name;
-> List all pipes:
SHOW PIPES;
-> Pause/Resume a Pipe:
ALTER PIPE pipe_name SET PIPE_EXECUTION_PAUSED = TRUE; -- Pause
ALTER PIPE pipe_name SET PIPE_EXECUTION_PAUSED = FALSE; -- Resume
-> When to Pause & Resume?
Before modifying the stage object
Before modifying the file format object
Before modifying the COPY command
To modify the COPY command, you must recreate the pipe!

Final Takeaways
-> Snowpipe automates continuous data loading
-> Prevents duplicate file loading with metadata tracking
-> Requires proper event notifications in cloud storage
-> Monitor pipe status & copy history for troubleshooting
14. Time Travel & Fail-safe in Snowflake
1. What is Time Travel?
Allows access to historical data that has been changed or deleted
Restores tables, schemas, and databases that were dropped
Enables querying past data at any point within the retention period
Used for data analysis, backup, and auditing
No need to enable manually; it is enabled by default

2. Retention Periods
Determines how long historical data is stored
Higher retention → Higher storage cost

Snowflake Edition Retention Period

Standard 1 day (can be set to 0)

Enterprise & Higher 0-90 days (default is 1 day)

Retention can be modified using ALTER command


Setting retention to 0 disables Time Travel

3. Querying Historical Data


Query at a Specific Timestamp
SELECT * FROM my_table AT(TIMESTAMP => '2025-02-27 16:20:00'::timestamp_tz);
Query Data as of 5 Minutes Ago
SELECT * FROM my_table AT(OFFSET => -60*5);
Query Data Before a Specific Query Execution
SELECT * FROM my_table BEFORE(STATEMENT => 'query_id');

4. Restoring Dropped Objects


Dropped objects (tables, schemas, databases) remain in Snowflake during the retention
period
Once the retention period expires, the object is permanently deleted
Restore a Dropped Table, Schema, or Database
UNDROP TABLE my_table;
UNDROP SCHEMA my_schema;
UNDROP DATABASE my_database;

5. Fail-safe (Last Resort Data Recovery)


** Fail-safe is not user-accessible; Snowflake Support must be contacted
** Cannot query or restore Fail-safe data directly
** Takes hours to days for recovery
Provides a 7-day recovery period after Time Travel ends
Ensures compliance and disaster recovery

6. Continuous Data Protection Lifecycle


1. Time Travel (Up to 90 days)
2. Fail-safe (7 days, recovery by Snowflake Support)
3. After Fail-safe, data is permanently lost

Key Takeaways
-> Use Time Travel for quick recovery & historical analysis
-> Fail-safe is a last resort but requires Snowflake Support
-> Higher retention = higher cost, choose wisely
-> Always back up critical data before retention expires
15. Zero-Copy Cloning in Snowflake
1. What is Zero-Copy Cloning?
Creates an instant copy of a table, schema, or database without duplicating storage
No additional storage cost at the time of cloning
Snapshot of source data is taken at the moment of cloning
Cloned object is independent of the source object
Changes in the source or clone do not affect each other

2. Key Use Cases


Clone production data into Dev/Test environments for safe testing
🗄 Take instant backups before making critical changes

3. Syntax for Cloning


CREATE OR REPLACE <CLONED_OBJECT_TYPE> <CLONED_OBJECT_NAME> CLONE
<SOURCE_OBJECT>;
Example: Clone a table
CREATE OR REPLACE TABLE new_table CLONE existing_table;
Example: Clone a schema
CREATE OR REPLACE SCHEMA test_schema CLONE prod_schema;
Example: Clone a database
CREATE OR REPLACE DATABASE dev_db CLONE prod_db;

4. What Can Be Cloned?


Data Storage Objects
Databases
Schemas
Tables
Streams
Data Configuration Objects
File Formats
Stages
Tasks
5. How Zero-Copy Cloning Works?
1. When a clone is created, it shares the same storage blocks as the original
2. New changes in either object do not affect the other
3. If rows are modified in the clone or source, only the modified rows take extra storage
4. Efficient for quick backups and testing environments

Key Takeaways
-> Fast cloning without extra storage cost initially
-> Ideal for backups, testing, and versioning
-> Changes in the clone do not affect the original
-> Supports various objects (tables, schemas, databases, tasks, etc.)
16. Snowflake Table Types – Permanent vs Transient vs Temporary
1. Types of Tables in Snowflake
Permanent Tables (Default, standard storage with fail-safe)
Transient Tables (No fail-safe, short retention period)
Temporary Tables (Session-specific, auto-dropped after session ends)

2. Permanent Tables
-> Default table type in Snowflake
-> Exists until explicitly dropped
-> Supports Time Travel (0-90 days, depending on edition)
-> Has a 7-day Fail-Safe period
Best for: Storing business-critical data
Syntax:
CREATE TABLE my_table (id INT, name STRING);

3. Transient Tables
-> Similar to Permanent Tables but with no Fail-Safe
-> Shorter retention period (0-1 day)
-> Exists until explicitly dropped
-> Best for temporary or intermediate processing
Best for: Staging tables, intermediate results
Syntax:
CREATE TRANSIENT TABLE my_transient_table (id INT, name STRING);

4. Temporary Tables
-> Exists only within the session
-> Automatically dropped when session ends
-> Not visible to other sessions or users
-> No fail-safe period
Best for: Development, testing, and temporary processing in stored procedures
Syntax:
CREATE TEMPORARY TABLE my_temp_table (id INT, name STRING);

5. Comparison of Table Types


Table Type Persistence Time Travel Retention Fail-Safe

Permanent Until explicitly dropped 0-90 days (depends on edition) 7 days

Transient Until explicitly dropped 0-1 day (default 1) ** No fail-safe

Temporary Only for session duration 0-1 day (default 1) ** No fail-safe

6. Key Points to Remember


-> Tables cannot be converted from one type to another
-> Transient databases/schemas default to transient tables
-> Temporary tables take precedence if a table of the same name exists
-> Find table type using:
SHOW TABLES;
(Look at the ‘Kind’ field for the table type)

17. External Tables in Snowflake


1. What is an External Table?
• External Tables allow querying data stored in external storage (Amazon S3, Azure Blob,
GCP) without loading it into Snowflake.
• They store metadata about files but not actual data.
• Read-Only – No INSERT, UPDATE, DELETE operations allowed.
• Can be used in queries, joins, views, and materialized views.
• Slower performance than normal Snowflake tables.
• Useful for analyzing raw data without storage costs.
External Storage Locations:
Amazon S3
Google Cloud Storage (GCS)
Azure Blob Storage

2. Metadata of External Tables


External tables include pseudocolumns to track metadata.

Metadata Column Description

VARIANT column representing each row from the


VALUE
external file.

METADATA$FILENAME Filename & path in the stage storage.

METADATA$FILE_ROW_NUMBER Row number in the staged file.

Example Query:
SELECT VALUE, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER
FROM SAMPLE_EXT;

Helps in tracking file origin and row positioning.

3. How to Create an External Table?


Steps to Create an External Table
1. Create a File Format
2. Create an External Stage (Cloud Storage Location)
3. Create the External Table
External Table Syntax
CREATE EXTERNAL TABLE <table_name>
( column_definitions )
WITH LOCATION = <external_stage>
FILE_FORMAT = <file_format_object>;
Example: Create an External Table
1. Create a File Format
CREATE FILE FORMAT MYS3CSV
TYPE = CSV
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
SKIP_HEADER = 1;
2. Create an External Stage (S3 Example)
CREATE STAGE MYS3STAGE
URL = 's3://mybucket/data/'
CREDENTIALS = (AWS_KEY_ID='xxxx' AWS_SECRET_KEY='xxxx')
FILE_FORMAT = MYS3CSV;

3. Create an External Table


CREATE OR REPLACE EXTERNAL TABLE SAMPLE_EXT (
ID INT AS (VALUE:C1::INT),
NAME VARCHAR(20) AS (VALUE:C2::VARCHAR),
DEPT INT AS (VALUE:C3::INT)
)
WITH LOCATION = @MYS3STAGE
FILE_FORMAT = MYS3CSV;

Table references files in S3 without loading them into Snowflake!

4. Refreshing External Tables


• External tables auto-refresh to sync metadata with new files.
• What happens during refresh?
New files are added to the table metadata.
Modified files update existing metadata.
Deleted files are removed from the metadata.

5. Key Benefits of External Tables


No need to load data into Snowflake storage.
Lower storage costs for data that doesn’t change often.
Useful for data lakes, raw logs, and semi-structured data.
Compatible with Snowflake queries, views, and materialized views.
Use External Tables when you want to analyze data stored externally without copying it
into Snowflake.
18. Snowflake Access Control – Roles & Privileges
1. What is Access Control?
• Access control defines who can access what in Snowflake.
• Two types of access control models:
 Discretionary Access Control (DAC) – Object owner grants access.
 Role-Based Access Control (RBAC) – Access is given to roles, and roles are assigned
to users.

2. Key Concepts

Concept Description

Securable Object Any entity like tables, schemas, databases, views, warehouses, etc.

Role A set of privileges that can be granted to users or other roles.

Privilege The level of access (e.g., SELECT, INSERT, DELETE, etc.).

User People or system accounts to whom roles are assigned.

3. Privileges in Snowflake

Privilege Usage

SELECT Read data from a table.

INSERT Add new rows to a table.

UPDATE Modify existing data.

DELETE Remove rows from a table.

TRUNCATE Remove all rows from a table.


Privilege Usage

ALL PRIVILEGES Grant all permissions except OWNERSHIP.

OWNERSHIP Full control over an object.

To grant privileges:
GRANT SELECT ON TABLE sales TO role analyst;
To revoke privileges:
REVOKE SELECT ON TABLE sales FROM role analyst;

4. Object Hierarchy in Snowflake


Access is granted at different levels:
Database → Schema → Tables, Views, Stages, File Formats, etc.

5. Roles in Snowflake
System-Defined Roles
1 ORGADMIN – Manages organization-level operations (e.g., creating accounts).
2 ACCOUNTADMIN – Highest role in an account, manages users, roles, usage, and billing.
3 SECURITYADMIN – Manages grants and security policies.
4 USERADMIN – Creates and manages users & roles.
5 SYSADMIN – Creates warehouses, databases, and other objects.
6 PUBLIC – Default role assigned to all users.
To check current role:
SELECT CURRENT_ROLE();
To switch role:
USE ROLE SYSADMIN;

6. Role Hierarchy in Snowflake


┌───────────────┐
│ ACCOUNTADMIN │
└───────────────┘

┌───────────────┐
│ SECURITYADMIN │
└───────────────┘

┌───────────────┐
│ USERADMIN │
└───────────────┘

┌───────────────┐
│ SYSADMIN │
└───────────────┘

┌───────────────┐
│ CUSTOM ROLES │
└───────────────┘
Key Points:
• Custom roles should be assigned under SYSADMIN.
• ACCOUNTADMIN inherits all system roles.
• PUBLIC role is assigned to all users automatically.

7. Custom Roles
• Created by USERADMIN or any role with CREATE ROLE privilege.
• Not assigned to any user by default.
• Best practice: Assign custom roles under SYSADMIN for better management.
Create a custom role:
CREATE ROLE analyst_role;
Assign privileges to the custom role:
GRANT SELECT, INSERT ON TABLE sales TO ROLE analyst_role;
Assign the role to a user:
GRANT ROLE analyst_role TO USER madhu;

8. Best Practices for Access Control


• Follow Role-Based Access Control (RBAC).
• Limit ACCOUNTADMIN access to a few users.
• Use custom roles instead of granting permissions directly to users.
• Regularly audit user roles to remove unnecessary access.

19. Snowflake Views – Normal, Secure & Materialized

1. What are Views?


Views are useful for displaying certain rows and columns in one or more tables. A view
makes it possible to obtain the result of a query as if it were a table. The CREATE VIEW
statement defines the query. Snowflake supports two different types of views:
Non-materialized views (often referred to as "views") - The results of a non-materialized
view are obtained by executing the query at the moment the view is referenced in a query.
When compared to materialised views, performance is slower.
Materialized views - Although named as a type of view, a materialised view behaves more
like a table in many aspects. The results of a materialised view are saved in a similar way to
that of a table. This allows for faster access, but it necessitates storage space and active
maintenance, both of which incur extra expenses.

• Views are virtual tables based on a SQL query.


• They don’t store data; they retrieve it dynamically from base tables.
• Uses of Views:
 Combine and filter data.
 Restrict access to sensitive columns.
 Simplify complex queries.
Creating a View:
CREATE OR REPLACE VIEW sales_view AS
SELECT product_id, revenue FROM sales;

2. Types of Views in Snowflake

Type Description

Normal View (Non-Materialized


Standard view that executes SQL dynamically.
View)

Secure View Hides the underlying query from unauthorized users.

Stores precomputed results for faster query


Materialized View
performance.

3. Secure Views
• A secure view hides its SQL definition from unauthorized users.
• Only users with the required role can see its definition.
• Useful for:
 Data Security – Hide sensitive logic.
 Access Control – Restrict users from seeing base tables.
Creating a Secure View:
CREATE OR REPLACE SECURE VIEW sales_secure_view AS
SELECT product_id, revenue FROM sales;

Checking if a View is Secure:


SELECT table_name, is_secure
FROM mydb.information_schema.views
WHERE table_name = 'sales_secure_view';

4. Materialized Views
• Unlike normal views, Materialized Views store precomputed results.
• Improves performance for repetitive queries on large datasets.
• Cannot be created on multiple tables or complex queries.

Creating a Materialized View:


CREATE OR REPLACE MATERIALIZED VIEW sales_mv AS
SELECT product_id, SUM(revenue) AS total_revenue
FROM sales
GROUP BY product_id;

Checking Materialized Views:


SELECT table_name, is_materialized
FROM mydb.information_schema.views
WHERE table_name = 'sales_mv';

5. Refreshing Materialized Views


• No manual refresh required – Snowflake automatically updates them.
• Background process updates the view when the base table changes.
• Takes a minute to refresh.

6. Cost of Materialized Views


Storage Cost: Since the view stores data, it increases storage usage.
Compute Cost: Snowflake automatically refreshes the view using its compute resources.

7. When to Use Materialized Views?


Use Materialized Views when:
 Query results don’t change often.
 Query runs frequently.
 Query takes a long time to execute (e.g., aggregations).
** Use Regular Views when:
 Data changes frequently.
 Query is simple or uses multiple tables.
 View results are not accessed often.

8. Advantages of Materialized Views


• Faster Performance – Precomputed results reduce query time.
• No Manual Refreshing – Snowflake handles updates automatically.
• Always Up-to-Date – Even with frequent DML on base tables.
9. Limitations of Materialized Views **
** Can be created on only one table (no joins).
** Doesn’t support all aggregate & window functions.
** Cannot reference:
1. Another Materialized View.
2. A Regular View.
3. A User-Defined Function (UDF).

20. Dynamic Data Masking in Snowflake


1. Column-Level Security
• Protects sensitive data (e.g., PHI, bank details) by applying security policies at the column
level.
• Two methods:
 Dynamic Data Masking – Masks data dynamically based on user roles.
 External Tokenization – Replaces data with cypher text using external cloud
functions.

2. Masking Policies
• Schema-level objects that define how data should be masked.
• Applied dynamically at query runtime without modifying actual data.
• One policy can be applied to multiple columns.

3. Dynamic Data Masking


• Data remains unchanged in the table.
• The query result varies based on the user’s role.
• Data can be fully masked, partially masked, obfuscated, or tokenized.
• Unauthorized users can perform operations but cannot view raw data.
Example:
Role SSN Output

PAYROLL 123-45-6789

Other Users ******

4. Creating Masking Policies


• Based on Role
CREATE MASKING POLICY employee_ssn_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('PAYROLL') THEN val
ELSE '******'
END;

• Based on Conditions
CREATE MASKING POLICY email_visibility AS
(email VARCHAR, visibility STRING) RETURNS VARCHAR ->
CASE
WHEN CURRENT_ROLE() = 'ADMIN' THEN email
WHEN visibility = 'Public' THEN email
ELSE '***MASKED***'
END;

5. Applying Masking Policies


• Applied at the column level.
• Can be used on multiple tables and views.
Apply Masking Policy to a Column:
ALTER TABLE public.employee
MODIFY COLUMN ssn SET MASKING POLICY employee_ssn_mask;

Apply Multiple Masking Policies at Once:


ALTER TABLE public.employee
MODIFY COLUMN ssn SET MASKING POLICY employee_ssn_mask,
MODIFY COLUMN email SET MASKING POLICY email_visibility USING(email, visibility);

6. Removing Masking Policies **


Unset a Masking Policy:
ALTER TABLE public.employee
MODIFY COLUMN ssn UNSET MASKING POLICY;

Unset Multiple Policies at Once:


ALTER TABLE public.employee
MODIFY COLUMN ssn UNSET MASKING POLICY,
MODIFY COLUMN email UNSET MASKING POLICY;

7. Altering & Dropping Policies


Modify an Existing Policy:
ALTER MASKING POLICY employee_ssn_mask SET BODY ->
CASE
WHEN CURRENT_ROLE() IN ('HR') THEN val
ELSE '#####'
END;
Rename a Masking Policy:
ALTER MASKING POLICY employee_ssn_mask RENAME TO ssn_protection_policy;
Drop a Masking Policy (After Unsetting from Tables):
DROP MASKING POLICY employee_ssn_mask;

8. Limitations
** Must unset a policy before dropping it.
** Input and output data types must match.
** Does not encrypt data, only masks it at query runtime.

21. Data Sharing


1.What is Data Sharing?

Snowflake’s data sharing feature allows organizations to securely share data across
different Snowflake accounts without the need to copy or move the data. This is done in
real time, meaning that data can be shared as soon as it is available, with no delays.
The sharing process works by creating a share in Snowflake that contains selected data
(tables, views, schemas, etc.) and then granting access to another Snowflake account. This
access is read-only, so the recipient can query the shared data but cannot modify it.
Data sharing is secure and governed by Snowflake’s role-based access control (RBAC),
ensuring that only authorized users have access to the data. This feature is commonly used
for sharing data between business partners or departments within a large organization,
without the overhead of data duplication.

• Securely share data with both Snowflake and non-Snowflake users.


• Consumers can query shared data using their own compute resources.
• No data duplication – shared data remains in the provider’s Snowflake account.
Key Roles:
 Provider – The account sharing the data by creating a share object.
 Consumer – The account accessing the shared data.

2. What Can Be Shared?


• Supported Objects for Sharing:
Tables
External Tables
Secure Views
Secure Materialized Views
Secure UDFs

3. What is a Share?
• A Share is a named database object that includes:
 The database & schema being shared
 The grants (permissions) on objects
 The consumer account details
Creating a Share:
CREATE SHARE my_share;
Adding a Database to the Share:
GRANT USAGE ON DATABASE my_database TO SHARE my_share;
Adding a Table to the Share:
GRANT SELECT ON TABLE my_database.public.my_table TO SHARE my_share;
Assigning a Consumer to the Share:
ALTER SHARE my_share ADD ACCOUNT = '<consumer_account_id>';

4. Reader Accounts (For Non-Snowflake Users)


• Data sharing is only supported between Snowflake accounts by default.
• If the consumer doesn’t have a Snowflake account, we can create a Reader Account.
• Reader accounts are read-only (No DML operations like INSERT, UPDATE, DELETE).
• Provider pays for the compute usage of the reader account.

Creating a Reader Account:


CREATE MANAGED ACCOUNT reader_account
ADMIN_NAME = 'reader_admin'
ADMIN_PASSWORD = 'StrongPassword123'
TYPE = READER;

Granting Access to the Reader Account:


GRANT USAGE ON DATABASE my_database TO SHARE my_share;
GRANT SELECT ON TABLE my_database.public.my_table TO SHARE my_share;
ALTER SHARE my_share ADD ACCOUNT = 'reader_account';

5. Benefits of Secure Data Sharing


• No Data Duplication – Consumers query the shared data without copying it.
• Live & Real-Time Data – Changes made by the provider are instantly visible to the
consumer.
• Cross-Cloud Sharing – Data can be shared across AWS, Azure, and GCP.
• Cost-Effective – No storage costs for consumers (they only pay for compute).

6. Limitations
** Consumers cannot modify the shared data.
** Cannot share non-secure views or standard materialized views.
** Providers must manage access control.
22. Scheduling in Snowflake Using Tasks
1. What is a Task?
• Tasks in Snowflake allow scheduling SQL execution at defined intervals.
• Used for automating ETL processes, stored procedure execution, and change data
capture (CDC).
• Can be Snowflake-managed (serverless) or user-managed (virtual warehouses).
Use Cases:
Automating batch processing
Implementing CDC (Change Data Capture)
Running stored procedures on schedule
Managing dependencies between tasks (DAG – Directed Acyclic Graph)

2. How to Create a Task?


• Use the CREATE TASK command to define a new task.
• Tasks can be time-based (using CRON) or dependency-based (AFTER another task).
Example 1: Task to Insert Data Every 10 Minutes
CREATE OR REPLACE TASK CUSTOMER_INSERT
WAREHOUSE = COMPUTE_WH
SCHEDULE = '10 MINUTE'
AS
INSERT INTO CUSTOMERS (CREATE_DATE) VALUES (CURRENT_TIMESTAMP);

Example 2: Task to Call a Stored Procedure Daily at 9:30 UTC


CREATE OR REPLACE TASK CUSTOMER_LOAD
WAREHOUSE = MY_WH
SCHEDULE = 'USING CRON 30 9 * * * UTC'
AS
CALL PROC_LOAD_CUSTOMERS();

3. Altering a Task
• Modify task properties using ALTER TASK.
Modify Schedule, Dependencies, or Query:
ALTER TASK emp_task SET SCHEDULE = '5 MINUTE';
ALTER TASK emp_task SUSPEND;
ALTER TASK task_dept ADD AFTER task_emp;
ALTER TASK task_dept REMOVE AFTER task_emp;
Resuming a Suspended Task:
ALTER TASK emp_task RESUME;

4. Using CRON for Scheduling


• CRON syntax allows flexible time-based scheduling.
• Format: minute hour day month day-of-week timezone
Examples:
Every Day at 9:30 AM UTC:
SCHEDULE = 'USING CRON 30 9 * * * UTC'
Every Monday at 12 AM UTC:
SCHEDULE = 'USING CRON 0 0 * * 1 UTC'
Every Hour:
SCHEDULE = 'USING CRON 0 * * * * UTC'

5. DAG of Tasks (Task Dependencies)


• DAG (Directed Acyclic Graph) ensures tasks run in sequence.
• Root task triggers dependent child tasks automatically.
Example DAG:
CREATE OR REPLACE TASK TASK_A
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 30 9 * * * UTC'
AS 'SQL Query 1';

CREATE OR REPLACE TASK TASK_B


WAREHOUSE = COMPUTE_WH AFTER TASK_A
AS 'SQL Query 2';

CREATE OR REPLACE TASK TASK_C


WAREHOUSE = COMPUTE_WH AFTER TASK_A
AS 'SQL Query 3';

CREATE OR REPLACE TASK TASK_D


WAREHOUSE = COMPUTE_WH
AS 'SQL Query 4';

ALTER TASK TASK_D ADD AFTER TASK_B;


ALTER TASK TASK_D ADD AFTER TASK_C;
Execution Order:
TASK_A runs first
Then TASK_B and TASK_C run in parallel
TASK_D runs after TASK_B and TASK_C complete

6. Checking Task History


• Use TASK_HISTORY to monitor task execution status.
View All Task Executions (Latest First):
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
ORDER BY scheduled_time DESC;

Check History for a Specific Task (Last 6 Hours):


SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
scheduled_time_range_start => DATEADD('HOUR', -6, CURRENT_TIMESTAMP()),
task_name => 'Task Name'
));
Check Task History for a Given Time Range:
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
scheduled_time_range_start => TO_TIMESTAMP_LTZ('2022-07-17 10:00:00.000 -0700'),
scheduled_time_range_end => TO_TIMESTAMP_LTZ('2022-07-17 11:00:00.000 -0700')
));

7. Troubleshooting Tasks
If your task is not running, follow these steps:
• Step 1: Check Task Status
SHOW TASKS;
➡ If the status is SUSPENDED, resume it using:
ALTER TASK my_task RESUME;
• Step 2: Check Task History for Failures
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE state = 'FAILED';

➡ Identify Query ID and check error details.


• Step 3: Verify Permissions
➡ Ensure the task owner has permissions to the warehouse, database, and tables.
• Step 4: If Using Streams, Verify Changes Exist
➡ Run the following to check if the stream has data:
SELECT SYSTEM$STREAM_HAS_DATA('my_stream');
➡ If result = FALSE, no new data to process.

8. Summary
Tasks help automate SQL execution in Snowflake.
Supports time-based scheduling (CRON) and dependency-based execution (DAG).
Monitor & troubleshoot tasks using TASK_HISTORY.
Ensure permissions & check stream data availability for CDC.

23. Streams in Snowflake (Change Data Capture - CDC)


1. What is a Stream?
• Streams in Snowflake track row-level changes (INSERT, UPDATE, DELETE) made to tables.
• They store metadata about these changes but do not store actual data.
• Used for Change Data Capture (CDC) and real-time data processing.
• Streams work with Tasks to automate data movement.
Use Case:
Detect new records (INSERTs)
Track modifications (UPDATEs as DELETE + INSERT pairs)
Identify deleted records (DELETEs)
Merge changes into a target table
Continuous Data Pipelines:
Snowpipe + Stream + Task → Real-time Data Processing

2. Metadata of Streams
• Each stream maintains metadata for tracking DML changes:

Metadata Column Description

METADATA$ACTION DML Operation Type: INSERT, DELETE

METADATA$ISUPDATE Part of an UPDATE? TRUE (UPDATE) / FALSE (INSERT/DELETE)

METADATA$ROW_ID Unique ID for tracking row-level changes

How Updates Are Stored?


➡ UPDATE = DELETE (Old Row) + INSERT (New Row)
➡ METADATA$ISUPDATE = TRUE

3. How a Stream Works? (Data Flow)


• Stream does not store changed data, it tracks changes using an offset pointer.
• When changes are consumed, offset moves forward.
• Once consumed, changes are no longer available in the stream.
• If multiple tables need to consume the same changes, create multiple streams.

4. Consuming Data from Streams


• Use MERGE to apply stream changes to target tables.
Identify Insert Records:
WHERE METADATA$ACTION = 'INSERT' AND METADATA$ISUPDATE = 'FALSE'
Identify Update Records:
WHERE METADATA$ACTION = 'INSERT' AND METADATA$ISUPDATE = 'TRUE'
Identify Delete Records:
WHERE METADATA$ACTION = 'DELETE' AND METADATA$ISUPDATE = 'FALSE'
MERGE Stream Data into Target Table:
MERGE INTO target_table AS T
USING (SELECT * FROM my_stream) AS S
ON T.ID = S.ID
WHEN MATCHED AND S.METADATA$ACTION = 'DELETE' THEN DELETE
WHEN MATCHED AND S.METADATA$ISUPDATE = 'TRUE' THEN UPDATE SET T.name =
S.name
WHEN NOT MATCHED THEN INSERT (ID, name) VALUES (S.ID, S.name);

Applies CDC logic to target table automatically.

5. Types of Streams
1. Standard Streams
• Tracks INSERTs, UPDATEs, DELETEs, and TRUNCATEs.
• Best for full change tracking (CDC).
Create a Standard Stream:
CREATE OR REPLACE STREAM my_stream ON TABLE my_table;

2. Append-Only Streams
• Tracks only INSERT operations.
• Ignores DELETEs and UPDATEs.
• Best for append-only tables (logs, event data, etc.).
Create an Append-Only Stream:
CREATE OR REPLACE STREAM my_stream ON TABLE my_table APPEND_ONLY = TRUE;

3. Insert-Only Streams (For External Tables)


• Tracks only INSERTs for External Tables.
• Deletes are NOT tracked.
Create an Insert-Only Stream:
CREATE OR REPLACE STREAM my_stream ON EXTERNAL TABLE my_table INSERT_ONLY =
TRUE;
6. Summary
Streams track table changes (INSERT, UPDATE, DELETE) without storing actual data.
Metadata columns help in identifying DML operations.
Use MERGE to apply changes to target tables.
Different types of streams for different use cases (Standard, Append-Only, Insert-Only).
Combine Streams + Tasks + Snowpipe for real-time data pipelines.

24. User-Defined Functions (UDF) in Snowflake


1. What is a UDF?
• User-Defined Functions (UDFs) allow custom operations not available in built-in
functions.
• Useful when the same logic needs to be reused multiple times.
• Supports overloading (same function name, different parameters).
• Return Types:
 Scalar UDFs → Return a single value.
 Tabular UDFs → Return multiple rows.
• Supported Languages:
SQL
JavaScript
Java
Python

2. Sample UDFs
Scalar UDF (Returns a Single Value)
Example: Area of a Circle
CREATE FUNCTION area_of_circle(radius FLOAT)
RETURNS FLOAT
AS
$$
PI() * radius * radius
$$;
Usage:
SELECT area_of_circle(4.5);
Returns: 63.617251

Tabular UDF (Returns Multiple Rows)


Example: Returning Sample Data
CREATE FUNCTION sample_people()
RETURNS TABLE (name VARCHAR, age NUMBER)
AS
$$
SELECT 'Ravi', 34
UNION
SELECT 'Latha', 27
UNION
SELECT 'Madhu', 25
$$;

Usage:
SELECT * FROM TABLE(sample_people());
Returns:
Name Age

Ravi 34

Latha 27

Madhu 25

3. Key Benefits of UDFs


Reusable & Modular – Avoid repeating logic in multiple places.
Overloading – Can define multiple UDFs with the same name but different parameters.
Custom Processing – Extend Snowflake’s built-in functions.

25. Stored Procedures in Snowflake


1. What is a Stored Procedure?
• Stored procedures allow you to write procedural code that includes:
SQL statements
Conditional statements (IF, CASE, etc.)
Looping (FOR, WHILE, etc.)
Cursors
• Supported Languages:
SQL (Snowflake Scripting)
JavaScript
Java
Scala
Python
• Key Features:
 Supports branching and looping.
 Can return single values or tabular data.
 Can dynamically generate and execute SQL.

2. Sample Stored Procedure


Example: Insert Data Using JavaScript
CREATE OR REPLACE PROCEDURE LOAD_TABLE1()
RETURNS VARCHAR
LANGUAGE javascript
AS
$$
var rs = snowflake.execute( { sqlText:
`INSERT INTO table1 ("column 1")
SELECT 'value 1' AS "column 1";`
});
return 'Done';
$$;

Execution:
CALL LOAD_TABLE1();
Returns: 'Done'

3. UDFs vs Stored Procedures

Feature UDF (User-Defined Function) Stored Procedure

Return Type Single value or table Single value or table

Complex logic with loops &


Logic Complexity Simple calculations
conditions

Supports Dynamic SQL? ** No Yes

Supports Branching &


** No Yes
Looping?

Reusable calculations (e.g., tax, ETL, data processing, dynamic


Use Case
discount) queries
26. Caching in Snowflake

Stores result of executed queries


1 What is Caching?
• Caching is a temporary storage mechanism that stores copies of query results or data for
faster access in future queries.
• Benefits:
Improves performance
Reduces query cost
Speeds up repeated queries

2 Types of Caching in Snowflake


1. Query Results Cache (Results Cache)
• Location: Cloud Services Layer
• Retention: Cached data is available for 24 hours
• Scope: Shared across all Virtual Warehouses (VWs)
• Condition: Query must be identical to a previous one
• Invalidation: Cache is invalidated if:
 Underlying data changes
 Query is not identical (e.g., column reordering, subset of data)
Example:
SELECT * FROM EMPLOYEES WHERE DEPT = 'HR';
• If the same query runs within 24 hours, Snowflake retrieves it from cache instead of re-
executing it.

2. Local Disk Cache


• Location: Inside Virtual Warehouse (VW) on SSD/RAM
• Scope: Stores data blocks (not results) fetched from Remote Storage
• Retention: Cache is lost when VW is suspended
• Subset Queries Work! (Unlike Query Results Cache)
• Depends on VW Size:
 X-Small VW → Limited caching capacity
 Larger VW → More data can be cached
Example:
1 First Query:
SELECT * FROM EMPLOYEES LIMIT 10000;
➡ 10,000 rows are cached in Local Disk Cache.
2 Second Query:
SELECT * FROM EMPLOYEES LIMIT 3000;
• Since 3,000 rows are a subset of the cached 10,000 rows, Snowflake retrieves them from
Local Disk Cache instead of Remote Storage.
3. Metadata Cache
 Scope: Global across Snowflake.
 Location: Snowflake's control plane.
 How It Works:
o Stores metadata like table structure, statistics, and query execution plans.
o Speeds up queries by eliminating the need to scan metadata from storage.
o Cached metadata is automatically refreshed when table schema or partitions
change.
Example:
 Running SHOW TABLES or DESCRIBE TABLE is much faster due to metadata caching.

4. Cloud Services Layer Cache


 Scope: Global across all users in an account.
 Location: Snowflake's Cloud Services layer.
 How It Works:
o Caches query plans and execution strategies.
o Helps optimize repeated queries across different users and sessions.

3. Key Differences: Query Results Cache vs Local Disk Cache

Feature Query Results Cache Local Disk Cache

Virtual Warehouse
Location Cloud Services Layer
(SSD/RAM)

Stores Query results Raw data blocks

Scope Shared across all warehouses Limited to a single VW

Retention 24 hours Until VW is suspended


Feature Query Results Cache Local Disk Cache

Supports Subset
** No Yes
Queries?

Data changes, different query


Invalidation VW suspension
structure

27. Unloading Data in Snowflake


1. What is Unloading?
• Unloading refers to exporting data from a Snowflake table into a file stored in a stage
(Snowflake/internal/external).
• Use case: When you need to move data out of Snowflake for backup, reporting, or
migration purposes.

2. Steps to Unload Data


Step 1: Use COPY INTO to move data to a stage
COPY INTO @MYSTAGE
FROM EMPLOYEES;
Step 2: Download files from the stage
 From Snowflake stage:
 GET @MYSTAGE file_path;
 From S3 or Azure: Use respective cloud storage tools (AWS CLI, Azure Storage
Explorer).

3. Syntax of COPY INTO (Unloading Data)


COPY INTO @STAGE_LOCATION
FROM TABLE_NAME
<OPTIONS>;

4. Unloading Options
Option Description Example

OVERWRITE Overwrites existing files OVERWRITE = TRUE

SINGLE Exports data into a single file SINGLE = TRUE

MAX_FILE_SIZE Specifies max file size (MB) MAX_FILE_SIZE = 10000000

INCLUDE_QUERY_ID Adds a unique identifier to each file INCLUDE_QUERY_ID = TRUE

DETAILED_OUTPUT Shows file details (name, size, rows) DETAILED_OUTPUT = TRUE

5. Example: Unload Data with Custom Options


• Unload EMPLOYEES table as a single file to @MYSTAGE
COPY INTO @MYSTAGE/emp_data.csv
FROM EMPLOYEES
FILE_FORMAT = (TYPE = CSV, HEADER = TRUE)
SINGLE = TRUE
OVERWRITE = TRUE;
• Unload SALES table into multiple Parquet files with max file size
COPY INTO @MYSTAGE/sales_data/
FROM SALES
FILE_FORMAT = (TYPE = PARQUET)
MAX_FILE_SIZE = 50000000
INCLUDE_QUERY_ID = TRUE;
Snowflake interview QnA

2. In Snowflake, how are data and information secured?

Snowflake incorporates several layers of security to protect data and ensure compliance:

 Encryption: All data in Snowflake is encrypted both in transit (using TLS/SSL) and at rest
(using AES-256 encryption). This ensures that data is protected during transmission and
storage.

 Role-based Access Control (RBAC): Snowflake uses RBAC to manage permissions and access
control. Users are assigned specific roles, and these roles determine the actions they can
perform and the data they can access.

 Multi-factor Authentication (MFA): Snowflake supports MFA, requiring users to provide


additional authentication beyond just a password, enhancing security.

 Data Masking: Snowflake supports dynamic data masking, which allows administrators to
mask sensitive data at the column level based on the user’s role.

 Network Policies: Snowflake provides network policies to control which IP addresses can
access Snowflake, adding an extra layer of security.

These features, combined with Snowflake’s rigorous compliance certifications, ensure data is
protected and meets regulatory requirements.

3. Is Snowflake an ETL (Extract, Transform, and Load) tool?

No, Snowflake is not an ETL tool by itself. It is primarily a cloud-based data warehouse that is
designed for data storage, querying, and analytics. However, Snowflake can be used in conjunction
with ETL tools like Informatica, Talend, and Matillion to extract, transform, and load data into
Snowflake. It also supports ELT (Extract, Load, Transform) workflows, where raw data is loaded into
Snowflake first, and transformations are performed within the warehouse.

4. Snowflake is what kind of database?

Snowflake is a cloud-based data warehouse designed for analytical processing. It supports


structured, semi-structured, and unstructured data and is optimized for fast querying and data
analysis. Snowflake is not just a traditional database but also an integrated platform for data
warehousing, data lakes, and data sharing.

5. How does Snowflake handle semi-structured data like JSON, Avro, and Parquet?
Snowflake provides native support for semi-structured data, allowing users to ingest, store, and
query data formats like JSON, Avro, Parquet, and XML without requiring any transformation before
loading. The platform uses a special data type called VARIANT to store semi-structured data.

When loading semi-structured data into Snowflake, users can store the data in VARIANT columns,
which can hold nested and complex data structures. Snowflake provides several built-in functions to
parse, query, and manipulate semi-structured data directly within SQL queries.

For example, users can use the :, [], and TO_VARIANT functions to access and transform nested JSON
objects. Snowflake’s support for semi-structured data helps organizations avoid the need for pre-
processing or conversion, making it easier to work with diverse data sources.

6. What are Snowflake’s best practices for performance optimization?

Snowflake offers several best practices to optimize query performance, including:

 Clustering: Use clustering keys on large tables to organize the data for faster access.
Snowflake automatically manages clustering, but for large tables or specific query patterns,
defining a cluster key can significantly improve performance.

 Micro-Partitioning: Snowflake automatically divides data into small, manageable partitions.


Query performance can be improved by ensuring that queries filter on partitioned columns,
reducing the amount of data that needs to be scanned.

 Query Optimization: Snowflake has an intelligent query optimizer that automatically


optimizes queries. However, users can improve performance by writing efficient queries,
avoiding complex joins, and limiting the number of queries run simultaneously on a single
virtual warehouse.

 Materialized Views: Use materialized views for frequently queried or aggregate data.
Materialized views store precomputed results, which can improve performance by reducing
the need for recalculating results on every query.

 Virtual Warehouses: Choose the right size for virtual warehouses based on workload. Virtual
warehouses can be resized vertically or horizontally to meet specific demands.

 Data Caching: Snowflake automatically caches query results, making subsequent queries
faster. Leveraging this cache by reusing previous query results can reduce the load on the
system and improve performance.

 Data Storage Optimization: Use compression for large datasets, and store only necessary
data to avoid large, unoptimized tables.

7. What are the advantages of Snowflake’s multi-cluster architecture?

Snowflake’s multi-cluster architecture offers several advantages:

 Concurrency Scaling: Snowflake automatically spins up multiple clusters to handle high


concurrency without performance degradation. This is especially useful for organizations
with many users or varied workloads.
 Separation of Compute and Storage: Compute and storage are decoupled, so users can
scale compute resources independently based on demand without affecting storage. This
flexibility allows Snowflake to handle multiple workloads simultaneously without conflicts.

 Zero Impact on Other Workloads: With multi-cluster architecture, different virtual


warehouses can run independently, ensuring that resource-intensive queries or tasks do not
impact others. For instance, heavy ETL processes can run on one cluster while another
cluster serves live analytics queries.

 Automatic Scaling: Snowflake automatically handles the creation and management of


additional compute clusters when needed, providing on-demand scalability to match
workload fluctuations.

These advantages make Snowflake particularly well-suited for environments with unpredictable
query loads, frequent data uploads, and large numbers of users.

13. What is the use of Snowflake Connectors?

The Snowflake connector is a piece of software that allows us to connect to the Snowflake data
warehouse platform and conduct activities such as Read/Write, Metadata import, and Bulk data
loading.

The Snowflake connector can be used to execute the following tasks:

 Read data from or publish data to tables in the Snowflake data warehouse.

 Load data in bulk into a Snowflake data warehouse table.

 You can insert or bulk load data into numerous tables at the same time by using the
Numerous input connections functionality.

 To lookup records from a table in the Snowflake data warehouse.

Following are the types of Snowflake Connectors:

 Snowflake Connector for Kafka

 Snowflake Connector for Spark

 Snowflake Connector for Python

14. Does Snowflake use Indexes?

No, Snowflake does not use indexes. This is one of the aspects that set the Snowflake scale so good
for the queries.

16. Does Snowflake maintain stored procedures?

Yes, Snowflake maintains stored procedures. The stored procedure is the same as a function; it is
created once and used several times. Through the CREATE PROCEDURE command, we can create it
and through the “CALL” command, we can execute it. In Snowflake, stored procedures are
developed in Javascript API. These APIs enable stored procedures for executing the database
operations like SELECT, UPDATE, and CREATE.
17. How do we execute the Snowflake procedure?

Stored procedures allow us to create modular code comprising complicated business logic by adding
various SQL statements with procedural logic. For executing the Snowflake procedure, carry out the
below steps:

 Run a SQL statement

 Extract the query results

 Extract the result set metadata

18. Explain Snowflake Compression

All the data we enter into the Snowflake gets compacted systematically. Snowflake utilizes modern
data compression algorithms for compressing and storing the data. Customers have to pay for the
packed data, not the exact data.

Following are the advantages of the Snowflake Compression:

 Storage expenses are lesser than original cloud storage because of compression.

 No storage expenditure for on-disk caches.

 Approximately zero storage expenses for data sharing or data cloning.

24. What strategies would you employ to optimize storage costs in Snowflake while maintaining
query performance?

To optimize storage costs in Snowflake while maintaining query performance, I would consider the
following strategies:

1. Implement appropriate data retention policies and leverage Time Travel judiciously.

2. Use column-level compression where applicable to reduce storage requirements.

3. Employ table clustering to improve data locality and query efficiency.

4. Utilize materialized views for frequently accessed query results.

5. Regularly archive or purge unnecessary data.

6. Take advantage of Snowflake's automatic data compression and deduplication features.

Look for candidates who can balance cost optimization with performance considerations. They
should understand Snowflake's storage architecture and be able to explain how different storage
strategies impact both costs and query performance. Follow up by asking about their experience in
implementing such strategies in real-world scenarios.

25. What Are Snowflake’s Roles and Why Are They Important?

Snowflake has a role-based access control (RBAC) model to enable secure access and data
protection. Some key aspects are:
 Roles centralize access control

 Roles can be assigned privileges like creating/accessing tables, operating warehouses

 Roles can have other roles granted to them

 Roles allow the grouping of privileges and inheritance of privileges

 Roles enable multi-factor authentication for security

Proper role setup is crucial for access control and security.

27.Explain Snowflake Streams and Tasks.

Snowflake Streams capture changes to tables and provide change data to consumers in near real-
time. Tasks help run async pieces of code like ETL transformations.

Key differences:

 Streams capture changes, Tasks run code

 Streams continuously, Tasks run once

 Streams read-only, Tasks can transform data

 Streams require setup for tables, Tasks can run ad hoc

 Streams data capture, Tasks general purpose execution

They can be used together for capturing changes and processing them

28.What Is a Snowflake Pipeline?

In Snowflake, a pipeline refers to the architecture used for loading and transforming data. Key
aspects:

 Handles moving data from sources into Snowflake

 Handles transformation, cleansing, and business logic

 Has stages for landing raw data

 Tables for storing transformed data

 Tasks, streams, and snow pipes form the pipeline

 ETL/ELT orchestration happens in pipelines before analytics

Pipelines enable large-scale cloud ETL.

29.How Can You Monitor and Optimize Snowflake's Performance?

Some ways to monitor and optimize performance are:

 Reviewing query history, and profiles to identify slow queries

 Checking warehouse utilization for optimal sizing


 Tuning queries to leverage clustering, partitioning

 Using appropriate caching for hot data

 Scaling up or down warehouses for concurrency

 Tracing long-running queries to identify bottlenecks

 Collecting statistics on tables for query optimization

 Checking usage patterns to optimize workloads

31.How Can You Achieve High Concurrency in Snowflake?

Some ways to achieve high concurrency are:

 Using multiple virtual warehouses to distribute load

 Leveraging micro-partitions and sharding architecture

 Cloning tables/databases to prevent conflicts

 Using resource monitors to monitor and optimize concurrency

 Scaling up/down warehouses automatically based on usage

 Caching hot data to reduce compute requirements

 Tuning queries to run efficiently in parallel

 Optimizing ETL/ELT pipelines for efficiency

32.What Are Snowflake Resource Monitors?

Resource monitors in Snowflake allow you to monitor and control the usage of resources like
warehouses, data processed, credits used, etc. Key features:

 Set usage limits for resources

 Get notifications when limits are approached

 Configure auto-scaling actions like scaling warehouses

 Monitor parallel queries to optimize concurrency

 Historical monitoring for usage analysis

 Prevent overspending by setting credit limits

 Role-based access control for governance

36. Difference Between Warehouse Clusters and Serverless Compute in Snowflake


Snowflake offers two types of compute resources:
1️⃣ Virtual Warehouse Clusters → User-managed, dedicated compute resources.
2️⃣ Serverless Compute → Fully managed by Snowflake, billed only when used.
1️⃣ Virtual Warehouse Clusters (User-Managed Compute)
A warehouse cluster in Snowflake is a dedicated set of compute resources used to process
queries, perform transformations, and load data.
🔹 Key Features:
User-defined size & scaling (XS, S, M, L, etc.).
Supports multi-cluster (automatically scales up/down based on workload).
Charges for running time (even if idle).
Best for consistent workloads that require predictable performance.
🔹 Example:
CREATE WAREHOUSE my_wh WITH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
 Fixed compute size: Always reserves CPU and memory.
 Ideal for: ETL pipelines, scheduled reports, complex queries.

2️⃣ Serverless Compute (Snowflake-Managed Compute)


Serverless compute is on-demand compute that Snowflake automatically provisions and
scales based on query needs.
🔹 Key Features:
No manual sizing or tuning → Snowflake dynamically allocates resources.
Charges only when running (billed by execution time).
Used for specific tasks like Snowpipe, Tasks, or Materialized Views Refresh.
Best for irregular workloads or event-driven processing.
🔹 Example Use Cases:
 Snowpipe: Auto-ingests data without requiring a running warehouse.
 Materialized Views Refresh: Auto-updates views without dedicated compute.
 Tasks (Stored Procedures Execution): Scheduled jobs run with serverless compute.

Key Differences Between Warehouse Clusters & Serverless Compute

Warehouse Clusters (User-Managed Serverless Compute (Fully


Feature
Compute) Managed)

Compute
User-defined (size, scaling) Fully managed by Snowflake
Management
Warehouse Clusters (User-Managed Serverless Compute (Fully
Feature
Compute) Managed)

Pay per usage (execution time


Billing Pay for running time (even idle)
only)

Scaling Manual or Multi-cluster scaling Auto-scales dynamically

Snowpipe, Tasks, Materialized


Use Cases ETL, BI reports, transformations
Views

Snowflake handles
Performance Control Predictable, user-controlled
optimization

Example Query
Requires an active warehouse Runs without a warehouse
Execution

When to Use Which?


Use Virtual Warehouse Clusters if:
 You need consistent compute power (scheduled ETL jobs, dashboards).
 You want manual control over warehouse size and scaling.
 You want predictable costs.
Use Serverless Compute if:
 You have irregular or event-driven workloads (Snowpipe, materialized views).
 You don’t want to manage warehouse sizing or tuning.
 You want cost efficiency (pay only per execution).

Example Scenario: ETL vs. Snowpipe


 ETL Pipeline: Uses a dedicated virtual warehouse for transformations.
 Streaming Data Ingestion (Snowpipe): Uses serverless compute, auto-scales as
needed.

Comparison: Snowflake vs. Google BigQuery vs. Amazon Redshift


These three are popular cloud-based data warehouses, but they differ in architecture,
performance, pricing, and usability. Below is a detailed comparison to help understand why
Snowflake is often preferred.
1️. Architecture & Storage
Why Snowflake?
 Unlike Redshift, Snowflake allows fully independent scaling of compute and storage.
 Unlike BigQuery, Snowflake allows dedicated warehouses for better consistent
performance (BigQuery is completely serverless).

2️. Performance & Speed


Why Snowflake?
 Snowflake’s multi-cluster warehouses prevent performance bottlenecks.
 Redshift requires manual performance tuning (vacuum, analyze, distribution keys).
 BigQuery is best for batch processing, but latency is high for frequent queries.

3️.Pricing Model
Why Snowflake?
 Snowflake: Pay only for what you use (separate storage & compute billing).
 BigQuery: Pricing depends on data scanned, which can be unpredictable.
 Redshift: Expensive if not optimized properly (fixed-size clusters).

4️. Data Sharing & Security


Why Snowflake?
 Snowflake allows seamless cross-cloud data sharing with no data copy needed.
 Redshift and BigQuery do not support native real-time data sharing.

5️. Ease of Use & Administration


Why Snowflake?
 No infrastructure management compared to Redshift.
 More flexibility compared to BigQuery (which is fully serverless but lacks control).

Final Verdict: Which One to Choose?


Use Case Best Choice

General Purpose Data


Snowflake (Balanced Performance, Cost, Features)
Warehousing

BigQuery (Best for occasional queries, pay-per-query


Serverless, Ad-hoc Queries
model)

High-performance Analytics (On Redshift (MPP for structured workloads, but requires
AWS) tuning)

Multi-cloud & Data Sharing Needs Snowflake (Best for hybrid cloud environments)

Why Prefer Snowflake Over Others?


1️.Better Performance: Multi-cluster architecture handles concurrency better than Redshift
and BigQuery.
2️.Cost-Efficient: Pay for separate compute and storage rather than per-cluster (Redshift) or
per-query (BigQuery).
3️.Cross-Cloud Flexibility: Snowflake works across AWS, Azure, and GCP, while Redshift is
AWS-only and BigQuery is GCP-only.
4️.Zero Maintenance: No infrastructure management needed (unlike Redshift).
5️.Advanced Data Sharing & Security: Supports cross-region & cross-cloud data sharing.
6.Timetravel

You might also like