2 - Snowflake de Feb25
2 - Snowflake de Feb25
Snowflake
2. Snowflake architecture
3. Connecting to snowflake
4. Virtual warehouse
5. Micro-partitions
6. Clustering in Snowflake
7. Snowflake edition
8. Snowflake pricing
9. Data Loading in Snowflake
10. Loading Snowflake data with ETL tools
11. Stages
12. Loading Data from AWS S3, Azure, and GCP into Snowflake
13. Snow pipe
14. Time travel and fail safe
15. Zero copy cloning
16. Tables
17. External Tables in Snowflake
18. Access control in snowflake
19. Views
20. Dynamic data masking
21. Data sharing
22. Scheduling in snowflake - Tasks
23. Streams in Snowflake
24. User Defined Functions (UDFs)
25. Stored Procedures in Snowflake
26. Caching in Snowflake
27. Unloading Data in Snowflake
1. What is Snowflake?
Snowflake is a cloud-based data warehousing solution that provides data storage,
processing, and analytics services.
Key Features:
Founded in 2012.
Offers data storage and analytics services.
No on-premises infrastructure—it runs entirely on the cloud.
Operates on Amazon S3, Microsoft Azure, and Google Cloud Platform.
Available as Software-as-a-Service (SaaS).
Traditional Data
Feature Snowflake
Warehouse
Semi-structured Data
Needs ETL tools Supports semi-structured data natively
Handling
Needs additional
Data Backup No extra cost (via Cloning)
storage
Requires third-party
Job Scheduling Handled within Snowflake using Tasks
tools
2. Snowflake Architecture
Snowflake's architecture is designed to separate compute, storage, and cloud services,
ensuring high performance, scalability, and cost efficiency. It consists of three key layers:
1 Database Storage Layer
2 Query Processing Layer
3 Cloud Services Layer
3. Connecting to Snowflake
Snowflake provides multiple ways to connect, making it flexible for different use cases.
1. Web-Based User Interface (UI)
A browser-based interface to manage and use Snowflake.
Provides access to:
Query execution
Database and warehouse management
Security and user access controls
Ideal for administrators, developers, and analysts.
2. Command-Line Interface (CLI) - SnowSQL
SnowSQL is a command-line tool for interacting with Snowflake.
Supports SQL queries, scripting, and automation.
Useful for developers and DevOps teams.
3. ODBC & JDBC Drivers
Snowflake provides ODBC and JDBC drivers to integrate with external applications.
Enables connectivity with BI tools like Tableau, Power BI, and Looker.
Suitable for analytics, reporting, and third-party integrations.
4. Native Connectors for ETL Tools
Snowflake supports built-in connectors for ETL tools like:
Informatica
Datastage
Talend
Apache Nifi
Helps in data extraction, transformation, and loading (ETL) workflows.
4. Virtual Warehouse
A virtual warehouse in Snowflake is a cluster of compute resources that performs all
computational tasks, such as data loading, querying, and transformations. Snowflake’s
architecture separates compute from storage, so virtual warehouses can be resized (scaled
up or down) and turned on or off independently of the data storage layer. This enables fine-
grained control over performance and cost, allowing users to allocate more resources for
complex operations and scale down when resources are not needed.
Users can create multiple virtual warehouses to handle different workloads, such as ETL
jobs, reporting, and ad-hoc queries. Snowflake can automatically scale a warehouse up or
down based on workload demands, ensuring that performance remains optimal.
Warehouse Selection Based on Requirements
Choose a small warehouse for light workloads (e.g., small queries, occasional data
processing).
Use a larger warehouse for high concurrency, large data loads, and intensive queries.
Warehouse Sizing & Scaling
Snowflake warehouses come in different sizes, which determine the number of compute
nodes in the cluster.
Important Notes:
Each increase in warehouse size doubles the number of compute nodes and cost.
If there are insufficient resources, queries get queued until resources become
available.
Scaling Options in Snowflake
5. Micro-Partitions
Agenda
How data is stored in micro-partitions
Metadata of micro-partitions
Benefits of micro-partitions
Clustering & Cluster Keys
How to define and choose Cluster Keys
What are Micro-Partitions?
Snowflake uses a unique partitioning technique called micro-partitioning.
Micro-partitioning is automatic – users don’t need to define partitions.
Tables are partitioned based on the order of data insertion.
Micro-partitions are small in size (50 MB - 500 MB).
Data is compressed, and Snowflake automatically chooses the best compression algorithm.
Metadata of Micro-Partitions
Snowflake automatically maintains metadata about micro-partitions, which includes:
• Number of distinct values in each column
• Range of values in each column
• Other useful statistics for query optimization
Query Pruning (Metadata-Based Filtering)
Snowflake uses metadata to filter out unnecessary micro-partitions during query execution.
This process is called Query Pruning.
Instead of scanning the entire table, only relevant micro-partitions are scanned.
Example:
SELECT type, country FROM MY_TABLE WHERE name = 'Y';
Only the micro-partitions containing ‘Y’ will be scanned (instead of scanning the entire
table).
Only the required columns (type and country) will be queried, ignoring unnecessary data.
Benefits of Micro-Partitioning
No need for manual partitioning – Snowflake does it automatically.
Optimized query performance – Faster execution due to query pruning.
Columnar storage – Only relevant columns are scanned, improving efficiency.
Efficient compression – Reduces storage costs.
Enables fine-grained pruning – Minimizes data scanning and enhances speed.
What is Clustering?
• Clustering improves query performance by organizing data within micro-partitions.
• Helps when queries filter on specific columns frequently.
• Snowflake automatically clusters data, but manual clustering is needed for large tables
with frequent updates.
What is a Cluster Key?
A Cluster Key is one or more columns used to logically group data within micro-partitions.
Helps in query pruning by reducing the number of scanned micro-partitions.
Example of Defining a Cluster Key:
ALTER TABLE sales CLUSTER BY (region, date);
This clusters the sales table based on region and date, improving queries that filter by these
columns.
How to Choose Cluster Keys?
• Choose columns that are frequently used in WHERE, GROUP BY, and JOIN conditions.
• Select columns with high cardinality (many unique values).
• Avoid too many columns, as it increases clustering costs.
Summary
• Micro-partitioning is automatic – No user maintenance needed.
• Metadata-based pruning speeds up queries by reducing scanned data.
• Clustering improves performance for large datasets with frequent filtering.
• Cluster Keys should be chosen carefully to optimize storage and query execution.
6. Clustering in Snowflake
What is Clustering?
Clustering in Snowflake refers to the process of organizing data in a way that improves
query performance, particularly for large datasets. Snowflake uses automatic clustering by
default, meaning it automatically manages data distribution and storage optimization. Users
can define cluster keys to help Snowflake organize data more efficiently based on
commonly queried columns. This allows for faster retrieval of data and optimized query
performance, especially when working with large volumes of data.
Why is Clustering Important?
Snowflake stores data automatically in micro-partitions.
By default, Snowflake determines how to distribute and sort data when it's loaded.
However, as data grows, it may not be stored optimally for queries.
Clustering organizes data based on specific columns (clustering keys) to improve
query performance.
How Clustering Works in Snowflake
1. Micro-partitions store data in compressed, columnar format.
2. Without clustering, queries may scan multiple partitions, leading to performance
overhead.
3. With clustering, Snowflake orders the data logically based on a clustering key to
improve partition pruning.
4. Queries that filter or join on clustered columns will scan fewer partitions, improving
efficiency and cost-effectiveness.
Defining Clustering Keys
A clustering key is a set of columns in a table that determines how Snowflake should
organize data in micro-partitions.
Good Clustering Keys Should:
Be frequently used in the WHERE clause.
Be used as JOIN keys.
Be used in aggregations or GROUP BY operations.
Have a high cardinality (many distinct values).
** Without clustering, the query scans all partitions. With clustering, Snowflake scans only
relevant partitions, reducing cost.
Re-Clustering in Snowflake
Clustering does not happen automatically over time. When data gets fragmented, we need
to re-cluster the table.
Manual Re-Clustering
ALTER TABLE sales RECLUSTER;
Re-clustering costs Snowflake credits, so it should be used carefully.
Automatic Re-Clustering
Snowflake also supports automatic clustering (Enterprise Edition or above):
ALTER TABLE sales SET AUTO_CLUSTERING = TRUE;
Snowflake will continuously optimize clustering as new data is inserted.
Use clustering on large tables Small tables don’t benefit much from clustering.
Don’t cluster on more than 4 columns Too many keys increase overhead.
Scenario
You're working with a large e-commerce dataset containing millions of sales records. You
need to optimize query performance by defining clustering keys.
This clusters data by region and order_date, making it efficient for regional sales analysis.
Step 2: Load Sample Data
Now, insert some sample records:
INSERT INTO sales (order_id, customer_id, region, category, amount, order_date) VALUES
(1, 101, 'North', 'Electronics', 500.00, '2024-01-01'),
(2, 102, 'South', 'Clothing', 200.00, '2024-01-05'),
(3, 103, 'East', 'Electronics', 700.00, '2024-01-10'),
(4, 104, 'West', 'Clothing', 150.00, '2024-01-15'),
(5, 105, 'North', 'Electronics', 900.00, '2024-01-20');
3. Storage Cost
Snowflake charges for storage per terabyte (TB) per month (compressed).
2 Storage Plans:
How to choose?
Not sure about data size? → Start with On-Demand
Stable data volume? → Switch to Capacity Storage
Example Calculation:
If you use a Large warehouse (L) for 30 min → 4 Credits
If you use an XS warehouse for 1 hour → 1 Credit
Agenda
1. Load Types
2. Bulk Loading vs. Continuous Loading
3. Using the COPY Command
4. Transforming Data During Load
5. Other Data Loading Methods
1. Load Types in Snowflake
Snowflake provides two primary ways to load data:
Bulk Loading Using the COPY Command
Used for large datasets.
Loads batch files from cloud storage or local machines.
Requires a virtual warehouse for processing.
Continuous Loading Using Snowpipe
Best for real-time or streaming data.
Uses Snowpipe, which is serverless (additional cost applies).
Loads data automatically when new files appear in a stage.
Hands-On Example
Step 1: Create a Table
CREATE OR REPLACE TABLE customers (
customer_id INT,
name STRING,
email STRING
);
Benefits:
Handles large files efficiently.
Allows transformations while loading.
Benefits:
Automated & Continuous data ingestion.
Ideal for real-time analytics.
Transformations Supported:
Column Reordering
Column Omission
String Operations
Auto-Increment Fields
Summary
Summary
AWS S3, Azure Blob, and GCP Storage all follow similar steps:
1. Create storage integration → Allows Snowflake to access cloud storage.
2. Create an external stage → Defines where data is stored.
3. Use COPY INTO → Loads data into Snowflake.
13. Snowpipe - Continuous Data Loading
1. What is Continuous Loading?
Loads small volumes of data continuously (e.g., every 10 mins, every hour)
Supports real-time or near-real-time data ingestion
Ensures latest data is available for analytics
Uses Snowpipe (a serverless feature) for automatic ingestion
2. What is Snowpipe?
-> A named database object that contains a COPY command
-> Loads data within minutes after files are added to a stage
-> Serverless & managed by Snowflake
-> One-time setup for automation
-> Prevents duplicate file loading
-> Optimal file size: 100-250 MB
5. Snowpipe Syntax
CREATE OR REPLACE PIPE PIPE_NAME
AUTO_INGEST = TRUE
AS COPY INTO <table_name> FROM @<stage_name>;
Command Purpose
8. Managing Snowpipes
-> View pipe properties:
DESC PIPE pipe_name;
-> List all pipes:
SHOW PIPES;
-> Pause/Resume a Pipe:
ALTER PIPE pipe_name SET PIPE_EXECUTION_PAUSED = TRUE; -- Pause
ALTER PIPE pipe_name SET PIPE_EXECUTION_PAUSED = FALSE; -- Resume
-> When to Pause & Resume?
Before modifying the stage object
Before modifying the file format object
Before modifying the COPY command
To modify the COPY command, you must recreate the pipe!
Final Takeaways
-> Snowpipe automates continuous data loading
-> Prevents duplicate file loading with metadata tracking
-> Requires proper event notifications in cloud storage
-> Monitor pipe status & copy history for troubleshooting
14. Time Travel & Fail-safe in Snowflake
1. What is Time Travel?
Allows access to historical data that has been changed or deleted
Restores tables, schemas, and databases that were dropped
Enables querying past data at any point within the retention period
Used for data analysis, backup, and auditing
No need to enable manually; it is enabled by default
2. Retention Periods
Determines how long historical data is stored
Higher retention → Higher storage cost
Key Takeaways
-> Use Time Travel for quick recovery & historical analysis
-> Fail-safe is a last resort but requires Snowflake Support
-> Higher retention = higher cost, choose wisely
-> Always back up critical data before retention expires
15. Zero-Copy Cloning in Snowflake
1. What is Zero-Copy Cloning?
Creates an instant copy of a table, schema, or database without duplicating storage
No additional storage cost at the time of cloning
Snapshot of source data is taken at the moment of cloning
Cloned object is independent of the source object
Changes in the source or clone do not affect each other
Key Takeaways
-> Fast cloning without extra storage cost initially
-> Ideal for backups, testing, and versioning
-> Changes in the clone do not affect the original
-> Supports various objects (tables, schemas, databases, tasks, etc.)
16. Snowflake Table Types – Permanent vs Transient vs Temporary
1. Types of Tables in Snowflake
Permanent Tables (Default, standard storage with fail-safe)
Transient Tables (No fail-safe, short retention period)
Temporary Tables (Session-specific, auto-dropped after session ends)
2. Permanent Tables
-> Default table type in Snowflake
-> Exists until explicitly dropped
-> Supports Time Travel (0-90 days, depending on edition)
-> Has a 7-day Fail-Safe period
Best for: Storing business-critical data
Syntax:
CREATE TABLE my_table (id INT, name STRING);
3. Transient Tables
-> Similar to Permanent Tables but with no Fail-Safe
-> Shorter retention period (0-1 day)
-> Exists until explicitly dropped
-> Best for temporary or intermediate processing
Best for: Staging tables, intermediate results
Syntax:
CREATE TRANSIENT TABLE my_transient_table (id INT, name STRING);
4. Temporary Tables
-> Exists only within the session
-> Automatically dropped when session ends
-> Not visible to other sessions or users
-> No fail-safe period
Best for: Development, testing, and temporary processing in stored procedures
Syntax:
CREATE TEMPORARY TABLE my_temp_table (id INT, name STRING);
Example Query:
SELECT VALUE, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER
FROM SAMPLE_EXT;
2. Key Concepts
Concept Description
Securable Object Any entity like tables, schemas, databases, views, warehouses, etc.
3. Privileges in Snowflake
Privilege Usage
To grant privileges:
GRANT SELECT ON TABLE sales TO role analyst;
To revoke privileges:
REVOKE SELECT ON TABLE sales FROM role analyst;
5. Roles in Snowflake
System-Defined Roles
1 ORGADMIN – Manages organization-level operations (e.g., creating accounts).
2 ACCOUNTADMIN – Highest role in an account, manages users, roles, usage, and billing.
3 SECURITYADMIN – Manages grants and security policies.
4 USERADMIN – Creates and manages users & roles.
5 SYSADMIN – Creates warehouses, databases, and other objects.
6 PUBLIC – Default role assigned to all users.
To check current role:
SELECT CURRENT_ROLE();
To switch role:
USE ROLE SYSADMIN;
7. Custom Roles
• Created by USERADMIN or any role with CREATE ROLE privilege.
• Not assigned to any user by default.
• Best practice: Assign custom roles under SYSADMIN for better management.
Create a custom role:
CREATE ROLE analyst_role;
Assign privileges to the custom role:
GRANT SELECT, INSERT ON TABLE sales TO ROLE analyst_role;
Assign the role to a user:
GRANT ROLE analyst_role TO USER madhu;
Type Description
3. Secure Views
• A secure view hides its SQL definition from unauthorized users.
• Only users with the required role can see its definition.
• Useful for:
Data Security – Hide sensitive logic.
Access Control – Restrict users from seeing base tables.
Creating a Secure View:
CREATE OR REPLACE SECURE VIEW sales_secure_view AS
SELECT product_id, revenue FROM sales;
4. Materialized Views
• Unlike normal views, Materialized Views store precomputed results.
• Improves performance for repetitive queries on large datasets.
• Cannot be created on multiple tables or complex queries.
2. Masking Policies
• Schema-level objects that define how data should be masked.
• Applied dynamically at query runtime without modifying actual data.
• One policy can be applied to multiple columns.
PAYROLL 123-45-6789
• Based on Conditions
CREATE MASKING POLICY email_visibility AS
(email VARCHAR, visibility STRING) RETURNS VARCHAR ->
CASE
WHEN CURRENT_ROLE() = 'ADMIN' THEN email
WHEN visibility = 'Public' THEN email
ELSE '***MASKED***'
END;
8. Limitations
** Must unset a policy before dropping it.
** Input and output data types must match.
** Does not encrypt data, only masks it at query runtime.
Snowflake’s data sharing feature allows organizations to securely share data across
different Snowflake accounts without the need to copy or move the data. This is done in
real time, meaning that data can be shared as soon as it is available, with no delays.
The sharing process works by creating a share in Snowflake that contains selected data
(tables, views, schemas, etc.) and then granting access to another Snowflake account. This
access is read-only, so the recipient can query the shared data but cannot modify it.
Data sharing is secure and governed by Snowflake’s role-based access control (RBAC),
ensuring that only authorized users have access to the data. This feature is commonly used
for sharing data between business partners or departments within a large organization,
without the overhead of data duplication.
3. What is a Share?
• A Share is a named database object that includes:
The database & schema being shared
The grants (permissions) on objects
The consumer account details
Creating a Share:
CREATE SHARE my_share;
Adding a Database to the Share:
GRANT USAGE ON DATABASE my_database TO SHARE my_share;
Adding a Table to the Share:
GRANT SELECT ON TABLE my_database.public.my_table TO SHARE my_share;
Assigning a Consumer to the Share:
ALTER SHARE my_share ADD ACCOUNT = '<consumer_account_id>';
6. Limitations
** Consumers cannot modify the shared data.
** Cannot share non-secure views or standard materialized views.
** Providers must manage access control.
22. Scheduling in Snowflake Using Tasks
1. What is a Task?
• Tasks in Snowflake allow scheduling SQL execution at defined intervals.
• Used for automating ETL processes, stored procedure execution, and change data
capture (CDC).
• Can be Snowflake-managed (serverless) or user-managed (virtual warehouses).
Use Cases:
Automating batch processing
Implementing CDC (Change Data Capture)
Running stored procedures on schedule
Managing dependencies between tasks (DAG – Directed Acyclic Graph)
3. Altering a Task
• Modify task properties using ALTER TASK.
Modify Schedule, Dependencies, or Query:
ALTER TASK emp_task SET SCHEDULE = '5 MINUTE';
ALTER TASK emp_task SUSPEND;
ALTER TASK task_dept ADD AFTER task_emp;
ALTER TASK task_dept REMOVE AFTER task_emp;
Resuming a Suspended Task:
ALTER TASK emp_task RESUME;
7. Troubleshooting Tasks
If your task is not running, follow these steps:
• Step 1: Check Task Status
SHOW TASKS;
➡ If the status is SUSPENDED, resume it using:
ALTER TASK my_task RESUME;
• Step 2: Check Task History for Failures
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE state = 'FAILED';
8. Summary
Tasks help automate SQL execution in Snowflake.
Supports time-based scheduling (CRON) and dependency-based execution (DAG).
Monitor & troubleshoot tasks using TASK_HISTORY.
Ensure permissions & check stream data availability for CDC.
2. Metadata of Streams
• Each stream maintains metadata for tracking DML changes:
5. Types of Streams
1. Standard Streams
• Tracks INSERTs, UPDATEs, DELETEs, and TRUNCATEs.
• Best for full change tracking (CDC).
Create a Standard Stream:
CREATE OR REPLACE STREAM my_stream ON TABLE my_table;
2. Append-Only Streams
• Tracks only INSERT operations.
• Ignores DELETEs and UPDATEs.
• Best for append-only tables (logs, event data, etc.).
Create an Append-Only Stream:
CREATE OR REPLACE STREAM my_stream ON TABLE my_table APPEND_ONLY = TRUE;
2. Sample UDFs
Scalar UDF (Returns a Single Value)
Example: Area of a Circle
CREATE FUNCTION area_of_circle(radius FLOAT)
RETURNS FLOAT
AS
$$
PI() * radius * radius
$$;
Usage:
SELECT area_of_circle(4.5);
Returns: 63.617251
Usage:
SELECT * FROM TABLE(sample_people());
Returns:
Name Age
Ravi 34
Latha 27
Madhu 25
Execution:
CALL LOAD_TABLE1();
Returns: 'Done'
Virtual Warehouse
Location Cloud Services Layer
(SSD/RAM)
Supports Subset
** No Yes
Queries?
4. Unloading Options
Option Description Example
Snowflake incorporates several layers of security to protect data and ensure compliance:
Encryption: All data in Snowflake is encrypted both in transit (using TLS/SSL) and at rest
(using AES-256 encryption). This ensures that data is protected during transmission and
storage.
Role-based Access Control (RBAC): Snowflake uses RBAC to manage permissions and access
control. Users are assigned specific roles, and these roles determine the actions they can
perform and the data they can access.
Data Masking: Snowflake supports dynamic data masking, which allows administrators to
mask sensitive data at the column level based on the user’s role.
Network Policies: Snowflake provides network policies to control which IP addresses can
access Snowflake, adding an extra layer of security.
These features, combined with Snowflake’s rigorous compliance certifications, ensure data is
protected and meets regulatory requirements.
No, Snowflake is not an ETL tool by itself. It is primarily a cloud-based data warehouse that is
designed for data storage, querying, and analytics. However, Snowflake can be used in conjunction
with ETL tools like Informatica, Talend, and Matillion to extract, transform, and load data into
Snowflake. It also supports ELT (Extract, Load, Transform) workflows, where raw data is loaded into
Snowflake first, and transformations are performed within the warehouse.
5. How does Snowflake handle semi-structured data like JSON, Avro, and Parquet?
Snowflake provides native support for semi-structured data, allowing users to ingest, store, and
query data formats like JSON, Avro, Parquet, and XML without requiring any transformation before
loading. The platform uses a special data type called VARIANT to store semi-structured data.
When loading semi-structured data into Snowflake, users can store the data in VARIANT columns,
which can hold nested and complex data structures. Snowflake provides several built-in functions to
parse, query, and manipulate semi-structured data directly within SQL queries.
For example, users can use the :, [], and TO_VARIANT functions to access and transform nested JSON
objects. Snowflake’s support for semi-structured data helps organizations avoid the need for pre-
processing or conversion, making it easier to work with diverse data sources.
Clustering: Use clustering keys on large tables to organize the data for faster access.
Snowflake automatically manages clustering, but for large tables or specific query patterns,
defining a cluster key can significantly improve performance.
Materialized Views: Use materialized views for frequently queried or aggregate data.
Materialized views store precomputed results, which can improve performance by reducing
the need for recalculating results on every query.
Virtual Warehouses: Choose the right size for virtual warehouses based on workload. Virtual
warehouses can be resized vertically or horizontally to meet specific demands.
Data Caching: Snowflake automatically caches query results, making subsequent queries
faster. Leveraging this cache by reusing previous query results can reduce the load on the
system and improve performance.
Data Storage Optimization: Use compression for large datasets, and store only necessary
data to avoid large, unoptimized tables.
These advantages make Snowflake particularly well-suited for environments with unpredictable
query loads, frequent data uploads, and large numbers of users.
The Snowflake connector is a piece of software that allows us to connect to the Snowflake data
warehouse platform and conduct activities such as Read/Write, Metadata import, and Bulk data
loading.
Read data from or publish data to tables in the Snowflake data warehouse.
You can insert or bulk load data into numerous tables at the same time by using the
Numerous input connections functionality.
No, Snowflake does not use indexes. This is one of the aspects that set the Snowflake scale so good
for the queries.
Yes, Snowflake maintains stored procedures. The stored procedure is the same as a function; it is
created once and used several times. Through the CREATE PROCEDURE command, we can create it
and through the “CALL” command, we can execute it. In Snowflake, stored procedures are
developed in Javascript API. These APIs enable stored procedures for executing the database
operations like SELECT, UPDATE, and CREATE.
17. How do we execute the Snowflake procedure?
Stored procedures allow us to create modular code comprising complicated business logic by adding
various SQL statements with procedural logic. For executing the Snowflake procedure, carry out the
below steps:
All the data we enter into the Snowflake gets compacted systematically. Snowflake utilizes modern
data compression algorithms for compressing and storing the data. Customers have to pay for the
packed data, not the exact data.
Storage expenses are lesser than original cloud storage because of compression.
24. What strategies would you employ to optimize storage costs in Snowflake while maintaining
query performance?
To optimize storage costs in Snowflake while maintaining query performance, I would consider the
following strategies:
1. Implement appropriate data retention policies and leverage Time Travel judiciously.
Look for candidates who can balance cost optimization with performance considerations. They
should understand Snowflake's storage architecture and be able to explain how different storage
strategies impact both costs and query performance. Follow up by asking about their experience in
implementing such strategies in real-world scenarios.
25. What Are Snowflake’s Roles and Why Are They Important?
Snowflake has a role-based access control (RBAC) model to enable secure access and data
protection. Some key aspects are:
Roles centralize access control
Snowflake Streams capture changes to tables and provide change data to consumers in near real-
time. Tasks help run async pieces of code like ETL transformations.
Key differences:
They can be used together for capturing changes and processing them
In Snowflake, a pipeline refers to the architecture used for loading and transforming data. Key
aspects:
Resource monitors in Snowflake allow you to monitor and control the usage of resources like
warehouses, data processed, credits used, etc. Key features:
Compute
User-defined (size, scaling) Fully managed by Snowflake
Management
Warehouse Clusters (User-Managed Serverless Compute (Fully
Feature
Compute) Managed)
Snowflake handles
Performance Control Predictable, user-controlled
optimization
Example Query
Requires an active warehouse Runs without a warehouse
Execution
3️.Pricing Model
Why Snowflake?
Snowflake: Pay only for what you use (separate storage & compute billing).
BigQuery: Pricing depends on data scanned, which can be unpredictable.
Redshift: Expensive if not optimized properly (fixed-size clusters).
High-performance Analytics (On Redshift (MPP for structured workloads, but requires
AWS) tuning)
Multi-cloud & Data Sharing Needs Snowflake (Best for hybrid cloud environments)