100% found this document useful (1 vote)
333 views

Programming+in+Snowflake+ +All+Slides

Uploaded by

Abhinandan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
333 views

Programming+in+Snowflake+ +All+Slides

Uploaded by

Abhinandan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 342

Programming

in Snowflake
Masterclass
2024 Hands-On!
Programming in Snowflake

• SQL & Data Analytics • Client Drivers (Python)


• Snowflake Scripting • Snowpark API & Data Frames
• Snowflake SQL REST API • Streamlit Web Applications
• JSON Semi-Structured Data • Streamlit in Snowflake Apps
• Stored Procedures • Native App Framework
• User-Defined Functions • Data Pipelines & Data Sharing
• User-Defined Table Functions • Snowpipe and Snowpipe API
• SnowSQL & SnowCD • Data Exchange & Marketplace
Best Ways to Benefit from this Course

• Hands-On + Reviews (slides)


• Quizzes to Test Your Knowledge
• VSCode Project Setup + GitHub
• Captions + Playback Speed
• Q&A + Money Back Guarantee
• Review
Course Structure

• Section Introduction
• Client Request
• Review of Client Request
• Section Summary

• Section Content
• Hands-On Programming Experiments
• Review Checkpoint Slideshows
• Check Your Knowledge Quiz
Architecture Diagram: Private Data Sharing

Provider
Studio Shared Database

Snowflake
Provider
Publish
Account
Secure
Provider
Provider Profile
Provider
Profile Listing(s) Data
Profile(s)
Share

Get
Snowflake
Consumer
Proxy Database Account
Credentials

• Former Snowflake Data Superhero


• Former SnowPro Certification SME
• Four Snowflake Certification Exams

• 100% Specialized in Snowflake


• World-Class Expert in Snowflake

• ...You Are in Good Hands!


Real-Life Applications

Hierarchical Data Viewer Hierarchical Metadata Viewer


• with local CSV Python files • Data Lineage
• as local/remote Streamlit • Object Dependencies
web app
• Role Hierarchy
• as Python client to
• STREAMLIT in Snowflake App
Snowflake
• as STREAMLIT in
Snowflake (SiS) App Enhanced Query Profile
• as APPLICATION Native • Query Analyzer
App for private Data
Exchange • STREAMLIT in Snowflake App
Real-Life Projects
Client Request

We consider moving to Snowflake, would


you have time for a quick demo?
We are particularly interested in its compute
scalability, as we may need to run at some
point a few heavy complex queries, on big
data.
Why is Snowflake so interesting today, and
what are some of the best practices we
should be aware of?
Review of Client Request

We consider moving to Snowflake, would


you have time for a quick demo?
We are particularly interested in its compute
scalability, as we may need to run at some
point a few heavy complex queries, on big
data.
Why is Snowflake so interesting today, and
what are some of the best practices we
should be aware of?
Section Summary

• Sign-up for free trial account


• Snowsight (Snowflake's Web UI)
• Architecture and Virtual Warehouses
• Run Query with a Very Large Warehouse
• Resume a Large Multi-Cluster Warehouse
• Snowflake Editions & Pricing
• Best Practices for Compute and Storage
signup.snowflake.com
signup.snowflake.com
Snowsight (Web UI)
Classic Web UI (obsolete now)
Snowflake Architecture (EPP)

• Cloud Data Warehouse only (no


Services
local version)
• Uses AWS, Azure or GCP infra
• All through SQL REST API
Compute
• Snowsight REST API for dashboards
VirtualWarehouse
Virtual Warehouse
and worksheets Virtual Warehouse

• Compute and Storage completely


separated
• Snowflake Data Cloud Storage
Storage
Storage

Snowflake
Services
• Authentication: basic, key pair, OAuth, SSO, MFA
• Infrastructure Management
• Metadata Management
• Query Parsing and Optimization
• Access Control: users, roles, privileges
• Serverless Tasks
Compute
• Virtual Warehouses: ~your car engine
• Query Processing: single/multi-user, parallel processing
• Use bigger warehouse for a more complex query
• Use multi-cluster warehouses (Enterprise+ only) when multiple users
• Use query cache results when possible
Storage
• Back-End Data Storage: private to Snowflake
• Time Travel and Fail-safe storage
• Internal Stages: named, user, table stages
• Local and Remote Warehouse Storage
• Query Result Data Caches
• Data transfer in is free, but transfer out costs money!
Large Virtual Warehouse

Compute
4X-Large Virtual Warehouse (8x16 = 128 Nodes) X-Small
Virtual
Warehouse
(1 Node)

Storage
Storage
Storage

Snowflake
Large Multi-Cluster Virtual Warehouse
Compute
3 Clusters x 3X-Large Virtual Warehouse (3 x 64 Nodes = 192 Nodes) X-Small
Virtual
Warehouse
(1 Node)

Snowflake
Snowflake Editions & Pricing
Best Practices: Compute
• You are charged per VW up, not executed queries!
• Keep X-Small Warehouse
• Auto-Suspend after 1 minute (minimum!)
• Keep Standard Edition
• Economy mode
• Avoid querying too much the Account Usage schema
Best Practices: Storage
• Do not duplicate data, try zero-copy clone or data share
• Do not store large amounts of data
• No time travel or fail-safe unless necessary

• Use resource monitor to alert for over-spending


• Use IP address
• Switch from ACCOUNTADMIN
Client Request

You convinced us, we'll move to Snowflake!


Here is a small CSV file with our employee
test data, try to upload it into Snowflake.
We expect to expand in time to thousands
of employees, each with thousands of
related daily transactions, so some tables
may reach in the order of TB.
Review of Client Request

You convinced us, we'll move to Snowflake!


Here is a small CSV file with our employee
test data, try to upload it into Snowflake.
We expect to expand in time to thousands
of employees, each with thousands of
related daily transactions, so some tables
may reach in the order of TB.
Section Summary

• Demo Project Setup


• SQL Worksheets
• Query Context and Identifiers
• Internal/External Stages
• Direct Access to Staged CSV File
• Schema Inference and DDL Script
• COPY File into Table
SQL Worksheets
user

Web UI

SQL Worksheet

SQL statements
Compute

SQL Engine

Data

Snowflake
Query Context

1. Role

2. Warehouse

3. Database

4. Schema
Query Context: Built-In Functions
• CURRENT_ROLE/WAREHOUSE/DATABASE/SCHEMA()
• CURRENT_USER/SESSION/STATEMENT/TRANSACTION()
• CURRENT_DATE/TIME/TIMESTAMP()
• CURRENT_ACCOUNT/CLIENT/VERSION/REGION/IP_ADDRESS()
• CURRENT_ACCOUNT/ORGANIZATION_NAME()
• CURRENT_AVAILABLE/SECONDARY_ROLES()

• IS_ROLE_IN_SESSION()
• LAST_TRANSACTION/QUERY_ID()
Stages: Uploading and Unloading Data
$ aws s3 get/put
Local Computer

Data Files (CSV, JSON, …) S3 Blob Storage

PUT file stage (upload) Amazon Cloud


GET stage file (unload) Account

Internal Internal Internal External


Named Stage Table Stage User Stage Stage
@stage @%table @~ @stage

Table
COPY INTO table FROM location (upload)
COPY INTO location FROM table (unload)

Snowflake
Stages: Examples
• LIST @~;  list files from any stage
• REMOVE ...  remove files from a stage

• PUT file://C:\data\data.csv @%my_table;  not from Snowsight!


• COPY INTO my_table FROM @%my_table;

• CREATE TEMPORARY STAGE my_int_stage;


• PUT file://C:\data\data.csv @my_int_stage;  not from Snowsight!
• DESCRIBE STAGE my_int_stage;

• CREATE STAGE my_ext_stage URL='s3://load/files/’;


• COPY INTO my_table FROM @my_ext_stage/data.csv;
• SHOW STAGES;
Stages: External Stage in AWS S3
• Create a S3 bucket, in the same AWS region
• Create a folder and upload some CSV files
• Create a IAM policy, to access this bucket
• Create a IAM user, w/ previous policy attached
• Create access keys for the user → take note of both
CREATE STAGE mystage_s3
URL='s3://mybucket/spool/'
CREDENTIALS=(
AWS_KEY_ID='AKIAW6WN772VQZKDEOIY'
AWS_SECRET_KEY='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');

LIST @mystage_s3;
Schema Inferrence
• INFER_SCHEMA
• LOCATION => '@mystage'  internal/external named stage
• FILES => 'emp.csv', ...  1+ uploaded files
• FILE_FORMAT => 'myfmt'  CREATE FILE FORMAT ... PARSE_HEADER=TRUE

-- show column definitions


SELECT *
FROM TABLE(INFER_SCHEMA(...))

-- generate create table DDL


SELECT GENERATE_COLUMN_DESCRIPTION(
ARRAY_AGG(OBJECT_CONSTRUCT(*)), 'table') AS COLUMNS
FROM TABLE(INFER_SCHEMA(...));

-- create table directly from inferred schema


CREATE TABLE ... USING TEMPLATE(
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(INFER_SCHEMA(...)));
COPY Command
• COPY INTO table FROM stage  upload data from staged file(s) into a table
• COPY INTO stage FROM table  unload data from a table into staged file(s)

• LOCATION => '@mystage'  internal/external named stage


• FILES => 'emp.csv', ...  1+ uploaded files
• FILE_FORMAT => 'myfmt'  CREATE FILE FORMAT ... PARSE_HEADER=TRUE
• PATTERN => 'reg_exp'
• VALIDATION_MODE => RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS

COPY INTO EMP FROM @mystage


FILES = ('emp.csv')
FILE_FORMAT = (FORMAT_NAME = mycsvformat)
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE
FORCE = TRUE;
Client Request

What data file formats can we use with


Snowflake?
We have another data file with our
departments, but it is in JSON format.
Please upload it in Snowflake, stored as
tabular data.
Total salaries per department is something
rarely updated, but frequently looked at.
Review of Client Request

What data file formats can we use with


Snowflake?
We have another data file with our
departments, but it is in JSON format.
Please upload it in Snowflake, stored as
tabular data.
Total salaries per department is something
rarely updated, but frequently looked at.
Section Summary

• File Formats: CSV, JSON, Parquet…


• Upload JSON Data into VARIANT
• JSON Objects and Arrays
• Flatten Arrays and Lateral Joins
• Primary and Foreign Key Constraints
• Temporary and Transient Tables
• Materialized Views
File Formats
• CSV (Comma-Separated Values) = tabular row-oriented (good for transactions).
Includes any text-delimited (like tab-delimited values). First row may contain
column names (but data types must be inferred).
• JSON (JavaScript Object Notation) = for hierarchical semi-structured data, including
ND JSON (“Newline Delimited JSON”) for both loading/unloading.
• XML (Extensible Markup Language) = for hierarchical semi-structured data, only for
loading. More verbose than JSON.
• PARQUET = tabular format, but binary compressed and column-oriented (good for
analytics). Used for Hadoop (Cloudera and Twitter collab).
• ORC (Optimized Row Columnar) = only for loading. Tabular format, but column-
oriented (good for analytics). Used for Hadoop (Hortonworks and Facebook collab).
• AVRO = only for loading. Row-oriented (good for transactions), compressed binary
format. Additional schema evolution metadata (in JSON), for RPC serialization.
Used for Hadoop.
Named File Formats

CREATE STAGE …
FILE_FORMAT=(FORMAT_NAME='...')
stage

CREATE COPY INTO …


FILE FORMAT … FILE_FORMAT=(FORMAT_NAME='...')
file format

table
CREATE TABLE …
STAGE_FILE_FORMAT=(FORMAT_NAME='...')
JSON Objects Cheat Sheet

OBJECT_CONSTRUCT(“age”, 32) OBJECT_DELETE(”age”)


{ “age”: 32 } {}
OBJECT_AGG(“age”, 32) OBJECT_PICK(”name”)
{}
OBJECT_INSERT({ }, “age”, 32) GET(”age”)
32
AS_OBJECT({ “age”: 32 }) OBJECT_KEYS()
[“age”]
JSON Arrays Cheat Sheet
ARRAY_PREPEND([2, 3], 1) ARRAYS_OVERLAP([1, 4])
TRUE
ARRAY_INSERT([1, 3], 1, 3) ARRAY_TO_STRING(‘,’)
1,2,3
ARRAY_CAT([1, 2], [3]) ARRAY_CONTAINS(2)
TRUE
ARRAY_COMPACT([1, 2, NULL, 3]) ARRAY_MAX ()
3
ARRAY_CONSTRUCT(1, 2, 3) ARRAY_SLICE(1, 3)
[1, 2, 3] [2, 3]
ARRAY_FLATTEN([1, 2], [3]) ARRAY_SIZE()
3
ARRAY_EXCEPT([1, 2, 3, 4], [4]) ARRAY_SORT(FALSE)
[3, 2, 1]
ARRAY_DISTINCT([1, 1, 2, 3]) ARRAY_REMOVE(2)
[1, 3]
ARRAY_APPEND([1, 2], 3) ARRAY_REMOVE_AT(2)
[1, 2]
ARRAY_CAT([1, 2], [3]) GET(2) / ARRAY_POSITION(2)
[1, 3]
Flattening Arrays: JSON to VARIANT
create or replace table json(name string, v variant) as
select 'John', parse_json($${
"managers": [
{ "name": "Bill", "years": [2021, 2022] },
{ "name": "Linda", "years": [2020] }
]}$$)
union
select 'Mary', parse_json($${
"managers": [
{ "name": "Bill", "years": [2022, 2023] }
]}$$);

select *
from json;
Flattening Arrays: One Level
select j.name,
m.value, m.value:name::string, m.value:years
from json j,
table(flatten(input => j.v, path => 'managers')) m;

select j.name,
m.value, m.value:name::string, m.value:years
from json j,
lateral flatten(input => j.v, path => 'managers’) m;
Flattening Arrays: Two Levels
select name,
m.value, m.value:name::string, m.value:years,
y.value
from json j,
lateral flatten(input => j.v, outer => TRUE, path => 'managers') m,
lateral flatten(input => m.value, path => 'years') y;
Table Functions

• FLATTEN(…)  when flattening JSON array data


• LATERAL  when joining on UDF/UDTF input args
• RETURNS TABLE(col1 type, …)  in a UDTF definition
• SELECT * FROM TABLE(f(…))  when calling a UDTF
• RESULT_SCAN(…)  when getting data from cache
Temporary Tables

• CREATE TEMPORARY TABLE …


• With no Fail-safe period (but they may have time travel – can
UNDROP!).
• Only exist within a session, not visible to other users or sessions.
• Auto-purged completely when the session ends, with no possibility to
recover.
• Use for non-permanent, transitory data: ETL data, private session-
specific data.
• Can create with the same name, and their name takes precedence over
other tables!
• See also other temporary objects (like temp stages).
Transient Tables

• CREATE TRANSIENT TABLE …


• With no Fail-safe period (but they may have time travel – can
UNDROP!).
• Persist until explicitly dropped, but available to all users with
appropriate privileges.
• Use for transitory data that needs to be maintained beyond a
session.
• Use for data that does not need the same level of protection and
recovery.
Materialized Views: Use Cases

• CREATE MATERIALIZED VIEW ...


• Query results have a lower number of rows/columns.
• Query results contain results of aggregated data.
• Query results come from semi-structured data analysis.
• Query on external table needs improved performance.
• Base table does not change frequently.
• Query results do not change often.
• Results are used often.
• Queries consume a lot of resources.
Client Request

We don't have customer data today, so we


will need to either extract a sample from an
existing similar test table, or generate some
synthetic data from scratch.
The customer ID must be also generated
automatically.
Review of Client Request

We don't have customer data today, so we


will need to either extract a sample from an
existing similar test table, or generate some
synthetic data from scratch.
The customer ID must be also generated
automatically.
Section Summary

• Snowflake Tutorials
• Snowflake Sample Databases
• Sample Data Extraction
• Synthetic Data Generation
• External Data Generation
• Sequences
• Identity Columns
Snowflake Tutorials
Snowflake Sample Databases: TPC-H

• TPCH_SF1 = ~M elements
• Tutorial 1: Sample queries on TPC-H data

• TPCH_SF10 = ~10 x M elements

• TPCH_SF100 = ~100 x M elements

• TPCH_SF1000 = ~B elements
Snowflake Sample Databases: TPC-DS

• TPCDS_SF10TCL = 10 TB , 65M customers, 400K+ items


• STORE_SALES table ~30B rows, fact tables 56+B rows
• Tutorial 2: Sample queries on TPC-DS data
• Tutorial 3: TPC-DS 10TB Complete Query Test

• TPCDS_SF100TCL = 100 TB, 100M customers, 500K+ items


• STORE_SALES table ~300B rows, fact tables 560B rows
• Tutorial 4: TPC-DS 100TB Complete Query Test
Sample Data Extraction
• FROM ... SAMPLE (~TABLESAMPLE)  extracts subset of rows from a table
• BERNOULLI (~ROW) | SYSTEM (~BLOCK)  sampling method
• <probability> | <n> ROWS  sample size
• REPEATABLE (~SEED) (<seed>)  seed value (to make it deterministic)
Sample Data Extraction Example

SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
SAMPLE (1000000 ROWS);
Synthetic Data Generation
• FROM TABLE(GENERATOR([rowcount], [timelimit]))  rows (w/o columns)
• Random  deterministic values
• RANDOM/RANDOMSTR  64-bit integer/string with length
• UUID_STRING  UUID
• Controlled Distribution  for unique ID values
• NORMAL/UNIFORM/ZIPF  number w/ specific distribution/integer
• SEQ1/SEQ2/SEQ4/SEQ8  sequence of integers

GENERATOR

RANDOM, NORMAL, SEQn...


Data Generation Example

select
randstr(uniform(10, 30, random(1)), uniform(1, 100000, random(1)))::varchar(30) as name,
randstr(uniform(10, 30, random(2)), uniform(1, 10000, random(2)))::varchar(30) as city,
randstr(10, uniform(1, 100000, random(3)))::varchar(10) as license_plate,
randstr(uniform(10, 30, random(4)), uniform(1, 200000, random(4)))::varchar(30) as email
from table(generator(rowcount => 1000));
Faker Data Generation Example
Python Code
fake = Faker()
output = [{
"name": fake.name(),
"address": fake.address(),
"city": fake.city(),
"state": fake.state(),
"email": fake.email()
} for _ in range(1000)]
df = pd.DataFrame(output)
print(df)
Unique Identifiers

• Sequences
• Identity Columns
• UUIDs
Sequences

• CREATE SEQUENCE …
• START start INCREMENT incr  default (1, 1)
• ORDER | NOORDER  default ORDER (ASC)

• seq.nextval  for next value (no seq.currval!)


• TABLE(GETNEXTVAL(seq)  ~seq.nextval
Identity Columns

• CREATE TABLE …
• AUTOINCREMENT | IDENTITY  auto-gen number (~seq
number)
• START start INCREMENT incr  alt. to (start, incr), def (1, 1)
• ORDER | NOORDER  default ORDER (ASC)
Client Request

We need a better visual representation of


the hierarchical employee-manager
relationship. Please provide some solutions
in plain SQL, if any.
Also, the name of the manager for an
employee, and the names of employees
directly supervised, will be some common
functionalities used in queries.
Review of Client Request

We need a better visual representation of


the hierarchical employee-manager
relationship. Please provide some solutions
in plain SQL, if any.
Also, the name of the manager for an
employee, and the names of employees
directly supervised, will be some common
functionalities used in queries.
Section Summary

• Querying Hierarchical Data in SQL


• Views
• Stored Procedures
• JavaScript Stored Procedures API
• User-Defined Functions (UDFs)
• User-Defined Table Functions (UDTFs)
Hierarchical Data: SQL Queries
child_parent view indented name + path

1. Multi-Level JOINs

2. CONNECT BY

3. Recursive CTEs

4. Recursive Views
Hierarchical Data: (1) Multi-Level JOINs
select coalesce(m3.employee || '.', '')
|| coalesce(m2.employee || '.', '')
|| coalesce(m1.employee || '.', '')
|| e.employee as path,
regexp_count(path, '\\.') as level,
repeat(' ', level) || e.employee as name
from employee_manager e
left join employee_manager m1 on e.manager = m1.employee
left join employee_manager m2 on m1.manager = m2.employee
left join employee_manager m3 on m2.manager = m3.employee
order by path;
Hierarchical Data: (2) CONNECT BY

select repeat(' ', level-1) || employee as name,


ltrim(sys_connect_by_path(employee, '.'), '.') as path
from employee_manager
start with manager is null
connect by prior employee = manager
order by path;
Hierarchical Data: (3) Recursive CTEs

with recursive cte (level, name, path, employee) as (


select 1, employee, employee, employee
from employee_manager
where manager is null
union all
select m.level + 1,
repeat(' ', level) || e.employee,
path || '.' || e.employee,
e.employee
from employee_manager e join cte m on e.manager = m.employee)

select name, path


from cte
order by path;
Hierarchical Data: (4) Recursive Views

create recursive view employee_hierarchy (level, name, path, employee) as (


select 1, employee, employee, employee
from employee_manager
where manager is null
union all
select m.level + 1,
repeat(' ', level) || e.employee,
path || '.' || e.employee,
e.employee
from employee_manager e join employee_hierarchy m on e.manager = m.employee);

select name, path


from employee_hierarchy
order by path;
Stored Procedures and Functions

• CREATE PROCEDURE / CREATE FUNCTION


• RETURNS … [NOT NULL]  data type (TABLE for UDTFs)
• EXECUTE AS OWNER/CALLER  access rights
• CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT

• LANGUAGE ...  SQL/JavaScript or Python/Java/Scala (w/ Snowpark)


• for Python/Java/Scala
• RUNTIME_VERSION = '...'  fixed
• HANDLER = '…'  internal class[+function]
• PACKAGES = ('…', …)  imported packages
• IMPORTS = ('…', ...)  stage path and file names to read
• TARGET_PATH = '…'  stage path and JAR file name to write
Stored Procedures

Call from SQL

call proc1(22.5);
SELECT * FROM TABLE(
RESULT_SCAN(LAST_QUERY_ID()))

SQL Scripting Code JavaScript Code


create procedure proc1(num float) create procedure proc1(num float)
returns string returns string
language sql language javascript
as $$ as $$
return '+' || to_varchar(num); return '+' + NUM.toString();
$$; $$;
JavaScript Stored Procedures API

snowflake
addEvent(…)
log(…) execute(…)
setSpanAttribute(…)
createStatement(…)

Statement
getSqlText()
ResultSet
getColumnName/Type/Scale(…) next() SfDate
getColumnSqlType(…) execute(…) getSqlText()
getEpochSeconds()
isColumnNullable/Text/…(…) getColumnSqlType(…)
getNanoSeconds()
getColumnCount() getColumnValue[AsString](…)
getTimezone()
getColumnRowsAffected() getColumnCount()
getScale()
getRowCount() getNumColumnRowsAffected()
getNumRowsInserted/Updated/Deleted() getQueryId()
getQueryId()
UDFs (User-Defined Functions)

Call from SQL

select fct1(22.5);

SQL Scripting Code JavaScript Code


create function fct1(num float) create function fct1(num float)
returns string not null returns string
language sql language javascript
as as
'return \'+\' + NUM.toString()'; 'select ''+'' || to_varchar(num)';
UDFs (cont.) Python Code
create function fct1(num float)
returns string
Call from SQL language python
runtime_version = '3.8'
select fct1(22.5); handler = 'proc1'
as $$
def proc1(num: float):
return '+' + str(num)
$$;

Java Code Scala Code


create function fct1(num float) create function fct1(num float)
returns string returns string
language java language scala
runtime_version = 11 runtime_version = 2.12
handler = 'MyClass.fct1' handler = 'MyClass.fct1'
as $$ as $$
class MyClass { object MyClass {
public String fct1(float num) { def fct1(num: Float): String = {
return "+" + Float.toString(num); return "+" + num.toString
}} }}
$$; $$;
UDTFs (User-Defined Table Functions)
Call from SQL

select * from
table(fctt1('abc'));

SQL Scripting Code JavaScript Code


create function fctt1(s string) create function fctt1(s string)
returns table(out varchar) returns table(out varchar)
as language javascript
begin strict
select s as $$
union all {
select s processRow: function f(row, rowWriter, context) {
end; rowWriter.writeRow({OUT: row.S});
rowWriter.writeRow({OUT: row.S});
}}
$$;
UDTFs (cont.)
Java Code
Call from SQL create function fctt1(s string)
returns table(out varchar)
language java
select * from runtime_version = 11
table(fctt1('abc')); handler = 'MyClass'
as $$
import java.util.stream.Stream;
class OutputRow {
public String out;
Python Code public OutputRow(String outVal) {
this.out = outVal; }
create function fctt1(s string) }
returns table(out varchar)
language python class MyClass {
runtime_version = '3.8' public static Class getOutputClass() {
handler = 'MyClass' return OutputRow.class; }
as $$ public Stream<OutputRow> process(String inVal) {
class MyClass: return Stream.of(
def process(self, s: str): new OutputRow(inVal),
yield (s,) new OutputRow(inVal));
yield (s,) }}
$$; $$;
Client Request

Our DBA would like a quick intro to what's


different in Snowflake's SQL. What are a few
things we should be aware of when we run
our first SQL queries?
Please make the hierarchical display more
generic. We would like to show an indented
name with a path for any table or view with
child-parent as the first two columns.
Review of Client Request

Our DBA would like a quick intro to what's


different in Snowflake's SQL. What are a few
things we should be aware of when we run
our first SQL queries?
Please make the hierarchical display more
generic. We would like to show an indented
name with a path for any table or view with
child-parent as the first two columns.
Section Summary

• Object Identifiers
• DDL Statements
• Zero-Copy Cloning
• DML Statements
• Snowflake Scripting
• SQL vs SQL Scripting
• Cursor and ResultSet
• Transactions
Object Identifiers
• identifiers
• NAME/Name/name → NAME
• "Name" → Name
• "This is a name" → This is a name

• IDENTIFIER/TABLE functions
• IDENTIFIER/TABLE('MY_TABLE')  MY_TABLE/My_Table/my_table
• IDENTIFIER/TABLE('"my_table"')  "my_table"
• IDENTIFIER/TABLE($table_name)  SET table_name = 'my_table';

• database.schema.object  object name resolution


Column References and JSON Properties
• SELECT $1, $2 ...
• from any table type (including external tables)
• from staged files

• v:myobj.prop1.prop2['name2'].array1[2]::string
• v: = table column name or alias
• myobj = top JSON object
• myobj.prop1 = myobj['prop1'], prop2.name2 = prop2['name2']
• array1[2] = 3rd element in JSON array
• ::string = cast conversion
Variables
• session variables = global variables
• SET var = ..., UNSET var, $var  SHOW VARIABLES
• SnowSQL variables = extensions, w/ var substitution
• local variables = in blocks (Snowflake Scripting / stored procs / functions)
• var1 [type1] [DEFAULT expr1]
• LET var1 := [type1] [DEFAULT / := expr1]
• SELECT col1 INTO :var1
• bind variables = for parameterized queries, w/ runtime param values
• SELECT (:1), (:2), TO_TIMESTAMP((?), :2)
• environment variables = for Bash (Linux/macOS) or PowerShell (Windows)
• SET/EXPORT name=value → $name or %name%
Structured Query Language (SQL)
• Data Definition Language (DDL) • Transaction Control Language (TCL)
• CREATE | ALTER | DROP • BEGIN TRANSACTION
• COMMENT | USE • COMMIT | ROLLBACK
• SHOW | DESCRIBE • DESCRIBE TRANSACTION
• SHOW TRANSACTIONS | LOCKS
• Data Manipulation Language (DML)


INSERT | UPDATE
DELETE | TRUNCATE
• Data Control Language (DCL)
• MERGE | EXPLAIN
• USER | ROLE
• GRANT | REVOKE

• Data Query Language (DQL)


• SELECT | CALL
DML Commands
• INSERT [OVERWRITE]  with truncate
• [ALL] INTO <table> [(<cols>)] … VALUES (…), …
• INTO <table> [(<cols>)] … SELECT …
• FIRST|ALL WHEN … THEN INTO … ELSE INTO …  multi-table insert
• UPDATE <table> SET <col> = …, … [FROM …] [WHERE …]
• DELETE FROM <table> [USING …] [WHERE …]
• TRUNCATE [TABLE] [IF EXISTS] <table>  w/ metadata delete

-- conditional multi-table insert


insert first
when deptno <= 20
then into dept_ctas
else into dept_clone
select * from dept;
DDL Commands
• CREATE [OR REPLACE] <type> [IF NOT EXISTS] <name> …
• ALTER <type> <name> …
• DROP <type> [IF EXISTS] <name> [CASCADE | RESTRICT]

• COMMENT [IF EXISTS] ON <type> <name> IS …


• USE [<type>] <name>  context!

• SHOW <types> [LIKE …] [IN …]


• DESC[RIBE] <type> <name>
Create Tables
• CREATE TABLE <target>
• (col1 type, ...)  from scratch, empty

• LIKE <source>  empty copy


• AS SELECT ... FROM <source>  full copy CTAS (Create Table As Select)
• CLONE <source>  zero-copy clone (no data copied)

create table dept_like like dept;

create table dept_ctas as select * from dept;

create table dept_clone clone dept;


Zero-Copy Cloning

• CREATE TABLE|SCHEMA|DATABASE <target> CLONE <source>

• A clone shares all initial data with its source  referenced storage
• Any further change is stored separately  owned storage
• Can clone from a specific point back in time  time travel

<source>
<target>

Owned Storage Owned Storage

Referenced Storage
Snowflake Scripting
• SQL vs Scripting - ~PL/SQL (Oracle), Transact-SQL (Microsoft SQL Server)
• Procedural language (SQL is declarative), as SQL extension, since Feb 2022
• used only for Stored Procs (UDFs/UDTFs w/ SQL)
• Temporary bug in SnowSQL and Classic Console → requires $$ .. $$

• BEGIN .. END code blocks (w/ optional DECLARE and EXCEPTION)


• Branching & Looping Statements
• CURSOR & RESULTSET Objects
• Variables & Return Value, Built-In Variables
• Exceptions
Snowflake Scripting: Code Block Template
<code block> anonymous block
[declare] -- anonymous block
var1 float; execute immediate
my_exc exception (-202, 'Raised'); <code block>

begin <code block>


var1 := 3;
let var2 := 4
if (should_raise_exc) then stored procedure
raise my_exc; -- stored proc
return var1; create procedure proc()
returns ...
[exception] as
when statement_error then <code block>
return object_construct(
'Error type', 'STATEMENT_ERROR',
'SQLCODE', sqlcode,
'SQLERRM', sqlerrm,
'SQLSTATE', sqlstate);
end;
Snowflake Scripting: Compared to SQL

SQL Script Client


SQL Client
FOR order IN orders DO
UPDATE past_orders UPDATE past_orders
SET price = 22.40; SET price = order.price;
END FOR;

Compute

SQL Engine
SQL statement SQL script

Snowflake
Data
Snowflake Scripting: Variables
• Optional top DECLARE block, for SQL scalar/RESULTSET/CURSOR/EXCEPTION
• Optional data type + DEFAULT value
• LET for vars not DECLAREd, w/ optional data type + := initialization
• Can be used in RETURN
• Reference w/ : prefix only when in inline SQL

[declare]
var1 FLOAT DEFAULT 2.3;
res1 RESULTSET DEFAULT (SELECT ...);
cur1 CURSOR FOR SELECT (?) FROM ...
exc1 EXCEPTION (-202, 'Raised');
begin
var1 := 3;
LET var2 := var1 + 4;
LET cur2 CURSOR FOR SELECT :var1, ...;
LET res2 RESULTSET := (SELECT ...);
RETURN var1;
end;
Snowflake Scripting: Built-In Variables
• for last executed DML statement
• SQLROWCOUNT = rows affected by last statement, ~getNumRowsAffected()
• SQLFOUND = true if last statement affected 1+ rows
• SQLNOTFOUND = true if last statement affected 0 rows
• SQLID = ID of the last executed query (of any kind)
• exception classes  check in EXCEPTION block, with WHEN ...
• STATEMENT_ERROR = execution error
• EXPRESSION_ERROR = expression-related error
• exception info  to use in EXCEPTION block
• SQLCODE = exception_number
• SQLERRM = exception_message
• SQLSTATE = from ANSI SQL standard
Snowflake Scripting: Branching
• IF ... THEN ... [ELSEIF ... THEN ... [...]] ELSE ... END IF
• CASE cond WHEN val1 THEN ... [...] ELSE ... END  simple
• CASE WHEN cond1 THEN ... [...] ELSE ... END  searched

-- IF-THEN-ELSE -- simple CASE -- searched CASE


IF (FLAG = 1) THEN CASE (v) CASE
RETURN 'one'; WHEN 'first' THEN WHEN v = 'first' THEN
ELSEIF (FLAG = 2) THEN RETURN 'one'; RETURN 'one';
RETURN 'two'; WHEN 'second' THEN WHEN v = 'second' THEN
ELSE RETURN 'two'; RETURN 'two';
RETURN 'Unexpected!'; ELSE ELSE
END IF; RETURN 'unexpected'; RETURN 'unexpected';
END; END;
Snowflake Scripting: Looping
• FOR row IN cursor DO ... END FOR
• FOR i IN start TO end DO ... END FOR
• WHILE cond DO ... END WHILE
• REPEAT ... UNTIL cond END REPEAT
• LOOP ... [IF cond THEN BREAK/CONTINUE END IF] END LOOP
-- cursor-based -- WHILE-DO + REPEAT-UNTIL -- LOOP + BREAK/CONTINUE
FOR rec IN c1 DO WHILE (i <= 8) DO LOOP
i := i + 1; i := i + 1; i := i + 1;
END FOR; END WHILE; IF (i > 5) THEN
BREAK;
-- counter-based REPEAT ELSEIF (i < 20) THEN
FOR i IN 1 TO 20 DO i := i - 1; CONTINUE;
i := i + 1; UNTIL (i > 0) END IF;
END FOR; END REPEAT; END LOOP;
Snowflake Scripting: CURSOR and RESULTSET
• cur1 CURSOR FOR SELECT (?), ...  w/ optional bind vars
• res1 RESULTSET DEFAULT (SELECT :var ,...)  w/ optional var subst.
declare declare
var1 DEFAULT 0; id1 INTEGER DEFAULT 0;
res1 RESULTSET DEFAULT cur1 CURSOR FOR SELECT ...;
(SELECT :var1, ...); begin
begin OPEN cur1;
LET cur1 CURSOR FOR res1; FETCH cur1 INTO id1;
FOR row1 IN cur1 DO CLOSE cur1;
...; RETURN id1;
END FOR; end;
end;
begin
begin LET var1 := 'FROM ...';
LET cur1 CURSOR FOR SELECT (?) ...; LET stm1 := 'SELECT ...' || var1;
OPEN cur1 USING (:col1); LET res1 RESULTSET :=
RETURN TABLE( (EXECUTE IMMEDIATE :stm1);
RESULTSET_FROM_CURSOR(cur1)); RETURN TABLE(res1);
end; end;
Snowflake Scripting: Exceptions
• exc1 EXCEPTION (...)  custom exception object
• RAISE exc1  go to EXCEPTION section, check WHEN exc1
DECLARE
exc1 EXCEPTION (-2002, 'Raised');

BEGIN
LET err := true;
IF (err) THEN
RAISE exc1;
END IF;

EXCEPTION
WHEN STATEMENT_ERROR THEN ...
WHEN exc1 THEN ...
WHEN OTHER THEN
RETURN SQLCODE;
RAISE;
END;
Transaction Control Language (TCL)

• BEGIN TRANSACTION
• COMMIT | ROLLBACK

• DESCRIBE TRANSACTION
• SHOW TRANSACTIONS

• SHOW LOCKS
Transactions

• explicit = BEGIN TRANSACTION ... COMMIT/ROLLBACK


• implicit = single DDL stmt / DML stmt if AUTOCOMMIT on
• inner = no commit/rollback if already rolled-back/committed
• nested = no such thing (see before)
• scoped = whole inside a stored proc and end in the same session

• read committed = the only isolation level supported for tables


• system$abort_transaction = abort a running transaction by ID
Client Request

Our customers table is owned by an external


department, but we need to keep an up-to-
date copy into our database as well. Please
make sure it is sync-ed frequently, and we
can see all changes in almost real-time.
New employees can be added through CSV
files dropped into an Amazon S3 bucket
folder. We cannot update or delete exiting
employees through this method.
Review of Client Request

Our customers table is owned by an external


department, but we need to keep an up-to-
date copy into our database as well. Please
make sure it is sync-ed frequently, and we
can see all changes in almost real-time.
New employees can be added through CSV
files dropped into an Amazon S3 bucket
folder. We cannot update or delete exiting
employees through this method.
Section Summary

• Change Data Capture (CDC)


• MERGE Statement
• Change Tracking
• Streams and Tasks
• Dynamic Tables
• Snowpipe
Batch/Stream Data Transfer to Snowflake

PostgreSQL, Oracle, (1) Initial Batch


Microsoft SQL-Server, SAP...
Full Data Transfer

Source Table Target Table


Enable
CDC Fivetran,
Airbyte, Stitch...

CDC Log Stage Table

(2) Incremental
Stream CDC Updates
Remote Snowflake
OLTP Database Data Warehouse
Change Data Capture (CDC)

source target

del id name id name


UPDATE
(1) APPEND 3 False 1 John 3 x INSERT 1 John
DELETE
False 2 Mary 2 Mary
False 3 George 3 George

del id name id name


(2) APPEND 2 UPDATE + DELETE
False 1 Mark 1 Mark
True 2 3 George
CDC Methods
• CDC with Manual MERGE Statement
• MERGE INTO ... USING ... WHEN ...
• CDC with Change Tracking
• ALTER TABLE ... SET CHANGE_TRACKING = TRUE;
• CHANGES(information => DEFAULT/APPEND_ONLY)
• CDC with Stream and Task
• CREATE STREAM ... ON TABLE source
• CREATE TASK ... AS ...
• SYSTEM$HAS_STREAM_DATA(stream)
• METADATA$ACTION, METADATA$ISUPDATE
• CDC with Dynamic Table
• CREATE DYNAMIC TABLE ...
Manual CDC with MERGE Statement
-- source (table) --> target (table)
CREATE TABLE source(del BOOLEAN, id INT, name STRING);
CREATE TABLE target(id INT, name STRING);

-- call after each batches of INSERT/DELETE/UPDATE on the source table


MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED AND del
THEN DELETE
WHEN MATCHED AND NOT del
THEN UPDATE SET t.name = s.name
WHEN NOT MATCHED AND NOT del
THEN INSERT (id, name) VALUES (s.id, s.name);
CDC with Change Tracking
-- source (table) --> target (table)
CREATE TABLE source(id INT, name STRING);

-- enable change tracking and save that initial point in time


ALTER TABLE source SET CHANGE_TRACKING = TRUE;
SET ts1 = (SELECT CURRENT_TIMESTAMP());

-- perform INSERT/UPDATE/DELETE on the source table

-- see all INSERTs (since that point in time)


SELECT * FROM source
CHANGES(information => APPEND_ONLY) AT(timestamp => $ts1);

-- create target with all changes (since that point in time)


CREATE OR REPLACE TABLE target AS
SELECT id, name FROM source
CHANGES(information => DEFAULT) AT(timestamp => $ts1);
CDC with Stream and Task
CREATE TABLE source(id INT, name STRING);  source table
CREATE TABLE target(id INT, name STRING);  target table

CREATE STREAM stream1 ON TABLE source;  stream on source table


CREATE TASK task1  task for continuous CDC
WAREHOUSE = compute_wh
SCHEDULE = '1 minute'
WHEN SYSTEM$STREAM_HAS_DATA('stream1')
AS
MERGE INTO target t USING stream1 s ON t.id = s.id
WHEN MATCHED AND metadata$action = 'DELETE'  DELETE
AND metadata$isupdate = 'FALSE'
THEN DELETE
WHEN MATCHED AND metadata$action = 'INSERT'  DELETE + INSERT
AND metadata$isupdate = 'TRUE'
THEN UPDATE SET t.name = s.name
WHEN NOT MATCHED AND metadata$action = 'INSERT'  INSERT
THEN INSERT (id, name) VALUES (s.id, s.name);

ALTER TASK task1 RESUME/SUSPEND;  suspended by default!


EXECUTE TASK task1;  manual execution
CDC with Dynamic Table
CREATE TABLE source(id INT, name STRING);  source table

CREATE DYNAMIC TABLE target  dynamic table


WAREHOUSE = compute_wh  for internal task
TARGET_LAG = '1 minute'  ~SCHEDULE
AS
SELECT id, name FROM source;  w/ inferred schema

-- perform INSERT/UPDATE/DELETE on the source table

ALTER DYNAMIC TABLE target SUSPEND;  stop internal task


Snowpipes

• CREATE PIPE pipe AS COPY INTO table FROM @stage;

• AUTO_INGEST on
• automatic data loading
• from external stages only (S3, Azure Storage, Google Storage)
• AUTO_INGEST off
• no automatic data loading
• from external/internal named/table stages (not user stages!)
• only with Snowpipe REST API endpoint calls
Snowpipe on S3
• Create a continuous loading pipe based on the external S3 stage created before
• AUTO_INGEST = True
• COPY INTO new table FROM external stage
• Add an event notification for the S3 bucket/folder
• for "All object create events"
• on SQS Queue w/ ARN copied from show pipes for crt pipe
• Upload some CSV files in the folder
• check pipe status: select system$pipe_status('mypipe_s3');

CREATE PIPE mypipe_s3


AUTO_INGEST = TRUE
AS
COPY INTO emp_s3 FROM @mystage_s3
FILE_FORMAT = (TYPE = 'CSV')
ON_ERROR = 'CONTINUE';
Snowpipe on S3

(1) continuous data inflow

DataFile
File S3 External (6) COPY INTO table
Data
Data File Table
Stage
@stage (5) trigger
(2) upload files to AWS

Compute

Pipe

(4) SNS notification


(3) queue file
SQS
Queue

AWS Account Snowflake Account


Snowpipe on S3: Alternative

(1) continuous data inflow

DataFile
File S3 External (7) COPY INTO table
Data
Data File Table
Stage
@stage
(2) upload files into AWS
(3) queue file (6) call

SQS
Queue
(4) notif Compute
(5) call
Lambda Stored
Function Procedure

AWS Account Snowflake Account


Client Request

Hello, the VP of Sales here...


Are there some other storage data formats better
suited for such an employee-manager
hierarchical topology?
What other free or open-source data visualization
options exist for this kind of relationship?
You already came up with recursive queries in
SQL, but I an wondering if we cannot use graphs
or charts instead...
Review of Client Request

Hello, the VP of Sales here...


Are there some other storage data formats
better suited for such an employee-manager
hierarchical topology?
What other free or open-source data
visualization options exist for this kind of
relationship?
You already came up with recursive queries
in SQL, but I an wondering if we cannot use
graphs or charts instead...
Section Summary

• Hierarchical Data Formats


• Graphs (with GraphViz)
• Charts (with Plotly)
• Trees (with D3)
Hierarchical Data Formats: JSON, XML, YAML
JSON XML YAML
{ <?xml version="1.0"?> KING
"name": "KING", <object> - BLAKE
"children": [ <name>KING</name> - MARTIN
{ <children> JAMES
"name": "BLAKE", <object> JONES
"children": [ <name>BLAKE</name> - FORD
{ "name": "MARTIN" }, <children>
{ "name": "JAMES" } <object>
] <name>MARTIN</name>
}, </object>
{ <object>
"name": "JONES", <name>JAMES</name>
"children": [ </object>
{ "name": "FORD" } </children>
] </object>
} …
] </children>
} </object>
GraphViz DOT Notation: Edges

KING → BLAKE
BLAKE → MARTIN
BLAKE → JAMES
KING → JONES
JONES → FORD

digraph {
BLAKE -> KING;
MARTIN -> BLAKE;
JAMES -> BLAKE;
JONES -> KING;
FORD -> JONES;
}
GraphViz DOT Notation: Nodes & Edges

digraph {
n7839 [label="KING"];
n7698 [label="BLAKE"];
n7566 [label="JONES"];
n7902 [label="FORD"];
n7654 [label="MARTIN"];
n7900 [label="JAMES"];

n7698 -> n7839;


n7654 -> n7698;
n7900 -> n7698;
n7566 -> n7839;
n7902 -> n7566;
}
GraphViz DOT Notation: Styling
digraph {
graph [rankdir="BT"
bgcolor="#ffffff"
splines="ortho"]
node [style="filled"
fillcolor="lightblue"]
edge [arrowhead="None"]

n7839 [label="KING“
shape="rect" color="red"];
n7698 [label="BLAKE"];
n7566 [label="JONES"];
n7902 [label="FORD"];
n7654 [label="MARTIN"];
n7900 [label="JAMES"];

n7698 -> n7839 [style="dashed"];


n7654 -> n7698 [style="dashed"];
n7900 -> n7698 [style="dashed"];
n7566 -> n7839;
n7902 -> n7566;
}
Plotly Charts: Treemap
Plotly Charts: Icicle
Plotly Charts: Sunburst
Plotly Charts: Sankey
D3: Collapsible Tree
D3: Linear Dendrogram
D3: Radial Dendrogram
D3: Network Graph
Client Request

Can you make it customizable, to be able to


upload any CSV file, and select the columns
we want?
And find an easy solution to be able
somehow to share it online. We here in Sales
have limited technical knowledge and could
not easily install and run your demo project
as it is…
Review of Client Request

Can you make it customizable, to be able to


upload any CSV file, and select the columns
we want?
And find an easy solution to be able
somehow to share it online. We here in Sales
have limited technical knowledge and could
not easily install and run your demo project
as it is…
Section Summary

• Introduction to Streamlit
• Layout Components
• Interactive Widgets
• State and Callbacks
• Data Cache
• Multi-Page Applications
• Test as Local Web App
• Deploy and Share as Web App
Introduction to Streamlit
• History
• Bought by Snowflake in 2022 for $800M
• Integrated with Snowpark: Streamlit in Snowflake + Native Apps
• Features
• RAD framework for data science experiments (~VB, Access, Python at the app level)
• Connect to all sorts of data sources (Snowflake etc)
• Instant rendering as charts or HTML content, using many third-party components
• Architecture
• Minimalistic layout components and simple input controls (single event per control!)
• Full page rerun after any input control integration (tricky at the beginning!)
• Cache between reruns with session state and control callbacks (added recently)
• Data and resource/object cache (to avoid data reloads between page reruns)
• Development
• Great for prototyping and proof-of-concept simple apps (not like heavy React apps!)
• Support for single and multi-page applications
• Test as local web app (not standalone!)
• Share and deploy as remote web app to Streamlit Cloud (for 100% free!)
Layout Components
• st.sidebar  collapsible left sidebar
• st.tabs  tab control

• st.columns  side-by-side horizontal containers


• st.expander  collapsible container
• st.container  multi-element container
• st.empty  single-element placeholder
Layout Components
cont = st.container()
cont.write("First in …") tabs = st.tabs(“Tab 1”, “Tab 2”, “Tab 3”)
st.write(“Outside the …") tabs[0].write("Text in first tab")
cont.write(“Second in …") tabs[1].write("Text in second tab")

exps = st.expander(
with st.expander("Expanded"):
"Collapsed", expanded=False)
st.write("This is expanded")
exps.write("This is collapsed")
st.sidebar.selectbox(
“Select Box:", ["S", "M"])

cols = st.columns(3)
cols[0].write("Column 1")
cols[1].write("Column 2")
cols[2].write("Column 3")

with st.empty():
st.write("Replace this...")
st.write("...by this one")
Display Text
• st.write('Most objects'), st.write(['st', 'is <', 3])
• st.text('Fixed width text')
• st.title/header/subheader/caption('My title')

• st.code('for i in range(8): foo()')  source code (w/ optional line numbers)


• st.markdown('_Markdown_')  write markdown
• st.latex(r''' e^{i\pi} + 1 = 0 ''')  write formulas

• st.divider  ~HR
Interactive Widgets
• st.text/number/date/time_input("First name")  on_change() event
• st.text_area("Text to translate")  on_change() event

• st.selectbox("Pick one", ["cats", "dogs"])  on_change() event


• st.multiselect("Buy", ["milk", "apples", "potatoes"])  on_change() event
• st.slider("Pick a number", 0, 100)  on_change() event
• st.select_slider("Pick a size", ["S", "M", "L"])  on_change() event
• st.color_picker("Pick a color")  on_change() event
Interactive Widgets
on_change() → full page re-run

app.py (front-end)
import streamlit as st

st.multiselect("Select:",
["S", "M", "L"], default=["S", "M"])

st.selectbox( "Choose:",
["S", "M", "L"])

st.select_slider("Choose:",
["S", "M", "L"], value="M")

st.radio("Choose:",
["S", "M", "L"], index=2) server (back-end)
user clicks on S
…<API calls>
Buttons
• st.button("Click me")  buttons on single line, on_click() event
• st.toggle("Enable")  on_change() event
• st.checkbox("I agree")  on_change() event
• st.radio("Pick one", ["cats", "dogs"])  on_change() event

• st.file_uploader("Upload a CSV")  on_change() event


• st.download_button("Download file", data)  on_click() event
• st.link_button("Go to gallery", url)  no events
Buttons

on_change/on_click() → full page re-run


cache
app.py st.session_state
import streamlit as st {

if st.button("Button", key="my-button"): "my-button": false


st.write("You clicked!")

if st.toggle("Toggle", key="my-toggle"): "my-toggle": false


st.write("You toggled!") }
State and Callbacks

callbacks
full page re-run
def on_button_click(msg):
st.write(f"{msg},
{st.session_state.name}")
cache
app.py st.session_state
import streamlit as st {

st.session_state["name"] = "Chris“ "name": “Chris“

st.button("Button", key="my-button", "my-button": false


on_click=on_button_click, args=("Hi", )) }
Data Cache

on_click() → full page re-run


cache
app.py @st.cache_data/resource
import streamlit as st 2023-09-27 08:25:44.934312
import datetime

@st.cache_data/resource
def now():
always the
return datetime.datetime.now()
cached value
if st.button("Show Current Time"):
st.write(now()) 2023-09-27 08:25:44.934312
st.write(datetime.datetime.now()) 2023-09-27 08:26:44.961236
Multi-Page Applications (obsolete now)

app.py
import streamlit as st

def main(): st.write("My main page")


def page_one(): st.write("My Page One page")
def page_two(): st.write("My Page Two page")
def about(): st.write("My About page")

funcs = {
“-": main,
"Page One": page_one,
"Page Two": page_two,
"About": about }
name = st.sidebar.selectbox(
“Select Page:", funcs.keys())
funcs[name]()
Multi-Page Applications

1_Page_One.py
import streamlit as st

st.set_page_config(
page_title="Plotting Demo",
page_icon=" ")
Other Controls
• display progress/status
• with st.spinner(text='In progress'): …
• bar = st.progress(50) … bar.progress(100)
• with st.status('Authenticating...') as s: … s.update(label='Response')
• st.error/warning/info/success/toast('Error message')
• st.exception(e)
• st.balloons/snow()
• media
• st.image  show image/list of images
• st.audio/video  show audio/video player
• st.camera_input("Take a picture")  on_change() event
• chat
• with st.chat_message("user"): …  response to a chat message
• st.chat_input("Say something")  prompt chat widget, on_submit() event
Data Rendering
• st.dataframe  w/ dataframes from Pandas, PyArrow, Snowpark, PySpark
• st.table  show static table
• st.data_editor  show widget, on_change() event
• st.json  show pretty-printed JSON string
Charts
• st.area/bar/line/scatter_chart(df)
• st.map(df)  geo map w/ scatterplot
• st.graphviz_chart(fig)
• st.altair_chart(chart)
• st.bokeh_chart(fig)
• st.plotly_chart(fig)  interactive Plotly chart
• st.pydeck_chart(chart)  free maps
• st.pyplot(fig)  w/ matplotlib
• st.vega_lite_chart(df)
• st.column_config  insert spark lines!
• st.metric  show perf metric number in large
Deploy your Web App to Streamlit Cloud
• publish your app into GitHub
• Streamlit will create access keys! → need authorization
• your app will automatically refresh on each new GitHub push
• can later add "Open in Streamlit" button in GitHub
• deploying in Streamlit Cloud → for free, if public!
• sign-up with your Google email at share.streamlit.io
• make sure your app can be shared publicly → see limits! use subdomain
• replace any app\myapp.py to app/myapp.py, if from subfolder (\ → /)
• prefix with os.path.dirname(__file__)}/ any relative file names
• make sure requirements.txt is updated → check black sidebar log
• add any passwords or confidential data as Secrets (see Advanced Options)
• make sure you'll run the same version of Python when deployed
• can later add the link in Medium posts → expanded as gadget
Client Request

It looks great! And we’re now approved to go


ahead and implement the Hierarchical Data
Viewer for our employee-manager data
already loaded in Snowflake!
Your web app should now connect by default
to our employee table from Snowflake
instead. And show also hierarchical SQL
query results with path, as you demonstrated
not long ago. Make it generic as well.
Review of Client Request

It looks great! And we’re now approved to


go ahead and implement the Hierarchical
Data Viewer for our employee-manager
data already loaded in Snowflake!
Your web app should now connect by
default to our employee table from
Snowflake instead. And show also
hierarchical SQL query results with path, as
you demonstrated not long ago. Make it
generic as well.
Section Summary

• SnowCD (Connectivity Diagnostic)


• SnowSQL (Snowflake’s CLI)
• Visual Studio Code Extension
• Client Drivers
• Client-Side vs Server-Side Cursors
• Snowflake Connector for Python
• Connecting to Snowflake
• Snowflake Connectors for .NET and NodeJS
SnowCD (Connectivity Diagnostic Tool)
select system$allowlist() –- former system$whitelist()
-- copy JSON result into allowlist.json file
SnowCD: Private Links
select system$allowlist_privatelink()
-- copy JSON result into allowlist_privatelink.json file
SnowSQL (CLI Client)
• Overview
• Installs as command line tool for Linux/macOS/Windows
• Developed with Python Connector for Snowflake
• Features
• Connect to Snowflake account
• Perform SQL (DDL/DML) operations
• Run SQL deploy scripts (including PUT/GET for local files)
• Automate SQL deployments from bash shell through batch scripts
• Abort/pause running queries
• Hidden Gems
• ~/.snowsql/config file may be re-used to connect to Snowflake from client apps
• Can use SNOWSQL_PWD env var to save local password
• Customize scripts through variable substitution (for partner databases)
• Generate JWT tokens for key pair authentication (for SQL REST API)
Snowflake Extension for VS Code
• Enable users to write and execute SQL statements directly in VS Code.
• Install from Code > Preferences > Extensions.
• Sign-in to your cloud Snowflake account.
• Can use SnowSQL configuration files.
• Execute commands or queries, from SQL files.
• Query History.
• Query Results pane.
• Database Explorer.
• Upload/download files from stages directly.
Snowflake Extension for VS Code
Client Drivers

• Snowflake Connector for Python


• .NET Driver – for C#
• Node.js Driver – for JavaScript
• Go Snowflake Driver
• PHP PDO Driver for Snowflake

• JDBC Driver  for Java, Scala…


• ODBC Driver  from Windows
Client-Side vs Server-Side Cursors
Python Client SQL Scripting Client
for order in orders: FOR order IN orders DO
conn.cursor().execute(f""" UPDATE past_orders
UPDATE past_orders SET price = order.price;
SET price = {order.price}; END FOR;
""")

Snowflake Connector for Python

Compute

SQL Engine
multiple single
SQL statements SQL script

Data Snowflake
Snowflake Connector for Python

Python code
Client
Snowflake Connector for Python

SQL statements

Compute
Snowflake
SQL Engine

Data
Python Connector API
Exception
snowflake.connector
msg/raw_msg
apilevel QueryStatus errno
threadsafety ABORTING/SUCCESS/RUNNING sqlstate
paramstyle QUEUED/BLOCKED/NO_DATA sfqid
connect(…)

Connection
commit/rollback() get_query_status(…)
ResultMetadata
autocommit(…)
close() name/type_code
is_still_running/is_an_error(…) describe(…) display/internal_size
precision/scale
cursor(…) is_nullable
execute_string(…)
execute_stream(…)
Cursor ResultBatch
fetchone/many/all() rowcount
close() get_result_batches(…) compressed/uncompressed_size
fetch_pandas_all(…)
execute(…) fetch_pandas_batches(…) pandas.DataFrame to_pandas()
execute_many(…) close()
execute_async(…) fetchone/many/all()
Python Connector: Common Pattern
Python Client
import snowflake.connector

# show all property=value pairs for current user


with snowflake.connector.connect(…) as conn:
with conn.cursor() as cur:
cur.execute(“show parameters")
for row in cur:
print(f'{str(row[0])}={str(row[1])}')
Connecting to Snowflake
Basic (Username/Password) Key Pair
with open(f"{Path.home()}/.ssh/id_rsa", "rb") as key:
conn = snowflake.connector.connect(
p_key= serialization.load_pem_private_key(
account = …,
key.read(),
user = …,
password = os.environ['SNOWSQL_PWD'].encode(),
role = …,
backend = default_backend())
database = …,
schema = …,
pkb = p_key.private_bytes(
warehouse = …,
encoding = serialization.Encoding.DER,
password = os.getenv('SNOWSQL_PWD’))
format = serialization.PrivateFormat.PKCS8,
encryption_algorithm = \
serialization.NoEncryption())
SSO
conn = snowflake.connector.connect( conn = snowflake.connector.connect(
account = …, account = …,
user = …, user = …,
role = …, role = …,
database = …, database = …,
schema = …, schema = …,
warehouse = …, warehouse = …,
authenticator = "externalbrowser") private_key = pkb)
Snowflake Connector for .NET
• Create new Console App (.NET Framework) C# project in free Visual Studio CE IDE
• Add Snowflake.Data NuGet package and ref w/ “using Snowflake.Data.Client;”

C# Client Code
var user = "cristiscu";
var pwd = Environment.GetEnvironmentVariable("SNOWSQL_PWD");
var connStr = $"account=BTB76003;user={user};password={pwd}";
using (var conn = new SnowflakeDbConnection(connStr)) {
conn.Open();
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = "show parameters";
using (var reader = cmd.ExecuteReader())
while (reader.Read())
Console.WriteLine($"{reader[0]}={reader[1]}");
}
conn.Close();
}
Client Request

We have some data scientists familiar with


Python, but not with SQL. Is it possible to
write and deploy queries in Python only,
the way we do it with Pandas DataFrame?
What about some other functionality, to
write it in pure Python and transparently
deploy it into Snowflake?
What if we need to load additional third-
party components? Or access intermediate
resources?
Review of Client Request

We have some data scientists familiar with


Python, but not with SQL. Is it possible to
write and deploy queries in Python only,
the way we do it with Pandas DataFrame?
What about some other functionality, to
write it in pure Python and transparently
deploy it into Snowflake?
What if we need to load additional third-
party components? Or access intermediate
resources?
Section Summary

• Client vs Server-Side Programming


• Snowpark for Python Architecture
• Snowpark API: The Object Model
• Creating Queries with DataFrame
• SPs/UDFs/UDTFs in Python/Java/Scala
• Loading Additional Components
• Python Worksheets
Client-Side vs Server-Side Programming
UI
(Front-End) Data
Internet Snowflake
Client

(Python Business Compute


Connector) Logic (Back-End)

UI Business (Snowpark)
(Front-End) Logic

Client Internet Compute Snowflake


(Back-End)

Data
Snowpark for Python
Data Frame query Python SPs/UDFs/UDTFs

Query Translator Object Serializer


Snowpark
Client API
Snowflake Connector for Python

SQL statements Python bytecode

Compute

SQL Engine Python Sandbox

Snowflake

Data Anaconda Packages


Snowpark API:
The Object Model

• Personal Blog Post from Jul 2023


• Also on the Snowflake DSH Blog
• Reposted by 40+ Snowflake
employees on LinkedIn

• Visual Class Diagrams


• Classes grouped by functionality
Session Class
DataFrame QueryRecord
Context sql(…)
createDataFrame(…) count/index(…)
get_active_session() range(…) query_id/sql_text
generator(…) builder
flatten(…) SessionBuilder
table_function(…)
write_pandas(…)

Session
QueryHistory
add_import/packages/requirements(…) query_history()
get_imports/packages() get/get_stream(…) GetResult
remove_import/package(…) status/message
clear_imports/packages(…) file/size
call(…) file
close/cancel_all() FileOperation
use_database/schema(…) sproc
PutResult
use_role/warehouse(…) status/message
get_current_database/schema() source/target
get_current_account/role/warehouse(…) put/put_stream(…) source/target_size
query_tag reader source/target_compression
sql_simplifier_enabled DataFrameReader
telemetry_enabled
table(name)
Table
DataFrame Class
select/selectExpr(…)
filter/where(…)
sort/orderBy(…)
union/unionAll/unionByName(…) DataFrame Column alias/name/as_()
intersect/except_/minus/substract collect/collect_nowait(…) starts/endswith(…) over(…)
withColumn/withColumnRenamed(…) show() getItem(…) within_group(…)
with_columns(…) explain() name/getName() asc/desc()
join/crossJoin/natural_join(…) queries isin/in_(…) asc/desc_nulls_first/last()
join_table_function/flatten(…) schema/columns col(…) like/rlike/regexp(…) bitand/or/xor(…)
limit(…) count(…) equal_null/nan() bitwiseAnd/OR/XOR(…)
agg(…) take(…) eqNullSafe() cast/asType/try_cast(…)
drop/dropna/fillna(…) first(…) startswith/endswith() between(…)
crosstab/unpivot(…) is_cached() substr/substring(…) collate(…)
describe/rename/replace(…)
sample/sampleBy(…) stat na
randomSplit(…)
crosstab(…)
toDF(…)
sampleBy(…) drop/fill/replace(…)

DataFrameStatFunctions DataFrameNaFunctions
approxQuantile(…)
corr/cov(…)

rows = get_active_session().sql("SELECT ...").collect()


Grouping Functions
(arg)
group_by_grouping_sets(…) GroupingSets
groupBy
cube/rollup(…) RelationalGroupedDataFrame
pivot(…)
DataFrame count(…)
sum/avg/min/max(…) builtin(…)
col(…) mean/median(…) function(…)
agg(…)

(inherits)
Column CaseExpr

within_group(…) when(…)
over(…) otherwise/else(…)
(arg)
Window
currentRow WindowSpec
unboundedPreceding orderBy/partitionBy(…)
unboundedFollowing rangeBetween/rowsBetween(…)
orderBy/partitionBy(…)
rangeBetween/rowsBetween(…)
MERGE Statement
when_matched()
update(…)
WhenMatchedClause
delete()

delete(…) DeleteResult
update(…) delete(…) rows_deleted
Table
rows_inserted UpdateResult
rows_updated
update(…) rows_updated
multi_joined_rows_updated
multi_joined_rows_updated
rows_deleted
drop_table()

insert(…) MergeResult
when_not_matched() rows_inserted
merge(…) rows_updated
rows_deleted
WhenNotMatchedClause
insert(…)
Input/Output
pandas.DataFrame
toPandas(…)
to_pandas_batches(…)

DataFrameWriter
copy_into_location(…)
saveAsTable(…) write
mode(…)

option/options(…)
schema(…)
DataFrameReader DataFrame
csv/json/xml(…)
table(…) parquet/orc/avro(…)
sample(…)
Table
table_name
is_cached cache_result(…)

Row collect(…)
count/index(…)
asDict(…)
Create Query with DataFrame

Python Code using DataFrame Generated SQL


emps = (session.table("EMP") select dname, sum(sal)
.select("DEPTNO", "SAL")) from emp join dept
depts = (session.table("DEPT") on emp.deptno = dept.deptno
.select("DEPTNO", "DNAME")) where dname <> 'RESEARCH'
q = emps.join(depts, group by dname
emps.deptno == depts.deptno) order by dname;

q = q.filter(q.dname != 'RESEARCH')
q = (q.select("DNAME", "SAL")
.group_by("DNAME")
.agg({"SAL": "sum"})
.sort("DNAME"))

q.show()
Snowpark Stored Procedures
Python Code
create procedure proc1(num float)
returns string
language python
Call from SQL runtime_version = '3.8'
packages = ('snowflake-snowpark-python')
call proc1(22.5); handler = 'proc1'
as $$
import snowflake.snowpark as snowpark
def proc1(sess: snowpark.Session, num: float):
return '+' + str(num)
$$;
Java Code Scala Code
create procedure proc1(num float) create procedure proc1(num float)
returns string returns string
language java language scala
runtime_version = 11 runtime_version = 2.12
packages = ('com.snowflake:snowpark:latest') packages = ('com.snowflake:snowpark:latest')
handler = 'MyClass.proc1' handler = 'MyClass.proc1'
as $$ as $$
import com.snowflake.snowpark_java.*; import com.snowflake.snowpark.Session;
class MyClass { object MyClass {
public String proc1(Session sess, float num) { def proc1(sess: Session, num: Float): String = {
return "+" + Float.toString(num); }} return "+" + num.toString }}
$$; $$;
Snowpark UDFs (User-Defined Functions)
Python Code
create function fct1(num float)
returns string
language python
Call from SQL runtime_version = '3.8'
packages = ('snowflake-snowpark-python')
select fct1(22.5); handler = 'proc1'
as $$
import snowflake.snowpark as snowpark
def proc1(num: float):
return '+' + str(num)
$$;
Java Code Scala Code
create function fct1(num float) create function fct1(num float)
returns string returns string
language java language scala
runtime_version = 11 runtime_version = 2.12
packages = ('com.snowflake:snowpark:latest') packages = ('com.snowflake:snowpark:latest')
handler = 'MyClass.fct1' handler = 'MyClass.fct1'
as $$ as $$
import com.snowflake.snowpark_java.*; import com.snowflake.snowpark.Session;
class MyClass { object MyClass {
public String fct1(float num) { def fct1(num: Float): String = {
return "+" + Float.toString(num); return "+" + num.toString
}} }}
$$; $$;
Snowpark UDTFs (User-Defined Table Functions)
Java Code
create function fctt1(s string)
Call from SQL returns table(out varchar)
language java
runtime_version = 11
select * from packages = ('com.snowflake:snowpark:latest')
table(fctt1('abc')); handler = 'MyClass'
as $$
import com.snowflake.snowpark_java.*;
import java.util.stream.Stream;
class OutputRow {
public String out;
Python Code public OutputRow(String outVal) {
create function fctt1(s string) this.out = outVal; }
returns table(out varchar) }
language python
runtime_version = '3.8' class MyClass {
packages = ('snowflake-snowpark-python') public static Class getOutputClass() {
handler = 'MyClass' return OutputRow.class; }
as $$ public Stream<OutputRow> process(String inVal) {
import snowflake.snowpark as snowpark return Stream.of(
class MyClass: new OutputRow(inVal),
def process(self, s: str): new OutputRow(inVal));
yield (s,) }
yield (s,) }
$$; $$;
Snowpark for Python
Data Frame query Python SPs/UDFs/UDTFs

Query Translator Object Serializer


Snowpark
Client API
Snowflake Connector for Python

SQL statements Python bytecode

Compute

SQL Engine Python Sandbox

Snowflake

Data Anaconda Packages


Python Worksheets

Web UI

SQL Worksheet Python Worksheet

Compute Snowflake

SQL Engine Stored Procedure

Data Anaconda Packages


Functions

sproc
StoredProcedureRegistration StoredProcedure

udf
Session UDFRegistration UserDefinedFunction

udtf
UDTFRegistration UserDefinedTableFunction
Functions in Snowpark Python
• creating
• sproc/udf/udtf(lambda: ..., [name="..."], ...)  anonymous/named
• sproc/udf/udtf.register(name="...", is_permanent=True, ...)  registered
• @sproc/@udf/@udtf(name="...", is_permanent=True, ...)  registered
• UDTF handler class
• __init__(self) - optional
• process(self, ...) - required, for each input row → tuples w/ tabular value
• end_partition(self) - optional, to finalize processing of input partitions
• calling
• name(...) / fct = function("name")  by name/function pointer
• session.call/call_function/call_udf("name", ...)  SP/UDF
• session.table_function(...) / dataframe.join_table_function(...)  UDTF
• session.sql("call name(...)").collect()  SP
External Dependencies
• imports  local/staged JAR/ZIP/Python/XML files/folders
• IMPORTS = (‘path’)  in CREATE PROCEDURE/FUNCTION
• session.add_import("path")  session level

• packages  referenced int (Anaconda)/ext libraries (with "import")


• PACKAGES = ('name’, ...)  in CREATE PROCEDURE/FUNCTION
• session.add_packages("name", ...)  session level
• @udtf(packages=["name", ...])  @proc/@udt/@udtf
• session.add_requirements("path/requirements.txt")  session level
Python Worksheets

Web UI

SQL Worksheet Python Worksheet

Compute Snowflake

SQL Engine Stored Procedure

Data Anaconda Packages


Python Worksheets: Template
import snowflake.snowpark as snowpark  Snowpark required

def main(session: snowpark.Session):  default handler (can change)


# your Python code goes here...

Python Stored Procedure wrapper


with <worksheet_name> as procedure ()  on-the-fly stored proc (wrapper)
returns Table()  from Settings: String/Variant/Table()
language python  always Python
runtime_version=3.8  always Python 3.8
packages=('snowflake-snowpark-python’)  from Packages
handler='main’  from Settings
as '  your Python Worksheet code
import snowflake.snowpark as snowpark  Python Worksheet

def main(session: snowpark.Session):  default handler (can change)


# your Python code goes here...
'
call <worksheet_name>();  execute on-the-fly stored proc
Python Worksheets: Custom Signature
Generated SQL Worksheet  Deploy > Open in Worksheets
create procedure <proc_name>(par1 int, par2 string, ...)
returns Table()
language python
runtime_version = 3.8
packages =('snowflake-snowpark-python==*')
handler = 'main'
as '
import snowflake.snowpark as snowpark

def main(session: snowpark.Session, par1, par2, ...):


# your Python code goes here...
'

call <proc_name>(100, "abc", ...);


Client Request

For security and performance reasons, can


you deploy your whole Hierarchical Data
Viewer in Snowflake, closer to data?
How can we log messages and trace events in
Snowflake? We need to be alerted on the
most serious errors.
Review of Client Request

For security and performance reasons, can


you deploy your whole Hierarchical Data
Viewer in Snowflake, closer to data?
How can we log messages and trace events
in Snowflake? We need to be alerted on the
most serious errors.
Section Summary

• Streamlit in Snowflake Architecture


• Test as Local Streamlit Web App
• Deploy as Streamlit in Snowflake App
• Event Tables
• Log Messages & Trace Events
• Alerts & Email Notifications
Streamlit in Snowflake
web app client

Database
DataFile
Data File Streamlit
Python Files Named Stage

Snowpark API

Snowflake
Compute

SQL Engine Python Sandbox web server

Data Anaconda Packages


Streamlit in Snowflake
• Create and test as a local Streamlit web app
• Create a local Streamlit app, with one or more Python files.
• Connect locally to Snowflake through Snowpark.
• Test your application as a local Streamlit web app.

• Deploy as a Streamlit in Snowflake app


• Create a Snowflake database with a named stage.
• Upload your Python and other app files into this stage.
• Create a STREAMLIT object, mentioning the entry point file.
• In Snowsight, start your new app in the new Streamlit tab.
• Connect to Snowflake through get_active_session()
• Continue editing, running, and testing the app in Snowsight.
Streamlit in Snowflake
Event Tables
• CREATE EVENT TABLE myevents  predefined cols, max 1MB/row
• ALTER ACCOUNT SET EVENT_TABLE = myevents  one per account
• ALTER ACCOUNT UNSET EVENT_TABLE
• SHOW PARAMETERS LIKE 'event_table' IN ACCOUNT

• levels
• ALTER … SET LOG_LEVEL = OFF/DEBUG/WARN/INFO/ERROR  log messages
• ALTER … SET TRACE_LEVEL = OFF/ALWAYS/ON_EVENT  trace events

• SYSTEM$LOG('level', message)
• SYSTEM$LOG_TRACE/DEBUG/INFO/WARN/ERROR/FATAL(message)
Event Table Columns
• TIMESTAMP - log time / event creation / end of a time span
• START/OBSERVED_TIMESTAMP - start of a time span

• RECORD_TYPE - LOG / SPAN_EVENT / SPAN


• VALUE - primary event value (as VARIANT)

• TRACE - trace_id/span_id, as tracing context


• RESOURCE_ATTRIBUTES - source db/schema/user/warehouse of event
• SCOPE - class names for logs (name...)
• RECORD - fixed values for each record type (severity_text...)
• RECORD_ATTRIBUTES - variable attributes for each record type

SELECT RECORD['severity_text'] AS SEVERITY, VALUE AS MESSAGE


FROM myevents
WHERE SCOPE['name'] = 'python_logger' AND RECORD_TYPE = 'LOG';
Event Tables: Log Messages & Trace Events
Python Code
# log messages
import logging

logger = logging.getLogger("mylog")
logger.info/debug/warning/error/log/...("This is an INFO test.")

# trace events
from snowflake import telemetry

telemetry.add_event("FunctionEmptyEvent")
telemetry.add_event("FunctionEventWithAttributes", {"key1": "value1", ...})
Alerts
• CREATE ALERT ...
• WAREHOUSE  for compute resources
• SCHEDULE  cron expression, for periodical evaluation
• IF (EXISTS(condition))  SELECT/SHOW/CALL stmt to check condition
• THEN action  SQL CRUD/script/CALL, can also use system$send_email(...)
• ALTER ALERT ... SUSPEND/RESUME
• INFORMATION_SCHEMA.ALERT_HISTORY(...)  table function

CREATE ALERT myalert


WAREHOUSE = mywarehouse
SCHEDULE = '1 minute'
IF(EXISTS(
SELECT value FROM gauge WHERE value > 200
))
THEN
INSERT INTO gauge_history VALUES (current_timestamp());
Email Notifications
• CREATE NOTIFICATION INTEGRATION ...
• TYPE = EMAIL
• ALLOWED_RECIPIENTS = (email1, ...)  list of email addresses for TO
• SYSTEM$SEND_EMAIL(integration, to_list, subject, body, [mime_type])
• Snowflake Computing <[email protected]>  from

CREATE NOTIFICATION INTEGRATION my_notif_int


TYPE = EMAIL
ENABLED = TRUE
ALLOWED_RECIPIENTS = ('[email protected]', '[email protected]');

-- send email
CALL SYSTEM$SEND_EMAIL(
'my_notif_int',
'[email protected], [email protected]',
'Email Alert: Task A has finished.',
'Task A has successfully finished.\nEnd Time: 12:15:45');
Client Request

Our database could be accessed by local


admins, editors and guests. Editors cannot
make metadata changes, and guests can
only see the data. Some other employees
could be responsible with uploading data.
We also need a generic script to create
specialized databases and prod/test envs for
each of our partners.
Review of Client Request

Our database could be accessed by local


admins, editors and guests. Editors cannot
make metadata changes, and guests can
only see the data. Some other employees
could be responsible with uploading data.
We also need a generic script to create
specialized databases and prod/test envs for
each of our partners.
Section Summary

• Account and Schema-Level Objects


• Access Control Framework
• Data Control Language (DCL)
• The Roles Hierarchy
• System-Defined Roles
• Access Control Privileges
• Variables and Variable Substitution
Account-Level Objects System Role
Custom Role
Database Role
Application Role
Instance Roles
Managed Account Organization Role User
Reader Account
Global Account
Replication Account
Share Account

Application Database Warehouse


(class) (integration)
Package
Budget Application API Integration
Anomaly_Detection External Access Integration
Forecast Schema Notification Integration
Security Integration
Storage Integration
Resource Monitor
schema-level objects Network Policy
Connection
Schema-Level Objects

Alert Secret File Format Pipe Stream

Tag Streamlit Schema Sequence Task

Table View Stage (programming) (policy)

Temporary Table Materialized View Table Stage Procedure Masking Policy


Transient Table User Stage Function Row Access Policy
External Table Named Stage External Function Password Policy
Dynamic Table External Stage Session Policy
Event Table Packages Policy
(Hybrid Table) Network Rule
Access Control Framework

• DAC (Discretionary Access Control) - objects  grant access  owners


• RBAC (Role-Based Access Control) - objects  privileges  roles  users

CREATE USER user CREATE ...


user object
GRANT/REVOKE ROLE role
TO/FROM USER user

GRANT/REVOKE privilege
CREATE ROLE role ON object TO/FROM role
role privilege

GRANT/REVOKE ROLE role


TO/FROM ROLE role
Data Control Language (DCL)
• CREATE ROLE role
• GRANT ROLE role TO ROLE role  create the role hierarchy
• CREATE USER user
• GRANT ROLE role TO USER user  assign roles to users
• GRANT privilege, … ON obj_type obj_name TO ROLE role
• REVOKE privilege, … ON obj_type obj_name FROM ROLE role
The Role Hierarchy
ACCOUNTADMIN ORGADMIN

SECURITYADMIN SYSADMIN DATABASE ROLE

Functional
USERADMIN ADMIN EDITOR GUEST Custom Roles
(~User Groups)

Database
RW_ROLE RO_ROLE Custom Roles

PUBLIC
System-Defined Roles
• ACCOUNTADMIN - top-level role
• as SYSADMIN + SECURITYADMIN (+ ORGADMIN)
• SECURITYADMIN - to CREATE ROLE and GRANT ROLE to ROLE/USER
• GRANT/REVOKE privileges, inherits USERADMIN
• USERADMIN - to CREATE USER
• SYSADMIN - to CREATE WAREHOUSE/DATABASE/...
• GRANT privileges to objects, should inherit from any custom role

• ORGADMIN - to CREATE ACCOUNT in org


• SHOW ORGANIZATION ACCOUNTS and SHOW REGIONS, view usage info in org
• PUBLIC - automatically granted to every user/role in your account.
Access Control Privileges
• OWNERSHIP - full control over an object to one single role
• MANAGE GRANTS - can grant/revoke privileges on any object (~OWNERSHIP)
• IMPORTED PRIVILEGES - may enable other roles to access a shared db
• ALL [PRIVILEGES] - all privileges, except OWNERSHIP
• USAGE - can USE/SHOW a db/schema/function/stage/warehouse etc
• SELECT - can query and display R/O table/view/stream data
• INSERT/UPDATE/DELETE - enable R/W CRUD operations on a table data
• REFERENCES - can display the structure of a table/view or set table constraints
• EXECUTE - can run task/alert
• CREATE/MODIFT - can create/alter different types of objects
• APPLY - can add/drop policies/tags
• ON FUTURE - on db/schema objects yet to be created
Create Roles and Users
-- create roles -- create users
USE ROLE SECURITYADMIN; USE ROLE USERADMIN;
CREATE ROLE ADMIN; CREATE USER MARK;
CREATE ROLE EDITOR; CREATE USER CLAUDE;
CREATE ROLE GUEST; CREATE USER MARY;
CREATE ROLE RO_ROLE;
CREATE ROLE RW_ROLE;

-- create the role hierarchy -- assign roles to users


USE ROLE SECURITYADMIN; USE ROLE SECURITYADMIN;
GRANT ROLE ADMIN TO ROLE SYSADMIN; GRANT ROLE ADMIN TO USER MARK;
GRANT ROLE EDITOR TO ROLE SYSADMIN; GRANT ROLE EDITOR TO USER CLAUDE;
GRANT ROLE GUEST TO ROLE SYSADMIN; GRANT ROLE GUEST TO USER MARY;
GRANT ROLE RO_ROLE TO ROLE ADMIN;
GRANT ROLE RW_ROLE TO ROLE ADMIN;
GRANT ROLE RO_ROLE TO ROLE EDITOR;
GRANT ROLE RW_ROLE TO ROLE EDITOR;
GRANT ROLE RO_ROLE TO ROLE GUEST;
Grant Database Privileges to Roles
-- create database (w/ PUBLIC schema)
CREATE DATABASE security;

-- R/O privileges (RO_ROLE)


GRANT OPERATE, USAGE ON WAREHOUSE compute_wh TO ROLE RO_ROLE;

GRANT USAGE ON DATABASE security TO ROLE RO_ROLE;


GRANT USAGE ON SCHEMA security.public TO ROLE RO_ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA security.public TO ROLE RO_ROLE;
GRANT SELECT ON FUTURE TABLES IN SCHEMA security.public TO ROLE RO_ROLE;

-- R/W privileges (RW_ROLE)


GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA security.public TO ROLE RW_ROLE;
GRANT INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA security.public TO ROLE RW_ROLE;
GRANT CREATE TABLE, CREATE VIEW ON SCHEMA security.public TO ROLE RW_ROLE;

-- ADMIN privileges (ADMIN)


GRANT ALL ON DATABASE security TO ROLE ADMIN;
GRANT ALL ON FUTURE SCHEMAS IN DATABASE security TO ROLE ADMIN;
Inspect Database Object Privileges

select privilege_type, is_grantable, grantee


from security.information_schema.object_privileges
where object_catalog = 'SECURITY';
Multi-Tenant Architecture
HP Partner AT&T Partner

HP_ADM_PROD HP_ADM_DEV ATT_ADM_PROD ATT_ADM_DEV


HP_ETL_PROD HP_ETL_DEV ATT_ETL_PROD ATT_ETL_DEV

Compute

HP_WH_PROD HP_WH_DEV ATT_WH_PROD ATT_WH_DEV

HP_DB_PROD HP_DB_DEV ATT_DB_PROD ATT_DB_DEV

Snowflake Hosting Account


Multi-Tenant Databases and Environments
Command-Line Call
> snowsql -c my_conn -f create.sql -D tenant=HP -D env=PROD
SQL Script w/ var substitution
-- have variable substitution ON
!SET VARIABLE_SUBSTITUTION=true;

-- create new roles for tenant Admin and tenant ETL data engineer
USE ROLE SECURITYADMIN;
CREATE OR REPLACE ROLE &{tenant}_ADM_&{env};
CREATE OR REPLACE ROLE &{tenant}_ETL_&{env};

-- grant privileges to new tenant Admin role


USE ROLE ACCOUNTADMIN;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE &{tenant}_ADM_&{env};
GRANT CREATE WAREHOUSE ON ACCOUNT TO ROLE &{tenant}_ADM_&{env};

-- create tenant database, w/ new tenant Admin role


USE ROLE &{tenant}_ADM_&{env};
CREATE DATABASE &{tenant}_DB_&{env};
GRANT USAGE ON DATABASE &{tenant}_DB_&{env} TO ROLE &{tenant}_ETL_&{env};

Client Request

First, we'd like to run a Snowflake query


from a PowerShell script, without any
application or other code.
Second, when we drop employee data files
into an internal named stage, we want those
files automatically loaded into a table.
Review of Client Request

First, we'd like to run a Snowflake query


from a PowerShell script, without any
application or other code.
Second, when we drop employee data files
into an internal named stage, we want those
files automatically loaded into a table.
Section Summary

• Configure Key Pair Authentication


• Generate JWT with SnowSQL
• Snowpipe REST API
• Snowflake SQL REST API
• Cancel Running Query
• POST Command with curl
Key Pair Authentication

• Required for: Snowpipe API, SQL REST API

• Create and save a passphrase (in local env var) for an encrypted private key
• Generate and save (always local) a private key → rsa_key.p8
• Generate and save a public key (based on the private key) → rsa_key.pub
• Connect w/ basic authN and set RSA_PUBLIC_KEY for current user
• Reconnect w/ private key (and passphrase, if encrypted)
• Can later generate temporary JWT token w/ SnowSQL
Key Pair Authentication: Configuration

2. Generate public key


1. Generate private key (using the private key) C:\Users\cristiscu\.ssh\
(with optional
passphrase)
private key public key
rsa_key.p8 rsa_key.pub

3. Connect with 4. Connect with


username and private key (and
password passphrase)

ALTER USER cristiscu


SET RSA_PUBLIC_KEY='...’;
Snowflake
DESCRIBE USER cristiscu;
SQL REST API
• https://org.account.snowflakecomputing.com/api  API endpoint

• POST /api/v2/statements  submit SQL statements for execution


• GET /api/v2/statements/handle  check execution status of a statement
• POST /api/v2/statements/handle/cancel  cancel statement execution

• Required authentication with either OAuth or Key Pair, with JWT token.
• Data is returned in partitions.
• Can fetch query results concurrently.
• No PUT or GET commands.
SQL REST API

Python code Postman / curl SnowSQL


Clients
(import requests) (REST API tools) (Snowflake CLI)
POST /api/v2/statements
GET /api/v2/statements/handle
POST /api/v2/statements/handle/cancel
REST API Endpoint
SQL statements
Compute
Snowflake
SQL Engine

Data
Snowpipe REST API
• https://acct.snowflakecomputing.com/v1/data/pipes/name  API endpoint

• POST /insertFiles?requestId=id  triggers COPY TO cmd, to ingest list of files


• GET /insertReport?requestId=id&beginMark=mark  report of previous ingestion
• GET /loadHistoryScan?startTimeIncl=stime&endTimeExcl=etime&requestId=id 
report of ingestion during period

• Required authentication with Key Pair and generated JWT token.


Snowpipe REST Endpoints
continuous insertFiles()
data inflow

DataFile
Data File Internal
Data File Named Stage REST Endpoints
@stage
upload files
Ingest
Queue

Compute

Pipe

COPY INTO table

Table

Snowflake Account
Client Request

We need to auto-discover and mask the


sensitive customer data. Employee salaries
as well. And queries must be automatically
tagged with user's department.
Employees of the RESEARCH department
cannot see the actual year of the
HIREDATE in the employees table, and
should be restricted to their own records.
Review of Client Request

We need to auto-discover and mask the


sensitive customer data. Employee salaries
as well. And queries must be automatically
tagged with user's department.
Employees of the RESEARCH department
cannot see the actual year of the
HIREDATE in the employees table, and
should be restricted to their own records.
Section Summary

• Data Governance
• Object Tagging
• Query Tagging
• Data Classification
• Masking Policies
• Row Access Policies
Data Governance
• Object Tagging
• Query Tagging
• Data Classification
• System tags & categories
• Column-Level Masking Policies
• Dynamic Data Masking
• External Tokenization
• Tag-Based Masking
• Row Access Policies
Object Tagging
• Monitors sensitive data for compliance, discovery, protection, resource usage.
• Schema-level object, inherited by any child object → tag lineage

• CREATE TAG <tag> ALLOWED_VALUES <value1>, …


• ALTER TAG <tag> DROP ALLOWED_VALUES <value1>, …
• SHOW TAGS …

• CREATE <object> WITH TAG (<tag> = <value>)


• ALTER <object> SET TAG <tag> = <value>
• ALTER <object> UNSET SET <tag>

• SYSTEM$GET_TAG(<tag>, <obj>, <domain>) → assigned object tag value


• INFORMATION_SCHEMA.TAG_REFERENCES table function
• ACCOUNT_USAGE.TAG_REFERENCES view
Query Tag
• QUERY_TAG  session-level parameter to auto-tag queries
• QUERY_HISTORY  query w/ the tag

• can assign to group users as well


• set at the account level for all allowed values
• similar, but better that end-comments in a query text
• analytic queries run by Looker/Tableau/Power BI can use this feature
Data Classification
• Applies system-defined tags to recognize and protect sensitive data
• SNOWFLAKE.CORE.PRIVACY_CATEGORY
• identifier = name, email, IP address, url, bank account, driver license…
• quasi-identifier = age, gender, country, date of birth, occupation, city…
• sensitive/insensitive = salary
• SNOWFLAKE.CORE.SEMANTIC_CATEGORY
• personal attributes: name, age, email, bank account...

show tags in schema SNOWFLAKE.CORE;


select "name", "allowed_values" from table(result_scan(last_query_id()));
Data Classification: Generate Fake Data
import snowflake.snowpark as snowpark
from snowflake.snowpark.types import StructType, StructField, StringType
from faker import Faker
def main(session: snowpark.Session):
f = Faker()
output = [[f.name(), f.address(), f.city(), f.state(), f.email()]
for _ in range(10000)]
schema = StructType([
StructField("NAME", StringType(), False),
StructField("ADDRESS", StringType(), False),
StructField("CITY", StringType(), False),
StructField("STATE", StringType(), False),
StructField("EMAIL", StringType(), False)])
df = session.create_dataframe(output, schema)
df.write.mode("overwrite").save_as_table("CUSTOMERS_FAKE")
return df
Data Classification: Extract Semantic Categories
-- SELECT EXTRACT_SEMANTIC_CATEGORIES('CUSTOMERS_FAKE');

SELECT f.key::varchar as column_name,


f.value:"recommendation":"privacy_category"::varchar as privacy_category,
f.value:"recommendation":"semantic_category"::varchar as semantic_category,
f.value:"recommendation":"confidence"::varchar as confidence,
f.value:"recommendation":"coverage"::number(10,2) as coverage,
f.value:"details"::variant as details, f.value:"alternates"::variant as alts
FROM TABLE(FLATTEN(EXTRACT_SEMANTIC_CATEGORIES('CUSTOMERS_FAKE')::VARIANT)) AS f;
Data Classification: Apply Semantic Categories
CALL ASSOCIATE_SEMANTIC_CATEGORY_TAGS(
'EMPLOYEES.PUBLIC.CUSTOMERS_FAKE',
EXTRACT_SEMANTIC_CATEGORIES('EMPLOYEES.PUBLIC.CUSTOMERS_FAKE'))

Applied tag semantic_category to 3 columns.


Applied tag privacy_category to 3 columns.

select * from table(information_schema.tag_references(


'EMPLOYEES.PUBLIC.CUSTOMERS_FAKE.EMAIL', 'column'));
Tagged Objects in Snowsight
Masking & Access Policies

access
(row-level)

masking
(column-level)
Masking Policies (Column-Level)
• Dynamic Data Masking
• to mask stored data with built-in function  data visualization protection
• (***) ***-****  masked (or NULL)
• (***) ***-4465  partially-masked (604) 555-4465

• External Tokenization
• to store tokenized data w/ external function  data storage protection
• (Gw6) fk2-cHSl  obfuscated
• dslgklknbsdfsdfxzc  tokenized (encoded)

• Tag-Based Masking
• ~dynamic data masking, but with a 'PII' security tag value
Masking Policies (Column-Level)
create masking policy research_on_year
as (hiredate date) returns date ->
case when current_role() <> 'RESEARCH' then val
else date_from_parts(2000, month(hiredate), day(hiredate)) end;

alter table emp


modify column hiredate
set masking policy research_on_year;

select * from emp;  with role RESEARCH


Tag-Based Column Masking
CREATE TAG security_class ALLOWED_VALUES 'PII', 'PCA', 'PHI';

create masking policy research_on_year_tag


as (hiredate date) returns date ->
case when SYSTEM$GET_TAG_ON_CURRENT_COLUMN(
'EMPLOYEES.PUBLIC.SECURITY_CLASS') <> 'PII' then hiredate
else date_from_parts(2000, month(hiredate), day(hiredate)) end;

ALTER TAG security_class


SET MASKING POLICY research_on_year_tag;

ALTER TABLE emp


ALTER COLUMN hiredate SET TAG security_class = 'PII';
Access Policies (Row-Level)

create row access policy research_on_emp


as (deptno int) returns boolean ->
deptno = 20 or current_role() <> 'RESEARCH’;

alter table emp


add row access policy research_on_emp
on (deptno);

select * from emp;  with role RESEARCH


Client Request

Most departments will get their own


Snowflake account, and we will need to
share our data with them.
Some departments may not have their own
account, but we still need to come up with
a solution.
We also need to share some aggregate data
with our partners. But they should not be
able to guess individual row values, while
we cannot access their data at all.
Review of Client Request

Most departments will get their own


Snowflake account, and we will need to
share our data with them.
Some departments may not have their own
account, but we still need to come up with
a solution.
We also need to share some aggregate data
with our partners. But they should not be
able to guess individual row values, while
we cannot access their data at all.
Section Summary

• Secure Data Sharing


• Secure Functions and Views
• Reader Accounts
• Private Share (Data Exchange)
• Public Share (Snowflake Marketplace)
• Data Clean Rooms
Secure Data Sharing
• CREATE SHARE …  by producer, always w/ R/O access!
• GRANT USAGE ON DATABASE/SCHEMA ... TO SHARE ...  required!
• GRANT SELECT ON VIEW ... TO SHARE ...  secure views
• GRANT USAGE ON FUNCTION ... TO SHARE ...  secure functions
• ALTER SHARE ... ADD ACCOUNTS = ..., ...  consumer & reader accounts

• CREATE DATABASE ... FROM SHARE ...  by consumer, as proxy db


• GRANT IMPORTED PRIVILEGES ... ON DATABASE ... TO ROLE ...
Inbound/Outbound Data Shares
own connection partner connection
R/O
view1 share1 data db2

consumer secure proxy


fcts/views
Secure Data Share
database database

own mixed
Secure Data Share own table
view
database

Own Snowflake Account (producer) Partner Snowflake Account (consumer)

create share share1; create database db2


from share producer.share1;
grant usage on database db1 to share share1;
grant usage on schema db1.sch1 to share share1; select * from db2.view1;
grant select on view view1 to share share1;
grant imported privileges
alter share share1 add accounts = consumer; on database db2 to role role2;
Cannot Share Already Shared Data!

shared tbls/secure Secure Data


database fcts/views Share
R/O
shared data
data Secure Data secure Secure Data
proxy
Share database fcts/views Share

Snowflake Account
Secure Functions and Views
• Secure UDFs/Store Procedures
• CREATE SECURE FUNCTION …, IS_SECURE field
• Users cannot see code definitions (body, header info, imports…)
• No internal optimizations (may be slower), avoid push-down
• No exposed amount of data scanned, in queries

• Secure Views/Materialized Views


• CREATE SECURE VIEW …, IS_SECURE field
• Users cannot see view definitions (base tables…)
• No user access to underlying data (function calls…)
• No internal optimizations (may be slower)
• No exposed amount of data scanned, in queries
Reader Accounts
• CREATE MANAGED ACCOUNT name TYPE = READER → locator + URL (same region and edition)
• ADMIN_NAME/PASSWORD  username/password
• or Data > Private Sharing > Reader Accounts
• features
• cannot modify existing data or create database objects
• cannot upload new data or unload data through storage integrations
• can create users/roles/warehouses/rmons + shared databases (on the inbound shares  cannot see anything else)

shared secure
database views/fcts
Secure
Data proxy
database
Share
Reader proxy
"Account" database

R/O
data
Snowflake Producer Account Snowflake Consumer Account
Private/Public Shares: Listings
• private share = Data Exchange
• w/ other specific consumers (separate/reader accounts)
• no need for approvals
• can share data through secure views/functions, or native apps

• public share = Snowflake Marketplace


• public, for everybody
• needs approval for Provider Profile + each published Listing
• may offer free shares w/ Get or Request (wait for approval)
• free or could monetize
Private Share: Data Exchange

Provider
Studio Shared Database

Snowflake
Provider
Publish
Account
Secure
Provider
Provider Profile
Provider
Profile Listing(s) Data
Profile(s)
Share

Get
Snowflake
Consumer
Proxy Database Account
Public Share: Snowflake Marketplace

Snowflake
Provider
Shared Database Provider
Studio
Account

Submit for Approval

Submit for Approval + Publish Secure


Data Marketplace
Provider
Provider Profile
Provider
Profile (Snowflake-Owned
Listing(s) Share
Profile(s) Account)

Approve Request + Get


Snowflake
Proxy Database Consumer
Account
Data Clean Room: Yao’s Millionaire Problem

Bob (Producer) Alice (Consumer)

$1,250,000 $1,110,000

SELECT …

row access
policy
• Bob has full access to his wealth
• Alice can only run “SELECT …” Bob is richer
Data Clean Room: Design Steps
• The producer creates and attaches a row access policy on its table.

• The policy allows only the producer to get full access to its data.

• The policy may allow a consumer role to run some allowed statements.
• The consumer must run the exact statements allowed by the producer.
• Any other statement run by the consumer will return no data.

• The producer will have no access to any consumer data, at any time.
Data Clean Room: with Secure Data Share
customers associates
name sales fullname profession
Mark Dole $12,000 John Doe Teacher
John Doe $2,300 Emma Brown Dentist
Emma Brown $1,300 George Lou Teacher

Secure SELECT a.profession, AVG(c.sales)


row access FROM customers c JOIN asssociates a
policies Data ON c.name = a.fullname
Share GROUP BY a.profession

allowed_statements
profession AVG(sales)
statement
Teacher $2,100
SELECT a.profession, AVG(c.sales)…
Dentist $3,200
SELECT COUNT(*)…
Clerk $1,230

Producer Consumer
(Your Snowflake Account) (Partner Snowflake Account)
Client Request

Could you extend your Hierarchical Data


Viewer to render Snowflake metadata as well?
We're interested in better data visualizations
in these particular areas:
* Entity-relationship diagrams
* Users and the role hierarchy
* Task dependencies
* Data lineage
* Database object dependencies
Review of Client Request

Could you extend your Hierarchical Data


Viewer to render Snowflake metadata as well?
We're interested in better data visualizations
in these particular areas:
* Entity-relationship diagrams
* Users and the role hierarchy
* Task dependencies
* Data lineage
* Database object dependencies
Section Summary

• Information Schema vs Account Usage


• Table Constraints
• Entity-Relationship Diagrams
• Users and Roles
• Task Workflows & Task Runs
• Data Lineage
• Object Dependencies
Information Schema vs Account Usage
• <db>.INFORMATION_SCHEMA • SNOWFLAKE.ACCOUNT_USAGE
• local to a database • global, in the Snowflake app
• no dropped objects • dropped objects included
• 7 days..6 months retention time • 1 year retention time
• instant data access • 45 min..3h data latency (2h mostly)

• read-only views & table functions • mirror Info schema views/functions


• extensive metadata info • many historical views
• metering history (cost info)

• READER_ACCOUNT_USAGE
• ORGANIZATION_USAGE
Information Schema Views
• Inventory
• Tables, Columns, Views, Event_Tables, External_Tables
• Databases, Stages, Sequences, Pipes, File_Formats
• Replication_Databases, Replication_Groups
• Programming
• Packages, Functions & Procedures
• Class_Instances, Class_Instance_Functions, Class_Instance_Procedures
• Constraints
• Table_Constraints, Referential_Constraints
• Roles & Privileges
• Enabled_Roles, Applicable_Roles
• Applied_Privileges, Usage_Privileges, Object_Privileges, Table_Privileges
• Metrics
• Table_Storage_Metrics, Load_History
Account Usage Views (Historical Only)
• Query_History, Access_History, Login_History
• Alert_History, Task_History, Serverless_Task_History
• Copy_History, Load_History, Pipe_Usage_History
• Data_Transfer_History
• Metering_History, Metering_Daily_History
• Warehouse_Load_History, Warehouse_Metering_History
• Storage_Usage
• Object_Dependencies
• Tag_References
• Sessions
Table Constraints
• NOT NULL = the only one always enforced.
• PRIMARY KEY (PK) = for referential integrity, as unique table row identifier,
never enforced.
• FOREIGN KEY (FK) = for referential integrity, as propagation of a PK, never
enforced.
• UNIQUE = for unique combination of column values, other than the PK,
never enforced.
• ENFORCED/DEFERRABLE/INITIALLY = never enforced.
• MATCH/UPDATE/DELETE = for FK only, never enforced.
• CLUSTERING KEYS = optional, similar to PKs, but used for better micro-
partitioning, not referential integrity.
ER (Entity-Relationship) Diagrams

• SHOW DATABASES — database names


• SHOW SCHEMAS IN DATABASE <db> — database schemas
• SHOW TABLES IN SCHEMA <db>.<sch> — schema tables
• SHOW COLUMNS IN SCHEMA <db>.<sch> — table columns
• SHOW UNIQUE KEYS IN SCHEMA <db>.<sch> — UNIQUE constraints
• SHOW PRIMARY KEYS IN SCHEMA <db>.<sch> — PK constraints
• SHOW IMPORTED KEYS IN SCHEMA <db>.<sch> — FK constraints
ER (Entity-Relationship) Diagram
Security: The Role Hierarchy
ACCOUNTADMIN ORGADMIN

SECURITYADMIN SYSADMIN DATABASE ROLE

Functional
USERADMIN ADMIN EDITOR GUEST Custom Roles
(~User Groups)

Database
RW_ROLE RO_ROLE Custom Roles

PUBLIC
Security: Roles in Snowsight
Security: Parsing Users and Roles

# users # user roles


users = {} for user in users:
rows = runQuery("show users") rows = runQuery(f'show grants to user "{user}"')
for row in rows: for row in rows:
users[str(row["name"])] = [] users[user].append(str(row["role"]))

# roles # role hierarchy


roles = {} for role in roles:
rows = runQuery("show roles") rows = runQuery(f'show grants to role "{role}"')
for row in rows: for row in rows:
roles[str(row["name"])] = [] if (str(row["privilege"]) == "USAGE"
and str(row["granted_on"]) == "ROLE"):
roles[role].append(str(row["name"]))
Security: All Users and Roles
Security: All Users and Custom Roles
Security: All Custom and System-Defined Roles
Security: All System-Defined Roles
Task Workflows (DAGs)
• CREATE TASK name … AS …
• AFTER …, …  parent task(s)
• CONFIG  key-value pairs accessed by all tasks in a DAG
• ALLOW_OVERLAPPING_EXECUTION  cocurrent DAGs
• SHOW TASKS … IN SCHEMA …

• SYSTEM$TASK_DEPENDENTS_ENABLE(name)
• enable all children before DAG run
• INFORMATION_SCHEMA.TASK_HISTORY(task_name=>name))
• show all task runs, with errors and status: COMPLETED/FAILED/SCHEDULED
• sort DESC by RUN_ID to see most recent runs
Data Lineage: Table-Level Views
select distinct
substr(directSources.value:objectName, len($SCH)+2) as source,
substr(object_modified.value:objectName, len($SCH)+2) as target
from snowflake.account_usage.access_history ah,
lateral flatten(input => objects_modified) object_modified,
lateral flatten(input => object_modified.value:"columns", outer => true) cols,
lateral flatten(input => cols.value:directSources, outer => true) directSources
where directSources.value:objectName like $SCH || '%'
or object_modified.value:objectName like $SCH || '%'
Data Lineage: Column-Level Graph View
OBJECT_DEPENCENCIES View

• REFERENCING → REFERENCED object (pairs)


• object name / id / domain (type)
• database + schema

• DEPENDENCY_TYPE
• BY_NAME = view/UDF… → view/UDF…
• BY_ID = ext stage → storage integration, stream → table/view
• BY_NAME_AND_ID = materialized view → table
Object Dependencies: Tabular View
Object Dependencies: Graph View
Task Dependencies: Initial DAG Topology

create task t1 as select SYSTEM$WAIT(1);


create task t2 after t1 as select SYSTEM$WAIT(2);
create task t3 after t1 as select SYSTEM$WAIT(1);
create task t4 after t2, t3 as select SYSTEM$WAIT(1);
create task t5 after t1, t4 as select SYSTEM$WAIT(1);
create task t6 after t5 as select SYSTEM$WAIT(1);
create task t7 after t6 as select SYSTEM$WAIT(1);
create task t8 after t6 as select SYSTEM$WAIT(2);
Task Dependencies : Task-Parent Pairs
show tasks in schema tasks.public;
select t."name" task,
split_part(p.value::string, '.', -1) parent
from table(result_scan(last_query_id())) t,
lateral flatten(input => t."predecessors",
outer => true) p;

digraph {
rankdir="BT";
edge [dir="back"];
T2 -> T1;
T3 -> T1;
T4 -> T2;
T4 -> T3;
T5 -> T1;
T5 -> T4;
T6 -> T5;
T7 -> T6;
T8 -> T6;
}
Task Workflows: Examine DAG Task Runs
select SYSTEM$TASK_DEPENDENTS_ENABLE('tasks.public.t1');
execute task t1;

select name, state, scheduled_time, query_start_time, completed_time


from table(information_schema.task_history())
where run_id = (select top 1 run_id
from table(information_schema.task_history(task_name => 'T1'))
order by query_start_time desc)
order by query_start_time;
Task Workflows: Gantt Chart with DAG Task Run
Client Request

We would like to expose your Hierarchical


Data Viewer application to our internal
partners. We could publish the app in
private, and our partners would subscribe
only if they want to.
The app should be able to process any
child-parent data pairs from their
Snowflake accounts, similar to our
employee-manager hierarchy.
Review of Client Request

We would like to expose your Hierarchical


Data Viewer application to our internal
partners. We could publish the app in
private, and our partners would subscribe
only if they want to.
The app should be able to process any
child-parent data pairs from their
Snowflake accounts, similar to our
employee-manager hierarchy.
Section Summary

• Snowflake Native App Framework


• Application Package
• Application
• Load and Test Apps in Snowflake
• Publish and Consume Native Apps
Native App: Prepare and Upload
• Create and test first as a local Streamlit web app
• Create a script.sql file, to prepare data on the consumer’s side.
• Create a readme.md file, for the first info page of the app.
• Create a manifest.yml file, pointing to the two previous files.

• In Snowflake
• Upload all your app files into a named stage.
• Create an APPLICATION PACKAGE with the files uploaded in the stage.
• Create an APPLICATION for this package.
• Create a STREAMLIT object for the code.
Native App: Test and Deploy
• In Snowsight
• Start your new app in the new Apps tab.
• Connect to Snowflake through get_active_session()
• Continue editing, running, testing the app in Snowsight, as a producer.

• In the Marketplace/Data Exchange → public/private share


• Create [and get approved by Snowflake] a provider profile.
• Publish your app [and get approved in the Marketplace] as a Native App.
Native App: Private Share (Data Exchange)

Compute
DataFile
Data File Application
App Files stage Application
Package Snowflake
uploads Provider
Publish Account

Listing(s) Secure Data


Share

Get Snowflake
Compute
Consumer
Proxy Database Application
Account
GRANT ... TO APPLICATION ...
Native App: Public Share (Snowflake Marketplace)

Compute Snowflake
DataFile
Data File Application
App Files stage Application Provider
Package Account
uploads

Submit for Approval Submit for Approval


+ Publish Marketplace
Secure Data (Snowflake
Provider Provider Profile
Provider Listing(s)
Provider Profile Share Account)
Studio Profile(s)

Approve Request
+ Get Snowflake
Compute
Consumer
Proxy Database Application Account

GRANT ... TO APPLICATION ...


Native App: Provider-Consumer

Provider listing
Provider
AppProfile
Files
Profile

Snowflake Provider Account Snowflake Consumer Account


Client Request

Hi there. I'm the Lead Data Analyst....


People wonder how much we consume with
Snowflake, and on which queries.
The Usage charts are good, but we need
more insights.
Review of Client Request

Hi there. I'm the Lead Data Analyst....


People wonder how much we consume with
Snowflake, and on which queries.
The Usage charts are good, but we need
more insights.
Section Summary

• Admin Usage Charts


• Hidden Cost Traps
• Metering History & Daily History
• Warehouse Metering History
• Query & Load History
• Storage Usage
Admin Usage Charts
Hidden Cost Traps
Admin Dashboard: METERING Views
• METERING_HISTORY
• Credits Used [between …]
• METERING_DAILY_HISTORY
• Credits Billed by Month
• Credit Breakdown by Day with Cloud Services Adjustment
Admin Dashboard: WAREHOUSE_METERING_HISTORY
• Credit Usage by Warehouse [between …]
• Jobs by Warehouse [between …]
• Credit Usage by Warehouse over Time [between …]
• Warehouse Usage Greater than 7 Day Average [between …]
• Compute and Cloud Services by Warehouse [between …]
Admin Dashboard: QUERY_HISTORY
• Total Number of Executed Jobs [since …]
• Execution Time by Query Type
• Top 25 Longest Queries
• Total Execution Time by Repeated Queries
• Average Query Execution Time (By User)
• Cloud Service Credit Utilization by Query Type (Top 10)
Admin Dashboard: Miscellaneous
• LOAD_HISTORY
• Total Row Loaded by Day [between …]
• LOGIN_HISTORY
• Logins by User or Client [between …]
• STORAGE_USAGE
• Average Current Storage (for Data, Stages and Fail-Safe) [since …]
• Data Storage used Overtime
Client Request

A few of our employees would need a crush


course in intermediate SQL.
In particular, could you refresh our memory on:
• Conversion of inline subqueries to CTEs
• GROUP BY queries with grouping sets
• Pivot and unpivot queries
We also need to look at data back in time at any
moment. What choices do we have?
Review of Client Request

A few of our employees would need a crush


course in intermediate SQL.
In particular, could you refresh our memory
on:
• Conversion of inline subqueries to CTEs
• GROUP BY queries with grouping sets
• Pivot and unpivot queries
We also need to look at data back in time at
any moment. What choices do we have?
Section Summary

• Data Analytics
• SELECT Statement
• Subqueries vs CTEs
• Group Queries
• Pivot/Unpivot Queries
• Time Travel and Fail-safe
SELECT Statement
• SELECT …  projection
• DISTINCT …  dedup
• FROM ...  sources (joins)
• PIVOT | UNPIVOT ...  dicing
• GROUP BY [CUBE/ROLLUP/…] …  grouping
• WHERE | HAVING | QUALIFY ...  filters
• ORDER BY ...  sorting
• TOP | LIMIT | OFFSET | FETCH ...  slicing
Subqueries vs CTEs
Subqueries CTEs
select ee.deptno, with q1 as
sum(ee.sal) as sum_sal, (select empno
(select max(sal) from emp e
from emp join dept d on e.deptno = d.deptno
where deptno = ee.deptno) as max_sal where d.dname <> 'RESEARCH'),
from emp ee
where ee.empno in q2 as
(select empno (select deptno, max(sal) max_sal
from emp e from emp
join dept d on e.deptno = d.deptno group by deptno)
where d.dname <> 'RESEARCH')
group by ee.deptno select ee.deptno,
order by ee.deptno; sum(ee.sal) as sum_sal,
max(q2.max_sal) as max_sal
from emp ee
join q2 on q2.deptno = ee.deptno
join q1 on q1.empno = ee.empno
group by ee.deptno
order by ee.deptno;
GROUP BY with WHERE and HAVING
select deptno,
to_char(year(hiredate)) as year,
sum(sal) sals
from emp
where year > '1980'
group by deptno, year -- all
having sum(sal) > 5000
order by deptno, year;
GROUP BY with QUALIFY
select deptno,
row_number() over (order by deptno) as rn,
to_char(year(hiredate)) as year,
sum(sal) sals
from emp
where year > '1980'
group by deptno, year
having sum(sal) > 5000
qualify rn > 1
order by deptno, year;
GROUP BY with GROUPING SETS
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by grouping sets (deptno, year)
having sum(sal) > 5000
order by deptno, year;
GROUP BY with ROLLUP
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by rollup (deptno, year)
having sum(sal) > 5000
order by deptno, year;
GROUP BY with CUBE
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by cube (deptno, year)
having sum(sal) > 5000
order by deptno, year;
PIVOT Query
GROUP BY Query PIVOT Query
select dname, with q as
to_char(year(hiredate)) as year, (select dname,
sum(sal) as sals year(hiredate) as year,
from emp sum(sal) as sals
join dept on emp.deptno = dept.deptno from emp
where year >= '1982' join dept on emp.deptno = dept.deptno
group by dname, year where year >= 1982
order by dname, year; group by dname, year
order by dname, year)

select * from q
pivot (sum(sals)
for year in (1982, 1983)) as p;
UNPIVOT Query
PIVOT Query UNPIVOT Query (→ back to GROUP BY)
with q as with q as
(select dname, (select …),
year(hiredate) as year,
sum(sal) as sals p as
from emp (select * from q
join dept on emp.deptno = dept.deptno pivot (sum(sals)
where year >= 1982 for year in (1982, 1983)) as p)
group by dname, year
order by dname, year) select * from p
unpivot (sals
select * from q for year in ("1982", "1983"));
pivot (sum(sals)
for year in (1982, 1983)) as p;
Time Travel and Fail-safe
• Time Travel
• = for DATABASE|SCHEMA|TABLE
• DATA_RETENTION_TIME_IN_DAYS  in CREATE … / ALTER … SET ...
• set to zero to disable
• 1 for transient/temporary tables, or permanent tables in Standard Edition
• 1..90 days for permanent tables in Enterprise Edition

• Fail-safe
• = additional days to restore tables (no SQL, call Snowflake support!)
• 7 days for permanent tables  regardless of the edition + cannot disable!
• 0 days for transient tables
Time Travel
• Looking Back in Time
• SELECT ... FROM … AT(TIMESTAMP => <timestamp>) …
• SELECT ... FROM … AT(OFFSET => <time_diff>) …
• SELECT ... FROM … AT(STREAM => '<name>') …
• SELECT ... FROM … AT|BEFORE(STATEMENT => <id>) …
• In Zero-Copy Cloning
• CREATE DATABASE | SCHEMA | TABLE <t> CLONE <s> AT|BEFORE(…)
• Restoring Dropped Objects
• DROP DATABASE | SCHEMA | TABLE …
• SHOW DATABASES | SCHEMAS | TABLES HISTORY […]  dropped
• UNDROP DATABASE | SCHEMA | TABLE …  restore dropped obj
Client Request

Our employees know basic SQL, but


advanced analytics is not their area of
expertise.
Could you quickly walk us through the main
categories of window functions and a few
statistical SQL extensions, with practical
demos on our small live database?
Review of Client Request

Our employees know basic SQL, but


advanced analytics is not their area of
expertise.
Could you quickly walk us through the main
categories of window functions and a few
statistical SQL extensions, with practical
demos on our small live database?
Section Summary

• Window Functions
• Ranking Functions
• Offset Functions
• Statistical Functions
• Regression Functions
Window Functions: OVER Clause
• PARTITION BY
• ORDER BY

• ROWS/RANGE  window framing (# rows / values)


Window Frame
select ename, hiredate, sal,
round(avg(sal) over (order by hiredate
rows between 1 preceding and 1 following), 2) as avg
from emp
order by hiredate;
Rank Functions

• ROW_NUMBER()  1, 2, 3, 4, 5, 6, 7 …

• RANK()  1, 1, 3, 4, 4, 4, 7 …
• DENSE_RANK()  1, 1, 2, 3, 3, 3, 4, …
• PERCENT_RANK()  0%, 0%, 23%, 48%, …

• NTILE(n)  1, 1, 1, 2, 2, 2, 3, 3 …
• CUME_DIST()
Rank Functions
select deptno, ename,
row_number() over (order by deptno) row_number,
rank() over (order by deptno) rank,
dense_rank() over (order by deptno) dense_rank,
round(percent_rank() over (order by deptno) * 100) || '%' percent_rank
from emp
order by deptno;
Offset Functions

• LEAD(expr, offset=1)
• LAG(expr, offset=1)

• FIRST_VALUE(expr)
• LAST_VALUE(expr)
• NTH_VALUE(expr,offset)
• RATIO_TO_REPORT(expr)
Offset Functions
select ename,
lead(sal, 1) over (order by ename) lead,
sal,
lag(sal, 1) over (order by ename) lag,
first_value(sal) over (order by ename) first,
last_value(sal) over (order by ename) last,
nth_value(sal, 1) over (order by ename) nth
from emp
order by ename;
Statistical Functions
• VAR_POP/SAMP
• VARIANCE
• STDDEV_POP/SAMP
• STDDEV
• COVAR_POP/SAMP
• CORR

• SKEW(expr)
• KURTOSIS(expr)
Skew & Kurtosis

select count(*), avg(sal), median(sal), skew(sal), kurtosis(sal) from emp;

Skew (negative/positive) Kurtosis


Distributions
select row_number() over(order by sal) rn, sal
from emp
order by sal;

select width_bucket(SAL, 800, 5000, 10) as sals


from emp
order by sal;
Regression Functions
• REGR_SLOPE
• REGR_INTERCEPT
• REGR_R2

• REGR_COUNT
• REGR_SXX/SYY/SXY
• REGR_AVGX/AVGY
Linear Regression (y = x * SLOPE + INTERCEPT)
select REGR_SLOPE(sals, year), REGR_INTERCEPT(sals, year), REGR_R2(sals, year)
from (select year(hiredate) as year, sal as sals from emp order by year);
Client Request

In practice, some queries will always be


slower.
Could you give us some pointers where to
look at to eventually improve the query
performance?
How can we estimate if the storage is
appropriate for a query? If not, will
clustering truly improve the performance
on some specific queries?
Review of Client Request

In practice, some queries will always be


slower.
Could you give us some pointers where to
look at to eventually improve the query
performance?
How can we estimate if the storage is
appropriate for a query? If not, will
clustering truly improve the performance
on some specific queries?
Section Summary

• Query Performance Optimization


• Query History & Load History
• Query Result Caching
• Query Execution Plan (EXPLAIN)
• Query Profile
• Clustering Keys
• Enhanced Query Profile & Analysis
Query Performance Optimization Methods
• Check Query History and Load History  in Web UI / Admin Dashboard
• Check Execution Times  use Query Hashes
• Top Longest Queries Chart
• Long Running Repeated Queries Chart
• Check Query Profile & Query Plan (EXECUTE Statement)
• Avoid Spilling from RAM to Local (SSD) or Remote (S3) Storage
• Look for “Exploding” Joins
• Understand Caching + Use Materialize Views
• Need Clustering? Understand Micro-Partitioning
• Inefficient Pruning  check Partition Scanned/Total in TableScan
• Optimize Warehouse Performance
• Increase Warehouse Size
• Query Acceleration
• Check Queued Queries
Query History
• QUERY_HISTORY
• INFORMATION_SCHEMA - table function, for the past 7 days, limited, by session/user/wh
• ACCOUNT_USAGE - view, for the past year, with latency
• query types
• most frequently executed queries → count(*), check task schedules
• most time-consuming queries, overall (in total) → is it ok to exec so frequently?
• slowest queries, with longest execution time (in average) → may need optimization
• heaviest queries, with most scanned data (in average)
• information
• which KPI am I looking at? → order by + row_number() sort
• is my query in this list? on which place? → find by query text + use row_number()
• filters
• only successfully executed, by period (last mo, year) + limit to max 10,000
• by query type (SELECT, CALL...), warehouse size, query tag... → optional
Enhanced Query Analysis
Query Result Caching
user query

Services
past 24h
HOT Metadata Result Cache max 31 days

Compute
local SSD
WARM Warehouse Local Disk RAM

Storage
cloud blob
COLD Remote Disk storage

Snowflake
Query Result Caching
• HOT
• result cache on
• cache hit → SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
• WARM
• result cache off or no cache hit
• warehouse up and result still in the warehouse (in RAM or local disks)
• get query result from warehouse cache (BYTES_SCANNED > 0)
• COLD
• no warehouse cache (warehouse suspended, not up)
• result cache off (ALTER SESSION SET USE_CACHED_RESULT = false) or no
cache hit
• must run query
Query Profile
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Enhanced Query Profile
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Enhanced Query Profile
“Exploding” Joins Problem
Query Execution Plan: EXPLAIN
• EXPLAIN [USING TABULAR] <query>
• ~EXPLAIN_JSON function (JSON → tabular output)
• EXPLAIN USING JSON <query>
• ~SYSTEM$EXPLAIN_PLAN_JSON function (JSON output)
• EXPLAIN USING TEXT <query>
• ~SYSTEM$EXPLAIN_JSON_TO_TEXT function (JSON → TEXT output)
Query Execution Plan
explain
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Clustering
• Clustering Keys
• SYSTEM$CLUSTERING_DEPTH(table, (col1, …))
• SYSTEM$CLUSTERING_INFORMATION(table, (col1, …))

• micro-partitions = table data storage segments, with min/max values


on specific (groups of) column values, that can drastically improve the
query search.
• pruning = possibility of scanning fewer micro-partitions for a query.
Clustering: With Full Clustering Keys
SQL Code → JSON Result
select system$clustering_information(
'snowflake_sample_data.tpcds_sf100tcl.store_sales'); -- ~300B rows

{
"cluster_by_keys" : "LINEAR(ss_sold_date_sk, ss_item_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 9,  so-so (higher is better)
"average_overlaps" : 3.4849,  bad (many overlaps)
"average_depth" : 2.7497,  system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 3,
"00002" : 180604,  so-so (most on depth 2-3)
"00003" : 540900,
"00004" : 0,

"00016" : 0  good (none so deep)
},
"clustering_errors" : [ ]
}
Clustering: With Partial Clustering Keys
SQL Code → JSON Result
select system$clustering_information(
'snowflake_sample_data.tpcds_sf100tcl.store_sales', -- ~300B rows
'(ss_sold_date_sk)');

{
"cluster_by_keys" : "LINEAR(ss_sold_date_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 719687,  good! (high)
"average_overlaps" : 0.0132,  good! (low)
"average_depth" : 1.0076,  system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 719203,  good! (most w/ depth 1)
"00002" : 366,
"00003" : 1197,
"00004" : 624,

"00032" : 22  so-so (22 partitions on depth 32)
},
"clustering_errors" : [ ]
}
Clustering: Worst Case Scenario
SQL Code → JSON Result
select system$clustering_information(
'snowflake_sample_data.tpcds_sf100tcl.store_sales', -- ~300B rows
'(ss_store_sk)');

{
"cluster_by_keys" : "LINEAR(ss_store_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 0,  very bad (no constant partition)
"average_overlaps" : 721506.0,  very bad (almost all overlap)
"average_depth" : 721507.0,  system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,

"1048576" : 721507  very bad (all w/ full scan)
},
"clustering_errors" : [ ]
}
Programming
in Snowflake
Masterclass
2024 Hands-On!

You might also like