Programming+in+Snowflake+ +All+Slides
Programming+in+Snowflake+ +All+Slides
in Snowflake
Masterclass
2024 Hands-On!
Programming in Snowflake
• Section Introduction
• Client Request
• Review of Client Request
• Section Summary
• Section Content
• Hands-On Programming Experiments
• Review Checkpoint Slideshows
• Check Your Knowledge Quiz
Architecture Diagram: Private Data Sharing
Provider
Studio Shared Database
Snowflake
Provider
Publish
Account
Secure
Provider
Provider Profile
Provider
Profile Listing(s) Data
Profile(s)
Share
Get
Snowflake
Consumer
Proxy Database Account
Credentials
Snowflake
Services
• Authentication: basic, key pair, OAuth, SSO, MFA
• Infrastructure Management
• Metadata Management
• Query Parsing and Optimization
• Access Control: users, roles, privileges
• Serverless Tasks
Compute
• Virtual Warehouses: ~your car engine
• Query Processing: single/multi-user, parallel processing
• Use bigger warehouse for a more complex query
• Use multi-cluster warehouses (Enterprise+ only) when multiple users
• Use query cache results when possible
Storage
• Back-End Data Storage: private to Snowflake
• Time Travel and Fail-safe storage
• Internal Stages: named, user, table stages
• Local and Remote Warehouse Storage
• Query Result Data Caches
• Data transfer in is free, but transfer out costs money!
Large Virtual Warehouse
Compute
4X-Large Virtual Warehouse (8x16 = 128 Nodes) X-Small
Virtual
Warehouse
(1 Node)
Storage
Storage
Storage
Snowflake
Large Multi-Cluster Virtual Warehouse
Compute
3 Clusters x 3X-Large Virtual Warehouse (3 x 64 Nodes = 192 Nodes) X-Small
Virtual
Warehouse
(1 Node)
Snowflake
Snowflake Editions & Pricing
Best Practices: Compute
• You are charged per VW up, not executed queries!
• Keep X-Small Warehouse
• Auto-Suspend after 1 minute (minimum!)
• Keep Standard Edition
• Economy mode
• Avoid querying too much the Account Usage schema
Best Practices: Storage
• Do not duplicate data, try zero-copy clone or data share
• Do not store large amounts of data
• No time travel or fail-safe unless necessary
Web UI
SQL Worksheet
SQL statements
Compute
SQL Engine
Data
Snowflake
Query Context
1. Role
2. Warehouse
3. Database
4. Schema
Query Context: Built-In Functions
• CURRENT_ROLE/WAREHOUSE/DATABASE/SCHEMA()
• CURRENT_USER/SESSION/STATEMENT/TRANSACTION()
• CURRENT_DATE/TIME/TIMESTAMP()
• CURRENT_ACCOUNT/CLIENT/VERSION/REGION/IP_ADDRESS()
• CURRENT_ACCOUNT/ORGANIZATION_NAME()
• CURRENT_AVAILABLE/SECONDARY_ROLES()
• IS_ROLE_IN_SESSION()
• LAST_TRANSACTION/QUERY_ID()
Stages: Uploading and Unloading Data
$ aws s3 get/put
Local Computer
Table
COPY INTO table FROM location (upload)
COPY INTO location FROM table (unload)
Snowflake
Stages: Examples
• LIST @~; list files from any stage
• REMOVE ... remove files from a stage
LIST @mystage_s3;
Schema Inferrence
• INFER_SCHEMA
• LOCATION => '@mystage' internal/external named stage
• FILES => 'emp.csv', ... 1+ uploaded files
• FILE_FORMAT => 'myfmt' CREATE FILE FORMAT ... PARSE_HEADER=TRUE
CREATE STAGE …
FILE_FORMAT=(FORMAT_NAME='...')
stage
table
CREATE TABLE …
STAGE_FILE_FORMAT=(FORMAT_NAME='...')
JSON Objects Cheat Sheet
select *
from json;
Flattening Arrays: One Level
select j.name,
m.value, m.value:name::string, m.value:years
from json j,
table(flatten(input => j.v, path => 'managers')) m;
select j.name,
m.value, m.value:name::string, m.value:years
from json j,
lateral flatten(input => j.v, path => 'managers’) m;
Flattening Arrays: Two Levels
select name,
m.value, m.value:name::string, m.value:years,
y.value
from json j,
lateral flatten(input => j.v, outer => TRUE, path => 'managers') m,
lateral flatten(input => m.value, path => 'years') y;
Table Functions
• Snowflake Tutorials
• Snowflake Sample Databases
• Sample Data Extraction
• Synthetic Data Generation
• External Data Generation
• Sequences
• Identity Columns
Snowflake Tutorials
Snowflake Sample Databases: TPC-H
• TPCH_SF1 = ~M elements
• Tutorial 1: Sample queries on TPC-H data
• TPCH_SF1000 = ~B elements
Snowflake Sample Databases: TPC-DS
SELECT *
FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
SAMPLE (1000000 ROWS);
Synthetic Data Generation
• FROM TABLE(GENERATOR([rowcount], [timelimit])) rows (w/o columns)
• Random deterministic values
• RANDOM/RANDOMSTR 64-bit integer/string with length
• UUID_STRING UUID
• Controlled Distribution for unique ID values
• NORMAL/UNIFORM/ZIPF number w/ specific distribution/integer
• SEQ1/SEQ2/SEQ4/SEQ8 sequence of integers
GENERATOR
select
randstr(uniform(10, 30, random(1)), uniform(1, 100000, random(1)))::varchar(30) as name,
randstr(uniform(10, 30, random(2)), uniform(1, 10000, random(2)))::varchar(30) as city,
randstr(10, uniform(1, 100000, random(3)))::varchar(10) as license_plate,
randstr(uniform(10, 30, random(4)), uniform(1, 200000, random(4)))::varchar(30) as email
from table(generator(rowcount => 1000));
Faker Data Generation Example
Python Code
fake = Faker()
output = [{
"name": fake.name(),
"address": fake.address(),
"city": fake.city(),
"state": fake.state(),
"email": fake.email()
} for _ in range(1000)]
df = pd.DataFrame(output)
print(df)
Unique Identifiers
• Sequences
• Identity Columns
• UUIDs
Sequences
• CREATE SEQUENCE …
• START start INCREMENT incr default (1, 1)
• ORDER | NOORDER default ORDER (ASC)
• CREATE TABLE …
• AUTOINCREMENT | IDENTITY auto-gen number (~seq
number)
• START start INCREMENT incr alt. to (start, incr), def (1, 1)
• ORDER | NOORDER default ORDER (ASC)
Client Request
1. Multi-Level JOINs
2. CONNECT BY
3. Recursive CTEs
4. Recursive Views
Hierarchical Data: (1) Multi-Level JOINs
select coalesce(m3.employee || '.', '')
|| coalesce(m2.employee || '.', '')
|| coalesce(m1.employee || '.', '')
|| e.employee as path,
regexp_count(path, '\\.') as level,
repeat(' ', level) || e.employee as name
from employee_manager e
left join employee_manager m1 on e.manager = m1.employee
left join employee_manager m2 on m1.manager = m2.employee
left join employee_manager m3 on m2.manager = m3.employee
order by path;
Hierarchical Data: (2) CONNECT BY
call proc1(22.5);
SELECT * FROM TABLE(
RESULT_SCAN(LAST_QUERY_ID()))
snowflake
addEvent(…)
log(…) execute(…)
setSpanAttribute(…)
createStatement(…)
Statement
getSqlText()
ResultSet
getColumnName/Type/Scale(…) next() SfDate
getColumnSqlType(…) execute(…) getSqlText()
getEpochSeconds()
isColumnNullable/Text/…(…) getColumnSqlType(…)
getNanoSeconds()
getColumnCount() getColumnValue[AsString](…)
getTimezone()
getColumnRowsAffected() getColumnCount()
getScale()
getRowCount() getNumColumnRowsAffected()
getNumRowsInserted/Updated/Deleted() getQueryId()
getQueryId()
UDFs (User-Defined Functions)
select fct1(22.5);
select * from
table(fctt1('abc'));
• Object Identifiers
• DDL Statements
• Zero-Copy Cloning
• DML Statements
• Snowflake Scripting
• SQL vs SQL Scripting
• Cursor and ResultSet
• Transactions
Object Identifiers
• identifiers
• NAME/Name/name → NAME
• "Name" → Name
• "This is a name" → This is a name
• IDENTIFIER/TABLE functions
• IDENTIFIER/TABLE('MY_TABLE') MY_TABLE/My_Table/my_table
• IDENTIFIER/TABLE('"my_table"') "my_table"
• IDENTIFIER/TABLE($table_name) SET table_name = 'my_table';
• v:myobj.prop1.prop2['name2'].array1[2]::string
• v: = table column name or alias
• myobj = top JSON object
• myobj.prop1 = myobj['prop1'], prop2.name2 = prop2['name2']
• array1[2] = 3rd element in JSON array
• ::string = cast conversion
Variables
• session variables = global variables
• SET var = ..., UNSET var, $var SHOW VARIABLES
• SnowSQL variables = extensions, w/ var substitution
• local variables = in blocks (Snowflake Scripting / stored procs / functions)
• var1 [type1] [DEFAULT expr1]
• LET var1 := [type1] [DEFAULT / := expr1]
• SELECT col1 INTO :var1
• bind variables = for parameterized queries, w/ runtime param values
• SELECT (:1), (:2), TO_TIMESTAMP((?), :2)
• environment variables = for Bash (Linux/macOS) or PowerShell (Windows)
• SET/EXPORT name=value → $name or %name%
Structured Query Language (SQL)
• Data Definition Language (DDL) • Transaction Control Language (TCL)
• CREATE | ALTER | DROP • BEGIN TRANSACTION
• COMMENT | USE • COMMIT | ROLLBACK
• SHOW | DESCRIBE • DESCRIBE TRANSACTION
• SHOW TRANSACTIONS | LOCKS
• Data Manipulation Language (DML)
•
•
INSERT | UPDATE
DELETE | TRUNCATE
• Data Control Language (DCL)
• MERGE | EXPLAIN
• USER | ROLE
• GRANT | REVOKE
• A clone shares all initial data with its source referenced storage
• Any further change is stored separately owned storage
• Can clone from a specific point back in time time travel
<source>
<target>
Referenced Storage
Snowflake Scripting
• SQL vs Scripting - ~PL/SQL (Oracle), Transact-SQL (Microsoft SQL Server)
• Procedural language (SQL is declarative), as SQL extension, since Feb 2022
• used only for Stored Procs (UDFs/UDTFs w/ SQL)
• Temporary bug in SnowSQL and Classic Console → requires $$ .. $$
Compute
SQL Engine
SQL statement SQL script
Snowflake
Data
Snowflake Scripting: Variables
• Optional top DECLARE block, for SQL scalar/RESULTSET/CURSOR/EXCEPTION
• Optional data type + DEFAULT value
• LET for vars not DECLAREd, w/ optional data type + := initialization
• Can be used in RETURN
• Reference w/ : prefix only when in inline SQL
[declare]
var1 FLOAT DEFAULT 2.3;
res1 RESULTSET DEFAULT (SELECT ...);
cur1 CURSOR FOR SELECT (?) FROM ...
exc1 EXCEPTION (-202, 'Raised');
begin
var1 := 3;
LET var2 := var1 + 4;
LET cur2 CURSOR FOR SELECT :var1, ...;
LET res2 RESULTSET := (SELECT ...);
RETURN var1;
end;
Snowflake Scripting: Built-In Variables
• for last executed DML statement
• SQLROWCOUNT = rows affected by last statement, ~getNumRowsAffected()
• SQLFOUND = true if last statement affected 1+ rows
• SQLNOTFOUND = true if last statement affected 0 rows
• SQLID = ID of the last executed query (of any kind)
• exception classes check in EXCEPTION block, with WHEN ...
• STATEMENT_ERROR = execution error
• EXPRESSION_ERROR = expression-related error
• exception info to use in EXCEPTION block
• SQLCODE = exception_number
• SQLERRM = exception_message
• SQLSTATE = from ANSI SQL standard
Snowflake Scripting: Branching
• IF ... THEN ... [ELSEIF ... THEN ... [...]] ELSE ... END IF
• CASE cond WHEN val1 THEN ... [...] ELSE ... END simple
• CASE WHEN cond1 THEN ... [...] ELSE ... END searched
BEGIN
LET err := true;
IF (err) THEN
RAISE exc1;
END IF;
EXCEPTION
WHEN STATEMENT_ERROR THEN ...
WHEN exc1 THEN ...
WHEN OTHER THEN
RETURN SQLCODE;
RAISE;
END;
Transaction Control Language (TCL)
• BEGIN TRANSACTION
• COMMIT | ROLLBACK
• DESCRIBE TRANSACTION
• SHOW TRANSACTIONS
• SHOW LOCKS
Transactions
(2) Incremental
Stream CDC Updates
Remote Snowflake
OLTP Database Data Warehouse
Change Data Capture (CDC)
source target
• AUTO_INGEST on
• automatic data loading
• from external stages only (S3, Azure Storage, Google Storage)
• AUTO_INGEST off
• no automatic data loading
• from external/internal named/table stages (not user stages!)
• only with Snowpipe REST API endpoint calls
Snowpipe on S3
• Create a continuous loading pipe based on the external S3 stage created before
• AUTO_INGEST = True
• COPY INTO new table FROM external stage
• Add an event notification for the S3 bucket/folder
• for "All object create events"
• on SQS Queue w/ ARN copied from show pipes for crt pipe
• Upload some CSV files in the folder
• check pipe status: select system$pipe_status('mypipe_s3');
DataFile
File S3 External (6) COPY INTO table
Data
Data File Table
Stage
@stage (5) trigger
(2) upload files to AWS
Compute
Pipe
DataFile
File S3 External (7) COPY INTO table
Data
Data File Table
Stage
@stage
(2) upload files into AWS
(3) queue file (6) call
SQS
Queue
(4) notif Compute
(5) call
Lambda Stored
Function Procedure
KING → BLAKE
BLAKE → MARTIN
BLAKE → JAMES
KING → JONES
JONES → FORD
digraph {
BLAKE -> KING;
MARTIN -> BLAKE;
JAMES -> BLAKE;
JONES -> KING;
FORD -> JONES;
}
GraphViz DOT Notation: Nodes & Edges
digraph {
n7839 [label="KING"];
n7698 [label="BLAKE"];
n7566 [label="JONES"];
n7902 [label="FORD"];
n7654 [label="MARTIN"];
n7900 [label="JAMES"];
n7839 [label="KING“
shape="rect" color="red"];
n7698 [label="BLAKE"];
n7566 [label="JONES"];
n7902 [label="FORD"];
n7654 [label="MARTIN"];
n7900 [label="JAMES"];
• Introduction to Streamlit
• Layout Components
• Interactive Widgets
• State and Callbacks
• Data Cache
• Multi-Page Applications
• Test as Local Web App
• Deploy and Share as Web App
Introduction to Streamlit
• History
• Bought by Snowflake in 2022 for $800M
• Integrated with Snowpark: Streamlit in Snowflake + Native Apps
• Features
• RAD framework for data science experiments (~VB, Access, Python at the app level)
• Connect to all sorts of data sources (Snowflake etc)
• Instant rendering as charts or HTML content, using many third-party components
• Architecture
• Minimalistic layout components and simple input controls (single event per control!)
• Full page rerun after any input control integration (tricky at the beginning!)
• Cache between reruns with session state and control callbacks (added recently)
• Data and resource/object cache (to avoid data reloads between page reruns)
• Development
• Great for prototyping and proof-of-concept simple apps (not like heavy React apps!)
• Support for single and multi-page applications
• Test as local web app (not standalone!)
• Share and deploy as remote web app to Streamlit Cloud (for 100% free!)
Layout Components
• st.sidebar collapsible left sidebar
• st.tabs tab control
exps = st.expander(
with st.expander("Expanded"):
"Collapsed", expanded=False)
st.write("This is expanded")
exps.write("This is collapsed")
st.sidebar.selectbox(
“Select Box:", ["S", "M"])
cols = st.columns(3)
cols[0].write("Column 1")
cols[1].write("Column 2")
cols[2].write("Column 3")
with st.empty():
st.write("Replace this...")
st.write("...by this one")
Display Text
• st.write('Most objects'), st.write(['st', 'is <', 3])
• st.text('Fixed width text')
• st.title/header/subheader/caption('My title')
• st.divider ~HR
Interactive Widgets
• st.text/number/date/time_input("First name") on_change() event
• st.text_area("Text to translate") on_change() event
app.py (front-end)
import streamlit as st
st.multiselect("Select:",
["S", "M", "L"], default=["S", "M"])
st.selectbox( "Choose:",
["S", "M", "L"])
st.select_slider("Choose:",
["S", "M", "L"], value="M")
st.radio("Choose:",
["S", "M", "L"], index=2) server (back-end)
user clicks on S
…<API calls>
Buttons
• st.button("Click me") buttons on single line, on_click() event
• st.toggle("Enable") on_change() event
• st.checkbox("I agree") on_change() event
• st.radio("Pick one", ["cats", "dogs"]) on_change() event
callbacks
full page re-run
def on_button_click(msg):
st.write(f"{msg},
{st.session_state.name}")
cache
app.py st.session_state
import streamlit as st {
@st.cache_data/resource
def now():
always the
return datetime.datetime.now()
cached value
if st.button("Show Current Time"):
st.write(now()) 2023-09-27 08:25:44.934312
st.write(datetime.datetime.now()) 2023-09-27 08:26:44.961236
Multi-Page Applications (obsolete now)
app.py
import streamlit as st
funcs = {
“-": main,
"Page One": page_one,
"Page Two": page_two,
"About": about }
name = st.sidebar.selectbox(
“Select Page:", funcs.keys())
funcs[name]()
Multi-Page Applications
1_Page_One.py
import streamlit as st
st.set_page_config(
page_title="Plotting Demo",
page_icon=" ")
Other Controls
• display progress/status
• with st.spinner(text='In progress'): …
• bar = st.progress(50) … bar.progress(100)
• with st.status('Authenticating...') as s: … s.update(label='Response')
• st.error/warning/info/success/toast('Error message')
• st.exception(e)
• st.balloons/snow()
• media
• st.image show image/list of images
• st.audio/video show audio/video player
• st.camera_input("Take a picture") on_change() event
• chat
• with st.chat_message("user"): … response to a chat message
• st.chat_input("Say something") prompt chat widget, on_submit() event
Data Rendering
• st.dataframe w/ dataframes from Pandas, PyArrow, Snowpark, PySpark
• st.table show static table
• st.data_editor show widget, on_change() event
• st.json show pretty-printed JSON string
Charts
• st.area/bar/line/scatter_chart(df)
• st.map(df) geo map w/ scatterplot
• st.graphviz_chart(fig)
• st.altair_chart(chart)
• st.bokeh_chart(fig)
• st.plotly_chart(fig) interactive Plotly chart
• st.pydeck_chart(chart) free maps
• st.pyplot(fig) w/ matplotlib
• st.vega_lite_chart(df)
• st.column_config insert spark lines!
• st.metric show perf metric number in large
Deploy your Web App to Streamlit Cloud
• publish your app into GitHub
• Streamlit will create access keys! → need authorization
• your app will automatically refresh on each new GitHub push
• can later add "Open in Streamlit" button in GitHub
• deploying in Streamlit Cloud → for free, if public!
• sign-up with your Google email at share.streamlit.io
• make sure your app can be shared publicly → see limits! use subdomain
• replace any app\myapp.py to app/myapp.py, if from subfolder (\ → /)
• prefix with os.path.dirname(__file__)}/ any relative file names
• make sure requirements.txt is updated → check black sidebar log
• add any passwords or confidential data as Secrets (see Advanced Options)
• make sure you'll run the same version of Python when deployed
• can later add the link in Medium posts → expanded as gadget
Client Request
Compute
SQL Engine
multiple single
SQL statements SQL script
Data Snowflake
Snowflake Connector for Python
Python code
Client
Snowflake Connector for Python
SQL statements
Compute
Snowflake
SQL Engine
Data
Python Connector API
Exception
snowflake.connector
msg/raw_msg
apilevel QueryStatus errno
threadsafety ABORTING/SUCCESS/RUNNING sqlstate
paramstyle QUEUED/BLOCKED/NO_DATA sfqid
connect(…)
Connection
commit/rollback() get_query_status(…)
ResultMetadata
autocommit(…)
close() name/type_code
is_still_running/is_an_error(…) describe(…) display/internal_size
precision/scale
cursor(…) is_nullable
execute_string(…)
execute_stream(…)
Cursor ResultBatch
fetchone/many/all() rowcount
close() get_result_batches(…) compressed/uncompressed_size
fetch_pandas_all(…)
execute(…) fetch_pandas_batches(…) pandas.DataFrame to_pandas()
execute_many(…) close()
execute_async(…) fetchone/many/all()
Python Connector: Common Pattern
Python Client
import snowflake.connector
C# Client Code
var user = "cristiscu";
var pwd = Environment.GetEnvironmentVariable("SNOWSQL_PWD");
var connStr = $"account=BTB76003;user={user};password={pwd}";
using (var conn = new SnowflakeDbConnection(connStr)) {
conn.Open();
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = "show parameters";
using (var reader = cmd.ExecuteReader())
while (reader.Read())
Console.WriteLine($"{reader[0]}={reader[1]}");
}
conn.Close();
}
Client Request
UI Business (Snowpark)
(Front-End) Logic
Data
Snowpark for Python
Data Frame query Python SPs/UDFs/UDTFs
Compute
Snowflake
Session
QueryHistory
add_import/packages/requirements(…) query_history()
get_imports/packages() get/get_stream(…) GetResult
remove_import/package(…) status/message
clear_imports/packages(…) file/size
call(…) file
close/cancel_all() FileOperation
use_database/schema(…) sproc
PutResult
use_role/warehouse(…) status/message
get_current_database/schema() source/target
get_current_account/role/warehouse(…) put/put_stream(…) source/target_size
query_tag reader source/target_compression
sql_simplifier_enabled DataFrameReader
telemetry_enabled
table(name)
Table
DataFrame Class
select/selectExpr(…)
filter/where(…)
sort/orderBy(…)
union/unionAll/unionByName(…) DataFrame Column alias/name/as_()
intersect/except_/minus/substract collect/collect_nowait(…) starts/endswith(…) over(…)
withColumn/withColumnRenamed(…) show() getItem(…) within_group(…)
with_columns(…) explain() name/getName() asc/desc()
join/crossJoin/natural_join(…) queries isin/in_(…) asc/desc_nulls_first/last()
join_table_function/flatten(…) schema/columns col(…) like/rlike/regexp(…) bitand/or/xor(…)
limit(…) count(…) equal_null/nan() bitwiseAnd/OR/XOR(…)
agg(…) take(…) eqNullSafe() cast/asType/try_cast(…)
drop/dropna/fillna(…) first(…) startswith/endswith() between(…)
crosstab/unpivot(…) is_cached() substr/substring(…) collate(…)
describe/rename/replace(…)
sample/sampleBy(…) stat na
randomSplit(…)
crosstab(…)
toDF(…)
sampleBy(…) drop/fill/replace(…)
DataFrameStatFunctions DataFrameNaFunctions
approxQuantile(…)
corr/cov(…)
(inherits)
Column CaseExpr
within_group(…) when(…)
over(…) otherwise/else(…)
(arg)
Window
currentRow WindowSpec
unboundedPreceding orderBy/partitionBy(…)
unboundedFollowing rangeBetween/rowsBetween(…)
orderBy/partitionBy(…)
rangeBetween/rowsBetween(…)
MERGE Statement
when_matched()
update(…)
WhenMatchedClause
delete()
delete(…) DeleteResult
update(…) delete(…) rows_deleted
Table
rows_inserted UpdateResult
rows_updated
update(…) rows_updated
multi_joined_rows_updated
multi_joined_rows_updated
rows_deleted
drop_table()
insert(…) MergeResult
when_not_matched() rows_inserted
merge(…) rows_updated
rows_deleted
WhenNotMatchedClause
insert(…)
Input/Output
pandas.DataFrame
toPandas(…)
to_pandas_batches(…)
DataFrameWriter
copy_into_location(…)
saveAsTable(…) write
mode(…)
option/options(…)
schema(…)
DataFrameReader DataFrame
csv/json/xml(…)
table(…) parquet/orc/avro(…)
sample(…)
Table
table_name
is_cached cache_result(…)
Row collect(…)
count/index(…)
asDict(…)
Create Query with DataFrame
q = q.filter(q.dname != 'RESEARCH')
q = (q.select("DNAME", "SAL")
.group_by("DNAME")
.agg({"SAL": "sum"})
.sort("DNAME"))
q.show()
Snowpark Stored Procedures
Python Code
create procedure proc1(num float)
returns string
language python
Call from SQL runtime_version = '3.8'
packages = ('snowflake-snowpark-python')
call proc1(22.5); handler = 'proc1'
as $$
import snowflake.snowpark as snowpark
def proc1(sess: snowpark.Session, num: float):
return '+' + str(num)
$$;
Java Code Scala Code
create procedure proc1(num float) create procedure proc1(num float)
returns string returns string
language java language scala
runtime_version = 11 runtime_version = 2.12
packages = ('com.snowflake:snowpark:latest') packages = ('com.snowflake:snowpark:latest')
handler = 'MyClass.proc1' handler = 'MyClass.proc1'
as $$ as $$
import com.snowflake.snowpark_java.*; import com.snowflake.snowpark.Session;
class MyClass { object MyClass {
public String proc1(Session sess, float num) { def proc1(sess: Session, num: Float): String = {
return "+" + Float.toString(num); }} return "+" + num.toString }}
$$; $$;
Snowpark UDFs (User-Defined Functions)
Python Code
create function fct1(num float)
returns string
language python
Call from SQL runtime_version = '3.8'
packages = ('snowflake-snowpark-python')
select fct1(22.5); handler = 'proc1'
as $$
import snowflake.snowpark as snowpark
def proc1(num: float):
return '+' + str(num)
$$;
Java Code Scala Code
create function fct1(num float) create function fct1(num float)
returns string returns string
language java language scala
runtime_version = 11 runtime_version = 2.12
packages = ('com.snowflake:snowpark:latest') packages = ('com.snowflake:snowpark:latest')
handler = 'MyClass.fct1' handler = 'MyClass.fct1'
as $$ as $$
import com.snowflake.snowpark_java.*; import com.snowflake.snowpark.Session;
class MyClass { object MyClass {
public String fct1(float num) { def fct1(num: Float): String = {
return "+" + Float.toString(num); return "+" + num.toString
}} }}
$$; $$;
Snowpark UDTFs (User-Defined Table Functions)
Java Code
create function fctt1(s string)
Call from SQL returns table(out varchar)
language java
runtime_version = 11
select * from packages = ('com.snowflake:snowpark:latest')
table(fctt1('abc')); handler = 'MyClass'
as $$
import com.snowflake.snowpark_java.*;
import java.util.stream.Stream;
class OutputRow {
public String out;
Python Code public OutputRow(String outVal) {
create function fctt1(s string) this.out = outVal; }
returns table(out varchar) }
language python
runtime_version = '3.8' class MyClass {
packages = ('snowflake-snowpark-python') public static Class getOutputClass() {
handler = 'MyClass' return OutputRow.class; }
as $$ public Stream<OutputRow> process(String inVal) {
import snowflake.snowpark as snowpark return Stream.of(
class MyClass: new OutputRow(inVal),
def process(self, s: str): new OutputRow(inVal));
yield (s,) }
yield (s,) }
$$; $$;
Snowpark for Python
Data Frame query Python SPs/UDFs/UDTFs
Compute
Snowflake
Web UI
Compute Snowflake
sproc
StoredProcedureRegistration StoredProcedure
udf
Session UDFRegistration UserDefinedFunction
udtf
UDTFRegistration UserDefinedTableFunction
Functions in Snowpark Python
• creating
• sproc/udf/udtf(lambda: ..., [name="..."], ...) anonymous/named
• sproc/udf/udtf.register(name="...", is_permanent=True, ...) registered
• @sproc/@udf/@udtf(name="...", is_permanent=True, ...) registered
• UDTF handler class
• __init__(self) - optional
• process(self, ...) - required, for each input row → tuples w/ tabular value
• end_partition(self) - optional, to finalize processing of input partitions
• calling
• name(...) / fct = function("name") by name/function pointer
• session.call/call_function/call_udf("name", ...) SP/UDF
• session.table_function(...) / dataframe.join_table_function(...) UDTF
• session.sql("call name(...)").collect() SP
External Dependencies
• imports local/staged JAR/ZIP/Python/XML files/folders
• IMPORTS = (‘path’) in CREATE PROCEDURE/FUNCTION
• session.add_import("path") session level
Web UI
Compute Snowflake
Database
DataFile
Data File Streamlit
Python Files Named Stage
Snowpark API
Snowflake
Compute
• levels
• ALTER … SET LOG_LEVEL = OFF/DEBUG/WARN/INFO/ERROR log messages
• ALTER … SET TRACE_LEVEL = OFF/ALWAYS/ON_EVENT trace events
• SYSTEM$LOG('level', message)
• SYSTEM$LOG_TRACE/DEBUG/INFO/WARN/ERROR/FATAL(message)
Event Table Columns
• TIMESTAMP - log time / event creation / end of a time span
• START/OBSERVED_TIMESTAMP - start of a time span
logger = logging.getLogger("mylog")
logger.info/debug/warning/error/log/...("This is an INFO test.")
# trace events
from snowflake import telemetry
telemetry.add_event("FunctionEmptyEvent")
telemetry.add_event("FunctionEventWithAttributes", {"key1": "value1", ...})
Alerts
• CREATE ALERT ...
• WAREHOUSE for compute resources
• SCHEDULE cron expression, for periodical evaluation
• IF (EXISTS(condition)) SELECT/SHOW/CALL stmt to check condition
• THEN action SQL CRUD/script/CALL, can also use system$send_email(...)
• ALTER ALERT ... SUSPEND/RESUME
• INFORMATION_SCHEMA.ALERT_HISTORY(...) table function
-- send email
CALL SYSTEM$SEND_EMAIL(
'my_notif_int',
'[email protected], [email protected]',
'Email Alert: Task A has finished.',
'Task A has successfully finished.\nEnd Time: 12:15:45');
Client Request
GRANT/REVOKE privilege
CREATE ROLE role ON object TO/FROM role
role privilege
Functional
USERADMIN ADMIN EDITOR GUEST Custom Roles
(~User Groups)
Database
RW_ROLE RO_ROLE Custom Roles
PUBLIC
System-Defined Roles
• ACCOUNTADMIN - top-level role
• as SYSADMIN + SECURITYADMIN (+ ORGADMIN)
• SECURITYADMIN - to CREATE ROLE and GRANT ROLE to ROLE/USER
• GRANT/REVOKE privileges, inherits USERADMIN
• USERADMIN - to CREATE USER
• SYSADMIN - to CREATE WAREHOUSE/DATABASE/...
• GRANT privileges to objects, should inherit from any custom role
Compute
-- create new roles for tenant Admin and tenant ETL data engineer
USE ROLE SECURITYADMIN;
CREATE OR REPLACE ROLE &{tenant}_ADM_&{env};
CREATE OR REPLACE ROLE &{tenant}_ETL_&{env};
• Create and save a passphrase (in local env var) for an encrypted private key
• Generate and save (always local) a private key → rsa_key.p8
• Generate and save a public key (based on the private key) → rsa_key.pub
• Connect w/ basic authN and set RSA_PUBLIC_KEY for current user
• Reconnect w/ private key (and passphrase, if encrypted)
• Can later generate temporary JWT token w/ SnowSQL
Key Pair Authentication: Configuration
• Required authentication with either OAuth or Key Pair, with JWT token.
• Data is returned in partitions.
• Can fetch query results concurrently.
• No PUT or GET commands.
SQL REST API
Data
Snowpipe REST API
• https://acct.snowflakecomputing.com/v1/data/pipes/name API endpoint
DataFile
Data File Internal
Data File Named Stage REST Endpoints
@stage
upload files
Ingest
Queue
Compute
Pipe
Table
Snowflake Account
Client Request
• Data Governance
• Object Tagging
• Query Tagging
• Data Classification
• Masking Policies
• Row Access Policies
Data Governance
• Object Tagging
• Query Tagging
• Data Classification
• System tags & categories
• Column-Level Masking Policies
• Dynamic Data Masking
• External Tokenization
• Tag-Based Masking
• Row Access Policies
Object Tagging
• Monitors sensitive data for compliance, discovery, protection, resource usage.
• Schema-level object, inherited by any child object → tag lineage
access
(row-level)
masking
(column-level)
Masking Policies (Column-Level)
• Dynamic Data Masking
• to mask stored data with built-in function data visualization protection
• (***) ***-**** masked (or NULL)
• (***) ***-4465 partially-masked (604) 555-4465
• External Tokenization
• to store tokenized data w/ external function data storage protection
• (Gw6) fk2-cHSl obfuscated
• dslgklknbsdfsdfxzc tokenized (encoded)
• Tag-Based Masking
• ~dynamic data masking, but with a 'PII' security tag value
Masking Policies (Column-Level)
create masking policy research_on_year
as (hiredate date) returns date ->
case when current_role() <> 'RESEARCH' then val
else date_from_parts(2000, month(hiredate), day(hiredate)) end;
own mixed
Secure Data Share own table
view
database
Snowflake Account
Secure Functions and Views
• Secure UDFs/Store Procedures
• CREATE SECURE FUNCTION …, IS_SECURE field
• Users cannot see code definitions (body, header info, imports…)
• No internal optimizations (may be slower), avoid push-down
• No exposed amount of data scanned, in queries
shared secure
database views/fcts
Secure
Data proxy
database
Share
Reader proxy
"Account" database
R/O
data
Snowflake Producer Account Snowflake Consumer Account
Private/Public Shares: Listings
• private share = Data Exchange
• w/ other specific consumers (separate/reader accounts)
• no need for approvals
• can share data through secure views/functions, or native apps
Provider
Studio Shared Database
Snowflake
Provider
Publish
Account
Secure
Provider
Provider Profile
Provider
Profile Listing(s) Data
Profile(s)
Share
Get
Snowflake
Consumer
Proxy Database Account
Public Share: Snowflake Marketplace
Snowflake
Provider
Shared Database Provider
Studio
Account
$1,250,000 $1,110,000
SELECT …
row access
policy
• Bob has full access to his wealth
• Alice can only run “SELECT …” Bob is richer
Data Clean Room: Design Steps
• The producer creates and attaches a row access policy on its table.
• The policy allows only the producer to get full access to its data.
• The policy may allow a consumer role to run some allowed statements.
• The consumer must run the exact statements allowed by the producer.
• Any other statement run by the consumer will return no data.
• The producer will have no access to any consumer data, at any time.
Data Clean Room: with Secure Data Share
customers associates
name sales fullname profession
Mark Dole $12,000 John Doe Teacher
John Doe $2,300 Emma Brown Dentist
Emma Brown $1,300 George Lou Teacher
allowed_statements
profession AVG(sales)
statement
Teacher $2,100
SELECT a.profession, AVG(c.sales)…
Dentist $3,200
SELECT COUNT(*)…
Clerk $1,230
Producer Consumer
(Your Snowflake Account) (Partner Snowflake Account)
Client Request
• READER_ACCOUNT_USAGE
• ORGANIZATION_USAGE
Information Schema Views
• Inventory
• Tables, Columns, Views, Event_Tables, External_Tables
• Databases, Stages, Sequences, Pipes, File_Formats
• Replication_Databases, Replication_Groups
• Programming
• Packages, Functions & Procedures
• Class_Instances, Class_Instance_Functions, Class_Instance_Procedures
• Constraints
• Table_Constraints, Referential_Constraints
• Roles & Privileges
• Enabled_Roles, Applicable_Roles
• Applied_Privileges, Usage_Privileges, Object_Privileges, Table_Privileges
• Metrics
• Table_Storage_Metrics, Load_History
Account Usage Views (Historical Only)
• Query_History, Access_History, Login_History
• Alert_History, Task_History, Serverless_Task_History
• Copy_History, Load_History, Pipe_Usage_History
• Data_Transfer_History
• Metering_History, Metering_Daily_History
• Warehouse_Load_History, Warehouse_Metering_History
• Storage_Usage
• Object_Dependencies
• Tag_References
• Sessions
Table Constraints
• NOT NULL = the only one always enforced.
• PRIMARY KEY (PK) = for referential integrity, as unique table row identifier,
never enforced.
• FOREIGN KEY (FK) = for referential integrity, as propagation of a PK, never
enforced.
• UNIQUE = for unique combination of column values, other than the PK,
never enforced.
• ENFORCED/DEFERRABLE/INITIALLY = never enforced.
• MATCH/UPDATE/DELETE = for FK only, never enforced.
• CLUSTERING KEYS = optional, similar to PKs, but used for better micro-
partitioning, not referential integrity.
ER (Entity-Relationship) Diagrams
Functional
USERADMIN ADMIN EDITOR GUEST Custom Roles
(~User Groups)
Database
RW_ROLE RO_ROLE Custom Roles
PUBLIC
Security: Roles in Snowsight
Security: Parsing Users and Roles
• SYSTEM$TASK_DEPENDENTS_ENABLE(name)
• enable all children before DAG run
• INFORMATION_SCHEMA.TASK_HISTORY(task_name=>name))
• show all task runs, with errors and status: COMPLETED/FAILED/SCHEDULED
• sort DESC by RUN_ID to see most recent runs
Data Lineage: Table-Level Views
select distinct
substr(directSources.value:objectName, len($SCH)+2) as source,
substr(object_modified.value:objectName, len($SCH)+2) as target
from snowflake.account_usage.access_history ah,
lateral flatten(input => objects_modified) object_modified,
lateral flatten(input => object_modified.value:"columns", outer => true) cols,
lateral flatten(input => cols.value:directSources, outer => true) directSources
where directSources.value:objectName like $SCH || '%'
or object_modified.value:objectName like $SCH || '%'
Data Lineage: Column-Level Graph View
OBJECT_DEPENCENCIES View
• DEPENDENCY_TYPE
• BY_NAME = view/UDF… → view/UDF…
• BY_ID = ext stage → storage integration, stream → table/view
• BY_NAME_AND_ID = materialized view → table
Object Dependencies: Tabular View
Object Dependencies: Graph View
Task Dependencies: Initial DAG Topology
digraph {
rankdir="BT";
edge [dir="back"];
T2 -> T1;
T3 -> T1;
T4 -> T2;
T4 -> T3;
T5 -> T1;
T5 -> T4;
T6 -> T5;
T7 -> T6;
T8 -> T6;
}
Task Workflows: Examine DAG Task Runs
select SYSTEM$TASK_DEPENDENTS_ENABLE('tasks.public.t1');
execute task t1;
• In Snowflake
• Upload all your app files into a named stage.
• Create an APPLICATION PACKAGE with the files uploaded in the stage.
• Create an APPLICATION for this package.
• Create a STREAMLIT object for the code.
Native App: Test and Deploy
• In Snowsight
• Start your new app in the new Apps tab.
• Connect to Snowflake through get_active_session()
• Continue editing, running, testing the app in Snowsight, as a producer.
Compute
DataFile
Data File Application
App Files stage Application
Package Snowflake
uploads Provider
Publish Account
Get Snowflake
Compute
Consumer
Proxy Database Application
Account
GRANT ... TO APPLICATION ...
Native App: Public Share (Snowflake Marketplace)
Compute Snowflake
DataFile
Data File Application
App Files stage Application Provider
Package Account
uploads
Approve Request
+ Get Snowflake
Compute
Consumer
Proxy Database Application Account
Provider listing
Provider
AppProfile
Files
Profile
• Data Analytics
• SELECT Statement
• Subqueries vs CTEs
• Group Queries
• Pivot/Unpivot Queries
• Time Travel and Fail-safe
SELECT Statement
• SELECT … projection
• DISTINCT … dedup
• FROM ... sources (joins)
• PIVOT | UNPIVOT ... dicing
• GROUP BY [CUBE/ROLLUP/…] … grouping
• WHERE | HAVING | QUALIFY ... filters
• ORDER BY ... sorting
• TOP | LIMIT | OFFSET | FETCH ... slicing
Subqueries vs CTEs
Subqueries CTEs
select ee.deptno, with q1 as
sum(ee.sal) as sum_sal, (select empno
(select max(sal) from emp e
from emp join dept d on e.deptno = d.deptno
where deptno = ee.deptno) as max_sal where d.dname <> 'RESEARCH'),
from emp ee
where ee.empno in q2 as
(select empno (select deptno, max(sal) max_sal
from emp e from emp
join dept d on e.deptno = d.deptno group by deptno)
where d.dname <> 'RESEARCH')
group by ee.deptno select ee.deptno,
order by ee.deptno; sum(ee.sal) as sum_sal,
max(q2.max_sal) as max_sal
from emp ee
join q2 on q2.deptno = ee.deptno
join q1 on q1.empno = ee.empno
group by ee.deptno
order by ee.deptno;
GROUP BY with WHERE and HAVING
select deptno,
to_char(year(hiredate)) as year,
sum(sal) sals
from emp
where year > '1980'
group by deptno, year -- all
having sum(sal) > 5000
order by deptno, year;
GROUP BY with QUALIFY
select deptno,
row_number() over (order by deptno) as rn,
to_char(year(hiredate)) as year,
sum(sal) sals
from emp
where year > '1980'
group by deptno, year
having sum(sal) > 5000
qualify rn > 1
order by deptno, year;
GROUP BY with GROUPING SETS
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by grouping sets (deptno, year)
having sum(sal) > 5000
order by deptno, year;
GROUP BY with ROLLUP
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by rollup (deptno, year)
having sum(sal) > 5000
order by deptno, year;
GROUP BY with CUBE
select deptno,
to_char(year(hiredate)) as year,
grouping(deptno) deptno_g,
grouping(year) year_g,
grouping(deptno, year) deptno_year_g,
sum(sal) sals
from emp where year > '1980'
group by cube (deptno, year)
having sum(sal) > 5000
order by deptno, year;
PIVOT Query
GROUP BY Query PIVOT Query
select dname, with q as
to_char(year(hiredate)) as year, (select dname,
sum(sal) as sals year(hiredate) as year,
from emp sum(sal) as sals
join dept on emp.deptno = dept.deptno from emp
where year >= '1982' join dept on emp.deptno = dept.deptno
group by dname, year where year >= 1982
order by dname, year; group by dname, year
order by dname, year)
select * from q
pivot (sum(sals)
for year in (1982, 1983)) as p;
UNPIVOT Query
PIVOT Query UNPIVOT Query (→ back to GROUP BY)
with q as with q as
(select dname, (select …),
year(hiredate) as year,
sum(sal) as sals p as
from emp (select * from q
join dept on emp.deptno = dept.deptno pivot (sum(sals)
where year >= 1982 for year in (1982, 1983)) as p)
group by dname, year
order by dname, year) select * from p
unpivot (sals
select * from q for year in ("1982", "1983"));
pivot (sum(sals)
for year in (1982, 1983)) as p;
Time Travel and Fail-safe
• Time Travel
• = for DATABASE|SCHEMA|TABLE
• DATA_RETENTION_TIME_IN_DAYS in CREATE … / ALTER … SET ...
• set to zero to disable
• 1 for transient/temporary tables, or permanent tables in Standard Edition
• 1..90 days for permanent tables in Enterprise Edition
• Fail-safe
• = additional days to restore tables (no SQL, call Snowflake support!)
• 7 days for permanent tables regardless of the edition + cannot disable!
• 0 days for transient tables
Time Travel
• Looking Back in Time
• SELECT ... FROM … AT(TIMESTAMP => <timestamp>) …
• SELECT ... FROM … AT(OFFSET => <time_diff>) …
• SELECT ... FROM … AT(STREAM => '<name>') …
• SELECT ... FROM … AT|BEFORE(STATEMENT => <id>) …
• In Zero-Copy Cloning
• CREATE DATABASE | SCHEMA | TABLE <t> CLONE <s> AT|BEFORE(…)
• Restoring Dropped Objects
• DROP DATABASE | SCHEMA | TABLE …
• SHOW DATABASES | SCHEMAS | TABLES HISTORY […] dropped
• UNDROP DATABASE | SCHEMA | TABLE … restore dropped obj
Client Request
• Window Functions
• Ranking Functions
• Offset Functions
• Statistical Functions
• Regression Functions
Window Functions: OVER Clause
• PARTITION BY
• ORDER BY
• ROW_NUMBER() 1, 2, 3, 4, 5, 6, 7 …
• RANK() 1, 1, 3, 4, 4, 4, 7 …
• DENSE_RANK() 1, 1, 2, 3, 3, 3, 4, …
• PERCENT_RANK() 0%, 0%, 23%, 48%, …
• NTILE(n) 1, 1, 1, 2, 2, 2, 3, 3 …
• CUME_DIST()
Rank Functions
select deptno, ename,
row_number() over (order by deptno) row_number,
rank() over (order by deptno) rank,
dense_rank() over (order by deptno) dense_rank,
round(percent_rank() over (order by deptno) * 100) || '%' percent_rank
from emp
order by deptno;
Offset Functions
• LEAD(expr, offset=1)
• LAG(expr, offset=1)
• FIRST_VALUE(expr)
• LAST_VALUE(expr)
• NTH_VALUE(expr,offset)
• RATIO_TO_REPORT(expr)
Offset Functions
select ename,
lead(sal, 1) over (order by ename) lead,
sal,
lag(sal, 1) over (order by ename) lag,
first_value(sal) over (order by ename) first,
last_value(sal) over (order by ename) last,
nth_value(sal, 1) over (order by ename) nth
from emp
order by ename;
Statistical Functions
• VAR_POP/SAMP
• VARIANCE
• STDDEV_POP/SAMP
• STDDEV
• COVAR_POP/SAMP
• CORR
• SKEW(expr)
• KURTOSIS(expr)
Skew & Kurtosis
• REGR_COUNT
• REGR_SXX/SYY/SXY
• REGR_AVGX/AVGY
Linear Regression (y = x * SLOPE + INTERCEPT)
select REGR_SLOPE(sals, year), REGR_INTERCEPT(sals, year), REGR_R2(sals, year)
from (select year(hiredate) as year, sal as sals from emp order by year);
Client Request
Services
past 24h
HOT Metadata Result Cache max 31 days
Compute
local SSD
WARM Warehouse Local Disk RAM
Storage
cloud blob
COLD Remote Disk storage
Snowflake
Query Result Caching
• HOT
• result cache on
• cache hit → SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
• WARM
• result cache off or no cache hit
• warehouse up and result still in the warehouse (in RAM or local disks)
• get query result from warehouse cache (BYTES_SCANNED > 0)
• COLD
• no warehouse cache (warehouse suspended, not up)
• result cache off (ALTER SESSION SET USE_CACHED_RESULT = false) or no
cache hit
• must run query
Query Profile
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Enhanced Query Profile
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Enhanced Query Profile
“Exploding” Joins Problem
Query Execution Plan: EXPLAIN
• EXPLAIN [USING TABULAR] <query>
• ~EXPLAIN_JSON function (JSON → tabular output)
• EXPLAIN USING JSON <query>
• ~SYSTEM$EXPLAIN_PLAN_JSON function (JSON output)
• EXPLAIN USING TEXT <query>
• ~SYSTEM$EXPLAIN_JSON_TO_TEXT function (JSON → TEXT output)
Query Execution Plan
explain
select dname, sum(sal)
from employees.public.emp e
join employees.public.dept d
on e.deptno = d.deptno
where dname <> 'RESEARCH'
group by dname
order by dname;
Clustering
• Clustering Keys
• SYSTEM$CLUSTERING_DEPTH(table, (col1, …))
• SYSTEM$CLUSTERING_INFORMATION(table, (col1, …))
{
"cluster_by_keys" : "LINEAR(ss_sold_date_sk, ss_item_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 9, so-so (higher is better)
"average_overlaps" : 3.4849, bad (many overlaps)
"average_depth" : 2.7497, system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 3,
"00002" : 180604, so-so (most on depth 2-3)
"00003" : 540900,
"00004" : 0,
…
"00016" : 0 good (none so deep)
},
"clustering_errors" : [ ]
}
Clustering: With Partial Clustering Keys
SQL Code → JSON Result
select system$clustering_information(
'snowflake_sample_data.tpcds_sf100tcl.store_sales', -- ~300B rows
'(ss_sold_date_sk)');
{
"cluster_by_keys" : "LINEAR(ss_sold_date_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 719687, good! (high)
"average_overlaps" : 0.0132, good! (low)
"average_depth" : 1.0076, system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 719203, good! (most w/ depth 1)
"00002" : 366,
"00003" : 1197,
"00004" : 624,
…
"00032" : 22 so-so (22 partitions on depth 32)
},
"clustering_errors" : [ ]
}
Clustering: Worst Case Scenario
SQL Code → JSON Result
select system$clustering_information(
'snowflake_sample_data.tpcds_sf100tcl.store_sales', -- ~300B rows
'(ss_store_sk)');
{
"cluster_by_keys" : "LINEAR(ss_store_sk)",
"total_partition_count" : 721507,
"total_constant_partition_count" : 0, very bad (no constant partition)
"average_overlaps" : 721506.0, very bad (almost all overlap)
"average_depth" : 721507.0, system$clustering_depth
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 0,
"00003" : 0,
"00004" : 0,
…
"1048576" : 721507 very bad (all w/ full scan)
},
"clustering_errors" : [ ]
}
Programming
in Snowflake
Masterclass
2024 Hands-On!