0% found this document useful (0 votes)
7 views

What is Snowflake -1

Snowflake is a cloud-native data platform that offers advanced features beyond traditional data warehouses, including support for data science, data engineering, and secure data sharing. It utilizes a Multi-cluster Shared Data Architecture, allowing independent scaling of storage and compute resources, and supports both structured and semi-structured data formats. As a SaaS product, Snowflake automates management tasks and provides a flexible pay-as-you-use subscription model, ensuring high availability and durability of data.

Uploaded by

Prakash Js
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

What is Snowflake -1

Snowflake is a cloud-native data platform that offers advanced features beyond traditional data warehouses, including support for data science, data engineering, and secure data sharing. It utilizes a Multi-cluster Shared Data Architecture, allowing independent scaling of storage and compute resources, and supports both structured and semi-structured data formats. As a SaaS product, Snowflake automates management tasks and provides a flexible pay-as-you-use subscription model, ensuring high availability and durability of data.

Uploaded by

Prakash Js
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Snowflake

What is Snowflake?

I define Snowflake as a cloud-native data platform offered


as a service.

The idea behind calling Snowflake a data platform instead


of strictly a data warehouse, as is quite common, is to point
out that Snowflake has features and capabilities beyond
what you'd expect from a traditional data warehouse.
It has support for additional workloads such as data science
and features such as native processing of semi-structured
data, like you'd find in a data lake.
Let's run through six key workloads Snowflake support,
Data Platform
Data Warehouse
Snowflake is very often referred to as a data warehouse,
Snowflake organizes its data into databases, schemas, and
tables.
A Snowflake user can ingest structured data, such as CSV,
into a table and then query it using ANSI Standard SQL.
All SQL statements are also ACID compliant.
Snowflake as an advanced version of a traditional data
warehouse with strong capabilities in other areas.
Data lake.
The Snowflake service can scale out storage and compute
to handle big data in the petabytes.
You can also store raw files without needing to specify the
schema upfront.
Snowflake can natively process semi-structured data
formats, such as JSON or Avro, with the ability to
manipulate complex data structures using built-in functions
and SQL extensions.
Data Engineering.
Snowflake has methods to simplify data ingestion for batch
and streaming workloads, including the COPY INTO
statement and the serverless feature Snowpipe,
With Snowflake's new architecture, ETL and analytic jobs by
instantiating on the fly separate compute clusters.
Snowflake also offer native objects, such as tasks and
streams, to create data pipelines.
By default Snowflake comes with many security features,
including encryption of all data at rest and in transit, role-
based access control, multi-factor authentication, and
more.

Data Science
Data scientists have historically faced difficulties preparing
data assets, building models, moving those models into
production, and actively monitoring their AI and machine
learning assets after deployment.
One key aspect of Snowflake which helps data scientists is
its centralized storage.
Data scientists typically spend most of their time finding,
retrieving, and cleaning data, and a minority on actually
building, training, and deploying their models.
Having all their data in one location which has been
curated really helps in removing some of the data
management roadblocks they face.
Snowflake also have a partner ecosystem of third-party
vendors and tools, which have been certified to integrate
with the Snowflake platform.
Data science tools like Amazon SageMaker, DataRobot, and
others integrate natively with Snowflake.
Data Sharing
Snowflake have also implemented some interesting
features to simplify the sharing of table data.
Secure data sharing is a feature to allow one Snowflake
account to privately share table data with another account
in the same region, which is quite cool and not possible on
many platforms.
There's also an online marketplace accessible through the
Snowflake UI where we can produce and consume datasets.
And using the partner ecosystem we can expose our
curated data with select customers via BI tools, like Tableau
or Power BI.
Data Applications
Let's talk about data application development.
Snowflake maintained many connectors and drivers for
high-level languages, like Python, .NET, and Go, and
database connectors, like JDBC and ODBC, to help in
building data-intensive applications.
You can also extend the built-in functions with UDFs and
stored procedures, which can be written in Java, JavaScript,
Python, or SQL.
We can also make use of external UDFs to write custom
code, which resides outside of Snowflake on a cloud
platform service like AWS Lambda or Azure Functions.
Can building data applications called Snowpark. This
allows for programmatic querying and processing of data
either using Java, Python, or Scala.

Cloud Native
Snowflake’s software is purpose built for the Cloud
Snowflake is a cloud-native solution. All the software that
allows Snowflake to handle the workloads we just went
through was purpose-built for the cloud.
It's not a retrofitted version of an existing technology like
Hadoop or Microsoft SQL Server.
The query engine, the storage file format, the metadata
store, the architecture generally was designed with cloud
infrastructure in mind.

Snowflake being a cloud-native solution and not what is


called a cloud-washed solution,
The actual hardware, the storage, the compute, the load
balancers, all the infrastructure that make up a single
Snowflake account and runs the software Snowflake
designed is provisioned into one of three leading cloud
providers.
All Snowflake infrastructure runs on the Cloud in either
AWS, GCP or Azure.
It was originally designed to run on AWS and then ported to
Azure in 2018, and then to Google Cloud Platform in 2020.
Snowflake makes use of Cloud’s elasticity, scalability, high
availability, cost-efficiency & durability.
Snowflake inherit the many benefits of the cloud, high
availability, the ability to always have access to services
and data, scalability, so we can very easily add compute or
storage resources, elasticity, so we can dynamically scale
out based on demand, and durability guarantees, so we
know our data is replicated and it won't be lost.
Software as a service (SaaS)
You could have virtual machines hosted remotely, which
are accessed and managed by a user, like an EC2 instance
in AWS.
Snowflake is slightly different from that.
It's a pure SaaS product, or software-as-a-service product.
This means that as users there's no management of our
account infrastructure at all, so no SSHing into instances or
checking data files once loaded into Snowflake tables.
There's also very limited management of software with the
only real thing to configure being Snowflake's connectors,
like the SnowSQL command line tool.
In Snowflake, the user performs no manual upgrades or
patches to software.
There are weekly online updates and patches which are
completely transparent to the user and cause no visible
downtime.
As with many SaaS products, Snowflake offer a flexible pay-
as-you-use subscription model. The flexibility of this model
allows users to pay only for the resources that are needed
and is therefore cheaper .
Accessing Snowflake is super simple, no need to set up any
software. You interact with a Snowflake account through
the UI by authenticating with a username and password.
This can optionally be extended with connectors and
drivers for programmatic access,
These data optimization activities are done automatically
as part of the loading process.

Multi-cluster Shared Data Architecture


Distributed Architectures
Most of the traditional data storage and analysis systems,
such as on-prem data warehouses, data lakes, and cloud
ported versions of these solutions arrange their underlying
hardware and networking into one of two distributed
computing architectures,
 shared-disk
 shared nothing

Shared-disk architecture.
This was the first move away from single node
architectures and work by scaling out nodes in a compute
cluster, but keeping storage in a single location. In other
words, all data is accessible from all cluster nodes. Any
machine can read or write any portion of the data.
Shared nothing architecture
Database systems for high performance storage and
querying has become the shared nothing architecture.like
Hadoop and Spark.
As the name indicates, the nodes in this architecture do not
share any hardware.
Storage and compute are located on separate machines
that are networked together. The nodes in the cluster
contain a subset of data locally. Instead of the nodes
sharing one central data repository,

Shared-disk architecture Advantage and


Disadvantage
Advantage
Shared-disk architecture is relatively simple to manage,
and acts as a single source of truth.
There aren't multiple copies of your data out there.
However, with that management simplicity
Disadvantage
Single point of failure for storage.
There can also be bandwidth and network latency
limitations with multiple nodes communicating with the
centralized storage.
Using this architecture limits scalability
and a relatively small number of queries can run the
centralized storage device. This results in resource
contention.
shared nothing architecture Advantage and
Disadvantage
Advantage
co-locating compute and storage on each node. This
avoids the bandwidth and network latency limitations of
communicating with the centralized storage device.
Shared nothing is generally cheaper as well, as each node
in the cluster can be made from off the shelf commodity
hardware. Instead of the big enterprise systems, which are
generally associated with the shared-disk architecture.
It's also a lot simpler to scale the shared nothing
architecture. You can relatively easily add a network node
to the cluster.
Disadvantage
scaling was still limited, because shuffling data between
nodes was expensive. As the cluster scaled data that was
poorly distributed among the nodes caused poor
performance. It was also quite difficult to redistribute the
data among the nodes once it was ingested.
The storage and compute were tightly coupled. If you want
more compute,you'd have to pay for more storage.
Shared nothing architectures had a tendency towards over
provisioning of hardware, because it didn't have the
flexibility to easily decrease compute resources when they
weren't needed.
And again, resource contention was still a common issue.

Multi-cluster Shared Data Architecture


let's take a look at how Snowflake have reimagined
storage, compute, and networking in the cloud.
Snowflake built specifically for the cloud, what they
call the Multi-cluster Shared Data Architecture.

It's a service-oriented architecture consisting of three


physically separated but logically integrated layers.
A centralized cloud storage layer down here which stores
all our table data organized into databases, conceptually
similar to the shared disk architecture.
Moving up a level, we have the query processing layer
made up of separate compute clusters that Snowflake
called virtual warehouses. They're responsible for
executing the computation required to process a query,
users issue SQL commands to create virtual warehouses
that Snowflake then provision and manage. All virtual
warehouses created, have consistent access to the
centralized storage and have the ability to cache locally the
table data used during query processing. Each cluster
working in a similar fashion to the shared nothing
architecture.
At the top, we have the cloud services layer, which
coordinates the whole show handling authentication to
Snowflake, managing the cloud infrastructure, passing and
optimizing queries, and a lot more.
And when I refer to this architecture as service oriented,
what I mean by that is that each layer under the hood is a
separate physical service or collection of services that
Snowflake manage communicating over a network via
restful interfaces.
For example, the actual physical machines which compute
the query results, which make up virtual warehouses are
entirely separate from where we keep the long-term table
data.

It allows for some quite unique capabilities.


Firstly, and this is key, storage, compute and management
services can now be scaled independently. They're
decoupled.
Secondly, because the layers are decoupled and
provisioned in a cloud environment, there is no hard limit
on how much each layer can be scaled.
We can throw as much data as we want into the cloud
storage layer, and scale out compute near infinitely.
The multi-cluster aspect of the multi-cluster shared data
architecture is the ability to create as many separate
compute clusters as desired, and they'd be isolated from
each other as different virtual warehouses. This allows for
workload isolation.
For example, allowing analytics and pipeline jobs to not
contend for resources.
There's also a fourth layer called the cloud agnostic
layer, which ensures the three main layers will perform
identically on any of the cloud providers an account is
deployed into.
However, this won't feature on the exam, so just good to
know.

Storage Layer
The storage layer is really just the blob storage service of
the cloud provider you've deployed your Snowflake account
into.

Persistent and infinitely scalable cloud storage residing in


cloud providers blob storage service, such as AWS S3.
Data loaded into Snowflake is organized by databases,
schemas and accessible primarily as tables.
Snowflake users by proxy get the availability & durability
guarantees of the cloud providers blob storage.
Both structured and semi-structured data files can be
loaded and stored in Snowflake

So, for example, our trial account, which was created in


AWS, under the hood, the storage layer would use the S3
service.
As snowflake users, we inherit the native ability of services
like S3, to scale out our storage layer, almost infinitely.
we also get the availability, and durability guarantees of
the cloud provider's blob storage.
S3 is designed to be both highly available and durable,
replicating data across three different physically separated
availability zones, within AWS region.
snowflake can achieve excellent availability and
redundancy of stored table data.
To the user, data loaded or inserted into Snowflake is
organized into databases, schemas, and tables. All table
data is stored in the centralized storage layer. Acting as a
single source of truth.
Snowflake also has native support for ingesting structured
and semi-structured file formats. Such as the limited files
like CSV or TSV.
As well as semi-structured files like JSON, Avro, and
Parquet.
So, what is Snowflake actually storing in blob storage,
when data is loaded into a table?
When we load a data file or insert some records via SQL,
it's reorganized into Snowflake's proprietary column, the
file format.

When data files are loaded or rows inserted into a table,


Snowflake reorganizes the data into its proprietary
compressed, columnar table file format.
Completely transparently to the user, and stored in the
scalable blob storage.
Columnar data format, store data values of the same
column, physically next to each other on disk.
Unlike a CSV file, for example, which stores data in a row
oriented manner. Where each data value of a row is stored
next to each other.
Storing our table data in a columnar file format is great for
the types of OLAP queries we'll be running on Snowflake.
Because it effectively allows you to skip past columns,
which are not needed for a query, unlike in a row oriented
data file. This is what we call read optimization. Reducing
the amount of data needed to be fetched.
The data files are also compressed during the loading
process. Compression is particularly good in a columnar file
format.
As columns usually contain data of the same type. So a
compression algorithm can be more efficient.
And lastly, for this point, all data files are automatically
encrypted by default, using aes256 strong encryption.
Data loaded into Snowflake will also be divided into what
Snowflake calls micro partitions. This is done so Snowflake
can optimize queries by ignoring partitions, that aren't
needed to compute the result of a query.
A Snowflake account is charged a flat rate per terabyte on
table data stored in the storage layer. It's calculated at the
end of each month.
All the processes like encryption, compression columnar
data storage, micro partitioning, are entirely automated
and managed by Snowflake.
Table data is only accessible via SQL, not directly in
the blob storage.

Query Processing Layer

The query processing layer, also referred to as the compute


layer?
As we mentioned, this layer performs the processing task
on the data gathered from the storage layer to answer user
queries.
It consists of Snowflake-managed compute clusters called
virtual warehouses.
The query processing layer consists of “Virtual
Warehouses” that execute the processing tasks required to
return results for most SQL statements.
A virtual warehouse is a Snowflake object you create via
SQL commands. It's a named abstraction for a cluster of
cloud-based compute instances that Snowflake provision
and manage.
CREATE WAREHOUSE MY_WH WAREHOUSE_SIZE=LARGE;

If we were to execute this create warehouse statement in


our trial account, which is deployed into AWS, behind the
scenes, the virtual warehouse will be composed of EC2
instances arranged together as a distributed compute
cluster.
These run the software of the execution engine Snowflake
designed for the cloud.
This carries out the computation task of the query plan
generated by the services layer.
And because Snowflake is a SaaS solution, as users, we
have no direct access to those nodes. We interact with just
the abstracted warehouse object.
Under the hood, these compute clusters cooperate in
a very similar way to shared-nothing architectures.
Virtual warehouses make a remote call to the storage layer
when a query is issued to them, and then the raw table
data retrieved is stored on a local cache made up of high-
speed storage.
This cache data can be used to compute results for
subsequent queries. Virtual warehouses are ephemeral and
highly flexible.
Firstly, we mean that virtual warehouse objects can be
created and dropped instantly by the user,just like a table
or a database.
Behind the scenes, Snowflake will provision the compute
instances when a create statement is issued and remove
them when it's dropped.
Once they've been created, virtual warehouses can be
paused or resumed.
Virtual warehouses can be created or removed instantly.
Virtually unlimited number of virtual warehouses can be
created each with it’s own configuration.
Virtual warehouses can be paused or resumed.
Virtual warehouses come in multiple “t-shirt” sizes
indicating their relative compute power.
All running virtual warehouses have consistent access to
the same data in the storage layer.
In the pause state, you're not charged for compute.
for example. The ability to create many virtual warehouses
gives us a couple of key advantages.
One, it allows us to scale out compute to handle highly
concurrent workloads.
And two, each warehouse is isolated from each other. This
means we could create a virtual warehouse for data
loading and another for data analysis,and there would be
no resource contention between them.
On top of this, each virtual warehouse can have its own
configuration for each unique workload.
A key virtual warehouse configuration is its size. Virtual
warehouses come in many sizes, which indicate their
relative compute power.
They range from extra small to six times extra large. The
size relates to the number of cloud compute instances
This allows you to scale up the individual virtual
warehouses to meet the computation requirements for
different workloads.
Some of the larger virtual warehouses take a bit longer to
provision, but it's generally pretty quick.
All running virtual warehouses have access to the same
data in the storage layer, regardless of how much data we
have stored there or the total number of virtual
warehouses running.
You can imagine the complexity involved in synchronizing
all the reads and writes from potentially hundreds of virtual
warehouses to a single storage location.
How does Snowflake handle this?
Snowflake is not an eventually consistent system, it uses
strict ACID-compliant processing to ensure that all updates
and inserts are immediately available to all virtual
warehouses.
This is primarily achieved by a service in the global
services layer called the transaction manager, which
synchronizes data access.

Services Layer
The services layer is a collection of highly available and
scalable services that coordinate activities such as
authentication and query optimization across all Snowflake
accounts.
Similar to the underlying virtual warehouse resources, the
services layer also runs on cloud compute instances.
Services managed by this layer include:
• Authentication & Access Control
• Infrastructure Management
• Transaction Management
• Metadata Management
• Query parsing and optimisation
• Security
So we, as users, don't strictly need to understand the inner
workings of how Snowflake manage things like
infrastructure.
Okay, so what is the services layer?
It's a collection of highly available and scalable services
that coordinate activities across all Snowflake accounts in
order to process user requests.
Think of activities like authentication or query optimization.
This is why you might hear it referred to as the global
services layer.
Having a global multi-tenancy model like this, instead of
creating an account-specific version of all the services
every time an account is requested, allows Snowflake to
achieve certain economies of scale, and also makes
implementation of some interesting features, like secure
data sharing, much simpler to achieve.
Behind the scenes,
the services run on cloud-based compute instances, much
like virtual warehouses. However, we have no control or
visibility into their creation or how they work.
SO what services actually make up the services layer,
and what do they do?
Let's first take a look at authentication and access control.
This is about proving who you are and if you have valid
login credentials, as well as determining, once you're
logged into an account, what level of privileges you have to
perform certain actions.
We also have infrastructure management.
This service handles the creation and management of the
underlying cloud resources, such as the blob storage and
compute instances required for the storage and query
processing layers.
Next up, we have transaction management.
As mentioned briefly in the Virtual Warehouse section,
Snowflake is an ACID-compliant data warehouse, which
uses transactions to ensure, among other things, the data
is consistently accessible by all virtual warehouses.
Okay, the metadata management service keeps
information and statistics on objects and the data they
manage. The services layer also handles query parsing and
optimization. This service takes the SQL query we submit
and turns it into an actionable plan the virtual warehouses
can execute.
And lastly here is security. This is a broad category of
services, which handle things like data encryption and key
rotation.

Snowflake Editions & Key Features


Different editions of Snowflake exist to better fit how an
organization would use Snowflake and what level of
features and service they require.
The Snowflake edition affects the amount charged for
compute and data storage.
The Standard edition is the introductory level offering and
contains the core functionality of Snowflake. With this
edition, we get the basic ANSI standard SQL features you'd
expect, including most DDL and DML statements, as well as
use of advanced DML statements, such as multi table
insert, merge, and windowing functions.
We also get access to Snowflake's suite of security,
governance and data protection features Snowflake
collectively call Continuous Data Protection. This includes
the likes of Time Travel and network policies.
Next along, we have the Enterprise edition.
With this, we get all the features and services of the
Standard Edition, along with features designed specifically
for the needs of large-scale enterprises and
organizations.
Some examples include multi-cluster warehouses and
database failover.
The Business Critical edition offers higher levels of data
protection and enhanced security to support organizations
with very sensitive data,
for example, data that must comply with regulations. With
this edition, we can enable private connectivity to our
Snowflake account, or use our own customer-
managed key to encrypt data in Snowflake. Business
Critical includes all the features and services of the
Enterprise and Standard editions.
And finally, we have the VPS edition, or Virtual Private
Snowflake.
This offers the highest level of security for organizations
that have strict requirements around data protection, such
as governmental bodies or financial institutions.
It includes all the features of the Business Critical Edition
and below, but in a completely separate environment,
walled off from all other Snowflake accounts.
the services layer is shared between accounts.
If you were to create a VPS edition account, this
wouldn't be the case. Services like the metadata
store wouldn't be shared.

Snowflake Object Model


Objects like databases, schemas, and tables, the building
blocks of how we organize and store our data.
We also have objects which are specific to Snowflake and
enable a lot of its advanced features, things like streams,
stages, and pipes.
Everything you see in this diagram is an object.
An object in Snowflake is simply something you can
interact with, something you can issue commands against,
such as create table or drop stream.

For example, we can now manage account properties by


executing commands against an account object.
It's important to point out at this stage
that every object in Snowflake is securable. This means
that privileges on objects such as the ability to read a
table, can be granted to roles.
Roles are then granted to users, and this determines what
a user can see and do in Snowflake.
Snowflake objects fit into a hierarchy with one to many
relationships.
At the top here we have an organization, which is a
relatively new feature responsible for managing one or
more Snowflake accounts, and you might have noticed that
there is no reference to an organization in your trial
account.
This is because the organization feature is not enabled by
default. It's available on request from Snowflake support.
You can configure an organization's properties and perform
administration tasks by interacting with an organization
object.
Now, going down a level, we have the account.
The word account is used to refer to both the collection of
services Snowflake provide you when you sign up, the UI,
the compute, the storage, all of that, which is accessible
via the URL, but it also refers to an account object.
You can interact with this object via SQL to change account
level parameters, like how we handle date strings across
the account.

Just below the account object here we have a row of what


are referred to in the Snowflake documentation as account
level objects.
These typically don't hold data, but are used to configure
different parts of your account, such as which users exist,
and how many virtual warehouses you'd like.
For now, let's zoom in on the series of containers that hold
our data, starting with databases.
Databases form the first main way we can organize our
data stored in Snowflake, like any other SQL data
warehouse, and one account can have many databases.
One level down, databases can be further organized into
schemas.
One schema belongs to one database. Schemas
themselves are comprised of many different types of
objects.
The table- The table is the logical representation of our
stored data and is what we will be primarily interacting
with.
There are many more schema level objects which serve a
variety of purposes.

Organisation, Account, Database & Schema


what is the organization feature intended for?
First It's essentially a way we can manage multiple
snowflake accounts, including creating them.
Within a real world company or organization, there may be
many snowflake accounts created.
Secondly, with organizations enabled, you can set up and
administer Snowflake features, which make use of multiple
accounts within an organization such as database
replication and failover.
And thirdly, it's used for monitoring usage and billing
across multiple accounts.
So how do we go about enabling organizations?
The standard approach is to contact Snowflake
support.
During this process, you provide, among other things, an
organization name and designate an existing account as
the primary account.
The account you nominate will have access to a role
called org admin.
In the classic console, for example, with the org admin role
active, you can view the accounts created under that
organization.
The org admin role allows a user to manage the lifecycle of
an account.
When a user has the org admin role set, it can create
accounts with the query shown here.
When creating an account, you specify the cloud platform,
a region, and the Snowflake edition.
By default, there is a limit of 25 accounts.
You'll have to reach out to Snowflake support to increase
this cap.
Running the show organization accounts command allows
the user to view all accounts in an organization. You can
also list the regions available for an organization by
executing the show regions command.
With this role active, we can also enable cross account
features.
SELECT
system$global_account_set_parameter(
‘UT677AA’,
‘ENABLE_ACCOUNT_DATABASE_REPLICATION’,
‘true’);
Shown here is the command to enable replication for an
account.
And lastly, Monitoring account usage , is an example of
how we can query usage and billing information for all
accounts in an organization via the organization usage
schema in the Snowflake database.

Account
An account is the administrative name for a collection of
storage, compute and cloud services deployed and
managed entirely on a selected cloud platform.
When we use the word account in Snowflake, we're either
referring to the administrative name for the collection of
storage, compute, and cloud services, or an account object
itself, which is used to change account properties and
manage account level objects.
Each account is hosted on a single cloud provider, either
Amazon Web Services, Google Cloud platform, or Microsoft
Azure.
And for each cloud provider, there are several regions an
account can be provisioned into.
An account resides in a single geographic region, as does
the data in that account, as there are regulatory
considerations for moving data between regions.
By default, snowflake doesn't move data between
regions unless requested.
Each account is created as a single snowflake edition.
However, it can be changed later on.
An account is created with the system-defined role
ACCOUNTADMIN.
By default, accounts containa number of system defined
roles. To configure account level properties and manage
account level objects like warehouses, we can use the
account admin role.
However, this role is quite powerful, so it should be
granted to users sparingly.
This allows us to enforce the security best practice of least
privilege.
Account Regions
This is where your data would physically reside, and you'd
be subject to the regulatory requirements of that region.
The region in which your account is provisioned affects the
price of compute and storage,
which regulatory certifications you can achieve,
which snowflake features you have access to,
some network latency you'll experience if your account is
in a different region from where you're connecting from.
Account URL
This URL you uniquely identifies your account and is the
host name used to connect to the Snowflake service,
whether that's through the UI, SnowSQL command line
tool, Python.
Whatever method of connectivity you use, with this URL,
you'll gain access to your remotely hosted account.
Your trial account URL is most likely composed of three
parts.
The account locator, the cloud services region ID, and the
cloud service provider.
All these components together form an account
identifier.
Depending on the region and cloud platform your account
is deployed into, your account identifier might only be
composed of an account locator or some mixture of all
three.
If the account setup is not done through an automated
provisioning process, like with the trial, but with someone
from Snowflake support, you can request a specific unique
account identifier.
If your account was created by a user with the org admin
role, the account identifier would look a bit different.
It would be comprised of a unique organization name and
an account name set when creating the account.
Database.
This is an important object.
It's the first logical container for your data. It groups
together schemas, which themselves hold schema level
objects such as tables and views.

Let's step through some of the properties of a database.


Databases must have a unique name in an account.
A database must start with an alphabetic character and
cannot contain spaces or special characters unless the
entire identifier string is enclosed in double quotes.
Identifiers enclosed in double quotes are case sensitive.
Those without, are not.
Let's go through a few sample commands.
CREATE DATABASE MY_DATABASE;

CREATE DATABASE MY_DB_CLONE CLONE MYTESTDB;

CREATE DATABASE MYDB1


AS REPLICA OF MYORG.ACCOUNT1.MYDB1
DATA_RETENTION_TIME_IN_DAYS = 10;

CREATE DATABASE SHARED_DB FROM SHARE UTT783.SHARE;

Databases can be created from a clone of another


database in the same account.
Databases and their child objects can be replicated to
another account.
And databases can be created from a share object provided
by another snowflake account.

schema objects
Schemas are a way to further segment a database.
One database can contain many schemas, and each
schema belongs to a single database.
A schema name must be unique within a database.
Like database as schema name must start with an
alphabetic character and cannot contain spaces or special
characters unless the entire identifier string is enclosed in
double quotes.
Here we have the create statement for a schema.

CREATE SCHEMA MY_SCHEMA;


CREATE SCHEMA MY_SCHEMA_CLONE CLONE MY_SCHEMA;

This code snippet shows how schemas can also be cloned.


This provides a mirrored version of a schema and its child
objects.
The database name and schema name together form a
namespace in Snowflake.

You can propend the namespace to a table name to


globally refer to a table, or you can set the namespace in
the context on the UI or command line.
Once a namespace is set, using the used database and
used schema commands, all subsequent queries will
execute from within that database and schema
combination.
Table and View Types
let's take a look at the types of tables and views we get in
Snowflake.
tables are a logical abstraction over the data in the storage
layer, they describe the structure of your data in order for
you to query it.
There are four types of tables in Snowflake, each with a
different set of requirements around data retention.

The first is the permanent table type.


This is the default table type when a table is created, and is
the most commonly used.
As the name implies, this table will exist until it's explicitly
dropped by a user.
Each table type has different limitations when it comes to
these features.
The time travel retention period for a table is the duration
of time we can go back and perform data restoration tasks,
things like undropping a table.
A permanent table can set the time travel retention
period up to 90 days if we're on an enterprise
edition or higher Snowflake account.
If on the standard edition, the max is one day.
Permanent tables also have access to the non-configurable
period of seven days for fail-safe, in which Snowflake can
restore deleted data for us.
The second type of table is the temporary table.
A temporary table is restoring non-permanent transitory
data.
This could be something like a step in a complex ETL
process storing the result of a query for downstream
processing.
A temporary table persists for the duration of a session.
Once a session is over, the table data is completely purged
from the system.
Also, a temporary table cannot be converted to
another type of table.
Temporary tables have a max time travel retention period
of one, regardless of the Snowflake edition.
They also don't have a fail-safe period. Once a session is
over, the table data cannot be recovered by Snowflake.
Transient tables
similar to permanent tables in that they exist until explicitly
dropped.
However, they differ in that they have a max time travel
retention period of one, regardless of the Snowflake
edition, and do not have a fail-safe period.
Table data and fail-safe contributes to storage costs. So this
is one reason we might use a transient table over a
permanent table.
External tables
They provide the ability to query data which resides
outside of Snowflake.
For example, data stored in an Azure container. You can
overlay a structure onto this data and Snowflake will record
metadata about the files in the container.
External tables are read-only, and are generally slower to
return a result when queried, and external tables do not
support time travel or fail-safe.
To understand which of these four types a table is, you can
either execute or show tables query and check the kind
column or check the object browser in the UI. Each table
type has its own icon.
Snowflake views
Okay, so Snowflake views also come in several flavors.

standard view
It's an object that stores a select query definition, not any
data.
Here's some sample code creating a view.
CREATE VIEW MY_VIEW AS SELECT COL1, COL2 FROM
MY_TABLE;
You give it a name and define a query. The query
references a source table, and when a query is executed
against the view, the data is retrieved from the source
table.
We can use a view, much like a table, they can be joined to
tables or other views, you can reference them in
subqueries, and you can use "order by", "group by", and
"where" clauses with a view.
Because they don't store any data, standard views do not
contribute to storage costs.
And if the source table for a view is dropped, querying the
view returns an "object does not exist" error.
One of the functions of a standard view is to restrict the
contents of a table, revealing only a subset of the columns
of a table or subset of rows.

materialized view.
A materialized view also stores a query definition, but
instead of not storing any data, the result of the query
definition is actively maintained.
Snowflake call this a pre-computer dataset.
You might see this phrasing in the exam.
Snowflake charges compute for the background process
that periodically updates the view with the latest results of
the defined query.
This is why the materialized view is known as a
serverless feature.
It doesn't make use of the user managed virtual
warehouses to keep itself up-to-date.
There's also additional storage costs to store the results of
the view.
And lastly, for materialized views, they can be used
to boost the performance of external tables.
Secure View

Both standard and materialized views can be designated as


secure by adding the keyword "secure" to the view
definition.
With a secure view, the query definition is only
visible to authorized users.
Commands like "get DDL" and "describe" will not return a
view definition to unauthorized users.
Some optimizations the Snowflake query optimizer perform
may inadvertently expose some data ordinarily hidden
from users.
Designating a viewer secure will bypass these
optimizations.
It's a way to be more certain that sensitive data is not
shown to the wrong audience.

a secure view might, under certain circumstances, have


worse performance, as it doesn't make use of these
optimizations.
Unlike the table type, we can check the view type by
executing the "show views" and show "materialized views"
commands, or identifying it from the UI, via the icon in the
object browser.
User Defined Functions (UDFs)

User defined functions (UDFs) are schema-level objects


that enable users to write their own functions in three
different languages:
• SQL
• JavaScript
• Python
• Java

UDFs accept 0 or more parameters.


UDFs can return scalar or tabular results (UDTF).
UDFs can be called as part of a SQL statement.
UDFs can be overloaded.
User-defined function variations
You can write a UDF in one of several variations, depending
on the input and output requirements your function must
meet.
Variation Description
User-defined Also known as a scalar function, returns
function (UDF) one output row for each input row. The
returned row consists of a single
column/value.
User-defined Operates on values across multiple rows
aggregate to perform mathematical calculations
function (UDAF) such as sum, average, counting, finding
minimum or maximum values, standard
deviation, and estimation, as well as
some non-mathematical operations.
User-defined Returns a tabular value for each input
Variation Description
table function row.
(UDTF)
Vectorized user- Receive batches of input rows as Pandas
defined function DataFrames and return batches of results
(UDF) as Pandas arrays or Series.
Vectorized user- Receive batches of input rows as Pandas
defined table DataFrames and return tabular results.
function (UDTF)

SELECT AREA_OF_CIRCLE(col1) FROM MY_TABLE;

showing the CREATE statement of a SQL UDF.


All UDFs, regardless of their lang uage, are composed of a
function name, input parameters, of which there could be
zero or more, the return type, and the function definition,
delimited by a pair of dollar signs.
The result of a UDF can either be scalar or tabular.
A scalar function returns one output row.
The returned row consists of a single column value.
The example shown is scalar,
as it returns a single column and a single row.
A tabular function, also called a table function, or UDTF,
returns zero, one, or multiple rows.
Let's amend our code example to turn it into a tabular
function.
We do this by specifying a return clause that contains the
TABLE keyword and specifies the names and data types of
the columns in the table result.
And UDFs can be called as part of a SQL statement, as you
can see in the code example.
This will become more relevant when we contrast UDFs and
stored procedures, as stored procedures can't be used as
part of a SQL statement.
And lastly for this slide, UDFs can be overloaded, meaning
we can create multiple functions with the same name if
their input parameters are different.
JavaScript UDF

Let's take a quick look at implementations of JavaScript and


Java UDFs and see how they differ from SQL UDFs.
One of the first differences to point out is the LANGUAGE
parameter.
With this, you can designate a UDF as expecting JavaScript
code.
SQL UDFs don't need to include the LANGUAGE property, as
it's the default language.
You can make use of JavaScript's high-level programming
features such as branching and looping, error handling, and
in-built functions through the standard JavaScript library.
However, it cannot include or call additional libraries from
within the code.
And unlike SQL UDFs, JavaScript UDFs can refer to
themselves recursively or from within their own code,
enabling some interesting use cases.
It's also important to consider the data types of the input
parameters to the function. Because Snowflake and
JavaScript have different data types, when passing
between the two environments, the data types have to be
mapped.
For example, both JavaScript and Snowflake support
strings,
so they're transferred as is.
However, JavaScript doesn't have an integer data type, so
all numbers passed to a JavaScript function are represented
as doubles in the code.
Java UDF
like the Python UDF, is still in preview mode,
so unlikely to come up in the certification exam.
But let's quickly take a look
Let's take a look at a Java UDF CREATE statement.
You can see the LANGUAGE option specified as JAVA.
Behind the scenes, Snowflake will spin up a JVM, a Java
virtual machine, to execute the code specified in the
function body.
And Snowflake currently supports writing UDFs in Java
versions 8, 9, 10, and 11.
Similar to JavaScript,
Snowflake restrict accessing libraries outside the standard
Java libraries.
One interesting feature of Java UDFs is that they can
specify their definition as either inline Java code, so we can
write it out during the creation of the function, as we see
on the left, or pre-compiled code in the form of a JAR file.
Java UDFs cannot be designated as secure.
You'll notice in the CREATE statement,

two additional parameters, HANDLER and TARGET_PATH.


HANDLER specifies the class and function you'd like to
execute,
and TARGET_PATH specifies the location of a JAR file.
One limitation of Java UDFs is that they cannot be
designated as secure,whereas SQL and JavaScript UDFs
can.
External functions
An external function is a user-defined function which calls
code that is maintained and executed outside of Snowflake.

From our perspective as a user, an external UDF looks and


functions just like a regular UDF.
It's quite a powerful feature and addresses some of the
limitations of the internal UDFs.
You can use various different languages, such as Go or C#,
and reference third-party libraries.
To give you a concrete example,
we could call an AWS Lambda function in a separate AWS
account that does something like calculate the sentiment
of a string.
We have some familiar parts like the function name, the
function parameters, and the return type.
Next here, we have an object called an API_INTEGRATION.
This object is configured with the required information to
authenticate with the cloud provider hosting the function
code, AWS, for example, and attached to an external
function.
The URL specified here is for the proxy service.
This is a service that sits between Snowflake and the code
stored externally.
In AWS, this will be something like the API Gateway Service.
They handle the tasks involved in accepting and processing
API calls.
This code example shows a CREATE statement for an API
INTEGRATION object. It holds information about which cloud
provider we're integrating with.
In this case, we're using AWS.
We also store the external AWS role we'd like Snowflake to
be able to assume, which has attached to it certain
privileges.
And finally here, we can restrict which APIs we have access
to through the proxy service.
Let's now run through the lifecycle of an external UDF call.
An external function is called from a SQL statement.
For example, here we're calling the calculate sentiment
UDF.
The external function will reference the security credentials
in the API_INTEGRATION object when making a call to the
proxy service. The proxy service acts as a middleman. It
relays requests between Snowflake and the remote service.
The remote service is another name for the remotely-
executed code.
In our example here, we're using AWS, so the remote
service will be an AWS Lambda function.
However, it could be Azure's Function service, GCP Cloud
Functions, or a standalone web API.
The result of the remote service is then passed back via the
proxy service to Snowflake and then to the user.
Looking at this flow, we might wonder why we have
something between Snowflake and the remote service.It
primarily serves as an additional layer of security.

Okay, let's run through some limitations of external


functions.
Because Snowflake can't see the code of the remote
service, the query optimizer might not be able to perform
optimizations it would on internal functions, which could
make them execute slower.
Also, by virtue of being remote, potentially in a different
region, some latency could be added.
Currently, external functions can only be scalar, returning a
single value for each input row.
External functions cannot be shared with other accounts
using secure data sharing.
The use of external functions can raise additional security
concerns.
For example, if you call a function which makes use of a
third-party library, that library could potentially store
sensitive data outside of Snowflake.
In some situations, Snowflake can charge for data moved
to a different cloud platform or region.

Stored Procedures

In the world of relational database management systems,


stored procedures were named collections of SQL
statements, often containing procedural logic.
They allowed us to bundle together or modularize
commonly executed queries that generally performed
administrative tasks.
Here we could introduce stored procedures. We lift all the
code we routinely execute and wrap it in a CREATE
PROCEDURE statement.
With this object created, any authorized user simply needs
to invoke this procedure and it will go away and execute all
the statements configured in its body.
Although the general principle is the same, in Snowflake,
we can implement stored procedures in a few different
ways and Snowflake have included some enhancements
over past systems we'll walk through.
There are three methods
of implementing stored procedures in Snowflake,
 using JavaScript,
 using Snowflake Scripting,
 just SQL with support for procedural logic or using
Snowpark,
which allows us to create stored procedures in Python,
Java, and Scala.

Stored procedures are database objects, meaning they're


created in a specific database and schema.
Stored Procedure: JavaScript

Stored procedures can take zero or more input parameters.


We call this part the signature.
Next, we specify a return data type.
And although specifying a return type is mandatory, in the
code body of a stored procedure, it's optional to return a
value.
In fact, it's customary for stored procedures not to return
anything or only return a warning or error code.
For a JavaScript stored procedure, you can only return a
scalar value.
But in the case of the Snowflake Scripting method, where
you'd set the language to SQL, you can return tabular data.
And stored procedures have a unique privilege model,
allowing it to execute when called with either the privileges
of the role that created the procedure called owner's rights,
or the role executing the procedure called caller's rights.
We now get to the body of the stored procedure, delimited
here by two dollar signs.
This holds the JavaScript code we'd like the stored
procedure to execute when invoked.
In it, we can use higher-level programming features, like
branching, looping and error handling.
For now, we'll keep things simple and on our first line, take
our input parameter and store it as a JavaScript variable we
can use later on.
One notable aspect unique to stored procedures in
Snowflake is that they can mix JavaScript and SQL in their
definition. In it, we can dynamically create SQL commands
and then execute them using Snowflake's JavaScript API.
stored procedures are invoked using the call keyword.
It's called as an independent statement, not part of a sql
statement.
A single executable statement could only call one stored
procedure,
like in this code example.

Only UDFs can be called as part of a SQL statement.


For example, when selecting columns.
Both UDFs and procedures can be overloaded.
Both UDFs and procedures can take zero or more input
arguments.
Procedures can make use of the JavaScript API, allowing us
to combine SQL and JavaScript, whereas in UDFs, we can't
do this.
Stored procedures don't necessarily have to return a value,
whereas UDFs always do.
And the value returned by a stored procedure, unlike the
value returned by a function, cannot be used directly in
SQL.
And lastly, not all UDFs can refer to themselves recursively,
whereas stored procedures can.
Sequences
schema level object, Sequences are commonly found in
many SQL databases.
In Snowflake it's used to generate sequential and unique
numbers automatically.
A common use case for this is to increment something like
an employee ID or a transaction ID.
Sequences cannot guarantee their values will be gap free.

Values generated by a sequence are globally unique.


In practice, this means that even if two queries make a call
to a sequence at the same time, they would not return the
same number.
DEFAULT VALUE FOR A COLUMN TABLE.
Tasks & Streams

Task and Stream objects referenced together.


When combined, they provide us with a method to
continuously process new or changed records in a table.
Tasks and Streams can also be used independently.
Task
A Task is simply an object use to schedule an execution of a
SQL statement, a stored procedure, or some procedural
logic using Snowflake Scripting.
You can use a Task to periodically copy some data, execute
a maintenance type routine, like clearing down some old
tables or periodically populating a reporting table.
Any use case that requires a statement to run on a
schedule is suited for a Task.
To running with Tasks,
you'll firstly need to have access to the ACCOUNTADMIN
role or have enabled a custom role with the CREATE TASK
privilege assigned to it.
The second thing we need to do is issue a CREATE TASK
command like the one we see here.
It's configured with a Task identifier unique in the schema,
a virtual warehouse the Task will use for its execution.
Optionally, we can emit this parameter to use Snowflake-
managed compute resources instead of a user-managed
warehouse.
However, using the serverless model comes with
limitations.

Thirdly, we have a triggering mechanism. This can be an


interval in minutes like shown or a CRON expression.
You can also trigger a Task to execute after another Task is
completed, in which case you won't need to specify the
SCHEDULE parameter.
And lastly, we have the command we'd like to execute. In
this case, we're performing a COPY INTO statement, which
will copy the contents of a stage into a table every half an
hour.
Now creating a Task that doesn't automatically start it.
To kick it off, we must issue an ALTER TASK RESUME
command.
This requires a few different privileges.
Firstly, we need a global privilege, EXECUTE TASK.
By default only the ACCOUNTADMIN has this privilege.
We also need either the ownership or operate privilege on
the Task object itself.
If you run this command, it will start the timer specified in
the Task configuration. So if you put 30 minutes, in 30
minutes from running the RESUME command, the first
execution of the Task will begin.
You can also pause a Task by issuing the opposite
command, ALTER TASK SUSPEND.
And Tasks can be chained together to form what is called a
DAG or directed acyclic graph.
In this example, we have four tasks.
T1 is our root task, which defines a schedule like a
standalone Task, specifying when a run of a DAG should
start.
T2 and T3 are child tasks. These are triggered only after
the successful completion of the root task, and therefore do
not define a schedule in their DDL.
We can then add further child tasks where the Tasks flow in
a single direction.
T4 is another child task. However, it has two Task
dependencies. It would have to wait for both dependencies
to successfully complete before executing itself.
A DAG can be composed of a maximum of 1,000 Tasks, and
one Task can only be linked to 100 other Tasks.
So a child task can only reference 100 dependent Tasks or
a parent task with 100 child Tasks.
It's also worth pointing out that all Tasks in a DAG must
have the same Task owner. and they must be stored in the
same database and schema.
Here's a code snippet of how we would define a child task
with two dependencies.

It uses the keyword AFTER, followed by the parent task


names separated by a comma.

Stream
A Stream is a schema level object, which allows you to view
the inserts, updates, and deletes made to a table between
two points in time.
A stream is an object created to view & track DML changes
to a source table – inserts, updates & deletes.
This code example shows the create statement for a
Stream.

you create the Stream on top of a table and the Stream is a


queryable object.
When querying a Stream, the output will have an identical
structure to the base table defined during its creation, but
will only contain the changed records, not all the records in
that table.
As we can see in this screen grab
of a SELECT * on a Stream,
it also contains three additional metadata columns, which
give us some information about the change.
We have the METADATA$ACTION column, which indicates
whether the DML operation was an INSERT or a DELETE.
Next we have the METADATA$ISUPDATE column, which
indicates whether the operation in the action column was
part of an UPDATE statement.
Updates to rows in the source table are represented as a
pair of DELETE and INSERT records in the Stream with the
ISUPDATE column set to true.
And lastly is the METADATA$ROW_ID column. This specifies
the unique ID for a row. We can use it to track changes to a
specific row over time.
Let's walk through an example

To start, let's say we insert 10 records into a table, creating


table version one.
I then create a Stream on top of that table.
The Stream object itself will store something called an
offset shown here by a red dot.
This marks the point from which changes in the base table
are recorded.
In our case, it started to record any changes made to the
table after version one.
If we were to select from the Stream at this point, the
Stream would be empty. This is because Streams only show
changes made after the offset.
Any changes prior to the offset will not appear in the
Stream.
So now let's update two rows in the table. If we then follow
this with a SELECT on the Stream, we can now see it
started to show our changes.
And as we perform more changes to the base table, more
changes will accumulate in the Stream, which we can query
like a table along with the metadata.
For example, this code snippet shows us inserting the
changes made to the source table into a downstream table.
This progresses the offset, and here we can tie Streams to
Tasks.
We can get a Task to execute only when a Stream has data
in it.
Using the system function STREAM_HAS_DATA, this will
return a Boolean indicating if the Task should execute when
it reaches its scheduled time.
We can configure a query in the Task to insert from the
Stream into a downstream table, thereby consuming the
changes of the source table
and progressing the offset, so the next time the Task runs,
we only get new changes.
Billing
When it comes to billing, the first thing to highlight is the
purchasing plan Snowflake offer.

An on demand account is similar to the pay-as-you-go


pricing plans of other cloud providers such as AWS, where
you only pay for the resources you use.
At the end of the month an invoices created with details of
usage for that month. There is a $25 minimum for every
month.
Using this method we wouldn't have to enter any long-term
licensing agreements with Snowflake.
With the pre-purchased capacity option, a customer can
purchase a set dollar amount of Snowflake resources
upfront.
The major advantage of going with this plan is that the pre-
purchase rates will be quite a bit lower than on demand.
So if you know you're going to use at least a hundred hours
of compute in a month, this might be a good idea to save
some money.
If your usage pattern is a lot more variable, the on demand
option makes more sense.
From this point on all credits and pricing reflect the on
demand costs, but bear in mind prices will be generally
lower if the same activities were to be performed with a
capacity plan.
There are five main areas in which a Snowflake account is
built.
1.Virtual warehouse services composed the cost of the
individual customer managed virtual warehouses a user
would create and execute queries with.
2.Cloud services include operations that don't make use
of user managed virtual warehouses, but nevertheless cost
Snowflake something to compute and execute.
Metadata operations such as creating tables, executing a
show or describe command are examples of commands
that don't need an active virtual warehouse
3 . serverless, that are built slightly differently. For this
class of services Snowflake managed the compute
resources instead of users creating virtual warehouses.
For example, the feature Snowpipe will automatically ingest
files as they appear in a stage. This feature does not
require a customer managed virtual warehouse, but
instead, Snowflake will spin up some compute behind the
scenes and itself manage the scaling, resizing, and
duration of that compute.
4. data storage.
That could be in a temporary holding location like a stage
or long-term storage like a table.
5. transferring data
Transferring Data out of Snowflake or between Snowflake
accounts if the destination location is in a different region
or cloud provider.
Services that use compute in one form or another,
virtual warehouse services, cloud services, and
serverless features have their cost calculated using
something called a Snowflake credit.
Whereas storage and data transfer are billed using
direct currency.
For example, a certain number of terabytes will cost a
specific amount of dollars.
So what are Snowflake credits?
They're billing units to measure compute resource
consumption.
The more compute you use, the more credits it'll cost.
A credit represents a monetary value based on which cloud
provider, region, and Snowflake edition your account is
deployed as.
For example, if I had a standard edition account on AWSin
the London region, I would be paying a different price per
credit to a business critical account on Azure deployed into
Canada Central.
For each service shown here there is a different method of
calculating the amount of credits you'll be charged for
using them.
The simplest one to rationalize about is probably virtual
warehouse services.
There are two key factors that contribute to the calculation
of credits.
The first is the virtual warehouse size. Virtual warehouses
come in several sizes. Each size has a different hourly
credit rate.
For example, the smallest size, extra small if run for one
hour would cost one Snowflake credit.
The second is the duration of virtual warehouse is in the
started state.
Credit consumption is calculated on a per second basis
while a virtual warehouse is active.
When a warehouse is suspended, it doesn't consume
credits. This is quite an important point to remember.
Virtual warehouse cost isn't based on the number or
complexity of queries issued to it, but whether it's in the
started or suspended state.
Virtual warehouses have a minimum billable period of 60
seconds.
So if you were to start a virtual warehouse for 30 seconds
and then shut it down, you'd still be charged for a minute.
For any time after 60 seconds, you'll be billed on a per
second basis.
So tying this all together, if you were to run an extra small
virtual warehouse for two and a half hours, it would cost
you 2.5 credits.
And if in your cloud provider and region a credit costs $4,
the total cost would be $10.
Okay, let's take a look at cloud services.
Queries that make use of cloud services to return a result
are billed at a higher rate of 4.4 credits per compute hour.
An example of a query that makes use of cloud services
might be something like a create table command.
This doesn't use a virtual warehouse it only requires cloud
services compute.
Initially 4.4 credits per hour might seem comparatively
high, but it's important to know that queries that make use
of cloud services are generally very quick to complete.
There's another slightly more complex aspect of cloud
services billing that keeps its cost down.
Only cloud services usage that exceeds 10% of the daily
usage of virtual warehouse compute is charged.
Let's understand this with an example.
If you consume 16 credits of virtual warehouse compute by
running a large warehouse for two hours in a day and only
1.1 credits of cloud services compute, you wouldn't have to
pay for the cloud services compute.
This is because the 1.1 credits for the cloud services
compute is less than 10% of the 16 credits of virtual
warehouse compute.
This 10% calculation is called the cloud services
adjustment and is calculated daily.
Each serverless feature such as clustering, Snowpipe and
database replication has its own credit rate per compute
hour.

Serverless features make use of both compute and cloud


services in the background, and each type of compute has
its own rate.
For example, the compute required to keep a materialized
view up to date is 10 Snowflake credits per hour.
And the cloud services required to keep a materialized view
up to date is 5 Snowflake credits per hour.
And lastly, the cloud services adjustment does not apply to
serverless features.
Data Storage & Transfer Billing Overview

Unlike compute services, the cost is calculated with a flat


rate instead of through a Snowflake credit.
Data storage billing is calculated monthly based on the
daily average of on disk bytes for all data stored in your
Snowflake account.
The bulk of this will be from database tables, including the
data required for time travel and fail safe features.
However, it also includes files stored in the Snowflake
managed stages used for data loading and unloading.
Files can be compressed in a stage to save on space, and
those monthly costs for storing data in Snowflake are
based on a flat rate per terabyte.
The dollar amount charge per terabyte depends on
whether your account is capacity or on demand, as well as
the cloud provider and region.
Snowflake also charges a per byte fee if you transfer data
from one region into cloud storage in another region, or if
you transfer data from one cloud platform to another.
The exact dollar amounts you charge per byte depends on
the region and cloud platform your account is hosted in.
For example, if I executed the Snowflake copy into location
command unloading data from my account hosted in the
AWS London region into Azure blob storage in the Tokyo
region, I would pay an additional cost when compared with
doing the same operation, but to a bucket in the same
cloud provider and region.
Data transfer charges also apply when replicating data to a
Snowflake account in a region or cloud platform different
from where your primary Snowflake account is hosted.
And lastly, because external functions process data outside
of Snowflake, you may also incur some data transfer
charges using them.

SnowCD

SnowCD (Snowflake Connectivity Diagnostic Tool) is a tool


that helps diagnose and troubleshoot network connection
issues between your system and Snowflake.
Why Use SnowCD?
 To verify network configuration for Snowflake access.
 As part of automated deployment scripts.
 As a prerequisite check before deploying a service that
connects to Snowflake.
 For environment checks while starting a new machine.
 For periodic checks on running machines.

 How Does SnowCD Work?


 Preparation:
 You need to retrieve a list of allowed hostnames and
ports from Snowflake to use with SnowCD.
 - Connect to Snowflake through the web interface.
 - Run the SYSTEM$ALLOWLIST() or
 SYSTEM$ALLOWLIST_PRIVATELINK() function
depending on your private connectivity setup.

SnowCD Output
SnowCD communicates its results through messages
displayed on the console. Here’s what the messages mean:
 Success Message:
If all checks are valid, SnowCD returns the number of
checks on the number of hosts with the message All
checks passed

 Error Message:
This message appears if you try to run SnowCD without
providing the allowlist file generated by the
SYSTEM$ALLOWLIST() function.
oubleshooting with SnowCD
If SnowCD detects an issue during its checks, it will display
details about the failed check(s) along with a
troubleshooting suggestion to help you fix the problem. For
instance, the following output indicates an invalid
hostname:

In this example, SnowCD couldn’t perform a DNS lookup for


the hostname www.google1.com The suggestion points
towards a potential DNS configuration issue that you might
need to address on your DNS server.
Remember:
 SnowCD doesn’t detect all network issues.
SnowSQL

SnowSQL is the command-line interface (CLI) client


designed for connecting to the Snowflake platform. It
allows users to execute SQL queries and carry out a wide
range of DDL and DML operations. SnowSQL offers a robust
interface, enabling direct control of Snowflake right from a
terminal or command prompt.

What is SnowSQL?
SnowSQL, also known as the Snowflake CLI (command line
interface), enables connecting to Snowflake from the
command line to execute SQL statements and scripts, load
and unload data, manage databases & warehouses, and
perform a whole lot of other administrative tasks.
SnowSQL isn't just another SQL command-line client; it's
packed with features that make it stand out. Here are some
key features about SnowSQL:
 Available on Linux, Windows, and MacOS
 One-step installation process
 Provides an interactive shell for executing SQL
commands
 Supports batch mode execution
 Includes output formatting options for result sets
 Comes with command history, auto-completion and
syntax highlighting
 Allows configuration profiles to save connection details
 Supports variables for parameterizing SQL statements
 Integrated help (!help) commands for on-the-fly
commands assistance
Usage of SnowSQL
SnowSQL, as Snowflake's command-line client, offers
a TONS of functionalities. Here are some of its primary
usages:
1.Execute SQL statements and scripts
 Run queries directly in the CLI
 Execute scripts in batch mode
 Issue DDL commands like CREATE, ALTER, DROP
 Insert, update, delete data (DML)
 Call stored procedures and user-defined functions
(UDFs)
2. Load and unload data
 Use COPY INTO to load data from files
 PUT command to upload data from local
 GET command to download result sets
 COPY INTO location to stage external files
3. Query monitoring and tuning
 See execution plans using EXPLAIN
 Monitor resource usage with query history
 Tune queries based on execution metrics
4. User and security management
 Switch roles to control privileges
 Grant and revoke user privileges
 Manage user accounts and passwords
5. Database administration
 Create, clone, undrop databases
 Execute DDL on schemas, tables
 Switch contexts with USE DB and USE SCHEMA
6. Warehouse management
 Create and resize virtual warehouses
 Suspend, resume, or drop warehouses
 Switch warehouses to control usage
7. Session management
 Establish connections and authenticate
 Use MFA, OAuth, and other auth methods
 Create multiple named connection profiles
 Disconnect or quit sessions
8. Command line productivity
 Command history and auto-complete
 Pipe results between commands
 Format output using options
 Export results to files

How to connect to Snowflake using


SnowSQL

After installing SnowSQL, you can establish connections to Snowflake


by providing the necessary credentials and parameters. Proper
configuration of SnowSQL is essential for ensuring a secure and
efficient connection.

There are several approaches:

1) Passing Credentials Directly


snowsql -a <account> -u <user> -p <password>
1) Selecting the Database and Schema in SnowSQL
Before running any queries, it's essential to specify the
database and schema you want to work with. Snowflake
provides a sample database named
snowflake_sample_data, which is an excellent resource for
practice and exploration.
To select this database in SnowSQL, use the USE DATABASE
command:
USE DATABASE snowflake_sample_data;
1) Selecting the Database and Schema in SnowSQL
Before running any queries, it's essential to specify the
database and schema you want to work with. Snowflake
provides a sample database named
snowflake_sample_data, which is an excellent resource for
practice and exploration.
To select this database in SnowSQL, use the USE DATABASE
command:
USE DATABASE snowflake_sample_data;
How to disconnect from SnowSQL
When you're working within SnowSQL and wish to end the
current Snowflake session, you have a couple of command
options. You can either type !quit or !exit to safely
disconnect.

Connectivity: Connectors, Drivers and


Partnered Tools

Connectors and Drivers


So far, we've seen a couple of different ways to connect to
Snowflake through the UI and with the SnowSQL command
line tool.
These are great, but what if you want to develop
applications with connectivity to Snowflake or push some
data from a streaming technology like Kafka into a table?
Snowflake maintain open source drivers and connectors for
all the technologies you see here.

If you want to programmatically connectvand issue


commands with Python,
Snowflake maintain a Python package in a public Git repo.
As well as Python, Snowflake natively support connectivity
with the programming languages Go, PHP, .NET and
Node.js.
There's also a Spark connector enabling a Spark cluster to
read data from and write data to Snowflake tables.
From Spark's perspective, Snowflake looks similar to any of
the other Spark data sources: Postgres, HDFS or S3.
Snowflake make available a Kafka connector, either for the
Confluent version of Kafka or the open source version of
Kafka.
It reads data from one or more Kafka topics and loads it
into a Snowflake table.
You can also connect with JDBC or ODBC, allowing third-
party tools not officially partnered with Snowflake to
connect.
With these interfaces, you can also use programming
languages, like Java or C++ to connect and issue
commands.
but let's take a quick look at some sample code for the
Python package to get an idea of how we connect and
issue commands.
pip install snowflake-connector-python==2.6.2

In our Python environment, we install the snowflake-


connector-python package, optionally specifying a version.

In the code itself, the first thing we do is import the


package.
We then create a connection object, which takes as input
what we'd expect when authenticating:
our username, password, and account identifier.
From that connection object, we create something called a
cursor.
A cursor allows us to execute SQL statements and traverse
the results,
similar to many database connectors.
We then use the cursor object to execute a SELECT
statement and print one row from the output of that
command.
So you can see it's quite simple to get up and running.
There are many different tools you can use alongside
Snowflake.
Snowflake certify third-party solutions on the market to
ensure they natively connect and integrate.
These are called Snowflake technology partners.
There are many reasons why you'd want to combine
Snowflake with an additional tool.
A common use case is to hook Snowflake into visualization
software to leverage its advanced dashboarding and
reporting capabilities.
Using a partner tool, you get that extra level of assurance
that'll connect to Snowflake and safely and consistently
execute commands, which is quite useful when going into a
production environment.
The partner tools are usually broken down into five
categories:
 business intelligence,
 data integration,
 security and governance,
 SQL development and management,
 machine learning and data science.
Let's take a closer look at each.
Business Intelligence is a class of software used for data
analysis, commonly associated with producing
visualizations or dashboards.
Some popular examples that have native support with
Snowflake are Tableau, Power BI, QlikView and
ThoughtSpot.
Broadly speaking, data integration is the process of taking
data from a source system and putting it into a target, in
this case, Snowflake, and doing some level of data
transformation in the process.
There are many partner tools that have been designed
specifically to solve common data integration problems.
For example, dbt, Informatica, Pentaho, and Fivetran.
Security and governance is quite a broad category of tools.
There are data governance tools,
such as Collibra, monitoring tools such as Datadog,
HashiCorp Vault for storing sensitive Snowflake access
data, such as tokens, passwords, and encryption keys, and
data.world for data cataloging,
Snowflake offer a feature called Partner Connect, which is
an extension of the partner program and used to expedite
connectivity With Snowflake.
Through the Snowflake UI, a user can easily create a trial
account with a selection of partner tools.
All the ones marked are available for Partner Connect.
For example, let's take Fivetran, which is a tool to extract
and load data.
Using a setup wizard inside your account on the Snowflake
UI we create a trial account for Fivetran, giving us a
username and login to Fivetran's online portal.
On the Snowflake side, the Partner Connect process will
create objects like a database, warehouse and roles
specifically for use with Fivetran.
There are some very specialized tools out there in the
realm of machine learning and data science.
Running data science workloads using Snowflake data is
simplified with certified tools like DataRobot, Dataiku,
Amazon SageMaker, and Zepl.
Our last category is SQL development and management.
There's a variety of trusted third-party tools to help with
managing the modeling, development and deployment of
SQL code, such as SqlDBM, SeekWell, and Agile Data
Engine.

Snowflake Scripting

Instructor: Another way we can interact with our data is


using Snowflake Scripting.
It's an extension to Snowflake SQL, adding support for
procedural logic.
It adds constructs like looping, branching, exceptions, and
declaring and assigning values to variables.
The procedural code can be implemented inside a stored
procedure or executed directly in a worksheet.

The code itself is written within a scripting block.


You can see here the basic structure that is made up of
three sections.
The first is called declare, in which you can define
variables,
cursors, result sets, and exceptions.
The second is begin and end.
Here, you write SQL statements and scripting constructs
like loops or branching.
The begin keyword is also used to start a transaction.
These are used to explicitly execute a group of commands
together.
Snowflake recommends starting a transaction with begin
transaction instead of just begin to avoid confusion
between scripting and transactions.
Okay, and lastly, we have exception.
Like in higher level programming languages, this is where
we would specify exception handling code.
In case we encounter an error, we'd like to know what to do
in response to it.
Let's look at an example to bring this to life.
This sample code is what is called an anonymous block,
meaning it's executed outside of a stored procedure.
In the declare section, we declare two variables as type
number.
Bear in mind that these variables can only be used within
the scope of this block.
It's also possible to include a normal SQL statement to
create a table, for example. And unlike a variable,the
object could be accessed outside of this block.

SnowSQL and the Classic Console do not correctly parse


Snowflake Scripting blocks, they need to be wrapped in
string constant delimiters like dollar signs.

So the begin section contains our assignments and logic to


implement Pythagoras theorem.
We firstly assign a value to length A of the triangle with the
colon equals notation.
The next line shows how we can declare and assign a
variable in the begin section using the keyword let.
This also shows how we can infer the data type of the
assigned value, not needing to specify the data type
number.
We then have the logic, which is to perform the square root
of A squared plus B squared.
And then return the result using the keyword return.
Note that we're using functions here in our block.

And this example shows how we implement the same


procedural logic inside the body of a stored procedure.
Anything we can do in the anonymous block, we can do in
the stored procedure.
Containing our logic in a long-lived object like this is good
for sharing amongst users and making our code
repeatable.
It's worth bearing in mind,
SnowSQL and the classic console can't pass scripting
blocks without wrapping them in dollar signs.
However, on Snowsight, we don't need these.
Branching Constructs

We can also use branching constructs such as if else


statements and case statements.
You might be familiar with these from high level
programming languages.
anonymous block usingan if else statement to check
whether an integer is odd or even by checking the sign
remainder of a division.
In this example, we don't include either a declare or
exception section.
These are optional.
looping constructs.

We have at our disposal the for loop, the while loop,


something called repeat, and one simply called loop.
It's not required that we have a detailed knowledge of all of
these for the certification exam.
So let's take a look at the standard for loop.
Firstly, we declare two variables total and max num.
Max num represents the number of iterations of the for
loop we'd like to execute.
And the total will hold a running total. In the begin section,
the for loop specifies an initial value of one assigned to I.
And we want it to iterate or go through up to our max num,
which would be 10 in this case.
For each iteration, we do a small calculation to add the
current value of I to the existing total. This returns 55, the
sum of the numbers 1 through 10.
Cursor

We might also want to loop over records in a table.


To do this, we'll have to use something called a cursor.
We create a cursor like a variable.
We give it a name and follow that with the words cursor for.
We then specify a query we'd like to access the results of.
Here we're looping through each record in cursor c1 and
tallying the total amount from a table of transactions.
After the for loop reaches the end of the cursor, so the final
row, we return the total.
RESULTSET
The last thing I'd like to take a look at is somewhat similar
to a cursor.
It's called a result set.
A result set is a sequel data type that like a cursor defines
a query.
However, a result set is executed when you assign the
query to the result set variable unlike a cursor which is
executed when called.
Hence, it's thought of as holding the results of a query.
The results can be accessed in two ways.
In this simple code example,
1. table function,
returning all or a subset of rows in the result set. Or we
could iterate over the result set with a cursor similar to our
previous example.

The same using cursor

Snowpark
Snowpark is an API accessed outside of the Snowflake
interface.
It was implemented as an alternative to SQL, allowing us to
query and process our data using high-level programming
languages currently supporting Java, Scala, and Python.
The API provides methods and classes, such as select, join,
drop and union instead of constructing query strings and
executing those, like we've seen with stored procedures in
UDFs.
Its main abstraction is something called a DataFrame. It's a
data structure that organizes data into a two-dimensional
table of rows and columns, conceptually similar to a
spreadsheet.
We'll have a look at this shortly.
If you've used Apache Spark or the pandas library,
Snowpark DataFrames will be very familiar and quite
intuitive to use.
There's a couple of things to know regarding Snowpark's
computation model.
Snowpark operations are executed lazily, meaning an
operation is only executed when an action is requested
rather than when it's declared.
Snowpark also works on a push down model, meaning all
operations are performed using Snowflake compute.
No data is transferred to where you're executing the
Snowpark code or to another cluster for processing.
This is often contrasted with the Snowflake connector for
Spark, which actually moves data out of Snowflake into the
Spark cluster to process the data.
A large part of why Snowpark was created was to avoid
developers having to move data out of Snowflake to use
their preferred language.
Okay, let's step through some sample code line by line
using the Python Snowpark API.
Let's say we have a simple requirement to build a data
pipeline whose purpose is to flag transactions over a
certain value and produce a table summarizing our results
for reporting.
In this code snippet, we're importing the Snowpark
libraries. Next, we define a dictionary in which we store our
connection parameters, including our account, user,
password, role, warehouse, database, and schema.
Here we're using our operating system's environment
variables to populate the parameters, but we could just use
plaintext strings or do something like use the result of a
call to a secrets manager.
In this code snippet, we then pass the dictionary into our
Session.builder method. This allows us to establish a
connection with Snowflake and provides methods for
creating DataFrames. For example, on line 15, we're using
the session objectto create a DataFrame from a table called
transactions. Along with the table method here, the session
object also has a method called sql.
This accepts a SQL query string, so we could issue a
SELECT command as another way to create a DataFrame.
Okay, let's try and understand
what DataFrames are with this next command. If I were to
print out our transactions DataFrame by following it with a
dot operator and then the collect function, we'd see a list of
row objects output to the console.
A DataFrame is a collection of row objects with their
columns defined by a schema.
Each row object contains the column name and its values,
and because DataFrames are lazily evaluated, we would
retrieve the results from Snowflake when something like a
collect function is called, not when we initially created the
DataFrame, like we did with the session object.
Now that we have a DataFrame, we can start to do our data
transformations.
In line 18, we're using the DataFrame method filter.nThis is
similar to a WHERE clause in SQL. We're using it to filter our
DataFrame on the amount column for transactions over
1,000.
We then assign the output of that operation to a new
DataFrame.
Many of the operations from SQL have their own
programming constructs.
For example, here we have a group by and count. We can
group by our account holders and count how many
transactions they have over 1,000.
On line 20, we filter again to check if there are greater or
equal to two transactions over 1,000.
We can chain together DataFrame operations by using a
dot operator, and then the next method.
So here, after filtering, we then rename the count column
to flagged_count.
This takes the filtered DataFrames input. We can then write
that DataFrame to a new table, selecting a mode, such as
append or overwrite.
This will create a new table in the database we set in our
session config, and it will match the schema and data of
our flagged_transactions DataFrame.
Let's use the DataFrame method show to print out the
contents of this DataFrame.
It's similar to collect, but returns our result in a tabular
format.
And finally, on our last line here, we close our session,
ending our connection to Snowflake.

You might also like