What is Snowflake -1
What is Snowflake -1
What is Snowflake?
Data Science
Data scientists have historically faced difficulties preparing
data assets, building models, moving those models into
production, and actively monitoring their AI and machine
learning assets after deployment.
One key aspect of Snowflake which helps data scientists is
its centralized storage.
Data scientists typically spend most of their time finding,
retrieving, and cleaning data, and a minority on actually
building, training, and deploying their models.
Having all their data in one location which has been
curated really helps in removing some of the data
management roadblocks they face.
Snowflake also have a partner ecosystem of third-party
vendors and tools, which have been certified to integrate
with the Snowflake platform.
Data science tools like Amazon SageMaker, DataRobot, and
others integrate natively with Snowflake.
Data Sharing
Snowflake have also implemented some interesting
features to simplify the sharing of table data.
Secure data sharing is a feature to allow one Snowflake
account to privately share table data with another account
in the same region, which is quite cool and not possible on
many platforms.
There's also an online marketplace accessible through the
Snowflake UI where we can produce and consume datasets.
And using the partner ecosystem we can expose our
curated data with select customers via BI tools, like Tableau
or Power BI.
Data Applications
Let's talk about data application development.
Snowflake maintained many connectors and drivers for
high-level languages, like Python, .NET, and Go, and
database connectors, like JDBC and ODBC, to help in
building data-intensive applications.
You can also extend the built-in functions with UDFs and
stored procedures, which can be written in Java, JavaScript,
Python, or SQL.
We can also make use of external UDFs to write custom
code, which resides outside of Snowflake on a cloud
platform service like AWS Lambda or Azure Functions.
Can building data applications called Snowpark. This
allows for programmatic querying and processing of data
either using Java, Python, or Scala.
Cloud Native
Snowflake’s software is purpose built for the Cloud
Snowflake is a cloud-native solution. All the software that
allows Snowflake to handle the workloads we just went
through was purpose-built for the cloud.
It's not a retrofitted version of an existing technology like
Hadoop or Microsoft SQL Server.
The query engine, the storage file format, the metadata
store, the architecture generally was designed with cloud
infrastructure in mind.
Shared-disk architecture.
This was the first move away from single node
architectures and work by scaling out nodes in a compute
cluster, but keeping storage in a single location. In other
words, all data is accessible from all cluster nodes. Any
machine can read or write any portion of the data.
Shared nothing architecture
Database systems for high performance storage and
querying has become the shared nothing architecture.like
Hadoop and Spark.
As the name indicates, the nodes in this architecture do not
share any hardware.
Storage and compute are located on separate machines
that are networked together. The nodes in the cluster
contain a subset of data locally. Instead of the nodes
sharing one central data repository,
Storage Layer
The storage layer is really just the blob storage service of
the cloud provider you've deployed your Snowflake account
into.
Services Layer
The services layer is a collection of highly available and
scalable services that coordinate activities such as
authentication and query optimization across all Snowflake
accounts.
Similar to the underlying virtual warehouse resources, the
services layer also runs on cloud compute instances.
Services managed by this layer include:
• Authentication & Access Control
• Infrastructure Management
• Transaction Management
• Metadata Management
• Query parsing and optimisation
• Security
So we, as users, don't strictly need to understand the inner
workings of how Snowflake manage things like
infrastructure.
Okay, so what is the services layer?
It's a collection of highly available and scalable services
that coordinate activities across all Snowflake accounts in
order to process user requests.
Think of activities like authentication or query optimization.
This is why you might hear it referred to as the global
services layer.
Having a global multi-tenancy model like this, instead of
creating an account-specific version of all the services
every time an account is requested, allows Snowflake to
achieve certain economies of scale, and also makes
implementation of some interesting features, like secure
data sharing, much simpler to achieve.
Behind the scenes,
the services run on cloud-based compute instances, much
like virtual warehouses. However, we have no control or
visibility into their creation or how they work.
SO what services actually make up the services layer,
and what do they do?
Let's first take a look at authentication and access control.
This is about proving who you are and if you have valid
login credentials, as well as determining, once you're
logged into an account, what level of privileges you have to
perform certain actions.
We also have infrastructure management.
This service handles the creation and management of the
underlying cloud resources, such as the blob storage and
compute instances required for the storage and query
processing layers.
Next up, we have transaction management.
As mentioned briefly in the Virtual Warehouse section,
Snowflake is an ACID-compliant data warehouse, which
uses transactions to ensure, among other things, the data
is consistently accessible by all virtual warehouses.
Okay, the metadata management service keeps
information and statistics on objects and the data they
manage. The services layer also handles query parsing and
optimization. This service takes the SQL query we submit
and turns it into an actionable plan the virtual warehouses
can execute.
And lastly here is security. This is a broad category of
services, which handle things like data encryption and key
rotation.
Account
An account is the administrative name for a collection of
storage, compute and cloud services deployed and
managed entirely on a selected cloud platform.
When we use the word account in Snowflake, we're either
referring to the administrative name for the collection of
storage, compute, and cloud services, or an account object
itself, which is used to change account properties and
manage account level objects.
Each account is hosted on a single cloud provider, either
Amazon Web Services, Google Cloud platform, or Microsoft
Azure.
And for each cloud provider, there are several regions an
account can be provisioned into.
An account resides in a single geographic region, as does
the data in that account, as there are regulatory
considerations for moving data between regions.
By default, snowflake doesn't move data between
regions unless requested.
Each account is created as a single snowflake edition.
However, it can be changed later on.
An account is created with the system-defined role
ACCOUNTADMIN.
By default, accounts containa number of system defined
roles. To configure account level properties and manage
account level objects like warehouses, we can use the
account admin role.
However, this role is quite powerful, so it should be
granted to users sparingly.
This allows us to enforce the security best practice of least
privilege.
Account Regions
This is where your data would physically reside, and you'd
be subject to the regulatory requirements of that region.
The region in which your account is provisioned affects the
price of compute and storage,
which regulatory certifications you can achieve,
which snowflake features you have access to,
some network latency you'll experience if your account is
in a different region from where you're connecting from.
Account URL
This URL you uniquely identifies your account and is the
host name used to connect to the Snowflake service,
whether that's through the UI, SnowSQL command line
tool, Python.
Whatever method of connectivity you use, with this URL,
you'll gain access to your remotely hosted account.
Your trial account URL is most likely composed of three
parts.
The account locator, the cloud services region ID, and the
cloud service provider.
All these components together form an account
identifier.
Depending on the region and cloud platform your account
is deployed into, your account identifier might only be
composed of an account locator or some mixture of all
three.
If the account setup is not done through an automated
provisioning process, like with the trial, but with someone
from Snowflake support, you can request a specific unique
account identifier.
If your account was created by a user with the org admin
role, the account identifier would look a bit different.
It would be comprised of a unique organization name and
an account name set when creating the account.
Database.
This is an important object.
It's the first logical container for your data. It groups
together schemas, which themselves hold schema level
objects such as tables and views.
schema objects
Schemas are a way to further segment a database.
One database can contain many schemas, and each
schema belongs to a single database.
A schema name must be unique within a database.
Like database as schema name must start with an
alphabetic character and cannot contain spaces or special
characters unless the entire identifier string is enclosed in
double quotes.
Here we have the create statement for a schema.
standard view
It's an object that stores a select query definition, not any
data.
Here's some sample code creating a view.
CREATE VIEW MY_VIEW AS SELECT COL1, COL2 FROM
MY_TABLE;
You give it a name and define a query. The query
references a source table, and when a query is executed
against the view, the data is retrieved from the source
table.
We can use a view, much like a table, they can be joined to
tables or other views, you can reference them in
subqueries, and you can use "order by", "group by", and
"where" clauses with a view.
Because they don't store any data, standard views do not
contribute to storage costs.
And if the source table for a view is dropped, querying the
view returns an "object does not exist" error.
One of the functions of a standard view is to restrict the
contents of a table, revealing only a subset of the columns
of a table or subset of rows.
materialized view.
A materialized view also stores a query definition, but
instead of not storing any data, the result of the query
definition is actively maintained.
Snowflake call this a pre-computer dataset.
You might see this phrasing in the exam.
Snowflake charges compute for the background process
that periodically updates the view with the latest results of
the defined query.
This is why the materialized view is known as a
serverless feature.
It doesn't make use of the user managed virtual
warehouses to keep itself up-to-date.
There's also additional storage costs to store the results of
the view.
And lastly, for materialized views, they can be used
to boost the performance of external tables.
Secure View
Stored Procedures
Stream
A Stream is a schema level object, which allows you to view
the inserts, updates, and deletes made to a table between
two points in time.
A stream is an object created to view & track DML changes
to a source table – inserts, updates & deletes.
This code example shows the create statement for a
Stream.
SnowCD
SnowCD Output
SnowCD communicates its results through messages
displayed on the console. Here’s what the messages mean:
Success Message:
If all checks are valid, SnowCD returns the number of
checks on the number of hosts with the message All
checks passed
Error Message:
This message appears if you try to run SnowCD without
providing the allowlist file generated by the
SYSTEM$ALLOWLIST() function.
oubleshooting with SnowCD
If SnowCD detects an issue during its checks, it will display
details about the failed check(s) along with a
troubleshooting suggestion to help you fix the problem. For
instance, the following output indicates an invalid
hostname:
What is SnowSQL?
SnowSQL, also known as the Snowflake CLI (command line
interface), enables connecting to Snowflake from the
command line to execute SQL statements and scripts, load
and unload data, manage databases & warehouses, and
perform a whole lot of other administrative tasks.
SnowSQL isn't just another SQL command-line client; it's
packed with features that make it stand out. Here are some
key features about SnowSQL:
Available on Linux, Windows, and MacOS
One-step installation process
Provides an interactive shell for executing SQL
commands
Supports batch mode execution
Includes output formatting options for result sets
Comes with command history, auto-completion and
syntax highlighting
Allows configuration profiles to save connection details
Supports variables for parameterizing SQL statements
Integrated help (!help) commands for on-the-fly
commands assistance
Usage of SnowSQL
SnowSQL, as Snowflake's command-line client, offers
a TONS of functionalities. Here are some of its primary
usages:
1.Execute SQL statements and scripts
Run queries directly in the CLI
Execute scripts in batch mode
Issue DDL commands like CREATE, ALTER, DROP
Insert, update, delete data (DML)
Call stored procedures and user-defined functions
(UDFs)
2. Load and unload data
Use COPY INTO to load data from files
PUT command to upload data from local
GET command to download result sets
COPY INTO location to stage external files
3. Query monitoring and tuning
See execution plans using EXPLAIN
Monitor resource usage with query history
Tune queries based on execution metrics
4. User and security management
Switch roles to control privileges
Grant and revoke user privileges
Manage user accounts and passwords
5. Database administration
Create, clone, undrop databases
Execute DDL on schemas, tables
Switch contexts with USE DB and USE SCHEMA
6. Warehouse management
Create and resize virtual warehouses
Suspend, resume, or drop warehouses
Switch warehouses to control usage
7. Session management
Establish connections and authenticate
Use MFA, OAuth, and other auth methods
Create multiple named connection profiles
Disconnect or quit sessions
8. Command line productivity
Command history and auto-complete
Pipe results between commands
Format output using options
Export results to files
Snowflake Scripting
Snowpark
Snowpark is an API accessed outside of the Snowflake
interface.
It was implemented as an alternative to SQL, allowing us to
query and process our data using high-level programming
languages currently supporting Java, Scala, and Python.
The API provides methods and classes, such as select, join,
drop and union instead of constructing query strings and
executing those, like we've seen with stored procedures in
UDFs.
Its main abstraction is something called a DataFrame. It's a
data structure that organizes data into a two-dimensional
table of rows and columns, conceptually similar to a
spreadsheet.
We'll have a look at this shortly.
If you've used Apache Spark or the pandas library,
Snowpark DataFrames will be very familiar and quite
intuitive to use.
There's a couple of things to know regarding Snowpark's
computation model.
Snowpark operations are executed lazily, meaning an
operation is only executed when an action is requested
rather than when it's declared.
Snowpark also works on a push down model, meaning all
operations are performed using Snowflake compute.
No data is transferred to where you're executing the
Snowpark code or to another cluster for processing.
This is often contrasted with the Snowflake connector for
Spark, which actually moves data out of Snowflake into the
Spark cluster to process the data.
A large part of why Snowpark was created was to avoid
developers having to move data out of Snowflake to use
their preferred language.
Okay, let's step through some sample code line by line
using the Python Snowpark API.
Let's say we have a simple requirement to build a data
pipeline whose purpose is to flag transactions over a
certain value and produce a table summarizing our results
for reporting.
In this code snippet, we're importing the Snowpark
libraries. Next, we define a dictionary in which we store our
connection parameters, including our account, user,
password, role, warehouse, database, and schema.
Here we're using our operating system's environment
variables to populate the parameters, but we could just use
plaintext strings or do something like use the result of a
call to a secrets manager.
In this code snippet, we then pass the dictionary into our
Session.builder method. This allows us to establish a
connection with Snowflake and provides methods for
creating DataFrames. For example, on line 15, we're using
the session objectto create a DataFrame from a table called
transactions. Along with the table method here, the session
object also has a method called sql.
This accepts a SQL query string, so we could issue a
SELECT command as another way to create a DataFrame.
Okay, let's try and understand
what DataFrames are with this next command. If I were to
print out our transactions DataFrame by following it with a
dot operator and then the collect function, we'd see a list of
row objects output to the console.
A DataFrame is a collection of row objects with their
columns defined by a schema.
Each row object contains the column name and its values,
and because DataFrames are lazily evaluated, we would
retrieve the results from Snowflake when something like a
collect function is called, not when we initially created the
DataFrame, like we did with the session object.
Now that we have a DataFrame, we can start to do our data
transformations.
In line 18, we're using the DataFrame method filter.nThis is
similar to a WHERE clause in SQL. We're using it to filter our
DataFrame on the amount column for transactions over
1,000.
We then assign the output of that operation to a new
DataFrame.
Many of the operations from SQL have their own
programming constructs.
For example, here we have a group by and count. We can
group by our account holders and count how many
transactions they have over 1,000.
On line 20, we filter again to check if there are greater or
equal to two transactions over 1,000.
We can chain together DataFrame operations by using a
dot operator, and then the next method.
So here, after filtering, we then rename the count column
to flagged_count.
This takes the filtered DataFrames input. We can then write
that DataFrame to a new table, selecting a mode, such as
append or overwrite.
This will create a new table in the database we set in our
session config, and it will match the schema and data of
our flagged_transactions DataFrame.
Let's use the DataFrame method show to print out the
contents of this DataFrame.
It's similar to collect, but returns our result in a tabular
format.
And finally, on our last line here, we close our session,
ending our connection to Snowflake.