Matillion Optimizing Snowflake
Matillion Optimizing Snowflake
Optimizing Snowflake
A real-world guide
Optmizing Snowflake: A real-world guide
1
Contents
2 Introduction
3 About this book
4 Snowflake Architecture
5 Snowflake Virtual Warehousing
7 Accessing Snowflake
9 Loading Data
16 Exporting Data
Optmizing Snowflake: A real-world guide
Introduction
If you are reading this eBook, you are probably considering This eBook is brought to you by Matillion. Matillion is an
or have already selected Snowflake—the data warehouse industry-leading data transformation solution for cloud
built for the cloud—as your modern cloud data warehouse. data warehouses. Delivering a true end-to-end data
That is an excellent choice. You are ahead of the curve transformation solution (not just data prep or movement
(and your competitors, with their outdated on-premise from one location to another), Matillion provides an instant-
databases). on experience to get you up and running in just a few
Now what? clicks, a pay-as-you-go billing model to cut out lengthy
procurement processes, and an intuitive user interface
Whether you’re a data warehouse developer, a data to minimize technical pain and speed up time to results.
architect or manager, a business intelligence specialist, an Matillion is available globally for Snowflake on AWS
Optmizing Snowflake: A real-world guide
analytics professional, or a tech-savvy marketer, you now Marketplace, Microsoft Azure Marketplace and Google
need to make the most of the Snowflake platform to get Cloud Platform.
the most out of your data. That’s where this eBook comes in
handy. It introduces the Snowflake platform and helps you More information is available at www.matillion.com
understand the various options available in the platform to
ingest, manipulate, and export data that is performant and
cost-efficient.
3
About
this book
Snowflake Architecture Accessing Snowflake Exporting Data
In this chapter, you’ll “look under the This chapter looks Snowflake access This chapter describes differences
hood” to understand Snowflake’s options. This may be different from between exporting data to an internal
unique underlying architecture. The what you are used to since you or external stage or to a local file
primary differences between Snowflake will have more options. Once you system or other target. What you want
and other cloud offerings is Snowflake’s understand the inner workings of to do with the data may affect which
“multi-cluster shared data” approach, Snowflake and can access it, you can option is best for your use case.
which separates your compute start making the most of it. Query
and storage capabilities to improve your data using SnowSQL, which is
Storage and
concurrency. Data is accessible as specifically designed for Snowflake, or
your users need it, speeding up data use a number of other methods, giving Compute Costs
processing and analysis. you the flexibility you need to get the Costs for storage and compute
data you need. resources couldn’t be simpler due to
Snowflake’s pay-as-you-go pricing
Snowflake Virtual Loading Data model. Although Snowflake is a cost-
Optmizing Snowflake: A real-world guide
Warehousing This chapter walks you through ways efficient solution, this chapter provides
Snowflake is essentially a cluster to bring data into Snowflake. This some tips that can help you keep costs
of compute resource, which makes information is useful if your business is down and under control.
Snowflake a scalable and powerful just starting out on Snowflake and you
solution. Knowing how to size your need to populate the data warehouse Best Practices
resources can help you take advantage with historical data or external data The final chapter ties together all the
of Snowflake’s highly scalable, low- from many different sources. It’s also information in this eBook into best
cost, and automatically managed useful if you want to improve your data practices that reduce costs, increase
enterprise data warehouse. Ultimately, loading processes. performance, and help you make the
this makes your business more most out of your data.
performant and cost-efficient.
4
Snowflake
Architecture
Snowflake is a fully relational, analytical, SQL-based, virtual If you have different workloads with different needs, you can
data warehouse built from the ground up to take advantage size your warehouses to fit your current need. You can even
of the full elasticity and scalability of the cloud. Snowflake create “multi-cluster warehouses” that scale automatically
delivers a database-as-a-service (DBaaS) platform to relieve to accommodate the number of concurrent queries. Best of
users of the complexity and administrative burdens that all, you can turn off any of the virtual warehouses at any time,
plague traditional architectures. Furthermore, Snowflake was so you pay only for what you use.
built with a new, unique architecture called “multi-cluster
shared data.” On top of its scalable storage and compute platform,
Snowflake handles optimization, security, and availability for
Snowflake can give your company the flexibility and agility your business and provides many other overhead-reducing
Optmizing Snowflake: A real-world guide
to meet changing data needs. Flexible cloud storage allows benefits. This makes it a fully managed solution. Lastly, you
you to store nearly unlimited amounts of structured and can move real-time data, at nearly unlimited scales. With
semi-structured data in a single location, consolidating Snowflake Data Sharing, data can be simply shared with
your disparate data sources. Compute nodes, called “virtual another Snowflake account.
warehouses” execute queries and perform transformations
on your data. These compute nodes can be created at any
time using either SQL commands or the Snowflake UI.
This means your virtual warehouse is scalable, allowing
your business to respond to growing data needs, without
additional procurement or managerial overhead.
5
Snowflake
Virtual Warehousing
continued...
This can be especially useful when you load data, and it • Multi-cluster Warehouses
enables faster parallel loads.
Accessing
Snowflake
Continued...
SnowSQL: Snowflake’s
Command-Line Interface
(CLI)
If you are a script junkie, you’ll love SnowSQL.
Here’s another example that
SnowSQL is a modern CLI that allows users
executes a script stored in a OTHER DATA QUERYING AND
to execute SQL queries, perform all DDL
local file (input_script.sql) and MANAGEMENT OPTIONS
and DML operations—including loading and
stores the results in a local file.
unloading data into and out of Snowflake— You may want to query Snowflake
Here, a username and password
and perform many other tasks. SnowSQL from your favorite SQL client or
are used instead of stored
may be used to to access your Snowflake from a reporting or dashboarding
credentials.
database from the command line quickly and, tool. If you are a software
if required, automate operations by running developer, you may want to access
scripts. snowsql -a abc123 -u jsmith or manage your data in Snowflake
-f //tmp/input_script.sql -o using your favorite programming
output_file=//tmp/output.csv language.
See Installing SnowSQL for instructions on
-o quiet=true -o friendly=false Snowflake provides various drivers
downloading and installing the
-o header=false -o output_ or programmatic interfaces, some
format=csv of which are listed below:
SnowSQL (that is, the snowsql executable)
can be run as an interactive shell or in batch • Snowflake connector for Python
Optmizing Snowflake: A real-world guide
mode. Here’s an example of running a simple You may use the PUT command
• Snowflake connector for Apache
query where results are printed to stdout to upload local files into an
Spark
(the console). Note that the login credentials internal stage and the GET
• JDBC driver
represented in the example command by command to download files from
• Node.js driver
myaccount are stored in the [connections] an internal stage to local disk/
• Go Snowflake driver (which
section of a configuration file. system. Read more about internal
provides interface for developing
and external stages in the Ch 4:
applications using the Go
HELPFUL HINT: Use SnowSQL for writing and Loading Data.
programming language)
running scripts in Snowflake or for automating • .NET driver
your data load and other processes. For more details, see the
• ODBC driver
SnowSQL user guide.
9
Loading To start fully benefiting from the power of Snowflake, you need to load your data into a database.
There are several options for loading data into Snowflake tables:
Data •
•
•
Bulk load data
Use Snowpipe to continuously load data
Load data using the web interface
• Use custom/vendor applications
These options are reviewed in detail below. Choose a method that suits your use case.
You can bulk load data from files into tables in Snowflake using the COPY INTO table command.
You may execute a COPY command from the Snowflake web interface, from a SnowSQL prompt,
or from your favorite programming language by using the appropriate driver, as discussed earlier
in the Other Data Querying and Management Options section.
1. Check that the files you intend to load are of a supported format (see below).
2.Compress files for faster loading.
3. Check that your target table already exists (use CREATE TABLE to create a table).
Optmizing Snowflake: A real-world guide
With Matillion ETL for 4. Ensure that the files are already staged in an internal or external stage.
Snowflake you can
also take advantage of
Review and (where possible) adhere to best practices.
components developed
to streamline efforts, File types supported by the bulk loader:
such as the built-in
scheduler to ensure your • Any flat, delimited plain-text format (comma-separated values, tab-separated values, etc.)
Snowflake tables are all • Semi-structured data in JSON, Avro, ORC, Parquet, or XML format (XML is currently supported as
updated consistently at a a preview feature)
convenient time interval.
10
Continued... A stage in Snowflake is a (named) location where you store data files before loading them. Use
CREATE STAGE command to create one. Snowflake supports internal and external stages, which
in turn, can be permanent or temporary. A temporary stage gets dropped automatically at the
end of a session.
Internal Stage
An internal stage stores data within your Snowflake environment. Files can be uploaded into an
internal stage using SnowSQL and the PUT command, or they can be loaded programmatically
using one of the drivers. Loading data via an internal stage:
Step 1: Upload one or more data files to a Snowflake stage (a named internal stage, a
stage for a specified table, or a stage for the current user) using the PUT command.
Step 2: Use the COPY INTO table command to load the contents of the staged file(s) into
a Snowflake database table.
External Stage
Optmizing Snowflake: A real-world guide
An external stage stages the data in a location outside of Snowflake. This is usually within Am-
azon Simple Storage Service (S3), Microsoft Azure Blob Storage or and Google Cloud Storage
Containers. External stages can be named entities in Snowflake or references to a location or
bucket in the relevant service (for example, s3://bucketname/path-to-file).
Accessing an external stage requires credentials. These can be passed via the CREDENTIALS
parameter of the COPY INTO table command. Alternatively, the CREATE STAGE command al-
lows you to specify the credentials required to access an external stage. Snowflake would auto-
matically use the stored-credentials if they are not passed into the COPY INTO command when
loading from this stage. However, many organisations frequently rotate keys for added security
which may invalidate the stored key and may introduce additional overhead when storing keys
within Snowflake.
11
Step 2: Use the COPY INTO table command to load the contents of the staged file(s)
into a Snowflake database table.
You may already have existing or established processes for moving data files into an S3 bucket
or a Microsoft Azure Container.
Costs related to using an external stage can be found on the bill from the respective vendor. You
may use any tools at your disposal to upload files to or download files from an external stage.
Loading from an Assume that you uploaded some files to an internal stage using SnowSQL and the PUT
internal stage command. Here are some examples of loading a file called 1.csv from various internal stages—a
named stage (@mystage), a table stage (%mytable), and a user stage (~):
Optmizing Snowflake: A real-world guide
Loading from an The following example loads data from a file in Amazon S3. Note that credentials are passed as
external stage part of the command.
(Amazon S3)
copy into mytable
from ‘s3://mybucket 1/prefix 1/file 1.csv’
credentials = (aws_key_id=’xxxx’ aws_secret_key=’xxxxx’
aws_token=’xxxxxx’);
12
Continued... Here’s an example of loading data from a file in a named external stage:
copy into mytable
Loading data from a file in from ‘@myextstage/some folder/file 1.csv’;
a named external stage
Further Reading:
• Bulk Loading from a Local File System Using COPY
• Bulk Loading from Amazon S3 Using COPY
• Bulk Loading from Microsoft Azure Using COPY
Users can build tools that initiate loads by invoking a REST endpoint; without managing a virtual
• Querying Metadata for warehouse or manually running a COPY command every time a file needs to be loaded. The
Staged Files service is managed by Snowflake and it automatically scales up or down based on the load on
the Snowpipe service.
Snowpipe Billing
Snowpipe is billed based on compute credits used per second. Snowflake tracks the resource
consumption of loads for all pipes in an account, with per-second/per-core granularity, as
Snowpipe actively queues and processes data files. “Per-core” refers to the physical CPU cores
in a compute server. You will see a new line item (SNOWPIPE) on your Snowflake bill. Go to
the “Billings and Usage” page in the Snowflake web interface to get a detailed breakdown of
Snowpipe usage. You can break usage information down to a specific date and hour as well as
13
Continued... Central to Snowpipe is the concept of a “pipe”. A Snowpipe pipe is a wrapper around the COPY
command that is used to load a file into the target table in Snowflake. A few of the options
from the COPY command are not supported. See CREATE PIPE for more information.
Snowpipe Benefits
Snowpipe provides a serverless data loading option that manages compute capacity on your
behalf. You can also take advantage of the per-second/per-core billing to save on compute
costs and pay only for the exact compute resources you use.
Instant insights. Snowpipe immediately provides fresh data to all your business users
without contention.
Cost-effectiveness. You pay only for the per-second compute used to load data rath-
er than the costs for running a warehouse continuously or by the hour.
Ease-of-use. You can point Snowpipe at an S3 bucket from within the Snowflake UI
and data will automatically load asynchronously as it arrives.
Flexibility. Technical resources can interface directly with the programmatic REST API,
using Java and Python SDKs to enable highly customized loading use cases.
Zero management. Snowpipe automatically provisions the correct capacity for the
data being loaded. There are no servers or management to worry about.
Cost
Optmizing Snowflake: A real-world guide
Performance
Read more about how you can streamline data loading with Snowpipe.
Let’s look at how to explicitly invoke Snowflake via the REST API to load files using the Snow-
pipe service. In this example, we will copy a file to an S3 staging area represented by a named
external stage in snowflake and then invoke the Snowpipe REST endpoint to ingest the file.
We are ingesting a JSON file which will be loaded into a Variant column in a table.
14
You may notice that Steps 1, 2, and 3 below are exactly the same as in the previous section
except that the pipe is created without auto_ingest=true.
Step 1: Create a named stage in Snowflake.
Users cannot authenticate with the REST API using their Snowflake login
credentials. Generate a public-private key pair for making calls to the Snowpipe
REST endpoints. In addition, grant sufficient privileges on the objects for the data
load, for example, the target database, schema, and table; the stage object; and the
pipe.
For more information on generating a compliant key pair and associating it with
the relevant user, see Configure Security in the Snowflake documentation. Also
refer to the relevant SDK documentation on key-based authentication.
15
Further Reading:
Continued... • Loading Continuously Using Snowpipe
• Understanding Billing for Snowpipe Usage
• How Snowpipe Streamlines Your Continuous Data Loading and Your Business
• Video: Automatically Ingesting Streaming Data with Snowpipe
• Video: Load Data Fast, Analyze Even Faster
The Snowflake web interface provides a simple wizard for loading files into tables. The wizard
uploads files to an internal stage (via PUT), and then it uses a COPY command to load data into
the table. See the documentation for more information on using the web interface to upload
files and load data.
leverage the Snowflake platform. Snowflake provides connectors and drivers for many popu-
lar languages that can be used to build custom applications.
SnowSQL is a good example of this. It uses the Python connector provided by Snowflake to
provide an effective CLI.
16
Exporting There are several options for exporting (also called unloading) data:
• Bulk exporting data from a table, a view, or the result of a SELECT statement into files in
These options are reviewed in detail below. Choose a method that suits your use case.
Files exported to an internal stage may be downloaded to a local file system using SnowSQL
and the GET command. Files exported to an external stage may be accessed/downloaded via
interfaces provided by the respective platform (Amazon S3, Microsoft Azure or Google Cloud
Storage).
At the time of this writing, Snowflake can export data as single-character delimited files (CSV,
TSV, etc.) or as JSON. Exports can be optionally compressed and are always transparently en-
crypted when written to internal stages. Exports to external stages can be optionally encrypted
as well.
Optmizing Snowflake: A real-world guide
Sign up for a demonstration of Matillion ETL for Snowflake and an opportunity to speak to
a Solution Architect about your unique use case.
17
Continued... The following command is an example of exporting the result of a SELECT statement to
an internal stage named my_stage, to files that reside in a folder named result and whose
names are prefixed with data_, with a file format object named vsv, and using gzip compres-
sion. You can then load data from these files into other tables or download the files to your
local disk using the SnowSQL GET command.
Exporting to an external copy into @my_stage/result/data_ from (select * from orders)
stage (Amazon S3) file_format=(format_name=’vsv’ compression=’gzip’);
The following command is an example of exporting from a table named mytable to CSV-
formatted files in Amazon S3 by specifying the credentials* for the desired S3 bucket.
Exporting to an external copy into s3://mybucket/unload/ from mytable
stage (Microsoft Azure) credentials = (aws_key_id=’xxxx’ aws_secret_key=’xxxxx’ aws_token=’xxxxxx’)
file_format = csv;
The following command is an example of exporting from a table named mytable to CSV-
formatted files in Microsoft Azure. The command specifies credentials for the targeted Blob
NOTE storage bucket.
You can avoid specifying copy into azure://myaccount.blob.core.windows.net/unload/
credentials by creating from mytable
named external stages in credentials = (aws_key_id=’xxxx’ aws_secret_key=’xxxxx’ aws_token=’xxxxxx’)
advance using the CREATE file_format = csv;
STAGE command. To do this,
you specify the external
stage location and optionally Exporting Data to a Local File System or Other Target
the credentials required to You may also use your favorite programming language or client to query a table or a view and
Optmizing Snowflake: A real-world guide
access this location. then write the results to one or more files in a local file system or any other target. This approach
may be slower than using the “COPY INTO <location>” command, because data needs to travel to
the local machine running your code or client, which then writes to the file system. This approach
may not noticeably affect exports for small tables but it will affect exports for larger tables. COPY
INTO also benefits from using compression techniques when data is exported, which results in
reduced network traffic and faster data movement.
You can also issue a “COPY INTO <location>” command using programming interfaces, download
the files from the appropriate stage, and then access the data. This may be appropriate if you
intend to download large data sets to local files.
Storage and
Compute Costs
This eBook has described methods and best practices for Storage Costs
optimizing your usage of Snowflake to control costs. This All customers are charged a monthly fee for the data they store in
chapter discusses how billing works, recaps methods Snowflake. Storage cost is measured using the average amount of
for cost optimization, and describes how you can track storage used per month for all customer data consumed or stored
and control costs. The following information is based on in Snowflake, after compression.
Snowflake’s pricing guide.
Note that features such as Time Travel and Fail-safe may increase
Snowflake’s unique architecture allows for a clear costs associated with storage, because data is not immediately
separation between your storage and compute resources. deleted but instead is held in reserve to support these features.
This allows Snowflake’s pricing model to be much simpler Files in internal stages and features such as Snowflake Data
and to include only two items: Sharing, also known as the Data Sharehouse™, and cloning will
also affect your storage costs.
• The cost of storage used
• The cost of compute resources (implemented as Virtual Warehouse (Compute) Costs
Optmizing Snowflake: A real-world guide
virtual warehouses) consumed Snowflake charges you only when your warehouse is in a “started”
state. There is no charge when it is in a “suspended” state. This
Snowflake credits are used to pay for the processing time allows you to create multiple warehouse definitions and suspend
used by each virtual warehouse. them to prevent you from being billed for them. You must issue an
ALTER WAREHOUSE RESUME command before you intend to use a
suspended virtual warehouse.
19
Continued...
There is a linear relationship between the number of OTHER FACTORS
servers in a warehouse cluster and the number of credits
the cluster consumes. Snowflake uses per-second billing The following factors influence the unit costs for the
(with a 60-second minimum each time the warehouse credits you use and the data storage you use:
starts), so warehouses are billed only for the credits
they actually consume. For more information, see • Whether you have a Snowflake On Demand account
Understanding Snowflake Credit and Storage. or a Capacity account
• The Region in which you create your Snowflake
You can profile a warehouse to understand its usage account
and credits spent. To profile a warehouse, use the • The Snowflake Edition that your organization
WAREHOUSE_LOAD_HISTORY and WAREHOUSE_ chooses
METERING_HISTORY functions. The information provided
by these functions can tell you if scaling up a warehouse Most Snowflake customers use Snowflake On Demand
would benefit any existing loads. Conversely, you may initially to develop and test the application workload in
also be able to identify underutilized warehouses and order to gain real-world experience that enables them
consolidate them, if appropriate. Read more about to estimate their monthly costs. When the application
profiling here. workload is understood, customers can then purchase an
appropriately sized capacity.
Once you understand how best to use your virtual
warehouse for your data needs you can implement best NOTE : The Snowflake Edition your business chooses will
Optmizing Snowflake: A real-world guide
practices for performance and cost efficiency. impact billing. On Demand: Usage-based pricing with no
long-term licensing requirements. Capacity: Discounted
pricing based on an upfront capacity commitment.
Further Reading:
• Snowflake’s Pricing Guide
• How Usage-Based Pricing Delivers a Bud-
get-Friendly Cloud Data Warehouse
20 IMPROVING LOAD PERFORMANCE
Best
• Use bulk loading to get the data into tables in Snowflake.
• Consider splitting large data files so the load can be efficiently distributed across servers
in a cluster.
Practices •
•
Delete from internal stages files that are no longer needed. This may improve
performance in addition to saving on costs.
Isolate load and transform jobs from queries to prevent resource contention. Dedicate
separate warehouses for loading and querying operations to optimize performance for
each.
• Leverage the scalable compute layer to do the bulk of the data processing.
• Consider using Snowpipe in micro-batching scenarios.
• Cost
Conclusion
About Matillion
We hope you enjoyed this eBook and that you have Matillion is an industry-leading data
found some helpful tips on how to make the most of your transformation solution for cloud data
Snowflake database. Implementing the best practices warehouses. Delivering a true end-to-end data
and optimizations described in this eBook should help you transformation solution (not just data prep
enhance big data analytics performance and reduce your or movement from one location to another),
Snowflake costs. Matillion provides an instant-on experience to
get you up and running in just a few clicks, a
With Snowflake, you can spend fewer resources on managing pay-as-you-go billing model to cut out lengthy
database overhead and focus on what’s really important: procurement processes, and an intuitive user
answering your organization’s most pressing business interface to minimize technical pain and speed
questions. up time to results. Matillion is available globally
for Snowflake on AWS Marketplace, Microsoft
Azure Marketplace and Google Cloud Platform.