0% found this document useful (0 votes)
24 views

5.micro Partitions+and+Clustering

Uploaded by

clouditlab9
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

5.micro Partitions+and+Clustering

Uploaded by

clouditlab9
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Micro-Partitions and Clustering

What is a Snowflake micro-partition?


A micro-partition is a file, stored in the blob storage
service for the cloud service provider on which a
Snowflake account runs:

AWS - S3
Azure - Azure Blob Storage
GCP - Google Cloud Storage
Data Partitioning
Partitioning is the process of breaking down a table into smaller, more
manageable parts based on a common Creteria,
for example, a date,
a geographic region,
or a product category.
Each partition is treated as a separate table and can be queried
independently, allowing faster and more efficient data retrieval. Also, keep in
mind that partitioning can help lower storage costs by putting data that is used
less often in cheaper storage space.
Ex:
Let's consider an example to illustrate the benefits of data partitioning.
Let's say, for example, that we have a sales database containing millions of records organized
into year and month partitions so that data from specific months or years can be promptly
accessed.
------------------
Therefore, by partitioning the data like this, requests are more efficiently processed and more
accurate answers can be obtained.

SELECT store_location, SUM(sales_amount)


FROM sales
WHERE transaction_date BETWEEN '2023-01-01' AND '2023-12-31'
AND product_category = 'Electronics'
GROUP BY store_location
Assume that the sales table is partitioned on the transaction_date and store_location columns.
The warehouse can prune the partitions only to scan those containing data within a certain
time frame or within a particular store location. This way, the number of records it needs to
scan is significantly reduced, resulting in faster query times.
Benefits of Snowflake Micro-Partitions
The benefits of Snowflake's approach to partitioning table data include:

Automatic partitioning requires virtually no user oversight


Snowflake Micro-partitions are small, allowing for efficient DML operations
Snowflake Micro-partition metadata enables "zero-copy cloning", allowing for efficient
copying of tables, schemas, and databases with no extra storage costs
Original micro-partitions remain immutable, ensuring data integrity when editing data in a
Snowflake zero-copy clone
Snowflake Micro-partitions improve query performance through horizontal and vertical
query pruning, scanning only the needed micro-partitions for better query performance
Clustering metadata is recorded for each micro-partition, allowing Snowflake to further
optimize query performance.
Micro partitions
• Snowflake has implemented a powerful and unique form of partitioning,
called micro-partitioning.
• Micro-partitioning is automatically performed on all Snowflake tables.
• Tables are transparently partitioned using the ordering of the data as it is
inserted/loaded.
• Micro-partitions are small in size (50 to 500 MB).
• Data is compressed in micro partitions
• Snowflake automatically determines the most efficient compression algorithm for
the columns in each micro-partition.
Eg: Observe how the blue coloured and magenta
coloured data stored
Metadata of Micro partitions
• Metadata is also maintained by Snowflake which Includes…
• the number of distinct values for each field
• the range of values for each field
• Other useful information to improve performance.

• Metadata is a key part of the Snowflake architecture as it allows queries to determine


whether or not the data inside a micro-partition should be queried.
• This way, when a query is executed, it does not need to scan the entire dataset but
instead only queries the micro-partitions that hold relevant data.
• This process is known as query pruning, as the data is pruned before the query itself
is executed.
SELECT type, country
FROM MY_TABLE
WHERE name = "Y“;

The only micro-partitions that match this criterion are micro-partitions 3 and 4. The query pruning has
reduced our total dataset to just these two partitions
And only the [type] and [country] fields are required in the query output, any part of micro-partitions that do
not contain data for these columns would also be pruned.
i.e. When the micro-partitions themselves are queried, only the required columns are queried
Benefits of Micro partitioning
• In contrast to traditional static partitioning, Snowflake micro-partitions are derived
automatically; they don’t need to be explicitly defined up-front or maintained by
users.
• Micro-partitions are small in size (50 to 500 MB), which enables extremely
efficient DML and fine-grained pruning for faster queries.
• Columns are stored independently within micro-partitions, often referred to as
columnar storage.
• This enables efficient scanning of individual columns; only the columns
referenced by a query are scanned.
• Columns are also compressed individually within micro-partitions, this optimizes
the storage cost.
Benefits of Snowflake Micro-Partitions
The benefits of Snowflake's approach to partitioning table data include:

Automatic partitioning requires virtually no user oversight


Snowflake Micro-partitions are small, allowing for efficient DML operations
Snowflake Micro-partition metadata enables "zero-copy cloning", allowing for efficient
copying of tables, schemas, and databases with no extra storage costs

Snowflake Micro-partitions improve query performance through horizontal and vertical


query pruning, scanning only the needed micro-partitions for better query performance
Clustering metadata is recorded for each micro-partition, allowing Snowflake to further
optimize query performance.
Clustering
• Clustering is a key factor in query performance, It reduces the scanning of micro
partitions.
• A clustering key is a subset of columns in a table that are explicitly designated to
co-locate the data in the table in the micro-partitions.
• Initially the data will be stored in micro partitions in the order of inserting records,
then will be realigned based on the cluster keys.
• We have to choose proper Cluster keys.
• We can define cluster keys on multiple columns as well.
• We can modify the cluster keys based on our requirements, this is called as re-
clustering.
• Re-clustering consumes credits, number of credits consumed depends on the
size of the table.
Clustering Keys :
In this second example, lets say we clustered
our dataset based on the [Name] field, as this
was the frequently used key field used in our
queries in WHERE CLAUSE / JOIN
• The data is now stored and ordered based on the value of the [name] field.
• The micro-partition 4 is now the only micro-partition that contains [name] values of Y.
• When we execute our earlier query now, the query pruning will reduce our target data down to just micro-
partition 4, which means our query has less data to interpret and thus will perform more efficiently.
Or if we see type and date
column getting frequently in
where clause.. Then ALTER
table to include both as cluster
keys
Defining Cluster keys
Defining Cluster keys on a new table:
CREATE TABLE MY_TABLE
(
type number,
name string,
country string,
date date,
)
CLUSTER BY (name);

Modifying Cluster keys on existing table:

ALTER TABLE MY_TABLE CLUSTER BY (name, date);


Choosing Cluster Keys
Define cluster keys on
• Columns frequently used in Filter conditions (Where clause)
• Columns using as Join keys
• Frequently used functions or expressions
Like YEAR(date), SUBSTRING(med_cd,1,6)

Snowflake recommends
• Define cluster keys on large tables and don’t on small tables.
• Don’t define cluster keys on more than 4 columns.
Thank You

You might also like