Netezza Best Practices
Netezza Best Practices
Prepared By
Sivakumar Nair/India/IBM
1. Introduction
2. Distribution
3. Datatypes
4. ZoneMaps
5. Statistics
6. Groom / Reclaim
7. ETL/ELT Guidelines
Introduction
Netezza sells itself on simplicity and therefore best practice should not mean
hundreds of rules and regulations to follow. Recommended that basic principles are
on
> Distribution
> Datatypes
> Statistics
> Zonemaps
> Reclaim
Along side some basic standards for ETL and general pointers will help applications
to perform 99%. Best practices means minimal effort early on for maximum gain.
Distribution
Good Distribution is the fundamental element of performance. A SPU is the
individual element of parallelism and if all SPUs have same amount of work to do, a
query will be quicker than if one SPU was asked to do same job.
> Bad distribution is called data skew
> Skew to one SPU is worst case scenario.
> Skew affects query in hand and others as SPU has more to do.
> Skew also means that the machine will fill up quicker.
> Simple rule. Good distribution-Good Performance.
> Never create a table with out distribution key.
> If no distribution key is specified, the NPS chooses a distribution key and there is
no guarantee what that key is. This will eventually creates data skew.
When choosing the distribution key consider the following factors
> More distinct the distribution key values, the better.
> The Same distribution key value always goes to the same SPU.
> Table Used together should use the same columns for their distribution key
when possible.
> If a particular key is largely used in equal join clause, then that key is good
choice for distribution key.
> Check that there is no accidental process skew when there is a good record
distribution.
> If in doubt, use Random distribution as it will give perfect distribution.
> For Smaller tables Random distribution is usually good choice.
Criteria for Selecting distribution keys.
> Choose column for distribution key that distribute table rows eventually.
> Choose columns for the distribution key based on the selection set that you use
most frequently to retrieve rows from the table.
> Choose as few columns as possible for distribution key (Max 4 Columns).
> Do not choose Boolean columns as distribution key.
Data types
Picking right data types always give better performance.
> Having columns of uniform type produces consistent results.
> Having columns of uniform type ensures that data is stored efficiently.
> Having columns of Uniform type allow the system to process the queries
efficiently
> Numeric data type with a scales 0 are similar to INTEGER datatypes and switch to
Integer datatype means Zonemaps
> The INTERVAL datatype means cumbersome and hard to work with. Consider
storing original Time and Timestamp values and calculating interval on fly.
> Floating point datatype are, by definition, lousy in nature. There may be
performance hit by using them
> Inconsistent datatype for same column in different tables hit performance
ZoneMaps
> ZoneMaps improve the throughput and the response time of SQL against large
groups, or continually augmented nearly ordered data.
> Zonemaps are automatically generated, persistent, internal tables.
> Works with Large, grouped or nearly ordered date, timestamp and byteint,
smallint, integer, and biginteger datatypes.
> Zonemaps take advantage of inherent ordering or grouping of data to reduce disk
scans required to retrieve data on restricted scan queries.
Statistics
> Netezza uses Cost based Optimizer
> The more up to date and accurate table statistics are, the better plans the query
optimizer will generate.
> Statistics should be built into ETL or ELT processing where ever possible.
> Regular monitoring should be deployed to check out of date statistics.
Groom / Reclaim
Why groom is important?
> An update or delete of a table row does not remove old tuple.
> Over time outdated or deleted tuples are of no interest to any transaction
and must be deleted to free up space.
When should you reclaim
> Groom tables that receives frequent update or deletes
> Groom tables if you cancel or abort large load operation.
Groom best practices
> If You have a table whose contents are delete completely, consider using
truncate rather than delete, which eliminates the need to run groom
command.
> Build groom into the ETL processing where ever possible.