SlideShare a Scribd company logo
Building and Optimizing Data
 Warehouse "Star Schemas"
        with MySQL


       Bert Scalzo, Ph.D.


     Bert.Scalzo@Quest.com
About the Author
 Oracle DBA for 20+ years, versions 4 through 10g
 Been doing MySQL work for past year (4.x and 5.x)
 Worked for Oracle Education & Consulting
 Holds several Oracle Masters (DBA & CASE)
 BS, MS, PhD in Computer Science and also an MBA
 LOMA insurance industry designations: FLMI and ACS
 Books
    – The TOAD Handbook (Feb 2003)
    – Oracle DBA Guide to Data Warehousing and Star Schemas (Jun 2003)
    – TOAD Pocket Reference 2nd Edition (May 2005)
 Articles
    – Oracle Magazine
    – Oracle Technology Network (OTN)
    – Oracle Informant
    – PC Week (now E-Magazine)
    – Linux Journal
    – www.linux.com
    – www.quest-pipelines.com
Star schema   my sql
Star Schema Design


“Star schema” approach to dimensional data
modeling was pioneered by Ralph Kimball


Dimensions: smaller, de-normalized tables containing
business descriptive columns that users use to query



Facts: very large tables with primary keys formed from
the concatenation of related dimension table foreign key
columns, and also possessing numerically additive, non-
key columns used for calculations during user queries
Facts




Dimensions
108th – 1010th




  103rd – 105th
The Ad-Hoc Challenge

How much data would a data miner mine,
if a data miner could mine data?

Dimensions: generally queried selectively to find lookup
value matches that are used to query against the fact table


Facts: must be selectively queried, since they generally
have hundreds of millions to billions of rows – even full
table scans utilizing parallel are too big for most systems


Business Intelligence (BI) tools generally offer end-users the
ability to perform projections of group operations on columns
from facts using restrictions on columns from dimensions …
Hardware Not Compensate

Often, people have expectation that using
expensive hardware is only way to obtain
optimal performance for a data warehouse

   •CPU              •Disk
      •SMP              •15,000 RPM
      •MPP              •RAID (EMC)

   •OS               •MySQL
      •UNIX             •4.x / 5.x
      •64-bit           •64-bit
DB Design Paramount

In reality, the database design is the key
factor to optimal query performance for a
data warehouse built as a “Star Schema”

There are certain minimum hardware and
software requirements that once met, play a
very subordinate role to tuning the database

Golden Rule: get the basic database
design and query explain plan correct
Key Tuning Requirements

 1.   MySQL 5.x

 2.   MySQL.ini

 3.   Table Design

 4.   Index Design

 5.   Data Loading Architecture

 6.   Analyze Table

 7.   Query Style

 8.   Explain plan
1. MySQL 5.X (help on the way)

•Index Merge Explain
    •Prior to 5.x, only one index used per referenced table
    •This radically effects both index design and explain plans
•Rename Table for MERGE fixed
    •With 4.x, some scenarios could cause table corruption
•New ``greedy search'' optimizer that can significantly reduce the
time spent on query optimization for some many-table joins

•Views
    •Useful for pre-canning/forcing query style or syntax (i.e. hints)
•Stored Procedures
•Rudimentary Triggers
•InnoDB
    •Compact Record Format
    •Fast Truncate Table
2. MySQL.ini

•query_cache_size                  = 0 (or 13% overhead)
•sort_buffer_size                  >= 4MB

•bulk_insert_buffer_size           >= 16MB
•key_buffer_size                   >= 25-50% RAM
•myisam_sort_buffer_size           >= 16MB

•innodb_additional_mem_pool_size   >= 4MB
•innodb_autoextend_increment       >= 64MB
•innodb_buffer_pool_size           >= 25-50% RAM
•innodb_file_per_table             = TRUE
•innodb_log_file_size              = 1/N of buffer pool
•innodb_log_buffer_size            = 4-8 MB
3. Table Design

SPEED vs. SPACE vs. MANAGEMENT

    64 MB / Million Rows (Avg. Fact)
   500      Million Rows
===================
32,000 MB (32 GB)

Primary storage engine options:
   •MyISAM
   •MyISAM + RAID_TYPE
   •MERGE
   •InnoDB
ENGINE = MyISAM
        CREATE TABLE ss.pos_day (
          PERIOD_ID decimal(10,0) NOT NULL default '0',
          LOCATION_ID decimal(10,0) NOT NULL default '0',
          PRODUCT_ID decimal(10,0) NOT NULL default '0',
          SALES_UNIT decimal(10,0) NOT NULL default '0',
          SALES_RETAIL decimal(10,0) NOT NULL default '0',
          GROSS_PROFIT decimal(10,0) NOT NULL default '0‘
          PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID),
          ADD INDEX PERIOD(PERIOD_ID),
          ADD INDEX LOCATION(LOCATION_ID),
          ADD INDEX PRODUCT(PRODUCT_ID)
        ) ENGINE=MyISAM
          PACK_KEYS
          DATA_DIRECTORY=‘C:mysqldata’
          INDEX_DIRECTORY=‘D:mysqldata’;
Pros:
     •Non-transactional – faster, lower disk space usage, and less memory
Cons:
     •2/4GB data file limit on operating systems that don't support big files
     •2/4GB index file limit on operating systems that don't support big files
     •One big table poses data archival and index maintenance challenges
     (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
ENGINE = MyISAM + RAID_TYPE
        CREATE TABLE ss.pos_day (
          PERIOD_ID decimal(10,0) NOT NULL default '0',
          LOCATION_ID decimal(10,0) NOT NULL default '0',
          PRODUCT_ID decimal(10,0) NOT NULL default '0',
          SALES_UNIT decimal(10,0) NOT NULL default '0',
          SALES_RETAIL decimal(10,0) NOT NULL default '0',
          GROSS_PROFIT decimal(10,0) NOT NULL default '0‘
          PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID),
          ADD INDEX PERIOD(PERIOD_ID),
          ADD INDEX LOCATION(LOCATION_ID),
          ADD INDEX PRODUCT(PRODUCT_ID)
        ) ENGINE=MyISAM
          PACK_KEYS
          RAID_TYPE=STRIPED;
Pros:
     •Non-transactional – faster, lower disk space usage, and less memory
     •Can help you to exceed the 2GB/4GB limit for the MyISAM data file
     •Creates up to 255 subdirectories, each with file named table_name.myd
     •Distributed IO – put each table subdirectory and file on a different disk
Cons:
     •2/4GB index file limit on operating systems that don't support big files
     •One big table poses data archival and index maintenance challenges
     (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
ENGINE = MERGE
 CREATE TABLE ss.pos_merge (
   PERIOD_ID decimal(10,0) NOT NULL default '0',
   LOCATION_ID decimal(10,0) NOT NULL default '0',
   PRODUCT_ID decimal(10,0) NOT NULL default '0',
   SALES_UNIT decimal(10,0) NOT NULL default '0',
   SALES_RETAIL decimal(10,0) NOT NULL default '0',
   GROSS_PROFIT decimal(10,0) NOT NULL default '0',
   INDEX PK(PRODUCT_ID, LOCATION_ID, PERIOD_ID),
   INDEX PERIOD(PERIOD_ID),
   INDEX LOCATION(LOCATION_ID),
   INDEX PRODUCT(PRODUCT_ID)
 ) ENGINE=MERGE
   UNION=(pos_1998,pos_1999,pos_2000) INSERT_METHOD=LAST;
Pros:
     •Non-transactional – faster, lower disk space usage, and less memory
     •Partitioned tables offer data archival and index maintenance options
     (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
     •Distributed IO – put individual tables and indexes on different disks
Cons:
     •MERGE tables use more file descriptors on database server
     •MERGE key lookups are much slower on “eq_ref” searches
     •Can use only identical MyISAM tables for a MERGE table
ENGINE = InnoDB
              CREATE TABLE ss.pos_day (
                PERIOD_ID decimal(10,0) NOT NULL default '0',
                LOCATION_ID decimal(10,0) NOT NULL default '0',
                PRODUCT_ID decimal(10,0) NOT NULL default '0',
                SALES_UNIT decimal(10,0) NOT NULL default '0',
                SALES_RETAIL decimal(10,0) NOT NULL default '0',
                GROSS_PROFIT decimal(10,0) NOT NULL default '0‘
                PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID),
                ADD INDEX PERIOD(PERIOD_ID),
                ADD INDEX LOCATION(LOCATION_ID),
                ADD INDEX PRODUCT(PRODUCT_ID)
              ) ENGINE=InnoDB
                PACK_KEYS;
Pros:
     •Simple yet flexible tablespace datafile configuration
     innodb_data_file_path=ibdata1:1G:autoextend:max:2G; ibdata2:1G:autoextend:max:2G
Cons:
     •Uses much more disk space – typically 2.5 times as much disk space as MyISAM!!!
     •Transaction Safe – not needed, consumes resources (SET AUTOCOMMIT=0)
     •Foreign Keys – not needed, consumes resources (SET FOREIGN_KEY_CHECKS=0)
     •Prior to MySQL 4.1.1 – no “mutliple tablespaces” feature (i.e. one table per tablespace)
     •One big table poses data archival and index maintenance challenges
     (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
Space Usage 21 Million Records


         MERGE
           3GB




         MyISAM
            3GB

         InnoDB
             8GB
4. Index Design

Index Design must be driven by DW users’ nature: you don’t know what
they’ll query upon, and the more successful they are data mining – the
more they’ll try (which is a actually a really good thing) …

Therefore you don’t know which dimension tables they’ll reference and
which dimension columns they will restrict upon – so:

    •Fact tables should have primary keys – for data load integrity
    •Fact table dimension reference (i.e. foreign key) columns should each
    be individually indexed – for variable fact/dimension joins
    •Dimension tables should have primary keys
    •Dimension tables should be fully indexed
         •MySQL 4.x – only one index per dimension will be used
             •If you know that one column will always be used in
             conjunction with others, create concatenated indexes
         •MySQL 5.x – new index merge will use multiple indexes
Note: Make sure to build indexes based
off cardinality (i.e. leading portion most
selective), so in this case the index was
built backwards
MyISAM Key Cache Magic

1. Utilize two key caches:
    •Default Key Cache – for fact table indexes
    •Hot Key Cache – for dimension key indexes
   command-line option:
   shell> mysqld --hot_cache.key_buffer_size=16M

   option file:
   [mysqld]
   hot_cache.key_buffer_size=16M

   CACHE INDEX t1, t2, t3 IN hot_cache;

2. Pre-Load Dimension Indexes:
   LOAD INDEX INTO CACHE t1, t2, t3 IGNORE LEAVES;
5. Data Loading Architecture
Archive:
     •ALTER TABLE fact_table UNION=(mt2, mt3, mt4)
     •DROP TABLE mt1

Load:
     •TRUNCATE TABLE staging_table
     •Run nightly/weekly data load into staging_table
     •ALTER TABLE merge_table_4 DROP PRIMARY KEY
     •ALTER TABLE merge_table_4 DROP INDEX
     •INSERT INTO merge_table_4 SELECT * FROM staging_table
     •ALTER TABLE merge_table_4 ADD PRIMARY KEY(…)
     •ALTER TABLE merge_table_4 ADD INDEX(…)
     •ANALYZE TABLE merge_table_4
6. Analyze Table

ANALYZE TABLE ss.pos_merge;

•Analyze Table statement analyzes and stores the
key distribution for a table

•MySQL uses the stored key distribution to decide
the order in which tables should be joined

• If you have a problem with incorrect index usage,
you should run ANALYZE TABLE to update table
statistics such as cardinality of keys, which can
affect the choices the optimizer makes
7. Query Style

Many Business Intelligence (BI) and Report Writing
tools offer initialization parameters or global settings
which control the style of SQL code they generate.

Options often include:
   •Simple N-Way Join
   •Sub-Selects
   •Derived Tables
   •Reporting Engine does Join operations

For now – only the first options make sense…
(see following section regarding explain plans)
8. Explain Plan

The EXPLAIN statement can be used either as a synonym for
DESCRIBE or as a way to obtain information about how the
MySQL query optimizer will execute a SELECT statement.

You can get a good indication of how good a join is by taking
the product of the values in the rows column of the
EXPLAIN output. This should tell you roughly how many
rows MySQL must examine to execute the query.

We’ll refer to this calculation as the explain plan cost – and
use this as our primary comparative measure (along with of
course the actual run time) …
Method: Derived Tables




Huge Explain Cost, and
statement ran forever!!!   Cost = 1.8 x 1016th
Method: Sub-Selects




Sub-Select better than
Derived Table – but not
as good as Simple Join    Cost = 2.1 x 107th
Method: Simple Joins and
                    Single-Column Indexes




Join better than Sub-
Select and much better
than Derived Table        Cost = 7.1 x 106th
Method: Merge Table and
                    Single-Column Indexes




Same Explain Cost, but
statement ran 2X faster   Cost = 7.1 x 106th
Method: Merge Table and
                     Multi-Column Indexes




Concatenated Index
yielded best run time      Cost = 3.2 x 105th
Method: Merge Table and
                          Merge Indexes




MySQL 5.0 new Index
Merge = best run time    Cost = 5.2 x 105th
Conclusion …

•MySQL can easily be used to build large and
effective “Star Schema” Data Warehouses

•MySQL Version 5.x will offer even more useful index
and join query optimizations

•MySQL can be better configured for DW use through
effective mysql.ini option settings

•Table and Index designs are paramount to success

•Query style and resulting explain plans are critical to
achieving the fastest query run times
Ad

More Related Content

What's hot (20)

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
Fabio Fumarola
 
LineairDBの紹介
LineairDBの紹介LineairDBの紹介
LineairDBの紹介
Sho Nakazono
 
Ch 12: Cryptography
Ch 12: CryptographyCh 12: Cryptography
Ch 12: Cryptography
Sam Bowne
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Security Attacks.ppt
Security Attacks.pptSecurity Attacks.ppt
Security Attacks.ppt
Zaheer720515
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
N. I. Md. Ashafuddula
 
Introduction to data visualisation with D3
Introduction to data visualisation with D3Introduction to data visualisation with D3
Introduction to data visualisation with D3
Aleksander Fabijan
 
事例で学ぶApache Cassandra
事例で学ぶApache Cassandra事例で学ぶApache Cassandra
事例で学ぶApache Cassandra
Yuki Morishita
 
Curves and surfaces in OpenGL
Curves and surfaces in OpenGLCurves and surfaces in OpenGL
Curves and surfaces in OpenGL
Syed Zaid Irshad
 
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
Shohei Kobayashi
 
Ch09
Ch09Ch09
Ch09
Joe Christensen
 
Material Design in Android
Material Design in AndroidMaterial Design in Android
Material Design in Android
Mindfire Solutions
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
Cloudera, Inc.
 
Cryptography
CryptographyCryptography
Cryptography
Dipti Sakpal
 
S/MIME
S/MIMES/MIME
S/MIME
maria azam
 
Secure Socket Layer (SSL)
Secure Socket Layer (SSL)Secure Socket Layer (SSL)
Secure Socket Layer (SSL)
amanchaurasia
 
Uses for scripting languages,web scripting in perl
Uses for scripting languages,web scripting in perlUses for scripting languages,web scripting in perl
Uses for scripting languages,web scripting in perl
sana mateen
 
My Sql Data Migration
My Sql Data MigrationMy Sql Data Migration
My Sql Data Migration
Anil Yadav
 
Migration into a Cloud
Migration into a CloudMigration into a Cloud
Migration into a Cloud
Divya S
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
Nguyen Quang
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
Fabio Fumarola
 
LineairDBの紹介
LineairDBの紹介LineairDBの紹介
LineairDBの紹介
Sho Nakazono
 
Ch 12: Cryptography
Ch 12: CryptographyCh 12: Cryptography
Ch 12: Cryptography
Sam Bowne
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Security Attacks.ppt
Security Attacks.pptSecurity Attacks.ppt
Security Attacks.ppt
Zaheer720515
 
Introduction to data visualisation with D3
Introduction to data visualisation with D3Introduction to data visualisation with D3
Introduction to data visualisation with D3
Aleksander Fabijan
 
事例で学ぶApache Cassandra
事例で学ぶApache Cassandra事例で学ぶApache Cassandra
事例で学ぶApache Cassandra
Yuki Morishita
 
Curves and surfaces in OpenGL
Curves and surfaces in OpenGLCurves and surfaces in OpenGL
Curves and surfaces in OpenGL
Syed Zaid Irshad
 
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
RDBNoSQLの基礎と組み合わせDB構成をちょっとよくする話
Shohei Kobayashi
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
Cloudera, Inc.
 
Secure Socket Layer (SSL)
Secure Socket Layer (SSL)Secure Socket Layer (SSL)
Secure Socket Layer (SSL)
amanchaurasia
 
Uses for scripting languages,web scripting in perl
Uses for scripting languages,web scripting in perlUses for scripting languages,web scripting in perl
Uses for scripting languages,web scripting in perl
sana mateen
 
My Sql Data Migration
My Sql Data MigrationMy Sql Data Migration
My Sql Data Migration
Anil Yadav
 
Migration into a Cloud
Migration into a CloudMigration into a Cloud
Migration into a Cloud
Divya S
 

Viewers also liked (15)

Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, Warehousing
Venu Anuganti
 
Star schema
Star schemaStar schema
Star schema
Chandanapriya Sathavalli
 
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
Fotis Kokkoras
 
04 - SQL (μέρος 2)
04 - SQL (μέρος 2)04 - SQL (μέρος 2)
04 - SQL (μέρος 2)
Fotis Kokkoras
 
Database Project
Database ProjectDatabase Project
Database Project
Valerii Klymchuk
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
Dao Vo
 
Agile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data PresentationAgile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data Presentation
Vishal Kumar
 
Datacube
DatacubeDatacube
Datacube
man2sandsce17
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
Abdul Aslam
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
Health Catalyst
 
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Health Catalyst
 
Third Nature - Open Source Data Warehousing
Third Nature - Open Source Data WarehousingThird Nature - Open Source Data Warehousing
Third Nature - Open Source Data Warehousing
mark madsen
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, Warehousing
Venu Anuganti
 
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
02 - Σχεσιακό Μοντέλο (Βασικές Έννοιες) - Τύποι Δεδομένων
Fotis Kokkoras
 
04 - SQL (μέρος 2)
04 - SQL (μέρος 2)04 - SQL (μέρος 2)
04 - SQL (μέρος 2)
Fotis Kokkoras
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
Dao Vo
 
Agile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data PresentationAgile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data Presentation
Vishal Kumar
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
Abdul Aslam
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
Health Catalyst
 
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Health Catalyst
 
Third Nature - Open Source Data Warehousing
Third Nature - Open Source Data WarehousingThird Nature - Open Source Data Warehousing
Third Nature - Open Source Data Warehousing
mark madsen
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Ad

Similar to Star schema my sql (20)

Data Warehouse Logical Design using Mysql
Data Warehouse Logical Design using MysqlData Warehouse Logical Design using Mysql
Data Warehouse Logical Design using Mysql
HAFIZ Islam
 
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance
guest9912e5
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
Carol McDonald
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a Nutshell
Emily Ikuta
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB
 
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale  by ...[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale  by ...
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
Insight Technology, Inc.
 
[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles
Insight Technology, Inc.
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
MariaDB plc
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
Antonios Chatzipavlis
 
Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)
kayokogoto
 
What is MariaDB Server 10.3?
What is MariaDB Server 10.3?What is MariaDB Server 10.3?
What is MariaDB Server 10.3?
Colin Charles
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
MongoDB
 
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Ted Wennmark
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
AWS Germany
 
MySQL database
MySQL databaseMySQL database
MySQL database
lalit choudhary
 
DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2
Pranav Prakash
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
MariaDB plc
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
Zohar Elkayam
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)
IDERA Software
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source Database
Mahesh Salaria
 
Data Warehouse Logical Design using Mysql
Data Warehouse Logical Design using MysqlData Warehouse Logical Design using Mysql
Data Warehouse Logical Design using Mysql
HAFIZ Islam
 
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance
guest9912e5
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a Nutshell
Emily Ikuta
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB
 
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale  by ...[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale  by ...
[db tech showcase Tokyo 2014] B15: Scalability with MariaDB and MaxScale by ...
Insight Technology, Inc.
 
[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles[B14] A MySQL Replacement by Colin Charles
[B14] A MySQL Replacement by Colin Charles
Insight Technology, Inc.
 
Configuring workload-based storage and topologies
Configuring workload-based storage and topologiesConfiguring workload-based storage and topologies
Configuring workload-based storage and topologies
MariaDB plc
 
Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)Maria db 10 and the mariadb foundation(colin)
Maria db 10 and the mariadb foundation(colin)
kayokogoto
 
What is MariaDB Server 10.3?
What is MariaDB Server 10.3?What is MariaDB Server 10.3?
What is MariaDB Server 10.3?
Colin Charles
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
MongoDB
 
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Upgrade to MySQL 5.7 and latest news planned for MySQL 8
Ted Wennmark
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
AWS Germany
 
DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2DB2UDB_the_Basics Day2
DB2UDB_the_Basics Day2
Pranav Prakash
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
MariaDB plc
 
MySQL 5.7 New Features for Developers
MySQL 5.7 New Features for DevelopersMySQL 5.7 New Features for Developers
MySQL 5.7 New Features for Developers
Zohar Elkayam
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)
IDERA Software
 
MySQL: Know more about open Source Database
MySQL: Know more about open Source DatabaseMySQL: Know more about open Source Database
MySQL: Know more about open Source Database
Mahesh Salaria
 
Ad

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 

Star schema my sql

  • 1. Building and Optimizing Data Warehouse "Star Schemas" with MySQL Bert Scalzo, Ph.D. [email protected]
  • 2. About the Author  Oracle DBA for 20+ years, versions 4 through 10g  Been doing MySQL work for past year (4.x and 5.x)  Worked for Oracle Education & Consulting  Holds several Oracle Masters (DBA & CASE)  BS, MS, PhD in Computer Science and also an MBA  LOMA insurance industry designations: FLMI and ACS  Books – The TOAD Handbook (Feb 2003) – Oracle DBA Guide to Data Warehousing and Star Schemas (Jun 2003) – TOAD Pocket Reference 2nd Edition (May 2005)  Articles – Oracle Magazine – Oracle Technology Network (OTN) – Oracle Informant – PC Week (now E-Magazine) – Linux Journal – www.linux.com – www.quest-pipelines.com
  • 4. Star Schema Design “Star schema” approach to dimensional data modeling was pioneered by Ralph Kimball Dimensions: smaller, de-normalized tables containing business descriptive columns that users use to query Facts: very large tables with primary keys formed from the concatenation of related dimension table foreign key columns, and also possessing numerically additive, non- key columns used for calculations during user queries
  • 6. 108th – 1010th 103rd – 105th
  • 7. The Ad-Hoc Challenge How much data would a data miner mine, if a data miner could mine data? Dimensions: generally queried selectively to find lookup value matches that are used to query against the fact table Facts: must be selectively queried, since they generally have hundreds of millions to billions of rows – even full table scans utilizing parallel are too big for most systems Business Intelligence (BI) tools generally offer end-users the ability to perform projections of group operations on columns from facts using restrictions on columns from dimensions …
  • 8. Hardware Not Compensate Often, people have expectation that using expensive hardware is only way to obtain optimal performance for a data warehouse •CPU •Disk •SMP •15,000 RPM •MPP •RAID (EMC) •OS •MySQL •UNIX •4.x / 5.x •64-bit •64-bit
  • 9. DB Design Paramount In reality, the database design is the key factor to optimal query performance for a data warehouse built as a “Star Schema” There are certain minimum hardware and software requirements that once met, play a very subordinate role to tuning the database Golden Rule: get the basic database design and query explain plan correct
  • 10. Key Tuning Requirements 1. MySQL 5.x 2. MySQL.ini 3. Table Design 4. Index Design 5. Data Loading Architecture 6. Analyze Table 7. Query Style 8. Explain plan
  • 11. 1. MySQL 5.X (help on the way) •Index Merge Explain •Prior to 5.x, only one index used per referenced table •This radically effects both index design and explain plans •Rename Table for MERGE fixed •With 4.x, some scenarios could cause table corruption •New ``greedy search'' optimizer that can significantly reduce the time spent on query optimization for some many-table joins •Views •Useful for pre-canning/forcing query style or syntax (i.e. hints) •Stored Procedures •Rudimentary Triggers •InnoDB •Compact Record Format •Fast Truncate Table
  • 12. 2. MySQL.ini •query_cache_size = 0 (or 13% overhead) •sort_buffer_size >= 4MB •bulk_insert_buffer_size >= 16MB •key_buffer_size >= 25-50% RAM •myisam_sort_buffer_size >= 16MB •innodb_additional_mem_pool_size >= 4MB •innodb_autoextend_increment >= 64MB •innodb_buffer_pool_size >= 25-50% RAM •innodb_file_per_table = TRUE •innodb_log_file_size = 1/N of buffer pool •innodb_log_buffer_size = 4-8 MB
  • 13. 3. Table Design SPEED vs. SPACE vs. MANAGEMENT 64 MB / Million Rows (Avg. Fact) 500 Million Rows =================== 32,000 MB (32 GB) Primary storage engine options: •MyISAM •MyISAM + RAID_TYPE •MERGE •InnoDB
  • 14. ENGINE = MyISAM CREATE TABLE ss.pos_day ( PERIOD_ID decimal(10,0) NOT NULL default '0', LOCATION_ID decimal(10,0) NOT NULL default '0', PRODUCT_ID decimal(10,0) NOT NULL default '0', SALES_UNIT decimal(10,0) NOT NULL default '0', SALES_RETAIL decimal(10,0) NOT NULL default '0', GROSS_PROFIT decimal(10,0) NOT NULL default '0‘ PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID), ADD INDEX PERIOD(PERIOD_ID), ADD INDEX LOCATION(LOCATION_ID), ADD INDEX PRODUCT(PRODUCT_ID) ) ENGINE=MyISAM PACK_KEYS DATA_DIRECTORY=‘C:mysqldata’ INDEX_DIRECTORY=‘D:mysqldata’; Pros: •Non-transactional – faster, lower disk space usage, and less memory Cons: •2/4GB data file limit on operating systems that don't support big files •2/4GB index file limit on operating systems that don't support big files •One big table poses data archival and index maintenance challenges (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
  • 15. ENGINE = MyISAM + RAID_TYPE CREATE TABLE ss.pos_day ( PERIOD_ID decimal(10,0) NOT NULL default '0', LOCATION_ID decimal(10,0) NOT NULL default '0', PRODUCT_ID decimal(10,0) NOT NULL default '0', SALES_UNIT decimal(10,0) NOT NULL default '0', SALES_RETAIL decimal(10,0) NOT NULL default '0', GROSS_PROFIT decimal(10,0) NOT NULL default '0‘ PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID), ADD INDEX PERIOD(PERIOD_ID), ADD INDEX LOCATION(LOCATION_ID), ADD INDEX PRODUCT(PRODUCT_ID) ) ENGINE=MyISAM PACK_KEYS RAID_TYPE=STRIPED; Pros: •Non-transactional – faster, lower disk space usage, and less memory •Can help you to exceed the 2GB/4GB limit for the MyISAM data file •Creates up to 255 subdirectories, each with file named table_name.myd •Distributed IO – put each table subdirectory and file on a different disk Cons: •2/4GB index file limit on operating systems that don't support big files •One big table poses data archival and index maintenance challenges (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
  • 16. ENGINE = MERGE CREATE TABLE ss.pos_merge ( PERIOD_ID decimal(10,0) NOT NULL default '0', LOCATION_ID decimal(10,0) NOT NULL default '0', PRODUCT_ID decimal(10,0) NOT NULL default '0', SALES_UNIT decimal(10,0) NOT NULL default '0', SALES_RETAIL decimal(10,0) NOT NULL default '0', GROSS_PROFIT decimal(10,0) NOT NULL default '0', INDEX PK(PRODUCT_ID, LOCATION_ID, PERIOD_ID), INDEX PERIOD(PERIOD_ID), INDEX LOCATION(LOCATION_ID), INDEX PRODUCT(PRODUCT_ID) ) ENGINE=MERGE UNION=(pos_1998,pos_1999,pos_2000) INSERT_METHOD=LAST; Pros: •Non-transactional – faster, lower disk space usage, and less memory •Partitioned tables offer data archival and index maintenance options (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc) •Distributed IO – put individual tables and indexes on different disks Cons: •MERGE tables use more file descriptors on database server •MERGE key lookups are much slower on “eq_ref” searches •Can use only identical MyISAM tables for a MERGE table
  • 17. ENGINE = InnoDB CREATE TABLE ss.pos_day ( PERIOD_ID decimal(10,0) NOT NULL default '0', LOCATION_ID decimal(10,0) NOT NULL default '0', PRODUCT_ID decimal(10,0) NOT NULL default '0', SALES_UNIT decimal(10,0) NOT NULL default '0', SALES_RETAIL decimal(10,0) NOT NULL default '0', GROSS_PROFIT decimal(10,0) NOT NULL default '0‘ PRIMARY KEY(PRODUCT_ID, LOCATION_ID, PERIOD_ID), ADD INDEX PERIOD(PERIOD_ID), ADD INDEX LOCATION(LOCATION_ID), ADD INDEX PRODUCT(PRODUCT_ID) ) ENGINE=InnoDB PACK_KEYS; Pros: •Simple yet flexible tablespace datafile configuration innodb_data_file_path=ibdata1:1G:autoextend:max:2G; ibdata2:1G:autoextend:max:2G Cons: •Uses much more disk space – typically 2.5 times as much disk space as MyISAM!!! •Transaction Safe – not needed, consumes resources (SET AUTOCOMMIT=0) •Foreign Keys – not needed, consumes resources (SET FOREIGN_KEY_CHECKS=0) •Prior to MySQL 4.1.1 – no “mutliple tablespaces” feature (i.e. one table per tablespace) •One big table poses data archival and index maintenance challenges (e.g. drop 1998 data, make 1999 read only, rebuild 2000 indexes, etc)
  • 18. Space Usage 21 Million Records MERGE 3GB MyISAM 3GB InnoDB 8GB
  • 19. 4. Index Design Index Design must be driven by DW users’ nature: you don’t know what they’ll query upon, and the more successful they are data mining – the more they’ll try (which is a actually a really good thing) … Therefore you don’t know which dimension tables they’ll reference and which dimension columns they will restrict upon – so: •Fact tables should have primary keys – for data load integrity •Fact table dimension reference (i.e. foreign key) columns should each be individually indexed – for variable fact/dimension joins •Dimension tables should have primary keys •Dimension tables should be fully indexed •MySQL 4.x – only one index per dimension will be used •If you know that one column will always be used in conjunction with others, create concatenated indexes •MySQL 5.x – new index merge will use multiple indexes
  • 20. Note: Make sure to build indexes based off cardinality (i.e. leading portion most selective), so in this case the index was built backwards
  • 21. MyISAM Key Cache Magic 1. Utilize two key caches: •Default Key Cache – for fact table indexes •Hot Key Cache – for dimension key indexes command-line option: shell> mysqld --hot_cache.key_buffer_size=16M option file: [mysqld] hot_cache.key_buffer_size=16M CACHE INDEX t1, t2, t3 IN hot_cache; 2. Pre-Load Dimension Indexes: LOAD INDEX INTO CACHE t1, t2, t3 IGNORE LEAVES;
  • 22. 5. Data Loading Architecture Archive: •ALTER TABLE fact_table UNION=(mt2, mt3, mt4) •DROP TABLE mt1 Load: •TRUNCATE TABLE staging_table •Run nightly/weekly data load into staging_table •ALTER TABLE merge_table_4 DROP PRIMARY KEY •ALTER TABLE merge_table_4 DROP INDEX •INSERT INTO merge_table_4 SELECT * FROM staging_table •ALTER TABLE merge_table_4 ADD PRIMARY KEY(…) •ALTER TABLE merge_table_4 ADD INDEX(…) •ANALYZE TABLE merge_table_4
  • 23. 6. Analyze Table ANALYZE TABLE ss.pos_merge; •Analyze Table statement analyzes and stores the key distribution for a table •MySQL uses the stored key distribution to decide the order in which tables should be joined • If you have a problem with incorrect index usage, you should run ANALYZE TABLE to update table statistics such as cardinality of keys, which can affect the choices the optimizer makes
  • 24. 7. Query Style Many Business Intelligence (BI) and Report Writing tools offer initialization parameters or global settings which control the style of SQL code they generate. Options often include: •Simple N-Way Join •Sub-Selects •Derived Tables •Reporting Engine does Join operations For now – only the first options make sense… (see following section regarding explain plans)
  • 25. 8. Explain Plan The EXPLAIN statement can be used either as a synonym for DESCRIBE or as a way to obtain information about how the MySQL query optimizer will execute a SELECT statement. You can get a good indication of how good a join is by taking the product of the values in the rows column of the EXPLAIN output. This should tell you roughly how many rows MySQL must examine to execute the query. We’ll refer to this calculation as the explain plan cost – and use this as our primary comparative measure (along with of course the actual run time) …
  • 26. Method: Derived Tables Huge Explain Cost, and statement ran forever!!! Cost = 1.8 x 1016th
  • 27. Method: Sub-Selects Sub-Select better than Derived Table – but not as good as Simple Join Cost = 2.1 x 107th
  • 28. Method: Simple Joins and Single-Column Indexes Join better than Sub- Select and much better than Derived Table Cost = 7.1 x 106th
  • 29. Method: Merge Table and Single-Column Indexes Same Explain Cost, but statement ran 2X faster Cost = 7.1 x 106th
  • 30. Method: Merge Table and Multi-Column Indexes Concatenated Index yielded best run time Cost = 3.2 x 105th
  • 31. Method: Merge Table and Merge Indexes MySQL 5.0 new Index Merge = best run time Cost = 5.2 x 105th
  • 32. Conclusion … •MySQL can easily be used to build large and effective “Star Schema” Data Warehouses •MySQL Version 5.x will offer even more useful index and join query optimizations •MySQL can be better configured for DW use through effective mysql.ini option settings •Table and Index designs are paramount to success •Query style and resulting explain plans are critical to achieving the fastest query run times