0% found this document useful (0 votes)
43 views

Chapter 3 Topic - 4

This document discusses techniques for efficient implementation of data warehouses, including cube computation, access methods, and query processing. It covers efficient computation of data cubes through precomputation of cuboids to allow fast response times to queries. Methods for selecting which cuboids to materialize include considering query frequencies, update costs, and storage requirements. Indexing techniques like bitmap indexes and join indexes can also help optimize cube access and querying.

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Chapter 3 Topic - 4

This document discusses techniques for efficient implementation of data warehouses, including cube computation, access methods, and query processing. It covers efficient computation of data cubes through precomputation of cuboids to allow fast response times to queries. Methods for selecting which cuboids to materialize include considering query frequencies, update costs, and storage requirements. Indexing techniques like bitmap indexes and join indexes can also help optimize cube access and querying.

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

DATA WAREHOUSING

AND
DATA MINING

P. Venkateshwarlu
Faculty of Computer Sci

[email protected]
TOPIC

Data Warehouse
Implementation

2
INTRODUCTION
 Data warehouses contain huge volumes of data.
 OLAP servers demand that decision support queries be
answered in the order of seconds.
 Therefore, it is essential (crucial) for data warehouse
systems to support efficient implementation with
› highly efficient cube computation techniques
› access methods
› query processing techniques.

3
Efficient Computation of Data Cubes

 Core of multidimensional data analysis is the


efficient of aggregation across many sets of
dimension
 The compute cube operator aggregates over all
subsets of dimensions

4
Computation of Data Cubes ..

 You would like to create a data cubeAll_Electronics


that contains the following:
 item, city, year, and sales_in_Euro

 Answer following queries


› Compute the sum of sales, grouping by item and city
› Compute the sum of sales, grouping by item
› Compute the sum of sales, grouping by city

5
Computation of Data Cubes ..
 The total number of data cuboids is 23=8

› {(city,item,year),
› (city,item), (city,year),
› (city),(item),(year),
› ()}

 (), the dimensions are not grouped


› These group-by’s form a lattice of cuboids for the data
cube
› The basic cuboid contains all three dimensions

6
Lattice of a cube

7
Computation of Cubes
 On-line analytical processing may need to access
different cuboids for different queries

 Compute some cuboids in advance


› Precomputation leads to fast response times
› Most products support to some degree precomputation
Computation of Cubes
 Storage space may explode...
› If there are no hierarchies the total number for n-
dimensional cube is 2n
 But....
› Many dimensions may have hierarchies, for example time
 day < week < month < quarter < year
 For a n-dimensional data cube, where Li is the number of all
levels (for time Ltime=5), the total number of cuboids that can
be generated is

n
T   (Li  1)
i1
Selected computation

 It is unrealistic to precompute and materialize


(store) all cuboids that can be generated

 Partial materialization
› Only some of possible cuboids are generated
Materialization
 No materialization
› Do not precompute any of “nonbase” cuboids
 Expensive computation in during data analysis
 Full materialization
› Pre compute all cuboids
 Huge amount of memory....
 Partial materialization
 Which cuboids should we pre compute and which not?
Partial materialization -
Selection of cuboids
 Take into account:
› the queries, their frequencies, the accessing costs
› workload characteristics, costs for incremental
updates, storage requirements

 Broad context of physical database design,


generation and selection of indices
Heuristic approaches for cuboid
selection

 Materialize the set of cuboids on which other


popular referenced cuboids are based
Advantage of materialized cuboids

 It is important to take advantage of materialized


cuboids during query processing
› How to use available index structures on the
materialized cuboids
› How to transform the OLAP operations into the
selected cuboids
Selection of operations
 Determine which operations should be preformed on
the available cuboids
› This involves transforming any selection, projection, roll-
up and drill down operations in the query into
corresponding SQL and/or OLAP operations

› Determine to which materialized cuboids the relevant


operations should be applied
› Identifying all materialized cuboids that may potentially
by used to answer the query
Example
 Suppose that we define a datacube for
ALLElectronics of the form
 sales[time,item,location]: sum(salles_in_euro)

 Dimension hierarchies
› time: day < month < quater < year
› Item: item_name < brand < type
Query
 {brand,province_or_state} with year=2000

 Four materialized cubes are available

1) {year, item_name, city}


2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2000

 Which should be selected to process the query?


Granularity
 Finer granularity data cannot be generated from coarser-
granularity data

 Cuboid 2 cannot be used since country is more general


concept then province_or_state
 Cuboids 1, 3, 4 can be used
› They have the same set or superset of the dimensions of the query
› The selection clause in the query can imply the selection in the cuboid
› The abstraction levels for the item and location dimension in these
cuboids are at a finer level than brand and province_or_state
How would the costs of each cuboid
compare?
 Cuboid 1 would cost the most, since both item_name
and city are at a lower level than brand and
province_or_state

 If not many year values associated with items in the


cube, and there are several item names for each
brand, then cuboid 3 will be better than cuboid 4
 Efficient indices available for cuboid 4, cuboid 4
better choice (bitmap indexes)
Indexing OLAP Data:
Bitmap Index
 Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the
value for the indexed column
Base table Index on Region Index on Type

Cust Region Type RecID Asia Europe Am erica RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
4 0 0 1 4 1 0
C4 America Retail
5 0 1 0 5 0 1
C5 Europe Dealer

21
Bitmap Index
 Allows quick search in data cubes
 Advantageous compared to hash and tree indices
 Useful for low-cardinality domains because
comparison, join, and aggregation operations are
reduced to bitmap arithmetic's
› (Reduced processing time!)
 Significant reduction in space and I/O since a string
of character can be represented by a bit
Join indexing method

 The join indexing method gained popularity from its use in


relational database query processing
 Traditional indexing maps the value in a given column to a
list of rows having that value
 For example, if two relations R(RID,A) and S(B,SID) join on
two attributes A and B, then the join index record contains
the pair (RID,SID) from R and S relation
 Join index records can identify joinable tuples without
performing costly join operators
Indexing OLAP Data:
Join Indices

In data warehouses, join index relates the


values of the dimensions of a start
schema to rows in the fact table.
› E.g. fact table: Sales and two dimensions
city and product
 A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
› Join indices can span multiple dimensions
Multiway Array Aggregation

 Sometimes we need to precompute all of the cuboids


for a given data cube
› (full materialization)
 Cuboids can be stored on secondary storage and
accessed when necessary
 Methods must take into account the limited amount
of main memory and time
 Different techniques for ROLAP and MOLAP
ROLAP cube computation

 Sorting hashing and grouping operations are applied


to the dimension attributes in order to reorder and
group related tuples
 Grouping is preformed on some sub aggregates as a
partial grouping step
› Speed up computation
 Aggregate may be computed from previously
computed aggregates, rather than from the base fact
tables
MOLAB and cube computation
 MOLAP cannot perform the value-based reordering because
it uses direct array addressing
› Partition arrays into chunks (a small subcube which fits in memory).
› Compressed sparse array addressing for empty cell arrays
 Compute aggregates in “multiway” by visiting cube cells in
the order which minimizes the number of times to visit
each cell, and reduces memory access and storage cost
Iceberg Cube
 Computing only the cuboid cells whose count
or other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup

 Motivation
 Only calculate “interesting” cells—data above certain
threshold
 Avoid explosive growth of the cube

• Suppose 100 dimensions, only 1 base cell. How many


aggregate cells if count >= 1? What about count >= 2?
THANK YOU

CHAPTER 3 (TOPIC – IV ) CONCLUDED

29

You might also like