100% found this document useful (1 vote)
1K views

Assignment 4-1

The document discusses snowflake schemas for data warehousing. It defines a snowflake schema as a refinement of a star schema where some dimensional hierarchies are normalized into multiple tables. The summary provides an example snowflake schema diagram for a university data warehouse with dimensions for student, course, semester, and instructor. It also describes OLAP operations that could be performed on the schema to analyze average grades by student.

Uploaded by

samina khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

Assignment 4-1

The document discusses snowflake schemas for data warehousing. It defines a snowflake schema as a refinement of a star schema where some dimensional hierarchies are normalized into multiple tables. The summary provides an example snowflake schema diagram for a university data warehouse with dimensions for student, course, semester, and instructor. It also describes OLAP operations that could be performed on the schema to analyze average grades by student.

Uploaded by

samina khan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

4.

1 State why, for the integration of multiple heterogeneous information sources,


many companies in industry prefer the update-driven approach (which constructs
and uses data warehouses), rather than the query-driven approach (which applies
wrappers and integrators). Describe situations where the query-driven approach
is preferable to the update-driven approach.

Solution:

State why, for the integration of multiple heterogeneous information sources, many
companies in the industry prefer the update-driven approach (which constructs and
uses data warehouses), rather than the query-driven approach (which applies wrappers
and integrators).

There are situations where the query-driven approach is preferable over the update-
driven approach. An update-driven approach and using data warehouses have the
following benefits: data can be cleaned before running the queries on it. summary
information can be calculated beforehand and also data from different sources can be
aggregated to make it possible to use one style of queries such as SQL.

All these make the queries faster and writing them easier (thus saving money on
developer salaries and infrastructure costs). This comes with the price of the results
possibly being a bit out of date (as the tables contain some precalculated values). Often
this is not a problem.

4.2 Briefly compare the following concepts. You may use an example to explain
your point(s). (a) Snowflake schema, fact constellation, starnet query model (b)
Data cleaning, data transformation, refresh (c) Discovery-driven cube,
multifeature cube, virtual warehouse

Solution:

The most popular data model for a data warehouse is a multidimensional model. Such a
model can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema

Star schema: The most common modeling paradigm is the star schema, in which the
data warehouse contains (1) a large central table (fact table) containing the bulk of the
data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables),
one for each dimension. The schema graph resembles a starburst, with the dimension
tables displayed in a radial pattern around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in the normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However, this
saving of space is negligible in comparison to the typical magnitude of the fact table.
Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted. Hence, although the snowflake schema reduces
redundancy, it is not as popular as the star schema in data warehouse design.

The querying of multidimensional databases can be based on a starnet model. A starnet


model consists of radial lines emanating froma the central point, where each line
represents a concept hierarchy for a dimension. Each abstraction level in the hierarchy
is called a footprint. These represent the granularities available for use by OLAP
operations such as drill-down and roll-up.

4.3 Suppose that a data warehouse consists of the three dimensions time, doctor,
and patient, and the two measures count and charge, where charge is the fee that
a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling
data warehouses.

(a) star schema: a fact table in the middle connected to a set of dimension tables

snowflake schema: a refinement of star schema where some dimensional hierarchy is


normalized into a set of smaller dimension tables, forming a shape similar to snowflake.

Fact constellations: multiple fact tables share dimension tables, viewed as a collection


of stars, therefore called galaxy schema or fact constellation

(b) Draw a schema diagram for the above data warehouse using one of the
schema classes listed in (a).
(c) Starting with the base cuboid [day,doctor,patient], what specific OLAP
operations should be performed in order to list the total fee collected by each
doctor in 2010?
(d) To obtain the same list, write an SQL query assuming the data are stored in a
relational database with the schema fee (day, month, year, doctor, hospital,
patient, count, charge).

4.4 Suppose that a data warehouse for Big University consists of the four
dimensions student, course, semester, and instructor, and two measures count
and avg grade. At the lowest conceptual level (e.g., for a given student, course,
semester, and instructor combination), the avg grade measure stores the actual
course grade of the student. At higher conceptual levels, avg grade stores the
average grade for the given combination.

Here in this problem we have to draw a snowflake schema diagram for the data
warehouse. The snowflake schema is a variant of the star schema. Here, the
centralized fact table is connected to multiple dimensions. In the snowflake schema,
dimensions are present in a normalized form in multiple related tables. The snowflake
structure materialized when the dimensions of a star schema are detailed and highly
structured, having several levels of relationship, and the child tables have multiple
parent tables. The snowflake effect affects only the dimension tables and does not
affect the fact tables.

(a) Draw a snowflake schema diagram for the data warehouse.

The snowflake design is the result of further expansion and normalized of the
dimension table. In other words, a dimension table is said to be snowflaked if the
low-cardinality attribute of the dimensions has been divided into separate normalized
tables. These tables are then joined to the original dimension table with referential
constraints.
Generally, snowflaking is not recommended in the dimension table, as it hampers
the understandability and performance of the dimension model as more tables would
be required to be joined to satisfy the queries.
Characteristics of snowflake schema:-

 The snowflake schema uses small disk space.


 It is easy to implement dimension that is added to the schema.
 There are multiple tables, so performance is reduced.
 The dimension table consists of two or more sets of attributes that define
information at different grains.
 The sets of attributes of the same dimension table are being populated by
different source systems.

Advantages:-

 It provides structured data which reduces the problem of data integrity.


 It uses small disk space because data are highly structured.

Disadvantages:-

 Snowflaking reduces space consumed by dimension tables but compared with


the entire data warehouse the saving is usually insignificant.
 Avoid snowflaking or normalization of a dimension table, unless required and
appropriate.
 Do not snowflake hierarchies of one dimension table into separate tables.
Hierarchies should belong to the dimension table only and should never be
snowflakes.
 Multiple hierarchies that can belong to the same dimension have been designed
at the lowest possible detail.

(b) Starting with the base cuboid [student,course,semester,instructor], what


specific OLAP operations (e.g., roll-up from semester to year) should you
perform in order to list the average grade of CS courses for each Big
University student.

Online Analytical Processing is a category of software that allows users to analyze


information from multiple database systems at the same time. It is a technology that
enables analysts to extract and view business data from different points of view.
Analysts frequently need to group, aggregate and join data. These OLAP operations in
data mining are resource intensive. With OLAP data can be pre-calculated and pre-
aggregated, making analysis faster. Starting with the base cuboid [student, course,
semester, instructor]

1. roll-up on course from to major


2. roll-up on student from to university
3. Dice on course, student with department =”CS” and university=”Big University”
4. Drill-down on student from university to student name

(c) If each dimension has five levels (including all), such as “student < major <
status < university < all”, how many cuboids will this cube contain
(including the base and apex cuboids)?

The cube will contain 54=625 cuboids.

4.5 Suppose that a data warehouse consists of the four dimensions date,
spectator, location, and game, and the two measures count and charge, where
charge is the fare that a spectator pays when watching a game on a given
date. Spectators may be students, adults, or seniors, with each category
having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

 Star schema diagram for the given data:-


(b) Starting with the base cuboid [date,spectator,location, game], what specific
OLAP operations should you perform in order to list the total charge paid by
student spectators at GM Place in 2010?

The OLAP operations to be performed are:

Roll-up on date , spectator , location , game dimension tables from

i) date id to year

ii) spectator id to status.

iii) location id to location_name.

iv) game id to game_name

query to be performed to get the solution:-

dice( status= “students”, location_name= “GM_Place”, and year=2010);


(c)Bitmap indexing is useful in data warehousing. Taking this cube as an
example, briefly discuss advantages and problems of using a bitmap index
structure

Advantages and Disadvantages of using bitmap index structure

Advantages:- 

 It helps in faster retrieval of data if columns are indexed based on their


functionalities.
 Multiple dimensioned table data can be combined for executing a query.
 Handling a large number of records in tables  is efficient.
 All operations like insertion, deletion , updating  of data in table is efficient.

Disadvantages :- 

 If a table  consists of less records of data , then bit map indexing may not be
possible .
 A deadlock situation may occur while performing operations like insertion,
deletion , updation on data.
 using the bitmap join index , only 1 table can be updated .
 It can not be created with the UNIQUE attribute in the table. i e;  a table should
consist multiple attributes.

4.6 A data warehouse can be modeled by either a star schema or a snowflake


schema. Briefly describe the similarities and the differences of the two models,
and then analyze their advantages and disadvantages with regard to one another.
Give your opinion of which might be more empirically useful and state the
reasons behind your answer.

Similarities between snowflake schema and star schema: 

i) Both snowflake and star schema has a fact table which is surrounded by their
dimension tables.

ii) In both snowflake and star schema the dimension tables are normalised.

Difference between star schema and snowflake schema with Advantages and
Disadvantages: 

Star schema Snowflake schema

1) The type of data structure 1) The type of data structure used is


used is de normalised normalised
2) Database used is simple 2) Complex data base is used

3) Dimension Table are 3) Dimension table are divided into


categorical many stages or pieces

4) Query is processed faster 4) Query processed is slower than 


than snowflake schema star schema

5) More redundant data 5) Less redundant data

6) To create relationship 6) To create relationship between


between different tables one different tables more number of
join is enough joins is required

The type of schema which is better depends upon certain criteria. Since snowflake
schema is more complex it is used in situtation where we have to know the number of
new customers made in a certain project. Whereas, star schema are used in more
simple situation such as revenue of a certain customer. Overall, the schema which is
more useful in real life is snowflake schema although it may be more complex and a bit
slower but it is used in most of the situations.

4.7 Design a data warehouse for a regional weather bureau. The weather bureau
has about 1000 probes, which are scattered throughout various land and ocean
locations in the region to collect basic weather data, including air pressure,
temperature, and precipitation at each hour. All data are sent to the central
station, which has collected such data for more than 10 years. Your design
should facilitate efficient querying and online analytical processing, and derive
general weather patterns in multidimensional space.

Star Schema: In datawarehousing, star schema is one of the more preliminary and
widely used schemas. In star schema, there is/are one or more fact table(s) at the
center of warehouse referencing different dimension tables.

Star schema for a regional weather bereau: For a regional weather bereau, weather
data can be collected from various dimensions like date, temparature, precipitation,
location. There will be one fact table at center that will combine data from from all these
dimension tables. Star schema for a regional weather bereau is shown below:
This schema will allow to query data from different perspective and using various
combinations of dimensions.

4.8 A popular data warehouse implementation is to construct a multidimensional


database, known as a data cube. Unfortunately, this may often generate a huge,
yet very sparse, multidimensional matrix.

(a) Present an example illustrating such a huge and sparse data cube.

(b) Design an implementation method that can elegantly overcome this sparse
matrix problem. Note that you need to explain your data structures in detail and
discuss the space needed, as well as how to retrieve data from your structures.
(d) Modify your design in (b) to handle incremental data updates. Give the
reasoning behind your new design.

4.9 Regarding the computation of measures in a data cube:

(a) Enumerate three categories of measures, based on the kind of aggregate


functions used in computing a data cube.
(b) For a data cube with the three dimensions time, location, and item, which
category does the function variance belong to? Describe how to compute it if the
cube is partitioned into many chunks. Hint: The formula for computing variance is
1 N PN i=1 (xi − ¯xi) 2 , where x¯i is the average of xis.
(c) Suppose the function is “top 10 sales.” Discuss how to efficiently compute
this measure in a data cube.

4.10 Suppose a company wants to design a data warehouse to facilitate the


analysis of moving vehicles in an online analytical processing manner. The
company registers huge amounts of auto movement data in the format of (Auto
ID, location, speed, time). Each Auto ID represents a vehicle associated with
information (e.g., vehicle category, driver category), and each location may be
associated with a street in a city. Assume that a street map is available for the
city.
(a) Design such a data warehouse to facilitate effective online analytical
processing in multidimensional space.

To design a data warehouse for the analysis of moving vehicles, we can consider
vehicle as a fact table that points to four dimensions: auto, time, location and speed.
The measures can vary depending on the desired target and the power required. Here,
measures considered are vehicle sold and vehicle mileage. A star scheme is shown
below. I'm really sorry for the blurred image, but that was the I best I could grab. I'm not
able to upload high quality images as they just don't show up which is due to Homework
Help upload issues. I hope you are able to understand.
(b) The movement data may contain noise. Discuss how you would develop a
method to automatically discover data records that were likely erroneously
registered in the data repository.

To handle the noise in data, we first need to do data cleaning. Missing values may be
filled or dropped entirely, depending on the tolerance of the system. Then we can use
some data smoothing techniques to remove noisy data points, for example, regression
and outlier analysis. Finally, we can also set up some rules to detect inconsistent data
and remove them based on domain knowledge.

(c) The movement data may be sparse. Discuss how you would develop a method
that constructs a reliable data warehouse despite the sparsity of data.

(d) If you want to drive from A to B starting at a particular time, discuss how a
system may use the data in this warehouse to work out a fast route.
4.11 Radio-frequency identification is commonly used to trace object movement
and perform inventory control. An RFID reader can successfully read an RFID tag
from a limited distance at any scheduled time. Suppose a company wants to
design a data warehouse to facilitate the analysis of objects with RFID tags in an
online analytical processing manner. The company registers huge amounts of
RFID data in the format of (RFID, at location, time), and also has some information
about the objects carrying the RFID tag, for example, (RFID, product name,
product category, producer, date produced, price).

(a) Design a data warehouse to facilitate effective registration and online


analytical processing of such data.
(b) The RFID data may contain lots of redundant information. Discuss a method
that maximally reduces redundancy during data registration in the RFID data
warehouse.

(c) The RFID data may contain lots of noise such as missing registration and
misread IDs. Discuss a method that effectively cleans up the noisy data in the
RFID data warehouse.

(d) You may want to perform online analytical processing to determine how many
TV sets were shipped from the LA seaport to BestBuy in Champaign, IL, by
month, brand, and price range. Outline how this could be done efficiently if you
were to store such RFID data in the warehouse.
(e) If a customer returns a jug of milk and complains that is has spoiled before
its expiration date, discuss how you can investigate such a case in the
warehouse to find out what the problem is, either in shipping or in storage.

4.12 In many applications, new data sets are incrementally added to the existing
large data sets. Thus, an important consideration is whether a measure can be
computed efficiently in an incremental manner. Use count, standard deviation,
and median as examples to show that a distributive or algebraic measure
facilitates efficient incremental computation, whereas a holistic measure does
not.

Step 1:

Data mining is significant:

By examining customer market reports, purchases, and card payments, data mining
enables banks to interact with credit quality and anti-fraud systems. When creating a
new promotional campaign, business intelligence also helps firms fully appreciate the
online behaviors and interests of their clients.
Explanation:
Before entering your data field, there are many popular data mining approaches to take
into account, but some of the most popular ones are clusters, data filtering, connection,
data management, computer vision, data visualization, classifying, artificial neural, and
predictions.

Step 2:

Data Mining obstacles:

1.Environmental and economic challenges.

2. Insufficient and messy data.

3. Data Dissemination.

4. Data Complexity.

5. Performances.
6. Expansion and algorithmic efficiency.

7. Methods for mining are being improved.

8. Background Information Inclusion.

Data mining a fad:

Data mining isn't a new fad. Instead, the availability of large volumes of data or the
impending need to transform those data into valuable information and expertise have
driven the growth in data mining. Thus, packet sniffing can be seen as the outcome of
the technology's gradual development.

Explanation:
Metadata can be classified as either narrative, administrative, or structural. Resource
search, classification, and choosing are made possible by descriptive information. It
may have components like the title, the author, and the subjects. Resource
management is facilitated by organizational metadata.

4.13 Suppose that we need to record three measures in a data cube: min(),
average(), and median(). Design an efficient computation and storage method for
each measure given that the cube allows data to be deleted incrementally (i.e., in
small portions at a time) from the cube.

Answer

For min, keep the <min val, count> pair for each cuboid to register the smallest value
and its count.
And for each deleted tuple, if its value is greater than min val, there's nothing to do .
Otherwise, decrement the count
of the corresponding node. If a count goes down to zero, recalculate the structure.

For average, keep a pair <sum, count> for each cuboid. And for each deleted node N,
decrement the count and subtract value N from the sum, and average = sum/count.

For median, keep a small number p of centered values, (e.g. p = 10), plus two counts:
up count and down count. Each removal may change the count or remove a centered
value. If the median no longer falls among these centered values, recalculate the set.
Otherwise, the median can be easily calculated from the above set.

4.14

In data warehouse technology, a multiple dimensional view can be implemented


by a relational database technique (ROLAP), by a multidimensional database
technique (MOLAP), or by a hybrid database technique (HOLAP). (a) Briefly
describe each implementation technique. (b) For each technique, explain how
each of the following functions may be implemented: i. The generation of a data
warehouse (including aggregation) ii. Roll-up iii. Drill-down iv. Incremental
updating (c) Which implementation techniques do you prefer, and why?

Answer

OLAP :Online Analytical Processing is a class of tools which can exract and present
multi-dimensional data from different points of view. OLAP structures the data in a
heirarchal manner. OLAP function include trend analysis, drilling-down, summarization
of data and data rotation. There are three types of OLAP

: 1. ROLAP 2. MOLAP 3. HOLAP

1. ROLAP ( Relational OLAP servers): These servers are found between relational
back-end server and client front-end tools. They make use of RDBMSs or extended
RDBMSs to store and manage warehouse data and uses OLAP middleware to fill in the
gaps. ROLAP servers have the ability to optimize the all back end DBMSs, as well as
deployment of aggregation navigation logic and other tools.

i. The generation of data warehouse(including aggregation) Initial Aggregation is


achieved using SQL via group-bys. The cube operator aggregates all subsets of the
dimensions in the specified operation which generates a single cube. ROLAP depends
on tuples and relational tables as its basic data structures. The fact table stores data at
the abstraction level indicated by join keys in the scheme for the given data cube.
Aggregated data is also stored in fact tables. To optimize ROLAP cube computation we
use techniques like sorting, hashing, grouping. Grouping is performed on sub-
aggregates. Sub-aggregates are the aggregates that are derived from previously
computed aggregates.

ii. Roll-up: The roll-up operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension reduction such that
one or more dimensions are removed from the given cube.

iii. drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing new dimensions.

iv. Incremental updating: Data warehouse implementation can be broken down into
segments or increments. An increment is a defined data warehouse implementation
project that has a specified beginning and end. Incremental data capture is a time-
dependent model for capturing changes to operational systems. This technique is best
applied in circumstances where changes in the data are significantly smaller than the
size of the data set for a specific period of time.These techniques are more complex
than static capture, because they are closely tied to the DBMS or the operational
software which updates the DBMS.

2. MOLAP(Multi-dimensional OLAP servers) :These servers allow for multidimensional


views of data through array-based multidimensional engines. The first generation of
server-based multidimensional OLAP (MOLAP) solutions use multidimensional
databases (MDDBs). The main advantage of an MDDB over an RDBMS is that an
MDDB can provide information quickly since it is calculated and stored at the
appropriate hierarchy level in advance

. i. The generation of data warehouse(including aggregation) MOLAP uses array


structures to store data for OLAP. Initial aggregation is done using SQL via group-bys.
MOLAP follows very different cube computation scheme than ROLAP. It uses direct
array addressing, where dimension values are accessed via the position or index of the
corresponding array. The array based cube is generated by Partitioning array into
chunks. A chunk is a subcube that is small enough to fit into the memory for cube
computation.It computes aggregates by visiting cube cells.

ii. Roll-up: The roll-up operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension reduction such that
one or more dimensions are removed from the given cube. Here we roll-up chunks that
make sub-cubes and these sub-cubes make an array.

iii. drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data
to more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing new dimensions.

iv.Incremental updating:

Data warehouse implementation can be broken down into segments or increments. An


increment is a defined data warehouse implementation project that has a specified
beginning and end. Incremental data capture is a time-dependent model for capturing
changes to operational systems.The technique for MOLAP updates would be more
sophisticated due to the additional complexity of arrays and subcubes.

3. HOLAP (Hybrid OLAP servers): HOLAP combines elements from MOLAP and
ROLAP. HOLAP keeps the original data in relational tables but stores aggregations in a
multidimensional format. It combines MOLAP and ROLAP. It utilizes both pre-calculated
cubes &amp; relational data sources.
i. The generation of data warehouse(including aggregation) The generation would
consist of a combined approach of both MOLAP and ROLAP. A HOLAP would be
generated to store large volumes of detail in a relational database while a MOLAP
would be used to store aggregations separately.

ii. Roll-up: It combines the roll-up operations of both ROLAP and MOLAP.
iii. drill-down: It combines the drill-down operations of both ROLAP and MOLAP.
iv.Incremental updating: Data warehouse implementation can be broken down into
segments or increments. An increment is a defined data warehouse implementation
project that has a specified beginning and end. Incremental data capture is a time-
dependent model for capturing changes to operational systems.It updates both ROLAP
DBMS updates as well as MOLAP updates of array and subcubes. d. HOLAP
implementation technique is mostly preferred because it is the combination of greater
scalability of ROLAP and faster computation of MOLAP. HOLAP can store data either
in RDBS or MDDBS. It also improves performance and manages the storage of data

4.15 Suppose that a data warehouse contains 20 dimensions, each with about five
levels of granularity.

Answer

(a) Users are mainly interested in four particular dimensions, each having three
frequently accessed levels for rolling up and drilling down. How would you
design a data cube structure to support this preference efficiently?

(a) Users are mainly interested in four particular dimensions, each having three
frequently accessed levels for rolling up and drilling down. How would you design a data
cube structure to efficiently support this preference?An e fficient data cube structure to
support this preference would be to use partial materialization, or selected computation
of cuboids. By computing only the proper subset of the whole set of possible cuboids,
the total amount of storage space required would be minimized while maintaining a fast
response time and avoiding redundant computation.

(b) At times, a user may want to drill through the cube to the raw data for one or
two particular dimensions. How would you support this feature?

(b) At times, a user may want to drill through  the cube, down to the raw data for one or
two particular dimensions. How would you support this feature?Since the user may want
to drill through the cube for only one or two dimensions, this feature could be supported
by computing the required cuboids on the fly. Since the user may only need this feature
in frequently, the time required for computing aggregates on those one or two
dimensions on the fly should be acceptable.

4.16

A data cube, C, has n dimensions, and each dimension has exactly p distinct
values in the base cuboid. Assume that there are no concept hierarchies
associated with the dimensions.
Answer

a. What is the maximum number of cells possible in the base cuboid?

p^n.
This is the maximum number of distinct tuples that you can form with p distinct values
per dimensions.

b. What is the minimum number of cells possible in the base cuboid?

p.
You need at least p tuples to contain p distinct values per dimension. In this case no
tuple shares any value on any dimension.

c.What is the maximum number of cells possible (including both base cells and
aggregate cells) in the data cube,
C?

(p + 1)^n.
The argument is similar to that of part (a), but now we have p + 1 because in addition to
the p distinct values of each dimension we can also choose ∗.

d. What is the minimum number of cells possible in the data cube, C?

(2^n − 1) × p + 1.
The minimum number of cells is when each cuboid contains only p cells, except
for the apex, which contains a single cell

4.17

What are the differences between the three main types of data warehouse usage:
information processing, analytical processing, and data mining? Discuss the
motivation behind OLAP mining (OLAM).

Answer

You might also like