Assignment 4-1
Assignment 4-1
Solution:
State why, for the integration of multiple heterogeneous information sources, many
companies in the industry prefer the update-driven approach (which constructs and
uses data warehouses), rather than the query-driven approach (which applies wrappers
and integrators).
There are situations where the query-driven approach is preferable over the update-
driven approach. An update-driven approach and using data warehouses have the
following benefits: data can be cleaned before running the queries on it. summary
information can be calculated beforehand and also data from different sources can be
aggregated to make it possible to use one style of queries such as SQL.
All these make the queries faster and writing them easier (thus saving money on
developer salaries and infrastructure costs). This comes with the price of the results
possibly being a bit out of date (as the tables contain some precalculated values). Often
this is not a problem.
4.2 Briefly compare the following concepts. You may use an example to explain
your point(s). (a) Snowflake schema, fact constellation, starnet query model (b)
Data cleaning, data transformation, refresh (c) Discovery-driven cube,
multifeature cube, virtual warehouse
Solution:
The most popular data model for a data warehouse is a multidimensional model. Such a
model can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema
Star schema: The most common modeling paradigm is the star schema, in which the
data warehouse contains (1) a large central table (fact table) containing the bulk of the
data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables),
one for each dimension. The schema graph resembles a starburst, with the dimension
tables displayed in a radial pattern around the central fact table.
Snowflake schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into
additional tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in the normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However, this
saving of space is negligible in comparison to the typical magnitude of the fact table.
Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted. Hence, although the snowflake schema reduces
redundancy, it is not as popular as the star schema in data warehouse design.
4.3 Suppose that a data warehouse consists of the three dimensions time, doctor,
and patient, and the two measures count and charge, where charge is the fee that
a doctor charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling
data warehouses.
(a) star schema: a fact table in the middle connected to a set of dimension tables
(b) Draw a schema diagram for the above data warehouse using one of the
schema classes listed in (a).
(c) Starting with the base cuboid [day,doctor,patient], what specific OLAP
operations should be performed in order to list the total fee collected by each
doctor in 2010?
(d) To obtain the same list, write an SQL query assuming the data are stored in a
relational database with the schema fee (day, month, year, doctor, hospital,
patient, count, charge).
4.4 Suppose that a data warehouse for Big University consists of the four
dimensions student, course, semester, and instructor, and two measures count
and avg grade. At the lowest conceptual level (e.g., for a given student, course,
semester, and instructor combination), the avg grade measure stores the actual
course grade of the student. At higher conceptual levels, avg grade stores the
average grade for the given combination.
Here in this problem we have to draw a snowflake schema diagram for the data
warehouse. The snowflake schema is a variant of the star schema. Here, the
centralized fact table is connected to multiple dimensions. In the snowflake schema,
dimensions are present in a normalized form in multiple related tables. The snowflake
structure materialized when the dimensions of a star schema are detailed and highly
structured, having several levels of relationship, and the child tables have multiple
parent tables. The snowflake effect affects only the dimension tables and does not
affect the fact tables.
The snowflake design is the result of further expansion and normalized of the
dimension table. In other words, a dimension table is said to be snowflaked if the
low-cardinality attribute of the dimensions has been divided into separate normalized
tables. These tables are then joined to the original dimension table with referential
constraints.
Generally, snowflaking is not recommended in the dimension table, as it hampers
the understandability and performance of the dimension model as more tables would
be required to be joined to satisfy the queries.
Characteristics of snowflake schema:-
Advantages:-
Disadvantages:-
(c) If each dimension has five levels (including all), such as “student < major <
status < university < all”, how many cuboids will this cube contain
(including the base and apex cuboids)?
4.5 Suppose that a data warehouse consists of the four dimensions date,
spectator, location, and game, and the two measures count and charge, where
charge is the fare that a spectator pays when watching a game on a given
date. Spectators may be students, adults, or seniors, with each category
having its own charge rate.
i) date id to year
Advantages:-
Disadvantages :-
If a table consists of less records of data , then bit map indexing may not be
possible .
A deadlock situation may occur while performing operations like insertion,
deletion , updation on data.
using the bitmap join index , only 1 table can be updated .
It can not be created with the UNIQUE attribute in the table. i e; a table should
consist multiple attributes.
i) Both snowflake and star schema has a fact table which is surrounded by their
dimension tables.
ii) In both snowflake and star schema the dimension tables are normalised.
Difference between star schema and snowflake schema with Advantages and
Disadvantages:
The type of schema which is better depends upon certain criteria. Since snowflake
schema is more complex it is used in situtation where we have to know the number of
new customers made in a certain project. Whereas, star schema are used in more
simple situation such as revenue of a certain customer. Overall, the schema which is
more useful in real life is snowflake schema although it may be more complex and a bit
slower but it is used in most of the situations.
4.7 Design a data warehouse for a regional weather bureau. The weather bureau
has about 1000 probes, which are scattered throughout various land and ocean
locations in the region to collect basic weather data, including air pressure,
temperature, and precipitation at each hour. All data are sent to the central
station, which has collected such data for more than 10 years. Your design
should facilitate efficient querying and online analytical processing, and derive
general weather patterns in multidimensional space.
Star Schema: In datawarehousing, star schema is one of the more preliminary and
widely used schemas. In star schema, there is/are one or more fact table(s) at the
center of warehouse referencing different dimension tables.
Star schema for a regional weather bereau: For a regional weather bereau, weather
data can be collected from various dimensions like date, temparature, precipitation,
location. There will be one fact table at center that will combine data from from all these
dimension tables. Star schema for a regional weather bereau is shown below:
This schema will allow to query data from different perspective and using various
combinations of dimensions.
(a) Present an example illustrating such a huge and sparse data cube.
(b) Design an implementation method that can elegantly overcome this sparse
matrix problem. Note that you need to explain your data structures in detail and
discuss the space needed, as well as how to retrieve data from your structures.
(d) Modify your design in (b) to handle incremental data updates. Give the
reasoning behind your new design.
To design a data warehouse for the analysis of moving vehicles, we can consider
vehicle as a fact table that points to four dimensions: auto, time, location and speed.
The measures can vary depending on the desired target and the power required. Here,
measures considered are vehicle sold and vehicle mileage. A star scheme is shown
below. I'm really sorry for the blurred image, but that was the I best I could grab. I'm not
able to upload high quality images as they just don't show up which is due to Homework
Help upload issues. I hope you are able to understand.
(b) The movement data may contain noise. Discuss how you would develop a
method to automatically discover data records that were likely erroneously
registered in the data repository.
To handle the noise in data, we first need to do data cleaning. Missing values may be
filled or dropped entirely, depending on the tolerance of the system. Then we can use
some data smoothing techniques to remove noisy data points, for example, regression
and outlier analysis. Finally, we can also set up some rules to detect inconsistent data
and remove them based on domain knowledge.
(c) The movement data may be sparse. Discuss how you would develop a method
that constructs a reliable data warehouse despite the sparsity of data.
(d) If you want to drive from A to B starting at a particular time, discuss how a
system may use the data in this warehouse to work out a fast route.
4.11 Radio-frequency identification is commonly used to trace object movement
and perform inventory control. An RFID reader can successfully read an RFID tag
from a limited distance at any scheduled time. Suppose a company wants to
design a data warehouse to facilitate the analysis of objects with RFID tags in an
online analytical processing manner. The company registers huge amounts of
RFID data in the format of (RFID, at location, time), and also has some information
about the objects carrying the RFID tag, for example, (RFID, product name,
product category, producer, date produced, price).
(c) The RFID data may contain lots of noise such as missing registration and
misread IDs. Discuss a method that effectively cleans up the noisy data in the
RFID data warehouse.
(d) You may want to perform online analytical processing to determine how many
TV sets were shipped from the LA seaport to BestBuy in Champaign, IL, by
month, brand, and price range. Outline how this could be done efficiently if you
were to store such RFID data in the warehouse.
(e) If a customer returns a jug of milk and complains that is has spoiled before
its expiration date, discuss how you can investigate such a case in the
warehouse to find out what the problem is, either in shipping or in storage.
4.12 In many applications, new data sets are incrementally added to the existing
large data sets. Thus, an important consideration is whether a measure can be
computed efficiently in an incremental manner. Use count, standard deviation,
and median as examples to show that a distributive or algebraic measure
facilitates efficient incremental computation, whereas a holistic measure does
not.
Step 1:
By examining customer market reports, purchases, and card payments, data mining
enables banks to interact with credit quality and anti-fraud systems. When creating a
new promotional campaign, business intelligence also helps firms fully appreciate the
online behaviors and interests of their clients.
Explanation:
Before entering your data field, there are many popular data mining approaches to take
into account, but some of the most popular ones are clusters, data filtering, connection,
data management, computer vision, data visualization, classifying, artificial neural, and
predictions.
Step 2:
3. Data Dissemination.
4. Data Complexity.
5. Performances.
6. Expansion and algorithmic efficiency.
Data mining isn't a new fad. Instead, the availability of large volumes of data or the
impending need to transform those data into valuable information and expertise have
driven the growth in data mining. Thus, packet sniffing can be seen as the outcome of
the technology's gradual development.
Explanation:
Metadata can be classified as either narrative, administrative, or structural. Resource
search, classification, and choosing are made possible by descriptive information. It
may have components like the title, the author, and the subjects. Resource
management is facilitated by organizational metadata.
4.13 Suppose that we need to record three measures in a data cube: min(),
average(), and median(). Design an efficient computation and storage method for
each measure given that the cube allows data to be deleted incrementally (i.e., in
small portions at a time) from the cube.
Answer
For min, keep the <min val, count> pair for each cuboid to register the smallest value
and its count.
And for each deleted tuple, if its value is greater than min val, there's nothing to do .
Otherwise, decrement the count
of the corresponding node. If a count goes down to zero, recalculate the structure.
For average, keep a pair <sum, count> for each cuboid. And for each deleted node N,
decrement the count and subtract value N from the sum, and average = sum/count.
For median, keep a small number p of centered values, (e.g. p = 10), plus two counts:
up count and down count. Each removal may change the count or remove a centered
value. If the median no longer falls among these centered values, recalculate the set.
Otherwise, the median can be easily calculated from the above set.
4.14
Answer
OLAP :Online Analytical Processing is a class of tools which can exract and present
multi-dimensional data from different points of view. OLAP structures the data in a
heirarchal manner. OLAP function include trend analysis, drilling-down, summarization
of data and data rotation. There are three types of OLAP
1. ROLAP ( Relational OLAP servers): These servers are found between relational
back-end server and client front-end tools. They make use of RDBMSs or extended
RDBMSs to store and manage warehouse data and uses OLAP middleware to fill in the
gaps. ROLAP servers have the ability to optimize the all back end DBMSs, as well as
deployment of aggregation navigation logic and other tools.
ii. Roll-up: The roll-up operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension reduction such that
one or more dimensions are removed from the given cube.
iii. drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing new dimensions.
iv. Incremental updating: Data warehouse implementation can be broken down into
segments or increments. An increment is a defined data warehouse implementation
project that has a specified beginning and end. Incremental data capture is a time-
dependent model for capturing changes to operational systems. This technique is best
applied in circumstances where changes in the data are significantly smaller than the
size of the data set for a specific period of time.These techniques are more complex
than static capture, because they are closely tied to the DBMS or the operational
software which updates the DBMS.
ii. Roll-up: The roll-up operation performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension reduction such that
one or more dimensions are removed from the given cube. Here we roll-up chunks that
make sub-cubes and these sub-cubes make an array.
iii. drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data
to more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing new dimensions.
iv.Incremental updating:
3. HOLAP (Hybrid OLAP servers): HOLAP combines elements from MOLAP and
ROLAP. HOLAP keeps the original data in relational tables but stores aggregations in a
multidimensional format. It combines MOLAP and ROLAP. It utilizes both pre-calculated
cubes & relational data sources.
i. The generation of data warehouse(including aggregation) The generation would
consist of a combined approach of both MOLAP and ROLAP. A HOLAP would be
generated to store large volumes of detail in a relational database while a MOLAP
would be used to store aggregations separately.
ii. Roll-up: It combines the roll-up operations of both ROLAP and MOLAP.
iii. drill-down: It combines the drill-down operations of both ROLAP and MOLAP.
iv.Incremental updating: Data warehouse implementation can be broken down into
segments or increments. An increment is a defined data warehouse implementation
project that has a specified beginning and end. Incremental data capture is a time-
dependent model for capturing changes to operational systems.It updates both ROLAP
DBMS updates as well as MOLAP updates of array and subcubes. d. HOLAP
implementation technique is mostly preferred because it is the combination of greater
scalability of ROLAP and faster computation of MOLAP. HOLAP can store data either
in RDBS or MDDBS. It also improves performance and manages the storage of data
4.15 Suppose that a data warehouse contains 20 dimensions, each with about five
levels of granularity.
Answer
(a) Users are mainly interested in four particular dimensions, each having three
frequently accessed levels for rolling up and drilling down. How would you
design a data cube structure to support this preference efficiently?
(a) Users are mainly interested in four particular dimensions, each having three
frequently accessed levels for rolling up and drilling down. How would you design a data
cube structure to efficiently support this preference?An e fficient data cube structure to
support this preference would be to use partial materialization, or selected computation
of cuboids. By computing only the proper subset of the whole set of possible cuboids,
the total amount of storage space required would be minimized while maintaining a fast
response time and avoiding redundant computation.
(b) At times, a user may want to drill through the cube to the raw data for one or
two particular dimensions. How would you support this feature?
(b) At times, a user may want to drill through the cube, down to the raw data for one or
two particular dimensions. How would you support this feature?Since the user may want
to drill through the cube for only one or two dimensions, this feature could be supported
by computing the required cuboids on the fly. Since the user may only need this feature
in frequently, the time required for computing aggregates on those one or two
dimensions on the fly should be acceptable.
4.16
A data cube, C, has n dimensions, and each dimension has exactly p distinct
values in the base cuboid. Assume that there are no concept hierarchies
associated with the dimensions.
Answer
p^n.
This is the maximum number of distinct tuples that you can form with p distinct values
per dimensions.
p.
You need at least p tuples to contain p distinct values per dimension. In this case no
tuple shares any value on any dimension.
c.What is the maximum number of cells possible (including both base cells and
aggregate cells) in the data cube,
C?
(p + 1)^n.
The argument is similar to that of part (a), but now we have p + 1 because in addition to
the p distinct values of each dimension we can also choose ∗.
(2^n − 1) × p + 1.
The minimum number of cells is when each cuboid contains only p cells, except
for the apex, which contains a single cell
4.17
What are the differences between the three main types of data warehouse usage:
information processing, analytical processing, and data mining? Discuss the
motivation behind OLAP mining (OLAM).
Answer