Csb4318 DWDM Unit - 1 Revised
Csb4318 DWDM Unit - 1 Revised
Mining
B.Tech – VI Semester
Dr. M.KATHIRAVAN
Assistant Professor –(SG)
School of Computing Sciences,
Department of Computer Science and Engineering 1
Lecture 1
Department of Computer science and Engineering CSB4318 – Data Warehousing and Data Mining 2
What is Data Warehouse?
DW is Suitable for top level management.
A decision support database that is maintained separately from the
organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”—W. H. Inmon
DWM is a place where a heterogeneous data are organized under
the unified schema architecture in a single site to facilitate mgt
decision making process.
3
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction records
4
Data Warehouse—Subject-
Oriented
5
Star Schema
time
4 Dimensional Tables
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
6
Data Warehouse—Time Variant
7
Data Warehouse—Nonvolatile
A physically separate store of data transformed from the operational
environment
Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
8
Lecture 2 Data Warehouse vs.
Heterogeneous DBMS
10
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage Repetitive(cyclic, dull) ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records tens millions
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
11
From Tables and Spreadsheets
Lecture 3
to Data Cubes
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which operational
DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources naturally use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis
directly on relational databases
12
From Tables and Spreadsheets
to Data Cubes
14
Cube: Lattice of cuboids
CUBE all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
15
SAMPLE DATA CUBE
Date
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
sum
Country
Canada
Mexico
sum
16
DATAWAREHOUSE
USAGE(APPLICATION)
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
17
Lecture 4
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
19
Data Mart
20
Meta data
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
21
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
Greater scalability
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
Specialized support for SQL queries over star/snowflake schemas
22
23
24
ROLAP
25
HOLAP
26
Lecture 6
Multidimensional OLAP Operations
Roll up (drill-up): summarize data
by ascending location hierarchy from the level of city to country.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end
29
Star Schema
time 4 Dimensions Table
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
30
Example of Snowflake
time
Schema
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
31
Fact Constellation
Schema(Enterprise data
time warehouse)
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
33
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
34
DATAWAREHOUSE MODELS
Enterprise warehouse
collects all of the information about subjects spanning the entire
organization
Data Mart
a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected groups,
such as marketing data mart
Independent vs. dependent (directly from warehouse) data
mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized
35
DATAWAREHOUSE
USAGE(APPLICATION)
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
36
From (OLAP) to On Line
Lecture 9
Analytical Mining
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data.
Available information processing structure surrounding
data warehouses
Includes,accessing,integration,consolidation,and transformation of
multiple heterogeneous data's.
ODBC, OLEDB, Web accessing, service facilities, reporting and
OLAP tools
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
Provides flexibility to select desired DM functions by integrating
multiple mining functions with OLAP server and SWAP data
mining task dynamically.
37
Data warehousing to Mining
How does DM relate to IP and OLAP?
IP-Can find useful information directly stored in the DB But not
sophisticated patterns or regularities buried in the DB .So information
processing is not a data mining.
OLAP is a multidimensional data analysis for user directed data
summary and aggregation.
DM covers a much broader spectrum than OLAP because it not only
performs summarization and aggregation also association,
classification, prediction,clustering,time series.
38
Data warehousing to Mining
39
Unit 1
James Daly
40
DATA BASE VS DATA WAREHOUSE VS
DATA MINING
41
Match the following
I. Database Ans Set1 : 1. – A, C, E, H, J, L
II. Datawarehouse 2. – B, D, F, I, M
III.Data Mining 3. – G, K, N
42
Question No: 2
_______ is a subject oriented, Integrated, Non-
volatile and Time Referenced.
Data Mining
Data Warehouse
Data Base
Virtual Datawarehouse
43
Question No: 3
________ Describes the data contained in the
Datawarehouse
Data Mart
Meta Data
Virtual Data
Multidimensional Data
44
Question No: 4
Among the given choice, which is specialized data
warehouse database.
Oracle
DB2
Sybase
Redbrick
None of the above
45
Question No: 5
A Stat Schema is composed of_________fact
table.
1
2
3
4
46
Question No: 6
The key used in operational environment may not
have an element of ___________
Cost
Time
Quality
Frequency
47
Question No: 7
Data Warehouse contains ____________ data
that is never found in operational environment.
Normalized
Informational
Summarized
Denormalized
48
Question No: 8
A Data Warehouse is __________
49
Question No: 9
ETL________
Extract, Transfer and Load
Extract, Transact and Load
Exhibit, Transform and Load
Extract, Transform and Load
50
Question No: 10
Convert data from legacy or host format to
warehouse format.
Data Extraction
Data Cleaning
Data Transformation
Data Loading
51
Question No: 11
___________Sparse array-based multidimensional
storage engine
ROLAP
MOLAP
HOLAP
SPECIALIZED SQL SERVER
52
Question No: 12
_____________involving more than one fact
table
Roll Up
Roll Down
Drill Across
Drill Through
53
Question No: 13
___________Select more no. of dimension from
a given cube.
Slice
Dice
Pivot
Roll Down
54
Question No: 14
A refinement of star schema where some
dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar
to__________
Stat
Snow Flake
Fact Constellation
Entity Relational Model
55
Question No: 15
56
Question No: 16
57
Question No: 17
Data Warehouse
Virtual Warehouse
Enterprise warehouse
58
Question No: 18
59
Question No: 19
Among the given choices, find the application of
data warehouse?
Information Processing
Analytical Processing
Data Mining
All the above
60
Question No: 20
________is the data stage in data warehouse
processing.
Operational environment
Data warehouse environment
Extract, Transform, Load and Refresh
Online analytical environment
61
DATA MINING
62
63
Which of the following is an essential process in which
the intelligent methods are applied to extract data
patterns?
1. Warehousing
2. Data Mining - Data mining is a type of process in
which several intelligent methods are used to extract
meaningful data from the huge collection ( or set) of
data.
3. Text Mining
4. Data Selection
64
Which of the following refers to the problem of finding
abstracted patterns (or structures) in the unlabeled
data?
1. Supervised learning
2. Unsupervised learning - Unsupervised learning is a
type of machine learning algorithm that is generally
used to find the hidden structured and patterns in the
given unlabeled data.
3. Hybrid learning
4. Reinforcement learning
65
Which of the following can be considered as the correct
process of Data Mining?
1. Infrastructure, Exploration, Analysis, Interpretation,
Exploitation - he process of data mining contains many
sub-processes in a specific order. The correct order in
which all sub-processes of data mining executes is
Infrastructure, Exploration, Analysis, Interpretation, and
Exploitation.
2. Exploration, Infrastructure, Analysis, Interpretation,
Exploitation
3. Exploration, Infrastructure, Interpretation, Analysis,
Exploitation
4. Exploration, Infrastructure, Analysis, Exploitation,
Interpretation
66
For what purpose, the analysis tools pre-compute the
summaries of the huge amount of data?
1. In order to maintain consistency
2. For authentication
3. For data access
To obtain the queries response - Whenever a query is fired,
the response of the query would be put very earlier. So, for
the query response, the analysis tools pre-compute the
summaries of the huge amount of data. To understand it in
more details, consider the following example:
Suppose that to get some information about something, you
write a keyword in Google search. Google's analytical tools
will then pre-compute large amounts of data to provide a
quick output related to the keywords you have written.
67
What are the functions of Data Mining?
1. Association and correctional analysis classification
2. Prediction and characterization
3. Cluster analysis and Evolution analysis
4. All of the above
In data mining, there are several functionalities used
for performing the different types of tasks. The
common functionalities used in data mining are cluster
analysis, prediction, characterization, and evolution.
Still, the association and correctional analysis
classification are also one of the important
functionalities of data mining.
68