0% found this document useful (0 votes)
17 views

Week 7-1

suitable for querying and analysis in the data The document discusses key concepts related to data warehousing and data mining. It provides an overview of: 1) The components and characteristics of a data warehouse, including its multi-tiered architecture, hardware, database management system, front-end access tools, and other tools. 2) The process of getting data into the data warehouse from multiple sources within and outside the organization, and transforming the data through cleansing, consolidation and normalization. 3) Common data warehousing concepts like OLTP vs OLAP, dimensions, facts and measures, and data modeling techniques like classification and nearest neighbor analysis.

Uploaded by

Abdul Hannan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Week 7-1

suitable for querying and analysis in the data The document discusses key concepts related to data warehousing and data mining. It provides an overview of: 1) The components and characteristics of a data warehouse, including its multi-tiered architecture, hardware, database management system, front-end access tools, and other tools. 2) The process of getting data into the data warehouse from multiple sources within and outside the organization, and transforming the data through cleansing, consolidation and normalization. 3) Common data warehousing concepts like OLTP vs OLAP, dimensions, facts and measures, and data modeling techniques like classification and nearest neighbor analysis.

Uploaded by

Abdul Hannan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CS-822 Data Mining

Data Warehouse & Data Modeling

Week # 7
Agenda
What is Data Warehousing?
Data warehousing Concepts:
 OLTP vs OLAP

CS-822 Data Mining, Spring 2023


 OLAP Cubes and operations
 Dimensions, Facts & measures
 Schemas
 Data Modelling
 Classification – Problem
 Nearest Neighbor (KNN)
 Splitting of Training and Test Data
 Cross Validation
 Training and Evaluation
2
Transformation of Data to Information

CS-822 Data Mining, Spring 2023


Information
Exploration / Analysis

SQL reporting

Relational Warehouse

Cleansing / & Normalization

Data Transaction Processing

3
The Past and The Problem

 Only had scattered transactional systems in the organization – data

spread among different systems

CS-822 Data Mining, Spring 2023


 Transactional systems were not designed for decision support analysis

 Data constantly changes on transactional systems

 Lack of historical data

 Often resources were taxed with both needs on the same systems

4
The Past and The Problem

 Operational databases are designed to keep transactions from daily

operations. It is optimized to efficiently update or create individual records

CS-822 Data Mining, Spring 2023


 A database for analysis on the other hand needs to be geared toward

flexible requests or queries (Ad hoc, statistical analysis)

5
Need for Data Warehousing

 Integrated, company-wide view of high-quality information (from


disparate databases)

CS-822 Data Mining, Spring 2023


 Separation of operational and informational systems and data (for

improved performance)

6
Operational system – a system that is used to run a business in real
time, based on current data; also called a system of record

CS-822 Data Mining, Spring 2023


Informational system – a system designed to support decision
making based on historical point-in-time and prediction data for
complex queries or data-mining applications

7
Issues with Company-Wide View

Inconsistent key structures

Synonyms

CS-822 Data Mining, Spring 2023


Free-form vs. structured fields

Inconsistent data values

Missing data

8
What is a Data Warehouse?

 DWH is like a relational database


designed for analytical needs.
 It functions on the basis of OLAP

CS-822 Data Mining, Spring 2023


 It is a central location where
consolidated data from multiple
locations (databases) are stored

Data warehousing is an architectural model designed to gather


data from various sources into a single unified data model for
analysis purposes.

9
 DWH is the act of organizing & storing data in a way so as to make its retrieval efficient and
insightful.
 It is also called as the process of transforming data into information

CS-822 Data Mining, Spring 2023


10
Features/Characteristics of Data Warehouse

 A managed database in which the data is:


• Subject Oriented

CS-822 Data Mining, Spring 2023


• Integrated
• Time Variant
• Non Volatile
 To support management’ s decision making process

11
Subject Oriented Integrated

 Organized around major subject  Data from different sources are

areas in the enterprise. (Sales, brought together and consolidated

CS-822 Data Mining, Spring 2023


Inventory, Financial, etc.)  The data is cleaned and made

 Only includes data which is used in consistent


the decision making processes Example – Bank Systems using

Elements used for transactional Different Codes


• Gender– COMM
processing are removed
• Date - C

12
Time Variant Non-Volatile

 Data in a Data Warehouse contains  Operational systems have


both current and historical continually changing data
information

CS-822 Data Mining, Spring 2023


 Data Warehouses continually
 Operational Systems contain only absorb current data and
current data integrates it with its existing data
(Aggregate or Summary tables)
Systems typically retain data:
Example: an account balance
 Operational Systems – 60 to 90 Days
 Data Warehouse – 5 to 10 Years
at a bank

13
What Is a Data Warehouse?

 Not a product, it is a process


 Combination of hardware and software

CS-822 Data Mining, Spring 2023


Concept of a Data Warehouse is not new,
but the technology that allows it is.

 Can often be set up as one VLDB (Very Large Database) or a


collection of subject areas called Data Marts.
 There are now tools which “unify” these Data Marts and make it
appear as a single database.

14
Data Warehouse: A Multi-Tiered Architecture

CS-822 Data Mining, Spring 2023


15
Components of a Data Warehouse

Four General Components:


 Hardware

CS-822 Data Mining, Spring 2023


 DBMS - Database Management System
 Front End Access Tools
 Other Tools

 In all components scalability is vital.


Scalability is the ability to grow as your data
and processing needs increase

16
Components of a Data Warehouse
Hardware:

 Power- # of Processors, Memory, I/O Bandwidth, and Speed of the Bus

CS-822 Data Mining, Spring 2023


 Availability – Redundant equipment

 Disk Storage - Speed and enough storage for the loaded data set

 Backup Solution - Automated and be able to allow for incremental


backups and archiving older data

17
Components of a Data Warehouse
DBMS:

 Physical storage capacity of the DBMS

CS-822 Data Mining, Spring 2023


 Loading, indexing, and processing speed

 Availability

 Handle your data needs

 Operational integrity, reliability, and manageability

Operational integrity–Enforce its rules – Security and “Atomic Transactions”


Reliability - Can it recover from failure quickly and easily
Manageability – Ability to do the day to day tasks with little or no effort.
18
Components of a Data Warehouse
Front End & Other Tools:

 Query Tools (SQL & GUI based)

CS-822 Data Mining, Spring 2023


 Report Writers

 Metadata Repositories

 OLAP (Online Analytical Processing)

 Data Mining Products

19
Getting the Data In

•Data will come from multiple databases and files within the organization

•Also can come from outside sources

CS-822 Data Mining, Spring 2023


• Examples:
• Weather Reports
• Demographic information by Zip Code

20
Transformation Phase:
Getting the Data In
 Takes data and turns it into a form that is
suitable for insertion into the warehouse
Three Steps :  Combines related data
 Removes redundancies
1. Extraction Phase

CS-822 Data Mining, Spring 2023


 Common Codes (Commercial Customer)
2. Transformation Phase  Spelling Mistakes (Lozenges)
 Consistency (PA,Pa,Penna,Pennsylvania)
3. Loading Phase
 Formatting (Addresses)
Extraction Phase:
Loading Phase:
 Source systems export data via files or
populates directly when the databases  Places the cleaned data into the DBMS in its
can “talk” to each other final, useable form
 Transfers them to the Data Warehouse  Compare data from source systems and the
server and puts it into some sort of Data Warehouse
staging area  Document the load information for the users
21
OLTP to OLAP

22

CS-822 Data Mining, Spring 2023


From Tables and Spreadsheets to Data Cubes

 A data warehouse is based on a multidimensional data model which views data in the
form of a data cube

 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

CS-822 Data Mining, Spring 2023


• Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the
related dimension tables
 In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.

23
Cube: A Lattice of Cuboids

all 0-D (apex) cuboid

time

CS-822 Data Mining, Spring 2023


item location supplier 1-D cuboids

time,location location,supplier
item,location
2-D cuboids
time,item time,supplier item,supplier

time,location,supplier item,location,supplier
3-D cuboids
time,item,supplier time,location,supplier

time, item, location, supplier 4-D (base) cuboid


24
OLAP (Online Analytical Processing)

 OLAP is a flexible way for you to make complicated analysis of multidimensional


data.
 DWH is modeled on the concept of OLAP. DBs are modeled on the concept of

CS-822 Data Mining, Spring 2023


OLTP.
 OLTP systems use data stored in the form of two-dimensional rows and columns.

25
Multidimensional Data

 Sales volume as a function of product, month, and region

CS-822 Data Mining, Spring 2023


26
A Sample Data Cube

Total annual sales


of TVs in U.S.A.

CS-822 Data Mining, Spring 2023


27
Data Cube Example

28

CS-822 Data Mining, Spring 2023


Data Cube Example

29

CS-822 Data Mining, Spring 2023


Data Cube Example (4-D)

30

CS-822 Data Mining, Spring 2023


Typical OLAP Operations

 Roll up (drill-up): summarize data


by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up

CS-822 Data Mining, Spring 2023


from higher level summary to lower level summary or detailed data, or
introducing new dimensions
 Slice and dice: project and select

 Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes

31
CS-822 Data Mining, Spring 2023
Roll-up
Roll-up performs aggregation on a data cube in any
of the following ways −

 By climbing up a concept hierarchy for a


dimension

 By dimension reduction

32
CS-822 Data Mining, Spring 2023
Drill-down
Drill-down is the reverse operation of roll-up. It is
performed by either of the following ways −

 By stepping down a concept hierarchy for a


dimension

 By introducing a new dimension.

33
Slice

The slice operation selects one


particular dimension from a given cube
and provides a new sub-cube

CS-822 Data Mining, Spring 2023


34
Dice

Dice selects two or more dimensions from a


given cube and provides a new sub-cube

CS-822 Data Mining, Spring 2023


35
Pivot

The pivot operation is also known as


rotation.

CS-822 Data Mining, Spring 2023


It rotates/transposes the data axes in
view in order to provide an alternative
presentation of data

36
OLAP Operations

37

CS-822 Data Mining, Spring 2023


OLAP vs OLTP

Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)


1 Involves historical processing of information. Involves day-to-day processing.
2 OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or

CS-822 Data Mining, Spring 2023


such as executives, managers and analysts. database professionals.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Schema and Based on Entity Relationship Model.
Fact Constellation Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated data. Provides primitive and highly detailed data.
8 Provides summarized and multidimensional view Provides detailed and flat relational view of
of data. data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in millions. Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.

38
Dimensions

 The tables that describe the dimensions involved are called dimension tables.
 Dividing a DWH project into dimensions provides structured information for
analysis & reporting.

CS-822 Data Mining, Spring 2023


39
Dimensions

 End users fire queries on these dimension tables which contain descriptive
information.

CS-822 Data Mining, Spring 2023


40
Facts and Measures

 A fact is a measure that can be summed, averaged or manipulated.


 A Fact table contains 2 kind of data – a dimension key and a measure.
 Every Dimension table is linked to a fact table.

CS-822 Data Mining, Spring 2023


41
Schemas

 A schema gives the logical description of the entire database.


 It gives details about the constraints placed on the tables, key value present &
how the key values are linked between the different tables.

CS-822 Data Mining, Spring 2023


 A database uses a relation model, while a data warehouse uses:
 Star
 Snowflakes
 Fact constellation (galaxy)

42
Star Schema

43

CS-822 Data Mining, Spring 2023


Snowflakes Schema
A schema is known as a snowflake if one or more dimension tables do not connect directly to the fact
table but must join through other dimension tables.

CS-822 Data Mining, Spring 2023


44
Components of a Data Warehouse – Data Mining

 Answers the questions you didn’t know to ask

 Analyzes great amounts of data (usually contained in a Data

CS-822 Data Mining, Spring 2023


Warehouse) and looks for trends in the data
 Technology now allows us to do this better than in the past

45
Components of a Data Warehouse – Data Mining

 Most famous example is the Huggies - Heineken case

 Used in Retail sector to analyze buying habits

CS-822 Data Mining, Spring 2023


 Used in financial areas to detect fraud

 Used in the stock market to find trends

 Used in scientific research

 Used in national security

46

You might also like