SlideShare a Scribd company logo
1
• Data Warehousing
• OLAP
• Data Mining
• Further Reading
2
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data WarehousingData Warehousing
• OLTP (online transaction processing) systems
– range in size from megabytes to terabytes
– high transaction throughput
• Decision makers require access to all data
– Historical and current
– 'A data warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of
management’s decision-making process' (Inmon 1993)
3
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
BenefitsBenefits
• Potential high returns on investment
– 90% of companies in 1996 reported return of investment
(over 3 years) of > 40%
• Competitive advantage
– Data can reveal previously unknown, unavailable and
untapped information
• Increased productivity of corporate decision-makers
– Integration allows more substantive, accurate and
consistent analysis
4
Typical ArchitectureTypical Architecture
5
Warehouse mgr
Load
mgr
Warehouse mgr
Query
manager
DBMS
Meta-data Highly
summarized
data
Lightly summarized
data
Detailed data
Mainframe operational
n/w,h/w data
Departmental
RDBMS data
Private data
External data
Archive/backup
Reporting query, app
development,EIS tools
OLAP tools
Data-mining tools
Source: Connolly and Begg p1157
Data WarehousesData Warehouses
• Types of Data
– Detailed
– Summarised
– Meta-data
– Archive/Back-up
6
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Information FlowsInformation Flows
7
Warehouse Mgr
Load
mgr
Warehouse mgr
Query
manager
DBMS
Meta-
data Highly
summ.
data
Lightly
summ.
Detailed data
Operational data
source 1
Operational data
source n
Archive/backup
Reporting query, app
development,EIS tools
OLAP tools
Data-mining tools
Meta-flow
Inflow
Downflow
Upflow
Outflow
Source Connolly and Begg p1162
Information Flow ProcessesInformation Flow Processes
• Five primary information flows
– Inflow - extraction, cleansing and loading of data from
source systems into warehouse
– Upflow - adding value to data in warehouse through
summarizing, packaging and distributing data
– Downflow - archiving and backing up data in warehouse
– Outflow - making data available to end users
– Metaflow - managing the metadata
8
Problems of Data WarehousingProblems of Data Warehousing
1. Underestimation of resources for data loading
2. Hidden problems with source systems
3. Required data not captured
4. Increased end-user demands
5. Data homogenization
6. High demand for resources
7. Data ownership
8. High maintenance
9. Long duration projects
10. Complexity of integration
9
Data Warehouse DesignData Warehouse Design
• Data must be designed to allow ad-hoc queries to be
answered with acceptable performance constraints
• Queries usually require access to factual data
generated by business transactions
– e.g. find the average number of properties rented out with a
monthly rent greater than £700 at each branch office over the
last six months
• Uses Dimensionality Modelling
10
Dimensionality ModellingDimensionality Modelling
• Similar to E-R modelling but with constraints
– composed of one fact table with a composite primary key
– dimension tables have a simple primary key which
corresponds exactly to one foreign key in the fact table
– uses surrogate keys based on integer values
– Can efficiently and easily support ad-hoc end-user queries
11
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Star SchemasStar Schemas
• The most common dimensional model
• A fact table surrounded by dimension tables
• Fact tables
– contains FK for each dimension table
– large relative to dimension tables
– read-only
• Dimension tables
– reference data
– query performance speeded up by denormalising into a
single dimension table
12
E-R Model ExampleE-R Model Example
13
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Star Schema ExampleStar Schema Example
14
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Other SchemasOther Schemas
• Snowflake schemas
– variant of star schema
– each dimension can have its own dimensions
• Starflake schemas
– hybrid structure
– contains mixture of (denormalised) star and
(normalised) snowflake schemas
15
OLAPOLAP• Online Analytical Processing
– dynamic synthesis, analysis and consolidation of large
volumes of multi-dimensional data
– normally implemented using specialized multi-
dimensional DBMS
• a method of visualising and manipulating data with
many inter-relationships
16
Codd’s OLAP RulesCodd’s OLAP Rules
1. Multi-dimensional conceptual view
2. Transparency
3. Accessibility
4. Consistent reporting performance
5. Client-server architecture
6. Generic dimensionality
7. Dynamic sparse matrix handling
8. Multi-user support
9. Unrestricted cross-dimensional operations
10. Intuitive data manipulation
17
OLAP ToolsOLAP Tools
• Categorised according to architecture of underlying database
– Multi-dimensional OLAP
• data typically aggregated and stored according to predicted
usage
• use array technology
– Relational OLAP
• use of relational meta-data layer with enhanced SQL
– Managed Query Environment
• deliver data direct from DBMS or MOLAP server to desktop
in form of a datacube
18
MOLAPMOLAP
19
RDB
Server
Load
MOLAP
server Request
Result
Presentation
Layer
Database/Application
Logic LayerEnroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
ROLAPROLAP
20
RDB
Server
ROLAP
server Request
Result
Presentation
Layer
Application
Logic Layer
SQL
Result
Database
LayerEnroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
MQEMQE
21
RDB
Server
Load
MOLAP
server Request
Result
SQL
Result
End-user
tools
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data MiningData Mining
• ‘The process of extracting valid, previously unknown,
comprehensible and actionable information from
large databases and using it to make crucial business
decisions’
focus is to reveal information which is hidden or unexpected
– patterns and relationships are identified by examining the
underlying rules and features of the data
– work from data up
– require large volumes of data
22
Example Data Mining ApplicationsExample Data Mining Applications
• Retail/Marketing
– Identifying buying patterns of customers
– Finding associations among customer demographic
characteristics
– Predicting response to mailing campaigns
– Market basket analysis
23
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Example Data Mining ApplicationsExample Data Mining Applications
• Banking
– Detecting patterns of fraudulent credit card use
– Identifying loyal customers
– Predicting customers likely to change their credit card
affiliation
– Determining credit card spending by customer groups
24
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data Mining TechniquesData Mining Techniques
• Four main techniques
– Predictive Modeling
– Database Segmentation
– Link Analysis
– Deviation Direction
25
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data Mining TechniquesData Mining Techniques
• Predictive Modelling
– using observations to form a model of the important
characteristics of some phenomenon
• Techniques:
– Classification
– Value Prediction
26
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Classification Example- Tree InductionClassification Example- Tree Induction
27
Customer renting property
> 2 years
Rent property
Rent property Buy property
Customer age
> 25 years?
No Yes
No Yes
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data Mining TechniquesData Mining Techniques
• Database Segmentation:
– to partition a database into an unknown number of
segments (or clusters) of records which share a number of
properties
• Techniques:
– Demographic clustering
– Neural clustering
28
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Segmentation: ScatterplotSegmentation: Scatterplot
ExampleExample
29
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Data Mining TechniquesData Mining Techniques
• Link Analysis
– establish associations between individual records (or sets of
records) in a database
• e.g. ‘when a customer rents property for more than two years
and is more than 25 years old, then in 40% of cases, the
customer will buy the property’
– Techniques
• Association discovery
• Sequential pattern discovery
• Similar time sequence discovery
30
Data Mining TechniquesData Mining Techniques
• Deviation Detection
– identify ‘outliers’, something which deviates from some
known expectation or norm
– Statistics
– Visualisation
31
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
Deviation Detection: VisualisationDeviation Detection: Visualisation
ExampleExample
32
Mining and Warehousing
• Data mining needs single, separate, clean, integrated, self-
consistent data source
• Data warehouse well equipped:
– populated with clean, consistent data
– contains multiple sources
– utilises query capabilities
– capability to go back to data source
33
Further Reading
• Connolly and Begg, chapters 31 to 34.
• W H Inmon, Building the Data Warehouse, New York, Wiley
and Sons, 1993.
• Benyon-Davies P, Database Systems (2nd
ed), Macmillan Press,
2000, ch 34, 35 & 36.
34
Enroll Now
https://goo.gl/QbTVal
Enroll Now
https://goo.gl/QbTVal
35
36
Ad

More Related Content

What's hot (20)

Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
Harish Chand
 
Data Warehousing - in the real world
Data Warehousing - in the real worldData Warehousing - in the real world
Data Warehousing - in the real world
ukc4
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
work
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data Warehousing and Mining
Data Warehousing and MiningData Warehousing and Mining
Data Warehousing and Mining
ethantelaviv
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
Satya P. Joshi
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
Aswathy S Nair
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
Deepak Chaurasia
 
Data mining
Data miningData mining
Data mining
Samir Sabry
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
Yogendra Uikey
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
Vibrant Technologies & Computers
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Vigneshwaar Ponnuswamy
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
umesh patil
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
Amin Choroomi
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Mining
CHANDERPRABHU JAIN COLLEGE OF HIGHER STUDIES & SCHOOL OF LAW
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
Dr. Sunil Kr. Pandey
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
Harish Chand
 
Data Warehousing - in the real world
Data Warehousing - in the real worldData Warehousing - in the real world
Data Warehousing - in the real world
ukc4
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
work
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data Warehousing and Mining
Data Warehousing and MiningData Warehousing and Mining
Data Warehousing and Mining
ethantelaviv
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
Satya P. Joshi
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
Aswathy S Nair
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
Deepak Chaurasia
 
Data mining and data warehousing
Data mining and data warehousingData mining and data warehousing
Data mining and data warehousing
umesh patil
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
Amin Choroomi
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
Dr. Sunil Kr. Pandey
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 

Similar to Difference between data warehouse and data mining (20)

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
DATAVERSITY
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
AnwarrChaudary
 
ETL Pipeline for the snowflake problem statement
ETL Pipeline for the snowflake problem statementETL Pipeline for the snowflake problem statement
ETL Pipeline for the snowflake problem statement
JayantAsudhani1
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
Niloy Mukherjee
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
Rishikese MR
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
158001210111bapan data warehousepptse.pptx
158001210111bapan data warehousepptse.pptx158001210111bapan data warehousepptse.pptx
158001210111bapan data warehousepptse.pptx
BapanKar2
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
Modernizing Data Architecture using Data Virtualization for Agile Data Delivery
Modernizing Data Architecture using Data Virtualization for Agile Data DeliveryModernizing Data Architecture using Data Virtualization for Agile Data Delivery
Modernizing Data Architecture using Data Virtualization for Agile Data Delivery
Denodo
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
Zaloni
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence Architecture
Philippe Julio
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
The New Model
The New ModelThe New Model
The New Model
David Kaiser
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
DATAVERSITY
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
AnwarrChaudary
 
ETL Pipeline for the snowflake problem statement
ETL Pipeline for the snowflake problem statementETL Pipeline for the snowflake problem statement
ETL Pipeline for the snowflake problem statement
JayantAsudhani1
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
158001210111bapan data warehousepptse.pptx
158001210111bapan data warehousepptse.pptx158001210111bapan data warehousepptse.pptx
158001210111bapan data warehousepptse.pptx
BapanKar2
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
Modernizing Data Architecture using Data Virtualization for Agile Data Delivery
Modernizing Data Architecture using Data Virtualization for Agile Data DeliveryModernizing Data Architecture using Data Virtualization for Agile Data Delivery
Modernizing Data Architecture using Data Virtualization for Agile Data Delivery
Denodo
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
Zaloni
 
Business Intelligence Architecture
Business Intelligence ArchitectureBusiness Intelligence Architecture
Business Intelligence Architecture
Philippe Julio
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
Ad

Recently uploaded (20)

GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
larencebapu132
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-26-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 4-26-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 4-26-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-26-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
High Performance Liquid Chromatography .pptx
High Performance Liquid Chromatography .pptxHigh Performance Liquid Chromatography .pptx
High Performance Liquid Chromatography .pptx
Ayush Srivastava
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
Open Access: Revamping Library Learning Resources.
Open Access: Revamping Library Learning Resources.Open Access: Revamping Library Learning Resources.
Open Access: Revamping Library Learning Resources.
Rishi Bankim Chandra Evening College, Naihati, North 24 Parganas, West Bengal, India
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Timber Pitch Roof Construction Measurement-2024.pptx
Timber Pitch Roof Construction Measurement-2024.pptxTimber Pitch Roof Construction Measurement-2024.pptx
Timber Pitch Roof Construction Measurement-2024.pptx
Tantish QS, UTM
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Diabetic neuropathy peripheral autonomic
Diabetic neuropathy peripheral autonomicDiabetic neuropathy peripheral autonomic
Diabetic neuropathy peripheral autonomic
Pankaj Patawari
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Geography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjectsGeography Sem II Unit 1C Correlation of Geography with other school subjects
Geography Sem II Unit 1C Correlation of Geography with other school subjects
ProfDrShaikhImran
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
World war-1(Causes & impacts at a glance) PPT by Simanchala Sarab(BABed,sem-4...
larencebapu132
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
High Performance Liquid Chromatography .pptx
High Performance Liquid Chromatography .pptxHigh Performance Liquid Chromatography .pptx
High Performance Liquid Chromatography .pptx
Ayush Srivastava
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Timber Pitch Roof Construction Measurement-2024.pptx
Timber Pitch Roof Construction Measurement-2024.pptxTimber Pitch Roof Construction Measurement-2024.pptx
Timber Pitch Roof Construction Measurement-2024.pptx
Tantish QS, UTM
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Diabetic neuropathy peripheral autonomic
Diabetic neuropathy peripheral autonomicDiabetic neuropathy peripheral autonomic
Diabetic neuropathy peripheral autonomic
Pankaj Patawari
 
Ad

Difference between data warehouse and data mining

  • 1. 1
  • 2. • Data Warehousing • OLAP • Data Mining • Further Reading 2 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 3. Data WarehousingData Warehousing • OLTP (online transaction processing) systems – range in size from megabytes to terabytes – high transaction throughput • Decision makers require access to all data – Historical and current – 'A data warehouse is a subject-oriented, integrated, time- variant and non-volatile collection of data in support of management’s decision-making process' (Inmon 1993) 3 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 4. BenefitsBenefits • Potential high returns on investment – 90% of companies in 1996 reported return of investment (over 3 years) of > 40% • Competitive advantage – Data can reveal previously unknown, unavailable and untapped information • Increased productivity of corporate decision-makers – Integration allows more substantive, accurate and consistent analysis 4
  • 5. Typical ArchitectureTypical Architecture 5 Warehouse mgr Load mgr Warehouse mgr Query manager DBMS Meta-data Highly summarized data Lightly summarized data Detailed data Mainframe operational n/w,h/w data Departmental RDBMS data Private data External data Archive/backup Reporting query, app development,EIS tools OLAP tools Data-mining tools Source: Connolly and Begg p1157
  • 6. Data WarehousesData Warehouses • Types of Data – Detailed – Summarised – Meta-data – Archive/Back-up 6 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 7. Information FlowsInformation Flows 7 Warehouse Mgr Load mgr Warehouse mgr Query manager DBMS Meta- data Highly summ. data Lightly summ. Detailed data Operational data source 1 Operational data source n Archive/backup Reporting query, app development,EIS tools OLAP tools Data-mining tools Meta-flow Inflow Downflow Upflow Outflow Source Connolly and Begg p1162
  • 8. Information Flow ProcessesInformation Flow Processes • Five primary information flows – Inflow - extraction, cleansing and loading of data from source systems into warehouse – Upflow - adding value to data in warehouse through summarizing, packaging and distributing data – Downflow - archiving and backing up data in warehouse – Outflow - making data available to end users – Metaflow - managing the metadata 8
  • 9. Problems of Data WarehousingProblems of Data Warehousing 1. Underestimation of resources for data loading 2. Hidden problems with source systems 3. Required data not captured 4. Increased end-user demands 5. Data homogenization 6. High demand for resources 7. Data ownership 8. High maintenance 9. Long duration projects 10. Complexity of integration 9
  • 10. Data Warehouse DesignData Warehouse Design • Data must be designed to allow ad-hoc queries to be answered with acceptable performance constraints • Queries usually require access to factual data generated by business transactions – e.g. find the average number of properties rented out with a monthly rent greater than £700 at each branch office over the last six months • Uses Dimensionality Modelling 10
  • 11. Dimensionality ModellingDimensionality Modelling • Similar to E-R modelling but with constraints – composed of one fact table with a composite primary key – dimension tables have a simple primary key which corresponds exactly to one foreign key in the fact table – uses surrogate keys based on integer values – Can efficiently and easily support ad-hoc end-user queries 11 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 12. Star SchemasStar Schemas • The most common dimensional model • A fact table surrounded by dimension tables • Fact tables – contains FK for each dimension table – large relative to dimension tables – read-only • Dimension tables – reference data – query performance speeded up by denormalising into a single dimension table 12
  • 13. E-R Model ExampleE-R Model Example 13 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 14. Star Schema ExampleStar Schema Example 14 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 15. Other SchemasOther Schemas • Snowflake schemas – variant of star schema – each dimension can have its own dimensions • Starflake schemas – hybrid structure – contains mixture of (denormalised) star and (normalised) snowflake schemas 15
  • 16. OLAPOLAP• Online Analytical Processing – dynamic synthesis, analysis and consolidation of large volumes of multi-dimensional data – normally implemented using specialized multi- dimensional DBMS • a method of visualising and manipulating data with many inter-relationships 16
  • 17. Codd’s OLAP RulesCodd’s OLAP Rules 1. Multi-dimensional conceptual view 2. Transparency 3. Accessibility 4. Consistent reporting performance 5. Client-server architecture 6. Generic dimensionality 7. Dynamic sparse matrix handling 8. Multi-user support 9. Unrestricted cross-dimensional operations 10. Intuitive data manipulation 17
  • 18. OLAP ToolsOLAP Tools • Categorised according to architecture of underlying database – Multi-dimensional OLAP • data typically aggregated and stored according to predicted usage • use array technology – Relational OLAP • use of relational meta-data layer with enhanced SQL – Managed Query Environment • deliver data direct from DBMS or MOLAP server to desktop in form of a datacube 18
  • 19. MOLAPMOLAP 19 RDB Server Load MOLAP server Request Result Presentation Layer Database/Application Logic LayerEnroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 20. ROLAPROLAP 20 RDB Server ROLAP server Request Result Presentation Layer Application Logic Layer SQL Result Database LayerEnroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 22. Data MiningData Mining • ‘The process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions’ focus is to reveal information which is hidden or unexpected – patterns and relationships are identified by examining the underlying rules and features of the data – work from data up – require large volumes of data 22
  • 23. Example Data Mining ApplicationsExample Data Mining Applications • Retail/Marketing – Identifying buying patterns of customers – Finding associations among customer demographic characteristics – Predicting response to mailing campaigns – Market basket analysis 23 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 24. Example Data Mining ApplicationsExample Data Mining Applications • Banking – Detecting patterns of fraudulent credit card use – Identifying loyal customers – Predicting customers likely to change their credit card affiliation – Determining credit card spending by customer groups 24 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 25. Data Mining TechniquesData Mining Techniques • Four main techniques – Predictive Modeling – Database Segmentation – Link Analysis – Deviation Direction 25 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 26. Data Mining TechniquesData Mining Techniques • Predictive Modelling – using observations to form a model of the important characteristics of some phenomenon • Techniques: – Classification – Value Prediction 26 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 27. Classification Example- Tree InductionClassification Example- Tree Induction 27 Customer renting property > 2 years Rent property Rent property Buy property Customer age > 25 years? No Yes No Yes Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 28. Data Mining TechniquesData Mining Techniques • Database Segmentation: – to partition a database into an unknown number of segments (or clusters) of records which share a number of properties • Techniques: – Demographic clustering – Neural clustering 28 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 29. Segmentation: ScatterplotSegmentation: Scatterplot ExampleExample 29 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 30. Data Mining TechniquesData Mining Techniques • Link Analysis – establish associations between individual records (or sets of records) in a database • e.g. ‘when a customer rents property for more than two years and is more than 25 years old, then in 40% of cases, the customer will buy the property’ – Techniques • Association discovery • Sequential pattern discovery • Similar time sequence discovery 30
  • 31. Data Mining TechniquesData Mining Techniques • Deviation Detection – identify ‘outliers’, something which deviates from some known expectation or norm – Statistics – Visualisation 31 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 32. Deviation Detection: VisualisationDeviation Detection: Visualisation ExampleExample 32
  • 33. Mining and Warehousing • Data mining needs single, separate, clean, integrated, self- consistent data source • Data warehouse well equipped: – populated with clean, consistent data – contains multiple sources – utilises query capabilities – capability to go back to data source 33
  • 34. Further Reading • Connolly and Begg, chapters 31 to 34. • W H Inmon, Building the Data Warehouse, New York, Wiley and Sons, 1993. • Benyon-Davies P, Database Systems (2nd ed), Macmillan Press, 2000, ch 34, 35 & 36. 34 Enroll Now https://goo.gl/QbTVal Enroll Now https://goo.gl/QbTVal
  • 35. 35
  • 36. 36

Editor's Notes

  • #3: Data Warehousing- a repository of information, or archive information, gathered from multiple sources stored under a unified schema. OLAP – on-line analytical processing - a method of analysis of data based on multi-dimensional databases. Data Mining- analyses data, discovers rules and patterns from the data. Both OLAP and Data Mining are end-user tools for data analysis
  • #4: DBMS in industry are pervasive throughout industry. Designed to handle high transaction throughput, where transactions typically make small changes to the operational data, for day-to-day running of the organisation. Can range in size: small databases being mbs where large databases can require terabytes or even petabytes. Decision makers require access to all the organisation’s data to provide comprehensive analysis of the organisation, its business, its requirements and its trends. This needs access to both current and past data. A data warehouse is different to an OLTP system in that it fits the definition given on this slide. Subject oriented: organised around the major subjects of the enterprise (e.g. customers, products, sales) rather than the major application areas (e.g. customer invoicing, stock control, product sales) Integrated: coming together of the source data from different enterprise-wide applications systems. Often inconsistent, e.g. different formats Time-variant: data in the warehouse is only accurate at some point in time or over some time interval. Data represents a series of snapshots Non-volatile: data is not updated in real time, refreshed from operational systems on a regular basis. New data is always added, rather than replaced.
  • #5: These are the benefits. Organisations normally must commit a huge amount of investment and resources in developing the data warehouse, but the potential returns on that investment due to increasing productivity and the competitive advantage that gives can be very large, the IDC quote is 90% of companies had >40% return over three years( normal ROI would be 8-15%). The competitive advantage is gained by the access to data that was previously unavailable, and increases productivity as the data is integrated from previously incompatible systems. Competitive advantage is gained by giving access to this data by management decision makers for forecasting trends etc.
  • #6: This shows the typical architecture of a data warehouse. It shows: Data sources - can vary from mainframes to departmental databases to external data. There are lots of different sources and different data types. Load manager (or frontend) performs the extraction and loading of data into the warehouse. Warehouse manager performs all the operations associated with the management of the data in the warehouse. Operations include - ensuring consistency of data, indexes and views, denormalising, aggregating data, backing up and archiving. Query manager (backend) manages the user queries. Complexity depends on flexibility of end-user access tools. Can include directing queries to tables, scheduling execution of queries, generating query profiles to assist warehouse manager in managing indexes and views. Detailed data: this is all the detailed data in the schema. Normally stored offline and aggregated into next level of data. Lightly/highly summarised data: this is the aggregated data generated by the warehouse manager. This is subject to change on an on-going basis depending on the types of queries. Purpose is to speed up queries. Meta-data: description of data in warehouse. Changes according to structure of data in warehouse.
  • #7: There are numerous types of data in a data warehouse. Detailed data is the actual data which has been pulled in from the various sources. Summarised data tends to create various views of the detailed data, to answer specific queries. It needs to be summarised because there is such a large amount of data. Because these views can change, there also needs to be meta-data. Archive/backup – the data warehouse will always grow, so some of the older data can be archived, in a way that it can still be included in queries if required.
  • #8: This diagram shows the main flows of data in the warehouse. The next slide explains each of the five flows.
  • #9: This explains the processes associated with each of the information flows. Inflow – cleans dirty data, restructures data to suit new requirements, ensure source is consistent with data already in the warehouse Upflow – summarise data into more convenient views, pack data into more useful formats, distributes data to increase availability/accessibility Downflow – transfer data of limited value restore following crash Outflow – 2 activities: 1. accessing – satisfy individual end user requests for data from tools; 2. delivering – delivery of information to end users Metaflow - - process which moves metadata responding to changing needs, i.e. updating metadata accordingly
  • #10: These are the typical problems with a data warehouse, most arise from the problems of integrating the source data(see pages 1050-1052, Connolly & Begg) 80% of development time is spent on data loading Problems with source systems, e.g. nulls allow incomplete data, needs to be fixed OLTP systems may not store data needed – so may need to alter the OLTP systems Users become aware of capabilities – need better tools Homogenization can lessen value of data – similarties v. differences in data Disk space, large no. of indexes Data accessible to all users Reorganisation of business processes – change to DW Can take 3 years to build – data marts support only one department so may be quicker Need to integrate all tools to ensure benefits the organisation
  • #11: The types of queries we need to be able to perform are different to those in an OLTP system as they are more factual, analytical and temporal. An example is given - try doing this in a relational system. So normal modelling techniques (E-R model) are not suitable as the relationships between the data can sometimes be too complex, therefore we use dimensionality modelling: a logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access.
  • #12: Our dimensional model is based on the E-R model but with some restrictions, to support the types of queries required. A model with these restrictions is called a Star Schema. The use of surrogate keys aids performance in joins.
  • #13: A star schema contains two types of tables as defined. An example is given on the next couple of slides. Fact tables contain factual data. The FK can be classed as either unintelligent (an unique identifier represented by the actual data), or intelligent (a surrogate FK) Fact tables are unlikely to change – as facts normally occur in the past Dimensional tables contain reference data – i.e. the data which supports the fact. Star schemas given increased performance as dimension tables are denormalised, thus minimising the number of joins.
  • #14: This is an E-R model taken from Connolly and Begg for the Dream Home database. Notice that it contains complex relationships between the various objects, which would make it different to answer the types of queries required. So we redesign using a star schema as given on the next slide...
  • #15: This is the Star Schema version. Now we have one table in the centre which contains all the links to the dimension tables, which contains the data. The fact table is just like a M:N relationship in a relational database. Note that the can be more than one fact table in a star schema.
  • #16: Snowflake: Star schema is denormalised Snowflake schemas contain no denormalised data, the data is normalised Starflake A combination of normalised and denormalised data. This is the most appropriate schema to use.
  • #17: There are 2 tools commonly used for data analysis. These are OLAP and data mining. OLAP tools are query centric – database schemas are array oriented and multi-dimensional in nature, e.g. market analysis. So OLAP tools work on the concept of multi-dimensional (i.e. >2 dimensions) data – to support complex analytical applications. So there is a new type of database – multi-dimensional DB – as there is a need to retrieve large numbers of records from very large data sets and summarise this data on they fly. For example, the dimensions of a database could be property type, city and time. Typical operations which can be performed are: Consolidation – aggregate data, roll up, e.g branch offices -> city, city -> country Drill down – the reverse of consolidation, displays detailed data Slicing and dicing, aka pivoting – look at data from different viewpoints, normally along a time axis.
  • #18: Codd defined a number of rules for OLAP systems Must be intuitively analytical and easy to use Transparent to users – users are familiar with particular front end tools All data sources must be accessible (network, hierarchical, relational, etc.) As number of dimensions increased, the performance must remain consistent Must operate efficiently in a client-server architecture There must be no bias towards any one dimension There may be instances where a large number of nulls are stored – this must not have any averse impact on accuracy and speed of access Must support concurrent users Must be able to support the typical operations on the last slide, e.g. performing roll-up within/across dimensions Slicing and dicing, drill down and consolidation must be intuitive, e.g. via a point and click interface, or drag and drop operations on a data cube Must be able to retrieve any view of the dta No. of dimensions should be unlimited
  • #19: Types of tools are categorised according to the architecture of the underlying database. There are 3 main categories. MOLAP – specialised data structures and MDDBMSs to organise, navigate and analyse data Data typically aggregated to enhance performance Use array technology and sparse data management ROLAP – this is the fastest growing technology. Works by providing multi-dimensional views of 2D data. SQL is enhanced to increase performance and support complex operations on multi dimensions MQE – this is the newest technology. Data can be delivered either directly from the RDB or from a MOLAP/ROLAP server in the form of a datacube. The datacube is stored and analysed locally – therefore they are simple to install, and each user can build a custom data cube.
  • #23: There are various end-user tools which can be used with data warehouses, and data mining is one of those sets of tools which is used for analysing data within a database to find hidden/unexpected information within a database.
  • #24: These are some examples of typical applications
  • #26: Although there are four operations which can be used independently, many applications work well when several or a combination of operations are used. There are specific techniques used with each operation.
  • #27: Predictive modelling uses observations to form a model of the important characteristics of some phenomenon. Can be used to analyse an existing database to determine some essential characteristics about the data set. Two main techniques: Classification: used to establish a specific predetermined class for each record in a database from a finite set of possible class values, e.g. if a customer has rented for > 2 years and > 25 years old then they are most likely to buy property. Can use neural/tree induction. Value prediction: used to estimate a continuous numeric value that is associated with a database record, uses statistical techniques, e.g. linear/non-linear regression. An example of classification is given on the next slide.
  • #29: The second technique is database segmentation. Aims to cluster records so that they share a number of properties, i.e. homogenous. Uses unsupervised learning to discover sub-populations in the database. Two types – demographic and neural clustering. An example is given on the next slide of a scatterplot.
  • #31: Link analysis establishes association between records. An example is given. Various techniques which look for associations/patterns/similar time sequences: Association – items which imply the presence of other item in same event Sequential – presence of 1 set of item implies presence of another in a period of time (e.g. long term customer buying behaviour) Similar time sequence – discovery of link between 2 sets of data that are time dependent, e.g. buying property -> buy household goods within 2 months.
  • #32: Deviation detections identifies records where a value is out of the ordinary. Can be done either statistically (e.g. linear regression) or by visualisation (e.g. graphically), as in the example on the next slide. Good for fraud detection.
  • #34: So to finish off on warehousing, if we look at the requirements for a data mining tool and then compare this to what we get from a data warehouse, then we can see that the ideal data source for data mining is a data warehouse. DW data is clean and consistent which is a prerequisite for Data mining Multiple sources allow to discover as many inter-relationships as possoble Query capabilities allow for selection of relevant subsets of records and fields Go back to data source – provides a way for data mining results to allow further investigation of uncovered patterns.